Class MultiWordChunker2

  • All Implemented Interfaces:
    Disambiguator

    public class MultiWordChunker2
    extends AbstractDisambiguator
    Multiword tagger-chunker. Note: currently does not support:
    • overlapping tagging (first matching multiword entry wins)
    • Field Detail

      • filename

        private final java.lang.String filename
      • allowFirstCapitalized

        private final boolean allowFirstCapitalized
      • removeOtherReadings

        private boolean removeOtherReadings
      • tagFormat

        private java.lang.String tagFormat
    • Constructor Detail

      • MultiWordChunker2

        public MultiWordChunker2​(java.lang.String filename)
        Parameters:
        filename - file text with multiwords and tags
      • MultiWordChunker2

        public MultiWordChunker2​(java.lang.String filename,
                                 boolean allowFirstCapitalized)
        Parameters:
        filename - file text with multiwords and tags
        allowFirstCapitalized - if set to true, first word of the multiword can be capitalized
    • Method Detail

      • setRemoveOtherReadings

        public void setRemoveOtherReadings​(boolean removeOtherReadings)
        Parameters:
        removeOtherReadings - If true and multiword matches other readings will be removed
      • setWrapTag

        public void setWrapTag​(boolean wrapTag)
        Parameters:
        wrapTag - If true the tag will be wrapped with < and >
      • formatPosTag

        protected java.lang.String formatPosTag​(java.lang.String posTag,
                                                int position,
                                                int multiwordLength)
        Override this method if you want format POS tag differently
        Parameters:
        posTag - POS tag for the multiword
        position - Position of the token in the multiword
        Returns:
        Returns formatted POS tag for the multiword
      • lazyInit

        private void lazyInit()
      • disambiguate

        public AnalyzedSentence disambiguate​(AnalyzedSentence input)
        Implements multiword POS tags, e.g., <ELLIPSIS> for ellipsis (...) start, and </ELLIPSIS> for ellipsis end.
        Parameters:
        input - The tokens to be chunked.
        Returns:
        AnalyzedSentence with additional markers.
      • matches

        protected boolean matches​(java.lang.String matchText,
                                  AnalyzedTokenReadings inputTokens)
      • loadWords

        private java.util.List<java.lang.String> loadWords​(java.io.InputStream stream)