Class BaseTagger

  • All Implemented Interfaces:
    Tagger

    public abstract class BaseTagger
    extends java.lang.Object
    implements Tagger
    Base tagger using Morfologik binary dictionaries.
    • Field Detail

      • wordTagger

        protected final WordTagger wordTagger
      • conversionLocale

        protected final java.util.Locale conversionLocale
      • tagLowercaseWithUppercase

        private final boolean tagLowercaseWithUppercase
      • dictionaryPath

        private final java.lang.String dictionaryPath
      • dictionary

        private final morfologik.stemming.Dictionary dictionary
    • Constructor Detail

      • BaseTagger

        public BaseTagger​(java.lang.String filename)
        Since:
        2.9
      • BaseTagger

        public BaseTagger​(java.lang.String filename,
                          java.util.Locale conversionLocale)
        Since:
        2.9
      • BaseTagger

        public BaseTagger​(java.lang.String filename,
                          java.util.Locale locale,
                          boolean tagLowercaseWithUppercase)
        Since:
        2.9
    • Method Detail

      • getManualAdditionsFileName

        @Nullable
        public abstract @Nullable java.lang.String getManualAdditionsFileName()
        Get the filename for manual additions, e.g., /en/added.txt, or null.
        Since:
        2.8
      • getManualRemovalsFileName

        @Nullable
        public @Nullable java.lang.String getManualRemovalsFileName()
        Get the filename for manual removals, e.g., /en/removed.txt, or null.
        Since:
        3.2
      • getDictionaryPath

        public java.lang.String getDictionaryPath()
        Since:
        2.9
      • overwriteWithManualTagger

        public boolean overwriteWithManualTagger()
        If true, tags from the binary dictionary (*.dict) will be overwritten by manual tags from the plain text dictionary.
        Since:
        2.9
      • getWordTagger

        protected WordTagger getWordTagger()
      • initWordTagger

        private WordTagger initWordTagger()
      • getDictionary

        protected morfologik.stemming.Dictionary getDictionary()
      • tag

        public java.util.List<AnalyzedTokenReadings> tag​(java.util.List<java.lang.String> sentenceTokens)
                                                  throws java.io.IOException
        Description copied from interface: Tagger
        Returns a list of AnalyzedTokens that assigns each term in the sentence some kind of part-of-speech information (not necessarily just one tag).

        Note that this method takes exactly one sentence. Its implementation may implement special cases for the first word of a sentence, which is usually written with an uppercase letter.

        Specified by:
        tag in interface Tagger
        Parameters:
        sentenceTokens - the text as returned by a WordTokenizer
        Throws:
        java.io.IOException
      • getAnalyzedTokens

        protected java.util.List<AnalyzedToken> getAnalyzedTokens​(java.lang.String word)
      • asAnalyzedTokenList

        protected java.util.List<AnalyzedToken> asAnalyzedTokenList​(java.lang.String word,
                                                                    java.util.List<morfologik.stemming.WordData> wdList)
      • asAnalyzedTokenListForTaggedWords

        protected java.util.List<AnalyzedToken> asAnalyzedTokenListForTaggedWords​(java.lang.String word,
                                                                                  java.util.List<TaggedWord> taggedWords)
      • asAnalyzedToken

        protected AnalyzedToken asAnalyzedToken​(java.lang.String word,
                                                morfologik.stemming.WordData wd)
      • createNullToken

        public final AnalyzedTokenReadings createNullToken​(java.lang.String token,
                                                           int startPos)
        Description copied from interface: Tagger
        Create the AnalyzedToken used for whitespace and other non-words. Use null as the POS tag for this token.
        Specified by:
        createNullToken in interface Tagger
      • createToken

        public AnalyzedToken createToken​(java.lang.String token,
                                         java.lang.String posTag)
        Description copied from interface: Tagger
        Create a token specific to the language of the implementing class.
        Specified by:
        createToken in interface Tagger
      • additionalTags

        @Nullable
        protected @Nullable java.util.List<AnalyzedToken> additionalTags​(java.lang.String word,
                                                                         WordTagger wordTagger)
        Allows additional tagging in some language-dependent circumstances
        Parameters:
        word - The word to tag
        Returns:
        Returns list of analyzed tokens with additional tags, or null