Class PolishWordTokenizer

  • All Implemented Interfaces:
    org.languagetool.tokenizers.Tokenizer

    public class PolishWordTokenizer
    extends org.languagetool.tokenizers.WordTokenizer
    Since:
    2.5
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private java.lang.String plTokenizing  
      private static java.util.Set<java.lang.String> prefixes  
      private org.languagetool.tagging.Tagger tagger  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void setTagger​(org.languagetool.tagging.Tagger tagger)
      Set the tagger to use in tokenizing.
      java.util.List<java.lang.String> tokenize​(java.lang.String text)
      Tokenizes text.
      • Methods inherited from class org.languagetool.tokenizers.WordTokenizer

        getProtocols, getTokenizingCharacters, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • plTokenizing

        private final java.lang.String plTokenizing
      • tagger

        private org.languagetool.tagging.Tagger tagger
      • prefixes

        private static final java.util.Set<java.lang.String> prefixes
    • Constructor Detail

      • PolishWordTokenizer

        public PolishWordTokenizer()
    • Method Detail

      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String text)
        Tokenizes text. The Polish tokenizer differs from the standard one in the following respects:
        1. it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
        2. it includes n-dash and m-dash as tokenizing characters, as these are not included in the spelling dictionary;
        3. it splits two kinds of compound words containing a hyphen, such as dziecko-geniusz (two nouns), polsko-indonezyjski (an ad-adjectival adjective and adjective), polsko-francusko-niemiecki (two ad-adjectival adjectives and adjective), or osiemnaście-dwadzieścia (two numerals) but not words in which the hyphen occurs before a morphological ending (such as SMS-y).
        Specified by:
        tokenize in interface org.languagetool.tokenizers.Tokenizer
        Overrides:
        tokenize in class org.languagetool.tokenizers.WordTokenizer
        Parameters:
        text - String of words to tokenize.
      • setTagger

        public void setTagger​(org.languagetool.tagging.Tagger tagger)
        Set the tagger to use in tokenizing. This is called in the constructor of Polish class, but if the class is used separately, it has to be called after the constructor to use the hybrid hyphen-tokenizing.
        Parameters:
        tagger - The tagger to use (compatible only with the Polish BaseTagger that uses the delivered PoliMorfologik 2.1 or later).
        Since:
        2.5