Package org.languagetool.tokenizers.pl
Class PolishWordTokenizer
- java.lang.Object
-
- org.languagetool.tokenizers.WordTokenizer
-
- org.languagetool.tokenizers.pl.PolishWordTokenizer
-
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class PolishWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
- Since:
- 2.5
-
-
Field Summary
Fields Modifier and Type Field Description private java.lang.String
plTokenizing
private static java.util.Set<java.lang.String>
prefixes
private org.languagetool.tagging.Tagger
tagger
-
Constructor Summary
Constructors Constructor Description PolishWordTokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
setTagger(org.languagetool.tagging.Tagger tagger)
Set the tagger to use in tokenizing.java.util.List<java.lang.String>
tokenize(java.lang.String text)
Tokenizes text.
-
-
-
Method Detail
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String text)
Tokenizes text. The Polish tokenizer differs from the standard one in the following respects:- it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
- it includes n-dash and m-dash as tokenizing characters, as these are not included in the spelling dictionary;
- it splits two kinds of compound words containing a hyphen, such as dziecko-geniusz (two nouns), polsko-indonezyjski (an ad-adjectival adjective and adjective), polsko-francusko-niemiecki (two ad-adjectival adjectives and adjective), or osiemnaście-dwadzieścia (two numerals) but not words in which the hyphen occurs before a morphological ending (such as SMS-y).
- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
- Parameters:
text
- String of words to tokenize.
-
setTagger
public void setTagger(org.languagetool.tagging.Tagger tagger)
Set the tagger to use in tokenizing. This is called in the constructor of Polish class, but if the class is used separately, it has to be called after the constructor to use the hybrid hyphen-tokenizing.- Parameters:
tagger
- The tagger to use (compatible only with the PolishBaseTagger
that uses the delivered PoliMorfologik 2.1 or later).- Since:
- 2.5
-
-