Package org.languagetool.rules.en
Class GoogleStyleWordTokenizer
- java.lang.Object
-
- org.languagetool.tokenizers.WordTokenizer
-
- org.languagetool.rules.en.GoogleStyleWordTokenizer
-
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class GoogleStyleWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
Tokenize sentences to tokens like Google does for its ngram index. Note: there doesn't seem to be official documentation about the way Google tokenizes there, so this is just an approximation.- Since:
- 3.2
-
-
Constructor Summary
Constructors Constructor Description GoogleStyleWordTokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.lang.String
getTokenizingCharacters()
java.util.List<java.lang.String>
tokenize(java.lang.String text)
-
-
-
Method Detail
-
getTokenizingCharacters
public java.lang.String getTokenizingCharacters()
- Overrides:
getTokenizingCharacters
in classorg.languagetool.tokenizers.WordTokenizer
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String text)
- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
-
-