Class GoogleStyleWordTokenizer

  • All Implemented Interfaces:
    org.languagetool.tokenizers.Tokenizer

    public class GoogleStyleWordTokenizer
    extends org.languagetool.tokenizers.WordTokenizer
    Tokenize sentences to tokens like Google does for its ngram index. Note: there doesn't seem to be official documentation about the way Google tokenizes there, so this is just an approximation.
    Since:
    3.2
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String getTokenizingCharacters()  
      java.util.List<java.lang.String> tokenize​(java.lang.String text)  
      • Methods inherited from class org.languagetool.tokenizers.WordTokenizer

        getProtocols, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • GoogleStyleWordTokenizer

        public GoogleStyleWordTokenizer()
    • Method Detail

      • getTokenizingCharacters

        public java.lang.String getTokenizingCharacters()
        Overrides:
        getTokenizingCharacters in class org.languagetool.tokenizers.WordTokenizer
      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String text)
        Specified by:
        tokenize in interface org.languagetool.tokenizers.Tokenizer
        Overrides:
        tokenize in class org.languagetool.tokenizers.WordTokenizer