Class EnglishWordTokenizer

  • All Implemented Interfaces:
    org.languagetool.tokenizers.Tokenizer

    public class EnglishWordTokenizer
    extends org.languagetool.tokenizers.WordTokenizer
    Since:
    2.5
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static java.lang.String[] EXCEPTION_REPLACEMENT  
      private static java.lang.String[] EXCEPTIONS  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.lang.String getTokenizingCharacters()  
      java.util.List<java.lang.String> tokenize​(java.lang.String text)
      Tokenizes text.
      • Methods inherited from class org.languagetool.tokenizers.WordTokenizer

        getProtocols, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • EXCEPTIONS

        private static final java.lang.String[] EXCEPTIONS
      • EXCEPTION_REPLACEMENT

        private static final java.lang.String[] EXCEPTION_REPLACEMENT
    • Constructor Detail

      • EnglishWordTokenizer

        public EnglishWordTokenizer()
    • Method Detail

      • getTokenizingCharacters

        public java.lang.String getTokenizingCharacters()
        Overrides:
        getTokenizingCharacters in class org.languagetool.tokenizers.WordTokenizer
      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String text)
        Tokenizes text. The English tokenizer differs from the standard one in two respects:
        1. it does not treat the hyphen as part of the word if the hyphen is at the end of the word;
        2. it includes n-dash as a tokenizing character, as it is used without a whitespace in English.
        Specified by:
        tokenize in interface org.languagetool.tokenizers.Tokenizer
        Overrides:
        tokenize in class org.languagetool.tokenizers.WordTokenizer
        Parameters:
        text - String of words to tokenize.