Class WordTokenizer

  • All Implemented Interfaces:
    Tokenizer

    public class WordTokenizer
    extends java.lang.Object
    implements Tokenizer
    Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (like http://foobar.org).
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static java.util.regex.Pattern DOMAIN_CHARS  
      private static java.util.regex.Pattern E_MAIL  
      private static java.util.regex.Pattern NO_PROTOCOL_URL  
      private static java.util.List<java.lang.String> PROTOCOLS  
      private static java.lang.String TOKENIZING_CHARACTERS  
      private static java.util.regex.Pattern URL_CHARS  
    • Constructor Summary

      Constructors 
      Constructor Description
      WordTokenizer()  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static java.util.List<java.lang.String> getProtocols()
      Get the protocols that the tokenizer knows about.
      java.lang.String getTokenizingCharacters()  
      static boolean isEMail​(java.lang.String token)  
      private boolean isProtocol​(java.lang.String token)  
      static boolean isUrl​(java.lang.String token)  
      protected java.util.List<java.lang.String> joinEMails​(java.util.List<java.lang.String> list)  
      protected java.util.List<java.lang.String> joinEMailsAndUrls​(java.util.List<java.lang.String> list)  
      protected java.util.List<java.lang.String> joinUrls​(java.util.List<java.lang.String> l)  
      java.util.List<java.lang.String> tokenize​(java.lang.String text)  
      private boolean urlEndsAt​(int i, java.util.List<java.lang.String> l, java.lang.String urlQuote)  
      private boolean urlStartsAt​(int i, java.util.List<java.lang.String> l)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • PROTOCOLS

        private static final java.util.List<java.lang.String> PROTOCOLS
      • URL_CHARS

        private static final java.util.regex.Pattern URL_CHARS
      • DOMAIN_CHARS

        private static final java.util.regex.Pattern DOMAIN_CHARS
      • NO_PROTOCOL_URL

        private static final java.util.regex.Pattern NO_PROTOCOL_URL
      • E_MAIL

        private static final java.util.regex.Pattern E_MAIL
      • TOKENIZING_CHARACTERS

        private static final java.lang.String TOKENIZING_CHARACTERS
        See Also:
        Constant Field Values
    • Constructor Detail

      • WordTokenizer

        public WordTokenizer()
    • Method Detail

      • getProtocols

        public static java.util.List<java.lang.String> getProtocols()
        Get the protocols that the tokenizer knows about.
        Returns:
        currently http, https, and ftp
        Since:
        2.1
      • isUrl

        public static boolean isUrl​(java.lang.String token)
        Since:
        3.0
      • isEMail

        public static boolean isEMail​(java.lang.String token)
        Since:
        3.5
      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String text)
        Specified by:
        tokenize in interface Tokenizer
      • getTokenizingCharacters

        public java.lang.String getTokenizingCharacters()
        Returns:
        The string containing the characters used by the tokenizer to tokenize words.
        Since:
        2.5
      • joinEMailsAndUrls

        protected java.util.List<java.lang.String> joinEMailsAndUrls​(java.util.List<java.lang.String> list)
      • joinEMails

        protected java.util.List<java.lang.String> joinEMails​(java.util.List<java.lang.String> list)
        Since:
        3.5
      • joinUrls

        protected java.util.List<java.lang.String> joinUrls​(java.util.List<java.lang.String> l)
      • urlStartsAt

        private boolean urlStartsAt​(int i,
                                    java.util.List<java.lang.String> l)
      • isProtocol

        private boolean isProtocol​(java.lang.String token)
      • urlEndsAt

        private boolean urlEndsAt​(int i,
                                  java.util.List<java.lang.String> l,
                                  java.lang.String urlQuote)