Package org.languagetool.tokenizers
Class WordTokenizer
- java.lang.Object
-
- org.languagetool.tokenizers.WordTokenizer
-
- All Implemented Interfaces:
Tokenizer
public class WordTokenizer extends java.lang.Object implements Tokenizer
Tokenizes a sentence into words. Punctuation and whitespace gets their own tokens. The tokenizer is a quite simple character-based one, though it knows about urls and will put them in one token, if fully specified including a protocol (likehttp://foobar.org
).
-
-
Field Summary
Fields Modifier and Type Field Description private static java.util.regex.Pattern
DOMAIN_CHARS
private static java.util.regex.Pattern
E_MAIL
private static java.util.regex.Pattern
NO_PROTOCOL_URL
private static java.util.List<java.lang.String>
PROTOCOLS
private static java.lang.String
TOKENIZING_CHARACTERS
private static java.util.regex.Pattern
URL_CHARS
-
Constructor Summary
Constructors Constructor Description WordTokenizer()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static java.util.List<java.lang.String>
getProtocols()
Get the protocols that the tokenizer knows about.java.lang.String
getTokenizingCharacters()
static boolean
isEMail(java.lang.String token)
private boolean
isProtocol(java.lang.String token)
static boolean
isUrl(java.lang.String token)
protected java.util.List<java.lang.String>
joinEMails(java.util.List<java.lang.String> list)
protected java.util.List<java.lang.String>
joinEMailsAndUrls(java.util.List<java.lang.String> list)
protected java.util.List<java.lang.String>
joinUrls(java.util.List<java.lang.String> l)
java.util.List<java.lang.String>
tokenize(java.lang.String text)
private boolean
urlEndsAt(int i, java.util.List<java.lang.String> l, java.lang.String urlQuote)
private boolean
urlStartsAt(int i, java.util.List<java.lang.String> l)
-
-
-
Field Detail
-
PROTOCOLS
private static final java.util.List<java.lang.String> PROTOCOLS
-
URL_CHARS
private static final java.util.regex.Pattern URL_CHARS
-
DOMAIN_CHARS
private static final java.util.regex.Pattern DOMAIN_CHARS
-
NO_PROTOCOL_URL
private static final java.util.regex.Pattern NO_PROTOCOL_URL
-
E_MAIL
private static final java.util.regex.Pattern E_MAIL
-
TOKENIZING_CHARACTERS
private static final java.lang.String TOKENIZING_CHARACTERS
- See Also:
- Constant Field Values
-
-
Method Detail
-
getProtocols
public static java.util.List<java.lang.String> getProtocols()
Get the protocols that the tokenizer knows about.- Returns:
- currently
http
,https
, andftp
- Since:
- 2.1
-
isUrl
public static boolean isUrl(java.lang.String token)
- Since:
- 3.0
-
isEMail
public static boolean isEMail(java.lang.String token)
- Since:
- 3.5
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String text)
-
getTokenizingCharacters
public java.lang.String getTokenizingCharacters()
- Returns:
- The string containing the characters used by the tokenizer to tokenize words.
- Since:
- 2.5
-
joinEMailsAndUrls
protected java.util.List<java.lang.String> joinEMailsAndUrls(java.util.List<java.lang.String> list)
-
joinEMails
protected java.util.List<java.lang.String> joinEMails(java.util.List<java.lang.String> list)
- Since:
- 3.5
-
joinUrls
protected java.util.List<java.lang.String> joinUrls(java.util.List<java.lang.String> l)
-
urlStartsAt
private boolean urlStartsAt(int i, java.util.List<java.lang.String> l)
-
isProtocol
private boolean isProtocol(java.lang.String token)
-
urlEndsAt
private boolean urlEndsAt(int i, java.util.List<java.lang.String> l, java.lang.String urlQuote)
-
-