Package org.languagetool.tokenizers.br
Class BretonWordTokenizer
- java.lang.Object
-
- org.languagetool.tokenizers.WordTokenizer
-
- org.languagetool.tokenizers.br.BretonWordTokenizer
-
- All Implemented Interfaces:
org.languagetool.tokenizers.Tokenizer
public class BretonWordTokenizer extends org.languagetool.tokenizers.WordTokenizer
-
-
Constructor Summary
Constructors Constructor Description BretonWordTokenizer()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description java.util.List<java.lang.String>
tokenize(java.lang.String text)
Tokenizes just like WordTokenizer with the exception that "c’h" is not split.
-
-
-
Method Detail
-
tokenize
public java.util.List<java.lang.String> tokenize(java.lang.String text)
Tokenizes just like WordTokenizer with the exception that "c’h" is not split. "C’h" is considered as a letter in breton (trigraph) and it occurs in many words. So tokenizer should not split it. Also split things like "n’eo" into 2 tokens only "n’" + "eo".- Specified by:
tokenize
in interfaceorg.languagetool.tokenizers.Tokenizer
- Overrides:
tokenize
in classorg.languagetool.tokenizers.WordTokenizer
- Parameters:
text
- Text to tokenize- Returns:
- List of tokens. Note: a special string ##BR_APOS## is used to replace apostrophes during tokenizing.
-
-