Class BretonWordTokenizer

  • All Implemented Interfaces:
    org.languagetool.tokenizers.Tokenizer

    public class BretonWordTokenizer
    extends org.languagetool.tokenizers.WordTokenizer
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.util.List<java.lang.String> tokenize​(java.lang.String text)
      Tokenizes just like WordTokenizer with the exception that "c’h" is not split.
      • Methods inherited from class org.languagetool.tokenizers.WordTokenizer

        getProtocols, getTokenizingCharacters, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • BretonWordTokenizer

        public BretonWordTokenizer()
    • Method Detail

      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String text)
        Tokenizes just like WordTokenizer with the exception that "c’h" is not split. "C’h" is considered as a letter in breton (trigraph) and it occurs in many words. So tokenizer should not split it. Also split things like "n’eo" into 2 tokens only "n’" + "eo".
        Specified by:
        tokenize in interface org.languagetool.tokenizers.Tokenizer
        Overrides:
        tokenize in class org.languagetool.tokenizers.WordTokenizer
        Parameters:
        text - Text to tokenize
        Returns:
        List of tokens. Note: a special string ##BR_APOS## is used to replace apostrophes during tokenizing.