Class CatalanWordTokenizer

  • All Implemented Interfaces:
    org.languagetool.tokenizers.Tokenizer

    public class CatalanWordTokenizer
    extends org.languagetool.tokenizers.WordTokenizer
    Tokenizes a sentence into words. Punctuation and whitespace gets its own token. Special treatment for hyphens and apostrophes in Catalan.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      java.util.List<java.lang.String> tokenize​(java.lang.String text)  
      private java.util.List<java.lang.String> wordsToAdd​(java.lang.String s)  
      • Methods inherited from class org.languagetool.tokenizers.WordTokenizer

        getProtocols, getTokenizingCharacters, isEMail, isUrl, joinEMails, joinEMailsAndUrls, joinUrls
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • patterns

        private final java.util.regex.Pattern[] patterns
      • speller

        protected org.languagetool.rules.spelling.morfologik.MorfologikSpeller speller
      • ELA_GEMINADA

        private static final java.util.regex.Pattern ELA_GEMINADA
      • ELA_GEMINADA_UPPERCASE

        private static final java.util.regex.Pattern ELA_GEMINADA_UPPERCASE
      • APOSTROF_RECTE

        private static final java.util.regex.Pattern APOSTROF_RECTE
      • APOSTROF_RODO

        private static final java.util.regex.Pattern APOSTROF_RODO
      • APOSTROF_RECTE_1

        private static final java.util.regex.Pattern APOSTROF_RECTE_1
      • APOSTROF_RODO_1

        private static final java.util.regex.Pattern APOSTROF_RODO_1
      • NEARBY_HYPHENS

        private static final java.util.regex.Pattern NEARBY_HYPHENS
      • HYPHENS

        private static final java.util.regex.Pattern HYPHENS
      • DECIMAL_POINT

        private static final java.util.regex.Pattern DECIMAL_POINT
      • DECIMAL_COMMA

        private static final java.util.regex.Pattern DECIMAL_COMMA
      • SPACE_DIGITS0

        private static final java.util.regex.Pattern SPACE_DIGITS0
      • SPACE_DIGITS

        private static final java.util.regex.Pattern SPACE_DIGITS
      • SPACE_DIGITS2

        private static final java.util.regex.Pattern SPACE_DIGITS2
    • Constructor Detail

      • CatalanWordTokenizer

        public CatalanWordTokenizer()
    • Method Detail

      • tokenize

        public java.util.List<java.lang.String> tokenize​(java.lang.String text)
        Specified by:
        tokenize in interface org.languagetool.tokenizers.Tokenizer
        Overrides:
        tokenize in class org.languagetool.tokenizers.WordTokenizer
        Parameters:
        text - Text to tokenize
        Returns:
        List of tokens. Note: a special string CA_APOS is used to replace apostrophes, and CA_HYPHEN to replace hyphens.
      • wordsToAdd

        private java.util.List<java.lang.String> wordsToAdd​(java.lang.String s)