Class EnglishChunker

  • All Implemented Interfaces:
    org.languagetool.chunking.Chunker

    public class EnglishChunker
    extends java.lang.Object
    implements org.languagetool.chunking.Chunker
    OpenNLP-based chunker. Also uses the OpenNLP tokenizer and POS tagger and maps the result to our own tokens (we have our own tokenizer), as far as trivially possible.
    Since:
    2.3
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private static java.lang.String CHUNKER_MODEL  
      private static opennlp.tools.chunker.ChunkerModel chunkerModel  
      private EnglishChunkFilter chunkFilter  
      private static java.lang.String POS_TAGGER_MODEL  
      private static opennlp.tools.postag.POSModel posModel  
      private static java.lang.String TOKENIZER_MODEL  
      private static opennlp.tools.tokenize.TokenizerModel tokenModel
      This needs to be static to save memory: as Language.LANGUAGES is static, any language that is once created there will never be released.
    • Constructor Summary

      Constructors 
      Constructor Description
      EnglishChunker()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void addChunkTags​(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)  
      private void assignChunksToReadings​(java.util.List<ChunkTaggedToken> chunkTaggedTokens)  
      private java.lang.String[] chunk​(java.lang.String[] tokens, java.lang.String[] posTags)  
      private @Nullable org.languagetool.AnalyzedTokenReadings getAnalyzedTokenReadingsFor​(int startPos, int endPos, java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)  
      private java.util.List<ChunkTaggedToken> getChunkTagsForReadings​(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)  
      private java.lang.String getSentence​(java.util.List<org.languagetool.AnalyzedTokenReadings> sentenceTokens)  
      private java.util.List<ChunkTaggedToken> getTokensWithTokenReadings​(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings, java.lang.String[] tokens, java.lang.String[] chunkTags)  
      private java.lang.String[] posTag​(java.lang.String[] tokens)  
      (package private) java.lang.String[] tokenize​(java.lang.String sentence)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • POS_TAGGER_MODEL

        private static final java.lang.String POS_TAGGER_MODEL
        See Also:
        Constant Field Values
      • tokenModel

        private static volatile opennlp.tools.tokenize.TokenizerModel tokenModel
        This needs to be static to save memory: as Language.LANGUAGES is static, any language that is once created there will never be released. As English has several variants, we'd have as many posModels etc. as we have variants -> huge waste of memory:
      • posModel

        private static volatile opennlp.tools.postag.POSModel posModel
      • chunkerModel

        private static volatile opennlp.tools.chunker.ChunkerModel chunkerModel
    • Constructor Detail

      • EnglishChunker

        public EnglishChunker()
    • Method Detail

      • addChunkTags

        public void addChunkTags​(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
        Specified by:
        addChunkTags in interface org.languagetool.chunking.Chunker
      • getChunkTagsForReadings

        private java.util.List<ChunkTaggedToken> getChunkTagsForReadings​(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
      • tokenize

        java.lang.String[] tokenize​(java.lang.String sentence)
      • posTag

        private java.lang.String[] posTag​(java.lang.String[] tokens)
      • chunk

        private java.lang.String[] chunk​(java.lang.String[] tokens,
                                         java.lang.String[] posTags)
      • getTokensWithTokenReadings

        private java.util.List<ChunkTaggedToken> getTokensWithTokenReadings​(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings,
                                                                            java.lang.String[] tokens,
                                                                            java.lang.String[] chunkTags)
      • assignChunksToReadings

        private void assignChunksToReadings​(java.util.List<ChunkTaggedToken> chunkTaggedTokens)
      • getSentence

        private java.lang.String getSentence​(java.util.List<org.languagetool.AnalyzedTokenReadings> sentenceTokens)
      • getAnalyzedTokenReadingsFor

        @Nullable
        private @Nullable org.languagetool.AnalyzedTokenReadings getAnalyzedTokenReadingsFor​(int startPos,
                                                                                             int endPos,
                                                                                             java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)