Package org.languagetool.chunking
Class EnglishChunker
- java.lang.Object
-
- org.languagetool.chunking.EnglishChunker
-
- All Implemented Interfaces:
org.languagetool.chunking.Chunker
public class EnglishChunker extends java.lang.Object implements org.languagetool.chunking.Chunker
OpenNLP-based chunker. Also uses the OpenNLP tokenizer and POS tagger and maps the result to our own tokens (we have our own tokenizer), as far as trivially possible.- Since:
- 2.3
-
-
Field Summary
Fields Modifier and Type Field Description private static java.lang.String
CHUNKER_MODEL
private static opennlp.tools.chunker.ChunkerModel
chunkerModel
private EnglishChunkFilter
chunkFilter
private static java.lang.String
POS_TAGGER_MODEL
private static opennlp.tools.postag.POSModel
posModel
private static java.lang.String
TOKENIZER_MODEL
private static opennlp.tools.tokenize.TokenizerModel
tokenModel
This needs to be static to save memory: as Language.LANGUAGES is static, any language that is once created there will never be released.
-
Constructor Summary
Constructors Constructor Description EnglishChunker()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
addChunkTags(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
private void
assignChunksToReadings(java.util.List<ChunkTaggedToken> chunkTaggedTokens)
private java.lang.String[]
chunk(java.lang.String[] tokens, java.lang.String[] posTags)
private @Nullable org.languagetool.AnalyzedTokenReadings
getAnalyzedTokenReadingsFor(int startPos, int endPos, java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
private java.util.List<ChunkTaggedToken>
getChunkTagsForReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
private java.lang.String
getSentence(java.util.List<org.languagetool.AnalyzedTokenReadings> sentenceTokens)
private java.util.List<ChunkTaggedToken>
getTokensWithTokenReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings, java.lang.String[] tokens, java.lang.String[] chunkTags)
private java.lang.String[]
posTag(java.lang.String[] tokens)
(package private) java.lang.String[]
tokenize(java.lang.String sentence)
-
-
-
Field Detail
-
TOKENIZER_MODEL
private static final java.lang.String TOKENIZER_MODEL
- See Also:
- Constant Field Values
-
POS_TAGGER_MODEL
private static final java.lang.String POS_TAGGER_MODEL
- See Also:
- Constant Field Values
-
CHUNKER_MODEL
private static final java.lang.String CHUNKER_MODEL
- See Also:
- Constant Field Values
-
tokenModel
private static volatile opennlp.tools.tokenize.TokenizerModel tokenModel
This needs to be static to save memory: as Language.LANGUAGES is static, any language that is once created there will never be released. As English has several variants, we'd have as many posModels etc. as we have variants -> huge waste of memory:
-
posModel
private static volatile opennlp.tools.postag.POSModel posModel
-
chunkerModel
private static volatile opennlp.tools.chunker.ChunkerModel chunkerModel
-
chunkFilter
private final EnglishChunkFilter chunkFilter
-
-
Method Detail
-
addChunkTags
public void addChunkTags(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
- Specified by:
addChunkTags
in interfaceorg.languagetool.chunking.Chunker
-
getChunkTagsForReadings
private java.util.List<ChunkTaggedToken> getChunkTagsForReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
-
tokenize
java.lang.String[] tokenize(java.lang.String sentence)
-
posTag
private java.lang.String[] posTag(java.lang.String[] tokens)
-
chunk
private java.lang.String[] chunk(java.lang.String[] tokens, java.lang.String[] posTags)
-
getTokensWithTokenReadings
private java.util.List<ChunkTaggedToken> getTokensWithTokenReadings(java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings, java.lang.String[] tokens, java.lang.String[] chunkTags)
-
assignChunksToReadings
private void assignChunksToReadings(java.util.List<ChunkTaggedToken> chunkTaggedTokens)
-
getSentence
private java.lang.String getSentence(java.util.List<org.languagetool.AnalyzedTokenReadings> sentenceTokens)
-
getAnalyzedTokenReadingsFor
@Nullable private @Nullable org.languagetool.AnalyzedTokenReadings getAnalyzedTokenReadingsFor(int startPos, int endPos, java.util.List<org.languagetool.AnalyzedTokenReadings> tokenReadings)
-
-