Class MultiWordChunker2
- java.lang.Object
-
- org.languagetool.tagging.disambiguation.AbstractDisambiguator
-
- org.languagetool.tagging.disambiguation.MultiWordChunker2
-
- All Implemented Interfaces:
Disambiguator
public class MultiWordChunker2 extends AbstractDisambiguator
Multiword tagger-chunker. Note: currently does not support:- overlapping tagging (first matching multiword entry wins)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
MultiWordChunker2.MultiWordEntry
-
Field Summary
Fields Modifier and Type Field Description private boolean
allowFirstCapitalized
private java.lang.String
filename
private boolean
removeOtherReadings
private java.lang.String
tagFormat
private java.util.Map<java.lang.String,java.util.List<MultiWordChunker2.MultiWordEntry>>
tokenToPosTagMap
private static java.lang.String
WRAP_TAG
-
Constructor Summary
Constructors Constructor Description MultiWordChunker2(java.lang.String filename)
MultiWordChunker2(java.lang.String filename, boolean allowFirstCapitalized)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description AnalyzedSentence
disambiguate(AnalyzedSentence input)
Implements multiword POS tags, e.g., <ELLIPSIS> for ellipsis (...) start, and </ELLIPSIS> for ellipsis end.private MultiWordChunker2.MultiWordEntry
findMultiwordEntry(AnalyzedTokenReadings[] inputTokens, int startingPosition, java.util.List<MultiWordChunker2.MultiWordEntry> multiwordItems)
protected java.lang.String
formatPosTag(java.lang.String posTag, int position, int multiwordLength)
Override this method if you want format POS tag differentlyprivate boolean
isMatching(AnalyzedTokenReadings[] inputTokens, int startingPosition, MultiWordChunker2.MultiWordEntry multiWordEntry)
private void
lazyInit()
private java.util.List<java.lang.String>
loadWords(java.io.InputStream stream)
protected boolean
matches(java.lang.String matchText, AnalyzedTokenReadings inputTokens)
protected AnalyzedTokenReadings
prepareNewReading(java.lang.String tokens, java.lang.String tok, AnalyzedTokenReadings token, java.lang.String tag)
private AnalyzedTokenReadings
setAndAnnotate(AnalyzedTokenReadings oldReading, AnalyzedToken newReading)
void
setRemoveOtherReadings(boolean removeOtherReadings)
void
setWrapTag(boolean wrapTag)
-
Methods inherited from class org.languagetool.tagging.disambiguation.AbstractDisambiguator
preDisambiguate
-
-
-
-
Field Detail
-
WRAP_TAG
private static final java.lang.String WRAP_TAG
- See Also:
- Constant Field Values
-
filename
private final java.lang.String filename
-
allowFirstCapitalized
private final boolean allowFirstCapitalized
-
removeOtherReadings
private boolean removeOtherReadings
-
tagFormat
private java.lang.String tagFormat
-
tokenToPosTagMap
private java.util.Map<java.lang.String,java.util.List<MultiWordChunker2.MultiWordEntry>> tokenToPosTagMap
-
-
Constructor Detail
-
MultiWordChunker2
public MultiWordChunker2(java.lang.String filename)
- Parameters:
filename
- file text with multiwords and tags
-
MultiWordChunker2
public MultiWordChunker2(java.lang.String filename, boolean allowFirstCapitalized)
- Parameters:
filename
- file text with multiwords and tagsallowFirstCapitalized
- if set totrue
, first word of the multiword can be capitalized
-
-
Method Detail
-
setRemoveOtherReadings
public void setRemoveOtherReadings(boolean removeOtherReadings)
- Parameters:
removeOtherReadings
- If true and multiword matches other readings will be removed
-
setWrapTag
public void setWrapTag(boolean wrapTag)
- Parameters:
wrapTag
- If true the tag will be wrapped with < and >
-
formatPosTag
protected java.lang.String formatPosTag(java.lang.String posTag, int position, int multiwordLength)
Override this method if you want format POS tag differently- Parameters:
posTag
- POS tag for the multiwordposition
- Position of the token in the multiword- Returns:
- Returns formatted POS tag for the multiword
-
lazyInit
private void lazyInit()
-
disambiguate
public AnalyzedSentence disambiguate(AnalyzedSentence input)
Implements multiword POS tags, e.g., <ELLIPSIS> for ellipsis (...) start, and </ELLIPSIS> for ellipsis end.- Parameters:
input
- The tokens to be chunked.- Returns:
- AnalyzedSentence with additional markers.
-
findMultiwordEntry
private MultiWordChunker2.MultiWordEntry findMultiwordEntry(AnalyzedTokenReadings[] inputTokens, int startingPosition, java.util.List<MultiWordChunker2.MultiWordEntry> multiwordItems)
-
isMatching
private boolean isMatching(AnalyzedTokenReadings[] inputTokens, int startingPosition, MultiWordChunker2.MultiWordEntry multiWordEntry)
-
matches
protected boolean matches(java.lang.String matchText, AnalyzedTokenReadings inputTokens)
-
prepareNewReading
protected AnalyzedTokenReadings prepareNewReading(java.lang.String tokens, java.lang.String tok, AnalyzedTokenReadings token, java.lang.String tag)
-
setAndAnnotate
private AnalyzedTokenReadings setAndAnnotate(AnalyzedTokenReadings oldReading, AnalyzedToken newReading)
-
loadWords
private java.util.List<java.lang.String> loadWords(java.io.InputStream stream)
-
-