Class ScannerSrxTextIterator
- java.lang.Object
-
- net.loomchild.segment.AbstractTextIterator
-
- net.loomchild.segment.srx.legacy.ScannerSrxTextIterator
-
- All Implemented Interfaces:
java.util.Iterator<java.lang.String>
,TextIterator
public class ScannerSrxTextIterator extends AbstractTextIterator
Quick and Dirty implementation of
TextIterator
usingScanner
.Preliminary tests showed that it requires between 50% and 100% more time to complete than default text iterator. Probably the reason is slow matching of exception rules, but also splitting break-rule-only is slower.
This implementation is also not able to solve overlapping rules, like other one-big-pattern-scan iterators and there seems to be no easy solution. Although this should not happen in input patterns, in large SRX file using cascading it is very easy to miss this.
One solution could be sorting patterns by length, but this is sometimes impossible to do. For example:
Rules are "(ab)+" and "a(b)+"
Inputs are "ababx" and "abbbx"
For first input order of exception rules should be reversed for the text to be split as early as possible, but for the second input it shouldn't. The solution could be to use reluctant quantifiers instead of greedy ones, but that is changing the input patterns provided by user and therefore is undesirable.
-
-
Field Summary
Fields Modifier and Type Field Description private java.util.Map<java.util.regex.Pattern,java.util.regex.Pattern>
exceptionMap
private boolean
noBreakRules
private java.util.Scanner
scanner
-
Constructor Summary
Constructors Modifier Constructor Description ScannerSrxTextIterator(SrxDocument document, java.lang.String languageCode, java.io.Reader reader, java.util.Map<java.lang.String,java.lang.Object> parameterMap)
ScannerSrxTextIterator(SrxDocument document, java.lang.String languageCode, java.lang.String text, java.util.Map<java.lang.String,java.lang.Object> parameterMap)
private
ScannerSrxTextIterator(SrxDocument document, java.lang.String languageCode, java.util.Scanner scanner)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private java.lang.String
createBreakRegexLookahead(Rule rule)
private java.lang.String
createBreakRegexNoLookahead(Rule rule)
private java.lang.String
createExceptionRegex(Rule rule)
private java.util.Map<java.util.regex.Pattern,java.util.regex.Pattern>
createExceptions(java.util.List<LanguageRule> languageRuleList)
private java.lang.String
createSeparator(java.util.List<LanguageRule> languageRuleList)
boolean
hasNext()
private boolean
isException(java.lang.StringBuilder segment)
java.lang.String
next()
-
Methods inherited from class net.loomchild.segment.AbstractTextIterator
remove, toString
-
-
-
-
Constructor Detail
-
ScannerSrxTextIterator
public ScannerSrxTextIterator(SrxDocument document, java.lang.String languageCode, java.lang.String text, java.util.Map<java.lang.String,java.lang.Object> parameterMap)
-
ScannerSrxTextIterator
public ScannerSrxTextIterator(SrxDocument document, java.lang.String languageCode, java.io.Reader reader, java.util.Map<java.lang.String,java.lang.Object> parameterMap)
-
ScannerSrxTextIterator
private ScannerSrxTextIterator(SrxDocument document, java.lang.String languageCode, java.util.Scanner scanner)
-
-
Method Detail
-
createSeparator
private java.lang.String createSeparator(java.util.List<LanguageRule> languageRuleList)
-
createExceptions
private java.util.Map<java.util.regex.Pattern,java.util.regex.Pattern> createExceptions(java.util.List<LanguageRule> languageRuleList)
-
createBreakRegexLookahead
private java.lang.String createBreakRegexLookahead(Rule rule)
-
createBreakRegexNoLookahead
private java.lang.String createBreakRegexNoLookahead(Rule rule)
-
createExceptionRegex
private java.lang.String createExceptionRegex(Rule rule)
-
hasNext
public boolean hasNext()
- Returns:
- true if there are more segments
-
next
public java.lang.String next()
- Returns:
- next segment in text, or null if end of text has been reached.
-
isException
private boolean isException(java.lang.StringBuilder segment)
-
-