Class ScannerSrxTextIterator

  • All Implemented Interfaces:
    java.util.Iterator<java.lang.String>, TextIterator

    public class ScannerSrxTextIterator
    extends AbstractTextIterator

    Quick and Dirty implementation of TextIterator using Scanner.

    Preliminary tests showed that it requires between 50% and 100% more time to complete than default text iterator. Probably the reason is slow matching of exception rules, but also splitting break-rule-only is slower.

    This implementation is also not able to solve overlapping rules, like other one-big-pattern-scan iterators and there seems to be no easy solution. Although this should not happen in input patterns, in large SRX file using cascading it is very easy to miss this.

    One solution could be sorting patterns by length, but this is sometimes impossible to do. For example:
    Rules are "(ab)+" and "a(b)+"
    Inputs are "ababx" and "abbbx"
    For first input order of exception rules should be reversed for the text to be split as early as possible, but for the second input it shouldn't. The solution could be to use reluctant quantifiers instead of greedy ones, but that is changing the input patterns provided by user and therefore is undesirable.

    • Field Detail

      • scanner

        private java.util.Scanner scanner
      • exceptionMap

        private java.util.Map<java.util.regex.Pattern,​java.util.regex.Pattern> exceptionMap
      • noBreakRules

        private boolean noBreakRules
    • Constructor Detail

      • ScannerSrxTextIterator

        public ScannerSrxTextIterator​(SrxDocument document,
                                      java.lang.String languageCode,
                                      java.lang.String text,
                                      java.util.Map<java.lang.String,​java.lang.Object> parameterMap)
      • ScannerSrxTextIterator

        public ScannerSrxTextIterator​(SrxDocument document,
                                      java.lang.String languageCode,
                                      java.io.Reader reader,
                                      java.util.Map<java.lang.String,​java.lang.Object> parameterMap)
      • ScannerSrxTextIterator

        private ScannerSrxTextIterator​(SrxDocument document,
                                       java.lang.String languageCode,
                                       java.util.Scanner scanner)
    • Method Detail

      • createSeparator

        private java.lang.String createSeparator​(java.util.List<LanguageRule> languageRuleList)
      • createExceptions

        private java.util.Map<java.util.regex.Pattern,​java.util.regex.Pattern> createExceptions​(java.util.List<LanguageRule> languageRuleList)
      • createBreakRegexLookahead

        private java.lang.String createBreakRegexLookahead​(Rule rule)
      • createBreakRegexNoLookahead

        private java.lang.String createBreakRegexNoLookahead​(Rule rule)
      • createExceptionRegex

        private java.lang.String createExceptionRegex​(Rule rule)
      • hasNext

        public boolean hasNext()
        Returns:
        true if there are more segments
      • next

        public java.lang.String next()
        Returns:
        next segment in text, or null if end of text has been reached.
      • isException

        private boolean isException​(java.lang.StringBuilder segment)