Class Tokenizer

java.lang.Object
org.gjt.xpp.impl.tokenizer.Tokenizer

public class Tokenizer extends Object
Simpe XML Tokenizer (SXT) performs input stream tokenizing. Advantages:
  • utility class to simplify creation of XML parsers, especially suited for pull event model but can support also push (SAX2)
  • small footprint: whole tokenizer is in one file
  • minimal memory utilization: does not use memory except for input and content buffer (that can grow in size)
  • fast: all parsing done in one function (simple automata)
  • supports most of XML 1.0 (except validation and external entities)
  • low level: supports on demand parsing of Characters, CDSect, Comments, PIs etc.)
  • parsed content: supports providing on demand parsed content to application (standard entities expanded all CDATA sections inserted, Comments and PIs removed) not for attribute values and element content
  • mixed content: allow to dynamically disable mixed content
  • small - total compiled size around 15K
Limitations:
  • it is just a tokenizer - does not enforce grammar
  • readName() is using Java identifier rules not XML
  • does not parse DOCTYPE declaration (skips everyting in [...])
Author:
Aleksander Slominski
  • Field Details

    • END_DOCUMENT

      public static final byte END_DOCUMENT
      See Also:
    • CONTENT

      public static final byte CONTENT
      See Also:
    • CHARACTERS

      public static final byte CHARACTERS
      See Also:
    • CDSECT

      public static final byte CDSECT
      See Also:
    • COMMENT

      public static final byte COMMENT
      See Also:
    • DOCTYPE

      public static final byte DOCTYPE
      See Also:
    • PI

      public static final byte PI
      See Also:
    • ENTITY_REF

      public static final byte ENTITY_REF
      See Also:
    • CHAR_REF

      public static final byte CHAR_REF
      See Also:
    • ETAG_NAME

      public static final byte ETAG_NAME
      See Also:
    • EMPTY_ELEMENT

      public static final byte EMPTY_ELEMENT
      See Also:
    • STAG_END

      public static final byte STAG_END
      See Also:
    • STAG_NAME

      public static final byte STAG_NAME
      See Also:
    • ATTR_NAME

      public static final byte ATTR_NAME
      See Also:
    • ATTR_CHARACTERS

      public static final byte ATTR_CHARACTERS
      See Also:
    • ATTR_CONTENT

      public static final byte ATTR_CONTENT
      See Also:
    • paramNotifyCharacters

      public boolean paramNotifyCharacters
    • paramNotifyComment

      public boolean paramNotifyComment
    • paramNotifyCDSect

      public boolean paramNotifyCDSect
    • paramNotifyDoctype

      public boolean paramNotifyDoctype
    • paramNotifyPI

      public boolean paramNotifyPI
    • paramNotifyCharRef

      public boolean paramNotifyCharRef
    • paramNotifyEntityRef

      public boolean paramNotifyEntityRef
    • paramNotifyAttValue

      public boolean paramNotifyAttValue
    • buf

      public char[] buf
    • pos

      public int pos
      position of next char that will be read from buffer
    • posStart

      public int posStart
      Range [posStart, posEnd) defines part of buf that is content of current token iff parsedContent == false
    • posEnd

      public int posEnd
    • posNsColon

      public int posNsColon
    • nsColonCount

      public int nsColonCount
    • seenContent

      public boolean seenContent
    • parsedContent

      public boolean parsedContent
      This falg decides which buffer will be used to retrieve content for current token. If true use pc and [pcStart, pcEnd) and if false use buf and [posStart, posEnd)
    • pc

      public char[] pc
      This is buffer for parsed content such as actual valuue of entity ('<' in buf but in pc it is 'invalid input: '&lt'')
    • pcStart

      public int pcStart
      Range [pcStart, pcEnd) defines part of pc that is content of current token iff parsedContent == false
    • pcEnd

      public int pcEnd
    • LOOKUP_MAX

      protected static final int LOOKUP_MAX
      See Also:
    • LOOKUP_MAX_CHAR

      protected static final char LOOKUP_MAX_CHAR
      See Also:
    • lookupNameStartChar

      protected static boolean[] lookupNameStartChar
    • lookupNameChar

      protected static boolean[] lookupNameChar
  • Constructor Details

    • Tokenizer

      public Tokenizer()
  • Method Details

    • reset

      public void reset()
    • setInput

      public void setInput(Reader r)
      Reset tokenizer state and set new input source
    • setInput

      public void setInput(char[] data)
      Reset tokenizer state and set new input source
    • setInput

      public void setInput(char[] data, int off, int len)
    • setNotifyAll

      public void setNotifyAll(boolean enable)
      Set notification of all XML content tokens: Characters, Comment, CDSect, Doctype, PI, EntityRef, CharRef and AttValue (tokens for STag, ETag and Attribute are always sent).
    • setParseContent

      public void setParseContent(boolean enable)
      Allow reporting parsed content for element content and attribute content (no need to deal with low level tokens such as in setNotifyAll).
    • isAllowedMixedContent

      public boolean isAllowedMixedContent()
    • setAllowedMixedContent

      public void setAllowedMixedContent(boolean enable)
      Set support for mixed conetent. If mixed content is disabled tokenizer will do its best to ensure that no element has mixed content model also ignorable whitespaces will not be reported as element content.
    • getSoftLimit

      public int getSoftLimit()
    • setSoftLimit

      public void setSoftLimit(int value) throws TokenizerException
      Set soft limit on internal buffer size. That means suggested size that tokznzier will try to keep.
      Throws:
      TokenizerException
    • getHardLimit

      public int getHardLimit()
    • setHardLimit

      public void setHardLimit(int value) throws TokenizerException
      Set hard limit on internal buffer size. That means that if input (such as element content) is bigger than hard limit size tokenizer will throw XmlTokenizerBufferOverflowException.
      Throws:
      TokenizerException
    • getBufferShrinkOffset

      public int getBufferShrinkOffset()
    • setBufferShrinkable

      public void setBufferShrinkable(boolean shrinkable) throws TokenizerException
      Throws:
      TokenizerException
    • isBufferShrinkable

      public boolean isBufferShrinkable()
    • getPosDesc

      public String getPosDesc()
      Return string describing current position of parsers as text 'at line %d (row) and column %d (colum) [seen %s...]'.
    • getLineNumber

      public int getLineNumber()
    • getColumnNumber

      public int getColumnNumber()
    • isNameStartChar

      protected boolean isNameStartChar(char ch)
    • isNameChar

      protected boolean isNameChar(char ch)
    • isS

      protected boolean isS(char ch)
      Determine if ch is whitespace ([3] S)
    • next

      public byte next() throws TokenizerException, IOException
      Return next recognized toke or END_DOCUMENT if no more input.

      This is simple automata (in pseudo-code):

       byte next() {
          while(state != END_DOCUMENT) {
            ch = more();  // read character from input
            state = func(ch, state); // do transition
            if(state is accepting)
              return state;  // return token to caller
          }
       }
       

      For speed (and simplicity?) it is using few procedures such as readName() or isS().

      Throws:
      TokenizerException
      IOException