Class SimpleXMLParser

java.lang.Object
com.lowagie.text.xml.simpleparser.SimpleXMLParser

public final class SimpleXMLParser extends Object
A simple XML and HTML parser. This parser is, like the SAX parser, an event based parser, but with much less functionality.

The parser can:

  • It recognizes the encoding used
  • It recognizes all the elements' start tags and end tags
  • It lists attributes, where attribute values can be enclosed in single or double quotes
  • It recognizes the <[CDATA[ ... ]]> construct
  • It recognizes the standard entities: &amp;, &lt;, &gt;, &quot;, and &apos;, as well as numeric entities
  • It maps lines ending in \r\n and \r to \n on input, in accordance with the XML Specification, Section 2.11
  • Field Details

    • UNKNOWN

      private static final int UNKNOWN
      possible states
      See Also:
    • TEXT

      private static final int TEXT
      See Also:
    • TAG_ENCOUNTERED

      private static final int TAG_ENCOUNTERED
      See Also:
    • EXAMIN_TAG

      private static final int EXAMIN_TAG
      See Also:
    • TAG_EXAMINED

      private static final int TAG_EXAMINED
      See Also:
    • IN_CLOSETAG

      private static final int IN_CLOSETAG
      See Also:
    • SINGLE_TAG

      private static final int SINGLE_TAG
      See Also:
    • CDATA

      private static final int CDATA
      See Also:
    • COMMENT

      private static final int COMMENT
      See Also:
    • PI

      private static final int PI
      See Also:
    • ENTITY

      private static final int ENTITY
      See Also:
    • QUOTE

      private static final int QUOTE
      See Also:
    • ATTRIBUTE_KEY

      private static final int ATTRIBUTE_KEY
      See Also:
    • ATTRIBUTE_EQUAL

      private static final int ATTRIBUTE_EQUAL
      See Also:
    • ATTRIBUTE_VALUE

      private static final int ATTRIBUTE_VALUE
      See Also:
    • stack

      Stack<Integer> stack
      the state stack
    • character

      int character
      The current character.
    • previousCharacter

      int previousCharacter
      The previous character.
    • lines

      int lines
      the line we are currently reading
    • columns

      int columns
      the column where the current character occurs
    • eol

      boolean eol
      was the last character equivalent to a newline?
    • nowhite

      boolean nowhite
      A boolean indicating if the next character should be taken into account if it's a space character. When nospace is false, the previous character wasn't whitespace.
      Since:
      2.1.5
    • state

      int state
      the current state
    • html

      boolean html
      Are we parsing HTML?
    • text

      current text (whatever is encountered between tags)
    • entity

      StringBuffer entity
      current entity (whatever is encountered between invalid input: '&' and ;)
    • tag

      String tag
      current tagname
    • attributes

      Map<String,String> attributes
      current attributes
    • doc

      The handler to which we are going to forward document content
    • comment

      The handler to which we are going to forward comments.
    • nested

      int nested
      Keeps track of the number of tags that are open.
    • quoteCharacter

      int quoteCharacter
      the quote character that was used to open the quote.
    • attributekey

      String attributekey
      the attribute key.
    • attributevalue

      String attributevalue
      the attribute value.
  • Constructor Details

  • Method Details

    • parse

      public static void parse(SimpleXMLDocHandler doc, SimpleXMLDocHandlerComment comment, Reader r, boolean html) throws IOException
      Parses the XML document firing the events to the handler.
      Parameters:
      doc - the document handler
      comment - comment
      r - the document. The encoding is already resolved. The reader is not closed
      html - html
      Throws:
      IOException - on error
    • detectCharsetFromBOM

      private static Optional<Charset> detectCharsetFromBOM(byte[] bom)
      Detect charset from BOM, as per Unicode FAQ.
    • parse

      public static void parse(SimpleXMLDocHandler doc, InputStream in) throws IOException
      Parses the XML document firing the events to the handler.
      Parameters:
      doc - the document handler
      in - the document. The encoding is deduced from the stream. The stream is not closed
      Throws:
      IOException - on error
    • getDeclaredEncoding

      private static String getDeclaredEncoding(String decl)
    • parse

      public static void parse(SimpleXMLDocHandler doc, Reader r) throws IOException
      Throws:
      IOException
    • go

      private void go(Reader r) throws IOException
      Does the actual parsing. Perform this immediately after creating the parser object.
      Throws:
      IOException
    • restoreState

      private int restoreState()
      Gets a state from the stack
      Returns:
      the previous state
    • saveState

      private void saveState(int s)
      Adds a state to the stack.
      Parameters:
      s - a state to add to the stack
    • flush

      private void flush()
      Flushes the text that is currently in the buffer. The text can be ignored, added to the document as content or as comment,... depending on the current state.
    • initTag

      private void initTag()
      Initialized the tag name and attributes.
    • doTag

      private void doTag()
      Sets the name of the tag.
    • processTag

      private void processTag(boolean start)
      processes the tag.
      Parameters:
      start - if true we are dealing with a tag that has just been opened; if false we are closing a tag.
    • throwException

      private void throwException(String s) throws IOException
      Throws an exception
      Throws:
      IOException