Class TaggedPdfReaderTool


  • public class TaggedPdfReaderTool
    extends java.lang.Object
    Converts a tagged PDF document into an XML file.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void convertToXml​(java.io.OutputStream os)
      Converts the current tag structure into an XML file with default encoding (UTF-8).
      void convertToXml​(java.io.OutputStream os, java.lang.String charset)
      Converts the current tag structure into an XML file with provided encoding.
      protected static java.lang.String escapeXML​(java.lang.String s, boolean onlyASCII)
      NOTE: copied from itext5 XMLUtils class Escapes a string with the appropriated XML codes.
      protected static java.lang.String fixTagName​(java.lang.String tag)
      Fixes specified tag name to be valid XML tag.
      protected void inspectAttributes​(PdfStructElem kid)
      Inspects attributes dictionary of the StructTreeRoot child.
      protected void inspectKid​(IStructureNode kid)
      Inspect the child of the StructTreeRoot.
      protected void inspectKids​(java.util.List<IStructureNode> kids)
      Inspect the children of the StructTreeRoot.
      static boolean isValidCharacterValue​(int c)
      Checks if a character value should be escaped/unescaped.
      protected void parseTag​(PdfMcr kid)
      Parses tag of the Marked Content Reference (MCR) kid of the StructTreeRoot.
      TaggedPdfReaderTool setRootTag​(java.lang.String rootTagName)
      Sets the name of the root tag of the resultant XML file
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • out

        protected java.io.OutputStreamWriter out
      • rootTag

        protected java.lang.String rootTag
      • parsedTags

        protected java.util.Map<PdfDictionary,​java.util.Map<java.lang.Integer,​java.lang.String>> parsedTags
      • inspectedStructTreeElems

        private final java.util.Set<PdfObject> inspectedStructTreeElems
    • Constructor Detail

    • Method Detail

      • isValidCharacterValue

        public static boolean isValidCharacterValue​(int c)
        Checks if a character value should be escaped/unescaped.
        Parameters:
        c - a character value
        Returns:
        true if it's OK to escape or unescape this value.
      • convertToXml

        public void convertToXml​(java.io.OutputStream os)
                          throws java.io.IOException
        Converts the current tag structure into an XML file with default encoding (UTF-8).
        Parameters:
        os - the output stream to save XML file to
        Throws:
        java.io.IOException - in case of any I/O error
      • convertToXml

        public void convertToXml​(java.io.OutputStream os,
                                 java.lang.String charset)
                          throws java.io.IOException
        Converts the current tag structure into an XML file with provided encoding.
        Parameters:
        os - the output stream to save XML file to
        charset - the charset of the resultant XML file
        Throws:
        java.io.IOException - in case of any I/O error
      • setRootTag

        public TaggedPdfReaderTool setRootTag​(java.lang.String rootTagName)
        Sets the name of the root tag of the resultant XML file
        Parameters:
        rootTagName - the name of the root tag
        Returns:
        this object
      • inspectKids

        protected void inspectKids​(java.util.List<IStructureNode> kids)
        Inspect the children of the StructTreeRoot.
        Parameters:
        kids - list of the direct kids of the StructTreeRoot
      • inspectKid

        protected void inspectKid​(IStructureNode kid)
        Inspect the child of the StructTreeRoot.
        Parameters:
        kid - the direct kid of the StructTreeRoot
      • inspectAttributes

        protected void inspectAttributes​(PdfStructElem kid)
        Inspects attributes dictionary of the StructTreeRoot child.
        Parameters:
        kid - the direct kid of the StructTreeRoot
      • parseTag

        protected void parseTag​(PdfMcr kid)
        Parses tag of the Marked Content Reference (MCR) kid of the StructTreeRoot.
        Parameters:
        kid - the direct PdfMcr kid of the StructTreeRoot
      • fixTagName

        protected static java.lang.String fixTagName​(java.lang.String tag)
        Fixes specified tag name to be valid XML tag.
        Parameters:
        tag - tag name to fix
        Returns:
        fixed tag name.
      • escapeXML

        protected static java.lang.String escapeXML​(java.lang.String s,
                                                    boolean onlyASCII)
        NOTE: copied from itext5 XMLUtils class Escapes a string with the appropriated XML codes.
        Parameters:
        s - the string to be escaped
        onlyASCII - codes above 127 will always be escaped with &#nn; if true
        Returns:
        the escaped string