Class Jsoup


  • public class Jsoup
    extends java.lang.Object
    The core public access point to the jsoup functionality.
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private Jsoup()  
    • Method Summary

      All Methods Static Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      static java.lang.String clean​(java.lang.String bodyHtml, Safelist safelist)
      Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of permitted tags and attributes.
      static java.lang.String clean​(java.lang.String bodyHtml, Whitelist safelist)
      Deprecated.
      as of 1.14.1.
      static java.lang.String clean​(java.lang.String bodyHtml, java.lang.String baseUri, Safelist safelist)
      Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through an allow-list of safe tags and attributes.
      static java.lang.String clean​(java.lang.String bodyHtml, java.lang.String baseUri, Safelist safelist, Document.OutputSettings outputSettings)
      Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of permitted tags and attributes.
      static java.lang.String clean​(java.lang.String bodyHtml, java.lang.String baseUri, Whitelist safelist)
      Deprecated.
      as of 1.14.1.
      static java.lang.String clean​(java.lang.String bodyHtml, java.lang.String baseUri, Whitelist safelist, Document.OutputSettings outputSettings)
      Deprecated.
      as of 1.14.1.
      static boolean isValid​(java.lang.String bodyHtml, Safelist safelist)
      Test if the input body HTML has only tags and attributes allowed by the Safelist.
      static boolean isValid​(java.lang.String bodyHtml, Whitelist safelist)
      Deprecated.
      as of 1.14.1.
      static Document parse​(java.io.File in, java.lang.String charsetName)
      Parse the contents of a file as HTML.
      static Document parse​(java.io.File in, java.lang.String charsetName, java.lang.String baseUri)
      Parse the contents of a file as HTML.
      static Document parse​(java.io.InputStream in, java.lang.String charsetName, java.lang.String baseUri)
      Read an input stream, and parse it to a Document.
      static Document parse​(java.io.InputStream in, java.lang.String charsetName, java.lang.String baseUri, Parser parser)
      Read an input stream, and parse it to a Document.
      static Document parse​(java.lang.String html)
      Parse HTML into a Document.
      static Document parse​(java.lang.String html, java.lang.String baseUri)
      Parse HTML into a Document.
      static Document parse​(java.lang.String html, java.lang.String baseUri, Parser parser)
      Parse HTML into a Document, using the provided Parser.
      static Document parseBodyFragment​(java.lang.String bodyHtml)
      Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
      static Document parseBodyFragment​(java.lang.String bodyHtml, java.lang.String baseUri)
      Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • Jsoup

        private Jsoup()
    • Method Detail

      • parse

        public static Document parse​(java.lang.String html,
                                     java.lang.String baseUri)
        Parse HTML into a Document. The parser will make a sensible, balanced document tree out of any HTML.
        Parameters:
        html - HTML to parse
        baseUri - The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag.
        Returns:
        sane HTML
      • parse

        public static Document parse​(java.lang.String html,
                                     java.lang.String baseUri,
                                     Parser parser)
        Parse HTML into a Document, using the provided Parser. You can provide an alternate parser, such as a simple XML (non-HTML) parser.
        Parameters:
        html - HTML to parse
        baseUri - The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur before the HTML declares a <base href> tag.
        parser - alternate parser to use.
        Returns:
        sane HTML
      • parse

        public static Document parse​(java.lang.String html)
        Parse HTML into a Document. As no base URI is specified, absolute URL detection relies on the HTML including a <base href> tag.
        Parameters:
        html - HTML to parse
        Returns:
        sane HTML
        See Also:
        parse(String, String)
      • parse

        public static Document parse​(java.io.File in,
                                     java.lang.String charsetName,
                                     java.lang.String baseUri)
                              throws java.io.IOException
        Parse the contents of a file as HTML.
        Parameters:
        in - file to load HTML from
        charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
        baseUri - The URL where the HTML was retrieved from, to resolve relative links against.
        Returns:
        sane HTML
        Throws:
        java.io.IOException - if the file could not be found, or read, or if the charsetName is invalid.
      • parse

        public static Document parse​(java.io.File in,
                                     java.lang.String charsetName)
                              throws java.io.IOException
        Parse the contents of a file as HTML. The location of the file is used as the base URI to qualify relative URLs.
        Parameters:
        in - file to load HTML from
        charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
        Returns:
        sane HTML
        Throws:
        java.io.IOException - if the file could not be found, or read, or if the charsetName is invalid.
        See Also:
        parse(File, String, String)
      • parse

        public static Document parse​(java.io.InputStream in,
                                     java.lang.String charsetName,
                                     java.lang.String baseUri)
                              throws java.io.IOException
        Read an input stream, and parse it to a Document.
        Parameters:
        in - input stream to read. Make sure to close it after parsing.
        charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
        baseUri - The URL where the HTML was retrieved from, to resolve relative links against.
        Returns:
        sane HTML
        Throws:
        java.io.IOException - if the file could not be found, or read, or if the charsetName is invalid.
      • parse

        public static Document parse​(java.io.InputStream in,
                                     java.lang.String charsetName,
                                     java.lang.String baseUri,
                                     Parser parser)
                              throws java.io.IOException
        Read an input stream, and parse it to a Document. You can provide an alternate parser, such as a simple XML (non-HTML) parser.
        Parameters:
        in - input stream to read. Make sure to close it after parsing.
        charsetName - (optional) character set of file contents. Set to null to determine from http-equiv meta tag, if present, or fall back to UTF-8 (which is often safe to do).
        baseUri - The URL where the HTML was retrieved from, to resolve relative links against.
        parser - alternate parser to use.
        Returns:
        sane HTML
        Throws:
        java.io.IOException - if the file could not be found, or read, or if the charsetName is invalid.
      • parseBodyFragment

        public static Document parseBodyFragment​(java.lang.String bodyHtml,
                                                 java.lang.String baseUri)
        Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
        Parameters:
        bodyHtml - body HTML fragment
        baseUri - URL to resolve relative URLs against.
        Returns:
        sane HTML document
        See Also:
        Document.body()
      • parseBodyFragment

        public static Document parseBodyFragment​(java.lang.String bodyHtml)
        Parse a fragment of HTML, with the assumption that it forms the body of the HTML.
        Parameters:
        bodyHtml - body HTML fragment
        Returns:
        sane HTML document
        See Also:
        Document.body()
      • clean

        public static java.lang.String clean​(java.lang.String bodyHtml,
                                             java.lang.String baseUri,
                                             Safelist safelist)
        Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through an allow-list of safe tags and attributes.
        Parameters:
        bodyHtml - input untrusted HTML (body fragment)
        baseUri - URL to resolve relative URLs against
        safelist - list of permitted HTML elements
        Returns:
        safe HTML (body fragment)
        See Also:
        Cleaner.clean(Document)
      • clean

        @Deprecated
        public static java.lang.String clean​(java.lang.String bodyHtml,
                                             java.lang.String baseUri,
                                             Whitelist safelist)
        Deprecated.
        as of 1.14.1.
      • clean

        public static java.lang.String clean​(java.lang.String bodyHtml,
                                             Safelist safelist)
        Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of permitted tags and attributes.

        Note that as this method does not take a base href URL to resolve attributes with relative URLs against, those URLs will be removed, unless the input HTML contains a <base href> tag. If you wish to preserve those, use the clean(String html, String baseHref, Safelist) method instead, and enable {@link Safelist#preserveRelativeLinks(boolean true)}.

        Parameters:
        bodyHtml - input untrusted HTML (body fragment)
        safelist - list of permitted HTML elements
        Returns:
        safe HTML (body fragment)
        See Also:
        Cleaner.clean(Document)
      • clean

        @Deprecated
        public static java.lang.String clean​(java.lang.String bodyHtml,
                                             Whitelist safelist)
        Deprecated.
        as of 1.14.1.
      • clean

        public static java.lang.String clean​(java.lang.String bodyHtml,
                                             java.lang.String baseUri,
                                             Safelist safelist,
                                             Document.OutputSettings outputSettings)
        Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of permitted tags and attributes.

        The HTML is treated as a body fragment; it's expected the cleaned HTML will be used within the body of an existing document. If you want to clean full documents, use Cleaner.clean(Document) instead, and add structural tags (html, head, body etc) to the safelist.

        Parameters:
        bodyHtml - input untrusted HTML (body fragment)
        baseUri - URL to resolve relative URLs against
        safelist - list of permitted HTML elements
        outputSettings - document output settings; use to control pretty-printing and entity escape modes
        Returns:
        safe HTML (body fragment)
        See Also:
        Cleaner.clean(Document)
      • isValid

        public static boolean isValid​(java.lang.String bodyHtml,
                                      Safelist safelist)
        Test if the input body HTML has only tags and attributes allowed by the Safelist. Useful for form validation.

        The input HTML should still be run through the cleaner to set up enforced attributes, and to tidy the output.

        Assumes the HTML is a body fragment (i.e. will be used in an existing HTML document body.)

        Parameters:
        bodyHtml - HTML to test
        safelist - safelist to test against
        Returns:
        true if no tags or attributes were removed; false otherwise
        See Also:
        clean(String, Safelist)
      • isValid

        @Deprecated
        public static boolean isValid​(java.lang.String bodyHtml,
                                      Whitelist safelist)
        Deprecated.
        as of 1.14.1.