Class EncodingSniffer

java.lang.Object
org.htmlunit.util.EncodingSniffer

public final class EncodingSniffer extends Object
Sniffs encoding settings from HTML, XML or other content. The HTML encoding sniffing algorithm is based on the HTML5 encoding sniffing algorithm.
  • Field Details

    • LOG

      private static final org.apache.commons.logging.Log LOG
      Logging support.
    • COMMENT_START

      private static final byte[][] COMMENT_START
      Sequence(s) of bytes indicating the beginning of a comment.
    • META_START

      private static final byte[][] META_START
      Sequence(s) of bytes indicating the beginning of a meta HTML tag.
    • OTHER_START

      private static final byte[][] OTHER_START
      Sequence(s) of bytes indicating the beginning of miscellaneous HTML content.
    • CHARSET_START

      private static final byte[][] CHARSET_START
      Sequence(s) of bytes indicating the beginning of a charset specification.
    • WHITESPACE

      private static final byte[] WHITESPACE
    • COMMENT_END

      private static final byte[] COMMENT_END
    • XML_DECLARATION_PREFIX

      private static final byte[] XML_DECLARATION_PREFIX
    • CSS_CHARSET_DECLARATION_PREFIX

      private static final byte[] CSS_CHARSET_DECLARATION_PREFIX
    • SIZE_OF_HTML_CONTENT_SNIFFED

      private static final int SIZE_OF_HTML_CONTENT_SNIFFED
      The number of HTML bytes to sniff for encoding info embedded in meta tags;
      See Also:
    • SIZE_OF_XML_CONTENT_SNIFFED

      private static final int SIZE_OF_XML_CONTENT_SNIFFED
      The number of XML bytes to sniff for encoding info embedded in the XML declaration; relatively small because it's always at the very beginning of the file.
      See Also:
    • SIZE_OF_CSS_CONTENT_SNIFFED

      private static final int SIZE_OF_CSS_CONTENT_SNIFFED
      See Also:
  • Constructor Details

    • EncodingSniffer

      private EncodingSniffer()
      Disallow instantiation of this class.
  • Method Details

    • sniffEncoding

      @Deprecated public static Charset sniffEncoding(List<NameValuePair> headers, InputStream content) throws IOException

      If the specified content is HTML content, this method sniffs encoding settings from the specified HTML content and/or the corresponding HTTP headers based on the HTML5 encoding sniffing algorithm.

      If the specified content is XML content, this method sniffs encoding settings from the specified XML content and/or the corresponding HTTP headers using a custom algorithm.

      Otherwise, this method sniffs encoding settings from the specified content of unknown type by looking for Content-Type information in the HTTP headers and Byte Order Mark information in the content.

      Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

      Parameters:
      headers - the HTTP response headers sent back with the content to be sniffed
      content - the content to be sniffed
      Returns:
      the encoding sniffed from the specified content and/or the corresponding HTTP headers, or null if the encoding could not be determined
      Throws:
      IOException - if an IO error occurs
    • isHtml

      @Deprecated static boolean isHtml(List<NameValuePair> headers)
      Deprecated.
      as of version 4.0.0; method will be removed without replacement
      Returns true if the specified HTTP response headers indicate an HTML response.
      Parameters:
      headers - the HTTP response headers
      Returns:
      true if the specified HTTP response headers indicate an HTML response
    • isXml

      @Deprecated static boolean isXml(List<NameValuePair> headers)
      Deprecated.
      as of version 4.0.0; method will be removed without replacement
      Returns true if the specified HTTP response headers indicate an XML response.
      Parameters:
      headers - the HTTP response headers
      Returns:
      true if the specified HTTP response headers indicate an XML response
    • contentTypeEndsWith

      static boolean contentTypeEndsWith(List<NameValuePair> headers, String... contentTypeEndings)
      Returns true if the specified HTTP response headers contain a Content-Type that ends with one of the specified strings.
      Parameters:
      headers - the HTTP response headers
      contentTypeEndings - the content type endings to search for
      Returns:
      true if the specified HTTP response headers contain a Content-Type that ends with one of the specified strings
    • sniffHtmlEncoding

      @Deprecated public static Charset sniffHtmlEncoding(List<NameValuePair> headers, InputStream content) throws IOException

      Sniffs encoding settings from the specified HTML content and/or the corresponding HTTP headers based on the HTML5 encoding sniffing algorithm.

      Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

      Parameters:
      headers - the HTTP response headers sent back with the HTML content to be sniffed
      content - the HTML content to be sniffed
      Returns:
      the encoding sniffed from the specified HTML content and/or the corresponding HTTP headers, or null if the encoding could not be determined
      Throws:
      IOException - if an IO error occurs
    • sniffXmlEncoding

      @Deprecated public static Charset sniffXmlEncoding(List<NameValuePair> headers, InputStream content) throws IOException

      Sniffs encoding settings from the specified XML content and/or the corresponding HTTP headers using a custom algorithm.

      Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

      Parameters:
      headers - the HTTP response headers sent back with the XML content to be sniffed
      content - the XML content to be sniffed
      Returns:
      the encoding sniffed from the specified XML content and/or the corresponding HTTP headers, or null if the encoding could not be determined
      Throws:
      IOException - if an IO error occurs
    • sniffCssEncoding

      @Deprecated private static Charset sniffCssEncoding(List<NameValuePair> headers, InputStream content) throws IOException
      Throws:
      IOException
    • sniffUnknownContentTypeEncoding

      @Deprecated public static Charset sniffUnknownContentTypeEncoding(List<NameValuePair> headers, InputStream content) throws IOException

      Sniffs encoding settings from the specified content of unknown type by looking for Content-Type information in the HTTP headers and Byte Order Mark information in the content.

      Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

      Parameters:
      headers - the HTTP response headers sent back with the content to be sniffed
      content - the content to be sniffed
      Returns:
      the encoding sniffed from the specified content and/or the corresponding HTTP headers, or null if the encoding could not be determined
      Throws:
      IOException - if an IO error occurs
    • sniffEncodingFromHttpHeaders

      @Deprecated public static Charset sniffEncodingFromHttpHeaders(List<NameValuePair> headers)
      Deprecated.
      as of version 4.0.0; method will be removed without replacement
      Attempts to sniff an encoding from the specified HTTP headers.
      Parameters:
      headers - the HTTP headers to examine
      Returns:
      the encoding sniffed from the specified HTTP headers, or null if the encoding could not be determined
    • sniffEncodingFromUnicodeBom

      static Charset sniffEncodingFromUnicodeBom(byte[] bytes)
      Attempts to sniff an encoding from a Byte Order Mark in the specified byte array.
      Parameters:
      bytes - the bytes to check for a Byte Order Mark
      Returns:
      the encoding sniffed from the specified bytes, or null if the encoding could not be determined
    • startsWith

      private static boolean startsWith(byte[] bytes, org.apache.commons.io.ByteOrderMark bom)
      Returns whether the specified byte array starts with the given ByteOrderMark, or not.
      Parameters:
      bytes - the byte array to check
      bom - the ByteOrderMark
      Returns:
      whether the specified byte array starts with the given ByteOrderMark, or not
    • sniffEncodingFromMetaTag

      @Deprecated static Charset sniffEncodingFromMetaTag(byte[] bytes) throws IOException
      Deprecated.
      as of version 4.0.0; method will be removed without replacement
      Attempts to sniff an encoding from an HTML meta tag in the specified byte array.
      Parameters:
      bytes - the bytes to check for an HTML meta tag
      Returns:
      the encoding sniffed from the specified bytes, or null if the encoding could not be determined
      Throws:
      IOException
    • sniffEncodingFromMetaTag

      public static Charset sniffEncodingFromMetaTag(InputStream is) throws IOException
      Attempts to sniff an encoding from an HTML meta tag in the specified byte array.
      Parameters:
      is - the content stream to check for an HTML meta tag
      Returns:
      the encoding sniffed from the specified bytes, or null if the encoding could not be determined
      Throws:
      IOException - if an IO error occurs
    • getAttribute

      static EncodingSniffer.Attribute getAttribute(byte[] bytes, int startFrom)
      Extracts an attribute from the specified byte array, starting at the specified index, using the HTML5 attribute algorithm.
      Parameters:
      bytes - the byte array to extract an attribute from
      startFrom - the index to start searching from
      Returns:
      the next attribute in the specified byte array, or null if one is not available
    • extractEncodingFromContentType

      public static Charset extractEncodingFromContentType(String s)
      Extracts an encoding from the specified Content-Type value using the IETF algorithm; if no encoding is found, this method returns null.
      Parameters:
      s - the Content-Type value to search for an encoding
      Returns:
      the encoding found in the specified Content-Type value, or null if no encoding was found
    • sniffEncodingFromXmlDeclaration

      @Deprecated static Charset sniffEncodingFromXmlDeclaration(byte[] bytes) throws IOException
      Deprecated.
      as of version 4.0.0; use sniffEncodingFromXmlDeclaration(InputStream) instead
      Searches the specified XML content for an XML declaration and returns the encoding if found, otherwise returns null.
      Parameters:
      bytes - the XML content to sniff
      Returns:
      the encoding of the specified XML content, or null if it could not be determined
      Throws:
      IOException
    • sniffEncodingFromXmlDeclaration

      public static Charset sniffEncodingFromXmlDeclaration(InputStream is) throws IOException
      Searches the specified XML content for an XML declaration and returns the encoding if found, otherwise returns null.
      Parameters:
      is - the content stream to check for the charset declaration
      Returns:
      the encoding of the specified XML content, or null if it could not be determined
      Throws:
      IOException - if an IO error occurs
    • sniffEncodingFromCssDeclaration

      @Deprecated static Charset sniffEncodingFromCssDeclaration(byte[] bytes) throws IOException
      Parses and returns the charset declaration at the start of a css file if any, otherwise returns null.
      Parameters:
      is - the input stream to parse
      Returns:
      the charset declaration at the start of a css file if any, otherwise returns null.

      e.g.

      @charset "UTF-8"
      Throws:
      IOException
    • sniffEncodingFromCssDeclaration

      public static Charset sniffEncodingFromCssDeclaration(InputStream is) throws IOException
      Parses and returns the charset declaration at the start of a css file if any, otherwise returns null.

      e.g.

      @charset "UTF-8"
      Parameters:
      is - the input stream to parse
      Returns:
      the charset declaration at the start of a css file if any, otherwise returns null.
      Throws:
      IOException - if an IO error occurs
    • toCharset

      public static Charset toCharset(String charsetName)
      Returns Charset if the specified charset name is supported on this platform.
      Parameters:
      charsetName - the charset name to check
      Returns:
      Charset if the specified charset name is supported on this platform
    • matches

      static boolean matches(byte[] bytes, int i, byte[][] sought)
      Returns true if the byte in the specified byte array at the specified index matches one of the specified byte array patterns.
      Parameters:
      bytes - the byte array to search in
      i - the index at which to search
      sought - the byte array patterns to search for
      Returns:
      true if the byte in the specified byte array at the specified index matches one of the specified byte array patterns
    • skipToAnyOf

      static int skipToAnyOf(byte[] bytes, int startFrom, byte[] targets)
      Skips ahead to the first occurrence of any of the specified targets within the specified array, starting at the specified index. This method returns -1 if none of the targets are found.
      Parameters:
      bytes - the array to search through
      startFrom - the index to start looking at
      targets - the targets to search for
      Returns:
      the index of the first occurrence of the specified targets within the specified array
    • indexOfSubArray

      static int indexOfSubArray(byte[] array, byte[] subarray, int startIndex)
      Finds the first index of the specified sub-array inside the specified array, starting at the specified index. This method returns -1 if the specified sub-array cannot be found.
      Parameters:
      array - the array to traverse for looking for the sub-array
      subarray - the sub-array to find
      startIndex - the start index to traverse forwards from
      Returns:
      the index of the sub-array within the array
    • read

      static byte[] read(InputStream content, int size) throws IOException
      Attempts to read size bytes from the specified input stream. Note that this method is not guaranteed to be able to read size bytes; however, the returned byte array will always be the exact length of the number of bytes read.
      Parameters:
      content - the input stream to read from
      size - the number of bytes to try to read
      Returns:
      the bytes read from the specified input stream
      Throws:
      IOException - if an IO error occurs
    • readAndPrepend

      static byte[] readAndPrepend(InputStream content, int size, byte[] prefix) throws IOException
      Attempts to read size bytes from the specified input stream and then prepends the specified prefix to the bytes read, returning the resultant byte array. Note that this method is not guaranteed to be able to read size bytes; however, the returned byte array will always be the exact length of the number of bytes read plus the length of the prefix array.
      Parameters:
      content - the input stream to read from
      size - the number of bytes to try to read
      prefix - the byte array to prepend to the bytes read from the specified input stream
      Returns:
      the bytes read from the specified input stream, prefixed by the specified prefix
      Throws:
      IOException - if an IO error occurs
    • translateEncodingLabel

      @Deprecated public static String translateEncodingLabel(Charset encodingLabel)
      Deprecated.
      as of version 4.0.0; method will be removed without replacement
      Translates the given encoding label into a normalized form according to Reference.
      Parameters:
      encodingLabel - the label to translate
      Returns:
      the normalized encoding name or null if not found
    • translateEncodingLabel

      public static String translateEncodingLabel(String encodingLabel)
      Translates the given encoding label into a normalized form according to Reference.
      Parameters:
      encodingLabel - the label to translate
      Returns:
      the normalized encoding name or null if not found