Class EncodingSniffer


  • public final class EncodingSniffer
    extends java.lang.Object
    Sniffs encoding settings from HTML, XML or other content. The HTML encoding sniffing algorithm is based on the HTML5 encoding sniffing algorithm.
    • Field Detail

      • LOG

        private static final org.apache.commons.logging.Log LOG
        Logging support.
      • COMMENT_START

        private static final byte[][] COMMENT_START
        Sequence(s) of bytes indicating the beginning of a comment.
      • META_START

        private static final byte[][] META_START
        Sequence(s) of bytes indicating the beginning of a meta HTML tag.
      • OTHER_START

        private static final byte[][] OTHER_START
        Sequence(s) of bytes indicating the beginning of miscellaneous HTML content.
      • CHARSET_START

        private static final byte[][] CHARSET_START
        Sequence(s) of bytes indicating the beginning of a charset specification.
      • WHITESPACE

        private static final byte[] WHITESPACE
      • COMMENT_END

        private static final byte[] COMMENT_END
      • XML_DECLARATION_PREFIX

        private static final byte[] XML_DECLARATION_PREFIX
      • CSS_CHARSET_DECLARATION_PREFIX

        private static final byte[] CSS_CHARSET_DECLARATION_PREFIX
      • SIZE_OF_HTML_CONTENT_SNIFFED

        private static final int SIZE_OF_HTML_CONTENT_SNIFFED
        The number of HTML bytes to sniff for encoding info embedded in meta tags;
        See Also:
        Constant Field Values
      • SIZE_OF_XML_CONTENT_SNIFFED

        private static final int SIZE_OF_XML_CONTENT_SNIFFED
        The number of XML bytes to sniff for encoding info embedded in the XML declaration; relatively small because it's always at the very beginning of the file.
        See Also:
        Constant Field Values
      • SIZE_OF_CSS_CONTENT_SNIFFED

        private static final int SIZE_OF_CSS_CONTENT_SNIFFED
        See Also:
        Constant Field Values
    • Constructor Detail

      • EncodingSniffer

        private EncodingSniffer()
        Disallow instantiation of this class.
    • Method Detail

      • sniffEncoding

        @Deprecated
        public static java.nio.charset.Charset sniffEncoding​(java.util.List<NameValuePair> headers,
                                                             java.io.InputStream content)
                                                      throws java.io.IOException

        If the specified content is HTML content, this method sniffs encoding settings from the specified HTML content and/or the corresponding HTTP headers based on the HTML5 encoding sniffing algorithm.

        If the specified content is XML content, this method sniffs encoding settings from the specified XML content and/or the corresponding HTTP headers using a custom algorithm.

        Otherwise, this method sniffs encoding settings from the specified content of unknown type by looking for Content-Type information in the HTTP headers and Byte Order Mark information in the content.

        Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

        Parameters:
        headers - the HTTP response headers sent back with the content to be sniffed
        content - the content to be sniffed
        Returns:
        the encoding sniffed from the specified content and/or the corresponding HTTP headers, or null if the encoding could not be determined
        Throws:
        java.io.IOException - if an IO error occurs
      • isHtml

        @Deprecated
        static boolean isHtml​(java.util.List<NameValuePair> headers)
        Deprecated.
        as of version 4.0.0; method will be removed without replacement
        Returns true if the specified HTTP response headers indicate an HTML response.
        Parameters:
        headers - the HTTP response headers
        Returns:
        true if the specified HTTP response headers indicate an HTML response
      • isXml

        @Deprecated
        static boolean isXml​(java.util.List<NameValuePair> headers)
        Deprecated.
        as of version 4.0.0; method will be removed without replacement
        Returns true if the specified HTTP response headers indicate an XML response.
        Parameters:
        headers - the HTTP response headers
        Returns:
        true if the specified HTTP response headers indicate an XML response
      • contentTypeEndsWith

        static boolean contentTypeEndsWith​(java.util.List<NameValuePair> headers,
                                           java.lang.String... contentTypeEndings)
        Returns true if the specified HTTP response headers contain a Content-Type that ends with one of the specified strings.
        Parameters:
        headers - the HTTP response headers
        contentTypeEndings - the content type endings to search for
        Returns:
        true if the specified HTTP response headers contain a Content-Type that ends with one of the specified strings
      • sniffHtmlEncoding

        @Deprecated
        public static java.nio.charset.Charset sniffHtmlEncoding​(java.util.List<NameValuePair> headers,
                                                                 java.io.InputStream content)
                                                          throws java.io.IOException

        Sniffs encoding settings from the specified HTML content and/or the corresponding HTTP headers based on the HTML5 encoding sniffing algorithm.

        Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

        Parameters:
        headers - the HTTP response headers sent back with the HTML content to be sniffed
        content - the HTML content to be sniffed
        Returns:
        the encoding sniffed from the specified HTML content and/or the corresponding HTTP headers, or null if the encoding could not be determined
        Throws:
        java.io.IOException - if an IO error occurs
      • sniffXmlEncoding

        @Deprecated
        public static java.nio.charset.Charset sniffXmlEncoding​(java.util.List<NameValuePair> headers,
                                                                java.io.InputStream content)
                                                         throws java.io.IOException

        Sniffs encoding settings from the specified XML content and/or the corresponding HTTP headers using a custom algorithm.

        Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

        Parameters:
        headers - the HTTP response headers sent back with the XML content to be sniffed
        content - the XML content to be sniffed
        Returns:
        the encoding sniffed from the specified XML content and/or the corresponding HTTP headers, or null if the encoding could not be determined
        Throws:
        java.io.IOException - if an IO error occurs
      • sniffUnknownContentTypeEncoding

        @Deprecated
        public static java.nio.charset.Charset sniffUnknownContentTypeEncoding​(java.util.List<NameValuePair> headers,
                                                                               java.io.InputStream content)
                                                                        throws java.io.IOException

        Sniffs encoding settings from the specified content of unknown type by looking for Content-Type information in the HTTP headers and Byte Order Mark information in the content.

        Note that if an encoding is found but it is not supported on the current platform, this method returns null, as if no encoding had been found.

        Parameters:
        headers - the HTTP response headers sent back with the content to be sniffed
        content - the content to be sniffed
        Returns:
        the encoding sniffed from the specified content and/or the corresponding HTTP headers, or null if the encoding could not be determined
        Throws:
        java.io.IOException - if an IO error occurs
      • sniffEncodingFromHttpHeaders

        @Deprecated
        public static java.nio.charset.Charset sniffEncodingFromHttpHeaders​(java.util.List<NameValuePair> headers)
        Deprecated.
        as of version 4.0.0; method will be removed without replacement
        Attempts to sniff an encoding from the specified HTTP headers.
        Parameters:
        headers - the HTTP headers to examine
        Returns:
        the encoding sniffed from the specified HTTP headers, or null if the encoding could not be determined
      • sniffEncodingFromUnicodeBom

        static java.nio.charset.Charset sniffEncodingFromUnicodeBom​(byte[] bytes)
        Attempts to sniff an encoding from a Byte Order Mark in the specified byte array.
        Parameters:
        bytes - the bytes to check for a Byte Order Mark
        Returns:
        the encoding sniffed from the specified bytes, or null if the encoding could not be determined
      • startsWith

        private static boolean startsWith​(byte[] bytes,
                                          org.apache.commons.io.ByteOrderMark bom)
        Returns whether the specified byte array starts with the given ByteOrderMark, or not.
        Parameters:
        bytes - the byte array to check
        bom - the ByteOrderMark
        Returns:
        whether the specified byte array starts with the given ByteOrderMark, or not
      • sniffEncodingFromMetaTag

        @Deprecated
        static java.nio.charset.Charset sniffEncodingFromMetaTag​(byte[] bytes)
                                                          throws java.io.IOException
        Deprecated.
        as of version 4.0.0; method will be removed without replacement
        Attempts to sniff an encoding from an HTML meta tag in the specified byte array.
        Parameters:
        bytes - the bytes to check for an HTML meta tag
        Returns:
        the encoding sniffed from the specified bytes, or null if the encoding could not be determined
        Throws:
        java.io.IOException
      • sniffEncodingFromMetaTag

        public static java.nio.charset.Charset sniffEncodingFromMetaTag​(java.io.InputStream is)
                                                                 throws java.io.IOException
        Attempts to sniff an encoding from an HTML meta tag in the specified byte array.
        Parameters:
        is - the content stream to check for an HTML meta tag
        Returns:
        the encoding sniffed from the specified bytes, or null if the encoding could not be determined
        Throws:
        java.io.IOException - if an IO error occurs
      • getAttribute

        static EncodingSniffer.Attribute getAttribute​(byte[] bytes,
                                                      int startFrom)
        Extracts an attribute from the specified byte array, starting at the specified index, using the HTML5 attribute algorithm.
        Parameters:
        bytes - the byte array to extract an attribute from
        startFrom - the index to start searching from
        Returns:
        the next attribute in the specified byte array, or null if one is not available
      • extractEncodingFromContentType

        public static java.nio.charset.Charset extractEncodingFromContentType​(java.lang.String s)
        Extracts an encoding from the specified Content-Type value using the IETF algorithm; if no encoding is found, this method returns null.
        Parameters:
        s - the Content-Type value to search for an encoding
        Returns:
        the encoding found in the specified Content-Type value, or null if no encoding was found
      • sniffEncodingFromXmlDeclaration

        @Deprecated
        static java.nio.charset.Charset sniffEncodingFromXmlDeclaration​(byte[] bytes)
                                                                 throws java.io.IOException
        Deprecated.
        as of version 4.0.0; use sniffEncodingFromXmlDeclaration(InputStream) instead
        Searches the specified XML content for an XML declaration and returns the encoding if found, otherwise returns null.
        Parameters:
        bytes - the XML content to sniff
        Returns:
        the encoding of the specified XML content, or null if it could not be determined
        Throws:
        java.io.IOException
      • sniffEncodingFromXmlDeclaration

        public static java.nio.charset.Charset sniffEncodingFromXmlDeclaration​(java.io.InputStream is)
                                                                        throws java.io.IOException
        Searches the specified XML content for an XML declaration and returns the encoding if found, otherwise returns null.
        Parameters:
        is - the content stream to check for the charset declaration
        Returns:
        the encoding of the specified XML content, or null if it could not be determined
        Throws:
        java.io.IOException - if an IO error occurs
      • sniffEncodingFromCssDeclaration

        @Deprecated
        static java.nio.charset.Charset sniffEncodingFromCssDeclaration​(byte[] bytes)
                                                                 throws java.io.IOException
        Parses and returns the charset declaration at the start of a css file if any, otherwise returns null.
        Parameters:
        is - the input stream to parse
        Returns:
        the charset declaration at the start of a css file if any, otherwise returns null.

        e.g.

        @charset "UTF-8"
        Throws:
        java.io.IOException
      • sniffEncodingFromCssDeclaration

        public static java.nio.charset.Charset sniffEncodingFromCssDeclaration​(java.io.InputStream is)
                                                                        throws java.io.IOException
        Parses and returns the charset declaration at the start of a css file if any, otherwise returns null.

        e.g.

        @charset "UTF-8"
        Parameters:
        is - the input stream to parse
        Returns:
        the charset declaration at the start of a css file if any, otherwise returns null.
        Throws:
        java.io.IOException - if an IO error occurs
      • toCharset

        public static java.nio.charset.Charset toCharset​(java.lang.String charsetName)
        Returns Charset if the specified charset name is supported on this platform.
        Parameters:
        charsetName - the charset name to check
        Returns:
        Charset if the specified charset name is supported on this platform
      • matches

        static boolean matches​(byte[] bytes,
                               int i,
                               byte[][] sought)
        Returns true if the byte in the specified byte array at the specified index matches one of the specified byte array patterns.
        Parameters:
        bytes - the byte array to search in
        i - the index at which to search
        sought - the byte array patterns to search for
        Returns:
        true if the byte in the specified byte array at the specified index matches one of the specified byte array patterns
      • skipToAnyOf

        static int skipToAnyOf​(byte[] bytes,
                               int startFrom,
                               byte[] targets)
        Skips ahead to the first occurrence of any of the specified targets within the specified array, starting at the specified index. This method returns -1 if none of the targets are found.
        Parameters:
        bytes - the array to search through
        startFrom - the index to start looking at
        targets - the targets to search for
        Returns:
        the index of the first occurrence of the specified targets within the specified array
      • indexOfSubArray

        static int indexOfSubArray​(byte[] array,
                                   byte[] subarray,
                                   int startIndex)
        Finds the first index of the specified sub-array inside the specified array, starting at the specified index. This method returns -1 if the specified sub-array cannot be found.
        Parameters:
        array - the array to traverse for looking for the sub-array
        subarray - the sub-array to find
        startIndex - the start index to traverse forwards from
        Returns:
        the index of the sub-array within the array
      • read

        static byte[] read​(java.io.InputStream content,
                           int size)
                    throws java.io.IOException
        Attempts to read size bytes from the specified input stream. Note that this method is not guaranteed to be able to read size bytes; however, the returned byte array will always be the exact length of the number of bytes read.
        Parameters:
        content - the input stream to read from
        size - the number of bytes to try to read
        Returns:
        the bytes read from the specified input stream
        Throws:
        java.io.IOException - if an IO error occurs
      • readAndPrepend

        static byte[] readAndPrepend​(java.io.InputStream content,
                                     int size,
                                     byte[] prefix)
                              throws java.io.IOException
        Attempts to read size bytes from the specified input stream and then prepends the specified prefix to the bytes read, returning the resultant byte array. Note that this method is not guaranteed to be able to read size bytes; however, the returned byte array will always be the exact length of the number of bytes read plus the length of the prefix array.
        Parameters:
        content - the input stream to read from
        size - the number of bytes to try to read
        prefix - the byte array to prepend to the bytes read from the specified input stream
        Returns:
        the bytes read from the specified input stream, prefixed by the specified prefix
        Throws:
        java.io.IOException - if an IO error occurs
      • translateEncodingLabel

        @Deprecated
        public static java.lang.String translateEncodingLabel​(java.nio.charset.Charset encodingLabel)
        Deprecated.
        as of version 4.0.0; method will be removed without replacement
        Translates the given encoding label into a normalized form according to Reference.
        Parameters:
        encodingLabel - the label to translate
        Returns:
        the normalized encoding name or null if not found
      • translateEncodingLabel

        public static java.lang.String translateEncodingLabel​(java.lang.String encodingLabel)
        Translates the given encoding label into a normalized form according to Reference.
        Parameters:
        encodingLabel - the label to translate
        Returns:
        the normalized encoding name or null if not found