Class EncodingSniffer
- java.lang.Object
-
- org.htmlunit.util.EncodingSniffer
-
public final class EncodingSniffer extends java.lang.Object
Sniffs encoding settings from HTML, XML or other content. The HTML encoding sniffing algorithm is based on the HTML5 encoding sniffing algorithm.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description (package private) static class
EncodingSniffer.Attribute
-
Field Summary
Fields Modifier and Type Field Description private static byte[][]
CHARSET_START
Sequence(s) of bytes indicating the beginning of a charset specification.private static byte[]
COMMENT_END
private static byte[][]
COMMENT_START
Sequence(s) of bytes indicating the beginning of a comment.private static byte[]
CSS_CHARSET_DECLARATION_PREFIX
private static org.apache.commons.logging.Log
LOG
Logging support.private static byte[][]
META_START
Sequence(s) of bytes indicating the beginning of ameta
HTML tag.private static byte[][]
OTHER_START
Sequence(s) of bytes indicating the beginning of miscellaneous HTML content.private static int
SIZE_OF_CSS_CONTENT_SNIFFED
private static int
SIZE_OF_HTML_CONTENT_SNIFFED
The number of HTML bytes to sniff for encoding info embedded inmeta
tags;private static int
SIZE_OF_XML_CONTENT_SNIFFED
The number of XML bytes to sniff for encoding info embedded in the XML declaration; relatively small because it's always at the very beginning of the file.private static byte[]
WHITESPACE
private static byte[]
XML_DECLARATION_PREFIX
-
Constructor Summary
Constructors Modifier Constructor Description private
EncodingSniffer()
Disallow instantiation of this class.
-
Method Summary
All Methods Static Methods Concrete Methods Deprecated Methods Modifier and Type Method Description (package private) static boolean
contentTypeEndsWith(java.util.List<NameValuePair> headers, java.lang.String... contentTypeEndings)
Returnstrue
if the specified HTTP response headers contain aContent-Type
that ends with one of the specified strings.static java.nio.charset.Charset
extractEncodingFromContentType(java.lang.String s)
Extracts an encoding from the specifiedContent-Type
value using the IETF algorithm; if no encoding is found, this method returnsnull
.(package private) static EncodingSniffer.Attribute
getAttribute(byte[] bytes, int startFrom)
Extracts an attribute from the specified byte array, starting at the specified index, using the HTML5 attribute algorithm.(package private) static int
indexOfSubArray(byte[] array, byte[] subarray, int startIndex)
Finds the first index of the specified sub-array inside the specified array, starting at the specified index.(package private) static boolean
isHtml(java.util.List<NameValuePair> headers)
Deprecated.as of version 4.0.0; method will be removed without replacement(package private) static boolean
isXml(java.util.List<NameValuePair> headers)
Deprecated.as of version 4.0.0; method will be removed without replacement(package private) static boolean
matches(byte[] bytes, int i, byte[][] sought)
Returnstrue
if the byte in the specified byte array at the specified index matches one of the specified byte array patterns.(package private) static byte[]
read(java.io.InputStream content, int size)
Attempts to readsize
bytes from the specified input stream.(package private) static byte[]
readAndPrepend(java.io.InputStream content, int size, byte[] prefix)
Attempts to readsize
bytes from the specified input stream and then prepends the specified prefix to the bytes read, returning the resultant byte array.(package private) static int
skipToAnyOf(byte[] bytes, int startFrom, byte[] targets)
Skips ahead to the first occurrence of any of the specified targets within the specified array, starting at the specified index.private static java.nio.charset.Charset
sniffCssEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content)
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadstatic java.nio.charset.Charset
sniffEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content)
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
instead(package private) static java.nio.charset.Charset
sniffEncodingFromCssDeclaration(byte[] bytes)
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadstatic java.nio.charset.Charset
sniffEncodingFromCssDeclaration(java.io.InputStream is)
Parses and returns the charset declaration at the start of a css file if any, otherwise returnsnull
.static java.nio.charset.Charset
sniffEncodingFromHttpHeaders(java.util.List<NameValuePair> headers)
Deprecated.as of version 4.0.0; method will be removed without replacement(package private) static java.nio.charset.Charset
sniffEncodingFromMetaTag(byte[] bytes)
Deprecated.as of version 4.0.0; method will be removed without replacementstatic java.nio.charset.Charset
sniffEncodingFromMetaTag(java.io.InputStream is)
Attempts to sniff an encoding from an HTMLmeta
tag in the specified byte array.(package private) static java.nio.charset.Charset
sniffEncodingFromUnicodeBom(byte[] bytes)
Attempts to sniff an encoding from a Byte Order Mark in the specified byte array.(package private) static java.nio.charset.Charset
sniffEncodingFromXmlDeclaration(byte[] bytes)
Deprecated.as of version 4.0.0; usesniffEncodingFromXmlDeclaration(InputStream)
insteadstatic java.nio.charset.Charset
sniffEncodingFromXmlDeclaration(java.io.InputStream is)
Searches the specified XML content for an XML declaration and returns the encoding if found, otherwise returnsnull
.static java.nio.charset.Charset
sniffHtmlEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content)
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadstatic java.nio.charset.Charset
sniffUnknownContentTypeEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content)
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadstatic java.nio.charset.Charset
sniffXmlEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content)
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadprivate static boolean
startsWith(byte[] bytes, org.apache.commons.io.ByteOrderMark bom)
Returns whether the specified byte array starts with the givenByteOrderMark
, or not.static java.nio.charset.Charset
toCharset(java.lang.String charsetName)
ReturnsCharset
if the specified charset name is supported on this platform.static java.lang.String
translateEncodingLabel(java.lang.String encodingLabel)
Translates the given encoding label into a normalized form according to Reference.static java.lang.String
translateEncodingLabel(java.nio.charset.Charset encodingLabel)
Deprecated.as of version 4.0.0; method will be removed without replacement
-
-
-
Field Detail
-
LOG
private static final org.apache.commons.logging.Log LOG
Logging support.
-
COMMENT_START
private static final byte[][] COMMENT_START
Sequence(s) of bytes indicating the beginning of a comment.
-
META_START
private static final byte[][] META_START
Sequence(s) of bytes indicating the beginning of ameta
HTML tag.
-
OTHER_START
private static final byte[][] OTHER_START
Sequence(s) of bytes indicating the beginning of miscellaneous HTML content.
-
CHARSET_START
private static final byte[][] CHARSET_START
Sequence(s) of bytes indicating the beginning of a charset specification.
-
WHITESPACE
private static final byte[] WHITESPACE
-
COMMENT_END
private static final byte[] COMMENT_END
-
XML_DECLARATION_PREFIX
private static final byte[] XML_DECLARATION_PREFIX
-
CSS_CHARSET_DECLARATION_PREFIX
private static final byte[] CSS_CHARSET_DECLARATION_PREFIX
-
SIZE_OF_HTML_CONTENT_SNIFFED
private static final int SIZE_OF_HTML_CONTENT_SNIFFED
The number of HTML bytes to sniff for encoding info embedded inmeta
tags;- See Also:
- Constant Field Values
-
SIZE_OF_XML_CONTENT_SNIFFED
private static final int SIZE_OF_XML_CONTENT_SNIFFED
The number of XML bytes to sniff for encoding info embedded in the XML declaration; relatively small because it's always at the very beginning of the file.- See Also:
- Constant Field Values
-
SIZE_OF_CSS_CONTENT_SNIFFED
private static final int SIZE_OF_CSS_CONTENT_SNIFFED
- See Also:
- Constant Field Values
-
-
Method Detail
-
sniffEncoding
@Deprecated public static java.nio.charset.Charset sniffEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content) throws java.io.IOException
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadIf the specified content is HTML content, this method sniffs encoding settings from the specified HTML content and/or the corresponding HTTP headers based on the HTML5 encoding sniffing algorithm.
If the specified content is XML content, this method sniffs encoding settings from the specified XML content and/or the corresponding HTTP headers using a custom algorithm.
Otherwise, this method sniffs encoding settings from the specified content of unknown type by looking for
Content-Type
information in the HTTP headers and Byte Order Mark information in the content.Note that if an encoding is found but it is not supported on the current platform, this method returns
null
, as if no encoding had been found.- Parameters:
headers
- the HTTP response headers sent back with the content to be sniffedcontent
- the content to be sniffed- Returns:
- the encoding sniffed from the specified content and/or the corresponding HTTP headers,
or
null
if the encoding could not be determined - Throws:
java.io.IOException
- if an IO error occurs
-
isHtml
@Deprecated static boolean isHtml(java.util.List<NameValuePair> headers)
Deprecated.as of version 4.0.0; method will be removed without replacementReturnstrue
if the specified HTTP response headers indicate an HTML response.- Parameters:
headers
- the HTTP response headers- Returns:
true
if the specified HTTP response headers indicate an HTML response
-
isXml
@Deprecated static boolean isXml(java.util.List<NameValuePair> headers)
Deprecated.as of version 4.0.0; method will be removed without replacementReturnstrue
if the specified HTTP response headers indicate an XML response.- Parameters:
headers
- the HTTP response headers- Returns:
true
if the specified HTTP response headers indicate an XML response
-
contentTypeEndsWith
static boolean contentTypeEndsWith(java.util.List<NameValuePair> headers, java.lang.String... contentTypeEndings)
Returnstrue
if the specified HTTP response headers contain aContent-Type
that ends with one of the specified strings.- Parameters:
headers
- the HTTP response headerscontentTypeEndings
- the content type endings to search for- Returns:
true
if the specified HTTP response headers contain aContent-Type
that ends with one of the specified strings
-
sniffHtmlEncoding
@Deprecated public static java.nio.charset.Charset sniffHtmlEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content) throws java.io.IOException
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadSniffs encoding settings from the specified HTML content and/or the corresponding HTTP headers based on the HTML5 encoding sniffing algorithm.
Note that if an encoding is found but it is not supported on the current platform, this method returns
null
, as if no encoding had been found.- Parameters:
headers
- the HTTP response headers sent back with the HTML content to be sniffedcontent
- the HTML content to be sniffed- Returns:
- the encoding sniffed from the specified HTML content and/or the corresponding HTTP headers,
or
null
if the encoding could not be determined - Throws:
java.io.IOException
- if an IO error occurs
-
sniffXmlEncoding
@Deprecated public static java.nio.charset.Charset sniffXmlEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content) throws java.io.IOException
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadSniffs encoding settings from the specified XML content and/or the corresponding HTTP headers using a custom algorithm.
Note that if an encoding is found but it is not supported on the current platform, this method returns
null
, as if no encoding had been found.- Parameters:
headers
- the HTTP response headers sent back with the XML content to be sniffedcontent
- the XML content to be sniffed- Returns:
- the encoding sniffed from the specified XML content and/or the corresponding HTTP headers,
or
null
if the encoding could not be determined - Throws:
java.io.IOException
- if an IO error occurs
-
sniffCssEncoding
@Deprecated private static java.nio.charset.Charset sniffCssEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content) throws java.io.IOException
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
instead- Throws:
java.io.IOException
-
sniffUnknownContentTypeEncoding
@Deprecated public static java.nio.charset.Charset sniffUnknownContentTypeEncoding(java.util.List<NameValuePair> headers, java.io.InputStream content) throws java.io.IOException
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadSniffs encoding settings from the specified content of unknown type by looking for
Content-Type
information in the HTTP headers and Byte Order Mark information in the content.Note that if an encoding is found but it is not supported on the current platform, this method returns
null
, as if no encoding had been found.- Parameters:
headers
- the HTTP response headers sent back with the content to be sniffedcontent
- the content to be sniffed- Returns:
- the encoding sniffed from the specified content and/or the corresponding HTTP headers,
or
null
if the encoding could not be determined - Throws:
java.io.IOException
- if an IO error occurs
-
sniffEncodingFromHttpHeaders
@Deprecated public static java.nio.charset.Charset sniffEncodingFromHttpHeaders(java.util.List<NameValuePair> headers)
Deprecated.as of version 4.0.0; method will be removed without replacementAttempts to sniff an encoding from the specified HTTP headers.- Parameters:
headers
- the HTTP headers to examine- Returns:
- the encoding sniffed from the specified HTTP headers, or
null
if the encoding could not be determined
-
sniffEncodingFromUnicodeBom
static java.nio.charset.Charset sniffEncodingFromUnicodeBom(byte[] bytes)
Attempts to sniff an encoding from a Byte Order Mark in the specified byte array.- Parameters:
bytes
- the bytes to check for a Byte Order Mark- Returns:
- the encoding sniffed from the specified bytes, or
null
if the encoding could not be determined
-
startsWith
private static boolean startsWith(byte[] bytes, org.apache.commons.io.ByteOrderMark bom)
Returns whether the specified byte array starts with the givenByteOrderMark
, or not.- Parameters:
bytes
- the byte array to checkbom
- theByteOrderMark
- Returns:
- whether the specified byte array starts with the given
ByteOrderMark
, or not
-
sniffEncodingFromMetaTag
@Deprecated static java.nio.charset.Charset sniffEncodingFromMetaTag(byte[] bytes) throws java.io.IOException
Deprecated.as of version 4.0.0; method will be removed without replacementAttempts to sniff an encoding from an HTMLmeta
tag in the specified byte array.- Parameters:
bytes
- the bytes to check for an HTMLmeta
tag- Returns:
- the encoding sniffed from the specified bytes, or
null
if the encoding could not be determined - Throws:
java.io.IOException
-
sniffEncodingFromMetaTag
public static java.nio.charset.Charset sniffEncodingFromMetaTag(java.io.InputStream is) throws java.io.IOException
Attempts to sniff an encoding from an HTMLmeta
tag in the specified byte array.- Parameters:
is
- the content stream to check for an HTMLmeta
tag- Returns:
- the encoding sniffed from the specified bytes, or
null
if the encoding could not be determined - Throws:
java.io.IOException
- if an IO error occurs
-
getAttribute
static EncodingSniffer.Attribute getAttribute(byte[] bytes, int startFrom)
Extracts an attribute from the specified byte array, starting at the specified index, using the HTML5 attribute algorithm.- Parameters:
bytes
- the byte array to extract an attribute fromstartFrom
- the index to start searching from- Returns:
- the next attribute in the specified byte array, or
null
if one is not available
-
extractEncodingFromContentType
public static java.nio.charset.Charset extractEncodingFromContentType(java.lang.String s)
Extracts an encoding from the specifiedContent-Type
value using the IETF algorithm; if no encoding is found, this method returnsnull
.- Parameters:
s
- theContent-Type
value to search for an encoding- Returns:
- the encoding found in the specified
Content-Type
value, ornull
if no encoding was found
-
sniffEncodingFromXmlDeclaration
@Deprecated static java.nio.charset.Charset sniffEncodingFromXmlDeclaration(byte[] bytes) throws java.io.IOException
Deprecated.as of version 4.0.0; usesniffEncodingFromXmlDeclaration(InputStream)
insteadSearches the specified XML content for an XML declaration and returns the encoding if found, otherwise returnsnull
.- Parameters:
bytes
- the XML content to sniff- Returns:
- the encoding of the specified XML content, or
null
if it could not be determined - Throws:
java.io.IOException
-
sniffEncodingFromXmlDeclaration
public static java.nio.charset.Charset sniffEncodingFromXmlDeclaration(java.io.InputStream is) throws java.io.IOException
Searches the specified XML content for an XML declaration and returns the encoding if found, otherwise returnsnull
.- Parameters:
is
- the content stream to check for the charset declaration- Returns:
- the encoding of the specified XML content, or
null
if it could not be determined - Throws:
java.io.IOException
- if an IO error occurs
-
sniffEncodingFromCssDeclaration
@Deprecated static java.nio.charset.Charset sniffEncodingFromCssDeclaration(byte[] bytes) throws java.io.IOException
Deprecated.as of version 4.0.0; depending on the content usesniffEncodingFromMetaTag(InputStream)
,sniffEncodingFromXmlDeclaration(InputStream)
, orsniffEncodingFromCssDeclaration(InputStream)
insteadParses and returns the charset declaration at the start of a css file if any, otherwise returnsnull
.- Parameters:
is
- the input stream to parse- Returns:
- the charset declaration at the start of a css file if any, otherwise returns
null
.e.g.
@charset "UTF-8"
- Throws:
java.io.IOException
-
sniffEncodingFromCssDeclaration
public static java.nio.charset.Charset sniffEncodingFromCssDeclaration(java.io.InputStream is) throws java.io.IOException
Parses and returns the charset declaration at the start of a css file if any, otherwise returnsnull
.e.g.
@charset "UTF-8"
- Parameters:
is
- the input stream to parse- Returns:
- the charset declaration at the start of a css file if any, otherwise returns
null
. - Throws:
java.io.IOException
- if an IO error occurs
-
toCharset
public static java.nio.charset.Charset toCharset(java.lang.String charsetName)
ReturnsCharset
if the specified charset name is supported on this platform.- Parameters:
charsetName
- the charset name to check- Returns:
Charset
if the specified charset name is supported on this platform
-
matches
static boolean matches(byte[] bytes, int i, byte[][] sought)
Returnstrue
if the byte in the specified byte array at the specified index matches one of the specified byte array patterns.- Parameters:
bytes
- the byte array to search ini
- the index at which to searchsought
- the byte array patterns to search for- Returns:
true
if the byte in the specified byte array at the specified index matches one of the specified byte array patterns
-
skipToAnyOf
static int skipToAnyOf(byte[] bytes, int startFrom, byte[] targets)
Skips ahead to the first occurrence of any of the specified targets within the specified array, starting at the specified index. This method returns-1
if none of the targets are found.- Parameters:
bytes
- the array to search throughstartFrom
- the index to start looking attargets
- the targets to search for- Returns:
- the index of the first occurrence of the specified targets within the specified array
-
indexOfSubArray
static int indexOfSubArray(byte[] array, byte[] subarray, int startIndex)
Finds the first index of the specified sub-array inside the specified array, starting at the specified index. This method returns-1
if the specified sub-array cannot be found.- Parameters:
array
- the array to traverse for looking for the sub-arraysubarray
- the sub-array to findstartIndex
- the start index to traverse forwards from- Returns:
- the index of the sub-array within the array
-
read
static byte[] read(java.io.InputStream content, int size) throws java.io.IOException
Attempts to readsize
bytes from the specified input stream. Note that this method is not guaranteed to be able to readsize
bytes; however, the returned byte array will always be the exact length of the number of bytes read.- Parameters:
content
- the input stream to read fromsize
- the number of bytes to try to read- Returns:
- the bytes read from the specified input stream
- Throws:
java.io.IOException
- if an IO error occurs
-
readAndPrepend
static byte[] readAndPrepend(java.io.InputStream content, int size, byte[] prefix) throws java.io.IOException
Attempts to readsize
bytes from the specified input stream and then prepends the specified prefix to the bytes read, returning the resultant byte array. Note that this method is not guaranteed to be able to readsize
bytes; however, the returned byte array will always be the exact length of the number of bytes read plus the length of the prefix array.- Parameters:
content
- the input stream to read fromsize
- the number of bytes to try to readprefix
- the byte array to prepend to the bytes read from the specified input stream- Returns:
- the bytes read from the specified input stream, prefixed by the specified prefix
- Throws:
java.io.IOException
- if an IO error occurs
-
translateEncodingLabel
@Deprecated public static java.lang.String translateEncodingLabel(java.nio.charset.Charset encodingLabel)
Deprecated.as of version 4.0.0; method will be removed without replacementTranslates the given encoding label into a normalized form according to Reference.- Parameters:
encodingLabel
- the label to translate- Returns:
- the normalized encoding name or null if not found
-
translateEncodingLabel
public static java.lang.String translateEncodingLabel(java.lang.String encodingLabel)
Translates the given encoding label into a normalized form according to Reference.- Parameters:
encodingLabel
- the label to translate- Returns:
- the normalized encoding name or null if not found
-
-