Package org.htmlcleaner
Class Utils
- java.lang.Object
-
- org.htmlcleaner.Utils
-
public class Utils extends java.lang.Object
Common utilities.
Created by: Vladimir Nikic
Date: November, 2006.
-
-
Field Summary
Fields Modifier and Type Field Description private static java.lang.String
ampNcr
private static java.util.regex.Pattern
ASCII_CHAR
static java.util.regex.Pattern
DECIMAL
static java.util.regex.Pattern
HEX_RELAXED
static java.util.regex.Pattern
HEX_STRICT
(package private) static java.util.regex.Pattern
VALID_XML_IDENTIFIER_CHAR_PATTERN
(package private) static java.lang.String
VALID_XML_IDENTIFIER_CHAR_REGEX
(package private) static java.util.regex.Pattern
VALID_XML_IDENTIFIER_START_CHAR_PATTERN
(package private) static java.lang.String
VALID_XML_IDENTIFIER_START_CHAR_REGEX
-
Constructor Summary
Constructors Constructor Description Utils()
-
Method Summary
All Methods Static Methods Concrete Methods Deprecated Methods Modifier and Type Method Description (package private) static java.lang.String
bchomp(java.lang.String str)
Removes the first newline and last newline (if present) of a string(package private) static java.lang.String
chomp(java.lang.String str)
Removes the last newline (if present) of a stringprivate static java.util.regex.Pattern
compileUnicodePattern(java.lang.String pattern)
private static int
convert_To_Entity_Name(java.lang.String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, java.lang.StringBuilder result, int i)
private static int
convertToUnicode(java.lang.String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, java.lang.StringBuilder result, int i)
static java.lang.String
deserializeEntities(java.lang.String str, boolean recognizeUnicodeChars)
static java.lang.String
escapeHtml(java.lang.String s, CleanerProperties props)
Escapes HTML stringstatic java.lang.String
escapeXml(java.lang.String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR)
change notes: 1) convert ascii characters encoded using x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert xx; format characters to " style representation if available for the character.static java.lang.String
escapeXml(java.lang.String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR, boolean isHtmlOutput)
change notes: 1) convert ascii characters encoded using x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert xx; format characters to " style representation if available for the character.static java.lang.String
escapeXml(java.lang.String s, CleanerProperties props, boolean isDomCreation)
Escapes XML string.private static int
extractCharCode(java.lang.String s, int charIndex, boolean relaxedUnicode, java.lang.StringBuilder unicode)
(earlier code was failing on this) - A; is converted by FF to 3 characters: + 'A' + ';' x138A; is converted by FF to 6? 7? characters: 'x'+'1'+'3'+ '8' + 'A' + ';' #0 is displayed kind of weird ᎊ is a single characterstatic java.lang.String
fullUrl(java.lang.String pageUrl, java.lang.String link)
Calculates full URL for specified page URL and link which could be full, absolute or relative like there can be found in A or IMG tags.private static java.lang.String
getAmpNcr()
static java.lang.String
getXmlName(java.lang.String name)
static java.lang.String
getXmlNSPrefix(java.lang.String name)
static boolean
isEmptyString(java.lang.Object o)
static boolean
isFullUrl(java.lang.String link)
Checks if specified link is full URL.static boolean
isValidHtmlAttributeName(java.lang.String name)
(package private) static boolean
isValidInt(java.lang.String s, int radix)
static boolean
isValidXmlIdentifier(java.lang.String s)
Checks whether specified string can be valid tag name or attribute name in xml.static boolean
isValidXmlIdentifierStartChar(java.lang.String identifier)
Determines whether the initial character of an identifier is valid for XMLstatic boolean
isWhitespaceString(java.lang.Object object)
Checks whether specified object's string representation is empty string (containing of only whitespaces).static boolean
isXmlReservedCharacter(java.lang.String c)
(package private) static java.lang.String
lchomp(java.lang.String str)
Removes the first newline (if present) of a stringstatic java.lang.String
ltrim(java.lang.String s)
Trims specified string from left.(package private) static java.lang.CharSequence
readUrl(java.net.URL url, java.lang.String charset)
Deprecated.static java.lang.String
replaceInvalidXmlIdentifierCharacters(java.lang.String name, java.lang.String replacement)
Strips out invalid characters from names used for XML Elements and replaces them with the specified character.static java.lang.String
rtrim(java.lang.String s)
Trims specified string from right.static java.lang.String
sanitizeHtmlAttributeName(java.lang.String name)
static java.lang.String
sanitizeXmlIdentifier(java.lang.String attName)
static java.lang.String
sanitizeXmlIdentifier(java.lang.String attName, java.lang.String prefix)
static java.lang.String
sanitizeXmlIdentifier(java.lang.String attName, java.lang.String prefix, java.lang.String replacementCharacter)
Attempts to replace invalid attribute names with valid ones.static java.lang.String[]
tokenize(java.lang.String s, java.lang.String delimiters)
-
-
-
Field Detail
-
VALID_XML_IDENTIFIER_START_CHAR_REGEX
static final java.lang.String VALID_XML_IDENTIFIER_START_CHAR_REGEX
- See Also:
- Constant Field Values
-
VALID_XML_IDENTIFIER_START_CHAR_PATTERN
static final java.util.regex.Pattern VALID_XML_IDENTIFIER_START_CHAR_PATTERN
-
VALID_XML_IDENTIFIER_CHAR_REGEX
static final java.lang.String VALID_XML_IDENTIFIER_CHAR_REGEX
- See Also:
- Constant Field Values
-
VALID_XML_IDENTIFIER_CHAR_PATTERN
static final java.util.regex.Pattern VALID_XML_IDENTIFIER_CHAR_PATTERN
-
ampNcr
private static java.lang.String ampNcr
-
ASCII_CHAR
private static final java.util.regex.Pattern ASCII_CHAR
-
HEX_STRICT
public static java.util.regex.Pattern HEX_STRICT
-
HEX_RELAXED
public static java.util.regex.Pattern HEX_RELAXED
-
DECIMAL
public static java.util.regex.Pattern DECIMAL
-
-
Method Detail
-
bchomp
static java.lang.String bchomp(java.lang.String str)
Removes the first newline and last newline (if present) of a string- Parameters:
str
-- Returns:
-
chomp
static java.lang.String chomp(java.lang.String str)
Removes the last newline (if present) of a string- Parameters:
str
-- Returns:
-
lchomp
static java.lang.String lchomp(java.lang.String str)
Removes the first newline (if present) of a string- Parameters:
str
-- Returns:
-
readUrl
@Deprecated static java.lang.CharSequence readUrl(java.net.URL url, java.lang.String charset) throws java.io.IOException
Deprecated.Reads content from the specified URL with specified charset into string- Parameters:
url
-charset
-- Throws:
java.io.IOException
-
isFullUrl
public static boolean isFullUrl(java.lang.String link)
Checks if specified link is full URL.- Parameters:
link
-- Returns:
- True, if full URl, false otherwise.
-
fullUrl
public static java.lang.String fullUrl(java.lang.String pageUrl, java.lang.String link)
Calculates full URL for specified page URL and link which could be full, absolute or relative like there can be found in A or IMG tags. (Reinstated as per user request in bug 159)
-
escapeHtml
public static java.lang.String escapeHtml(java.lang.String s, CleanerProperties props)
Escapes HTML string- Parameters:
s
- String to be escapedprops
- Cleaner properties affects escaping behaviour- Returns:
- the escaped string
-
escapeXml
public static java.lang.String escapeXml(java.lang.String s, CleanerProperties props, boolean isDomCreation)
Escapes XML string.- Parameters:
s
- String to be escapedprops
- Cleaner properties affects escaping behaviourisDomCreation
- Tells if escaped content will be part of the DOM- Returns:
- the escaped string
-
escapeXml
public static java.lang.String escapeXml(java.lang.String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR)
change notes: 1) convert ascii characters encoded using x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert xx; format characters to " style representation if available for the character. 3) convert html special entities to xml xx; when outputing in xml- Parameters:
s
- the string to escapeadvanced
- whether to use Advanced XML EscapingrecognizeUnicodeChars
- whether to recognise and replace Unicode characterstranslateSpecialEntities
- whether to translate special entitiesisDomCreation
- whether the escaping is in the context of DomCreation, an internal operation, with special rules.- Returns:
- the escaped string TODO Consider moving to CleanerProperties since a long list of params is misleading.
-
escapeXml
public static java.lang.String escapeXml(java.lang.String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR, boolean isHtmlOutput)
change notes: 1) convert ascii characters encoded using x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert xx; format characters to " style representation if available for the character. 3) convert html special entities to xml xx; when outputing in xml- Parameters:
s
- the string to escapeadvanced
- whether to use Advanced XML EscapingrecognizeUnicodeChars
- whether to recognise and replace Unicode characterstranslateSpecialEntities
- whether to translate special entitiesisDomCreation
- whether the escaping is in the context of DomCreation, an internal operation, with special rules.isHtmlOutput
- whether the output is intended to be treated as HTML- Returns:
- TODO Consider moving to CleanerProperties since a long list of params is misleading.
-
getAmpNcr
private static java.lang.String getAmpNcr()
-
convert_To_Entity_Name
private static int convert_To_Entity_Name(java.lang.String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, java.lang.StringBuilder result, int i)
- Parameters:
s
-domCreation
-recognizeUnicodeChars
-translateSpecialEntitiesToNCR
-result
-i
-- Returns:
-
convertToUnicode
private static int convertToUnicode(java.lang.String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, java.lang.StringBuilder result, int i)
- Parameters:
s
-domCreation
-recognizeUnicodeChars
-translateSpecialEntitiesToNCR
-result
-i
-- Returns:
-
extractCharCode
private static int extractCharCode(java.lang.String s, int charIndex, boolean relaxedUnicode, java.lang.StringBuilder unicode)
- (earlier code was failing on this) - A; is converted by FF to 3 characters: + 'A' + ';'
- x138A; is converted by FF to 6? 7? characters: 'x'+'1'+'3'+ '8' + 'A' + ';' #0 is displayed kind of weird
- ᎊ is a single character
- Parameters:
s
-charIndex
-relaxedUnicode
- 'x138;' is treated like 'ĸ'unicode
-- Returns:
- the index to continue scanning the source string -1 so normal loop incrementing skips the ';'
-
sanitizeXmlIdentifier
public static java.lang.String sanitizeXmlIdentifier(java.lang.String attName)
-
sanitizeXmlIdentifier
public static java.lang.String sanitizeXmlIdentifier(java.lang.String attName, java.lang.String prefix)
-
sanitizeHtmlAttributeName
public static java.lang.String sanitizeHtmlAttributeName(java.lang.String name)
-
isValidHtmlAttributeName
public static boolean isValidHtmlAttributeName(java.lang.String name)
-
sanitizeXmlIdentifier
public static java.lang.String sanitizeXmlIdentifier(java.lang.String attName, java.lang.String prefix, java.lang.String replacementCharacter)
Attempts to replace invalid attribute names with valid ones.- Parameters:
attName
- the attribute name to fixprefix
- the prefix to use to indicate an attribute name has been altered- Returns:
- either the original attribute name if valid, or a generated identifier if not
-
isValidXmlIdentifier
public static boolean isValidXmlIdentifier(java.lang.String s)
Checks whether specified string can be valid tag name or attribute name in xml.- Parameters:
s
- String to be checked- Returns:
- True if string is valid xml identifier, false otherwise
-
isEmptyString
public static boolean isEmptyString(java.lang.Object o)
- Parameters:
o
-- Returns:
- True if specified string is null of contains only whitespace characters
-
tokenize
public static java.lang.String[] tokenize(java.lang.String s, java.lang.String delimiters)
-
isXmlReservedCharacter
public static boolean isXmlReservedCharacter(java.lang.String c)
-
getXmlNSPrefix
public static java.lang.String getXmlNSPrefix(java.lang.String name)
- Parameters:
name
-- Returns:
- For xml element name or attribute name returns prefix (part before :) or null if there is no prefix
-
getXmlName
public static java.lang.String getXmlName(java.lang.String name)
- Parameters:
name
-- Returns:
- For xml element name or attribute name returns name after prefix (part after :)
-
isValidInt
static boolean isValidInt(java.lang.String s, int radix)
-
ltrim
public static java.lang.String ltrim(java.lang.String s)
Trims specified string from left.- Parameters:
s
-
-
rtrim
public static java.lang.String rtrim(java.lang.String s)
Trims specified string from right.- Parameters:
s
-
-
isWhitespaceString
public static boolean isWhitespaceString(java.lang.Object object)
Checks whether specified object's string representation is empty string (containing of only whitespaces).- Parameters:
object
- Object whose string representation is checked- Returns:
- true, if empty string, false otherwise
-
deserializeEntities
public static java.lang.String deserializeEntities(java.lang.String str, boolean recognizeUnicodeChars)
-
isValidXmlIdentifierStartChar
public static boolean isValidXmlIdentifierStartChar(java.lang.String identifier)
Determines whether the initial character of an identifier is valid for XML- Parameters:
identifier
- the identifier to check- Returns:
- true is the intial character is valid
-
replaceInvalidXmlIdentifierCharacters
public static java.lang.String replaceInvalidXmlIdentifierCharacters(java.lang.String name, java.lang.String replacement)
Strips out invalid characters from names used for XML Elements and replaces them with the specified character. For example, "" becomes "
" - Parameters:
name
-- Returns:
- valid XML name
-
compileUnicodePattern
private static java.util.regex.Pattern compileUnicodePattern(java.lang.String pattern)
-
-