Package org.htmlcleaner
Class Utils
java.lang.Object
org.htmlcleaner.Utils
Common utilities.
Created by: Vladimir NikicDate: November, 2006.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) static String
Removes the first newline and last newline (if present) of a string(package private) static String
Removes the last newline (if present) of a stringprivate static Pattern
compileUnicodePattern
(String pattern) private static int
convert_To_Entity_Name
(String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, StringBuilder result, int i) private static int
convertToUnicode
(String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, StringBuilder result, int i) static String
deserializeEntities
(String str, boolean recognizeUnicodeChars) static String
escapeHtml
(String s, CleanerProperties props) Escapes HTML stringstatic String
escapeXml
(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR) change notes: 1) convert ascii characters encoded using invalid input: '&#x'x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert invalid input: '&#x'xx; format characters to " style representation if available for the character.static String
escapeXml
(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR, boolean isHtmlOutput) change notes: 1) convert ascii characters encoded using invalid input: '&#x'x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert invalid input: '&#x'xx; format characters to " style representation if available for the character.static String
escapeXml
(String s, CleanerProperties props, boolean isDomCreation) Escapes XML string.private static int
extractCharCode
(String s, int charIndex, boolean relaxedUnicode, StringBuilder unicode) (earlier code was failing on this) - invalid input: 'Š'A; is converted by FF to 3 characters: + 'A' + ';' invalid input: '�'x138A; is converted by FF to 6? 7? characters: invalid input: '�' 'x'+'1'+'3'+ '8' + 'A' + ';' #0 is displayed kind of weird ᎊ is a single characterstatic String
Calculates full URL for specified page URL and link which could be full, absolute or relative like there can be found in A or IMG tags.private static String
static String
getXmlName
(String name) static String
getXmlNSPrefix
(String name) static boolean
static boolean
Checks if specified link is full URL.static boolean
(package private) static boolean
isValidInt
(String s, int radix) static boolean
Checks whether specified string can be valid tag name or attribute name in xml.static boolean
isValidXmlIdentifierStartChar
(String identifier) Determines whether the initial character of an identifier is valid for XMLstatic boolean
isWhitespaceString
(Object object) Checks whether specified object's string representation is empty string (containing of only whitespaces).static boolean
(package private) static String
Removes the first newline (if present) of a stringstatic String
Trims specified string from left.(package private) static CharSequence
Deprecated.static String
replaceInvalidXmlIdentifierCharacters
(String name, String replacement) Strips out invalid characters from names used for XML Elements and replaces them with the specified character.static String
Trims specified string from right.static String
static String
sanitizeXmlIdentifier
(String attName) static String
sanitizeXmlIdentifier
(String attName, String prefix) static String
sanitizeXmlIdentifier
(String attName, String prefix, String replacementCharacter) Attempts to replace invalid attribute names with valid ones.static String[]
-
Field Details
-
VALID_XML_IDENTIFIER_START_CHAR_REGEX
- See Also:
-
VALID_XML_IDENTIFIER_START_CHAR_PATTERN
-
VALID_XML_IDENTIFIER_CHAR_REGEX
- See Also:
-
VALID_XML_IDENTIFIER_CHAR_PATTERN
-
ampNcr
-
ASCII_CHAR
-
HEX_STRICT
-
HEX_RELAXED
-
DECIMAL
-
-
Constructor Details
-
Utils
public Utils()
-
-
Method Details
-
bchomp
Removes the first newline and last newline (if present) of a string- Parameters:
str
-- Returns:
-
chomp
Removes the last newline (if present) of a string- Parameters:
str
-- Returns:
-
lchomp
Removes the first newline (if present) of a string- Parameters:
str
-- Returns:
-
readUrl
Deprecated.Reads content from the specified URL with specified charset into string- Parameters:
url
-charset
-- Throws:
IOException
-
isFullUrl
Checks if specified link is full URL.- Parameters:
link
-- Returns:
- True, if full URl, false otherwise.
-
fullUrl
Calculates full URL for specified page URL and link which could be full, absolute or relative like there can be found in A or IMG tags. (Reinstated as per user request in bug 159) -
escapeHtml
Escapes HTML string- Parameters:
s
- String to be escapedprops
- Cleaner properties affects escaping behaviour- Returns:
- the escaped string
-
escapeXml
Escapes XML string.- Parameters:
s
- String to be escapedprops
- Cleaner properties affects escaping behaviourisDomCreation
- Tells if escaped content will be part of the DOM- Returns:
- the escaped string
-
escapeXml
public static String escapeXml(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR) change notes: 1) convert ascii characters encoded using invalid input: '&#x'x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert invalid input: '&#x'xx; format characters to " style representation if available for the character. 3) convert html special entities to xml invalid input: '&#x'xx; when outputing in xml- Parameters:
s
- the string to escapeadvanced
- whether to use Advanced XML EscapingrecognizeUnicodeChars
- whether to recognise and replace Unicode characterstranslateSpecialEntities
- whether to translate special entitiesisDomCreation
- whether the escaping is in the context of DomCreation, an internal operation, with special rules.- Returns:
- the escaped string TODO Consider moving to CleanerProperties since a long list of params is misleading.
-
escapeXml
public static String escapeXml(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR, boolean isHtmlOutput) change notes: 1) convert ascii characters encoded using invalid input: '&#x'x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert invalid input: '&#x'xx; format characters to " style representation if available for the character. 3) convert html special entities to xml invalid input: '&#x'xx; when outputing in xml- Parameters:
s
- the string to escapeadvanced
- whether to use Advanced XML EscapingrecognizeUnicodeChars
- whether to recognise and replace Unicode characterstranslateSpecialEntities
- whether to translate special entitiesisDomCreation
- whether the escaping is in the context of DomCreation, an internal operation, with special rules.isHtmlOutput
- whether the output is intended to be treated as HTML- Returns:
- TODO Consider moving to CleanerProperties since a long list of params is misleading.
-
getAmpNcr
-
convert_To_Entity_Name
private static int convert_To_Entity_Name(String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, StringBuilder result, int i) - Parameters:
s
-domCreation
-recognizeUnicodeChars
-translateSpecialEntitiesToNCR
-result
-i
-- Returns:
-
convertToUnicode
private static int convertToUnicode(String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, StringBuilder result, int i) - Parameters:
s
-domCreation
-recognizeUnicodeChars
-translateSpecialEntitiesToNCR
-result
-i
-- Returns:
-
extractCharCode
private static int extractCharCode(String s, int charIndex, boolean relaxedUnicode, StringBuilder unicode) - (earlier code was failing on this) - invalid input: 'Š'A; is converted by FF to 3 characters: + 'A' + ';'
- invalid input: '�'x138A; is converted by FF to 6? 7? characters: invalid input: '�' 'x'+'1'+'3'+ '8' + 'A' + ';' #0 is displayed kind of weird
- ᎊ is a single character
- Parameters:
s
-charIndex
-relaxedUnicode
- 'invalid input: '�'x138;' is treated like 'ĸ'unicode
-- Returns:
- the index to continue scanning the source string -1 so normal loop incrementing skips the ';'
-
sanitizeXmlIdentifier
-
sanitizeXmlIdentifier
-
sanitizeHtmlAttributeName
-
isValidHtmlAttributeName
-
sanitizeXmlIdentifier
public static String sanitizeXmlIdentifier(String attName, String prefix, String replacementCharacter) Attempts to replace invalid attribute names with valid ones.- Parameters:
attName
- the attribute name to fixprefix
- the prefix to use to indicate an attribute name has been altered- Returns:
- either the original attribute name if valid, or a generated identifier if not
-
isValidXmlIdentifier
Checks whether specified string can be valid tag name or attribute name in xml.- Parameters:
s
- String to be checked- Returns:
- True if string is valid xml identifier, false otherwise
-
isEmptyString
- Parameters:
o
-- Returns:
- True if specified string is null of contains only whitespace characters
-
tokenize
-
isXmlReservedCharacter
-
getXmlNSPrefix
- Parameters:
name
-- Returns:
- For xml element name or attribute name returns prefix (part before :) or null if there is no prefix
-
getXmlName
- Parameters:
name
-- Returns:
- For xml element name or attribute name returns name after prefix (part after :)
-
isValidInt
-
ltrim
Trims specified string from left.- Parameters:
s
-
-
rtrim
Trims specified string from right.- Parameters:
s
-
-
isWhitespaceString
Checks whether specified object's string representation is empty string (containing of only whitespaces).- Parameters:
object
- Object whose string representation is checked- Returns:
- true, if empty string, false otherwise
-
deserializeEntities
-
isValidXmlIdentifierStartChar
Determines whether the initial character of an identifier is valid for XML- Parameters:
identifier
- the identifier to check- Returns:
- true is the intial character is valid
-
replaceInvalidXmlIdentifierCharacters
Strips out invalid characters from names used for XML Elements and replaces them with the specified character. For example, "invalid input: '<'p%>" becomes "" - Parameters:
name
-- Returns:
- valid XML name
-
compileUnicodePattern
-