Class Utils

java.lang.Object
org.htmlcleaner.Utils

public class Utils extends Object

Common utilities.

Created by: Vladimir Nikic
Date: November, 2006.
  • Field Details

    • VALID_XML_IDENTIFIER_START_CHAR_REGEX

      static final String VALID_XML_IDENTIFIER_START_CHAR_REGEX
      See Also:
    • VALID_XML_IDENTIFIER_START_CHAR_PATTERN

      static final Pattern VALID_XML_IDENTIFIER_START_CHAR_PATTERN
    • VALID_XML_IDENTIFIER_CHAR_REGEX

      static final String VALID_XML_IDENTIFIER_CHAR_REGEX
      See Also:
    • VALID_XML_IDENTIFIER_CHAR_PATTERN

      static final Pattern VALID_XML_IDENTIFIER_CHAR_PATTERN
    • ampNcr

      private static String ampNcr
    • ASCII_CHAR

      private static final Pattern ASCII_CHAR
    • HEX_STRICT

      public static Pattern HEX_STRICT
    • HEX_RELAXED

      public static Pattern HEX_RELAXED
    • DECIMAL

      public static Pattern DECIMAL
  • Constructor Details

    • Utils

      public Utils()
  • Method Details

    • bchomp

      static String bchomp(String str)
      Removes the first newline and last newline (if present) of a string
      Parameters:
      str -
      Returns:
    • chomp

      static String chomp(String str)
      Removes the last newline (if present) of a string
      Parameters:
      str -
      Returns:
    • lchomp

      static String lchomp(String str)
      Removes the first newline (if present) of a string
      Parameters:
      str -
      Returns:
    • readUrl

      @Deprecated static CharSequence readUrl(URL url, String charset) throws IOException
      Deprecated.
      Reads content from the specified URL with specified charset into string
      Parameters:
      url -
      charset -
      Throws:
      IOException
    • isFullUrl

      public static boolean isFullUrl(String link)
      Checks if specified link is full URL.
      Parameters:
      link -
      Returns:
      True, if full URl, false otherwise.
    • fullUrl

      public static String fullUrl(String pageUrl, String link)
      Calculates full URL for specified page URL and link which could be full, absolute or relative like there can be found in A or IMG tags. (Reinstated as per user request in bug 159)
    • escapeHtml

      public static String escapeHtml(String s, CleanerProperties props)
      Escapes HTML string
      Parameters:
      s - String to be escaped
      props - Cleaner properties affects escaping behaviour
      Returns:
      the escaped string
    • escapeXml

      public static String escapeXml(String s, CleanerProperties props, boolean isDomCreation)
      Escapes XML string.
      Parameters:
      s - String to be escaped
      props - Cleaner properties affects escaping behaviour
      isDomCreation - Tells if escaped content will be part of the DOM
      Returns:
      the escaped string
    • escapeXml

      public static String escapeXml(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR)
      change notes: 1) convert ascii characters encoded using invalid input: '&#x'x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert invalid input: '&#x'xx; format characters to " style representation if available for the character. 3) convert html special entities to xml invalid input: '&#x'xx; when outputing in xml
      Parameters:
      s - the string to escape
      advanced - whether to use Advanced XML Escaping
      recognizeUnicodeChars - whether to recognise and replace Unicode characters
      translateSpecialEntities - whether to translate special entities
      isDomCreation - whether the escaping is in the context of DomCreation, an internal operation, with special rules.
      Returns:
      the escaped string TODO Consider moving to CleanerProperties since a long list of params is misleading.
    • escapeXml

      public static String escapeXml(String s, boolean advanced, boolean recognizeUnicodeChars, boolean translateSpecialEntities, boolean isDomCreation, boolean transResCharsToNCR, boolean translateSpecialEntitiesToNCR, boolean isHtmlOutput)
      change notes: 1) convert ascii characters encoded using invalid input: '&#x'x; format to the ascii characters -- may be an attempt to slip in malicious html 2) convert invalid input: '&#x'xx; format characters to " style representation if available for the character. 3) convert html special entities to xml invalid input: '&#x'xx; when outputing in xml
      Parameters:
      s - the string to escape
      advanced - whether to use Advanced XML Escaping
      recognizeUnicodeChars - whether to recognise and replace Unicode characters
      translateSpecialEntities - whether to translate special entities
      isDomCreation - whether the escaping is in the context of DomCreation, an internal operation, with special rules.
      isHtmlOutput - whether the output is intended to be treated as HTML
      Returns:
      TODO Consider moving to CleanerProperties since a long list of params is misleading.
    • getAmpNcr

      private static String getAmpNcr()
    • convert_To_Entity_Name

      private static int convert_To_Entity_Name(String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, StringBuilder result, int i)
      Parameters:
      s -
      domCreation -
      recognizeUnicodeChars -
      translateSpecialEntitiesToNCR -
      result -
      i -
      Returns:
    • convertToUnicode

      private static int convertToUnicode(String s, boolean domCreation, boolean recognizeUnicodeChars, boolean translateSpecialEntitiesToNCR, StringBuilder result, int i)
      Parameters:
      s -
      domCreation -
      recognizeUnicodeChars -
      translateSpecialEntitiesToNCR -
      result -
      i -
      Returns:
    • extractCharCode

      private static int extractCharCode(String s, int charIndex, boolean relaxedUnicode, StringBuilder unicode)
      • (earlier code was failing on this) - invalid input: '&#138'A; is converted by FF to 3 characters: Š + 'A' + ';'
      • invalid input: '&#0'x138A; is converted by FF to 6? 7? characters: invalid input: '&#0' 'x'+'1'+'3'+ '8' + 'A' + ';' #0 is displayed kind of weird
      • ᎊ is a single character
      Parameters:
      s -
      charIndex -
      relaxedUnicode - 'invalid input: '&#0'x138;' is treated like 'ĸ'
      unicode -
      Returns:
      the index to continue scanning the source string -1 so normal loop incrementing skips the ';'
    • sanitizeXmlIdentifier

      public static String sanitizeXmlIdentifier(String attName)
    • sanitizeXmlIdentifier

      public static String sanitizeXmlIdentifier(String attName, String prefix)
    • sanitizeHtmlAttributeName

      public static String sanitizeHtmlAttributeName(String name)
    • isValidHtmlAttributeName

      public static boolean isValidHtmlAttributeName(String name)
    • sanitizeXmlIdentifier

      public static String sanitizeXmlIdentifier(String attName, String prefix, String replacementCharacter)
      Attempts to replace invalid attribute names with valid ones.
      Parameters:
      attName - the attribute name to fix
      prefix - the prefix to use to indicate an attribute name has been altered
      Returns:
      either the original attribute name if valid, or a generated identifier if not
    • isValidXmlIdentifier

      public static boolean isValidXmlIdentifier(String s)
      Checks whether specified string can be valid tag name or attribute name in xml.
      Parameters:
      s - String to be checked
      Returns:
      True if string is valid xml identifier, false otherwise
    • isEmptyString

      public static boolean isEmptyString(Object o)
      Parameters:
      o -
      Returns:
      True if specified string is null of contains only whitespace characters
    • tokenize

      public static String[] tokenize(String s, String delimiters)
    • isXmlReservedCharacter

      public static boolean isXmlReservedCharacter(String c)
    • getXmlNSPrefix

      public static String getXmlNSPrefix(String name)
      Parameters:
      name -
      Returns:
      For xml element name or attribute name returns prefix (part before :) or null if there is no prefix
    • getXmlName

      public static String getXmlName(String name)
      Parameters:
      name -
      Returns:
      For xml element name or attribute name returns name after prefix (part after :)
    • isValidInt

      static boolean isValidInt(String s, int radix)
    • ltrim

      public static String ltrim(String s)
      Trims specified string from left.
      Parameters:
      s -
    • rtrim

      public static String rtrim(String s)
      Trims specified string from right.
      Parameters:
      s -
    • isWhitespaceString

      public static boolean isWhitespaceString(Object object)
      Checks whether specified object's string representation is empty string (containing of only whitespaces).
      Parameters:
      object - Object whose string representation is checked
      Returns:
      true, if empty string, false otherwise
    • deserializeEntities

      public static String deserializeEntities(String str, boolean recognizeUnicodeChars)
    • isValidXmlIdentifierStartChar

      public static boolean isValidXmlIdentifierStartChar(String identifier)
      Determines whether the initial character of an identifier is valid for XML
      Parameters:
      identifier - the identifier to check
      Returns:
      true is the intial character is valid
    • replaceInvalidXmlIdentifierCharacters

      public static String replaceInvalidXmlIdentifierCharacters(String name, String replacement)
      Strips out invalid characters from names used for XML Elements and replaces them with the specified character. For example, "invalid input: '<'p%>" becomes ""
      Parameters:
      name -
      Returns:
      valid XML name
    • compileUnicodePattern

      private static Pattern compileUnicodePattern(String pattern)