Class HTMLNamedEntitiesParser

java.lang.Object
org.htmlunit.cyberneko.HTMLNamedEntitiesParser

public final class HTMLNamedEntitiesParser extends Object
This is a very specialized class for recognizing HTML named entities with the ability to look them up in stages. It is stateless and hence memory friendly. Additionally, it is not generated code rather it sets itself up from a file at first use and stays fixed from now on. Technically, it is not a parser anymore, because it does not have a state that matches the HTML standard: 12.2.5.72 Character reference state

Because it is stateless, it delegates the state handling to the user in the sense of how many characters one saw and when to stop doing things.

  • Field Details

  • Constructor Details

    • HTMLNamedEntitiesParser

      private HTMLNamedEntitiesParser()
      Constructor. It builds the parser state from an entity defining properties file. This file has been taken from https://html.spec.whatwg.org/multipage/named-characters.html (JSON version) and converted appropriately.
  • Method Details

    • get

      public static HTMLNamedEntitiesParser get()
      Returns the singleton. The singleton is stateless and can safely be used in a multi-threaded context.
      Returns:
      the singleton instance of the parser, can never be null
    • lookup

      public HTMLNamedEntitiesParser.State lookup(String entityName)
      Utility method, mostly for testing, that allows us to look up and entity from a string instead from single characters.
      Parameters:
      entityName - the entity to look up
      Returns:
      a state that resembles the result, will never be null
    • lookup

      public HTMLNamedEntitiesParser.State lookup(int character, HTMLNamedEntitiesParser.State state)
      Pseudo parses and entity character by character. We assume that we get presented with the chars after the starting ampersand. This parser does not supported unicode entities, hence this has to be handled differently.
      Parameters:
      character - the next character, should not be the ampersand ever
      state - the last known state or null in case we start to parse
      Returns:
      the current state, which might be a valid final result, see HTMLNamedEntitiesParser.State
    • lookupEntityRefFor

      public String lookupEntityRefFor(String key)
      Returns:
      the entity ref for the given key (usually a single char) or null