Class HTMLNamedEntitiesParser


  • public final class HTMLNamedEntitiesParser
    extends java.lang.Object
    This is a very specialized class for recognizing HTML named entities with the ability to look them up in stages. It is stateless and hence memory friendly. Additionally, it is not generated code rather it sets itself up from a file at first use and stays fixed from now on. Technically, it is not a parser anymore, because it does not have a state that matches the HTML standard: 12.2.5.72 Character reference state

    Because it is stateless, it delegates the state handling to the user in the sense of how many characters one saw and when to stop doing things.

    • Constructor Detail

      • HTMLNamedEntitiesParser

        private HTMLNamedEntitiesParser()
        Constructor. It builds the parser state from an entity defining properties file. This file has been taken from https://html.spec.whatwg.org/multipage/named-characters.html (JSON version) and converted appropriately.
    • Method Detail

      • get

        public static HTMLNamedEntitiesParser get()
        Returns the singleton. The singleton is stateless and can safely be used in a multi-threaded context.
        Returns:
        the singleton instance of the parser, can never be null
      • lookup

        public HTMLNamedEntitiesParser.State lookup​(java.lang.String entityName)
        Utility method, mostly for testing, that allows us to look up and entity from a string instead from single characters.
        Parameters:
        entityName - the entity to look up
        Returns:
        a state that resembles the result, will never be null
      • lookup

        public HTMLNamedEntitiesParser.State lookup​(int character,
                                                    HTMLNamedEntitiesParser.State state)
        Pseudo parses and entity character by character. We assume that we get presented with the chars after the starting ampersand. This parser does not supported unicode entities, hence this has to be handled differently.
        Parameters:
        character - the next character, should not be the ampersand ever
        state - the last known state or null in case we start to parse
        Returns:
        the current state, which might be a valid final result, see HTMLNamedEntitiesParser.State
      • lookupEntityRefFor

        public java.lang.String lookupEntityRefFor​(java.lang.String key)
        Returns:
        the entity ref for the given key (usually a single char) or null