Package org.htmlunit.cyberneko
Class HTMLNamedEntitiesParser
java.lang.Object
org.htmlunit.cyberneko.HTMLNamedEntitiesParser
This is a very specialized class for recognizing HTML named entities with the ability
to look them up in stages. It is stateless and hence memory friendly.
Additionally, it is not generated code rather it sets itself up from a file at
first use and stays fixed from now on. Technically, it is not a parser anymore,
because it does not have a state that matches the HTML standard:
12.2.5.72 Character reference state
Because it is stateless, it delegates the state handling to the user in the sense of how many characters one saw and when to stop doing things.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprotected static class
This is our initial state and has a special optimization applied.static class
Our "level" in the treeish structure that keeps its static state and the next level underneath. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate FastHashMap
<String, String> private static final HTMLNamedEntitiesParser
-
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic HTMLNamedEntitiesParser
get()
Returns the singleton.lookup
(int character, HTMLNamedEntitiesParser.State state) Pseudo parses and entity character by character.Utility method, mostly for testing, that allows us to look up and entity from a string instead from single characters.lookupEntityRefFor
(String key)
-
Field Details
-
instance
-
rootLevel_
-
entities_
-
-
Constructor Details
-
HTMLNamedEntitiesParser
private HTMLNamedEntitiesParser()Constructor. It builds the parser state from an entity defining properties file. This file has been taken from https://html.spec.whatwg.org/multipage/named-characters.html (JSON version) and converted appropriately.
-
-
Method Details
-
get
Returns the singleton. The singleton is stateless and can safely be used in a multi-threaded context.- Returns:
- the singleton instance of the parser, can never be null
-
lookup
Utility method, mostly for testing, that allows us to look up and entity from a string instead from single characters.- Parameters:
entityName
- the entity to look up- Returns:
- a state that resembles the result, will never be null
-
lookup
Pseudo parses and entity character by character. We assume that we get presented with the chars after the starting ampersand. This parser does not supported unicode entities, hence this has to be handled differently.- Parameters:
character
- the next character, should not be the ampersand everstate
- the last known state or null in case we start to parse- Returns:
- the current state, which might be a valid final result, see
HTMLNamedEntitiesParser.State
-
lookupEntityRefFor
- Returns:
- the entity ref for the given key (usually a single char) or null
-