Package morfologik.fsa
Class CFSA2
- java.lang.Object
-
- morfologik.fsa.FSA
-
- morfologik.fsa.CFSA2
-
- All Implemented Interfaces:
java.lang.Iterable<java.nio.ByteBuffer>
public final class CFSA2 extends FSA
CFSA (Compact Finite State Automaton) binary format implementation, version 2:BIT_TARGET_NEXT
applicable on all arcs, not necessarily the last one.- v-coded goto field
- v-coded perfect hashing numbers, if any
- 31 most frequent labels integrated with flags byte
The encoding of automaton body is as follows.
---- CFSA header Byte Description +-+-+-+-+-+-+-+-+\ 0 | | | | | | | | | +------ '\' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 1 | | | | | | | | | +------ 'f' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 2 | | | | | | | | | +------ 's' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 3 | | | | | | | | | +------ 'a' +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 4 | | | | | | | | | +------ version (fixed 0xc6) +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 5 | | | | | | | | | +----\ +-+-+-+-+-+-+-+-+/ \ flags [MSB first] +-+-+-+-+-+-+-+-+\ / 6 | | | | | | | | | +----/ +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 7 | | | | | | | | | +------ label lookup table size +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 8-32 | | | | | | | | | +------ label value lookup table : : : : : : : : : | +-+-+-+-+-+-+-+-+/ ---- Start of a node; only if automaton was compiled with NUMBERS option. Byte +-+-+-+-+-+-+-+-+\ 0 | | | | | | | | | \ +-+-+-+-+-+-+-+-+ + 1 | | | | | | | | | | number of strings recognized +-+-+-+-+-+-+-+-+ +----- by the automaton starting : : : : : : : : : | from this node. v-coding +-+-+-+-+-+-+-+-+ + | | | | | | | | | / +-+-+-+-+-+-+-+-+/ ---- A vector of this node's arcs. An arc's layout depends on the combination of flags. 1) NEXT bit set, mapped arc label. +----------------------- node pointed to is next | +--------------------- the last arc of the node | | +------------------- this arc leads to a final state (acceptor) | | | _______+--------- arc's label; indexed if M > 0, otherwise explicit label follows | | | / | | | | +-+-+-+-+-+-+-+-+\ 0 |N|L|F|M|M|M|M|M| +------ flags + (M) index of the mapped label. +-+-+-+-+-+-+-+-+/ +-+-+-+-+-+-+-+-+\ 1 | | | | | | | | | +------ optional label if M == 0 +-+-+-+-+-+-+-+-+/ : : : : : : : : : +-+-+-+-+-+-+-+-+\ |A|A|A|A|A|A|A|A| +------ v-coded goto address +-+-+-+-+-+-+-+-+/
-
-
Field Summary
Fields Modifier and Type Field Description byte[]
arcs
An array of bytes with the internal representation of the automaton.static int
BIT_FINAL_ARC
The arc corresponds to the last character of a sequence available when building the automaton (acceptor transition).static int
BIT_LAST_ARC
The arc is the last one from the current node's arcs list.static int
BIT_TARGET_NEXT
The target node of this arc follows the last arc of the current state (no goto field).private int
epsilon
Epsilon node's offset.private java.util.EnumSet<FSAFlags>
flags
Flags for this automaton version.private boolean
hasNumbers
Iftrue
states are prepended with numbers.(package private) static int
LABEL_INDEX_BITS
The count of bits assigned to storing an indexed label.(package private) static int
LABEL_INDEX_MASK
Masks only the M bits of a flag byte.static int
LABEL_INDEX_SIZE
Maximum size of the labels index.byte[]
labelMapping
Label mapping for M-indexed labels.static byte
VERSION
Automaton header version value.
-
Constructor Summary
Constructors Constructor Description CFSA2(java.io.InputStream stream)
Reads an automaton from a byte stream.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description int
getArc(int node, byte label)
byte
getArcLabel(int arc)
(package private) int
getDestinationNodeOffset(int arc)
Returns the address of the node pointed to by this arc.int
getEndNode(int arc)
int
getFirstArc(int node)
java.util.Set<FSAFlags>
getFlags()
int
getNextArc(int arc)
int
getRightLanguageCount(int node)
int
getRootNode()
boolean
isArcFinal(int arc)
boolean
isArcLast(int arc)
Returnstrue
if this arc hasNEXT
bit set.boolean
isArcTerminal(int arc)
boolean
isNextSet(int arc)
(package private) static int
readVInt(byte[] array, int offset)
Read a v-int.private int
skipArc(int offset)
Read the arc's layout and skip as many bytes, as needed, to skip it.private int
skipVInt(int offset)
Skip a v-int.(package private) static int
vIntLength(int value)
Return the byte-length of a v-coded int.-
Methods inherited from class morfologik.fsa.FSA
getArcCount, getSequences, getSequences, iterator, read, read, readRemaining, visitAllStates, visitInPostOrder, visitInPostOrder, visitInPreOrder, visitInPreOrder
-
-
-
-
Field Detail
-
VERSION
public static final byte VERSION
Automaton header version value.- See Also:
- Constant Field Values
-
BIT_TARGET_NEXT
public static final int BIT_TARGET_NEXT
The target node of this arc follows the last arc of the current state (no goto field).- See Also:
- Constant Field Values
-
BIT_LAST_ARC
public static final int BIT_LAST_ARC
The arc is the last one from the current node's arcs list.- See Also:
- Constant Field Values
-
BIT_FINAL_ARC
public static final int BIT_FINAL_ARC
The arc corresponds to the last character of a sequence available when building the automaton (acceptor transition).- See Also:
- Constant Field Values
-
LABEL_INDEX_BITS
static final int LABEL_INDEX_BITS
The count of bits assigned to storing an indexed label.- See Also:
- Constant Field Values
-
LABEL_INDEX_MASK
static final int LABEL_INDEX_MASK
Masks only the M bits of a flag byte.- See Also:
- Constant Field Values
-
LABEL_INDEX_SIZE
public static final int LABEL_INDEX_SIZE
Maximum size of the labels index.- See Also:
- Constant Field Values
-
arcs
public byte[] arcs
An array of bytes with the internal representation of the automaton. Please see the documentation of this class for more information on how this structure is organized.
-
flags
private final java.util.EnumSet<FSAFlags> flags
Flags for this automaton version.
-
labelMapping
public final byte[] labelMapping
Label mapping for M-indexed labels.
-
hasNumbers
private final boolean hasNumbers
Iftrue
states are prepended with numbers.
-
epsilon
private final int epsilon
Epsilon node's offset.- See Also:
- Constant Field Values
-
-
Method Detail
-
getRootNode
public int getRootNode()
- Specified by:
getRootNode
in classFSA
- Returns:
- Returns the identifier of the root node of this automaton. Returns 0 if the start node is also the end node (the automaton is empty).
-
getFirstArc
public final int getFirstArc(int node)
- Specified by:
getFirstArc
in classFSA
- Parameters:
node
- Identifier of the node.- Returns:
- Returns the identifier of the first arc leaving
node
or 0 if the node has no outgoing arcs.
-
getNextArc
public final int getNextArc(int arc)
- Specified by:
getNextArc
in classFSA
- Parameters:
arc
- The arc's identifier.- Returns:
- Returns the identifier of the next arc after
arc
and leavingnode
. Zero is returned if no more arcs are available for the node.
-
getArc
public int getArc(int node, byte label)
-
getEndNode
public int getEndNode(int arc)
- Specified by:
getEndNode
in classFSA
- Parameters:
arc
- The arc's identifier.- Returns:
- Return the end node pointed to by a given
arc
. Terminal arcs (those that point to a terminal state) have no end node representation and throw a runtime exception.
-
getArcLabel
public byte getArcLabel(int arc)
- Specified by:
getArcLabel
in classFSA
- Parameters:
arc
- The arc's identifier.- Returns:
- Return the label associated with a given
arc
.
-
getRightLanguageCount
public int getRightLanguageCount(int node)
- Overrides:
getRightLanguageCount
in classFSA
- Parameters:
node
- Identifier of the node.- Returns:
- Returns the number of sequences reachable from the given state if
the automaton was compiled with
FSAFlags.NUMBERS
. The size of the right language of the state, in other words.
-
isArcFinal
public boolean isArcFinal(int arc)
- Specified by:
isArcFinal
in classFSA
- Parameters:
arc
- The arc's identifier.- Returns:
- Returns
true
if the destination node at the end of thisarc
corresponds to an input sequence created when building this automaton.
-
isArcTerminal
public boolean isArcTerminal(int arc)
- Specified by:
isArcTerminal
in classFSA
- Parameters:
arc
- The arc's identifier.- Returns:
- Returns
true
if thisarc
does not have a terminating node (@linkFSA.getEndNode(int)
will throw an exception). ImpliesFSA.isArcFinal(int)
.
-
isArcLast
public boolean isArcLast(int arc)
Returnstrue
if this arc hasNEXT
bit set.- Parameters:
arc
- The node's arc identifier.- Returns:
- Returns true if the argument is the last arc of a node.
- See Also:
BIT_LAST_ARC
-
isNextSet
public boolean isNextSet(int arc)
- Parameters:
arc
- The node's arc identifier.- Returns:
- Returns true if
BIT_TARGET_NEXT
is set for this arc. - See Also:
BIT_TARGET_NEXT
-
getFlags
public java.util.Set<FSAFlags> getFlags()
-
getDestinationNodeOffset
final int getDestinationNodeOffset(int arc)
Returns the address of the node pointed to by this arc.
-
skipArc
private int skipArc(int offset)
Read the arc's layout and skip as many bytes, as needed, to skip it.
-
readVInt
static int readVInt(byte[] array, int offset)
Read a v-int.
-
vIntLength
static int vIntLength(int value)
Return the byte-length of a v-coded int.
-
skipVInt
private int skipVInt(int offset)
Skip a v-int.
-
-