Package morfologik.fsa.builders
Class FSABuilder
- java.lang.Object
-
- morfologik.fsa.builders.FSABuilder
-
public final class FSABuilder extends java.lang.Object
Fast, memory-conservative finite state automaton builder, returning an in-memoryFSA
that is a tradeoff between construction speed and memory consumption. Use serializers to compress the returned automaton into more compact form.- See Also:
FSASerializer
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FSABuilder.InfoEntry
Debug and information constants.
-
Field Summary
Fields Modifier and Type Field Description private int[]
activePath
States on the "active path" (still mutable).private int
activePathLen
Current length of the active path.private static int
BUFFER_GROWTH_SIZE
Internal serialized FSA buffer expand ratio.private int
bufferGrowthSize
Internal serialized FSA buffer expand ratio.private int
epsilon
An epsilon state.private int[]
hashSet
Hash set of state addresses inserialized
, hashed byhash(int, int)
.private int
hashSize
Number of entries currently stored inhashSet
.private java.util.TreeMap<FSABuilder.InfoEntry,java.lang.Object>
info
Information about the automaton and its compilation.static java.util.Comparator<byte[]>
LEXICAL_ORDERING
A comparator comparing full byte arrays.private static int
MAX_LABELS
Maximum number of labels from a single state.private static int
MB
A megabyte.private int[]
nextArcOffset
The next offset at which an arc will be added to the given state onactivePath
.private byte[]
previous
Previous sequence added to the automaton inadd(byte[], int, int)
.private int
previousLength
previous
sequence's length, used in assertions only.private int
root
Root state.private int
serializationBufferReallocations
Number of serialization buffer reallocations.private byte[]
serialized
Holds serialized and mutable states.private int
size
Number of bytes already taken inserialized
.
-
Constructor Summary
Constructors Constructor Description FSABuilder()
FSABuilder(int bufferGrowthSize)
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description void
add(byte[] sequence, int start, int len)
Add a single sequence of bytes to the FSA.private int
allocateState(int labels)
Allocate space for a state with the given number of outgoing labels.static FSA
build(byte[][] input)
Build a minimal, deterministic automaton from a sorted list of byte sequences.static FSA
build(java.lang.Iterable<byte[]> input)
Build a minimal, deterministic automaton from an iterable list of byte sequences.private int
commonPrefix(byte[] sequence, int start, int len)
private static int
compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2)
Lexicographic order of input sequences.FSA
complete()
private boolean
equivalent(int start1, int start2, int len)
Returntrue
if two regions inserialized
are identical.private void
expandActivePath(int size)
Append a new mutable state to the active path.private void
expandAndRehash()
Reallocate and rehash the hash set.private void
expandBuffers()
Expand internal buffers for the next state.private int
freezeState(int activePathIndex)
Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state atactivePathIndex
and return it.private byte
getArcLabel(int arc)
Get label's arc.private int
getArcTarget(int arc)
Returns the address of an arc.java.util.Map<FSABuilder.InfoEntry,java.lang.Object>
getInfo()
private int
hash(int start, int byteCount)
Hash code of a fragment ofserialized
array.private boolean
isArcFinal(int arc)
Is this arc final?private boolean
isArcLast(int arc)
Is this arc the state's last?private int
serialize(int activePathIndex)
Serialize a given state on the active path.private void
setArcTarget(int arc, int state)
Fills the target state address of an arc.private boolean
setPrevious(byte[] sequence, int start, int length)
Copycurrent
into an internal buffer.private int
stateLength(int state)
The total length of the serialized state data (all arcs).
-
-
-
Field Detail
-
MB
private static final int MB
A megabyte.- See Also:
- Constant Field Values
-
BUFFER_GROWTH_SIZE
private static final int BUFFER_GROWTH_SIZE
Internal serialized FSA buffer expand ratio.- See Also:
- Constant Field Values
-
MAX_LABELS
private static final int MAX_LABELS
Maximum number of labels from a single state.- See Also:
- Constant Field Values
-
LEXICAL_ORDERING
public static final java.util.Comparator<byte[]> LEXICAL_ORDERING
A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale).
-
bufferGrowthSize
private final int bufferGrowthSize
Internal serialized FSA buffer expand ratio.
-
serialized
private byte[] serialized
Holds serialized and mutable states. Each state is a sequential list of arcs, the last arc is marked with#BIT_ARC_LAST
.
-
size
private int size
Number of bytes already taken inserialized
. Start from 1 to keep 0 a sentinel value (for the hash set and final state).
-
activePath
private int[] activePath
States on the "active path" (still mutable). Values are addresses of each state's first arc.
-
activePathLen
private int activePathLen
Current length of the active path.
-
nextArcOffset
private int[] nextArcOffset
The next offset at which an arc will be added to the given state onactivePath
.
-
root
private int root
Root state. If negative, the automaton has been built already and cannot be extended.
-
epsilon
private int epsilon
An epsilon state. The first and only arc of this state points either to the root or to the terminal state, indicating an empty automaton.
-
hashSet
private int[] hashSet
Hash set of state addresses inserialized
, hashed byhash(int, int)
. Zero reserved for an unoccupied slot.
-
hashSize
private int hashSize
Number of entries currently stored inhashSet
.
-
previous
private byte[] previous
Previous sequence added to the automaton inadd(byte[], int, int)
. Used in assertions only.
-
info
private java.util.TreeMap<FSABuilder.InfoEntry,java.lang.Object> info
Information about the automaton and its compilation.
-
previousLength
private int previousLength
previous
sequence's length, used in assertions only.
-
serializationBufferReallocations
private int serializationBufferReallocations
Number of serialization buffer reallocations.
-
-
Method Detail
-
add
public void add(byte[] sequence, int start, int len)
Add a single sequence of bytes to the FSA. The input must be lexicographically greater than any previously added sequence.- Parameters:
sequence
- The array holding input sequence of bytes.start
- Starting offset (inclusive)len
- Length of the input sequence (at least 1 byte).
-
complete
public FSA complete()
- Returns:
- Finalizes the construction of the automaton and returns it.
-
build
public static FSA build(byte[][] input)
Build a minimal, deterministic automaton from a sorted list of byte sequences.- Parameters:
input
- Input sequences to build automaton from.- Returns:
- Returns the automaton encoding all input sequences.
-
build
public static FSA build(java.lang.Iterable<byte[]> input)
Build a minimal, deterministic automaton from an iterable list of byte sequences.- Parameters:
input
- Input sequences to build automaton from.- Returns:
- Returns the automaton encoding all input sequences.
-
getInfo
public java.util.Map<FSABuilder.InfoEntry,java.lang.Object> getInfo()
- Returns:
- Returns various statistics concerning the FSA and its compilation.
- See Also:
FSABuilder.InfoEntry
-
isArcLast
private boolean isArcLast(int arc)
Is this arc the state's last?
-
isArcFinal
private boolean isArcFinal(int arc)
Is this arc final?
-
getArcLabel
private byte getArcLabel(int arc)
Get label's arc.
-
setArcTarget
private void setArcTarget(int arc, int state)
Fills the target state address of an arc.
-
getArcTarget
private int getArcTarget(int arc)
Returns the address of an arc.
-
commonPrefix
private int commonPrefix(byte[] sequence, int start, int len)
- Returns:
- The number of common prefix characters with the previous sequence.
-
freezeState
private int freezeState(int activePathIndex)
Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state atactivePathIndex
and return it.
-
expandAndRehash
private void expandAndRehash()
Reallocate and rehash the hash set.
-
stateLength
private int stateLength(int state)
The total length of the serialized state data (all arcs).
-
equivalent
private boolean equivalent(int start1, int start2, int len)
Returntrue
if two regions inserialized
are identical.
-
serialize
private int serialize(int activePathIndex)
Serialize a given state on the active path.
-
hash
private int hash(int start, int byteCount)
Hash code of a fragment ofserialized
array.
-
expandActivePath
private void expandActivePath(int size)
Append a new mutable state to the active path.
-
expandBuffers
private void expandBuffers()
Expand internal buffers for the next state.
-
allocateState
private int allocateState(int labels)
Allocate space for a state with the given number of outgoing labels.- Returns:
- state offset
-
setPrevious
private boolean setPrevious(byte[] sequence, int start, int length)
Copycurrent
into an internal buffer.
-
compare
private static int compare(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2)
Lexicographic order of input sequences. By default, consistent with the "C" sort (absolute value of bytes, 0-255).
-
-