Class FSABuilder


  • public final class FSABuilder
    extends java.lang.Object
    Fast, memory-conservative finite state automaton builder, returning an in-memory FSA that is a tradeoff between construction speed and memory consumption. Use serializers to compress the returned automaton into more compact form.
    See Also:
    FSASerializer
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  FSABuilder.InfoEntry
      Debug and information constants.
    • Constructor Summary

      Constructors 
      Constructor Description
      FSABuilder()  
      FSABuilder​(int bufferGrowthSize)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void add​(byte[] sequence, int start, int len)
      Add a single sequence of bytes to the FSA.
      private int allocateState​(int labels)
      Allocate space for a state with the given number of outgoing labels.
      static FSA build​(byte[][] input)
      Build a minimal, deterministic automaton from a sorted list of byte sequences.
      static FSA build​(java.lang.Iterable<byte[]> input)
      Build a minimal, deterministic automaton from an iterable list of byte sequences.
      private int commonPrefix​(byte[] sequence, int start, int len)  
      private static int compare​(byte[] s1, int start1, int lens1, byte[] s2, int start2, int lens2)
      Lexicographic order of input sequences.
      FSA complete()  
      private boolean equivalent​(int start1, int start2, int len)
      Return true if two regions in serialized are identical.
      private void expandActivePath​(int size)
      Append a new mutable state to the active path.
      private void expandAndRehash()
      Reallocate and rehash the hash set.
      private void expandBuffers()
      Expand internal buffers for the next state.
      private int freezeState​(int activePathIndex)
      Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state at activePathIndex and return it.
      private byte getArcLabel​(int arc)
      Get label's arc.
      private int getArcTarget​(int arc)
      Returns the address of an arc.
      java.util.Map<FSABuilder.InfoEntry,​java.lang.Object> getInfo()  
      private int hash​(int start, int byteCount)
      Hash code of a fragment of serialized array.
      private boolean isArcFinal​(int arc)
      Is this arc final?
      private boolean isArcLast​(int arc)
      Is this arc the state's last?
      private int serialize​(int activePathIndex)
      Serialize a given state on the active path.
      private void setArcTarget​(int arc, int state)
      Fills the target state address of an arc.
      private boolean setPrevious​(byte[] sequence, int start, int length)
      Copy current into an internal buffer.
      private int stateLength​(int state)
      The total length of the serialized state data (all arcs).
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • BUFFER_GROWTH_SIZE

        private static final int BUFFER_GROWTH_SIZE
        Internal serialized FSA buffer expand ratio.
        See Also:
        Constant Field Values
      • MAX_LABELS

        private static final int MAX_LABELS
        Maximum number of labels from a single state.
        See Also:
        Constant Field Values
      • LEXICAL_ORDERING

        public static final java.util.Comparator<byte[]> LEXICAL_ORDERING
        A comparator comparing full byte arrays. Unsigned byte comparisons ('C'-locale).
      • bufferGrowthSize

        private final int bufferGrowthSize
        Internal serialized FSA buffer expand ratio.
      • serialized

        private byte[] serialized
        Holds serialized and mutable states. Each state is a sequential list of arcs, the last arc is marked with #BIT_ARC_LAST.
      • size

        private int size
        Number of bytes already taken in serialized. Start from 1 to keep 0 a sentinel value (for the hash set and final state).
      • activePath

        private int[] activePath
        States on the "active path" (still mutable). Values are addresses of each state's first arc.
      • activePathLen

        private int activePathLen
        Current length of the active path.
      • nextArcOffset

        private int[] nextArcOffset
        The next offset at which an arc will be added to the given state on activePath.
      • root

        private int root
        Root state. If negative, the automaton has been built already and cannot be extended.
      • epsilon

        private int epsilon
        An epsilon state. The first and only arc of this state points either to the root or to the terminal state, indicating an empty automaton.
      • hashSet

        private int[] hashSet
        Hash set of state addresses in serialized, hashed by hash(int, int). Zero reserved for an unoccupied slot.
      • hashSize

        private int hashSize
        Number of entries currently stored in hashSet.
      • previous

        private byte[] previous
        Previous sequence added to the automaton in add(byte[], int, int). Used in assertions only.
      • info

        private java.util.TreeMap<FSABuilder.InfoEntry,​java.lang.Object> info
        Information about the automaton and its compilation.
      • previousLength

        private int previousLength
        previous sequence's length, used in assertions only.
      • serializationBufferReallocations

        private int serializationBufferReallocations
        Number of serialization buffer reallocations.
    • Constructor Detail

      • FSABuilder

        public FSABuilder()
      • FSABuilder

        public FSABuilder​(int bufferGrowthSize)
        Parameters:
        bufferGrowthSize - Buffer growth size (in bytes) when constructing the automaton.
    • Method Detail

      • add

        public void add​(byte[] sequence,
                        int start,
                        int len)
        Add a single sequence of bytes to the FSA. The input must be lexicographically greater than any previously added sequence.
        Parameters:
        sequence - The array holding input sequence of bytes.
        start - Starting offset (inclusive)
        len - Length of the input sequence (at least 1 byte).
      • complete

        public FSA complete()
        Returns:
        Finalizes the construction of the automaton and returns it.
      • build

        public static FSA build​(byte[][] input)
        Build a minimal, deterministic automaton from a sorted list of byte sequences.
        Parameters:
        input - Input sequences to build automaton from.
        Returns:
        Returns the automaton encoding all input sequences.
      • build

        public static FSA build​(java.lang.Iterable<byte[]> input)
        Build a minimal, deterministic automaton from an iterable list of byte sequences.
        Parameters:
        input - Input sequences to build automaton from.
        Returns:
        Returns the automaton encoding all input sequences.
      • isArcLast

        private boolean isArcLast​(int arc)
        Is this arc the state's last?
      • isArcFinal

        private boolean isArcFinal​(int arc)
        Is this arc final?
      • getArcLabel

        private byte getArcLabel​(int arc)
        Get label's arc.
      • setArcTarget

        private void setArcTarget​(int arc,
                                  int state)
        Fills the target state address of an arc.
      • getArcTarget

        private int getArcTarget​(int arc)
        Returns the address of an arc.
      • commonPrefix

        private int commonPrefix​(byte[] sequence,
                                 int start,
                                 int len)
        Returns:
        The number of common prefix characters with the previous sequence.
      • freezeState

        private int freezeState​(int activePathIndex)
        Freeze a state: try to find an equivalent state in the interned states dictionary first, if found, return it, otherwise, serialize the mutable state at activePathIndex and return it.
      • expandAndRehash

        private void expandAndRehash()
        Reallocate and rehash the hash set.
      • stateLength

        private int stateLength​(int state)
        The total length of the serialized state data (all arcs).
      • equivalent

        private boolean equivalent​(int start1,
                                   int start2,
                                   int len)
        Return true if two regions in serialized are identical.
      • serialize

        private int serialize​(int activePathIndex)
        Serialize a given state on the active path.
      • hash

        private int hash​(int start,
                         int byteCount)
        Hash code of a fragment of serialized array.
      • expandActivePath

        private void expandActivePath​(int size)
        Append a new mutable state to the active path.
      • expandBuffers

        private void expandBuffers()
        Expand internal buffers for the next state.
      • allocateState

        private int allocateState​(int labels)
        Allocate space for a state with the given number of outgoing labels.
        Returns:
        state offset
      • setPrevious

        private boolean setPrevious​(byte[] sequence,
                                    int start,
                                    int length)
        Copy current into an internal buffer.
      • compare

        private static int compare​(byte[] s1,
                                   int start1,
                                   int lens1,
                                   byte[] s2,
                                   int start2,
                                   int lens2)
        Lexicographic order of input sequences. By default, consistent with the "C" sort (absolute value of bytes, 0-255).