Class CFSA

  • All Implemented Interfaces:
    java.lang.Iterable<java.nio.ByteBuffer>

    public final class CFSA
    extends FSA
    CFSA (Compact Finite State Automaton) binary format implementation. This is a slightly reorganized version of FSA5 offering smaller automata size at some (minor) performance penalty.

    Note: Serialize to CFSA2 for new code.

    The encoding of automaton body is as follows.

     ---- FSA header (standard)
     Byte                            Description 
           +-+-+-+-+-+-+-+-+\
         0 | | | | | | | | | +------ '\'
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         1 | | | | | | | | | +------ 'f'
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         2 | | | | | | | | | +------ 's'
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         3 | | | | | | | | | +------ 'a'
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         4 | | | | | | | | | +------ version (fixed 0xc5)
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         5 | | | | | | | | | +------ filler character
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         6 | | | | | | | | | +------ annot character
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         7 |C|C|C|C|G|G|G|G| +------ C - node data size (ctl), G - address size (gotoLength)
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
      8-32 | | | | | | | | | +------ labels mapped for type (1) of arc encoding. 
           : : : : : : : : : |
           +-+-+-+-+-+-+-+-+/
     
     ---- Start of a node; only if automaton was compiled with NUMBERS option.
     
     Byte
            +-+-+-+-+-+-+-+-+\
          0 | | | | | | | | | \  LSB
            +-+-+-+-+-+-+-+-+  +
          1 | | | | | | | | |  |      number of strings recognized
            +-+-+-+-+-+-+-+-+  +----- by the automaton starting
            : : : : : : : : :  |      from this node.
            +-+-+-+-+-+-+-+-+  +
      ctl-1 | | | | | | | | | /  MSB
            +-+-+-+-+-+-+-+-+/
            
     ---- A vector of node's arcs. Conditional format, depending on flags.
     
     1) NEXT bit set, mapped arc label. 
     
                    +--------------- arc's label mapped in M bits if M's field value > 0
                    | +------------- node pointed to is next
                    | | +----------- the last arc of the node
             _______| | | +--------- the arc is final
            /       | | | |
           +-+-+-+-+-+-+-+-+\
         0 |M|M|M|M|M|1|L|F| +------ flags + (M) index of the mapped label.
           +-+-+-+-+-+-+-+-+/
     
     2) NEXT bit set, label separate.
     
                    +--------------- arc's label stored separately (M's field is zero).
                    | +------------- node pointed to is next
                    | | +----------- the last arc of the node
                    | | | +--------- the arc is final
                    | | | |
           +-+-+-+-+-+-+-+-+\
         0 |0|0|0|0|0|1|L|F| +------ flags
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         1 | | | | | | | | | +------ label
           +-+-+-+-+-+-+-+-+/
     
     3) NEXT bit not set. Full arc.
     
                      +------------- node pointed to is next
                      | +----------- the last arc of the node
                      | | +--------- the arc is final
                      | | |
           +-+-+-+-+-+-+-+-+\
         0 |A|A|A|A|A|0|L|F| +------ flags + (A) address field, lower bits
           +-+-+-+-+-+-+-+-+/
           +-+-+-+-+-+-+-+-+\
         1 | | | | | | | | | +------ label
           +-+-+-+-+-+-+-+-+/
           : : : : : : : : :       
           +-+-+-+-+-+-+-+-+\
     gtl-1 |A|A|A|A|A|A|A|A| +------ address, continuation (MSB)
           +-+-+-+-+-+-+-+-+/
     
    • Field Summary

      Fields 
      Modifier and Type Field Description
      byte[] arcs
      An array of bytes with the internal representation of the automaton.
      static int BIT_FINAL_ARC
      Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.
      static int BIT_LAST_ARC
      Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.
      static int BIT_TARGET_NEXT
      Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).
      private java.util.Set<FSAFlags> flags
      Flags for this automaton version.
      int gtl
      Number of bytes each address takes in full, expanded form (goto length).
      byte[] labelMapping
      Label mapping for arcs of type (1) (see class documentation).
      int nodeDataLength
      The length of the node header structure (if the automaton was compiled with NUMBERS option).
      static byte VERSION
      Automaton header version value.
    • Constructor Summary

      Constructors 
      Constructor Description
      CFSA​(java.io.InputStream stream)
      Creates a new automaton, reading it from a file in FSA format, version 5.
    • Field Detail

      • VERSION

        public static final byte VERSION
        Automaton header version value.
        See Also:
        Constant Field Values
      • BIT_FINAL_ARC

        public static final int BIT_FINAL_ARC
        Bitmask indicating that an arc corresponds to the last character of a sequence available when building the automaton.
        See Also:
        Constant Field Values
      • BIT_LAST_ARC

        public static final int BIT_LAST_ARC
        Bitmask indicating that an arc is the last one of the node's list and the following one belongs to another node.
        See Also:
        Constant Field Values
      • BIT_TARGET_NEXT

        public static final int BIT_TARGET_NEXT
        Bitmask indicating that the target node of this arc follows it in the compressed automaton structure (no goto field).
        See Also:
        Constant Field Values
      • arcs

        public byte[] arcs
        An array of bytes with the internal representation of the automaton. Please see the documentation of this class for more information on how this structure is organized.
      • nodeDataLength

        public final int nodeDataLength
        The length of the node header structure (if the automaton was compiled with NUMBERS option). Otherwise zero.
      • flags

        private final java.util.Set<FSAFlags> flags
        Flags for this automaton version.
      • gtl

        public final int gtl
        Number of bytes each address takes in full, expanded form (goto length).
      • labelMapping

        public final byte[] labelMapping
        Label mapping for arcs of type (1) (see class documentation). The array is indexed by mapped label's value and contains the original label.
    • Constructor Detail

      • CFSA

        CFSA​(java.io.InputStream stream)
        throws java.io.IOException
        Creates a new automaton, reading it from a file in FSA format, version 5.
        Throws:
        java.io.IOException
    • Method Detail

      • getRootNode

        public int getRootNode()
        Returns the start node of this automaton. May return 0 if the start node is also an end node.
        Specified by:
        getRootNode in class FSA
        Returns:
        Returns the identifier of the root node of this automaton. Returns 0 if the start node is also the end node (the automaton is empty).
      • getFirstArc

        public final int getFirstArc​(int node)
        Specified by:
        getFirstArc in class FSA
        Parameters:
        node - Identifier of the node.
        Returns:
        Returns the identifier of the first arc leaving node or 0 if the node has no outgoing arcs.
      • getNextArc

        public final int getNextArc​(int arc)
        Specified by:
        getNextArc in class FSA
        Parameters:
        arc - The arc's identifier.
        Returns:
        Returns the identifier of the next arc after arc and leaving node. Zero is returned if no more arcs are available for the node.
      • getArc

        public int getArc​(int node,
                          byte label)
        Specified by:
        getArc in class FSA
        Parameters:
        node - Identifier of the node.
        label - The arc's label.
        Returns:
        Returns the identifier of an arc leaving node and labeled with label. An identifier equal to 0 means the node has no outgoing arc labeled label.
      • getEndNode

        public int getEndNode​(int arc)
        Specified by:
        getEndNode in class FSA
        Parameters:
        arc - The arc's identifier.
        Returns:
        Return the end node pointed to by a given arc. Terminal arcs (those that point to a terminal state) have no end node representation and throw a runtime exception.
      • getArcLabel

        public byte getArcLabel​(int arc)
        Specified by:
        getArcLabel in class FSA
        Parameters:
        arc - The arc's identifier.
        Returns:
        Return the label associated with a given arc.
      • getRightLanguageCount

        public int getRightLanguageCount​(int node)
        Overrides:
        getRightLanguageCount in class FSA
        Parameters:
        node - Identifier of the node.
        Returns:
        Returns the number of sequences reachable from the given state if the automaton was compiled with FSAFlags.NUMBERS. The size of the right language of the state, in other words.
      • isArcFinal

        public boolean isArcFinal​(int arc)
        Specified by:
        isArcFinal in class FSA
        Parameters:
        arc - The arc's identifier.
        Returns:
        Returns true if the destination node at the end of this arc corresponds to an input sequence created when building this automaton.
      • isArcTerminal

        public boolean isArcTerminal​(int arc)
        Specified by:
        isArcTerminal in class FSA
        Parameters:
        arc - The arc's identifier.
        Returns:
        Returns true if this arc does not have a terminating node (@link FSA.getEndNode(int) will throw an exception). Implies FSA.isArcFinal(int).
      • isArcLast

        public boolean isArcLast​(int arc)
        Returns true if this arc has NEXT bit set.
        Parameters:
        arc - The node's arc identifier.
        Returns:
        Returns true if the argument is the last arc of a node.
        See Also:
        BIT_LAST_ARC
      • isNextSet

        public boolean isNextSet​(int arc)
        Parameters:
        arc - The node's arc identifier.
        Returns:
        Returns true if BIT_TARGET_NEXT is set for this arc.
        See Also:
        BIT_TARGET_NEXT
      • isLabelCompressed

        public boolean isLabelCompressed​(int arc)
        Parameters:
        arc - The node's arc identifier.
        Returns:
        Returns true if the label is compressed inside flags byte.
      • getFlags

        public java.util.Set<FSAFlags> getFlags()

        For this automaton version, an additional FSAFlags.NUMBERS flag may be set to indicate the automaton contains extra fields for each node.

        Specified by:
        getFlags in class FSA
        Returns:
        Returns a set of flags for this FSA instance.
      • getDestinationNodeOffset

        final int getDestinationNodeOffset​(int arc)
        Returns the address of the node pointed to by this arc.
      • skipArc

        private int skipArc​(int offset)
        Read the arc's layout and skip as many bytes, as needed, to skip it.