Class GOV4Function<T>

java.lang.Object
it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<T>
it.unimi.dsi.sux4j.mph.GOV4Function<T>
All Implemented Interfaces:
it.unimi.dsi.fastutil.Function<T,Long>, it.unimi.dsi.fastutil.objects.Object2LongFunction<T>, it.unimi.dsi.fastutil.Size64, Serializable, Function<T,Long>, ToLongFunction<T>

public class GOV4Function<T> extends it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<T> implements Serializable, it.unimi.dsi.fastutil.Size64
An immutable function stored quasi-succinctly using the Genuzio-Ottaviano-Vigna method to solve F2-linear systems. With respect to a GOV3Function, instances of this class have slightly slower lookups and are slightly slower to build, but use less space.

Instances of this class store a function from keys to values. Keys are provided by an iterable object (whose iterators must return elements in a consistent order), whereas values are provided by a LongIterable. If you do not specify values, each key will be assigned its rank (e.g., its position in iteration order starting from zero).

For convenience, this class provides a main method that reads from standard input a (possibly gzip'd) sequence of newline-separated strings, and writes a serialised function mapping each element of the list to its position, or to a given list of values.

Signing

Optionally, it is possible to sign a GOV4Function. Signing is possible if no list of values has been specified (otherwise, there is no way to associate a key with its signature). A w-bit signature will be associated with each key, so that getLong(Object) will return a default return value (by default, -1) on strings that are not in the original key set. As usual, false positives are possible with probability 2-w.

If you're not interested in the rank of a key, but just to know whether the key was in the original set, you can turn the function into an approximate dictionary. In this case, the value associated by the function with a key is exactly its signature, which means that the only space used by the function is that occupied by signatures: this is one of the fastest and most compact way of storing a static approximate dictionary. In this case, the only returned value is one, and the default return value is set to zero.

Building a function

This class provides a great amount of flexibility when creating a new function; such flexibility is exposed through the builder. To exploit the various possibilities, you must understand some details of the construction.

In a first phase, we build a BucketedHashStore containing hashes of the keys. By default, the store will associate each hash with the rank of the key. If you specify values, the store will associate with each hash the corresponding value.

However, if you further require an indirect construction the store will associate again each hash with the rank of the corresponding key, and access randomly the values (which must be either a LongList or a LongBigList). Indirect construction is useful only in complex, multi-layer hashes (such as an LcpMonotoneMinimalPerfectHashFunction) in which we want to reuse a checked BucketedHashStore. Storing values in the BucketedHashStore is extremely scalable because the values must just be a LongIterable that will be scanned sequentially during the store construction. On the other hand, if you have already a store that associates ordinal positions, and you want to build a new function for which a LongList or LongBigList of values needs little space (e.g., because it is described implicitly), you can opt for an indirect construction using the already built store.

Note that if you specify a store it will be used before building a new one (possibly because of a DuplicateException), with obvious benefits in terms of performance. If the store is not checked, and a DuplicateException is thrown, the constructor will try to rebuild the store, but this requires, of course, that the keys, and possibly the values, are available. Note that it is your responsibility to pass a correct store.

Multithreading

This implementation is multithreaded: each bucket returned by the BucketedHashStore is processed independently. By default, this class uses Runtime.availableProcessors() parallel threads, but by default no more than 4. If you wish to set a specific number of threads, you can do so through the system property "it.unimi.dsi.sux4j.mph.threads"<T>.

Implementation Details

The detail of the data structure can be found in “Fast Scalable Construction of (Minimal Perfect Hash) Functions”, by Marco Genuzio, Giuseppe Ottaviano and Sebastiano Vigna, 15th International Symposium on Experimental Algorithms — SEA 2016, Lecture Notes in Computer Science, Springer, 2016. We generate a random 4-regular linear system on F2, where the known term of the k-th equation is the output value for the k-th key. Then, we solve it and store the solution. Since the system must have ≈3% more variables than equations to be solvable, an r-bit GOV4Function on n keys requires 1.03rn bits.

Since:
4.0.0
Author:
Sebastiano Vigna
See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    A builder class for GOV4Function.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
    The expected bucket size.
    static double
    The ratio between variables and equations.
    protected final it.unimi.dsi.fastutil.longs.LongBigList
    The final magick—the list of values that define the output of the function.
    protected final long
    The seed used to generate the initial signature.
    protected final long
    The number of variables.
    protected final long
    The number of keys.
    static final String
    The system property used to set the number of parallel threads.
    protected final long[]
    A long containing the start offset of each bucket in the lower 56 bits, and the local seed of each bucket in the upper 8 bits.
    protected final long
    The mask to compare signatures, or zero for no signatures.
    protected final it.unimi.dsi.fastutil.longs.LongBigList
    The signatures.
    protected final it.unimi.dsi.bits.TransformationStrategy<? super T>
    The transformation strategy to turn objects of type T into bit vectors.
    protected final int
    The data width.

    Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

    defRetValue
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    protected
    GOV4Function(Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int signatureWidth, it.unimi.dsi.fastutil.longs.LongIterable values, int dataWidth, File tempDir, BucketedHashStore<T> bucketedHashStore, boolean indirect)
    Creates a new function for the given keys and values.
  • Method Summary

    Modifier and Type
    Method
    Description
    boolean
     
    void
    dump(String file)
     
    long
     
    long
    getLongBySignature(long[] signature)
    Low-level access to the output of this function.
    static void
    main(String[] arg)
     
    long
    Returns the number of bits used by this structure.
    int
    Deprecated.
    long
    Returns the number of keys in the function domain.

    Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

    defaultReturnValue, defaultReturnValue

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface it.unimi.dsi.fastutil.Function

    apply, clear

    Methods inherited from interface java.util.function.Function

    compose

    Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction

    andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
  • Field Details

    • C

      public static double C
      The ratio between variables and equations.
    • NUMBER_OF_THREADS_PROPERTY

      public static final String NUMBER_OF_THREADS_PROPERTY
      The system property used to set the number of parallel threads.
      See Also:
    • BUCKET_SIZE

      public static final int BUCKET_SIZE
      The expected bucket size.
      See Also:
    • n

      protected final long n
      The number of keys.
    • m

      protected final long m
      The number of variables.
    • width

      protected final int width
      The data width.
    • globalSeed

      protected final long globalSeed
      The seed used to generate the initial signature.
    • offsetAndSeed

      protected final long[] offsetAndSeed
      A long containing the start offset of each bucket in the lower 56 bits, and the local seed of each bucket in the upper 8 bits.
    • data

      protected final it.unimi.dsi.fastutil.longs.LongBigList data
      The final magick—the list of values that define the output of the function.
    • transform

      protected final it.unimi.dsi.bits.TransformationStrategy<? super T> transform
      The transformation strategy to turn objects of type T into bit vectors.
    • signatureMask

      protected final long signatureMask
      The mask to compare signatures, or zero for no signatures.
    • signatures

      protected final it.unimi.dsi.fastutil.longs.LongBigList signatures
      The signatures.
  • Constructor Details

    • GOV4Function

      protected GOV4Function(Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int signatureWidth, it.unimi.dsi.fastutil.longs.LongIterable values, int dataWidth, File tempDir, BucketedHashStore<T> bucketedHashStore, boolean indirect) throws IOException
      Creates a new function for the given keys and values.
      Parameters:
      keys - the keys in the domain of the function, or null.
      transform - a transformation strategy for the keys.
      signatureWidth - a positive number for a signature width, 0 for no signature, a negative value for a self-signed function; if nonzero, values must be null and width must be -1.
      values - values to be assigned to each element, in the same order of the iterator returned by keys; if null, the assigned value will the ordinal number of each element.
      dataWidth - the bit width of the values, or -1 if values is null.
      tempDir - a temporary directory for the store files, or null for the standard temporary directory.
      bucketedHashStore - a bucketed hash store containing the keys associated with their ranks (if there are no values, or indirect is true) or values, or null; the store can be unchecked, but in this case keys and transform must be non-null.
      indirect - if true, bucketedHashStore contains ordinal positions, and values is a LongIterable that must be accessed to retrieve the actual values.
      Throws:
      IOException
  • Method Details

    • getLong

      public long getLong(Object o)
      Specified by:
      getLong in interface it.unimi.dsi.fastutil.objects.Object2LongFunction<T>
    • getLongBySignature

      public long getLongBySignature(long[] signature)
      Low-level access to the output of this function.

      This method makes it possible to build several kind of functions on the same BucketedHashStore and then retrieve the resulting values by generating a single signature. The method TwoStepsGOV3Function.getLong(Object) is a good example of this technique.

      Parameters:
      signature - a signature generated as documented in BucketedHashStore.
      Returns:
      the output of the function.
    • size64

      public long size64()
      Returns the number of keys in the function domain.
      Specified by:
      size64 in interface it.unimi.dsi.fastutil.Size64
      Returns:
      the number of the keys in the function domain.
    • size

      @Deprecated public int size()
      Deprecated.
      Specified by:
      size in interface it.unimi.dsi.fastutil.Function<T,Long>
      Specified by:
      size in interface it.unimi.dsi.fastutil.Size64
    • numBits

      public long numBits()
      Returns the number of bits used by this structure.
      Returns:
      the number of bits used by this structure.
    • containsKey

      public boolean containsKey(Object o)
      Specified by:
      containsKey in interface it.unimi.dsi.fastutil.Function<T,Long>
    • dump

      public void dump(String file) throws IOException
      Throws:
      IOException
    • main

      public static void main(String[] arg) throws NoSuchMethodException, IOException, com.martiansoftware.jsap.JSAPException
      Throws:
      NoSuchMethodException
      IOException
      com.martiansoftware.jsap.JSAPException