Class GOVMinimalPerfectHashFunction128<T>

  • All Implemented Interfaces:
    it.unimi.dsi.fastutil.Function<T,​java.lang.Long>, it.unimi.dsi.fastutil.objects.Object2LongFunction<T>, it.unimi.dsi.fastutil.Size64, java.io.Serializable, java.util.function.Function<T,​java.lang.Long>, java.util.function.ToLongFunction<T>

    public class GOVMinimalPerfectHashFunction128<T>
    extends AbstractHashFunction<T>
    implements java.io.Serializable
    A minimal perfect hash function stored using the Genuzio-Ottaviano-Vigna 3-regular F3-linear system technique. It is the fastest minimal perfect hash function available with space close to 2 bits per key.

    Given a list of keys without duplicates, the builder of this class finds a minimal perfect hash function for the list. Subsequent calls to the getLong(Object) method will return a distinct number for each key in the list. For keys out of the list, the resulting number is not specified. In some (rare) cases it might be possible to establish that a key was not in the original list, and in that case -1 will be returned; by signing the function (see below), you can guarantee with a prescribed probability that -1 will be returned on keys not in the original list. The class can then be saved by serialisation and reused later.

    This class uses a bucketed hash store to provide highly scalable construction. Note that at construction time you can pass a BucketedHashStore containing the keys (associated with any value); however, if the store is rebuilt because of a BucketedHashStore.DuplicateException it will be rebuilt associating with each key its ordinal position.

    For convenience, this class provides a main method that reads from standard input a (possibly gzip'd) sequence of newline-separated strings, and writes a serialised minimal perfect hash function for the given list.

    Signing

    Optionally, it is possible to sign the minimal perfect hash function. A w-bit signature will be associated with each key, so that getLong(Object) will return -1 on strings that are not in the original key set. As usual, false positives are possible with probability 2-w.

    Multithreading

    This implementation is multithreaded: each bucket returned by the BucketedHashStore is processed independently. By default, this class uses Runtime.availableProcessors() parallel threads, but by default no more than 4. If you wish to set a specific number of threads, you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".

    How it Works

    The detail of the data structure can be found in “Fast Scalable Construction of (Minimal Perfect Hash) Functions”, by Marco Genuzio, Giuseppe Ottaviano and Sebastiano Vigna, 15th International Symposium on Experimental Algorithms — SEA 2016, Lecture Notes in Computer Science, Springer, 2016. We generate a random 3-regular hypergraph and give it an orientation. From the orientation, we generate a random linear system on F3, where the variables in the k-th equation are the vertices of the k-th hyperedge, and the known term of the k-th equation is the vertex giving orientation to the k-th hyperedge. Then, we solve the system and store the solution, which provides a perfect hash function.

    To obtain a minimal perfect hash function, we simply notice that we whenever we have to assign a value to a vertex, we can take care of using the number 3 instead of 0 if the vertex is actually the output value for some key. The final value of the minimal perfect hash function is the number of nonzero pairs of bits that precede the perfect hash value for the key. To compute this number, we use use in each bucket broadword programming. Since the system must have ≈10% more variables than equations to be solvable, a GOVMinimalPerfectHashFunction128 on n keys requires 2.2n bits.

    Since:
    4.0.0
    Author:
    Sebastiano Vigna
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected long[] array
      The bit array supporting bitVector.
      protected it.unimi.dsi.bits.LongArrayBitVector bitVector
      The bit vector underlying values.
      static int BUCKET_SIZE
      The expected bucket size.
      protected long[] edgeOffsetAndSeed
      A long containing the cumulating function of the bucket edges (i.e., keys) in the lower 56 bits, and the local seed of each bucket in the upper 8 bits.
      protected long globalSeed
      The seed used to generate the initial signature.
      protected long n
      The number of keys.
      static java.lang.String NUMBER_OF_THREADS_PROPERTY
      The system property used to set the number of parallel threads.
      static long serialVersionUID  
      protected long signatureMask
      The mask to compare signatures, or zero for no signatures.
      protected it.unimi.dsi.fastutil.longs.LongBigList signatures
      The signatures.
      protected it.unimi.dsi.bits.TransformationStrategy<? super T> transform
      The transformation strategy.
      protected it.unimi.dsi.fastutil.longs.LongBigList values
      The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function.
      • Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

        defRetValue
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected GOVMinimalPerfectHashFunction128​(java.lang.Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int signatureWidth, java.io.File tempDir, BucketedHashStore<T> bucketedHashStore)
      Creates a new minimal perfect hash function for the given keys.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static int countNonzeroPairs​(long x)
      Counts the number of nonzero pairs of bits in a long.
      void dump​(java.lang.String file)  
      long getLong​(java.lang.Object key)  
      long getLongBySignature​(long[] signature)
      Low-level access to the output of this minimal perfect hash function.
      static void main​(java.lang.String[] arg)  
      long numBits()
      Returns the number of bits used by this structure.
      long size64()  
      protected static long vertexOffset​(long edgeOffsetSeed)  
      • Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

        defaultReturnValue, defaultReturnValue
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface it.unimi.dsi.fastutil.Function

        apply, clear
      • Methods inherited from interface java.util.function.Function

        compose
      • Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction

        andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
    • Field Detail

      • NUMBER_OF_THREADS_PROPERTY

        public static final java.lang.String NUMBER_OF_THREADS_PROPERTY
        The system property used to set the number of parallel threads.
        See Also:
        Constant Field Values
      • BUCKET_SIZE

        public static final int BUCKET_SIZE
        The expected bucket size.
        See Also:
        Constant Field Values
      • n

        protected final long n
        The number of keys.
      • globalSeed

        protected final long globalSeed
        The seed used to generate the initial signature.
      • edgeOffsetAndSeed

        protected final long[] edgeOffsetAndSeed
        A long containing the cumulating function of the bucket edges (i.e., keys) in the lower 56 bits, and the local seed of each bucket in the upper 8 bits. The method vertexOffset(long) returns the bucket (i.e., vertex) cumulative value starting from the edge cumulative value.
      • values

        protected final it.unimi.dsi.fastutil.longs.LongBigList values
        The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function.
      • bitVector

        protected final it.unimi.dsi.bits.LongArrayBitVector bitVector
        The bit vector underlying values.
      • array

        protected transient long[] array
        The bit array supporting bitVector.
      • transform

        protected final it.unimi.dsi.bits.TransformationStrategy<? super T> transform
        The transformation strategy.
      • signatureMask

        protected final long signatureMask
        The mask to compare signatures, or zero for no signatures.
      • signatures

        protected final it.unimi.dsi.fastutil.longs.LongBigList signatures
        The signatures.
    • Constructor Detail

      • GOVMinimalPerfectHashFunction128

        protected GOVMinimalPerfectHashFunction128​(java.lang.Iterable<? extends T> keys,
                                                   it.unimi.dsi.bits.TransformationStrategy<? super T> transform,
                                                   int signatureWidth,
                                                   java.io.File tempDir,
                                                   BucketedHashStore<T> bucketedHashStore)
                                            throws java.io.IOException
        Creates a new minimal perfect hash function for the given keys.
        Parameters:
        keys - the keys to hash, or null.
        transform - a transformation strategy for the keys.
        signatureWidth - a signature width, or 0 for no signature.
        tempDir - a temporary directory for the store files, or null for the standard temporary directory.
        bucketedHashStore - a bucketed hash store containing the keys, or null; the store can be unchecked, but in this case keys and transform must be non-null.
        Throws:
        java.io.IOException
    • Method Detail

      • countNonzeroPairs

        public static final int countNonzeroPairs​(long x)
        Counts the number of nonzero pairs of bits in a long.
        Parameters:
        x - a long.
        Returns:
        the number of nonzero bit pairs in x.
      • vertexOffset

        protected static long vertexOffset​(long edgeOffsetSeed)
      • numBits

        public long numBits()
        Returns the number of bits used by this structure.
        Returns:
        the number of bits used by this structure.
      • getLong

        public long getLong​(java.lang.Object key)
        Specified by:
        getLong in interface it.unimi.dsi.fastutil.objects.Object2LongFunction<T>
      • getLongBySignature

        public long getLongBySignature​(long[] signature)
        Low-level access to the output of this minimal perfect hash function.

        This method makes it possible to build several kind of functions on the same BucketedHashStore and then retrieve the resulting values by generating a single signature. The method TwoStepsGOV3Function.getLong(Object) is a good example of this technique.

        Parameters:
        signature - a signature generated as documented in BucketedHashStore.
        Returns:
        the output of the function.
      • size64

        public long size64()
        Specified by:
        size64 in interface it.unimi.dsi.fastutil.Size64
        Overrides:
        size64 in class AbstractHashFunction<T>
      • dump

        public void dump​(java.lang.String file)
                  throws java.io.IOException
        Throws:
        java.io.IOException
      • main

        public static void main​(java.lang.String[] arg)
                         throws java.io.IOException,
                                com.martiansoftware.jsap.JSAPException
        Throws:
        java.io.IOException
        com.martiansoftware.jsap.JSAPException