Class MinimalPerfectHashFunction<T>

  • All Implemented Interfaces:
    it.unimi.dsi.fastutil.Function<T,​java.lang.Long>, it.unimi.dsi.fastutil.objects.Object2LongFunction<T>, it.unimi.dsi.fastutil.Size64, java.io.Serializable, java.util.function.Function<T,​java.lang.Long>, java.util.function.ToLongFunction<T>

    @Deprecated
    public class MinimalPerfectHashFunction<T>
    extends AbstractHashFunction<T>
    implements java.io.Serializable
    Deprecated.
    A minimal perfect hash function.

    Given a list of keys without duplicates, the builder of this class finds a minimal perfect hash function for the list. Subsequent calls to the getLong(Object) method will return a distinct number for each key in the list. For keys out of the list, the resulting number is not specified. In some (rare) cases it might be possible to establish that a key was not in the original list, and in that case -1 will be returned; by signing the function (see below), you can guarantee with a prescribed probability that -1 will be returned on keys not in the original list. The class can then be saved by serialisation and reused later.

    This class uses a chunked hash store to provide highly scalable construction. Note that at construction time you can pass a it.unimi.dsi.sux4j.io.ChunkedHashStore containing the keys (associated with any value); however, if the store is rebuilt because of a ChunkedHashStore.DuplicateException it will be rebuilt associating with each key its ordinal position.

    The theoretical memory requirements for the algorithm we use are 2γ=2.46 + o(n) bits per key, plus the bits for the random hashes (which are usually negligible). The o(n) part is due to an embedded ranking scheme that increases space occupancy by 6.25%, bringing the actual occupied space to around 2.68 bits per key.

    For convenience, this class provides a main method that reads from standard input a (possibly gzip'd) sequence of newline-separated strings, and writes a serialised minimal perfect hash function for the given list.

    Signing

    Optionally, it is possible to sign the minimal perfect hash function. A w-bit signature will be associated with each key, so that getLong(Object) will return -1 on strings that are not in the original key set. As usual, false positives are possible with probability 2-w.

    How it Works

    The technique used is very similar (but developed independently) to that described by Fabiano C. Botelho, Rasmus Pagh and Nivio Ziviani in “Simple and Efficient Minimal Perfect Hashing Functions”, Algorithms and data structures: 10th international workshop, WADS 2007, number 4619 of Lecture Notes in Computer Science, pages 139−150, 2007. In turn, the mapping technique described therein was actually first proposed by Bernard Chazelle, Joe Kilian, Ronitt Rubinfeld and Ayellet Tal in “The Bloomier Filter: an Efficient Data Structure for Static Support Lookup Tables”, Proc. SODA 2004, pages 30−39, 2004, as one of the steps to implement a mutable table.

    The basic ingredient is the Majewski-Wormald-Havas-Czech 3-hypergraph technique. After generating a random 3-hypergraph, we sort its 3-hyperedges to that a distinguished vertex in each 3-hyperedge, the hinge, never appeared before. We then assign to each vertex a two-bit number in such a way that for each 3-hyperedge the sum of the values associated to its vertices modulo 3 gives the index of the hash function generating the hinge. As as a result we obtain a perfect hash of the original set (one just has to compute the three hash functions, collect the three two-bit values, add them modulo 3 and take the corresponding hash function value).

    To obtain a minimal perfect hash, we simply notice that we whenever we have to assign a value to a vertex, we can take care of using the number 3 instead of 0 if the vertex is actually the output value for some key. The final value of the minimal perfect hash function is the number of nonzero pairs of bits that precede the perfect hash value for the key. To compute this number, we use a simple table-free ranking scheme, recording the number of nonzero pairs each BITS_PER_BLOCK bits and using Long.bitCount(long) to count the number of nonzero pairs of bits in a word.

    Since:
    0.1
    Author:
    Sebastiano Vigna
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected long[] array
      Deprecated.
      The bit array supporting values.
      static int BITS_PER_BLOCK
      Deprecated.
      The number of bits per block in the rank structure.
      protected it.unimi.dsi.bits.LongArrayBitVector bitVector
      Deprecated.
      The bit vector underlying values.
      protected long[] count
      Deprecated.
      The array of counts for blocks and superblocks.
      protected long globalSeed
      Deprecated.
      The seed used to generate the initial hash triple.
      static int LOG2_CHUNK_SIZE
      Deprecated.
      The logarithm of the desired chunk size.
      protected long n
      Deprecated.
      The number of keys.
      protected long[] offset
      Deprecated.
      The start offset of each chunk.
      protected long[] seed
      Deprecated.
      The seed of the underlying 3-hypergraphs.
      static long serialVersionUID
      Deprecated.
       
      protected long signatureMask
      Deprecated.
      The mask to compare signatures, or zero for no signatures.
      protected it.unimi.dsi.fastutil.longs.LongBigList signatures
      Deprecated.
      The signatures.
      protected it.unimi.dsi.bits.TransformationStrategy<? super T> transform
      Deprecated.
      The transformation strategy.
      protected it.unimi.dsi.fastutil.longs.LongBigList values
      Deprecated.
      The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function.
      • Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

        defRetValue
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected MinimalPerfectHashFunction​(java.lang.Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int signatureWidth, java.io.File tempDir, ChunkedHashStore<T> chunkedHashStore)
      Deprecated.
      Creates a new minimal perfect hash function for the given keys.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      static int countNonzeroPairs​(long x)
      Deprecated.
      Counts the number of nonzero pairs of bits in a long.
      long getLong​(java.lang.Object key)
      Deprecated.
       
      long getLongBySignature​(long[] triple)
      Deprecated.
      Low-level access to the output of this minimal perfect hash function.
      static void main​(java.lang.String[] arg)
      Deprecated.
       
      long numBits()
      Deprecated.
      Returns the number of bits used by this structure.
      long size64()
      Deprecated.
       
      • Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

        defaultReturnValue, defaultReturnValue
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface it.unimi.dsi.fastutil.Function

        apply, clear
      • Methods inherited from interface java.util.function.Function

        compose
      • Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction

        andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
    • Field Detail

      • serialVersionUID

        public static final long serialVersionUID
        Deprecated.
        See Also:
        Constant Field Values
      • BITS_PER_BLOCK

        public static final int BITS_PER_BLOCK
        Deprecated.
        The number of bits per block in the rank structure.
        See Also:
        Constant Field Values
      • LOG2_CHUNK_SIZE

        public static final int LOG2_CHUNK_SIZE
        Deprecated.
        The logarithm of the desired chunk size.
        See Also:
        Constant Field Values
      • n

        protected final long n
        Deprecated.
        The number of keys.
      • globalSeed

        protected final long globalSeed
        Deprecated.
        The seed used to generate the initial hash triple.
      • seed

        protected final long[] seed
        Deprecated.
        The seed of the underlying 3-hypergraphs.
      • offset

        protected final long[] offset
        Deprecated.
        The start offset of each chunk.
      • values

        protected final it.unimi.dsi.fastutil.longs.LongBigList values
        Deprecated.
        The final magick—the list of modulo-3 values that define the output of the minimal perfect hash function.
      • bitVector

        protected final it.unimi.dsi.bits.LongArrayBitVector bitVector
        Deprecated.
        The bit vector underlying values.
      • array

        protected transient long[] array
        Deprecated.
        The bit array supporting values.
      • transform

        protected final it.unimi.dsi.bits.TransformationStrategy<? super T> transform
        Deprecated.
        The transformation strategy.
      • signatureMask

        protected final long signatureMask
        Deprecated.
        The mask to compare signatures, or zero for no signatures.
      • signatures

        protected final it.unimi.dsi.fastutil.longs.LongBigList signatures
        Deprecated.
        The signatures.
      • count

        protected final long[] count
        Deprecated.
        The array of counts for blocks and superblocks.
    • Constructor Detail

      • MinimalPerfectHashFunction

        protected MinimalPerfectHashFunction​(java.lang.Iterable<? extends T> keys,
                                             it.unimi.dsi.bits.TransformationStrategy<? super T> transform,
                                             int signatureWidth,
                                             java.io.File tempDir,
                                             ChunkedHashStore<T> chunkedHashStore)
                                      throws java.io.IOException
        Deprecated.
        Creates a new minimal perfect hash function for the given keys.
        Parameters:
        keys - the keys to hash, or null.
        transform - a transformation strategy for the keys.
        signatureWidth - a signature width, or 0 for no signature.
        tempDir - a temporary directory for the store files, or null for the standard temporary directory.
        chunkedHashStore - a chunked hash store containing the keys, or null; the store can be unchecked, but in this case keys and transform must be non-null.
        Throws:
        java.io.IOException
    • Method Detail

      • countNonzeroPairs

        public static int countNonzeroPairs​(long x)
        Deprecated.
        Counts the number of nonzero pairs of bits in a long.
        Parameters:
        x - a long.
        Returns:
        the number of nonzero bit pairs in x.
      • numBits

        public long numBits()
        Deprecated.
        Returns the number of bits used by this structure.
        Returns:
        the number of bits used by this structure.
      • getLong

        public long getLong​(java.lang.Object key)
        Deprecated.
        Specified by:
        getLong in interface it.unimi.dsi.fastutil.objects.Object2LongFunction<T>
      • getLongBySignature

        public long getLongBySignature​(long[] triple)
        Deprecated.
        Low-level access to the output of this minimal perfect hash function.

        This method makes it possible to build several kind of functions on the same ChunkedHashStore and then retrieve the resulting values by generating a single triple of hashes. The method TwoStepsMWHCFunction.getLong(Object) is a good example of this technique.

        Parameters:
        triple - a triple generated as documented in ChunkedHashStore.
        Returns:
        the output of the function.
      • size64

        public long size64()
        Deprecated.
        Specified by:
        size64 in interface it.unimi.dsi.fastutil.Size64
        Overrides:
        size64 in class AbstractHashFunction<T>
      • main

        public static void main​(java.lang.String[] arg)
                         throws java.lang.NoSuchMethodException,
                                java.io.IOException,
                                com.martiansoftware.jsap.JSAPException
        Deprecated.
        Throws:
        java.lang.NoSuchMethodException
        java.io.IOException
        com.martiansoftware.jsap.JSAPException