Class CHDMinimalPerfectHashFunction<T>

java.lang.Object
it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<T>
it.unimi.dsi.sux4j.mph.AbstractHashFunction<T>
it.unimi.dsi.sux4j.mph.CHDMinimalPerfectHashFunction<T>
All Implemented Interfaces:
it.unimi.dsi.fastutil.Function<T,Long>, it.unimi.dsi.fastutil.objects.Object2LongFunction<T>, it.unimi.dsi.fastutil.Size64, Serializable, Function<T,Long>, ToLongFunction<T>

public class CHDMinimalPerfectHashFunction<T> extends AbstractHashFunction<T> implements Serializable
A minimal perfect hash function implemented using the “hash, displace and compress” technique.

Given a list of keys without duplicates, the builder of this class finds a minimal perfect hash function for the list. Subsequent calls to the getLong(Object) method will return a distinct number for each key in the list. For keys out of the list, the resulting number is not specified. In some (rare) cases it might be possible to establish that a key was not in the original list, and in that case -1 will be returned; by signing the function (see below), you can guarantee with a prescribed probability that -1 will be returned on keys not in the original list. The class can then be saved by serialisation and reused later.

This class uses a chunked hash store to provide highly scalable construction. Note that at construction time you can pass a it.unimi.dsi.sux4j.io.ChunkedHashStore containing the keys (associated with any value); however, if the store is rebuilt because of a DuplicateException it will be rebuilt associating with each key its ordinal position.

The memory requirements for the algorithm we use are ≈2 bits per key for load factor equal to one and λ = 5. Thus, this class can use ≈10% less memory than a GOVMinimalPerfectHashFunction.

However, its construction time is an order of magnitude larger, and query time is about 50% slower. Different tradeoffs between construction time, query time and space can be obtained by tweaking the load factor and the parameter λ (see the paper below for their exact meaning).

For convenience, this class provides a main method that reads from standard input a (possibly gzip'd) sequence of newline-separated strings, and writes a serialised minimal perfect hash function for the given list.

Signing

Optionally, it is possible to sign the minimal perfect hash function. A w-bit signature will be associated with each key, so that getLong(Object) will return -1 on strings that are not in the original key set. As usual, false positives are possible with probability 2-w.

How it Works

The technique used is described by Djamal Belazzougui, Fabiano C. Botelho and Martin Dietzfelbinger in “Hash, displace and compress”, Algorithms - ESA 2009, LNCS 5757, pages 682−693, 2009. However, with respect to the algorithm described in the paper, this implementation is much more scalable, as it uses a ChunkedHashStore to split the generation of large key sets into generation of smaller functions for each chunk (of size approximately 216<T>).

Since:
3.2.0
Author:
Sebastiano Vigna
See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    A builder class for CHDMinimalPerfectHashFunction.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected final EliasFanoLongBigList
    The displacement coefficients.
    protected final long
    The seed used to generate the initial hash triple.
    static final int
    The logarithm of the desired chunk size.
    protected final long
    The number of keys.
    protected final SparseRank
    The sparse ranking structure containing the unused entries.
    static final long
     
    protected final long
    The mask to compare signatures, or zero for no signatures.
    protected final it.unimi.dsi.fastutil.longs.LongBigList
    The signatures.
    protected final it.unimi.dsi.bits.TransformationStrategy<? super T>
    The transformation strategy.

    Fields inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

    defRetValue
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    protected
    CHDMinimalPerfectHashFunction(Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int lambda, double loadFactor, int signatureWidth, File tempDir, ChunkedHashStore<T> chunkedHashStore)
    Creates a new CHD minimal perfect hash function for the given keys.
  • Method Summary

    Modifier and Type
    Method
    Description
    long
     
    static void
    main(String[] arg)
     
    long
    Returns the number of bits used by this structure.
    long
     

    Methods inherited from class it.unimi.dsi.sux4j.mph.AbstractHashFunction

    containsKey, size

    Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

    defaultReturnValue, defaultReturnValue

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

    Methods inherited from interface it.unimi.dsi.fastutil.Function

    apply, clear

    Methods inherited from interface java.util.function.Function

    compose

    Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction

    andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
  • Field Details

    • serialVersionUID

      public static final long serialVersionUID
      See Also:
    • LOG2_CHUNK_SIZE

      public static final int LOG2_CHUNK_SIZE
      The logarithm of the desired chunk size.
      See Also:
    • n

      protected final long n
      The number of keys.
    • globalSeed

      protected final long globalSeed
      The seed used to generate the initial hash triple.
    • transform

      protected final it.unimi.dsi.bits.TransformationStrategy<? super T> transform
      The transformation strategy.
    • coefficients

      protected final EliasFanoLongBigList coefficients
      The displacement coefficients.
    • rank

      protected final SparseRank rank
      The sparse ranking structure containing the unused entries.
    • signatureMask

      protected final long signatureMask
      The mask to compare signatures, or zero for no signatures.
    • signatures

      protected final it.unimi.dsi.fastutil.longs.LongBigList signatures
      The signatures.
  • Constructor Details

    • CHDMinimalPerfectHashFunction

      protected CHDMinimalPerfectHashFunction(Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int lambda, double loadFactor, int signatureWidth, File tempDir, ChunkedHashStore<T> chunkedHashStore) throws IOException
      Creates a new CHD minimal perfect hash function for the given keys.
      Parameters:
      keys - the keys to hash, or null.
      transform - a transformation strategy for the keys.
      lambda - the average bucket size.
      loadFactor - the load factor.
      signatureWidth - a signature width, or 0 for no signature.
      tempDir - a temporary directory for the store files, or null for the standard temporary directory.
      chunkedHashStore - a chunked hash store containing the keys, or null; the store can be unchecked, but in this case keys and transform must be non-null.
      Throws:
      IOException
  • Method Details

    • numBits

      public long numBits()
      Returns the number of bits used by this structure.
      Returns:
      the number of bits used by this structure.
    • getLong

      public long getLong(Object key)
      Specified by:
      getLong in interface it.unimi.dsi.fastutil.objects.Object2LongFunction<T>
    • size64

      public long size64()
      Specified by:
      size64 in interface it.unimi.dsi.fastutil.Size64
      Overrides:
      size64 in class AbstractHashFunction<T>
    • main

      public static void main(String[] arg) throws NoSuchMethodException, IOException, com.martiansoftware.jsap.JSAPException
      Throws:
      NoSuchMethodException
      IOException
      com.martiansoftware.jsap.JSAPException