Class CHDMinimalPerfectHashFunction<T>
- java.lang.Object
-
- it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<K>
-
- it.unimi.dsi.sux4j.mph.AbstractHashFunction<T>
-
- it.unimi.dsi.sux4j.mph.CHDMinimalPerfectHashFunction<T>
-
- All Implemented Interfaces:
it.unimi.dsi.fastutil.Function<T,java.lang.Long>
,it.unimi.dsi.fastutil.objects.Object2LongFunction<T>
,it.unimi.dsi.fastutil.Size64
,java.io.Serializable
,java.util.function.Function<T,java.lang.Long>
,java.util.function.ToLongFunction<T>
public class CHDMinimalPerfectHashFunction<T> extends AbstractHashFunction<T> implements java.io.Serializable
A minimal perfect hash function implemented using the “hash, displace and compress” technique.Given a list of keys without duplicates, the builder of this class finds a minimal perfect hash function for the list. Subsequent calls to the
getLong(Object)
method will return a distinct number for each key in the list. For keys out of the list, the resulting number is not specified. In some (rare) cases it might be possible to establish that a key was not in the original list, and in that case -1 will be returned; by signing the function (see below), you can guarantee with a prescribed probability that -1 will be returned on keys not in the original list. The class can then be saved by serialisation and reused later.This class uses a chunked hash store to provide highly scalable construction. Note that at construction time you can pass a it.unimi.dsi.sux4j.io.ChunkedHashStore containing the keys (associated with any value); however, if the store is rebuilt because of a
DuplicateException
it will be rebuilt associating with each key its ordinal position.The memory requirements for the algorithm we use are ≈2 bits per key for load factor equal to one and λ = 5. Thus, this class can use ≈10% less memory than a
GOVMinimalPerfectHashFunction
.However, its construction time is an order of magnitude larger, and query time is about 50% slower. Different tradeoffs between construction time, query time and space can be obtained by tweaking the load factor and the parameter λ (see the paper below for their exact meaning).
For convenience, this class provides a main method that reads from standard input a (possibly
gzip
'd) sequence of newline-separated strings, and writes a serialised minimal perfect hash function for the given list.Signing
Optionally, it is possible to sign the minimal perfect hash function. A w-bit signature will be associated with each key, so that
getLong(Object)
will return -1 on strings that are not in the original key set. As usual, false positives are possible with probability 2-w.How it Works
The technique used is described by Djamal Belazzougui, Fabiano C. Botelho and Martin Dietzfelbinger in “Hash, displace and compress”, Algorithms - ESA 2009, LNCS 5757, pages 682−693, 2009. However, with respect to the algorithm described in the paper, this implementation is much more scalable, as it uses a
ChunkedHashStore
to split the generation of large key sets into generation of smaller functions for each chunk (of size approximately 216).- Since:
- 3.2.0
- Author:
- Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
CHDMinimalPerfectHashFunction.Builder<T>
A builder class forCHDMinimalPerfectHashFunction
.
-
Field Summary
Fields Modifier and Type Field Description protected EliasFanoLongBigList
coefficients
The displacement coefficients.protected long
globalSeed
The seed used to generate the initial hash triple.static int
LOG2_CHUNK_SIZE
The logarithm of the desired chunk size.protected long
n
The number of keys.protected SparseRank
rank
The sparse ranking structure containing the unused entries.static long
serialVersionUID
protected long
signatureMask
The mask to compare signatures, or zero for no signatures.protected it.unimi.dsi.fastutil.longs.LongBigList
signatures
The signatures.protected it.unimi.dsi.bits.TransformationStrategy<? super T>
transform
The transformation strategy.
-
Constructor Summary
Constructors Modifier Constructor Description protected
CHDMinimalPerfectHashFunction(java.lang.Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int lambda, double loadFactor, int signatureWidth, java.io.File tempDir, ChunkedHashStore<T> chunkedHashStore)
Creates a new CHD minimal perfect hash function for the given keys.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description long
getLong(java.lang.Object key)
static void
main(java.lang.String[] arg)
long
numBits()
Returns the number of bits used by this structure.long
size64()
-
Methods inherited from class it.unimi.dsi.sux4j.mph.AbstractHashFunction
containsKey, size
-
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defaultReturnValue, defaultReturnValue
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
-
-
-
-
Field Detail
-
serialVersionUID
public static final long serialVersionUID
- See Also:
- Constant Field Values
-
LOG2_CHUNK_SIZE
public static final int LOG2_CHUNK_SIZE
The logarithm of the desired chunk size.- See Also:
- Constant Field Values
-
n
protected final long n
The number of keys.
-
globalSeed
protected final long globalSeed
The seed used to generate the initial hash triple.
-
transform
protected final it.unimi.dsi.bits.TransformationStrategy<? super T> transform
The transformation strategy.
-
coefficients
protected final EliasFanoLongBigList coefficients
The displacement coefficients.
-
rank
protected final SparseRank rank
The sparse ranking structure containing the unused entries.
-
signatureMask
protected final long signatureMask
The mask to compare signatures, or zero for no signatures.
-
signatures
protected final it.unimi.dsi.fastutil.longs.LongBigList signatures
The signatures.
-
-
Constructor Detail
-
CHDMinimalPerfectHashFunction
protected CHDMinimalPerfectHashFunction(java.lang.Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, int lambda, double loadFactor, int signatureWidth, java.io.File tempDir, ChunkedHashStore<T> chunkedHashStore) throws java.io.IOException
Creates a new CHD minimal perfect hash function for the given keys.- Parameters:
keys
- the keys to hash, ornull
.transform
- a transformation strategy for the keys.lambda
- the average bucket size.loadFactor
- the load factor.signatureWidth
- a signature width, or 0 for no signature.tempDir
- a temporary directory for the store files, ornull
for the standard temporary directory.chunkedHashStore
- a chunked hash store containing the keys, ornull
; the store can be unchecked, but in this casekeys
andtransform
must be non-null
.- Throws:
java.io.IOException
-
-
Method Detail
-
numBits
public long numBits()
Returns the number of bits used by this structure.- Returns:
- the number of bits used by this structure.
-
getLong
public long getLong(java.lang.Object key)
- Specified by:
getLong
in interfaceit.unimi.dsi.fastutil.objects.Object2LongFunction<T>
-
size64
public long size64()
- Specified by:
size64
in interfaceit.unimi.dsi.fastutil.Size64
- Overrides:
size64
in classAbstractHashFunction<T>
-
main
public static void main(java.lang.String[] arg) throws java.lang.NoSuchMethodException, java.io.IOException, com.martiansoftware.jsap.JSAPException
- Throws:
java.lang.NoSuchMethodException
java.io.IOException
com.martiansoftware.jsap.JSAPException
-
-