Class BucketedHashStore<T>
- java.lang.Object
-
- it.unimi.dsi.sux4j.io.BucketedHashStore<T>
-
- All Implemented Interfaces:
it.unimi.dsi.io.SafelyCloseable
,java.io.Closeable
,java.io.Serializable
,java.lang.AutoCloseable
,java.lang.Iterable<BucketedHashStore.Bucket>
public class BucketedHashStore<T> extends java.lang.Object implements java.io.Serializable, it.unimi.dsi.io.SafelyCloseable, java.lang.Iterable<BucketedHashStore.Bucket>
A temporary store of signatures virtually divided into buckets.A bucketed hash store accumulates elements (objects of type
T
) by turning them into bit vectors (using a providedTransformationStrategy
) and then hashing such vectors into a signature (a pair of longs, i.e., overall we get a hash of 128 bits). Elements can be added one by one or in batches. Elements must be distinct, or, more precisely, they must be transformed into distinct bit vectors.Besides the hashes, we store some data associated with each element: if no data is specified, we store the rank of each element added (the first element added has rank 0, the second one has rank 1, and so on), unless you specified at construction time a nonzero hash width: in that case, the value stored by
add(Object)
will be given the lowest bits of the first hash of the signature associated with the object (the hash width is the number of bits stored). This feature makes it possible, for example, to implement a static dictionary using aGOV3Function
.The desired expected bucket size can be set by calling
bucketSize(int)
. Once all elements have been added, one callsiterator()
, which returns buckets one at a time (in their natural order); signatures within each bucket are returned by increasing value, and signatures within different buckets are in bucket order. Actually, the iterator provided by a bucket returns a triple of longs whose last element is the data associated with the element that generated the signature.Note that the main difference between an instance of this class and one of a
ChunkedHashStore
is that the latter can only guarantee that the average chunk (here, bucket) size will be within a factor of two from the desired one, as the number of chunks can be specified only as a power of two.It is possible (albeit very unlikely) that different elements generate the same hash. This event is detected during bucket iteration (not while accumulating hashes), and it will throw a
BucketedHashStore.DuplicateException
. At that point, the caller must handle the exception by resetting the store and trying again from scratch. Note that after a few (say, three) exceptions you can safely assume that there are duplicate elements. If you need to force a check on the whole store you can callcheck()
. If all your elements come from anIterable
,checkAndRetry(Iterable, LongIterable)
will try three times to build a checked bucketed hash store.Every
reset(long)
changes the seed used by the store to generate signatures. So, if this seed has to be stored this must happen after the last call toreset(long)
. To help tracking this fact, a call toseed()
will lock the store; any further call toreset(long)
will throw anIllegalStateException
. In case the store needs to be reused, you can callclear()
, that will bring back the store to after-creation state.When you have finished using a bucketed hash store, you should
close()
it. This class implementsSafelyCloseable
, and thus provides a safety-net finalizer.Filtering
You can at any time set a predicate that will filter the signatures returned by the store.
Computing frequencies
If you specify so at construction time, a bucketed hash store will compute for you a a map from values to their frequency.
Implementation details
Internally, a bucketed hash store save signatures into different disk segments using the highest bits (performing, in fact, the first phase of a bucket sort). Once the user chooses a bucket size, the store exhibits the data on disk by grouping disk segments or splitting them into buckets. This process is transparent to the user.
The assignment to a bucket happens conceptually by using the first 64-bit hash (shifted by one to the right, to avoid sign issues) to define a number α in the interval [0..1). Then, the bucket assigned is ⌊αm⌋, where m = 1 +
size()
/bucketSize()
is the number of buckets. Conceptually, we are mapping α, a uniform random number in the unit interval, into ⌊αm⌋, a uniform integer number in the range [0..m), by inversion. As show below, the whole computation can be carried out using a fixed-point representation, as it has been done for pseudorandom number generators since the early days, using only a multiplication and shift. The assignment is monotone nondecreasing, which makes it possible to emit the buckets one at a time scanning the keys in sorted order.Signatures have to be loaded into memory only segment by segment, so to be sorted and tested for uniqueness. As long as
DISK_SEGMENTS
is larger than eight, the store will need less than 0.75 bits per element of main memory.DISK_SEGMENTS
can be increased arbitrarily at compile time, but each store will openDISK_SEGMENTS
files at the same time. (For the same reason, it is strongly suggested that you close your stores as soon as you do not need them).Intended usage
bucketed hash stores should be built by classes that need to manipulate elements in buckets of approximate given size without needing access to the elements themselves, but just to their signatures, a typical example being
GOV3Function
, which uses the signatures to compute a 3-hyperedge. Once a bucketed hash store is built, it can be passed on to further substructures, reducing greatly the computation time (as the original collection need not to be scanned again).To compute the bucket corresponding to a given element, use
final long[] signature = new long[2]; Hashes.spooky4(transform.toBitVector(key), seed, signature); final int bucket = Math.multiplyHigh(signature[0] >>> 1, (1 + n / bucketSize) << 1);
whereseed
is the store seed,n
is the number of keys, andbucketSize
is the provided bucket size.- Since:
- 5.0.0
- Author:
- Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
BucketedHashStore.Bucket
A bucket returned by aBucketedHashStore
.static class
BucketedHashStore.DuplicateException
Denotes that the bucketed hash store contains a duplicate signature.
-
Field Summary
Fields Modifier and Type Field Description static int
BUFFER_SIZE
The size of the output buffers.static int
DEFAULT_BUCKET_SIZE
The default size of the bucket at creation.static int
DISK_SEGMENTS
The number of disk segments.static int
DISK_SEGMENTS_SHIFT
The shift for disk segments.protected long
filteredSize
The number of elements that pass the current filter, or -1 we it must be recomputed.static int
LOG2_DISK_SEGMENTS
The logarithm of the number of disk segments.protected long
seed
The seed used to generate the hash signatures.static long
serialVersionUID
protected long
size
The number of elements ever added.
-
Constructor Summary
Constructors Constructor Description BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform)
Creates a bucketed hash store with given transformation strategy.BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, it.unimi.dsi.logging.ProgressLogger pl)
Creates a bucketed hash store with given transformation strategy.BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, java.io.File tempDir)
Creates a bucketed hash store with given transformation strategy and temporary file directory.BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, java.io.File tempDir, int hashWidthOrCountValues, it.unimi.dsi.logging.ProgressLogger pl)
Creates a bucketed hash store with given transformation strategy, hash width and progress logger.BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, java.io.File tempDir, it.unimi.dsi.logging.ProgressLogger pl)
Creates a bucketed hash store with given transformation strategy and progress logger.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
add(T o)
Adds an element to this store, associating it with its ordinal position.void
add(T o, long value)
Adds an element to this store, associating it with a specified value.void
addAll(java.util.Iterator<? extends T> elements)
Adds the elements returned by an iterator to this store, associating them with their ordinal position.void
addAll(java.util.Iterator<? extends T> elements, it.unimi.dsi.fastutil.longs.LongIterator values)
Adds the elements returned by an iterator to this store, associating them with specified values.void
addAll(java.util.Iterator<? extends T> elements, it.unimi.dsi.fastutil.longs.LongIterator values, boolean requiresValue2CountMap)
Adds the elements returned by an iterator to this store, associating them with specified values, possibly building the associated value frequency map.int
bucketSize()
Returns the expected bucket size.void
bucketSize(int bucketSize)
Sets the expected bucket size.void
check()
Checks that this store has no duplicate signatures, throwing an exception if this fails to happen.void
checkAndRetry(java.lang.Iterable<? extends T> iterable)
Checks that this store has no duplicate signatures, and try to rebuild if this fails to happen.void
checkAndRetry(java.lang.Iterable<? extends T> iterable, it.unimi.dsi.fastutil.longs.LongIterable values)
Checks that this store has no duplicate signatures, and try to rebuild if this fails to happen.void
clear()
Clears this store.void
close()
Closes this store, disposing all associated resources.void
filter(org.apache.commons.collections4.Predicate<long[]> filter)
Sets a filter for this store.protected void
finalize()
java.util.Iterator<BucketedHashStore.Bucket>
iterator()
Returns an iterator over the buckets of this bucketed hash store.void
reset(long seed)
Resets this store using a new seed.long
seed()
Return the current seed of this bucketed hash store.it.unimi.dsi.fastutil.longs.LongBigList
signatures(int signatureWidth, it.unimi.dsi.logging.ProgressLogger pl)
Generate a list of signatures using the lowest bits of the first hash in this store.long
size()
Returns the size of this store.java.io.File
tempDir()
Return the temporary directory of this bucketed hash store, ornull
.it.unimi.dsi.bits.TransformationStrategy<? super T>
transform()
Return the transformation strategy provided at construction time.it.unimi.dsi.fastutil.longs.Long2LongOpenHashMap
value2FrequencyMap()
Return the current value frequency map.
-
-
-
Field Detail
-
serialVersionUID
public static final long serialVersionUID
- See Also:
- Constant Field Values
-
DEFAULT_BUCKET_SIZE
public static final int DEFAULT_BUCKET_SIZE
The default size of the bucket at creation. From Sux4J 5.2.0 it has been raised (it used to be one) to avoid that calls tocheckAndRetry(Iterable)
on a newly created store are too expensive.- See Also:
- Constant Field Values
-
BUFFER_SIZE
public static final int BUFFER_SIZE
The size of the output buffers.- See Also:
- Constant Field Values
-
LOG2_DISK_SEGMENTS
public static final int LOG2_DISK_SEGMENTS
The logarithm of the number of disk segments.- See Also:
- Constant Field Values
-
DISK_SEGMENTS
public static final int DISK_SEGMENTS
The number of disk segments.- See Also:
- Constant Field Values
-
DISK_SEGMENTS_SHIFT
public static final int DISK_SEGMENTS_SHIFT
The shift for disk segments.- See Also:
- Constant Field Values
-
size
protected long size
The number of elements ever added.
-
filteredSize
protected long filteredSize
The number of elements that pass the current filter, or -1 we it must be recomputed.
-
seed
protected long seed
The seed used to generate the hash signatures.
-
-
Constructor Detail
-
BucketedHashStore
public BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform) throws java.io.IOException
Creates a bucketed hash store with given transformation strategy.- Parameters:
transform
- a transformation strategy for the elements.- Throws:
java.io.IOException
-
BucketedHashStore
public BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, java.io.File tempDir) throws java.io.IOException
Creates a bucketed hash store with given transformation strategy and temporary file directory.- Parameters:
transform
- a transformation strategy for the elements.tempDir
- a temporary directory for the store files, ornull
for the current directory.- Throws:
java.io.IOException
-
BucketedHashStore
public BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, it.unimi.dsi.logging.ProgressLogger pl) throws java.io.IOException
Creates a bucketed hash store with given transformation strategy.- Parameters:
transform
- a transformation strategy for the elements.pl
- a progress logger, ornull
.- Throws:
java.io.IOException
-
BucketedHashStore
public BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, java.io.File tempDir, it.unimi.dsi.logging.ProgressLogger pl) throws java.io.IOException
Creates a bucketed hash store with given transformation strategy and progress logger.- Parameters:
transform
- a transformation strategy for the elements.tempDir
- a temporary directory for the store files, ornull
for the current directory.pl
- a progress logger, ornull
.- Throws:
java.io.IOException
-
BucketedHashStore
public BucketedHashStore(it.unimi.dsi.bits.TransformationStrategy<? super T> transform, java.io.File tempDir, int hashWidthOrCountValues, it.unimi.dsi.logging.ProgressLogger pl) throws java.io.IOException
Creates a bucketed hash store with given transformation strategy, hash width and progress logger.- Parameters:
transform
- a transformation strategy for the elements.tempDir
- a temporary directory for the store files, ornull
for the current directory.hashWidthOrCountValues
- if positive, no associated data is saved in the store:BucketedHashStore.Bucket.data(long)
will return this many lower bits of the first of the three hashes associated with the key; zero, values are stored; if negative, values are stored and a map from values to their frequency is computed.pl
- a progress logger, ornull
.- Throws:
java.io.IOException
-
-
Method Detail
-
bucketSize
public int bucketSize()
Returns the expected bucket size.- Returns:
- the expected bucket size.
-
bucketSize
public void bucketSize(int bucketSize)
Sets the expected bucket size.- Parameters:
bucketSize
- the expected bucket size.
-
seed
public long seed()
Return the current seed of this bucketed hash store. After calling this method, noreset(long)
will be allowed (unless the store is cleared).- Returns:
- the current seed of this bucketed hash store.
-
tempDir
public java.io.File tempDir()
Return the temporary directory of this bucketed hash store, ornull
.- Returns:
- the temporary directory of this bucketed hash store, or
null
.
-
transform
public it.unimi.dsi.bits.TransformationStrategy<? super T> transform()
Return the transformation strategy provided at construction time.- Returns:
- the transformation strategy provided at construction time.
-
add
public void add(T o, long value) throws java.io.IOException
Adds an element to this store, associating it with a specified value.- Parameters:
o
- the element to be added.value
- the associated value.- Throws:
java.io.IOException
-
add
public void add(T o) throws java.io.IOException
Adds an element to this store, associating it with its ordinal position.- Parameters:
o
- the element to be added.- Throws:
java.io.IOException
-
addAll
public void addAll(java.util.Iterator<? extends T> elements, it.unimi.dsi.fastutil.longs.LongIterator values, boolean requiresValue2CountMap) throws java.io.IOException
Adds the elements returned by an iterator to this store, associating them with specified values, possibly building the associated value frequency map.- Parameters:
elements
- an iterator returning elements.values
- an iterator on values parallel toelements
.requiresValue2CountMap
- whether to build the value frequency map (associating with each value its frequency).- Throws:
java.io.IOException
-
addAll
public void addAll(java.util.Iterator<? extends T> elements, it.unimi.dsi.fastutil.longs.LongIterator values) throws java.io.IOException
Adds the elements returned by an iterator to this store, associating them with specified values.- Parameters:
elements
- an iterator returning elements.values
- an iterator on values parallel toelements
.- Throws:
java.io.IOException
-
addAll
public void addAll(java.util.Iterator<? extends T> elements) throws java.io.IOException
Adds the elements returned by an iterator to this store, associating them with their ordinal position.- Parameters:
elements
- an iterator returning elements.- Throws:
java.io.IOException
-
size
public long size() throws java.io.IOException
Returns the size of this store. Note that if you set up a filter, the first call to this method will require a scan to the whole store.- Returns:
- the number of (possibly filtered) pairs of this store.
- Throws:
java.io.IOException
-
clear
public void clear() throws java.io.IOException
Clears this store. After a call to this method, the store can be reused.- Throws:
java.io.IOException
-
value2FrequencyMap
public it.unimi.dsi.fastutil.longs.Long2LongOpenHashMap value2FrequencyMap()
Return the current value frequency map.- Returns:
- the current value frequency map.
- Throws:
java.lang.IllegalStateException
- if this bucketed hash store does not contain a value frequency map.
-
finalize
protected void finalize() throws java.lang.Throwable
- Overrides:
finalize
in classjava.lang.Object
- Throws:
java.lang.Throwable
-
close
public void close() throws java.io.IOException
Closes this store, disposing all associated resources.- Specified by:
close
in interfacejava.lang.AutoCloseable
- Specified by:
close
in interfacejava.io.Closeable
- Throws:
java.io.IOException
-
reset
public void reset(long seed) throws java.io.IOException
Resets this store using a new seed. All accumulated data are cleared, and a new seed is reinstated.
-
check
public void check() throws BucketedHashStore.DuplicateException
Checks that this store has no duplicate signatures, throwing an exception if this fails to happen.- Throws:
BucketedHashStore.DuplicateException
- if this store contains duplicate signatures.
-
checkAndRetry
public void checkAndRetry(java.lang.Iterable<? extends T> iterable, it.unimi.dsi.fastutil.longs.LongIterable values) throws java.io.IOException
Checks that this store has no duplicate signatures, and try to rebuild if this fails to happen.- Parameters:
iterable
- the elements with which the store will be refilled if there are duplicate signatures.values
- the values that will be associated with the elements returned byiterable
.- Throws:
java.lang.IllegalArgumentException
- if after a few trials the store still contains duplicate signatures.java.io.IOException
-
checkAndRetry
public void checkAndRetry(java.lang.Iterable<? extends T> iterable) throws java.io.IOException
Checks that this store has no duplicate signatures, and try to rebuild if this fails to happen.Warning: the actions are executed exactly in the specified order—first check, then retry. If you invoke this method on an empty store you'll get a checked empty store.
- Parameters:
iterable
- the elements with which the store will be refilled if there are duplicate signatures.- Throws:
java.lang.IllegalArgumentException
- if after a few trials the store still contains duplicate signatures.java.io.IOException
-
signatures
public it.unimi.dsi.fastutil.longs.LongBigList signatures(int signatureWidth, it.unimi.dsi.logging.ProgressLogger pl) throws java.io.IOException
Generate a list of signatures using the lowest bits of the first hash in this store.For this method to work, this store must contain ranks.
- Parameters:
signatureWidth
- the width in bits of the signatures.pl
- a progress logger.- Throws:
java.io.IOException
-
filter
public void filter(org.apache.commons.collections4.Predicate<long[]> filter)
Sets a filter for this store.- Parameters:
filter
- a predicate that will be used to filter signatures.
-
iterator
public java.util.Iterator<BucketedHashStore.Bucket> iterator()
Returns an iterator over the buckets of this bucketed hash store.Note that at each iteration part of the state of this bucketed hash store is reused. Thus, after each call to
next()
the previously returnedBucketedHashStore.Bucket
will be no longer valid. Please use the provided copy constructor if you need to process in parallel several buckets.- Specified by:
iterator
in interfacejava.lang.Iterable<T>
- Returns:
- an iterator over the buckets of this bucketed hash store.
-
-