Class BloomFilter<T>
- java.lang.Object
-
- it.unimi.dsi.util.BloomFilter<T>
-
- All Implemented Interfaces:
it.unimi.dsi.fastutil.Size64
,java.io.Serializable
public class BloomFilter<T> extends java.lang.Object implements java.io.Serializable, it.unimi.dsi.fastutil.Size64
A Bloom filter.Instances of this class represent a set of elements (with false positives) using a Bloom filter (Burton H. Bloom, “Space/time trade-offs in hash coding with allowable errors”, Comm. ACM 13(7):422−426, 1970). Because of the way Bloom filters work, you cannot remove elements.
Given a maximum number of elements, Bloom filters have an expected error rate that depends on the number of hash functions used. More precisely, a Bloom filter for at most n elements with d hash functions will use ln 2 dn ≈ 1.44 dn bits; false positives will happen with probability 2-d. Adding more than n elements will result in a higher rate of false positives. You can conveniently build a filter by specifying the number of hash function, by specifying the expected false positive rate, or just requesting essentially no false positives.
The maximum number of bits supported is 137438953408L, which makes it possible to store with high precision several dozens billion elements.
This class exports access methods that are similar to those of
Set
, but it does not implement that interface, as too many non-optional methods would be unimplementable (e.g., iterators). To store generic objects of typeT
, we rely on MurmurHash3 and Google Guava's funnels, which are strategies turning an object into a sequence of bytes. There are predefined methods for storing character sequences, byte and character arrays, integers and longs; they use the ready-made funnels available inFunnels
. You can define your ownFunnel
and store objects correspondingly.If you intend to storage sequences of bytes which are already random looking (e.g., MD5 digests) you can use the
addHash(byte[])
/containsHash(byte[])
methods, which use the first 16 bytes of the argument byte array directly, without further hashing.If you plan to use predefined methods only, the suggested way to instantiate this class is to use the factory methods without
Funnel
arguments (e.g.,create(long, int)
). They return a filter of generic typeVoid
using anull
funnel that will be never invoked (unless you circument the generics type-safety mechanisms).A main method makes it easy to create serialized Bloom filters starting from a list of strings.
Implementation details
To generate several hash functions we use the technique suggested by Adam Kirsch and Michael Mitzenmacher in “Less hashing, same performance: Building a better Bloom filter”, Random Structures & Algorithms, 33(2):187−218, John Wiley & Sons, 2008. Two 64-bit hashes h0 and h1 are generated using
Hashing.murmur3_128()
(or taken from the argument, in case ofaddHash(byte[])
/containsHash(byte[])
). Then, the hash function of index i is simply h0 + ih1 (i ≥ 0). The paper proves that this choice does not worsen the rate of false positives.- Author:
- Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description static com.google.common.hash.Funnel<byte[]>
BYTE_ARRAY_FUNNEL
Funnels.byteArrayFunnel()
.static com.google.common.hash.Funnel<java.lang.Integer>
INTEGER_FUNNEL
Funnels.integerFunnel()
.static com.google.common.hash.Funnel<java.lang.Long>
LONG_FUNNEL
Funnels.longFunnel()
.static long
MAX_BITS
The maximum number of bits in a filter (limited by array size and bits in a long).static com.google.common.hash.Funnel<java.lang.CharSequence>
STRING_FUNNEL
Funnels.unencodedCharsFunnel()
.
-
Constructor Summary
Constructors Modifier Constructor Description protected
BloomFilter(long n, int d, com.google.common.hash.Funnel<T> funnel)
Creates a new Bloom filter with given number of hash functions and expected number of elements of given type.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description boolean
add(byte[] a)
Adds a byte array to this filter.boolean
add(char[] a)
Adds a character array to this filter.boolean
add(int x)
Adds an integer to this filter.boolean
add(long x)
Adds a long to this filter.boolean
add(java.lang.CharSequence s)
Adds a character sequence to this filter.boolean
add(T e)
Adds an object of generic type to this filter using the funnel specified at construction time.<V> boolean
add(V e, com.google.common.hash.Funnel<V> funnel)
Adds an object to this filter using a specified funnel.boolean
addHash(byte[] hash)
Adds a hash code to this filter.void
clear()
Clears this filter.boolean
contains(byte[] a)
Checks whether the given byte array is in this filter.boolean
contains(char[] a)
Checks whether the given character array is in this filter.boolean
contains(int x)
Adds an integer is in this filter.boolean
contains(long x)
Checks whether the given long is in this filter.boolean
contains(java.lang.CharSequence s)
Checks whether the given character sequence is in this filter.boolean
contains(T e)
Checks whether an object of generic type is in this filter using the funnel specified at construction time.boolean
containsHash(byte[] hash)
Checks whether a hash code is in this filter.static BloomFilter<java.lang.Void>
create(long n)
Creates a new high-precision Bloom filter a given expected number of elements.static BloomFilter<java.lang.Void>
create(long n, double precision)
Creates a new Bloom filter onVoid
with given precision and expected number of elements.static <T> BloomFilter<T>
create(long n, double precision, com.google.common.hash.Funnel<T> funnel)
Creates a new Bloom filter onVoid
with given precision and expected number of elements of given type.static BloomFilter<java.lang.Void>
create(long n, int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements.static <T> BloomFilter<T>
create(long n, int d, com.google.common.hash.Funnel<T> funnel)
Creates a new Bloom filter with given number of hash functions and expected number of elements of given type.static <T> BloomFilter<T>
create(long n, com.google.common.hash.Funnel<T> funnel)
Creates a new high-precision Bloom filter a given expected number of elements of given type.static void
main(java.lang.String[] arg)
int
size()
Deprecated.long
size64()
Returns the size of this filter.
-
-
-
Field Detail
-
BYTE_ARRAY_FUNNEL
public static final com.google.common.hash.Funnel<byte[]> BYTE_ARRAY_FUNNEL
Funnels.byteArrayFunnel()
.
-
STRING_FUNNEL
public static final com.google.common.hash.Funnel<java.lang.CharSequence> STRING_FUNNEL
Funnels.unencodedCharsFunnel()
.
-
INTEGER_FUNNEL
public static final com.google.common.hash.Funnel<java.lang.Integer> INTEGER_FUNNEL
Funnels.integerFunnel()
.
-
LONG_FUNNEL
public static final com.google.common.hash.Funnel<java.lang.Long> LONG_FUNNEL
Funnels.longFunnel()
.
-
MAX_BITS
public static final long MAX_BITS
The maximum number of bits in a filter (limited by array size and bits in a long).- See Also:
- Constant Field Values
-
-
Constructor Detail
-
BloomFilter
protected BloomFilter(long n, int d, com.google.common.hash.Funnel<T> funnel)
Creates a new Bloom filter with given number of hash functions and expected number of elements of given type.- Parameters:
n
- the expected number of elements.d
- the number of hash functions; if no more thann
elements are added to this filter, false positives will happen with probability 2-d.funnel
- a funnel for the elements of this filter.
-
-
Method Detail
-
create
public static <T> BloomFilter<T> create(long n, com.google.common.hash.Funnel<T> funnel)
Creates a new high-precision Bloom filter a given expected number of elements of given type.This constructor uses a number of hash functions that is logarithmic in the number of expected elements. This usually results in no false positives at all.
- Parameters:
n
- the expected number of elements.funnel
- a funnel for the elements of this filter (usecreate(long)
if you plan on using only the predefined methods).
-
create
public static <T> BloomFilter<T> create(long n, int d, com.google.common.hash.Funnel<T> funnel)
Creates a new Bloom filter with given number of hash functions and expected number of elements of given type.- Parameters:
n
- the expected number of elements.d
- the number of hash functions; if no more thann
elements are added to this filter, false positives will happen with probability 2-d.funnel
- a funnel for the elements of this filter (usecreate(long, int)
if you plan on using only the predefined methods).
-
create
public static <T> BloomFilter<T> create(long n, double precision, com.google.common.hash.Funnel<T> funnel)
Creates a new Bloom filter onVoid
with given precision and expected number of elements of given type.- Parameters:
n
- the expected number of elements.precision
- the expected fraction of false positives; if no more thann
elements are added to this filter, false positives will happen with no more than this probability. plan on using only the predefined methods).funnel
- a funnel for the elements of this filter (usecreate(long, double)
if you plan on using only the predefined methods).
-
create
public static BloomFilter<java.lang.Void> create(long n)
Creates a new high-precision Bloom filter a given expected number of elements.Filters created using this method will be accessible using predefined methods only. Use
create(long, Funnel)
if you need a generic filter.This constructor uses a number of hash functions that is logarithmic in the number of expected elements. This usually results in no false positives at all.
- Parameters:
n
- the expected number of elements.
-
create
public static BloomFilter<java.lang.Void> create(long n, int d)
Creates a new Bloom filter with given number of hash functions and expected number of elements.Filters created using this method will be accessible using predefined methods only. Use
create(long, int, Funnel)
if you need a generic filter.- Parameters:
n
- the expected number of elements.d
- the number of hash functions; if no more thann
elements are added to this filter, false positives will happen with probability 2-d.
-
create
public static BloomFilter<java.lang.Void> create(long n, double precision)
Creates a new Bloom filter onVoid
with given precision and expected number of elements.Filters created using this method will be accessible using predefined methods only. Use
create(long, double, Funnel)
if you need a generic filter.- Parameters:
n
- the expected number of elements.precision
- the expected fraction of false positives; if no more thann
elements are added to this filter, false positives will happen with no more than this probability. plan on using only the predefined methods).- See Also:
BloomFilter(long, int, Funnel)
-
add
public boolean add(java.lang.CharSequence s)
Adds a character sequence to this filter.- Parameters:
s
- a character sequence.- Returns:
- true if this filter was modified.
-
add
public boolean add(byte[] a)
Adds a byte array to this filter.- Parameters:
a
- a byte array.- Returns:
- true if this filter was modified.
-
add
public boolean add(char[] a)
Adds a character array to this filter.- Parameters:
a
- a character array.- Returns:
- true if this filter was modified.
-
add
public boolean add(int x)
Adds an integer to this filter.- Parameters:
x
- an integer.- Returns:
- true if this filter was modified.
-
add
public boolean add(long x)
Adds a long to this filter.- Parameters:
x
- a long.- Returns:
- true if this filter was modified.
-
add
public boolean add(T e)
Adds an object of generic type to this filter using the funnel specified at construction time.- Parameters:
e
- an object.- Returns:
- true if this filter was modified.
-
add
public <V> boolean add(V e, com.google.common.hash.Funnel<V> funnel)
Adds an object to this filter using a specified funnel.- Parameters:
e
- an object.funnel
- a funnel forobject
.- Returns:
- true if this filter was modified.
-
addHash
public boolean addHash(byte[] hash)
Adds a hash code to this filter.This method uses the first 16 bytes of a byte array to build two 64-bit hashes. The intended usage is storing digests and similar already-hashed values.
- Parameters:
hash
- a byte array of at least 16 elements containing a hash code.- Returns:
- true if this filter was modified.
- Throws:
java.lang.ArrayIndexOutOfBoundsException
- ifhash
is shorter than 16.- See Also:
containsHash(byte[])
-
contains
public boolean contains(java.lang.CharSequence s)
Checks whether the given character sequence is in this filter.Note that this method may return true on a character sequence that has never been added to this filter. This will happen with probability 2-d, where d is the number of hash functions specified at creation time, if the number of the elements in this filter is less than n, the number of expected elements specified at creation time.
- Parameters:
s
- a character sequence.- Returns:
- true if
s
is in this filter.
-
contains
public boolean contains(byte[] a)
Checks whether the given byte array is in this filter.- Parameters:
a
- a byte array.- Returns:
- true if
a
is in this filter. - See Also:
contains(CharSequence)
-
contains
public boolean contains(char[] a)
Checks whether the given character array is in this filter.- Parameters:
a
- a character array.- Returns:
- true if
a
is in this filter. - See Also:
contains(CharSequence)
-
contains
public boolean contains(int x)
Adds an integer is in this filter.- Parameters:
x
- an integer.- Returns:
- true if
x
is in this filter. - See Also:
contains(CharSequence)
-
contains
public boolean contains(long x)
Checks whether the given long is in this filter.- Parameters:
x
- a long.- Returns:
- true if
x
is in this filter. - See Also:
contains(CharSequence)
-
contains
public boolean contains(T e)
Checks whether an object of generic type is in this filter using the funnel specified at construction time.- Parameters:
e
- an element.- Returns:
- true if
e
is in this filter. - See Also:
contains(CharSequence)
-
containsHash
public boolean containsHash(byte[] hash)
Checks whether a hash code is in this filter.This method uses the first 16 bytes of a byte array to build two 64-bit hashes. The intended usage is storing digests and similar already-hashed values.
- Parameters:
hash
- a byte array of at least 16 elements containing a hash code.- Returns:
- true if
hash
is in this filter. - Throws:
java.lang.ArrayIndexOutOfBoundsException
- ifhash
is shorter than 16.- See Also:
addHash(byte[])
-
clear
public void clear()
Clears this filter.
-
size64
public long size64()
Returns the size of this filter.Note that the size of a Bloom filter is only a lower bound for the number of distinct elements that have been added to this filter. False positives might make the number returned by this method smaller than it should be.
- Specified by:
size64
in interfaceit.unimi.dsi.fastutil.Size64
- Returns:
- the size of this filter.
-
size
@Deprecated public int size()
Deprecated.- Specified by:
size
in interfaceit.unimi.dsi.fastutil.Size64
-
main
public static void main(java.lang.String[] arg) throws java.io.IOException, com.martiansoftware.jsap.JSAPException, java.lang.NoSuchMethodException
- Throws:
java.io.IOException
com.martiansoftware.jsap.JSAPException
java.lang.NoSuchMethodException
-
-