Class ImmutableExternalPrefixMap
- java.lang.Object
-
- it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<java.lang.CharSequence>
-
- it.unimi.dsi.util.AbstractPrefixMap
-
- it.unimi.dsi.util.ImmutableExternalPrefixMap
-
- All Implemented Interfaces:
it.unimi.dsi.fastutil.Function<java.lang.CharSequence,java.lang.Long>
,it.unimi.dsi.fastutil.objects.Object2LongFunction<java.lang.CharSequence>
,PrefixMap<MutableString>
,StringMap<MutableString>
,java.io.Serializable
,java.util.function.Function<java.lang.CharSequence,java.lang.Long>
,java.util.function.ToLongFunction<java.lang.CharSequence>
public class ImmutableExternalPrefixMap extends AbstractPrefixMap implements java.io.Serializable
An immutable prefix map mostly stored in external memory.Instances of this class write on a dump stream the string in the domain of the prefix map. More precisely, the dump stream is formed by blocks, and each block (with user-definable length, possibly the size of a basic disk I/O operation) is filled as much as possible with strings front coded and compressed with a
HuTuckerCodec
. Each block starts with the length of the first string in unary, followed by the encoding of the string. Then, for each string we write in unary the length of the common prefix (in characters) with the previous string, the length of the remaining suffix (in characters) and finally the encoded suffix. Note that if the encoding of a string is longer than a block, the string will occupy more than one block.We keep track using an
ImmutableBinaryTrie
of the strings at the start of each block inHu–Tucker coding
: thus, we are able to retrieve the interval corresponding to a given prefix by callinggetApproximatedInterval()
and scanning at most two blocks.Self-contained or non-self-contained
There are two kinds of external prefix maps: self-contained and non-self-contained. In the first case, you get a serialised object that you can load at any time. The dump stream is serialised with the object and expanded at each deserialisation in the Java temporary directory. If you deserialise a map several times, you will get correspondingly many copies of the dump stream in the temporary directory. The dump streams are deleted when the JVM exits. This mechanism is not very efficient, but since this class implements several interfaces it is essential that clients can make things work in a standard way.
Alternatively, you can give at creation time a filename for the dump stream. The resulting non-self-contained external prefix map can be serialised, but after deserialisation you need to set the dump stream filename or even directly the dump stream (for instance, to an input bit stream wrapping a byte array where the dump stream has been loaded). You can deserialise many copies of an external prefix map, letting all copies share the same dump stream.
This data structure is not synchronised, and concurrent reads may cause problems because of clashes in the usage of the underlying input bit stream. It would not be a good idea in any case to open a new stream for each caller, as that would certainly lead to disk thrashing.
The main method of this class helps in building large external prefix maps.
- Since:
- 0.9.3
- Author:
- Sebastiano Vigna
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected int[]
blockOffset
An array parallel toblockStart
giving the offset in blocks in the dump file of the corresponding word inblockStart
.protected long
blockSize
The block size of this (in bits).protected int[]
blockStart
The index of the first word in each block, plus an additional entry containingsize
.protected it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap<MutableString>
cache
A cache for the most recent queriesstatic int
CACHE_MAX_SIZE
The maximum number of entry in the cache map.protected it.unimi.dsi.fastutil.chars.Char2IntOpenHashMap
char2symbol
A map from characters to symbols of the coder.protected Decoder
decoder
A decoder used to read data from the dump stream.protected InputBitStream
dumpStream
A reference to the dump stream.protected ImmutableBinaryTrie<java.lang.CharSequence>
intervalApproximator
The in-memory data structure used to approximate intervals..protected boolean
iteratorIsUsable
If true, the creation of the lastDumpStreamIterator
was not followed by a call to any get method.protected boolean
selfContained
Whether this map is self-contained.static long
serialVersionUID
protected int
size
The number of terms in this map.static int
STD_BLOCK_SIZE
The standard block size (in bytes).protected char[]
symbol2char
A map (given by an array) from symbols in the coder to characters.-
Fields inherited from class it.unimi.dsi.util.AbstractPrefixMap
list, prefixMap, rangeMap
-
-
Constructor Summary
Constructors Constructor Description ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms)
Creates an external prefix map with block sizeSTD_BLOCK_SIZE
.ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms, int blockSizeInBytes)
Creates an external prefix map with specified block size.ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms, int blockSizeInBytes, java.lang.CharSequence dumpStreamFilename)
Creates an external prefix map with specified block size and dump stream.ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms, java.lang.CharSequence dumpStreamFilename)
Creates an external prefix map with block sizeSTD_BLOCK_SIZE
and specified dump stream.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
containsKey(java.lang.Object term)
Interval
getInterval(java.lang.CharSequence prefix)
Returns the range of strings having a given prefix.long
getLong(java.lang.Object o)
protected MutableString
getTerm(int index, MutableString s)
Writes a string specified by index into aMutableString
.it.unimi.dsi.fastutil.objects.ObjectIterator<java.lang.CharSequence>
iterator()
Returns an iterator over the map.static void
main(java.lang.String[] arg)
void
setDumpStream(InputBitStream dumpStream)
Sets the dump stream of this external prefix map to a given input bit stream.void
setDumpStream(java.lang.CharSequence dumpStreamFilename)
Sets the dump stream of this external prefix map to a given filename.int
size()
-
Methods inherited from class it.unimi.dsi.util.AbstractPrefixMap
list, prefixMap, rangeMap
-
Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction
defaultReturnValue, defaultReturnValue
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction
andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, defaultReturnValue, defaultReturnValue, get, getOrDefault, getOrDefault, put, put, remove, removeLong
-
-
-
-
Field Detail
-
serialVersionUID
public static final long serialVersionUID
- See Also:
- Constant Field Values
-
STD_BLOCK_SIZE
public static final int STD_BLOCK_SIZE
The standard block size (in bytes).- See Also:
- Constant Field Values
-
CACHE_MAX_SIZE
public static final int CACHE_MAX_SIZE
The maximum number of entry in the cache map.- See Also:
- Constant Field Values
-
intervalApproximator
protected final ImmutableBinaryTrie<java.lang.CharSequence> intervalApproximator
The in-memory data structure used to approximate intervals..
-
blockSize
protected final long blockSize
The block size of this (in bits).
-
decoder
protected final Decoder decoder
A decoder used to read data from the dump stream.
-
symbol2char
protected final char[] symbol2char
A map (given by an array) from symbols in the coder to characters.
-
char2symbol
protected final it.unimi.dsi.fastutil.chars.Char2IntOpenHashMap char2symbol
A map from characters to symbols of the coder.
-
size
protected final int size
The number of terms in this map.
-
blockStart
protected final int[] blockStart
The index of the first word in each block, plus an additional entry containingsize
.
-
blockOffset
protected final int[] blockOffset
An array parallel toblockStart
giving the offset in blocks in the dump file of the corresponding word inblockStart
. If there are no overflows, this will just be an initial segment of the natural numbers, but overflows cause jumps.
-
selfContained
protected final boolean selfContained
Whether this map is self-contained.
-
iteratorIsUsable
protected transient boolean iteratorIsUsable
If true, the creation of the lastDumpStreamIterator
was not followed by a call to any get method.
-
dumpStream
protected transient InputBitStream dumpStream
A reference to the dump stream.
-
cache
protected transient it.unimi.dsi.fastutil.objects.Object2IntLinkedOpenHashMap<MutableString> cache
A cache for the most recent queries
-
-
Constructor Detail
-
ImmutableExternalPrefixMap
public ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms, int blockSizeInBytes, java.lang.CharSequence dumpStreamFilename) throws java.io.IOException
Creates an external prefix map with specified block size and dump stream.This constructor does not assume that
CharSequence
instances returned byterms.iterator()
will be distinct. Thus, it can be safely used withFileLinesMutableStringIterable
.- Parameters:
terms
- an iterable whose iterator will enumerate in lexicographical order the terms for the map.blockSizeInBytes
- the block size (in bytes).dumpStreamFilename
- the name of the dump stream, ornull
for a self-contained map.- Throws:
java.io.IOException
-
ImmutableExternalPrefixMap
public ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms, java.lang.CharSequence dumpStreamFilename) throws java.io.IOException
Creates an external prefix map with block sizeSTD_BLOCK_SIZE
and specified dump stream.This constructor does not assume that
CharSequence
instances returned byterms.iterator()
will be distinct. Thus, it can be safely used withFileLinesMutableStringIterable
.- Parameters:
terms
- a collection whose iterator will enumerate in lexicographical order the terms for the map.dumpStreamFilename
- the name of the dump stream, ornull
for a self-contained map.- Throws:
java.io.IOException
-
ImmutableExternalPrefixMap
public ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms, int blockSizeInBytes) throws java.io.IOException
Creates an external prefix map with specified block size.This constructor does not assume that
CharSequence
instances returned byterms.iterator()
will be distinct. Thus, it can be safely used withFileLinesMutableStringIterable
.- Parameters:
blockSizeInBytes
- the block size (in bytes).terms
- a collection whose iterator will enumerate in lexicographical order the terms for the map.- Throws:
java.io.IOException
-
ImmutableExternalPrefixMap
public ImmutableExternalPrefixMap(java.lang.Iterable<? extends java.lang.CharSequence> terms) throws java.io.IOException
Creates an external prefix map with block sizeSTD_BLOCK_SIZE
.This constructor does not assume that
CharSequence
instances returned byterms.iterator()
will be distinct. Thus, it can be safely used withFileLinesMutableStringIterable
.- Parameters:
terms
- a collection whose iterator will enumerate in lexicographical order the terms for the map.- Throws:
java.io.IOException
-
-
Method Detail
-
setDumpStream
public void setDumpStream(java.lang.CharSequence dumpStreamFilename) throws java.io.FileNotFoundException
Sets the dump stream of this external prefix map to a given filename.This method sets the dump file used by this map, and should be only called after deserialisation, providing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.
Note that this method will attempt to close the old stream, if present.
- Parameters:
dumpStreamFilename
- the name of the dump file.- Throws:
java.io.FileNotFoundException
- See Also:
setDumpStream(InputBitStream)
-
setDumpStream
public void setDumpStream(InputBitStream dumpStream)
Sets the dump stream of this external prefix map to a given input bit stream.This method sets the dump file used by this map, and should be only called after deserialisation, providing a repositionable stream containing exactly the file generated at creation time. Essentially anything can happen if you do not follow the rules.
Using this method you can load an external prefix map in core memory, enjoying the compactness of the data structure, but getting much more speed.
Note that this method will attemp to close the old stream, if present.
- Parameters:
dumpStream
- a repositionable input bit stream containing exactly the dump stream generated at creation time.- See Also:
setDumpStream(CharSequence)
-
getInterval
public Interval getInterval(java.lang.CharSequence prefix)
Description copied from class:AbstractPrefixMap
Returns the range of strings having a given prefix.- Specified by:
getInterval
in classAbstractPrefixMap
- Parameters:
prefix
- a prefix.- Returns:
- the corresponding range of strings as an interval.
-
getTerm
protected MutableString getTerm(int index, MutableString s)
Description copied from class:AbstractPrefixMap
Writes a string specified by index into aMutableString
.- Specified by:
getTerm
in classAbstractPrefixMap
- Parameters:
index
- the index of a string.s
- a mutable string.- Returns:
string
.
-
containsKey
public boolean containsKey(java.lang.Object term)
- Specified by:
containsKey
in interfaceit.unimi.dsi.fastutil.Function<java.lang.CharSequence,java.lang.Long>
-
getLong
public long getLong(java.lang.Object o)
- Specified by:
getLong
in interfaceit.unimi.dsi.fastutil.objects.Object2LongFunction<java.lang.CharSequence>
-
iterator
public it.unimi.dsi.fastutil.objects.ObjectIterator<java.lang.CharSequence> iterator()
Returns an iterator over the map.The iterator returned by this method scans directly the dump stream.
Note that the returned iterator uses the same stream as all get methods. Calling such methods while the iterator is being used will produce an
IllegalStateException
.- Returns:
- an iterator over the map that just scans the dump stream.
-
size
public int size()
- Specified by:
size
in interfaceit.unimi.dsi.fastutil.Function<java.lang.CharSequence,java.lang.Long>
-
main
public static void main(java.lang.String[] arg) throws java.lang.ClassNotFoundException, java.io.IOException, com.martiansoftware.jsap.JSAPException, java.lang.SecurityException, java.lang.NoSuchMethodException
- Throws:
java.lang.ClassNotFoundException
java.io.IOException
com.martiansoftware.jsap.JSAPException
java.lang.SecurityException
java.lang.NoSuchMethodException
-
-