Class GV3CompressedFunction<T>

  • All Implemented Interfaces:
    it.unimi.dsi.fastutil.Function<T,​java.lang.Long>, it.unimi.dsi.fastutil.objects.Object2LongFunction<T>, it.unimi.dsi.fastutil.Size64, java.io.Serializable, java.util.function.Function<T,​java.lang.Long>, java.util.function.ToLongFunction<T>

    public class GV3CompressedFunction<T>
    extends it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction<T>
    implements java.io.Serializable, it.unimi.dsi.fastutil.Size64
    An immutable function stored in a compressed form.

    Instances of this class store a function from keys to values. Keys are provided by an iterable object (whose iterators must return elements in a consistent order), whereas values are provided by a LongIterable. If you do not specify values, each key will be assigned its rank (e.g., its position in iteration order starting from zero).

    The values must have a skewed distribution: some values must be much more frequent than others. In that case, this data structure uses much less space than, say, a GOV3Function because it is able to use, for each key, a number of bits close to the empirical entropy of the value list (with an additional ≈10%). It is faster than a GV4CompressedFunction, and it can be built more quickly, but it uses more space.

    In addition to the standard construction, which uses lazy Gaussian elimination to solve a number of random linear systems, this implementation has the option of enlarging slightly the space used by the structure (+12%) and use a peeling procedure (essentially, triangulation) to solve the systems. The construction time can be significantly shorter in this case.

    For convenience, this class provides a main method that reads from standard input a (possibly gzip'd) sequence of newline-separated strings, and writes a serialised function mapping each element of the list to its position, or to a given list of values.

    Building a function

    This class provides a great amount of flexibility when creating a new function; such flexibility is exposed through the builder. To exploit the various possibilities, you must understand some details of the construction.

    In a first phase, we build a BucketedHashStore containing hashes of the keys. By default, the store will associate each hash with the rank of the key. If you specify values, the store will associate with each hash the corresponding value.

    However, if you further require an indirect construction the store will associate again each hash with the rank of the corresponding key, and access randomly the values (which must be either a LongList or a LongBigList). Indirect construction is useful only in complex, multi-layer hashes (such as an LcpMonotoneMinimalPerfectHashFunction) in which we want to reuse a checked BucketedHashStore. Storing values in the BucketedHashStore is extremely scalable because the values must just be a LongIterable that will be scanned sequentially during the store construction. On the other hand, if you have already a store that associates ordinal positions, and you want to build a new function for which a LongList or LongBigList of values needs little space (e.g., because it is described implicitly), you can opt for an indirect construction using the already built store.

    Note that if you specify a store it will be used before building a new one (possibly because of a DuplicateException), with obvious benefits in terms of performance. If the store is not checked, and a DuplicateException is thrown, the constructor will try to rebuild the store, but this requires, of course, that the keys, and possibly the values, are available. Note that it is your responsibility to pass a correct store.

    Multithreading

    This implementation is multithreaded: each bucket returned by the BucketedHashStore is processed independently. By default, this class uses Runtime.availableProcessors() parallel threads, but by default no more than 4. If you wish to set a specific number of threads, you can do so through the system property "it.unimi.dsi.sux4j.mph.threads".

    Implementation Details

    The detail of the data structure can be found in “Engineering Compressed Functions”, by Marco Genuzio and Sebastiano Vigna, 2017. The theoretical basis for the construction is described by Jóhannes B. Hreinsson, Morten Krøyer, and Rasmus Pagh in “Storing a compressed function with constant time access”, Algorithms - ESA 2009, 17th Annual European Symposium, 2009, pages 730−741. Each output value is represented by a codeword from a prefix-free code (by default, a length-limited Huffman code). We generate a random 3-regular linear system on F2, where the known term of the k-th equation is a bit in a bit array. When we read the bit array at three suitable positions depending on an input key, we can recover the codeword representing the output value.

    Since:
    4.2.0
    Author:
    Sebastiano Vigna, Marco Genuzio
    See Also:
    GV4CompressedFunction, Serialized Form
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      protected GV3CompressedFunction​(java.lang.Iterable<? extends T> keys, it.unimi.dsi.bits.TransformationStrategy<? super T> transform, it.unimi.dsi.fastutil.longs.LongIterable values, boolean indirect, java.io.File tempDir, BucketedHashStore<T> bucketedHashStore, Codec codec, boolean peeled)
      Creates a new function for the given keys and values.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      boolean containsKey​(java.lang.Object o)  
      void dump​(java.lang.String file)  
      long getLong​(java.lang.Object o)  
      static void main​(java.lang.String[] arg)  
      long numBits()
      Returns the number of bits used by this structure.
      int size()
      Deprecated.
      long size64()
      Returns the number of keys in the function domain.
      • Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObject2LongFunction

        defaultReturnValue, defaultReturnValue
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface it.unimi.dsi.fastutil.Function

        apply, clear
      • Methods inherited from interface java.util.function.Function

        compose
      • Methods inherited from interface it.unimi.dsi.fastutil.objects.Object2LongFunction

        andThen, andThenByte, andThenChar, andThenDouble, andThenFloat, andThenInt, andThenLong, andThenObject, andThenReference, andThenShort, applyAsLong, composeByte, composeChar, composeDouble, composeFloat, composeInt, composeLong, composeObject, composeReference, composeShort, get, getOrDefault, getOrDefault, put, put, remove, removeLong
    • Field Detail

      • NUMBER_OF_THREADS_PROPERTY

        public static final java.lang.String NUMBER_OF_THREADS_PROPERTY
        The system property used to set the number of parallel threads.
        See Also:
        Constant Field Values
      • BUCKET_SIZE

        public static final int BUCKET_SIZE
        The expected bucket size.
        See Also:
        Constant Field Values
      • n

        protected final long n
        The number of keys.
      • globalMaxCodewordLength

        protected final int globalMaxCodewordLength
        Length of longest codeword
      • globalSeed

        protected long globalSeed
        The seed used to generate the initial signature.
      • offsetAndSeed

        protected final long[] offsetAndSeed
        A long containing three values per bucket:
        • the top SEED_BITS bits contain the seed (note that it must not be shifted right);
        • the remaining lower bits contain the starting position in data of the bits associated with the bucket.
      • data

        protected final it.unimi.dsi.bits.BitVector data
        A bit vector storing the main data array.
      • transform

        protected final it.unimi.dsi.bits.TransformationStrategy<? super T> transform
        The transformation strategy to turn objects of type T into bit vectors.
      • decoder

        protected final Codec.Decoder decoder
        The decoder that will be used to yield output values.
    • Constructor Detail

      • GV3CompressedFunction

        protected GV3CompressedFunction​(java.lang.Iterable<? extends T> keys,
                                        it.unimi.dsi.bits.TransformationStrategy<? super T> transform,
                                        it.unimi.dsi.fastutil.longs.LongIterable values,
                                        boolean indirect,
                                        java.io.File tempDir,
                                        BucketedHashStore<T> bucketedHashStore,
                                        Codec codec,
                                        boolean peeled)
                                 throws java.io.IOException
        Creates a new function for the given keys and values.
        Parameters:
        keys - the keys in the domain of the function, or null.
        transform - a transformation strategy for the keys.
        values - values to be assigned to each element, in the same order of the iterator returned by keys; if null, the assigned value will the ordinal number of each element.
        indirect - if true, bucketedHashStore contains ordinal positions, and values is a LongIterable that must be accessed to retrieve the actual values.
        tempDir - a temporary directory for the store files, or null for the standard temporary directory.
        bucketedHashStore - a bucketed hash store containing the keys associated with their values and counting value frequencies, or null; the store can be unchecked, but in this case keys and transform must be non-null.
        codec - the Codec used to encode values.
        peeled - whether to use peeling rather than lazy Gaussian elimination; the resulting structure uses +12% space, but it can be constructed much more quickly.
        Throws:
        java.io.IOException
    • Method Detail

      • getLong

        public long getLong​(java.lang.Object o)
        Specified by:
        getLong in interface it.unimi.dsi.fastutil.objects.Object2LongFunction<T>
      • size64

        public long size64()
        Returns the number of keys in the function domain.
        Specified by:
        size64 in interface it.unimi.dsi.fastutil.Size64
        Returns:
        the number of the keys in the function domain.
      • size

        @Deprecated
        public int size()
        Deprecated.
        Specified by:
        size in interface it.unimi.dsi.fastutil.Function<T,​java.lang.Long>
        Specified by:
        size in interface it.unimi.dsi.fastutil.Size64
      • numBits

        public long numBits()
        Returns the number of bits used by this structure.
        Returns:
        the number of bits used by this structure.
      • containsKey

        public boolean containsKey​(java.lang.Object o)
        Specified by:
        containsKey in interface it.unimi.dsi.fastutil.Function<T,​java.lang.Long>
      • dump

        public void dump​(java.lang.String file)
                  throws java.io.IOException
        Throws:
        java.io.IOException
      • main

        public static void main​(java.lang.String[] arg)
                         throws java.lang.NoSuchMethodException,
                                java.io.IOException,
                                com.martiansoftware.jsap.JSAPException
        Throws:
        java.lang.NoSuchMethodException
        java.io.IOException
        com.martiansoftware.jsap.JSAPException