Class S2PointVectorCoder

  • All Implemented Interfaces:
    S2Coder<java.util.List<S2Point>>

    @GwtCompatible
    class S2PointVectorCoder
    extends java.lang.Object
    implements S2Coder<java.util.List<S2Point>>
    An encoder/decoder of Lists.

    Values from the List returned by decode(Bytes, Cursor) are decoded only when they are accessed. This allows for very fast initialization and no additional memory use beyond the encoded data. The encoded data is not owned by the List; typically it points into a large contiguous buffer that contains other encoded data as well.

    • Field Detail

      • FAST

        static final S2PointVectorCoder FAST
        An instance of a S2PointVectorCoder which encodes/decodes S2Points in the FAST encoding format. The FAST format is optimized for fast encoding/decoding.
      • COMPACT

        static final S2PointVectorCoder COMPACT
        An instance of a S2PointVectorCoder which encodes/decodes S2Points in the COMPACT encoding format. The COMPACT format is optimized for disk usage and memory footprint.
      • FORMAT_FAST

        private static final int FORMAT_FAST
        The value of the FAST encoding format.
        See Also:
        Constant Field Values
      • FORMAT_COMPACT

        private static final int FORMAT_COMPACT
        The value of the COMPACT encoding format.
        See Also:
        Constant Field Values
      • SIZEOF_S2POINT

        private static final int SIZEOF_S2POINT
        The size of an encoded S2Point in bytes (3 doubles * 8 bytes per double).
        See Also:
        Constant Field Values
      • BLOCK_SIZE

        private static final int BLOCK_SIZE
        S2CellIds are represented in a special 64-bit format and are encoded in fixed-size blocks. BLOCK_SIZE represents the number of values per block. Block sizes of 4, 8, 16, and 32 were tested, and BLOCK_SIZE == 16 seems to offer the best compression. (Note that BLOCK_SIZE == 32 requires some code modifications which have since been removed).
        See Also:
        Constant Field Values
      • EXCEPTION

        private static final long EXCEPTION
        The exception value in the COMPACT encoding format.
    • Method Detail

      • encode

        public void encode​(java.util.List<S2Point> values,
                           java.io.OutputStream output)
                    throws java.io.IOException
        Description copied from interface: S2Coder
        Encodes value to output.
        Specified by:
        encode in interface S2Coder<java.util.List<S2Point>>
        Throws:
        java.io.IOException
      • encodeFast

        private static void encodeFast​(java.util.List<S2Point> values,
                                       java.io.OutputStream output)
                                throws java.io.IOException
        Throws:
        java.io.IOException
      • encodeCompact

        private static void encodeCompact​(java.util.List<S2Point> values,
                                          java.io.OutputStream output)
                                   throws java.io.IOException
        Encodes a vector of S2Points, optimizing for space.

        OVERVIEW

        We attempt to represent each S2Point as the center of an S2CellId. All S2CellIds must be at the same level. Any points that cannot be encoded exactly as S2CellId centers are stored as exceptions using 24 bytes each. If there are so many exceptions that the COMPACT encoding does not save significant space, we give up and use the uncompressed encoding.

        The first step is to choose the best S2CellId level. This requires converting each point to (face, si, ti) coordinates and checking whether the point can be represented exactly as an S2CellId center at some level. We then build a histogram of S2CellId levels (just like the similar code in S2Polygon.encode) and choose the best level (or give up, if there are not enough S2CellId-encodable points).

        The simplest approach would then be to take all the S2CellIds and right-shift them to remove all the constant bits at the chosen level. This would give the best spatial locality and hence the smallest deltas. However instead we give up some spatial locality and use the similar but faster transformation described below.

        Each encodable point is first converted to the (sj, tj) representation defined below:

        
         sj = (((face & 3) << 30) | (si >> 1)) >> (30 - level);
         tj = (((face & 4) << 29) | ti) >> (31 - level);
         
        These two values encode the (face, si, ti) tuple using (2 * level + 3) bits. To see this, recall that "si" and "ti" are 31-bit values that all share a common suffix consisting of a "1" bit followed by (30 - level) "0" bits. The code above right-shifts these values to remove the constant bits and then prepends the bits for "face", yielding a total of (level + 2) bits for "sj" and (level + 1) bits for "tj".

        We then combine (sj, tj) into one 64-bit value by interleaving bit pairs:

        
         v = interleaveBitPairs(sj, tj);
         
        (We could also interleave individual bits, but it is faster this way). The result is similar to right-shifting an S2CellId by (61 - 2 * level), except that it is faster to decode and the spatial locality is not quite as good.

        The 64-bit values are divided into blocks of size 8, and then each value is encoded as the sum of a base value, a per-block offset, and a per-value delta within that block:

        
         v[i,j] = base + offset[i] + delta[i, j]
         

        where "i" represents a block and "j" represents an entry in that block.

        The deltas for each block are encoded using a fixed number of 4-bit nibbles (1-16 nibbles per delta). This allows any delta to be accessed in constant time.

        The "offset" for each block is a 64-bit value encoded in 0-8 bytes. The offset is left-shifted such that it overlaps the deltas by a configurable number of bits (either 0 or 4), called the "overlap". The overlap and offset length (0-8 bytes) are specified per block. The reason for the overlap is that it allows fewer delta bits to be used in some cases. For example if base == 0 and the range within a block is 0xf0 to 0x110, then rather than using 12-bits deltas with an offset of 0, the overlap lets us use 8-bits deltas with an offset of 0xf0 (saving 7 bytes per block).

        The global minimum value "base" is encoded using 0-7 bytes starting with the most-significant non-zero bit possible for the chosen level. For example, if (level == 7) then the encoded values have at most 17 bits, so if "base" is encoded in 1 byte then it is shifted to occupy bits 9-16.

        Example: at level == 15, there are at most 33 non-zero value bits. The following shows the bit positions covered by "base", "offset", and "delta" assuming that "base" and "offset" are encoded in 2 bytes each, deltas are encoded in 2 nibbles (1 byte) each, and "overlap" is 4 bits:

        
         Base:             1111111100000000-----------------
         Offset:           -------------1111111100000000----
         Delta:            -------------------------00000000
         Overlap:                                   ^^^^
         

        The numbers (0 or 1) in this diagram denote the byte number of the encoded value. Notice that "base" is shifted so that it starts at the leftmost possible bit, "delta" always starts at the rightmost possible bit (bit 0), and "offset" is shifted so that it overlaps "delta" by the chosen "overlap" (either 0 or 4 bits). Also note that all of these values are summed, and therefore each value can affect higher-order bits due to carries.

        NOTE(user): Encoding deltas in 4-bit rather than 8-bit length increments reduces encoded sizes by about 7%. Allowing a 4-bit overlap between the offset and deltas reduces encoded sizes by about 1%. Both optimizations make the code more complex but don't affect running times significantly.

        ENCODING DETAILS

        Now we can move on to the actual encodings. First, there is a 2 byte header encoded as follows:

        
         Byte 0, bits 0-2: encodingFormat (COMPACT)
         Byte 0, bit  3:   haveExceptions
         Byte 0, bits 4-7: (lastBlockSize - 1)
         Byte 1, bits 0-2: baseBytes
         Byte 1, bits 3-7: level (0-30)
         

        This is followed by an EncodedStringVector containing the encoded blocks. Each block contains BLOCK_SIZE (8) values. The total size of the EncodedS2PointVector is not stored explicitly, but instead is calculated as

         num_values == BLOCK_SIZE * (numBlocks - 1) + lastBlockSize
         

        (An empty vector has numBlocks == 0 and lastBlockSize == BLOCK_SIZE.)

        Each block starts with a 1 byte header containing the following:

        
         Byte 0, bits 0-2: (offsetBytes - overlapNibbles)
         Byte 0, bit  3:   overlapNibbles
         Byte 0, bits 4-7: (deltaNibbles - 1)
         

        "overlapNibbles" is either 0 or 1 (indicating an overlap of 0 or 4 bits), while "offsetBytes" is in the range 0-8 (indicating the number of bytes used to encode the offset for this block). Note that some combinations cannot be encoded: in particular, offsetBytes == 0 can only be encoded with an overlap of 0 bits, and offsetBytes == 8 can only be encoded with an overlap of 4 bits. This allows us to encode offset lengths of 0-8 rather than just 0-7 without using an extra bit. (Note that the combinations that can't be encoded are not useful anyway).

        The header is followed by "offsetBytes" bytes for the offset, and then (4 * deltaNibbles) bytes for the deltas.

        If there are any points that could not be represented as S2CellIds, then "haveExceptions" in the header is true. In that case the delta values within each block are encoded as (delta + 8), and values 0-7 are used to represent exceptions. If a block has exceptions, they are encoded immediately following the array of deltas, and are referenced by encoding the corresponding exception index (0-7) as the delta.

        TODO(user): A vector containing a single leaf cell is currently encoded as 13 bytes (2 byte header, 7 byte base, 1 byte block count, 1 byte block length, 1 byte block header, 1 byte delta). However if this case occurs often, a better solution would be implement a separate format that encodes the leading k bytes of an S2CellId. It would have a one-byte header consisting of the encoding format (3 bits) and the number of bytes encoded (3 bits), followed by the S2CellId bytes. The extra 2 header bits could be used to store single points using other encodings, e.g. E7.

        If we wind up using 8-value blocks, we could also use the extra bit in the first byte of the header to indicate that there is only one value, and then skip the 2nd byte of header and the EncodedStringVector. But this would be messy because it also requires special cases while decoding. Essentially this would be a sub-format within the COMPACT format.

        Throws:
        java.io.IOException
      • bitMask

        private static long bitMask​(int n)
        Returns a bit mask with n low-order 1 bits, for 0 <= n <= 64.
      • maxBitsForLevel

        private static int maxBitsForLevel​(int level)
        Returns the maximum number of bits per value at the given S2CellId level.
      • baseShift

        private static int baseShift​(int level,
                                     int baseBits)
        Returns the number of bits that base should be right-shifted in order to encode only its leading baseBits bits, assuming that all points are encoded at the given S2CellId level.
      • chooseBestLevel

        private static int chooseBestLevel​(java.util.List<S2Point> points,
                                           java.util.List<S2PointVectorCoder.CellPoint> cellPoints)
        Returns the S2CellId level for which the greatest number of the given points can be represented as the center of an S2CellId, or -1 if there is no S2CellId that would result in significant space savings.

        Adds the S2CellId representation of each point (if any) to cellPoints.

      • convertCellsToValues

        private static com.google.common.primitives.ImmutableLongArray convertCellsToValues​(java.util.List<S2PointVectorCoder.CellPoint> cellPoints,
                                                                                            int level)
        Given a vector of points in S2PointVectorCoder.CellPoint format and an S2CellId level that has been chosen for encoding, returns a vector of 64-bit values that should be encoded in order to represent these points. Points that cannot be represented losslessly as the center of an S2CellId at the chosen level are indicated by the value EXCEPTION.
      • canEncode

        private static boolean canEncode​(long dMin,
                                         long dMax,
                                         int deltaBits,
                                         int overlapBits,
                                         boolean haveExceptions)
        Returns true if the range of values [dMin, dMax] can be encoded using the specified parameters (deltaBits, overlapBits, and haveExceptions).
      • getBlockCode

        private static void getBlockCode​(S2PointVectorCoder.MutableBlockCode code,
                                         com.google.common.primitives.ImmutableLongArray values,
                                         long base,
                                         boolean haveExceptions)
        Given a vector of 64-bit values to be encoded and an S2CellId level, returns the optimal encoding parameters that should be used to encode each block.