Class FrontCodedStringList

java.lang.Object
java.util.AbstractCollection<MutableString>
it.unimi.dsi.fastutil.objects.AbstractObjectCollection<MutableString>
it.unimi.dsi.fastutil.objects.AbstractObjectList<MutableString>
it.unimi.dsi.util.FrontCodedStringList
All Implemented Interfaces:
it.unimi.dsi.fastutil.objects.ObjectCollection<MutableString>, it.unimi.dsi.fastutil.objects.ObjectIterable<MutableString>, it.unimi.dsi.fastutil.objects.ObjectList<MutableString>, it.unimi.dsi.fastutil.Stack<MutableString>, Serializable, Comparable<List<? extends MutableString>>, Iterable<MutableString>, Collection<MutableString>, List<MutableString>, RandomAccess, SequencedCollection<MutableString>

public class FrontCodedStringList extends it.unimi.dsi.fastutil.objects.AbstractObjectList<MutableString> implements RandomAccess, Serializable
Compact storage of strings using front-coding compression (also known as compression by prefix omission).

This class stores a list of strings using front-coding (also known as prefix-omission) compression; the compression will be reasonable only if the list is sorted, but you could also use instances of this class just as a handy way to manage a large amount of strings. It implements an immutable ObjectList that returns the i-th string (as a MutableString) when the get(int) method is called with argument i. The returned mutable string may be freely modified.

As a commodity, this class provides a main method that reads from standard input a sequence of newline-separated strings, and writes a corresponding serialized front-coded string list.

Implementation Details

To store the list of strings, we use either a UTF-8 coded ByteArrayFrontCodedList, or a CharArrayFrontCodedList, depending on the value of the utf8 parameter at creation time. In the first case, if the strings are ASCII-oriented the resulting array will be much smaller, but access times will increase manifold, as each string must be UTF-8 decoded before being returned.

See Also:
  • Nested Class Summary

    Nested classes/interfaces inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList

    it.unimi.dsi.fastutil.objects.AbstractObjectList.ObjectRandomAccessSubList<K>, it.unimi.dsi.fastutil.objects.AbstractObjectList.ObjectSubList<K>
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected final it.unimi.dsi.fastutil.bytes.ByteArrayFrontCodedList
    The underlying ByteArrayFrontCodedList, or null.
    protected final it.unimi.dsi.fastutil.chars.CharArrayFrontCodedList
    The underlying CharArrayFrontCodedList, or null.
    static final long
     
    protected final boolean
    Whether this front-coded list is UTF-8 encoded.
  • Constructor Summary

    Constructors
    Constructor
    Description
    FrontCodedStringList(Collection<? extends CharSequence> c, int ratio, boolean utf8)
    Creates a new front-coded string list containing the character sequences contained in the given collection.
    FrontCodedStringList(Iterator<? extends CharSequence> words, int ratio, boolean utf8)
    Creates a new front-coded string list containing the character sequences returned by the given iterator.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected static char[]
    byte2Char(byte[] a, char[] s)
     
    protected static int
    countUTF8Chars(byte[] a)
     
    get(int index)
    Returns the element at the specified position in this front-coded string list as a mutable string.
    void
    get(int index, MutableString s)
    Returns the element at the specified position in this front-coded string list by storing it in a mutable string.
    it.unimi.dsi.fastutil.objects.ObjectListIterator<MutableString>
    listIterator(int k)
     
    static void
    main(String[] arg)
     
    int
    Returns the ratio of the underlying front-coded list.
    int
     
    boolean
    Returns whether this front-coded string list is storing its strings as UTF-8 encoded bytes.

    Methods inherited from class it.unimi.dsi.fastutil.objects.AbstractObjectList

    add, add, addAll, addAll, addElements, addElements, clear, compareTo, contains, ensureIndex, ensureRestrictedIndex, equals, forEach, getElements, hashCode, indexOf, iterator, lastIndexOf, listIterator, peek, pop, push, remove, removeElements, set, setElements, size, subList, toArray, toArray, top, toString

    Methods inherited from class java.util.AbstractCollection

    containsAll, isEmpty, remove, removeAll, retainAll

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait

    Methods inherited from interface java.util.Collection

    parallelStream, removeIf, stream, toArray

    Methods inherited from interface it.unimi.dsi.fastutil.objects.ObjectCollection

    spliterator

    Methods inherited from interface it.unimi.dsi.fastutil.objects.ObjectList

    addAll, addAll, setElements, setElements, sort, spliterator, unstableSort

    Methods inherited from interface it.unimi.dsi.fastutil.Stack

    isEmpty
  • Field Details

    • serialVersionUID

      public static final long serialVersionUID
      See Also:
    • byteFrontCodedList

      protected final it.unimi.dsi.fastutil.bytes.ByteArrayFrontCodedList byteFrontCodedList
      The underlying ByteArrayFrontCodedList, or null.
    • charFrontCodedList

      protected final it.unimi.dsi.fastutil.chars.CharArrayFrontCodedList charFrontCodedList
      The underlying CharArrayFrontCodedList, or null.
    • utf8

      protected final boolean utf8
      Whether this front-coded list is UTF-8 encoded.
  • Constructor Details

    • FrontCodedStringList

      public FrontCodedStringList(Iterator<? extends CharSequence> words, int ratio, boolean utf8)
      Creates a new front-coded string list containing the character sequences returned by the given iterator.
      Parameters:
      words - an iterator returning character sequences.
      ratio - the desired ratio.
      utf8 - if true, the strings will be stored as UTF-8 byte arrays.
    • FrontCodedStringList

      public FrontCodedStringList(Collection<? extends CharSequence> c, int ratio, boolean utf8)
      Creates a new front-coded string list containing the character sequences contained in the given collection.
      Parameters:
      c - a collection containing character sequences.
      ratio - the desired ratio.
      utf8 - if true, the strings will be stored as UTF-8 byte arrays.
  • Method Details

    • utf8

      public boolean utf8()
      Returns whether this front-coded string list is storing its strings as UTF-8 encoded bytes.
      Returns:
      true if this front-coded string list is keeping its data as an array of UTF-8 encoded bytes.
    • ratio

      public int ratio()
      Returns the ratio of the underlying front-coded list.
      Returns:
      the ratio of the underlying front-coded list.
    • get

      public MutableString get(int index)
      Returns the element at the specified position in this front-coded string list as a mutable string.
      Specified by:
      get in interface List<MutableString>
      Parameters:
      index - an index in the list.
      Returns:
      a MutableString that will contain the string at the specified position. The string may be freely modified.
    • get

      public void get(int index, MutableString s)
      Returns the element at the specified position in this front-coded string list by storing it in a mutable string.
      Parameters:
      index - an index in the list.
      s - a mutable string that will contain the string at the specified position.
    • countUTF8Chars

      protected static int countUTF8Chars(byte[] a)
    • byte2Char

      protected static char[] byte2Char(byte[] a, char[] s)
    • listIterator

      public it.unimi.dsi.fastutil.objects.ObjectListIterator<MutableString> listIterator(int k)
      Specified by:
      listIterator in interface List<MutableString>
      Specified by:
      listIterator in interface it.unimi.dsi.fastutil.objects.ObjectList<MutableString>
      Overrides:
      listIterator in class it.unimi.dsi.fastutil.objects.AbstractObjectList<MutableString>
    • size

      public int size()
      Specified by:
      size in interface Collection<MutableString>
      Specified by:
      size in interface List<MutableString>
      Specified by:
      size in class AbstractCollection<MutableString>
    • main

      public static void main(String[] arg) throws IOException, com.martiansoftware.jsap.JSAPException, NoSuchMethodException
      Throws:
      IOException
      com.martiansoftware.jsap.JSAPException
      NoSuchMethodException