Class WordSet


  • public final class WordSet
    extends java.lang.Object
    An efficient (both memory and time) implementation of a Set used to verify that a given word is contained within the set. The general usage pattern is expected to be such that most checks are positive, ie. that the word indeed is contained in the set.

    Performance of the set is comparable to that of TreeSet for Strings, ie. 2-3x slower than HashSet when using pre-constructed Strings. This is generally result of algorithmic complexity of structures; Word and Tree sets are roughly logarithmic to the whole data, whereas Hash set is linear to the length of key. However:

    • WordSet can use char arrays as keys, without constructing Strings. In cases where there is no (need for) Strings, WordSet seems to be about twice as fast, even without considering additional GC caused by temporary String instances.
    • WordSet is more compact in its memory presentation; if Strings are shared its size is comparable to optimally filled HashSet, and if no such Strings exists, its much more compact (relatively speaking)

    Although this is an efficient set for specific set of usage patterns, one restriction is that the full set of words to include has to be known before constructing the set. Also, the size of the set is limited to total word content of about 20k characters; factory method does verify the limit and indicates if an instance can not be created.

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      private static class  WordSet.Builder  
    • Field Summary

      Fields 
      Modifier and Type Field Description
      (package private) static char CHAR_NULL  
      (package private) char[] mData
      Compressed presentation of the word set.
      (package private) static int MIN_BINARY_SEARCH
      This is actually just a guess; but in general linear search should be faster for short sequences (definitely for 4 or less; maybe up to 8 or less?)
      (package private) static int NEGATIVE_OFFSET
      Offset added to numbers to mark 'negative' numbers.
    • Constructor Summary

      Constructors 
      Modifier Constructor Description
      private WordSet​(char[] data)  
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static char[] constructRaw​(java.util.TreeSet<java.lang.String> wordSet)  
      static WordSet constructSet​(java.util.TreeSet<java.lang.String> wordSet)  
      static boolean contains​(char[] data, char[] str, int start, int end)  
      boolean contains​(char[] buf, int start, int end)  
      static boolean contains​(char[] data, java.lang.String str)  
      boolean contains​(java.lang.String str)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • NEGATIVE_OFFSET

        static final int NEGATIVE_OFFSET
        Offset added to numbers to mark 'negative' numbers. Asymmetric, since range of negative markers needed is smaller than positive numbers...
        See Also:
        Constant Field Values
      • MIN_BINARY_SEARCH

        static final int MIN_BINARY_SEARCH
        This is actually just a guess; but in general linear search should be faster for short sequences (definitely for 4 or less; maybe up to 8 or less?)
        See Also:
        Constant Field Values
      • mData

        final char[] mData
        Compressed presentation of the word set.
    • Constructor Detail

      • WordSet

        private WordSet​(char[] data)
    • Method Detail

      • constructSet

        public static WordSet constructSet​(java.util.TreeSet<java.lang.String> wordSet)
      • constructRaw

        public static char[] constructRaw​(java.util.TreeSet<java.lang.String> wordSet)
      • contains

        public boolean contains​(char[] buf,
                                int start,
                                int end)
      • contains

        public static boolean contains​(char[] data,
                                       char[] str,
                                       int start,
                                       int end)
      • contains

        public boolean contains​(java.lang.String str)
      • contains

        public static boolean contains​(char[] data,
                                       java.lang.String str)