Class OptimizeStrings


  • public class OptimizeStrings
    extends java.lang.Object
    Share common underlying char[] among strings: Optimize sets of strings for efficient storage Apply it to a set of strings. Lifecycle: 1) make an instance of this class 2) call .add (String or String[]) for all strings in the set or - skip 1 and 2 and pass in a String [] that can be modified (sorted] 3) call .optimize() 4) call .getString(String) or .getStringArray(String[]) - returns new string objects .updateStringArray(String[]) to replace original Strings .getOffset(String) - returns int offset in some common string .getCommonStringIndex(String) 5) call getCommonStrings to get all the common strings 6) Let the GC collect the instance of this class The strings are added first to a big list, instead of directly to a stringbuilder in order to improve reuse by sorting for longest strings first The commonStrings are kept in an array. There are more than one to support very large amounts, in excess of 2GB. Nulls, passed in as strings, are mostly ignored, but handled appropriately.
    • Field Summary

      Fields 
      Modifier and Type Field Description
      private java.lang.String[] commonStringsA  
      private boolean doMeasurement  
      private java.util.ArrayList<java.lang.String> inStrings  
      private int[] lastIndexInCommonStringsA
      holds the last index (counting down in sort order) that is valid for the corresponding common string segment.
      private int nextSeq  
      private int[] offsets  
      private java.util.Map<java.lang.String,​java.lang.String> returnedStrings  
      private long savedCharsExact  
      private long savedCharsSubstr  
      private int splitSize
      splitSize = 100 means the common string can be as large as 100 the common string can hold a string of length 100
      private java.util.Map<java.lang.String,​java.lang.Integer> stringToIndexMap
      A two hop map from strings to offsets is used.
    • Constructor Summary

      Constructors 
      Constructor Description
      OptimizeStrings​(boolean doMeasurement)  
      OptimizeStrings​(boolean doMeasurement, int splitSize)  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void add​(java.lang.String s)
      null strings not added 0 length strings added
      void add​(java.lang.String[] sa)  
      private int eliminateSortedStringDuplicates​(java.lang.String[] sortedStrings)
      Scan sorted strings, removing duplicates.
      int getCommonStringIndex​(int index)  
      java.lang.String[] getCommonStrings()
      The list of common strings
      int getIndexOrSeqIndex​(java.lang.String s)  
      int getOffset​(int i)  
      long getOffset​(java.lang.String s)  
      long getSavedCharsExact()
      The number of characters saved - for statistical reporting only
      long getSavedCharsSubstr()  
      java.lang.String getString​(java.lang.String s)
      return a string which is made as a substring of the common string
      java.lang.String[] getStringArray​(java.lang.String[] sai)  
      int getStringIndex​(java.lang.String s)  
      void optimize()
      Fully checking indexof for every new string is prohibitively expensive We do a partial check - only checking if a string is a substring of the previous one
      private void optimizeI​(java.lang.String[] sortedStrings)  
      private java.lang.String[] sortStrings​(java.lang.String[] sa)  
      void updateStringArray​(java.lang.String[] sa)  
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • splitSize

        private int splitSize
        splitSize = 100 means the common string can be as large as 100 the common string can hold a string of length 100
      • inStrings

        private java.util.ArrayList<java.lang.String> inStrings
      • stringToIndexMap

        private java.util.Map<java.lang.String,​java.lang.Integer> stringToIndexMap
        A two hop map from strings to offsets is used. "index" (int): The first hop: goes from string to an int index, incrementing 1 per unique shared common string segment "offset" (int): The 2nd hop: goes from the first index to the actual int offset in the particular associated piece of the common strings. The lastIndexInCommonStringA array holds the last index value within each of the common string segments
      • offsets

        private int[] offsets
      • commonStringsA

        private java.lang.String[] commonStringsA
      • lastIndexInCommonStringsA

        private int[] lastIndexInCommonStringsA
        holds the last index (counting down in sort order) that is valid for the corresponding common string segment.
      • returnedStrings

        private java.util.Map<java.lang.String,​java.lang.String> returnedStrings
      • savedCharsExact

        private long savedCharsExact
      • savedCharsSubstr

        private long savedCharsSubstr
      • nextSeq

        private int nextSeq
      • doMeasurement

        private final boolean doMeasurement
    • Constructor Detail

      • OptimizeStrings

        public OptimizeStrings​(boolean doMeasurement)
      • OptimizeStrings

        public OptimizeStrings​(boolean doMeasurement,
                               int splitSize)
    • Method Detail

      • getSavedCharsExact

        public long getSavedCharsExact()
        The number of characters saved - for statistical reporting only
        Returns:
        the number of characters saved
      • getSavedCharsSubstr

        public long getSavedCharsSubstr()
      • getCommonStrings

        public java.lang.String[] getCommonStrings()
        The list of common strings
        Returns:
        the list of common strings
      • add

        public void add​(java.lang.String s)
        null strings not added 0 length strings added
        Parameters:
        s - -
      • add

        public void add​(java.lang.String[] sa)
      • getStringIndex

        public int getStringIndex​(java.lang.String s)
      • getIndexOrSeqIndex

        public int getIndexOrSeqIndex​(java.lang.String s)
        Parameters:
        s - must not be null
        Returns:
        a (positive or 0) or negative number. If positive, it is the offset in the common string. If negative, -v is the index (starting at 1) that sequentially increases, for each new unique string fetched using this method.
      • getString

        public java.lang.String getString​(java.lang.String s)
        return a string which is made as a substring of the common string
        Parameters:
        s - -
        Returns:
        an equal string, made as substring of common string instance equal results return the same string
      • getOffset

        public long getOffset​(java.lang.String s)
      • getOffset

        public int getOffset​(int i)
      • getCommonStringIndex

        public int getCommonStringIndex​(int index)
        Parameters:
        index - an index (not offset) to the sorted strings,
        Returns:
        the index of the segment it belongs to
      • getStringArray

        public java.lang.String[] getStringArray​(java.lang.String[] sai)
      • updateStringArray

        public void updateStringArray​(java.lang.String[] sa)
      • optimize

        public void optimize()
        Fully checking indexof for every new string is prohibitively expensive We do a partial check - only checking if a string is a substring of the previous one
      • optimizeI

        private void optimizeI​(java.lang.String[] sortedStrings)
      • eliminateSortedStringDuplicates

        private int eliminateSortedStringDuplicates​(java.lang.String[] sortedStrings)
        Scan sorted strings, removing duplicates. Copy refs so that final array has no dups up to new length Return new length, but don't trim array
        Parameters:
        sortedStrings -
        Returns:
        new length
      • sortStrings

        private java.lang.String[] sortStrings​(java.lang.String[] sa)