Package org.apache.uima.util.impl
Class OptimizeStrings
- java.lang.Object
-
- org.apache.uima.util.impl.OptimizeStrings
-
public class OptimizeStrings extends java.lang.Object
Share common underlying char[] among strings: Optimize sets of strings for efficient storage Apply it to a set of strings. Lifecycle: 1) make an instance of this class 2) call .add (String or String[]) for all strings in the set or - skip 1 and 2 and pass in a String [] that can be modified (sorted] 3) call .optimize() 4) call .getString(String) or .getStringArray(String[]) - returns new string objects .updateStringArray(String[]) to replace original Strings .getOffset(String) - returns int offset in some common string .getCommonStringIndex(String) 5) call getCommonStrings to get all the common strings 6) Let the GC collect the instance of this class The strings are added first to a big list, instead of directly to a stringbuilder in order to improve reuse by sorting for longest strings first The commonStrings are kept in an array. There are more than one to support very large amounts, in excess of 2GB. Nulls, passed in as strings, are mostly ignored, but handled appropriately.
-
-
Field Summary
Fields Modifier and Type Field Description private java.lang.String[]
commonStringsA
private boolean
doMeasurement
private java.util.ArrayList<java.lang.String>
inStrings
private int[]
lastIndexInCommonStringsA
holds the last index (counting down in sort order) that is valid for the corresponding common string segment.private int
nextSeq
private int[]
offsets
private java.util.Map<java.lang.String,java.lang.String>
returnedStrings
private long
savedCharsExact
private long
savedCharsSubstr
private int
splitSize
splitSize = 100 means the common string can be as large as 100 the common string can hold a string of length 100private java.util.Map<java.lang.String,java.lang.Integer>
stringToIndexMap
A two hop map from strings to offsets is used.
-
Constructor Summary
Constructors Constructor Description OptimizeStrings(boolean doMeasurement)
OptimizeStrings(boolean doMeasurement, int splitSize)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
add(java.lang.String s)
null strings not added 0 length strings addedvoid
add(java.lang.String[] sa)
private int
eliminateSortedStringDuplicates(java.lang.String[] sortedStrings)
Scan sorted strings, removing duplicates.int
getCommonStringIndex(int index)
java.lang.String[]
getCommonStrings()
The list of common stringsint
getIndexOrSeqIndex(java.lang.String s)
int
getOffset(int i)
long
getOffset(java.lang.String s)
long
getSavedCharsExact()
The number of characters saved - for statistical reporting onlylong
getSavedCharsSubstr()
java.lang.String
getString(java.lang.String s)
return a string which is made as a substring of the common stringjava.lang.String[]
getStringArray(java.lang.String[] sai)
int
getStringIndex(java.lang.String s)
void
optimize()
Fully checking indexof for every new string is prohibitively expensive We do a partial check - only checking if a string is a substring of the previous oneprivate void
optimizeI(java.lang.String[] sortedStrings)
private java.lang.String[]
sortStrings(java.lang.String[] sa)
void
updateStringArray(java.lang.String[] sa)
-
-
-
Field Detail
-
splitSize
private int splitSize
splitSize = 100 means the common string can be as large as 100 the common string can hold a string of length 100
-
inStrings
private java.util.ArrayList<java.lang.String> inStrings
-
stringToIndexMap
private java.util.Map<java.lang.String,java.lang.Integer> stringToIndexMap
A two hop map from strings to offsets is used. "index" (int): The first hop: goes from string to an int index, incrementing 1 per unique shared common string segment "offset" (int): The 2nd hop: goes from the first index to the actual int offset in the particular associated piece of the common strings. The lastIndexInCommonStringA array holds the last index value within each of the common string segments
-
offsets
private int[] offsets
-
commonStringsA
private java.lang.String[] commonStringsA
-
lastIndexInCommonStringsA
private int[] lastIndexInCommonStringsA
holds the last index (counting down in sort order) that is valid for the corresponding common string segment.
-
returnedStrings
private java.util.Map<java.lang.String,java.lang.String> returnedStrings
-
savedCharsExact
private long savedCharsExact
-
savedCharsSubstr
private long savedCharsSubstr
-
nextSeq
private int nextSeq
-
doMeasurement
private final boolean doMeasurement
-
-
Method Detail
-
getSavedCharsExact
public long getSavedCharsExact()
The number of characters saved - for statistical reporting only- Returns:
- the number of characters saved
-
getSavedCharsSubstr
public long getSavedCharsSubstr()
-
getCommonStrings
public java.lang.String[] getCommonStrings()
The list of common strings- Returns:
- the list of common strings
-
add
public void add(java.lang.String s)
null strings not added 0 length strings added- Parameters:
s
- -
-
add
public void add(java.lang.String[] sa)
-
getStringIndex
public int getStringIndex(java.lang.String s)
-
getIndexOrSeqIndex
public int getIndexOrSeqIndex(java.lang.String s)
- Parameters:
s
- must not be null- Returns:
- a (positive or 0) or negative number. If positive, it is the offset in the common string. If negative, -v is the index (starting at 1) that sequentially increases, for each new unique string fetched using this method.
-
getString
public java.lang.String getString(java.lang.String s)
return a string which is made as a substring of the common string- Parameters:
s
- -- Returns:
- an equal string, made as substring of common string instance equal results return the same string
-
getOffset
public long getOffset(java.lang.String s)
-
getOffset
public int getOffset(int i)
-
getCommonStringIndex
public int getCommonStringIndex(int index)
- Parameters:
index
- an index (not offset) to the sorted strings,- Returns:
- the index of the segment it belongs to
-
getStringArray
public java.lang.String[] getStringArray(java.lang.String[] sai)
-
updateStringArray
public void updateStringArray(java.lang.String[] sa)
-
optimize
public void optimize()
Fully checking indexof for every new string is prohibitively expensive We do a partial check - only checking if a string is a substring of the previous one
-
optimizeI
private void optimizeI(java.lang.String[] sortedStrings)
-
eliminateSortedStringDuplicates
private int eliminateSortedStringDuplicates(java.lang.String[] sortedStrings)
Scan sorted strings, removing duplicates. Copy refs so that final array has no dups up to new length Return new length, but don't trim array- Parameters:
sortedStrings
-- Returns:
- new length
-
sortStrings
private java.lang.String[] sortStrings(java.lang.String[] sa)
-
-