Package com.google.code.externalsorting
Class ExternalSort
java.lang.Object
com.google.code.externalsorting.ExternalSort
Goal: offer a generic external-memory sorting program in Java.
It must be : - hackable (easy to adapt) - scalable to large files - sensibly
efficient.
This software is in the public domain.
Usage: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt
You can change the default maximal number of temporary files with the -t
flag: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt
-t 3
For very large files, you might want to use an appropriate flag to allocate
more memory to the Java VM: java -Xms2G
com/google/code/externalsorting/ExternalSort somefile.txt out.txt
By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon
Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan,
Amit Jain, Thomas Mueller, Jerry Yang, First published: April 2010 originally posted at
http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic Comparator
<String> default comparator between strings.static final int
Default maximal number of temporary files allowed. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate static void
static long
This method calls the garbage collector and then returns the free memory.static long
estimateBestSizeOfBlocks
(long sizeoffile, int maxtmpfiles, long maxMemory) we divide the file into small blocks.static void
static long
mergeSortedFiles
(BufferedWriter fbw, Comparator<String> cmp, boolean distinct, List<IOStringStack> buffers) This merges several BinaryFileBuffer to an output writer.static long
mergeSortedFiles
(List<File> files, BufferedWriter fbw, Comparator<String> cmp, Charset cs, boolean distinct, boolean usegzip) This merges a bunch of temporary flat filesstatic long
mergeSortedFiles
(List<File> files, File outputfile) This merges a bunch of temporary flat filesstatic long
mergeSortedFiles
(List<File> files, File outputfile, Comparator<String> cmp) This merges a bunch of temporary flat filesstatic long
mergeSortedFiles
(List<File> files, File outputfile, Comparator<String> cmp, boolean distinct) This merges a bunch of temporary flat filesstatic long
mergeSortedFiles
(List<File> files, File outputfile, Comparator<String> cmp, Charset cs) This merges a bunch of temporary flat filesstatic long
mergeSortedFiles
(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct) This merges a bunch of temporary flat filesstatic long
mergeSortedFiles
(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip) This merges a bunch of temporary flat filesstatic void
This sorts a file (input) to an output file (output) using default parametersstatic void
sort
(File input, File output, Comparator<String> cmp) This sorts a file (input) to an output file (output) using customized comparatorstatic File
sortAndSave
(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory) Sort a list and save it to a temporary filestatic File
sortAndSave
(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel) Sort a list and save it to a temporary filesortInBatch
(BufferedReader fbr, long datalength) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(BufferedReader fbr, long datalength, Comparator<String> cmp, boolean distinct) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(BufferedReader fbr, long datalength, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp, boolean distinct) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp, File tmpdirectory, boolean distinct, int numHeader) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.sortInBatch
(File file, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, int numHeader) This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
-
Field Details
-
defaultcomparator
default comparator between strings. -
DEFAULTMAXTEMPFILES
public static final int DEFAULTMAXTEMPFILESDefault maximal number of temporary files allowed.- See Also:
-
-
Constructor Details
-
ExternalSort
public ExternalSort()
-
-
Method Details
-
displayUsage
private static void displayUsage() -
estimateAvailableMemory
public static long estimateAvailableMemory()This method calls the garbage collector and then returns the free memory. This avoids problems with applications where the GC hasn't reclaimed memory and reports no available memory.- Returns:
- available memory
-
estimateBestSizeOfBlocks
public static long estimateBestSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory) we divide the file into small blocks. If the blocks are too small, we shall create too many temporary files. If they are too big, we shall be using too much memory.- Parameters:
sizeoffile
- how much data (in bytes) can we expectmaxtmpfiles
- how many temporary files can we create (e.g., 1024)maxMemory
- Maximum memory to use (in bytes)- Returns:
- the estimate
-
main
- Parameters:
args
- command line argument- Throws:
IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(BufferedWriter fbw, Comparator<String> cmp, boolean distinct, List<IOStringStack> buffers) throws IOException This merges several BinaryFileBuffer to an output writer.- Parameters:
fbw
- A buffer where we write the data.cmp
- A comparator object that tells us how to sort the lines.distinct
- Passtrue
if duplicate lines should be discarded.buffers
- Where the data should be read.- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception
-
mergeSortedFiles
This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp) throws IOException This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, boolean distinct) throws IOException This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.distinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs) throws IOException This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.cs
- TheCharset
to be used for the byte to character conversion.- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct) throws IOException This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.cs
- TheCharset
to be used for the byte to character conversion.distinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception- Since:
- v0.1.2
-
mergeSortedFiles
public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip) throws IOException This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.cs
- TheCharset
to be used for the byte to character conversion.distinct
- Passtrue
if duplicate lines should be discarded.append
- Passtrue
if result should append toFile
instead of overwrite. Default to be false for overloading methods.usegzip
- assumes we used gzip compression for temporary files- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception- Since:
- v0.1.4
-
mergeSortedFiles
public static long mergeSortedFiles(List<File> files, BufferedWriter fbw, Comparator<String> cmp, Charset cs, boolean distinct, boolean usegzip) throws IOException This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.fbw
- The outputBufferedWriter
to merge the results to.cmp
- TheComparator
to use to compareString
s.cs
- TheCharset
to be used for the byte to character conversion.distinct
- Passtrue
if duplicate lines should be discarded.usegzip
- assumes we used gzip compression for temporary files- Returns:
- The number of lines sorted.
- Throws:
IOException
- generic IO exception- Since:
- v0.1.4
-
sort
This sorts a file (input) to an output file (output) using default parameters- Parameters:
input
- source fileoutput
- output file- Throws:
IOException
- generic IO exception
-
sort
This sorts a file (input) to an output file (output) using customized comparator- Parameters:
input
- source fileoutput
- output filecmp
- TheComparator
to use to compareString
s.- Throws:
IOException
- generic IO exception
-
sortAndSave
public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory) throws IOException Sort a list and save it to a temporary file- Parameters:
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)- Returns:
- the file containing the sorted data
- Throws:
IOException
- generic IO exception
-
sortAndSave
public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel) throws IOException Sort a list and save it to a temporary file- Parameters:
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.usegzip
- set totrue
if you are using gzip compression for the temporary filesparallel
- set totrue
when sorting in parallel- Returns:
- the file containing the sorted data
- Throws:
IOException
- generic IO exception
-
sortInBatch
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr
- data sourcedatalength
- estimated data volume (in bytes)- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(BufferedReader fbr, long datalength, Comparator<String> cmp, boolean distinct) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr
- data sourcedatalength
- estimated data volume (in bytes)cmp
- string comparatordistinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(BufferedReader fbr, long datalength, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr
- data sourcedatalength
- estimated data volume (in bytes)cmp
- string comparatormaxtmpfiles
- maximal number of temporary filesmaxMemory
- maximum amount of memory to use (in bytes)cs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesparallel
- sort in parallel- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file
- some flat file- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file
- some flat filecmp
- string comparator- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(File file, Comparator<String> cmp, boolean distinct) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file
- some flat filecmp
- string comparatordistinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(File file, Comparator<String> cmp, File tmpdirectory, boolean distinct, int numHeader) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatortmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(File file, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, int numHeader) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatorcs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary files- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-
sortInBatch
public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws IOException This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesparallel
- whether to sort in parallel- Returns:
- a list of temporary flat files
- Throws:
IOException
- generic IO exception
-