Class ExternalSort

java.lang.Object
com.google.code.externalsorting.ExternalSort

public class ExternalSort extends Object
Goal: offer a generic external-memory sorting program in Java. It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient. This software is in the public domain. Usage: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt You can change the default maximal number of temporary files with the -t flag: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt -t 3 For very large files, you might want to use an appropriate flag to allocate more memory to the Java VM: java -Xms2G com/google/code/externalsorting/ExternalSort somefile.txt out.txt By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Amit Jain, Thomas Mueller, Jerry Yang, First published: April 2010 originally posted at http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
  • Field Details

    • defaultcomparator

      public static Comparator<String> defaultcomparator
      default comparator between strings.
    • DEFAULTMAXTEMPFILES

      public static final int DEFAULTMAXTEMPFILES
      Default maximal number of temporary files allowed.
      See Also:
  • Constructor Details

    • ExternalSort

      public ExternalSort()
  • Method Details

    • displayUsage

      private static void displayUsage()
    • estimateAvailableMemory

      public static long estimateAvailableMemory()
      This method calls the garbage collector and then returns the free memory. This avoids problems with applications where the GC hasn't reclaimed memory and reports no available memory.
      Returns:
      available memory
    • estimateBestSizeOfBlocks

      public static long estimateBestSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)
      we divide the file into small blocks. If the blocks are too small, we shall create too many temporary files. If they are too big, we shall be using too much memory.
      Parameters:
      sizeoffile - how much data (in bytes) can we expect
      maxtmpfiles - how many temporary files can we create (e.g., 1024)
      maxMemory - Maximum memory to use (in bytes)
      Returns:
      the estimate
    • main

      public static void main(String[] args) throws IOException
      Parameters:
      args - command line argument
      Throws:
      IOException - generic IO exception
    • mergeSortedFiles

      public static long mergeSortedFiles(BufferedWriter fbw, Comparator<String> cmp, boolean distinct, List<IOStringStack> buffers) throws IOException
      This merges several BinaryFileBuffer to an output writer.
      Parameters:
      fbw - A buffer where we write the data.
      cmp - A comparator object that tells us how to sort the lines.
      distinct - Pass true if duplicate lines should be discarded.
      buffers - Where the data should be read.
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
    • mergeSortedFiles

      public static long mergeSortedFiles(List<File> files, File outputfile) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      outputfile - The output File to merge the results to.
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
    • mergeSortedFiles

      public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
    • mergeSortedFiles

      public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, boolean distinct) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      distinct - Pass true if duplicate lines should be discarded.
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
    • mergeSortedFiles

      public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
    • mergeSortedFiles

      public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      distinct - Pass true if duplicate lines should be discarded.
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
      Since:
      v0.1.2
    • mergeSortedFiles

      public static long mergeSortedFiles(List<File> files, File outputfile, Comparator<String> cmp, Charset cs, boolean distinct, boolean append, boolean usegzip) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      outputfile - The output File to merge the results to.
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      distinct - Pass true if duplicate lines should be discarded.
      append - Pass true if result should append to File instead of overwrite. Default to be false for overloading methods.
      usegzip - assumes we used gzip compression for temporary files
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
      Since:
      v0.1.4
    • mergeSortedFiles

      public static long mergeSortedFiles(List<File> files, BufferedWriter fbw, Comparator<String> cmp, Charset cs, boolean distinct, boolean usegzip) throws IOException
      This merges a bunch of temporary flat files
      Parameters:
      files - The List of sorted Files to be merged.
      fbw - The output BufferedWriter to merge the results to.
      cmp - The Comparator to use to compare Strings.
      cs - The Charset to be used for the byte to character conversion.
      distinct - Pass true if duplicate lines should be discarded.
      usegzip - assumes we used gzip compression for temporary files
      Returns:
      The number of lines sorted.
      Throws:
      IOException - generic IO exception
      Since:
      v0.1.4
    • sort

      public static void sort(File input, File output) throws IOException
      This sorts a file (input) to an output file (output) using default parameters
      Parameters:
      input - source file
      output - output file
      Throws:
      IOException - generic IO exception
    • sort

      public static void sort(File input, File output, Comparator<String> cmp) throws IOException
      This sorts a file (input) to an output file (output) using customized comparator
      Parameters:
      input - source file
      output - output file
      cmp - The Comparator to use to compare Strings.
      Throws:
      IOException - generic IO exception
    • sortAndSave

      public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory) throws IOException
      Sort a list and save it to a temporary file
      Parameters:
      tmplist - data to be sorted
      cmp - string comparator
      cs - charset to use for output (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      Returns:
      the file containing the sorted data
      Throws:
      IOException - generic IO exception
    • sortAndSave

      public static File sortAndSave(List<String> tmplist, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel) throws IOException
      Sort a list and save it to a temporary file
      Parameters:
      tmplist - data to be sorted
      cmp - string comparator
      cs - charset to use for output (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      usegzip - set to true if you are using gzip compression for the temporary files
      parallel - set to true when sorting in parallel
      Returns:
      the file containing the sorted data
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(BufferedReader fbr, long datalength) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      fbr - data source
      datalength - estimated data volume (in bytes)
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(BufferedReader fbr, long datalength, Comparator<String> cmp, boolean distinct) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      fbr - data source
      datalength - estimated data volume (in bytes)
      cmp - string comparator
      distinct - Pass true if duplicate lines should be discarded.
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(BufferedReader fbr, long datalength, Comparator<String> cmp, int maxtmpfiles, long maxMemory, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      fbr - data source
      datalength - estimated data volume (in bytes)
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      maxMemory - maximum amount of memory to use (in bytes)
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      usegzip - use gzip compression for the temporary files
      parallel - sort in parallel
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      file - some flat file
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      file - some flat file
      cmp - string comparator
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, boolean distinct) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
      Parameters:
      file - some flat file
      cmp - string comparator
      distinct - Pass true if duplicate lines should be discarded.
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, File tmpdirectory, boolean distinct, int numHeader) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, Charset cs, File tmpdirectory, boolean distinct, int numHeader) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      usegzip - use gzip compression for the temporary files
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception
    • sortInBatch

      public static List<File> sortInBatch(File file, Comparator<String> cmp, int maxtmpfiles, Charset cs, File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws IOException
      This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.
      Parameters:
      file - some flat file
      cmp - string comparator
      maxtmpfiles - maximal number of temporary files
      cs - character set to use (can use Charset.defaultCharset())
      tmpdirectory - location of the temporary files (set to null for default location)
      distinct - Pass true if duplicate lines should be discarded.
      numHeader - number of lines to preclude before sorting starts
      usegzip - use gzip compression for the temporary files
      parallel - whether to sort in parallel
      Returns:
      a list of temporary flat files
      Throws:
      IOException - generic IO exception