Class UnicodeCompressor

java.lang.Object
com.ibm.icu.text.UnicodeCompressor

public final class UnicodeCompressor extends Object
A compression engine implementing the Standard Compression Scheme for Unicode (SCSU) as outlined in Unicode Technical Report #6.

The SCSU works by using dynamically positioned windows consisting of 128 consecutive characters in Unicode. During compression, characters within a window are encoded in the compressed stream as the bytes 0x7F - 0xFF. The SCSU provides transparency for the characters (bytes) between U+0000 - U+00FF. The SCSU approximates the storage size of traditional character sets, for example 1 byte per character for ASCII or Latin-1 text, and 2 bytes per character for CJK ideographs.

USAGE

The static methods on UnicodeCompressor may be used in a straightforward manner to compress simple strings:

  String s = ... ; // get string from somewhere
  byte [] compressed = UnicodeCompressor.compress(s);
 

The static methods have a fairly large memory footprint. For finer-grained control over memory usage, UnicodeCompressor offers more powerful APIs allowing iterative compression:

  // Compress an array "chars" of length "len" using a buffer of 512 bytes
  // to the OutputStream "out"

  UnicodeCompressor myCompressor         = new UnicodeCompressor();
  final static int  BUFSIZE              = 512;
  byte []           byteBuffer           = new byte [ BUFSIZE ];
  int               bytesWritten         = 0;
  int []            unicharsRead         = new int [1];
  int               totalCharsCompressed = 0;
  int               totalBytesWritten    = 0;

  do {
    // do the compression
    bytesWritten = myCompressor.compress(chars, totalCharsCompressed, 
                                         len, unicharsRead,
                                         byteBuffer, 0, BUFSIZE);

    // do something with the current set of bytes
    out.write(byteBuffer, 0, bytesWritten);

    // update the no. of characters compressed
    totalCharsCompressed += unicharsRead[0];

    // update the no. of bytes written
    totalBytesWritten += bytesWritten;

  } while(totalCharsCompressed < len);

  myCompressor.reset(); // reuse compressor
 
Author:
Stephen F. Booth
See Also: