Class EncodingGuesser


  • public class EncodingGuesser
    extends java.lang.Object
    This class contains a list of known encodings used by TextMimeType. It is used by the TextMimeDetector but can be used as a stand alone utility class in other parts of your program if you want.

    The getPossibleEncodings() method takes a byte [] as its source and the bigger the array the better the detection ratio will be.

    The class is initialised with an empty list of encodings so it is effectively disabled by default. You can set the supported encodings to ALL of the encodings supported by your JVM at any point during your program execution using the following method EncodingGuesser.setSupportedEncodings(EncodingGuesser.getCanonicalEncodingNamesSupportedByJVM()); You can also clear the encodings and disable the detector at any point by calling EncodingGuesser.setSupportedEncodings(new ArrayList()). If later on you dynamically add more encodings they will NOT be detected automatically by this class but you can recall the above method.

    As the JVM can have a large number of encodings and each one is checked against the byte array it may be wise to remove all encodings you are sure you will not use to trim down on the number of tests. It will not stop at the first match but will try to match as many encodings as possible and return this as a Collection.

    A common scenario is where an application can handle only a small set of text encodings such as UTF-8 and windows-1252. If this is your case you can use the setSupportedEncodings() method so that these are the only encodings in the supported encodings Collection. This will dramatically improve the performance of this class.

    It's possible that small byte arrays that should contain binary data are considered possible text matches but generally binary data, such as images, should return no matches.

    There are some optimisations that are applicable to text files containing BOM's (Byte Order Marks) such as UTF-8, UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE. These are not required but if present will greatly improve the resultant possible matches returned from the getPossibleEncodings() method.

    • Constructor Summary

      Constructors 
      Constructor Description
      EncodingGuesser()  
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static boolean compareByteArrays​(byte[] a, int aOffset, byte[] b, int bOffset, int length)
      Utility method to compare a region of two byte arrays for equality
      static byte[] getByteArraySubArray​(byte[] a, int offset, int length)
      Get a sub array of this byte array starting at offset until length
      static java.util.Collection getCanonicalEncodingNamesSupportedByJVM()
      Utility method to get all of the current encoding names, in canonical format, supported by your JVM at the time this is called.
      static java.lang.String getDefaultEncoding()
      Get the JVM default canonical encoding.
      static int getLengthBOM​(java.lang.String encoding, byte[] data)
      Get the length of a BOM for this this encoding and byte array
      static java.util.Collection getPossibleEncodings​(byte[] data)
      Get a Collection of all the possible encodings this byte array could be used to represent.
      static java.util.Collection getSupportedEncodings()
      Get the Collection of currently supported encodings
      static java.util.Collection getValidEncodings​(java.lang.String[] encodings)
      Get a Collection containing entries in both the supported encodings and the passed in String [] of encodings.
      static boolean isKnownEncoding​(java.lang.String encoding)
      Check if the encoding String is one of the encodings supported.
      static boolean removeEncoding​(java.lang.String encoding)
      Allows you to remove an encoding from the supported encodings you are not interested in.
      static boolean removeEncodings​(java.lang.String[] encodings)
      Remove all valid encodings in the string array
      static java.util.Collection setSupportedEncodings​(java.util.Collection encodings)
      Set the supported encodings
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • EncodingGuesser

        public EncodingGuesser()
    • Method Detail

      • isKnownEncoding

        public static boolean isKnownEncoding​(java.lang.String encoding)
        Check if the encoding String is one of the encodings supported.
        Parameters:
        encoding -
        Returns:
        true if encoding is understood by this class
      • getPossibleEncodings

        public static java.util.Collection getPossibleEncodings​(byte[] data)
        Get a Collection of all the possible encodings this byte array could be used to represent.
        Parameters:
        data -
        Returns:
        the Collection of possible encodings from the supported encodings
      • removeEncoding

        public static boolean removeEncoding​(java.lang.String encoding)
        Allows you to remove an encoding from the supported encodings you are not interested in.
        Parameters:
        encoding -
        Returns:
        true if removed else false
      • removeEncodings

        public static boolean removeEncodings​(java.lang.String[] encodings)
        Remove all valid encodings in the string array
        Parameters:
        encodings - String [] containing the encodings to remove
        Returns:
        true if at least one of the encodings was removed else false
      • getValidEncodings

        public static java.util.Collection getValidEncodings​(java.lang.String[] encodings)
        Get a Collection containing entries in both the supported encodings and the passed in String [] of encodings. This is used by TextMimeDetector to get a valid list of the preferred encodings.
        Parameters:
        encodings -
        Returns:
        a Collection containing all valid encodings contained in the passed in encodings array
      • getDefaultEncoding

        public static java.lang.String getDefaultEncoding()
        Get the JVM default canonical encoding. For instance the canonical encoding for cp1252 is windows-1252
        Returns:
        the default canonical encoding name for the JVM
      • getSupportedEncodings

        public static java.util.Collection getSupportedEncodings()
        Get the Collection of currently supported encodings
        Returns:
        the supported encodings.
      • setSupportedEncodings

        public static java.util.Collection setSupportedEncodings​(java.util.Collection encodings)
        Set the supported encodings
        Parameters:
        encodings - . If this is null the supported encodings are left unchanged.
        Returns:
        a copy of the currently supported encodings
      • getLengthBOM

        public static int getLengthBOM​(java.lang.String encoding,
                                       byte[] data)
        Get the length of a BOM for this this encoding and byte array
        Parameters:
        encoding -
        data -
        Returns:
        length of BOM if the data contains a BOM else returns 0
      • getByteArraySubArray

        public static byte[] getByteArraySubArray​(byte[] a,
                                                  int offset,
                                                  int length)
        Get a sub array of this byte array starting at offset until length
        Parameters:
        a -
        offset -
        length -
        Returns:
        new byte array unless is would replicate or increase the original array in which case it returns the original
      • compareByteArrays

        public static boolean compareByteArrays​(byte[] a,
                                                int aOffset,
                                                byte[] b,
                                                int bOffset,
                                                int length)
        Utility method to compare a region of two byte arrays for equality
        Parameters:
        a -
        aOffset -
        b -
        bOffset -
        length -
        Returns:
        true is the two regions contain the same byte values else false
      • getCanonicalEncodingNamesSupportedByJVM

        public static java.util.Collection getCanonicalEncodingNamesSupportedByJVM()
        Utility method to get all of the current encoding names, in canonical format, supported by your JVM at the time this is called.
        Returns:
        current Collection of canonical encoding names