Class Utf8
- java.lang.Object
-
- com.google.protobuf.Utf8
-
final class Utf8 extends java.lang.Object
A set of low-level, high-performance static utility methods related to the UTF-8 character encoding. This class has no dependencies outside of the core JDK libraries.There are several variants of UTF-8. The one implemented by this class is the restricted definition of UTF-8 introduced in Unicode 3.1, which mandates the rejection of "overlong" byte sequences as well as rejection of 3-byte surrogate codepoint byte sequences. Note that the UTF-8 decoder included in Oracle's JDK has been modified to also reject "overlong" byte sequences, but (as of 2011) still accepts 3-byte surrogate codepoint byte sequences.
The byte sequences considered valid by this class are exactly those that can be roundtrip converted to Strings and back to bytes using the UTF-8 charset, without loss:
Arrays.equals(bytes, new String(bytes, Internal.UTF_8).getBytes(Internal.UTF_8))
See the Unicode Standard, Table 3-6. UTF-8 Bit Distribution, Table 3-7. Well Formed UTF-8 Byte Sequences.
This class supports decoding of partial byte sequences, so that the bytes in a complete UTF-8 byte sequence can be stored in multiple segments. Methods typically return
MALFORMED
if the partial byte sequence is definitely not well-formed;COMPLETE
if it is well-formed in the absence of additional input; or, if the byte sequence apparently terminated in the middle of a character, an opaque integer "state" value containing enough information to decode the character when passed to a subsequent invocation of a partial decoding method.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private static class
Utf8.DecodeUtil
Utility methods for decoding bytes intoString
.(package private) static class
Utf8.Processor
A processor of UTF-8 strings, providing methods for checking validity and encoding.(package private) static class
Utf8.SafeProcessor
Utf8.Processor
implementation that does not use anysun.misc.Unsafe
methods.(package private) static class
Utf8.UnpairedSurrogateException
(package private) static class
Utf8.UnsafeProcessor
Utf8.Processor
that usessun.misc.Unsafe
where possible to improve performance.
-
Field Summary
Fields Modifier and Type Field Description private static long
ASCII_MASK_LONG
A mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e.(package private) static int
COMPLETE
State value indicating that the byte sequence is well-formed and complete (no further bytes are needed to complete a character).(package private) static int
MALFORMED
State value indicating that the byte sequence is definitely not well-formed.(package private) static int
MAX_BYTES_PER_CHAR
Maximum number of bytes per Java UTF-16 char in UTF-8.private static Utf8.Processor
processor
UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform.private static int
UNSAFE_COUNT_ASCII_THRESHOLD
Used byUnsafe
UTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters.
-
Constructor Summary
Constructors Modifier Constructor Description private
Utf8()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description (package private) static java.lang.String
decodeUtf8(byte[] bytes, int index, int size)
Decodes the given UTF-8 encoded byte array slice into aString
.(package private) static java.lang.String
decodeUtf8(java.nio.ByteBuffer buffer, int index, int size)
Decodes the given UTF-8 portion of theByteBuffer
into aString
.(package private) static int
encode(java.lang.String in, byte[] out, int offset, int length)
(package private) static int
encodedLength(java.lang.String string)
Returns the number of bytes in the UTF-8-encoded form ofsequence
.private static int
encodedLengthGeneral(java.lang.String string, int start)
(package private) static void
encodeUtf8(java.lang.String in, java.nio.ByteBuffer out)
Encodes the given characters to the targetByteBuffer
using UTF-8 encoding.private static int
estimateConsecutiveAscii(java.nio.ByteBuffer buffer, int index, int limit)
Counts (approximately) the number of consecutive ASCII characters in the given buffer.private static int
incompleteStateFor(byte[] bytes, int index, int limit)
private static int
incompleteStateFor(int byte1)
private static int
incompleteStateFor(int byte1, int byte2)
private static int
incompleteStateFor(int byte1, int byte2, int byte3)
private static int
incompleteStateFor(java.nio.ByteBuffer buffer, int byte1, int index, int remaining)
(package private) static boolean
isValidUtf8(byte[] bytes)
Returnstrue
if the given byte array is a well-formed UTF-8 byte sequence.(package private) static boolean
isValidUtf8(byte[] bytes, int index, int limit)
Returnstrue
if the given byte array slice is a well-formed UTF-8 byte sequence.(package private) static boolean
isValidUtf8(java.nio.ByteBuffer buffer)
Determines if the givenByteBuffer
is a valid UTF-8 string.(package private) static int
partialIsValidUtf8(int state, byte[] bytes, int index, int limit)
Tells whether the given byte array slice is a well-formed, malformed, or incomplete UTF-8 byte sequence.(package private) static int
partialIsValidUtf8(int state, java.nio.ByteBuffer buffer, int index, int limit)
Determines if the givenByteBuffer
is a partially valid UTF-8 string.
-
-
-
Field Detail
-
processor
private static final Utf8.Processor processor
UTF-8 is a runtime hot spot so we attempt to provide heavily optimized implementations depending on what is available on the platform. The processor is the platform-optimized delegate for which all methods are delegated directly to.
-
ASCII_MASK_LONG
private static final long ASCII_MASK_LONG
A mask used when performing unsafe reads to determine if a long value contains any non-ASCII characters (i.e. any byte >= 0x80).- See Also:
- Constant Field Values
-
MAX_BYTES_PER_CHAR
static final int MAX_BYTES_PER_CHAR
Maximum number of bytes per Java UTF-16 char in UTF-8.- See Also:
CharsetEncoder.maxBytesPerChar()
, Constant Field Values
-
COMPLETE
static final int COMPLETE
State value indicating that the byte sequence is well-formed and complete (no further bytes are needed to complete a character).- See Also:
- Constant Field Values
-
MALFORMED
static final int MALFORMED
State value indicating that the byte sequence is definitely not well-formed.- See Also:
- Constant Field Values
-
UNSAFE_COUNT_ASCII_THRESHOLD
private static final int UNSAFE_COUNT_ASCII_THRESHOLD
Used byUnsafe
UTF-8 string validation logic to determine the minimum string length above which to employ an optimized algorithm for counting ASCII characters. The reason for this threshold is that for small strings, the optimization may not be beneficial or may even negatively impact performance since it requires additional logic to avoid unaligned reads (when callingUnsafe.getLong
). This threshold guarantees that even if the initial offset is unaligned, we're guaranteed to make at least one call toUnsafe.getLong()
which provides a performance improvement that entirely subsumes the cost of the additional logic.- See Also:
- Constant Field Values
-
-
Method Detail
-
isValidUtf8
static boolean isValidUtf8(byte[] bytes)
Returnstrue
if the given byte array is a well-formed UTF-8 byte sequence.This is a convenience method, equivalent to a call to
isValidUtf8(bytes, 0, bytes.length)
.
-
isValidUtf8
static boolean isValidUtf8(byte[] bytes, int index, int limit)
Returnstrue
if the given byte array slice is a well-formed UTF-8 byte sequence. The range of bytes to be checked extends from indexindex
, inclusive, tolimit
, exclusive.This is a convenience method, equivalent to
partialIsValidUtf8(bytes, index, limit) == Utf8.COMPLETE
.
-
partialIsValidUtf8
static int partialIsValidUtf8(int state, byte[] bytes, int index, int limit)
Tells whether the given byte array slice is a well-formed, malformed, or incomplete UTF-8 byte sequence. The range of bytes to be checked extends from indexindex
, inclusive, tolimit
, exclusive.- Parameters:
state
- eitherCOMPLETE
(if this is the initial decoding operation) or the value returned from a call to a partial decoding method for the previous bytes- Returns:
MALFORMED
if the partial byte sequence is definitely not well-formed,COMPLETE
if it is well-formed (no additional input needed), or if the byte sequence is "incomplete", i.e. apparently terminated in the middle of a character, an opaque integer "state" value containing enough information to decode the character when passed to a subsequent invocation of a partial decoding method.
-
incompleteStateFor
private static int incompleteStateFor(int byte1)
-
incompleteStateFor
private static int incompleteStateFor(int byte1, int byte2)
-
incompleteStateFor
private static int incompleteStateFor(int byte1, int byte2, int byte3)
-
incompleteStateFor
private static int incompleteStateFor(byte[] bytes, int index, int limit)
-
incompleteStateFor
private static int incompleteStateFor(java.nio.ByteBuffer buffer, int byte1, int index, int remaining)
-
encodedLength
static int encodedLength(java.lang.String string)
Returns the number of bytes in the UTF-8-encoded form ofsequence
. For a string, this method is equivalent tostring.getBytes(UTF_8).length
, but is more efficient in both time and space.- Throws:
java.lang.IllegalArgumentException
- ifsequence
contains ill-formed UTF-16 (unpaired surrogates)
-
encodedLengthGeneral
private static int encodedLengthGeneral(java.lang.String string, int start)
-
encode
static int encode(java.lang.String in, byte[] out, int offset, int length)
-
isValidUtf8
static boolean isValidUtf8(java.nio.ByteBuffer buffer)
Determines if the givenByteBuffer
is a valid UTF-8 string.Selects an optimal algorithm based on the type of
ByteBuffer
(i.e. heap or direct) and the capabilities of the platform.- Parameters:
buffer
- the buffer to check.- See Also:
isValidUtf8(byte[], int, int)
-
partialIsValidUtf8
static int partialIsValidUtf8(int state, java.nio.ByteBuffer buffer, int index, int limit)
Determines if the givenByteBuffer
is a partially valid UTF-8 string.Selects an optimal algorithm based on the type of
ByteBuffer
(i.e. heap or direct) and the capabilities of the platform.- Parameters:
buffer
- the buffer to check.- See Also:
partialIsValidUtf8(int, byte[], int, int)
-
decodeUtf8
static java.lang.String decodeUtf8(java.nio.ByteBuffer buffer, int index, int size) throws InvalidProtocolBufferException
Decodes the given UTF-8 portion of theByteBuffer
into aString
.- Throws:
InvalidProtocolBufferException
- if the input is not valid UTF-8.
-
decodeUtf8
static java.lang.String decodeUtf8(byte[] bytes, int index, int size) throws InvalidProtocolBufferException
Decodes the given UTF-8 encoded byte array slice into aString
.- Throws:
InvalidProtocolBufferException
- if the input is not valid UTF-8.
-
encodeUtf8
static void encodeUtf8(java.lang.String in, java.nio.ByteBuffer out)
Encodes the given characters to the targetByteBuffer
using UTF-8 encoding.Selects an optimal algorithm based on the type of
ByteBuffer
(i.e. heap or direct) and the capabilities of the platform.- Parameters:
in
- the source string to be encodedout
- the target buffer to receive the encoded string.- See Also:
encode(String, byte[], int, int)
-
estimateConsecutiveAscii
private static int estimateConsecutiveAscii(java.nio.ByteBuffer buffer, int index, int limit)
Counts (approximately) the number of consecutive ASCII characters in the given buffer. The byte order of theByteBuffer
does not matter, so performance can be improved if native byte order is used (i.e. no byte-swapping inByteBuffer.getLong(int)
).- Parameters:
buffer
- the buffer to be scanned for ASCII charsindex
- the starting index of the scanlimit
- the limit within buffer for the scan- Returns:
- the number of ASCII characters found. The stopping position will be at or before the first non-ASCII byte.
-
-