Class RuleBasedBreakIterator

java.lang.Object
com.ibm.icu.text.BreakIterator
com.ibm.icu.text.RuleBasedBreakIterator
All Implemented Interfaces:
Cloneable

public class RuleBasedBreakIterator extends BreakIterator
Rule Based Break Iterator This is a port of the C++ class RuleBasedBreakIterator from ICU4C.
  • Field Details

    • fRData

      @Deprecated public com.ibm.icu.impl.RBBIDataWrapper fRData
      Deprecated.
      This API is ICU internal only.
      The rule data for this BreakIterator instance. Not intended for public use. Declared public for testing purposes only.
    • fDebugEnv

      @Deprecated public static final String fDebugEnv
      Deprecated.
      This API is ICU internal only.
      Control debug, trace and dump options.
  • Constructor Details

    • RuleBasedBreakIterator

      public RuleBasedBreakIterator(String rules)
      Construct a RuleBasedBreakIterator from a set of rules supplied as a string.
      Parameters:
      rules - The break rules to be used.
  • Method Details

    • getInstanceFromCompiledRules

      public static RuleBasedBreakIterator getInstanceFromCompiledRules(InputStream is) throws IOException
      Create a break iterator from a precompiled set of break rules. Creating a break iterator from the binary rules is much faster than creating one from source rules. The binary rules are generated by the RuleBasedBreakIterator.compileRules() function. Binary break iterator rules are not guaranteed to be compatible between different versions of ICU.
      Parameters:
      is - an input stream supplying the compiled binary rules.
      Throws:
      IOException - if there is an error while reading the rules from the InputStream.
      See Also:
    • getInstanceFromCompiledRules

      @Deprecated public static RuleBasedBreakIterator getInstanceFromCompiledRules(ByteBuffer bytes) throws IOException
      Deprecated.
      This API is ICU internal only.
      Create a break iterator from a precompiled set of break rules. Creating a break iterator from the binary rules is much faster than creating one from source rules. The binary rules are generated by the RuleBasedBreakIterator.compileRules() function. Binary break iterator rules are not guaranteed to be compatible between different versions of ICU.
      Parameters:
      bytes - a buffer supplying the compiled binary rules.
      Throws:
      IOException - if there is an error while reading the rules from the buffer.
      See Also:
    • clone

      public Object clone()
      Clones this iterator.
      Overrides:
      clone in class BreakIterator
      Returns:
      A newly-constructed RuleBasedBreakIterator with the same behavior as this one.
    • equals

      public boolean equals(Object that)
      Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.
      Overrides:
      equals in class Object
    • toString

      public String toString()
      Returns the description (rules) used to create this iterator. (In ICU4C, the same function is RuleBasedBreakIterator::getRules())
      Overrides:
      toString in class Object
    • hashCode

      public int hashCode()
      Compute a hashcode for this BreakIterator
      Overrides:
      hashCode in class Object
      Returns:
      A hash code
    • dump

      @Deprecated public void dump(PrintStream out)
      Deprecated.
      This API is ICU internal only.
      Dump the contents of the state table and character classes for this break iterator. For debugging only.
    • compileRules

      public static void compileRules(String rules, OutputStream ruleBinary) throws IOException
      Compile a set of source break rules into the binary state tables used by the break iterator engine. Creating a break iterator from precompiled rules is much faster than creating one from source rules. Binary break rules are not guaranteed to be compatible between different versions of ICU.
      Parameters:
      rules - The source form of the break rules
      ruleBinary - An output stream to receive the compiled rules.
      Throws:
      IOException - If there is an error writing the output.
      See Also:
    • first

      public int first()
      Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).
      Specified by:
      first in class BreakIterator
      Returns:
      The offset of the beginning of the text.
    • last

      public int last()
      Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).
      Specified by:
      last in class BreakIterator
      Returns:
      The text's past-the-end offset.
    • next

      public int next(int n)
      Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().
      Specified by:
      next in class BreakIterator
      Parameters:
      n - The number of steps to move. The sign indicates the direction (negative is backwards, and positive is forwards).
      Returns:
      The character offset of the boundary position n boundaries away from the current one.
    • next

      public int next()
      Advances the iterator to the next boundary position.
      Specified by:
      next in class BreakIterator
      Returns:
      The position of the first boundary after this one.
    • previous

      public int previous()
      Moves the iterator backwards, to the boundary preceding the current one.
      Specified by:
      previous in class BreakIterator
      Returns:
      The position of the boundary position immediately preceding the starting position.
    • following

      public int following(int startPos)
      Sets the iterator to refer to the first boundary position following the specified position.
      Specified by:
      following in class BreakIterator
      Parameters:
      startPos - The position from which to begin searching for a break position.
      Returns:
      The position of the first break after the current position.
    • preceding

      public int preceding(int offset)
      Sets the iterator to refer to the last boundary position before the specified position.
      Overrides:
      preceding in class BreakIterator
      Parameters:
      offset - The position to begin searching for a break from.
      Returns:
      The position of the last boundary before the starting position.
    • checkOffset

      protected static final void checkOffset(int offset, CharacterIterator text)
      Throw IllegalArgumentException unless begin <= offset < end.
    • isBoundary

      public boolean isBoundary(int offset)
      Returns true if the specified position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".
      Overrides:
      isBoundary in class BreakIterator
      Parameters:
      offset - the offset to check.
      Returns:
      True if "offset" is a boundary position.
    • current

      public int current()
      Returns the current iteration position. Note that DONE is never returned from this function; if iteration has run to the end of a string, current() will return the length of the string while next() will return BreakIterator.DONE).
      Specified by:
      current in class BreakIterator
      Returns:
      The current iteration position.
    • getRuleStatus

      public int getRuleStatus()
      Return the status tag from the break rule that determined the boundary at the current iteration position. The values appear in the rule source within brackets, {123}, for example. For rules that do not specify a status, a default value of 0 is returned. If more than one rule applies, the numerically largest of the possible status values is returned.

      Of the standard types of ICU break iterators, only the word and line break iterator provides status values. The values are defined in class RuleBasedBreakIterator, and allow distinguishing between words that contain alphabetic letters, "words" that appear to be numbers, punctuation and spaces, words containing ideographic characters, and more. Call getRuleStatus after obtaining a boundary position from next(), previous(), or any other break iterator functions that returns a boundary position.

      Note that getRuleStatus() returns the value corresponding to current() index even after next() has returned DONE.

      Overrides:
      getRuleStatus in class BreakIterator
      Returns:
      the status from the break rule that determined the boundary at the current iteration position.
    • getRuleStatusVec

      public int getRuleStatusVec(int[] fillInArray)
      Get the status (tag) values from the break rule(s) that determined the boundary at the current iteration position. The values appear in the rule source within brackets, {123}, for example. The default status value for rules that do not explicitly provide one is zero.

      The status values used by the standard ICU break rules are defined as public constants in class RuleBasedBreakIterator.

      If the size of the output array is insufficient to hold the data, the output will be truncated to the available length. No exception will be thrown.

      Overrides:
      getRuleStatusVec in class BreakIterator
      Parameters:
      fillInArray - an array to be filled in with the status values.
      Returns:
      The number of rule status values from the rules that determined the boundary at the current iteration position. In the event that the array is too small, the return value is the total number of status values that were available, not the reduced number that were actually returned.
    • getText

      public CharacterIterator getText()
      Returns a CharacterIterator over the text being analyzed.

      Caution:The state of the returned CharacterIterator must not be modified in any way while the BreakIterator is still in use. Doing so will lead to undefined behavior of the BreakIterator. Clone the returned CharacterIterator first and work with that.

      The returned CharacterIterator is a reference to the actual iterator being used by the BreakIterator. No guarantees are made about the current position of this iterator when it is returned; it may differ from the BreakIterators current position. If you need to move that position to examine the text, clone this function's return value first.

      Specified by:
      getText in class BreakIterator
      Returns:
      An iterator over the text being analyzed.
    • setText

      public void setText(CharacterIterator newText)
      Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text. (The old iterator is dropped.)

      Caution: The supplied CharacterIterator is used directly by the BreakIterator, and must not be altered in any way by code outside of the BreakIterator. Doing so will lead to undefined behavior of the BreakIterator.

      Specified by:
      setText in class BreakIterator
      Parameters:
      newText - An iterator over the text to analyze.