Class TextPattern

  • All Implemented Interfaces:
    java.io.Serializable, java.lang.CharSequence

    public class TextPattern
    extends java.lang.Object
    implements java.io.Serializable, java.lang.CharSequence
    Fast pattern matching against a constant string.

    The regular expression facilities of the Java API are a powerful tool; however, when searching for a constant pattern many algorithms can increase of orders magnitude the speed of a search.

    This class provides constant-pattern text search facilities by implementing the last-character heuristics of the Boyer–Moore search algorithm using compact approximators, a randomized data structure that can accomodate in a small space (but in an approximated way) the bad-character shift table of a large alphabet such as Unicode.

    Since a large subset of US-ASCII is used in all languages (e.g., whitespace, punctuation, etc.), this class caches separately the shifts for the first 128 Unicode characters, resulting in very good performance even on text in pure US-ASCII.

    Note that the indexOf methods of MutableString use a even more simplified variant of Boyer–Moore's algorithm which is less efficient, but has a smaller setup time and does not generate any object. In general, for short case-insensitive patterns the overhead of this class will make it slower than such methods. The search facilities provided by this class are targeted at searches with long patterns, and case-insensitive searches.

    Instances of this class are immutable and thread-safe.

    Since:
    0.6
    Author:
    Sebastiano Vigna, Paolo Boldi
    See Also:
    MutableString.indexOf(MutableString, int), Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static int CASE_INSENSITIVE
      Enables case-insensitive matching.
      protected char[] pattern
      The pattern backing array.
      static int UNICODE_CASE
      Enables Unicode-aware case folding.
    • Constructor Summary

      Constructors 
      Constructor Description
      TextPattern​(java.lang.CharSequence pattern)
      Creates a new case-sensitive TextPattern object that can be used to search for the given pattern.
      TextPattern​(java.lang.CharSequence pattern, int flags)
      Creates a new TextPattern object that can be used to search for the given pattern.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      boolean caseInsensitive()
      Returns whether this pattern is case insensitive.
      char charAt​(int i)  
      boolean equals​(java.lang.Object o)
      Compares this text pattern to another object.
      int hashCode()
      Returns a hash code for this text pattern.
      int length()  
      int search​(byte[] a)
      Returns the index of the first occurrence of this pattern in the given byte array.
      int search​(byte[] a, int from)
      Returns the index of the first occurrence of this pattern in the given byte array starting from a given index.
      int search​(byte[] a, int from, int to)
      Returns the index of the first occurrence of this pattern in the given byte array between given indices.
      int search​(char[] array)
      Returns the index of the first occurrence of this pattern in the given character array.
      int search​(char[] array, int from)
      Returns the index of the first occurrence of this pattern in the given character array starting from a given index.
      int search​(char[] a, int from, int to)
      Returns the index of the first occurrence of this pattern in the given character array between given indices.
      int search​(it.unimi.dsi.fastutil.chars.CharList list)
      Returns the index of the first occurrence of this pattern in the given character list.
      int search​(it.unimi.dsi.fastutil.chars.CharList list, int from)
      Returns the index of the first occurrence of this pattern in the given character list starting from a given index.
      int search​(it.unimi.dsi.fastutil.chars.CharList list, int from, int to)
      Returns the index of the first occurrence of this pattern in the given character list between given indices.
      int search​(java.lang.CharSequence s)
      Returns the index of the first occurrence of this pattern in the given character sequence.
      int search​(java.lang.CharSequence s, int from)
      Returns the index of the first occurrence of this pattern in the given character sequence starting from a given index.
      int search​(java.lang.CharSequence s, int from, int to)
      Returns the index of the first occurrence of this pattern in the given character sequence between given indices.
      java.lang.CharSequence subSequence​(int from, int to)  
      java.lang.String toString()  
      boolean unicodeCase()
      Returns whether this pattern uses Unicode case folding.
      • Methods inherited from class java.lang.Object

        clone, finalize, getClass, notify, notifyAll, wait, wait, wait
      • Methods inherited from interface java.lang.CharSequence

        chars, codePoints
    • Field Detail

      • CASE_INSENSITIVE

        public static final int CASE_INSENSITIVE
        Enables case-insensitive matching.

        By default, case-insensitive matching assumes that only characters in the ASCII charset are being matched. Unicode-aware case-insensitive matching can be enabled by specifying the UNICODE_CASE flag in conjunction with this flag.

        Case-insensitivity involves a performance drop.

        See Also:
        Constant Field Values
      • UNICODE_CASE

        public static final int UNICODE_CASE
        Enables Unicode-aware case folding.

        When this flag is specified then case-insensitive matching, when enabled by the CASE_INSENSITIVE flag, is done in a manner consistent with the Unicode Standard. By default, case-insensitive matching assumes that only characters in the ASCII charset are being matched.

        Unicode-aware case folding is very expensive (two method calls per examined non-ASCII character).

        See Also:
        Constant Field Values
      • pattern

        protected char[] pattern
        The pattern backing array.
    • Constructor Detail

      • TextPattern

        public TextPattern​(java.lang.CharSequence pattern)
        Creates a new case-sensitive TextPattern object that can be used to search for the given pattern.
        Parameters:
        pattern - the constant pattern to search for.
      • TextPattern

        public TextPattern​(java.lang.CharSequence pattern,
                           int flags)
        Creates a new TextPattern object that can be used to search for the given pattern.
        Parameters:
        pattern - the constant pattern to search for.
        flags - a bit mask that may include CASE_INSENSITIVE and UNICODE_CASE.
    • Method Detail

      • caseInsensitive

        public boolean caseInsensitive()
        Returns whether this pattern is case insensitive.
      • unicodeCase

        public boolean unicodeCase()
        Returns whether this pattern uses Unicode case folding.
      • length

        public int length()
        Specified by:
        length in interface java.lang.CharSequence
      • charAt

        public char charAt​(int i)
        Specified by:
        charAt in interface java.lang.CharSequence
      • subSequence

        public java.lang.CharSequence subSequence​(int from,
                                                  int to)
        Specified by:
        subSequence in interface java.lang.CharSequence
      • search

        public int search​(char[] array)
        Returns the index of the first occurrence of this pattern in the given character array.
        Parameters:
        array - the character array to look in.
        Returns:
        the index of the first occurrence of this pattern contained in the given array, or -1, if the pattern cannot be found.
      • search

        public int search​(char[] array,
                          int from)
        Returns the index of the first occurrence of this pattern in the given character array starting from a given index.
        Parameters:
        array - the character array to look in.
        from - the index from which the search must start.
        Returns:
        the index of the first occurrence of this pattern contained in the subarray starting from from (inclusive), or -1, if the pattern cannot be found.
      • search

        public int search​(char[] a,
                          int from,
                          int to)
        Returns the index of the first occurrence of this pattern in the given character array between given indices.
        Parameters:
        a - the character array to look in.
        from - the index from which the search must start.
        to - the index at which the search must end.
        Returns:
        the index of the first occurrence of this pattern contained in the subarray starting from from (inclusive) up to to (exclusive) characters, or -1, if the pattern cannot be found.
      • search

        public int search​(java.lang.CharSequence s)
        Returns the index of the first occurrence of this pattern in the given character sequence.
        Parameters:
        s - the character sequence to look in.
        Returns:
        the index of the first occurrence of this pattern contained in the given character sequence, or -1, if the pattern cannot be found.
      • search

        public int search​(java.lang.CharSequence s,
                          int from)
        Returns the index of the first occurrence of this pattern in the given character sequence starting from a given index.
        Parameters:
        s - the character array to look in.
        from - the index from which the search must start.
        Returns:
        the index of the first occurrence of this pattern contained in the subsequence starting from from (inclusive), or -1, if the pattern cannot be found.
      • search

        public int search​(java.lang.CharSequence s,
                          int from,
                          int to)
        Returns the index of the first occurrence of this pattern in the given character sequence between given indices.
        Parameters:
        s - the character array to look in.
        from - the index from which the search must start.
        to - the index at which the search must end.
        Returns:
        the index of the first occurrence of this pattern contained in the subsequence starting from from (inclusive) up to to (exclusive) characters, or -1, if the pattern cannot be found.
      • search

        public int search​(byte[] a)
        Returns the index of the first occurrence of this pattern in the given byte array.
        Parameters:
        a - the byte array to look in.
        Returns:
        the index of the first occurrence of this pattern contained in the given byte array, or -1, if the pattern cannot be found.
      • search

        public int search​(byte[] a,
                          int from)
        Returns the index of the first occurrence of this pattern in the given byte array starting from a given index.
        Parameters:
        a - the byte array to look in.
        from - the index from which the search must start.
        Returns:
        the index of the first occurrence of this pattern contained in the array fragment starting from from (inclusive), or -1, if the pattern cannot be found.
      • search

        public int search​(byte[] a,
                          int from,
                          int to)
        Returns the index of the first occurrence of this pattern in the given byte array between given indices.
        Parameters:
        a - the byte array to look in.
        from - the index from which the search must start.
        to - the index at which the search must end.
        Returns:
        the index of the first occurrence of this pattern contained in the array fragment starting from from (inclusive) up to to (exclusive) characters, or -1, if the pattern cannot be found.
      • search

        public int search​(it.unimi.dsi.fastutil.chars.CharList list)
        Returns the index of the first occurrence of this pattern in the given character list.
        Parameters:
        list - the character list to look in.
        Returns:
        the index of the first occurrence of this pattern contained in the given list, or -1, if the pattern cannot be found.
      • search

        public int search​(it.unimi.dsi.fastutil.chars.CharList list,
                          int from)
        Returns the index of the first occurrence of this pattern in the given character list starting from a given index.
        Parameters:
        list - the character list to look in.
        from - the index from which the search must start.
        Returns:
        the index of the first occurrence of this pattern contained in the sublist starting from from (inclusive), or -1, if the pattern cannot be found.
      • search

        public int search​(it.unimi.dsi.fastutil.chars.CharList list,
                          int from,
                          int to)
        Returns the index of the first occurrence of this pattern in the given character list between given indices.
        Parameters:
        list - the character list to look in.
        from - the index from which the search must start.
        to - the index at which the search must end.
        Returns:
        the index of the first occurrence of this pattern contained in the sublist starting from from (inclusive) up to to (exclusive) characters, or -1, if the pattern cannot be found.
      • equals

        public final boolean equals​(java.lang.Object o)
        Compares this text pattern to another object.

        This method will return true iff its argument is a TextPattern containing the same constant pattern with the same flags set.

        Overrides:
        equals in class java.lang.Object
        Parameters:
        o - an object.
        Returns:
        true if the argument is a TextPatterns that contains the same constant pattern of this text pattern and has the same flags set.
      • hashCode

        public final int hashCode()
        Returns a hash code for this text pattern.

        The hash code of a text pattern is the same as that of a String with the same content (suitably lower cased, if the pattern is case insensitive).

        Overrides:
        hashCode in class java.lang.Object
        Returns:
        a hash code array for this object.
        See Also:
        String.hashCode()
      • toString

        public final java.lang.String toString()
        Specified by:
        toString in interface java.lang.CharSequence
        Overrides:
        toString in class java.lang.Object