Package gnu.lists


package gnu.lists

Contains utility classes and interfaces for sequences (lists), arrays, and trees.

Features:

  • The classes and interfaces are compatible with the Java2 Collections framework, but do not require them. A configuration option allows you to specify whether Sequence extends java.util.List or not.
  • Positions in a Sequence and iterators are the same kind of object, and are represented using magic cookies. In most collections frameworks, when you define a new class that implements a sequence, you would typically define a new iterator class as well. You never do that with gnu.lists; instead you implement suitable methods in the sequence class itself. This approach may seem strange, but it has important performance advantages, especially in applications where you want to go up and down nested sequences (i.e. in trees, such as XML elements), or remember sets of positions (such as XML node-sets).
  • The classes are designed for efficiency, especially concentrating on minimizing object allocation.
  • The framework supports general multi-dimensional arrays.
  • The package is self-contained; it does not depend on classes or features not in JDK 1.1.
  • Various useful sequence implementations, including Lisp-style linked lists; arrays with various element types; adjustable arrays that use a "buffer-gap" to make insertions and deletions more efficient; and adjustable arrays that automatically update positions on insertions and deletions.
  • The TreeList class can compactly represent a nested heterogenous tree. One intended usage is as a very efficient representation for XML data.
  • The Consumer interface is an abstraction of "data sinks" - objects that can receive data and do something with them. It is similar to SAX's DocumentHandler but at a more abstract (and sometimes more efficient) level.

The iteration and position framework

A position points between two elements in its "containing" sequence, or it points before the first element (if any), or after the last. Sometimes, we loosely say that the position refers to or points to an element of the sequence; in that case we mean the position is immediately before the element.

An iterator is an object that has a current position, but that can be moved so it gets a new position. What needs to be emphasized is that all Sequence implementations. use the same SeqPosition class to implement positions and iteration. In other "collections frameworks" each sequence class has its corresponding iterator class, but in this framework you instead add the methods that handle iteration to the sequence class itself, extending the AbstractSequence class. A consequence of this is that if you have an unused SeqPosition object, you can use it to iterate over any Sequence you choose. This potentially saves memory allocation, which gains you the most when iterating over a nested sequence structure, like a tree of XML data.

We do this by requiring that any position be representable using a pair of an int and an Object. In other words, the state of an iterator is restricted to be an (int, Object) pair. Of course that is all you could need, since if you need more state you can always create a helper object, and store the extra state there. The assumption we make though is that for most Sequences, an (int, Object) would be enough, without creating any new objects (beyond, sometimes, the SeqPosition itself).

The int part of a position-state is normally named ipos, and the Object part of a position-state is named xpos. Neither of these values have any meaning in themselves, they only have meaning as interpreted by the specific Sequence. For example, ipos might be the offset of the position in the sequence, but it might also some "magic cookie". If a Sequence supports insertions and deletions, and you want positions to be automatically adjusted, then a position has to be a magic cookie rather than an offset. (A typical implementation is for such a position to be an index into a table of positions managed by the Sequence itself.) So a complete position is actually a (int, Object, AbstractSequence) triple, where the AbstractSequence component is the object that interprets the (int, Object) pair. Normally, the AbstractSequence part of a position triple is the Sequence "containing" the position, but it can be any AbstractSequence that implements the various methods that provide the semantics for the (int, Object) pair.

The PositionContainer interface is a more abstract version of SeqPosition, in that it can represents more than one position. For example, instead of an array of separately allocated SeqPosition objects, you might use some data structure that implements PositionContainer, which is abstractly a list of (int, Object, Sequence)-triples.

The consumer protocol

You typically use an iteration framework it process the elements of a sequence in such a way that the data consumer is active and in control, while the sequence itself (the data producer) is a passive object. The Consumer works the other way round: The data producer is active and in control, while the data consumer is passive, consuming elements when requested by the data producer. The Consumer is a more abstract variant of SAX's ContentHandler interface for processing "events"; however Consumer is intended for general value sequences, and is less tied to XML.

Iteration

int iter = 0; // standard start position
for (;;)
{
  iter = list.nextPos(iter);
  if (iter == 0)
    break;
  Object value = list.getPosPrevious(iter);
  ... use value ...;
}

Position values for buffer-based sequences

The position encoding for sequences implemented using an array is simple. Assume the next index (as returned by nextIndex) is i. If the position is a before/backwards position, then position value is (i<<1). If the position is an after/forwards position, then position value is ((i<<1))|1.

But what each each item in the buffer has variable size? One example is a UTF-8 string. Another example is a buffer of text where each "item" is a line. Then we have the choice: Should the position value encode the index (making nextIndex and get cheap), or should it encode the offset into the buffer (making sequential access cheap)? Since sequential access is preferred, we do the latter. Thus for a before/backwards position, if the buffer offset for item i is pi, then the position value is (pi << 1). However, there is a complication when moving forwards using a ListIterator since using set or remove affects the previous item. It may be much more expensive to calculate the buffer position of the previous item than the next item. (Given pi it may be cheap to calculate pi+1 but expensive to calculate pi-1.) Therefore, we suggest using (((pi-1+1)<<1))|1, where p-1 is defined as -1. The addition of 1 allows us to handle the case of i=0 without conflicting with the use of -1 to mean end-of-sequence. Plus it makes the previous case of a simple array a special case.

  • Class
    Description
     
    A collection of default methods for implementing sequence-like classes.
    General interface to arrays of arbitrary dimension.
     
    A predicate that (only) matches a ATTRIBUTE_VALUE.
     
    Simple adjustable-length vector of Boolean values.
    Binary data which may represent text or other information.
    Simple adjustable-length vector of signed or unsigned 8-bit integers (bytes).
    Editable character sequence using a buffer-gap implementation and self-adjusting position.
    A sequence where each element is a character (Unicode code point).
    Simple adjustable-length vector whose elements are 16-bit chars.
    Indexing "composes" with a set of indexing arrays.
    Same as ComposedArray but also implements AVector.
    An object that can send its contents to a Consumer.
    A Consumer is something that will accept data (output), and do something with it.
    A Writer that wraps (filters) a Consumer.
    A class that encapsulates conversions between primitive and Object.
    A predicate that (only) matches a ELEMENT_VALUE.
    This singleton class represents an empty list.
     
    ExtPosition<E,ESEQ extends AbstractSequence<E>>
    A SeqPosition for sequences that need more than a Pos int for a position.
    Abstract helper class for Sequences that use an ExtPosition.
    Simple adjustable-length vector of 32-bit floats.
    Simple adjustable-length vector of 64-bit doubles.
    A Consumer that wraps some other Consumer.
    View an array as a vector, with the former's elements in row-major order.
    Simple adjustable-length vector whose elements are 32-bit code points Used for the Scheme string type.
    Simple adjustable-length vector of objects.
    A class to handle general multi-dimensional arrays.
     
    A "generalized vector" - a randomly-acessible sequence.
     
    Wrap a List (or an indexed selection of it) as a Sequence.
     
    Simple adjustable-length vector of signed or unsigned 32-bit integers (ints).
    A string implementation with contant-time codepoint indexing.
     
    A predicate (or type) on an item in a sequence.
    Semi-abstract class for traditions Lisp-style lists.
    Simple adjustable-length vector of signed or unsigned 64-bit integers (longs).
    A predicate that (only) matches only "nodes" in the XML sense.
    A "pair" object, as used in Lisp-like languages.
    A Pair with the file name and position it was read from.
    An object that can be "fed" a TreePosition, and will do something with it.
     
     
    A Consumer that extends a PrintWriter.
     
     
     
     
    Simple adjustable-length vector of signed 16-bit integers (shorts).
    Simple adjustable-length vector of signed 32-bit integers (ints).
    Simple adjustable-length vector of signed 64-bit integers (longs).
    Simple adjustable-length vector of signed 8-bit integers (bytes).
    SeqPosition<E,ESEQ extends AbstractSequence<E>>
    A position in a sequence (list).
    A Sequence is an ordered list of elements.
     
    Iterator subclass to iterate of CharSequences.
    Simple adjustable-length vector of signed or unsigned 16-bit integers (shorts).
    A generic simple vector.
    Implements a stable sequence with sticky positions.
    Various static utility methods for general strings (CharSeqs).
    A sequence consisting of a sub-range of the elements of a base sequence.
    Indexes are mapped.
    A compact representation of a tree, that is a nested list structure.
    A position that can also go down and up in a tree.
    Simple adjustable-length vector of unsigned 16-bit integers (shorts).
    Simple adjustable-length vector of unsigned 32-bit integers (ints).
    Simple adjustable-length vector of unsigned 64-bit integers (longs).
    Simple adjustable-length vector of unsigned 8-bit integers (bytes).
    Used for text that is supposed to be written out verbatim.
    A Consumer that does nothing.
    A Consumer extended with XML-specific methods.