Class BatchNode<T>


  • public final class BatchNode<T>
    extends java.lang.Object
    A batch processing data node for when there's no way to fit the data in memory.

    Data is stored in sharded files, and data is written/consumed and processed concurrently.

    The data is processed in batches. Each batch is processed in a single thread. The number of threads is controlled by #parallelism(IntSupplier).

    • Field Detail

      • DUMMY

        private static final java.util.function.Consumer<java.lang.Boolean> DUMMY
      • myDistributor

        private final java.util.function.ToIntFunction<T> myDistributor
      • myParallelism

        private final java.util.function.IntSupplier myParallelism
      • myQueueCapacity

        private final int myQueueCapacity
      • myReaderFactory

        private transient java.util.function.Function<java.io.File,​FromFileReader<T>> myReaderFactory
      • myReaderManager

        private final Throughput myReaderManager
      • myWriterManger

        private final Throughput myWriterManger
    • Method Detail

      • newInstance

        public static <T> BatchNode<T> newInstance​(java.io.File directory,
                                                   DataInterpreter<T> interpreter)
      • dispose

        public void dispose()
        Dispose of this node and explicitly delete all files.
      • processAll

        public void processAll​(java.util.function.Consumer<T> processor)
        Process each and every item individually
        Parameters:
        processor - Must be able to consume concurrently
      • processAll

        public void processAll​(java.util.function.Supplier<java.util.function.Consumer<T>> processorFactory)
        Similar to processAll(Consumer) but you provide a consumer constructor/factory rather than a specific consumer. Internally there will be 1 consumer per worker thread instantiated. This variant is for when the consumer(s) are stateful.
      • processCombineable

        public <R,​A extends TwoStepMapper<T,​R>> void processCombineable​(java.util.function.Supplier<A> aggregatorFactory,
                                                                                    java.util.function.Consumer<A> processor)
        Similar to processMergeable(Supplier, Consumer) but the processor is called with the aggregator instance itself rather than its extracted results. This corresponds to TwoStepMapper#Combineable rather than TwoStepMapper#Mergeable.
        See Also:
        processMergeable(Supplier, Consumer)
      • processMergeable

        public <R,​A extends TwoStepMapper<T,​R>> void processMergeable​(java.util.function.Supplier<A> aggregatorFactory,
                                                                                  java.util.function.Consumer<R> processor)
        Each shard is processed/aggregated separately by a TwoStepMapper instance. The results are then processed/merged by the provided processor.

        There is one TwoStepMapper instance per underlying worker thread. Those instances are reset and reused for each shard.

        The processor is called concurrently from multiple threads.

        You must make sure that all data items that need to be in the same aggregator instance are in the same file/shard. You control the number of shards via BatchNode.Builder.fragmentation(int) and which item goes in which shard via BatchNode.Builder.distributor(ToIntFunction).

        Parameters:
        aggregatorFactory - Produces the TwoStepMapper aggregator instances
        processor - Consumes the aggregated/derived data - the results of one whole TwoStepMapper instance at the time
      • reduceMapped

        @Deprecated
        public <R,​A extends TwoStepMapper.Mergeable<T,​R>> R reduceMapped​(java.util.function.Supplier<A> aggregatorFactory)
        Deprecated.
        v54 Use #reduceByMerging(Supplier) instead
        Same as processMergeable(Supplier, Consumer), but then also reduce/merge the total results using TwoStepMapper#merge(Object).

        Create a class that implements TwoStepMapper and make sure to also implement TwoStepMapper#merge(Object) - you can only use this if merging partial (sub)results is possible. Use a constructor or factory method that produce instances of that type as the argument to this method.

      • getReaderFactory

        private java.util.function.Function<java.io.File,​FromFileReader<T>> getReaderFactory()
      • process

        private void process​(java.io.File shard,
                             java.util.function.Consumer<T> consumer)
      • processAggregators

        private <R,​A extends TwoStepMapper<T,​R>> void processAggregators​(java.io.File shard,
                                                                                     java.util.function.Supplier<A> aggregatorFactory,
                                                                                     java.util.function.Consumer<A> processor)
      • processResults

        private <R,​A extends TwoStepMapper<T,​R>> void processResults​(java.io.File shard,
                                                                                 java.util.function.Supplier<A> aggregatorFactory,
                                                                                 java.util.function.Consumer<R> processor)