Cascading WIP 1.1
Cascading WIP 1.1 is now available as source on GitHub and as a regression tested distribution at Concurrent, Inc..
Please consider this WIP (and any other Work In Progress branch) as unstable and unsuitable for production use. That said, the more users who test it will make it stable that much more quickly.
Also note that the distribution downloads from Concurrent, Inc. are fully regression tested, so should be a drop in replacement for Cascading 1.0.
Please see CHANGES.txt for a comprehensive list of new features and bug fixes.
For highlights, please read on.
First, we added Fields.REPLACE and Fields.SWAP.
The former does ‘in-line’ replace of values within a set of fields while retaining the field names. The later will remove the arguments to an Operation from the stream and swap them with the results of the Operation. Both may only be used as output selectors on a Pipe.
Second, removed the restriction that Traps must remain within Map and Reduce boundaries. Traps may now span multiple Map/Reduce instances. So now an application have have a single Trap regardless of the Flow complexity.
Third, SpillableTupleList (used during join operations) supports compression by default. This should improve join performance for some use-cases.
Fourth, added MultiSinkTap (and renamed MultiTap to MultiSourceTap). This tap allows for multiple locations to be written to simultaneously, with each output stream containing only the values declared by the Scheme.
Fifth, GlobHfs was created to allow for standard Hadoop style globbing to used when creating a Tap. This is a simpler version of MultiSourceTap where the files match some pattern.
Sixth, added support for safe/not-safe Operations. A not-safe operation is one that may have side-effects so must never see the same Tuple more than once. Sometimes it is faster to process the same Tuples more than once across the cluster, this meta-data prevents that from happening if the operation is not idempotent.
Plus many more features under the hood to help improve performance and manageability of Cascading Flows and Cascades.
There is no planned release date for 1.1, but shouldn’t be terribly long away.