Cascading 3.1 Release
We are happy to announce Cascading 3.1 is now publicly available for download.
Version 3.1 improves the performance of Cascading over 3.0, resolves a number of issues during planning of complex workloads when running on MapReduce and Apache Tez, and further delivers on the promise of new platform portability with the addition of Apache Flink as an execution platform.
As of version 3.1, Cascading can further leverage declared type information during serialization of data across the ‘shuffle’ (partitioning phase) between map and reduce tasks or Tez vertices. For many applications, performing ETL or data cleansing type workloads, having flexible type support within a column can dramatically improve reliability by having operation request values as the types they require (a string or an integer). When integrating complex data sets, a given column or field may not be consistent across data sets, so requiring a common type instead of a means to lazily convert value on demand can be difficult to rely on in practice. But for applications and systems that are consistent, declaring types not only enforce consistency (or prevent inconsistency), the type information is now used to improve serialization IO efficiency.
In addition to leveraging additional type information, applications consuming data from directory partitioned data sets (data stored by date or other values) can be pre-filtered using a new filtering mechanism introduced in 3.1. For example, if a data set spans 10 years and is partitioned by year, and month, a filter can be used to prevent a Cascading application from reading files that are not within the expected time range during runtime by excluding partitions (directories) that do not match the filter. This simple enhancement allows applications to have greater performance without impacting the maintainability of large complex applications.
We are also very happy to remind users of the availability of Cascading on Apache Flink, with the release of 3.1.0, the Cascading Connector on Flink can be run against a stable Cascading release. For more information, see this Apache Conference presentation Faster Workflows.
Please note this is a minor release retaining API compatibility with Cascading 3.0 public methods.
As we continue to advance the code base, a number of other enhancements and bug fixes are included in the release. For the complete list of changes in Cascading 3.1, please see the change log.