Cascading 2.2 WIP and CoercibleTypes

Cascading 2.2 is starting to take shape for those interested in test driving emerging features. Of note is “field type” support. This allows fields read from an input file to have type information retained through to where the data is sinked/stored to a file. This is important for a few reasons: Detecting incompatible comparisons during joins and sorting at planner time Retain canonical types in a Tuple Reading and writing field type information from/into long term archive files (Avro, Thrift, etc) Reducing intermediate file size by guaranteeing field type information

Cascading 2.1

We are happy to announce that Cascading 2.1 is now publicly available for download. This release includes a number of new features. Specifically: – Restartable Flows using Checkpointing – Improved memory utilization and gc – Refactored build system, source and javadoc jars now available through conjars.org For more details see: https://github.com/Cascading/cascading/blob/2.0/CHANGES.txt

Cascading Software Development Kit

The Cascading SDK is now available for download. The SDK includes Cascading source and jars, and many of the Cascading based tools like Load and Multitool. It also includes at Amazon Elastic MapReduce install script (bootstrap action) that will pre-install all included tools on the master node.

Social Recommender Sample Code

Checkout the new SampleRecommender source code which shows a simple implementation of a social recommender using sample data from a Twitter feed. Includes use of stream assertions to validate data, and an R script to analyze results of the recommender.

Cascading 2.0

We are happy to announce that Cascading 2.0 is now publicly available for download. This release includes a number of new features. Specifically: – Apache 2.0 Licensing – Support for Hadoop 1.0.2 – Local and Hadoop planner modes, where local runs in memory without Hadoop dependencies – HashJoin pipe for “map side joins” – Merge pipe for “map side merges” – Simple Checkpointing for capturing intermediate data as a file

Scalding Released

If you are a Scala fan, checkout the Scalding announcement from Twitter. Or just grab the Scalding code from GitHub. Of course, don’t forget the other language bindings Cascalog, PyCascading, and Cascading.JRuby.

PyCascading Released

If interested in running Python on Apache Hadoop, checkout PyCascading from Twitter. Here is the official announcement on our mail-list. If Clojure is more your thing, there is always Cascalog, another project from the Twitter data teams (formerly BackType).

Intro to Cascading

Scale Unlimited will be offering their online course, Introduction to Cascading, this November 18th.

Cascading 2.0 Early Access

After months of work, we are very happy to announce availability of Cascading 2.0 WIP (Work in Progress). 2.0 is still under development, but it has become stable enough for us to make the work public so we can get early feedback on the APIs and other related changes, without causing unnecessary headaches to early adopters. Currently nearly all changes are internal except for these… Decoupled internal planner from Hadoop and providing a “local” mode planner for fast in-memory processing.

Apache Solr Integration

Apache Solr integration Tap has just been added to the Cascading extensions page for download from GitHub.

NFJS: Cascading a Simple MapReduce

The No Fluff, Just Stuff conference tour is running a series of presentations on Cascading and Cascalog. Check out the video below for a great introduction to Cascading.

Cascading Load and Multitool

After a bit of work, we have repackaged both Cascading Load and Multitool giving them helper bash wrappers for installing, running, and updating. The new packages are on the download page. After unpacking, multitool for example, just run ./bin/multitool install or ./bin/multitool help for more information. Multitool is a command line interface for running sed and grep like application on Apache Hadoop. It even supports joins across multiple files. It’s perfect for finding files or creating large test datasets from larger ones.

JAX San Jose 2011

Chris will be speaking at JAX in San Jose on Tuesday June 21st on Apache Hadoop and “Big Data”.

Buzzwords 2011

Chris will be speaking at Berlin Buzzwords this June on Common Patterns in MapReduce.

Cascalog Workshop – February 19th in San Francisco

Interested in getting started with Hadoop, Cascading, and Cascalog? If so, sign up for the Cascalog Workshop here in sunny San Francisco, Saturday February 19th, here before space runs out. Nathan Marz of BackType and the author of Cascalog will be leading the workshop. Chris K Wensel, the author of Cascading, will be lurking about lending a hand where possible.

ReadWriteWeb on BackType and Cascalog

From Secrets of BackType’s Data Engineers: Cascalog is one of their secret weapons, a Clojure-based query language for Hadoop that makes it simple for them to analyze their data in new ways. Inspired by the venerable Datalog, and built on top of Cascading, it allows you to write queries in Clojure and define even complex operations in simple code. Unlike alternatives like Pig or Hive, it’s written within a general-purpose language, so there’s no need for separate user-defined functions, but it’s still a highly-structured way of defining queries.

Cascading 1.2 Now Available

We are happy to announce that Cascading 1.2 is now publicly available for download. This release features many performance and usability enhancements while remaining backwards compatible with 1.0 and 1.1. Specifically: Performance optimizations during grouping (StreamComparator) Composable map-side partial aggregations (AggregateBy) Native Riffle support for non-Cascading (or nested iterative Cascading) processes (ProcessFlow and Riffle) For a detailed list of changes see: CHANGES.txt We are also happy to announce that Cascading and its extensions have their own Maven/Ivy Jar repository, Conjars.

Conjars Maven Repo

We would like to announce the Conjars Maven repo, a community driven Java jar repository for Cascading libraries and extensions. Anyone who wishes can register an account and push maven artifacts for use by other Cascading developers. Cascading itself is now available through Conjars; core, xml, and test. Conjars is now live, but still a work in progress. We hope to get some UI improvements in place in the near future.

News and Announcements