Load
Load provides simple command line interface for building high-load jobs on a cluster, based on Cascading.
The open source repo is available on GitHub at https://github.com/Cascading/cascading.load where the README.md file has instructions for build and installation, and the COMMANDS.md file has a full list of command line options.
Load can be run in Hadoop as a JAR file:
hadoop jar load.jar param1 param2 .. paramN
Or, after installing on your laptop or server, as a command suitable for use in bash scripts, cron jobs, etc.:
cascading.load param1 param2 .. paramN
Why use Load?
There are a number of good reasons for having a library of functional load tests handy:
- Generate datasets for load testing your Hadoop cluster.
- Produce a consistent set of baseline metrics for your Hadoop cluster.
Baseline metrics become particularly useful when you need to modify your cluster. Whether you are tuning Hadoop configuration settings, upgrading hardware, or modifying the switch fabric, Load can produce baseline metrics to help give you an objective, quantitative basis for comparisons.
For example, the generate data
app is only one mapper, which makes negligible use of HDFS reads, but substantial use of HDFS writes. Since there is no reducer there is no shuffle phase. So the generate data
app provides an excellent way to obtain baseline metrics for HDFS write throughput.
hadoop jar load.jar --generate -I output/nop -O output/gendata
The consume data
app provides a complement, since it is only one mapper which reads the data from generate data
and makes negligible use of HDFS writes. There the consume data
app provides an excellent way to obtain baseline metrics for HDFS read throughput. For another example, the count sort
app provides a great way to measure the cost of the shuffle phase on a given cluster configuration.