What's New in CDH4 Beta 1
CDH-wide Changes for CDH4 Beta 1
Apache Hadoop Common
Apache Hadoop HDFS
Apache Hadoop MapReduce
- MapReduce 2.0: MapReduce has undergone a complete overhaul and CDH4 now includes MapReduce 2.0 (MRv2). The fundamental idea of MRv2's YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-node NodeManagers (NM), form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on slave nodes instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. For details of the new architecture, see Apache Hadoop NextGen MapReduce (YARN).
Note: Cloudera does not consider the current upstream MRv2 release stable yet, and it could potentially change in non-backwards-compatible ways. Cloudera recommends that you use MRv1 unless you have particular reasons for using MRv2, which should not be considered production-ready.
- CDH4 continues to support the original MapReduce framework (i.e. the JobTracker and TaskTrackers). The old framework is referred to as MRv1. The user, during deployment, will have the choice of using either MRv1 or MRv2. Cloudera does not support running MRv1 and YARN daemons on the same nodes at the same time.
- MRv1 in CDH4 is based on its counterpart in CDH3, with some changes to make the MR API compatible with Hadoop 2.0.0 (and Hadoop 0.23 and later). This means that users will need to recompile their applications when going from CDH3 to CDH4 (even when continuing to use MRv1). Recompilation will not be necessary when going from MRv1 to MRv2 within CDH4.
- HFile v2, a new more efficient storage format (HBASE-3857)
- Faster recovery via distributed log splitting (HBASE-1364)
- Lower latency region-server operations via new multi-threaded and asynchronous implementations.
- An enhanced web UI that exposes more internal state
- Improved logging for identifying slow queries
- Improved corruption detection and repair tools (HBASE-5128)
Updated to upstream version 0.8.0
- Includes support for Binary DataType (HIVE-2380)
- Includes support for Timestamp DataType (HIVE-2272)
- Provides Plugin Developer Kit (HIVE-2244)
- Includes support for INSERT INTO append semantics (HIVE-306)
- Includes support for Per-Partition SerDe (HIVE-2484)
- Includes support for Import/Export facilities (HIVE-1918)
- Includes support for Bitmap Indexes (HIVE-1803)
- Includes support for RCFile Block Merge (HIVE-1950)
- Incorporates Group By Optimization (HIVE-1694)
- Provides new Virtual Columns (HIVE-2100)
- Incorporates JDBC Driver improvements (HIVE-559, HIVE-1631, HIVE-2000, HIVE-2054, HIVE-2144, HIVE-2153, HIVE-2358, HIVE-2369, HIVE-2456)
- Updated to upstream version 0.9.2
- Support for Penny — a framework for workflow instrumentation (PIG-1959)
- Robust CSV Loader/Store (PIG-1924)
- Macro expansion support for Pig Latin (PIG-1793)
- Ability to load data by column family in HBaseStorage (PIG-1782)
- Embed Pig in scripting languages (PIG-1479)
- Support for project-range expression (PIG-1693)
Apache Sqoop (incubating)
- Updated to upstream version 1.4.0
- Updated to upstream version 3.4.1