This is the documentation for CDH 4.7.0.
Documentation for other versions is available at Cloudera Documentation.

What's New in CDH4 Beta 1

CDH-wide Changes for CDH4 Beta 1

Updated Components

Apache Hadoop Common

  • A new FileContext API for applications
  • Hadoop Auth for end-to-end HTTP security. Hadoop Auth is a Java library consisting of client and server components to enable Kerberos SPNEGO authentication for HTTP.
  • Hadoop HttpFs: A read / write Hadoop file system proxy with REST API

Apache Hadoop HDFS

Apache Hadoop MapReduce

  • MapReduce 2.0: MapReduce has undergone a complete overhaul and CDH4 now includes MapReduce 2.0 (MRv2). The fundamental idea of MRv2's YARN architecture is to split up the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a global ResourceManager (RM) and per-application ApplicationMasters (AM). With MRv2, the ResourceManager (RM) and per-node NodeManagers (NM), form the data-computation framework. The ResourceManager service effectively replaces the functions of the JobTracker, and NodeManagers run on slave nodes instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks. For details of the new architecture, see Apache Hadoop NextGen MapReduce (YARN).
 

Note: Cloudera does not consider the current upstream MRv2 release stable yet, and it could potentially change in non-backwards-compatible ways. Cloudera recommends that you use MRv1 unless you have particular reasons for using MRv2, which should not be considered production-ready.

  • CDH4 continues to support the original MapReduce framework (i.e. the JobTracker and TaskTrackers). The old framework is referred to as MRv1. The user, during deployment, will have the choice of using either MRv1 or MRv2. Cloudera does not support running MRv1 and YARN daemons on the same nodes at the same time.
  • MRv1 in CDH4 is based on its counterpart in CDH3, with some changes to make the MR API compatible with Hadoop 2.0.0 (and Hadoop 0.23 and later). This means that users will need to recompile their applications when going from CDH3 to CDH4 (even when continuing to use MRv1). Recompilation will not be necessary when going from MRv1 to MRv2 within CDH4.

Apache HBase

User features:

  • HFile v2, a new more efficient storage format (HBASE-3857)
  • Faster recovery via distributed log splitting (HBASE-1364)
  • Lower latency region-server operations via new multi-threaded and asynchronous implementations.

Operator features:

  • An enhanced web UI that exposes more internal state
  • Improved logging for identifying slow queries
  • Improved corruption detection and repair tools (HBASE-5128)

Developer features:

  • Coprocessors (HBASE-2000)
  • Build support for Hadoop 0.20.20x, 0.22, 0.23.
  • Experimental: offheap slab cache and online table schema change (HBASE-4027)

Apache Hive

Updated to upstream version 0.8.0

User features:

Apache Pig

  • Updated to upstream version 0.9.2

New Features:

  • Support for Penny — a framework for workflow instrumentation (PIG-1959)
  • Robust CSV Loader/Store (PIG-1924)
  • Javascript support for Pig embedding and UDFs in scripting languages (PIG-1794)
  • Macro expansion support for Pig Latin (PIG-1793)
  • Ability to load data by column family in HBaseStorage (PIG-1782)
  • Embed Pig in scripting languages (PIG-1479)
  • Support for project-range expression (PIG-1693)

Apache Sqoop (incubating)

  • Updated to upstream version 1.4.0

New features:

Apache ZooKeeper

  • Updated to upstream version 3.4.1