This is the documentation for CDH 4.6.0.
Documentation for other versions is available at Cloudera Documentation.

What's New in CDH4.1.0

Security Fix

Apache Hadoop Common

  • HttpFS now supports delegation tokens.

Apache HDFS

Apache MapReduce

  • The logging level of an individual MapReduce job in MRv1 can be set; see Tips and Guidelines for details.
  • The performance of the MapReduce shuffle handler and IFile reader can be improved for either MRv1 or YARN by using native Linux system calls that cache data before the shuffle or merge operations. See Tips and Guidelines for more details.
  • MapReduce performance improvements for MRv1, optimizations to reduce job latency for small jobs. See Tips and Guidelines for more details.
  • MapReduce jobs can query individual job status with JobClient.getJob(JobID).getJobStatus().
  • Encrypted shuffle can be enabled; this feature also encrypts traffic to the Web UIs. See Configuring Encrypted Shuffle for details.
  • MapReduce job recovery for MRv1. If the Job Tracker is shutdown or crashes, on restart it automatically resubmits all jobs that were running at the time of shutdown or crash. All recovered jobs will be rerun from the beginning; all output from the incomplete run is deleted before the re-submission.

Apache Hadoop Security using Kerberos

  • The log file names for Hadoop security using Kerberos have been changed to avoid potential conflict: HDFS security logs are now written to SecurityAuth-hdfs.audit while MapReduce security logs are written to SecurityAuth-mapred.audit.

Apache Flume

  • Update to base version of Flume 1.2.0
  • Major improvements to the file channel, including on-disk encryption support
  • New, higher-throughput Asynchronous HBase sink
  • New, much faster, syslog TCP source capable of listening on many ports simultaneously.
  • Added exponential backoff behavior to failed nodes in load balancing RPC client and Avro Sink.
  • Included "stock" interceptors, including those that annotate events with the current hostname or timestamp
  • New monitoring support for JMX, Ganglia, and HTTP
  • Significantly expanded user documentation
  • Many other enhancements and fixes

Apache Sqoop

  • Target directory for Hive import no longer needs to match the table name.
  • Microsoft SQL server connector and OraOop are now supported.
  • The --columns argument is now supported for exporting jobs.
  • Microsoft SQL table names that include hyphens are now supported.

Hue

  • Hue can now be configured so that users can only see Beeswax queries that they issued or saved. With the default configuration, any Beeswax query can be viewed by any user. The new share_saved_queries property now controls the sharing of the queries; when set to "false", saved or executed queries can be viewed only by the owner or a Hue administrator.
  • The Job Browser configuration now supports the share_jobs property which, when set to "false", prevents a user from viewing information about jobs submitted by other users; an administrator can view jobs for all users. The default behavior allows all users to see all jobs.
  • Retired Jobs can now be viewed through the Job Browser in Hue. The information is less complete than the information displayed for Recent Jobs.
  • Hue now provides an Oozie application for creating workflows of MapReduce, streaming, Java, Pig, Hive, Sqoop, Shell and ssh jobs and scheduling them repetitively.
  • Hue is now available in German, Spanish, French, Japanese, Korean, Portuguese, Brazilian and simplified Chinese.

Apache Pig

  • CDH4.1 includes Pig 0.10.0.
  • Boolean data type is now supported. PIG-1429
  • Nested FOREACH's are now supported. PIG-1631
  • Ruby UDF's are now supported. PIG-2317
  • LIMIT/SAMPLE operators can take expressions other than constant value. PIG-1926
  • Default SPLIT destination can be specified by OTHERWISE keyword. PIG-1904
  • Syntactical sugar for TOTUPLE, TOBAG, and TOMAP is added. PIG-1387
  • AvroStorage now supports globs and commas. PIG-2492
  • AvroStorage now supports recursive records. PIG-2875
  • CDH4.1 includes the DataFu collection of Useful Apache Pig UDFs (User-Defined Functions) for statistical analysis. See Pig Installation for installation instructions.

Apache Oozie

  • Oozie workflow, coordinator and bundle XML definitions 0.4 support a parameters element defining the expected job parameters and default values, if any. If present, the parameters element enables an early verification of the submitted job configuration.
  • Oozie workflow XML definition 0.4 supports a global configuration section which is inherited by all actions. This global section can be used to define common key/values across actions in the workflow such as the job-tracker URI, the name-node URI and configuration properties. Values defined at the action level have precedence over global values.
  • Oozie workflow XML definition 0.4 supports multiple job-xml elements in action definitions. If an action has multiple job-xml elements, the property key/values of those configuration files are loaded to the action configuration; if a property key occurs in multiple configuration files, the last occurrence has precedence.
  • Oozie workflow MapReduce actions now support über JARs. The über JAR must be specified in the MapReduce action configuration section using the oozie.mapreduce.uber.jar property. In order to user über JARs, the Oozie server must be configured first by setting the oozie.action.mapreduce.uber.jar.enable property to true in the oozie-site.xml.
  • Oozie logs now self purge (by default after 30 days). Oozie logs can also be GZIPped when rolled.
  • Oozie workflow control nodes (start/end/fork/join/kill) now are treated as "action" nodes and they show up in the list of executed workflow actions.
  • Oozie now supports alternate share libraries. This enables the use of alternate sets of JARs for a given action type. Alternate share libraries can be configured at server level, job level and action level.
  • Oozie filesystem action now supports touchz (to create a file of zero length) and recursive chmod operations.
  • Oozie now supports submission of MapReduce jobs, without having to write a workflow, using the Oozie mapreduce subcommand.

Apache Hive

  • CDH4.1 upgrades Hive from version 0.8.1 to version 0.9.0. The new version of Hive includes approximately 150 bug fixes and feature enhancements not found in the previous version.
  • HIVE-2935 adds HiveServer2, an improved version of HiveServer that supports a new Thrift API tailored to JDBC and ODBC clients, Kerberos authentication, and multi-client concurrency. This patch also adds a new JDBC driver designed to run on top of HiveServer2, and a new CLI for HiveServer2 named BeeLine.
  • HIVE-3277 adds MetaStore audit logging for all connection types, both secure and non-secure.
  • HIVE-2957 improves the JDBC driver's support for TIMESTAMP column types.
  • HIVE-3056 adds the metatool utility which facilitates bulk updates of metastore catalog records.

Apache HBase

  • Many bug fixes and robustness enhancements to HBase 0.92.
  • HBASE-5189 adds metrics to track region splits.
  • HBASE-6283 adds an option to region_mover.rb to exclude a list of hosts when unloading.
  • HBASE-6643 enables the shell to accept an encoded region name when compacting or splitting a region.
  • HBASE-6444 exposes the ability to set custom HTTP Request Headers.

Apache Whirr

  • CDH4.1 includes Whirr 0.8.0.
  • On Amazon EC2, cluster compute groups are now supported. See WHIRR-63.
  • There is a new Whirr service for starting a cluster administered by Cloudera Manager. See https://github.com/cloudera/whirr-cm.

Apache Mahout

  • CDH4.1 includes the upstream Mahout version 0.7.
  • The K-Means, Fuzzy K-Means, Canopy, and Dirichlet algorithms have been reimplemented to use ClusterClassificationDriver to refactor clustering with outlier pruning support. See MAHOUT-981, MAHOUT-984, MAHOUT-982, and MAHOUT-983.
  • Outlier removal capability for cluster classifiers (MAHOUT-929 and MAHOUT-931).
  • The collections and math APIs are consolidated (MAHOUT-768).
  • Old Naive Bayes implementation is removed (MAHOUT-1010).
  • Watchmaker (for evolutionary/genetic algorithms) is removed (MAHOUT-1012).

Apache Avro