This is the documentation for Cloudera Manager 4.8.3.
Documentation for other versions is available at Cloudera Documentation.

Using the Cloudera Manager Java API for Cluster Automation

One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. This topic describes an approach to cluster automation that works for us, as well as many of our customers and partners.

Cluster Automation Use Cases

At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distributions (RHEL 5 & 6, Ubuntu Precise & Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, and so on. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.

This concept is not new; there are several other examples of Hadoop cluster automation solutions. For example, Yahoo! has its own infrastructure tools, and you can find publicly available Puppet recipes, with various degrees of completeness and maintenance. Furthermore, there are tools that work only with a particular virtualization environment. However, we needed a solution that is more powerful and easier to maintain.

Cloudera's automation system for Hadoop cluster deployment provisions VMs on-demand in our internal cloud. As cool as that capability sounds, it's actually not the most interesting part of the solution. More important is that we can install and configure Hadoop according to precise specifications using a powerful yet simple abstraction — using Cloudera Manager's open source Cloudera Manager API documentation.

This is what our automation system does:

  1. Installs the Cloudera Manager packages on the cluster.
  2. Starts the Cloudera Manager server.
  3. Uses the API to add hosts, install CDH, and define the cluster and its services.
  4. Uses the API to tune heap sizes, set up HDFS HA, turn on Kerberos security and generate keytabs, customize service directories and ports, and so on. Every configuration available in Cloudera Manager is exposed in the API.

The end result is a system that has become an indispensable part of our engineering process. The API also gives access to management features, such as gathering logs and monitoring information, starting and stopping services, polling cluster events, and creating a disaster recovery replication schedule. For example, the API can be used to retrieve logs from HDFS, HBase, or any other service,without needing to know about different log locations. The API can be used to stop any service, without worrying about any additional steps. (HBase needs to be gracefully shutdown.) We use these features extensively in our automated tests. As Cloudera Manager adds support for more services, their setup flows are the same as the existing ones.

Many of our customers and partners have also adopted the Cloudera Manager API for cluster automation:

  • Some OEM and hardware partners, delivering Hadoop-in-a-box appliances, use the API to set up CDH and Cloudera Manager on bare metal in the factory.
  • Some of our high-growth customers are constantly deploying new clusters, and have automated that with a combination of Puppet and the Cloudera Manager API. Puppet does the OS-level provisioning, and the software installation. After that, the Cloudera Manager API sets up the Hadoop services and configures the cluster.
  • Others have found it useful to integrate the API with their reporting and alerting infrastructure. An external script can poll the API for health and metrics information, as well as the stream of events and alerts, to feed into a custom dashboard.

Java API Examples

This example covers the Java API client. (Although Cloudera's internal automation tool uses the Python client today, we plan to move to Java to better work with our other Java-based tools like jclouds.)

To use the Java client, add this dependency to your project's pom.xml:

<project>
  <repositories>
    <repository>
      <id>cdh.repo</id>
      <url>https://repository.cloudera.com/content/groups/cloudera-repos</url>
      <name>Cloudera Repository</name>
    </repository>
    …
  </repositories>
  <dependencies>
    <dependency>
      <groupId>com.cloudera.api</groupId>
      <artifactId>cloudera-manager-api</artifactId>
      <version>4.6.2</version>      <!-- Set to the version you use -->
    </dependency>
    …
  </dependencies>
  ...
</project>

The Java client works like a proxy. It hides from the caller any details about REST, HTTP, and JSON. The entry point is a handle to the root of the API:

RootResourceV4 apiRoot = new ClouderaManagerClientBuilder().withHost("cm.cloudera.com")
.withUsernamePassword("admin", "admin").build().getRootV4();
From the root, you can traverse down to all other resources. (It's called "V4" because the currently Cloudera Manager API version is version 4, but the same builder also returns a V1, V2, or V3 root.) Here is the tree view of some of the key resources and the interesting operations they support:
  • RootResourceV4
    • ClustersResourceV4 - host membership, start cluster
      • ServicesResourceV4 - configuration, get metrics, HA, service commands
        • RolesResource - add roles, get metrics, logs
        • RoleConfigGroupsResource - configuration
      • ParcelsResource - parcel management
  • HostsResource - host management, get metrics
  • UsersResource - user management

Of course, these are all documented in the Javadoc. To give a short concrete example, here is the code to list and start a cluster:

// List of clusters
ApiClusterList clusters = apiRoot.getClustersResource().readClusters(DataView.SUMMARY);
for (ApiCluster cluster : clusters) {
  LOG.info("{}: {}", cluster.getName(), cluster.getVersion());
}

// Start the first cluster
ApiCommand cmd = apiRoot.getClustersResource().startCommand(clusters.get(0).getName());
while (cmd.isActive()) {
   Thread.sleep(100);
   cmd = apiRoot.getCommandsResource().readCommand(cmd.getId());
}
LOG.info("Cluster start {}", cmd.getSuccess() ? "succeeded" : "failed " + cmd.getResultMessage());

To see a full example of cluster deployment using the Java client, see whirr-cm. Specifically, jump straight to CmServerImpl#configure to see the core of the action.

You may find it interesting that the Java client is maintained with very little effort. Using Apache CXF, the client proxy comes free. It figures out the right HTTP call to make by inspecting the JAX-RS annotations in the REST interface, which is the same interface used by the Cloudera Manager API server. Therefore, new API methods are available to the Java client automatically.

Cloudera encourages you to try the Cloudera Manager API, and post your questions and feedback on the Cloudera Manager forums.