This is the documentation for CDH 4.6.0.
Documentation for other versions is available at Cloudera Documentation.

Deploying MapReduce v1 (MRv1) on a Cluster

This section describes configuration and startup tasks for MRv1 clusters only.

  Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade performance and may result in an unstable cluster deployment.

Do these tasks after you have configured HDFS:

  1. Configure properties for MRv1 clusters
  2. Configure local storage directories for use by MRv1 daemons
  3. Configure a health check script for DataNode processes
  4. Configure JobTracker Recovery
  5. Deploy the configuration to the entire cluster
  6. Start HDFS
  7. Create the HDFS /tmp directory
  8. Create MapReduce /var directories
  9. Verify the HDFS File Structure
  10. Create and configure the mapred.system.dir directory in HDFS
  11. Start MapReduce
  12. Create a Home Directory for each MapReduce User
  13. Configure the Hadoop daemons to start at boot time
  Note: Running Services

When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)

Step 1: Configuring Properties for MRv1 Clusters

  Note:

Edit these files in the custom directory you created when you copied the Hadoop configuration. When you have finished, you will push this configuration to all the nodes in the cluster; see Step 4.

 

For instructions on configuring a highly available JobTracker, see Configuring High Availability for the JobTracker (MRv1); you need to configure mapred.job.tracker differently in that case, and you must not use the port number.

Property

Configuration File

Description

mapred.job.tracker

conf/mapred-site.xml

If you plan to run your cluster with MRv1 daemons you need to specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host>:<port>. See Configuring Ports for CDH4 for the default port. If the value is set to local, the default, the JobTracker runs on demand when you run a MapReduce job; do not try to start the JobTracker yourself in this case. Note: if you specify the host (rather than using local) this must be the hostname (for example mynamenode) not the IP address.

Sample configuration:

mapred-site.xml:

<property>
 <name>mapred.job.tracker</name>
 <value>jobtracker-host.company.com:8021</value>
</property>

Step 2: Configure Local Storage Directories for Use by MRv1 Daemons

For MRv1, you need to configure an additional property in the mapred-site.xml file.

Property

Configuration File Location

Description

mapred.local.dir

mapred-site.xml on each TaskTracker

This property specifies the directories where the TaskTracker will store temporary data and intermediate map output files while running MapReduce jobs. Cloudera recommends that this property specifies a directory on each of the JBOD mount points; for example, /data/1/mapred/local through /data/N/mapred/local.

Sample configuration:

mapred-site.xml on each TaskTracker:

<property>
 <name>mapred.local.dir</name>
 <value>/data/1/mapred/local,/data/2/mapred/local,/data/3/mapred/local</value>
</property>

After specifying these directories in the mapred-site.xml file, you must create the directories and assign the correct file permissions to them on each node in your cluster.

To configure local storage directories for use by MapReduce:

In the following instructions, local path examples are used to represent Hadoop parameters. The mapred.local.dir parameter is represented by the /data/1/mapred/local, /data/2/mapred/local, /data/3/mapred/local, and /data/4/mapred/local path examples. Change the path examples to match your configuration.

  1. Create the mapred.local.dir local directories:
    $ sudo mkdir -p /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
  2. Configure the owner of the mapred.local.dir directory to be the mapred user:
    $ sudo chown -R mapred:hadoop /data/1/mapred/local /data/2/mapred/local /data/3/mapred/local /data/4/mapred/local
    The correct owner and permissions of these local directories are:

    Owner

    Permissions

    mapred:hadoop

    drwxr-xr-x

Step 3: Configure a Health Check Script for DataNode Processes

In earlier releases, the failure of a single mapred.local.dir caused the MapReduce TaskTracker process to shut down, resulting in the machine not being available to execute tasks. In CDH4, the TaskTracker process will continue to execute tasks as long as it has a single functioning mapred.local.dir available. No configuration change is necessary to enable this behavior.

Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down after the configured number of directory failures). Here is an example health script that exits if the DataNode process is not running:

#!/bin/bash
if ! jps | grep -q DataNode ; then
 echo ERROR: datanode not up
fi

In practice, the dfs.data.dir and mapred.local.dir are often configured on the same set of disks, so a disk failure will result in the failure of both a dfs.data.dir and mapred.local.dir.

See the section titled "Configuring the Node Health Check Script" in the Apache cluster setup documentation for further details.

Step 4: Configure JobTracker Recovery

JobTracker recovery means that jobs that are running when JobTracker fails (for example, because of a system crash or hardware failure) are re-run when the JobTracker is restarted. Any jobs that were running at the time of the failure will be re-run from the beginning automatically.

A recovered job will have the following properties:

  • It will have the same job ID as when it was submitted.
  • It will run under the same user as the original job.
  • It will write to the same output directory as the original job, overwriting any previous output.
  • It will show as RUNNING on the JobTracker web page after you restart the JobTracker.

Enabling JobTracker Recovery

By default JobTracker recovery is off, but you can enable it by setting the property mapreduce.jobtracker.restart.recover to true in mapred-site.xml.

Step 5: Deploy your Custom Configuration to your Entire Cluster

To deploy your configuration to your entire cluster:

  1. Push your custom directory (for example /etc/hadoop/conf.my_cluster) to each node in your cluster; for example:
    $ sudo scp -r /etc/hadoop/conf.my_cluster myuser@myCDHnode-<n>.mycompany.com:/etc/hadoop/conf.my_cluster
  2. Manually set alternatives on each node to point to that directory, as follows.

    To manually set the configuration on Red Hat-compatible systems:

    $ sudo alternatives --verbose --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50 
    $ sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

    To manually set the configuration on Ubuntu and SLES systems:

    $ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
    $ sudo update-alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster

    For more information on alternatives, see the update-alternatives(8) man page on Ubuntu and SLES systems or the alternatives(8) man page On Red Hat-compatible systems.

Step 6: Start HDFS on Every Node in the Cluster

for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done

Step 7: Create the HDFS /tmp Directory

  Important:

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  Note:

If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>

Step 8: Create MapReduce /var directories

sudo -u hdfs hadoop fs -mkdir -p /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging
sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Step 9: Verify the HDFS File Structure

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt   - hdfs supergroup          0 2012-04-19 15:14 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs
drwxr-xr-x   - hdfs     supergroup          0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred
drwxr-xr-x   - mapred   supergroup          0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred
drwxrwxrwt   - mapred   supergroup          0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Step 10: Create and Configure the mapred.system.dir Directory in HDFS

After you start HDFS and create /tmp, but before you start the JobTracker (see the next step), you must also create the HDFS directory specified by the mapred.system.dir parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.

To create the directory in its default location:

$ sudo -u hdfs hadoop fs -mkdir /tmp/mapred/system
$ sudo -u hdfs hadoop fs -chown mapred:hadoop /tmp/mapred/system
  Important:

If you create the mapred.system.dir directory in a different location, specify that path in the conf/mapred-site.xml file.

When starting up, MapReduce sets the permissions for the mapred.system.dir directory to drwx------, assuming the user mapred owns that directory.

Step 11: Start MapReduce

To start MapReduce, start the TaskTracker and JobTracker services

On each TaskTracker system:

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-mapreduce-jobtracker start

Step 12: Create a Home Directory for each MapReduce User

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir  /user/<user>
$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USER
sudo -u hdfs hadoop fs -chown $USER /user/$USER

Configure the Hadoop Daemons to Start at Boot Time