This is the documentation for CDH 4.6.0.
Documentation for other versions is available at Cloudera Documentation.

Upgrading to CDH4

Use the instructions that follow to upgrade to CDH4.

  Note: Running Services

When starting, stopping and restarting CDH components, always use the service (8) command rather than running scripts in /etc/init.d directly. This is important because service sets the current working directory to / and removes most environment variables (passing only LANG and TERM) so as to create a predictable environment in which to administer the service. If you run the scripts in/etc/init.d, any environment variables you have set remain in force, and could produce unpredictable results. (If you install CDH from packages, service will be installed as part of the Linux Standard Base (LSB).)

Step 1: Back Up Configuration Data and Uninstall Components

  1. If security is enabled, do the following (see the CDH3 Security Guide for more information about CDH3 security):
    1. Put the NameNode into safe mode:
      $ hadoop dfsadmin -safemode enter
    2. Perform a saveNamespace operation:
      $ hadoop dfsadmin -saveNamespace 

      This will result in a new fsimage being written out with no edit log entries.

    3. With the NameNode still in safe mode, shut down all services as instructed below.
  2. For each component you are using, back up configuration data, databases, and other important files, stop the component, then uninstall it. See the following sections for instructions:
      Note:

    At this point, you are only removing the components; do not install the new versions yet.

    CAUTION:
    On Ubuntu systems, make sure you remove HBase before removing ZooKeeper; otherwise your HBase configuration will be deleted. This is because hadoop-hbase depends on hadoop-zookeeper, and so purging hadoop-zookeeper will purge hadoop-hbase.
  3. Make sure the Hadoop services are shut down across your entire cluster by
    $ for x in /etc/init.d/hadoop-* ; do sudo $x stop ; done
  4. Check each host to make sure that there are no processes running as the hdfs or mapred users from root:
    # ps -aef | grep java

Step 2: Back up the HDFS Metadata

  Important:

Do this step when you are sure that all Hadoop services have been shut down. It is particularly important that the NameNode service is not running so that you can make a consistent backup.

To back up the HDFS metadata on the NameNode machine:

  Note:
  • Cloudera recommends backing up HDFS metadata on a regular basis, as well as before a major upgrade.
  • dfs.name.dir is deprecated but still works; dfs.namenode.name.dir is preferred. This example uses dfs.name.dir.
  1. Find the location of your dfs.name.dir (or dfs.namenode.name.dir); for example:
    $ grep -C1 dfs.name.dir /etc/hadoop/conf/hdfs-site.xml
    <property>
    <name>dfs.name.dir</name>
    <value>/mnt/hadoop/hdfs/name</value>
    </property>
  2. Back up the directory. The path inside the <value> XML element is the path to your HDFS metadata. If you see a comma-separated list of paths, there is no need to back up all of them; they store the same data. Back up the first directory, for example, by using the following commands:
    $ cd /mnt/hadoop/hdfs/name
    # tar -cvf /root/nn_backup_data.tar .
    ./
    ./current/
    ./current/fsimage
    ./current/fstime
    ./current/VERSION
    ./current/edits
    ./image/
    ./image/fsimage
      Warning:

    If you see a file containing the word lock, the NameNode is probably still running. Repeat the preceding steps, starting by shutting down the Hadoop services.

Step 3: Copy the Hadoop Configuration to the Correct Location and Update Alternatives

For CDH4, Hadoop looks for the cluster configuration files in a different location from the one used in CDH3, so you need to copy the configuration to the new location and reset the alternatives to point to it. Proceed as follows.

On each node in the cluster:

  1. Copy the existing configuration to the new location, for example:
    $ cp -r /etc/hadoop-0.20/conf.my_cluster /etc/hadoop/conf.my_cluster
  2. Update the alternatives, for example:
    $ sudo update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
  3. Verify that the operation succeeded:
    $ sudo alternatives --display hadoop-conf

Step 4: Uninstall CDH3 Hadoop

  Warning:

Do not proceed before you have backed up the HDFS metadata, and the files and databases for the individual components, as instructed in the previous steps.

To uninstall CDH3 Hadoop:

Run this command on each host:

On Red Hat-compatible systems:

$ sudo yum remove hadoop-0.20 bigtop-utils

On SLES systems:

$ sudo zypper remove hadoop-0.20 bigtop-utils

On Ubuntu systems:

sudo apt-get purge hadoop-0.20 bigtop-utils 
  Warning:

If you are upgrading an Ubuntu or Debian system from CDH3u3 or earlier, you must use apt-get purge (rather than apt-get remove) to make sure the re-install succeeds, but be aware that apt-get purge removes all your configuration data. If you have modified any configuration files, DO NOT PROCEED before backing them up.

To uninstall the repository packages, run this command on each host:

On Red Hat-compatible systems:

$ sudo yum remove cloudera-cdh3

On SLES systems:

$ sudo zypper remove cloudera-cdh

On Ubuntu and Debian systems:

sudo apt-get remove cdh3-repository
  Important: On Ubuntu and Debian systems, you need to re-create the /usr/lib/hadoop-0.20/ directory after uninstalling CDH3. Make sure you do this before you install CDH4:
$ sudo mkdir -p /usr/lib/hadoop-0.20/

Step 5: Download CDH4

On Red Hat-compatible systems:

  1. Download the CDH4 "1-click Install" Package:
  2. Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

    For OS Version

    Click this Link

    Red Hat/CentOS/Oracle 5

    Red Hat/CentOS/Oracle 5 link

    Red Hat/CentOS 6 (32-bit)

    Red Hat/CentOS 6 link (32-bit)

    Red Hat/CentOS/Oracle 6 (64-bit)

    Red Hat/CentOS/Oracle 6 link (64-bit)

  3. Install the RPM.

    For Red Hat/CentOS/Oracle 5:

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm

    For Red Hat/CentOS 6 (32-bit):

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.i386.rpm

    For Red Hat/CentOS/Oracle 6 (64-bit):

    $ sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.x86_64.rpm
      Note:

    For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

  4. (Optionally) add a repository key on each system in the cluster. Add the Cloudera Public GPG Key to your repository by executing one of the following commands:
    • For Red Hat/CentOS/Oracle 5 systems:
      $ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera 
    • For Red Hat/CentOS/Oracle 6 systems:
      $ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera 

On SLES systems:

  1. Download the CDH4 "1-click Install" Package:
  2. Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).
  3. Install the RPM:
    $ sudo rpm -i cloudera-cdh-4-0.x86_64.rpm
      Note:

    For instructions on how to add a repository or build your own repository, see Installing CDH4 on SLES Systems.

  4. Update your system package index by running:
    $ sudo zypper refresh
  5. (Optionally) add a repository key on each system in the cluster. Add the Cloudera Public GPG Key to your repository by executing the following command:
  • For all SLES systems:
    $ sudo rpm --import http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera  

On Ubuntu and Debian systems:

  1. Download the CDH4 "1-click Install" Package:
  2. Click one of the following: this link for a Squeeze system, or this link for a Lucid system, or this link for a Precise system.
  3. Install the package. Do one of the following: Choose Open with in the download window to use the package manager, or Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:
    sudo dpkg -i cdh4-repository_1.0_all.deb
      Note:

    For instructions on how to add a repository or build your own repository, see Installing CDH4 on Ubuntu Systems.

  4. (Optionally) add a repository key on each system in the cluster. Add the Cloudera Public GPG Key to your repository by executing one of the following commands:
    • For Ubuntu Lucid systems:
      $ curl -s http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -
    • For Ubuntu Precise systems:
      $ curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
    • For Debian Squeeze systems:
      $ curl -s http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -

Step 6a: Install CDH4 with MRv1

  Note:

Skip this step and go to Step 6b if you intend to use only YARN.

  1. Install and deploy ZooKeeper.
      Important:

    Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

    Follow instructions under ZooKeeper Installation.

  2. Install each type of daemon package on the appropriate systems(s), as follows.

    Where to install

    Install commands

    JobTracker host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-0.20-mapreduce-jobtracker

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-0.20-mapreduce-jobtracker

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-0.20-mapreduce-jobtracker

    NameNode host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-namenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-namenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode

    Secondary NameNode host (if used) running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-secondarynamenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-secondarynamenode

    All cluster hosts except the JobTracker, NameNode, and Secondary (or Standby) NameNode hosts, running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-0.20-mapreduce-tasktracker  hadoop-hdfs-datanode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-0.20-mapreduce-tasktracker hadoop-hdfs-datanode

    All client hosts, running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-client

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-client

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-client

Step 6b: Install CDH4 with YARN

  Note:

Skip this step if you intend to use only MRv1. Directions for installing MRv1 are in Step 6a.

To install CDH4 with YARN:

  Note:

If you are also installing MRv1, you can skip any packages you have already installed in Step 6a.

  1. Install and deploy ZooKeeper.
      Important:

    Cloudera recommends that you install (or update) and start a ZooKeeper cluster before proceeding. This is a requirement if you are deploying high availability (HA) for the NameNode or JobTracker.

    Follow instructions under ZooKeeper Installation.

  2. Install each type of daemon package on the appropriate systems(s), as follows.

    Where to install

    Install commands

    Resource Manager host (analogous to MRv1 JobTracker) running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-yarn-resourcemanager

    SLES

    $ sudo zypper clean --all; sudo zypper install hadoop-yarn-resourcemanager

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-yarn-resourcemanager

    NameNode host running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-namenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-namenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-namenode

    Secondary NameNode host (if used) running:

     

    Red Hat/CentOS compatible

    sudo yum clean all; sudo yum install hadoop-hdfs-secondarynamenode

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-hdfs-secondarynamenode

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-hdfs-secondarynamenode

    All cluster hosts except the Resource Manager (analogous to MRv1 TaskTrackers) running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    SLES

    $ sudo zypper clean --all; sudo zypper install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    Ubuntu or Debian

    $ sudo apt-get update; sudo apt-get install hadoop-yarn-nodemanager hadoop-hdfs-datanode hadoop-mapreduce

    One host in the cluster running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    SLES

    $ sudo zypper clean --all; sudo zypper install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-mapreduce-historyserver hadoop-yarn-proxyserver

    All client hosts, running:

     

    Red Hat/CentOS compatible

    $ sudo yum clean all; sudo yum install hadoop-client

    SLES

    sudo zypper clean --all; sudo zypper install hadoop-client

    Ubuntu or Debian

    sudo apt-get update; sudo apt-get install hadoop-client
      Note:

    The hadoop-yarn and hadoop-hdfs packages are installed on each system automatically as dependencies of the other packages.

Step 7: Copy the CDH4 Logging File

Copy over the log4j.properties file to your custom directory on each node in the cluster; for example:

$ cp /etc/hadoop/conf.empty/log4j.properties /etc/hadoop/conf.my_cluster/log4j.properties

Step 7a: (Secure Clusters Only) Set Variables for Secure DataNodes

  Important:

You must do the following if you are upgrading a CDH3 cluster that has Kerberos security enabled. Otherwise, skip this step.

In order to allow DataNodes to start on a secure Hadoop cluster, you must set the following variables on all DataNodes in /etc/default/hadoop-hdfs-datanode.

export HADOOP_SECURE_DN_USER=hdfs
export HADOOP_SECURE_DN_PID_DIR=/var/lib/hadoop-hdfs
export HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop-hdfs
export JSVC_HOME=/usr/lib/bigtop-utils/
  Note:

Depending on the version of Linux you are using, you may not have the /usr/lib/bigtop-utils directory on your system. If that is the case, set the JSVC_HOME variable to the /usr/libexec/bigtop-utils directory by using this command: export JSVC_HOME=/usr/libexec/bigtop-utils

Step 8: Upgrade the HDFS Metadata

  1. To upgrade the HDFS metadata, run the following command on the NameNode:
    $ sudo service hadoop-hdfs-namenode upgrade
      Note:

    The NameNode upgrade process can take a while depending on how many files you have.

    You can watch the progress of the upgrade by running:

    $ sudo tail -f /var/log/hadoop-hdfs/hadoop-hdfs-namenode-<hostname>.log 

    Look for a line that confirms the upgrade is complete, such as: /var/lib/hadoop-hdfs/cache/hadoop/dfs/<name> is complete

  2. Start up the DataNodes:

    On each DataNode:

    $ sudo service hadoop-hdfs-datanode start
  3. Wait for NameNode to exit safe mode, and then start the Secondary NameNode (if used) and complete the cluster upgrade.
    1. To check that the NameNode has exited safe mode, look for messages in the log file, or the NameNode's web interface, that say "...no longer in safe mode."
    2. To start the Secondary NameNode (if used), enter the following command on the Secondary NameNode host:
      $ sudo service hadoop-hdfs-secondarynamenode start
    3. To complete the cluster upgrade, follow the remaining steps below.

Step 9: Create the HDFS /tmp Directory

  Important:

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp
$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  Note:

If Kerberos is enabled, do not use commands in the form sudo -u <user> <command>; they will fail with a security error. Instead, use the following commands: $ kinit <user> (if you are using a password) or $ kinit -kt <keytab> <principal> (if you are using a keytab) and then, for each command executed by this user, $ <command>

Step 10: Start MapReduce (MRv1) or YARN

You are now ready to start and test MRv1 or YARN.

Step 10a: Start MapReduce (MRv1)

  Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade your performance and may result in an unstable MapReduce cluster deployment. Steps 9a and 9b are mutually exclusive.

After you have verified HDFS is operating correctly, you are ready to start MapReduce. On each TaskTracker system:

$ sudo service hadoop-0.20-mapreduce-tasktracker start

On the JobTracker system:

$ sudo service hadoop-0.20-mapreduce-jobtracker start

Verify that the JobTracker and TaskTracker started properly.

$ sudo jps | grep Tracker

If the permissions of directories are not configured correctly, the JobTracker and TaskTracker processes start and immediately fail. If this happens, check the JobTracker and TaskTracker logs and set the permissions correctly.

Verify basic cluster operation for MRv1.

At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site.

  1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
    $ sudo -u hdfs hadoop fs -mkdir /user/joe
    $ sudo -u hdfs hadoop fs -chown joe /user/joe

    Do the following steps as the user joe.

  2. Make a directory in HDFS called input and copy some XML files into it by running the following commands:
    $ hadoop fs -mkdir input
    $ hadoop fs -put /etc/hadoop/conf/*.xml input
    $ hadoop fs -ls input
    Found 3 items:
    -rw-r--r--   1 joe supergroup       1348 2012-02-13 12:21 input/core-site.xml
    -rw-r--r--   1 joe supergroup       1913 2012-02-13 12:21 input/hdfs-site.xml
    -rw-r--r--   1 joe supergroup       1001 2012-02-13 12:21 input/mapred-site.xml
  3. Run an example Hadoop job to grep with a regular expression in your input data.
    $ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'
  4. After the job completes, you can find the output in the HDFS directory named output because you specified that output directory to Hadoop.
    $ hadoop fs -ls
    Found 2 items
    drwxr-xr-x   - joe supergroup  0 2009-08-18 18:36 /user/joe/input
    drwxr-xr-x   - joe supergroup  0 2009-08-18 18:38 /user/joe/output

    You can see that there is a new directory called output.

  5. List the output files.
    $ hadoop fs -ls output
    Found 2 items
    drwxr-xr-x  -  joe supergroup     0 2009-02-25 10:33   /user/joe/output/_logs
    -rw-r--r--  1  joe supergroup  1068 2009-02-25 10:33   /user/joe/output/part-00000
    -rw-r--r-   1  joe supergroup     0 2009-02-25 10:33   /user/joe/output/_SUCCESS
  6. Read the results in the output file; for example:
    $ hadoop fs -cat output/part-00000 | head
    1       dfs.datanode.data.dir
    1       dfs.namenode.checkpoint.dir
    1       dfs.namenode.name.dir
    1       dfs.replication
    1       dfs.safemode.extension
    1       dfs.safemode.min.datanodes

    You have now confirmed your cluster is successfully running CDH4.

      Important:

    If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.

Step 10b: Start MapReduce with YARN

  Important:

Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade your performance and may result in an unstable MapReduce cluster deployment. Steps 10a and 10b are mutually exclusive.

Before deciding to deploy YARN, make sure you read the discussion under New Features.

After you have verified HDFS is operating correctly, you are ready to start YARN. First, create directories and set the correct permissions.

 

For more information see Deploying MapReduce v2 (YARN) on a Cluster.

Create a history directory and set permissions; for example:

sudo -u hdfs hadoop fs -mkdir /user/history
sudo -u hdfs hadoop fs -chmod -R 1777 /user/history
sudo -u hdfs hadoop fs -chown yarn /user/history

Create the /var/log/hadoop-yarn directory and set ownership:

 
$ sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
$ sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
 

You need to create this directory because it is the parent of /var/log/hadoop-yarn/apps which is explicitly configured in the yarn-site.xml.

Verify the directory structure, ownership, and permissions:

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt   - hdfs supergroup          0 2012-04-19 14:31 /tmp
drwxr-xr-x   - hdfs supergroup          0 2012-05-31 10:26 /user
drwxrwxrwt   - yarn supergroup          0 2012-04-19 14:31 /user/history
drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var
drwxr-xr-x   - hdfs   supergroup        0 2012-05-31 15:31 /var/log
drwxr-xr-x   - yarn   mapred            0 2012-05-31 15:31 /var/log/hadoop-yarn

To start YARN, start the ResourceManager and NodeManager services:

  Note:

Make sure you always start ResourceManager before starting NodeManager services.

On the ResourceManager system:

$ sudo service hadoop-yarn-resourcemanager start

On each NodeManager system (typically the same ones where DataNode service runs):

$ sudo service hadoop-yarn-nodemanager start

To start the MapReduce JobHistory Server

On the MapReduce JobHistory Server system:

$ sudo service hadoop-mapreduce-historyserver start

For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, set the HADOOP_MAPRED_HOME environment variable as follows:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Verify basic cluster operation for YARN.

At this point your cluster is upgraded and ready to run jobs. Before running your production jobs, verify basic cluster operation by running an example from the Apache Hadoop web site.

  Note:

For important configuration information, see Deploying MapReduce v2 (YARN) on a Cluster.

  1. Create a home directory on HDFS for the user who will be running the job (for example, joe):
    $ sudo -u hdfs hadoop fs -mkdir /user/joe
    $ sudo -u hdfs hadoop fs -chown joe /user/joe

    Do the following steps as the user joe.

  2. Make a directory in HDFS called input and copy some XML files into it by running the following commands in pseudo-distributed mode:
    $ hadoop fs -mkdir input
    $ hadoop fs -put /etc/hadoop/conf/*.xml input
    $ hadoop fs -ls input
    Found 3 items:
    -rw-r--r--   1 joe supergroup       1348 2012-02-13 12:21 input/core-site.xml
    -rw-r--r--   1 joe supergroup       1913 2012-02-13 12:21 input/hdfs-site.xml
    -rw-r--r--   1 joe supergroup       1001 2012-02-13 12:21 input/mapred-site.xml
  3. Set HADOOP_MAPRED_HOME for user joe:
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  4. Run an example Hadoop job to grep with a regular expression in your input data.
    $ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
  5. After the job completes, you can find the output in the HDFS directory named output23 because you specified that output directory to Hadoop.
    $ hadoop fs -ls
    Found 2 items
    drwxr-xr-x   - joe supergroup  0 2009-08-18 18:36 /user/joe/input
    drwxr-xr-x   - joe supergroup  0 2009-08-18 18:38 /user/joe/output23

    You can see that there is a new directory called output23.

  6. List the output files.
    $ hadoop fs -ls output23
    Found 2 items
    drwxr-xr-x  -  joe supergroup     0 2009-02-25 10:33   /user/joe/output23/_SUCCESS
    -rw-r--r--  1  joe supergroup  1068 2009-02-25 10:33   /user/joe/output23/part-r-00000
  7. Read the results in the output file.
    $ hadoop fs -cat output23/part-r-00000 | head
    1    dfs.safemode.min.datanodes
    1    dfs.safemode.extension
    1    dfs.replication
    1    dfs.permissions.enabled
    1    dfs.namenode.name.dir
    1    dfs.namenode.checkpoint.dir
    1    dfs.datanode.data.dir

You have now confirmed your cluster is successfully running CDH4.

  Important:

If you have client hosts, make sure you also update them to CDH4, and upgrade the components running on those clients as well.

Step 11: Set the Sticky Bit

For security reasons Cloudera strongly recommends you set the sticky bit on directories if you have not already done so.

The sticky bit prevents anyone except the superuser, directory owner, or file owner from deleting or moving the files within a directory. (Setting the sticky bit for a file has no effect.) Do this for directories such as /tmp. (For instructions on creating /tmp and setting its permissions, see these instructions).

Step 12: Re-Install CDH4 Components

To install the CDH4 components, see the following sections:

  • Flume. For more information, see "Flume Installation" in this guide.
  • Sqoop. For more information, see "Sqoop Installation" in this guide.
  • Sqoop 2. For more information, see "Sqoop 2 Installation" in this guide.
  • HCatalog. For more information, see "Installing and Using HCatalog" in this guide.
  • Hue. For more information, see "Hue Installation" in this guide.
  • Pig. For more information, see "Pig Installation" in this guide.
  • Oozie. For more information, see "Oozie Installation" in this guide.
  • Hive. For more information, see "Hive Installation" in this guide.
  • HBase. For more information, see "HBase Installation" in this guide.
  • ZooKeeper. For more information, "ZooKeeper Installation" in this guide.
  • Whirr. For more information, see "Whirr Installation" in this guide.
  • Snappy. For more information, see "Snappy Installation" in this guide.
  • Mahout. For more information, see "Mahout Installation" in this guide.

Step 13: Apply Configuration File Changes

  Important:

During uninstall, the package manager renames any configuration files you have modified from <file> to <file>.rpmsave. During re-install, the package manager creates a new <file> with applicable defaults. You are responsible for applying any changes captured in the original CDH3 configuration file to the new CDH4 configuration file. In the case of Ubuntu and Debian upgrades, a file will not be installed if there is already a version of that file on the system, and you will be prompted to resolve conflicts; for details, see Automatic handling of configuration files by dpkg.

For example, if you have modified your CDH3 zoo.cfg configuration file (/etc/zookeeper.dist/zoo.cfg), RPM uninstall and re-install (using yum remove) renames and preserves a copy of your modified zoo.cfg as /etc/zookeeper.dist/zoo.cfg.rpmsave. You should compare this to the new /etc/zookeeper/conf/zoo.cfg and resolve any differences that should be carried forward (typically where you have changed property value defaults). Do this for each component you upgrade to CDH4.

Step 14: Finalize the HDFS Metadata Upgrade

To finalize the HDFS metadata upgrade you began earlier in this procedure, proceed as follows:

  1. Make sure you are satisfied that the CDH4 upgrade has succeeded and everything is running smoothly. This could take a matter of days, or even weeks.
      Warning:

    Do not proceed until you are sure you are satisfied with the new deployment. Once you have finalized the HDFS metadata, you cannot revert to an earlier version of HDFS.

      Note:

    If you need to restart the NameNode during this period (after having begun the upgrade process, but before you've run finalizeUpgrade) simply restart your NameNode without the -upgrade option.

  2. Finalize the HDFS metadata upgrade: use one of the following commands, depending on whether Kerberos is enabled (see Configuring Hadoop Security in CDH4).
    • If Kerberos is enabled:
      $ kinit -kt /path/to/hdfs.keytab hdfs/<fully.qualified.domain.name@YOUR-REALM.COM> && hdfs dfsadmin -finalizeUpgrade
    • If Kerberos is not enabled:
      $ sudo -u hdfs hdfs dfsadmin -finalizeUpgrade
      Note:

    After the metadata upgrade completes, the previous/ and blocksBeingWritten/ directories in the DataNodes' data directories aren't cleared until the DataNodes are restarted.