Using DistCp to Migrate Data between two Clusters
You can use the DistCp tool on the CDH4 cluster to initiate the copy job to move the data. Between two clusters running different versions of CDH, run the DistCp tool with hftp:// as the source file system and hdfs:// as the destination file system.
Example of a source URI: hftp://namenode-location:50070/basePath
where namenode-location refers to the CDH3's NameNode hostname as defined by its config fs.default.name and 50070 is the NameNode's HTTP server port, as defined by the config dfs.http.address.
Example of a destination URI: hdfs://nameservice-id/basePath or hdfs://namenode-location
This refers to the CDH4's NameNode as defined by its configured fs.defaultFS.
The basePath in both the above URIs refers to the directory you want to copy, if one is specifically needed.
The DistCp Command
For more help, and to see all the options available on the DistCp tool, use the following command to see the builtin help:
$ hadoop distcp
$ hadoop distcp hftp://cdh3-namenode:50070/ hdfs://cdh4-nameservice/
Or use a specific path, such as /hbase to move HBase data, for example:
$ hadoop distcp hftp://cdh3-namenode:50070/hbase hdfs://cdh4-nameservice/hbase
DistCp will then submit a regular MapReduce job that performs a file-by-file copy.