This is the documentation for CDH 5.1.0.
Documentation for other versions is available at Cloudera Documentation.

Preparing to Index Data

Complete the following steps in preparation for indexing example data with MapReduce or Flume:

  1. Start a SolrCloud cluster containing two servers (this example uses two shards) as described in Deploying Cloudera Search. Stop and continue with the next step here after running the Starting Solr in SolrCloud Mode step and verifying that the two server processes are running.
  2. Generate the configuration files for the collection, including a tweet specific schema.xml:
    $ solrctl instancedir --generate $HOME/solr_configs3
    $ cp /usr/share/doc/search*/examples/solr-nrt/collection1/conf/schema.xml \
  3. Upload the instance directory to ZooKeeper:
    $ solrctl instancedir --create collection3 $HOME/solr_configs3
  4. Create the new collection:
    $ solrctl collection --create collection3 -s 2
  5. Verify the collection is live. For example, for the localhost, use http://localhost:8983/solr/#/~cloud.
  6. Prepare the configuration layout for use with MapReduce:
    $ cp -r $HOME/solr_configs3 $HOME/collection3
  7. Locate input files suitable for indexing, and check that the directory exists. This example assumes you are running the following commands as a user $USER with access to HDFS.
    $ sudo -u hdfs hadoop fs -mkdir -p /user/$USER
    $ sudo -u hdfs hadoop fs -chown $USER:$USER /user/$USER
    $ hadoop fs -mkdir -p /user/$USER/indir
    $ hadoop fs -copyFromLocal \
    /usr/share/doc/search*/examples/test-documents/sample-statuses-*.avro \
    $ hadoop fs -ls /user/$USER/indir
  8. Ensure that outdir exists in HDFS and that it is empty:
    $ hadoop fs -rm -r -skipTrash /user/$USER/outdir
    $ hadoop fs -mkdir /user/$USER/outdir
    $ hadoop fs -ls /user/$USER/outdir
  9. Collect HDFS/MapReduce configuration details. You can download these from Cloudera Manager or use /etc/hadoop, depending on your installation mechanism for the Hadoop cluster. This example uses the configuration found in /etc/hadoop/conf.cloudera.mapreduce1. Substitute the correct Hadoop configuration path for your cluster.