CCA-410 Study Guide

Begin Your Journey to Administrator Certification

The best way to study for the certification tests is to take a Cloudera training class. There is a high degree of correlation between Cloudera training classes and Cloudera certification tests. This is partially a function of design and partially the result of research and analysis -- we train and test on those skills, tasks, and knowledge we believe critical to the daily work of a Hadoop administrator. Of course, every training class is slightly different and is influenced by the needs of the students and the particular response of individual instructors. Also, every training class occurs at a moment in time, and the CDH/Hadoop ecosystem is dynamic. Thus, if you have taken a Cloudera training class, you should study the course materials and notes. However, you should also use these resources to check your understanding and update your skills.

This resource page is an ongoing work-in-progress. Please contribute: if you have exam prep suggestions or ideas, please email them to certification@cloudera.com


Recommended Cloudera Training Course

Cloudera Administrator Training for Apache Hadoop

Practice Test

CCA-410 Practice Test Subscription

General Resources


Exam Sections

These are the current exam sections and the percentage of the exam devoted to these topics.
  1. HDFS (38%)
  2. MapReduce (10%)
  3. Hadoop Cluster Planning (12%)
  4. Hadoop Cluster Installation and Administration (17%)
  5. Resource Management (6%)
  6. Monitoring and Logging (12%)
  7. Ecosystem (5%)

1. HDFS (38%)

Objectives

  • Describe the function of all Hadoop Daemons
  • Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing.
  • Identify current features of computing systems that motivate a system like Apache Hadoop.
  • Classify major goals of HDFS Design
  • Given a scenario, identify appropriate use case for HDFS Federation
  • Identify components and daemon of an HDFS HA-Quorum cluster
  • Analyze the role of HDFS security (Kerberos)
  • Describe file read and write paths

Study Resources


2. MapReduce (10%)

Objectives

  • Understand how to deploy MapReduce MapReduce v1 (MRv1)
  • Understand how to deploy MapReduce v2 (MRv2 / YARN)
  • Understand basic design strategy for MapReduce v2 (MRv2)

Study Resources


3. Hadoop Cluster Planning (12%)

Objectives

  • Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
  • Analyze the choices in selecting an OS
  • Understand kernel tuning and disk swapping
  • Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
  • Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O
  • Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
  • Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario

Study Resources

  • Hadoop Operations: Chapter 4

4. Hadoop Cluster Installation and Administration (17%)

Objectives

  • Given a scenario, identify how the cluster will handle disk and machine failures.
  • Analyze a logging configuration and logging configuration file format.
  • Understand the basics of Hadoop metrics and cluster health monitoring.
  • Identify the function and purpose of available tools for cluster monitoring.
  • Identify the function and purpose of available tools for managing the Apache Hadoop file system.

Study Resources

  • Hadoop Operations, Chapter 5

5. Resource Management (6%)

Objectives

  • Understand the overall design goals of each of Hadoop schedulers.
  • Given a scenario, determine how the FIFO Scheduler allocates cluster resources.
  • Given a scenario, determine how the Fair Scheduler allocates cluster resources.
  • Given a scenario, determine how the Capacity Scheduler allocates cluster resources.

Study Resources


6. Monitoring and Logging (12%)

Objectives

  • Understand the functions and features of Hadoop’s metric collection abilities
  • Analyze the NameNode and JobTracker Web UIs
  • Interpret a log4j configuration
  • Understand how to monitor the Hadoop Daemons
  • Identify and monitor CPU usage on master nodes
  • Describe how to monitor swap and memory allocation on all nodes
  • Identify how to view and manage Hadoop’s log files
  • Interpret a log file

Study Resources


7. The Hadoop Ecosystem (5%)

Objectives

  • Understand Ecosystem projects and what you need to do to deploy them on a cluster.

Study Resources


Sample Questions

Question

In which file can you specify the NameNode's heap size?
A. hadoop-env.sh
B. hdfs-site.xml
C. hdfs-core.xml
D. namenode.properties

Answer is A. Read more here: http://wiki.apache.org/hadoop/GettingStartedWithHadoop, particularly in the configuration section

Question

Your cluster is running Hadoop 2.0.0-cdh4.1.1 or above, client machine A writes a 500MB file into HDFS. The block size is 128MB. After client A has written 300MB of the data, client B attempts to read the file. Which of the following is true?
A. A File Not Found exception will be thrown on Client B
B. A File Not Found e exception will be thrown on Client A
C. Client B will be able to read 300MB of the data
D. Client B will be able to read 256MB of the data
E. Client B will block and, when client A completes its write, client B will then read all 500MB

Question

Your Hadoop cluster running MapReduce version one (MRv1) has a total of 100 Map slots and 50 Reduce slots, and is configured to use the FairScheduler. You have two pools: Production and Development. Production has a minMaps setting of 50, and a minReduces setting of 25. No jobs are running on the cluster. You submit a job to the Development pool which needs a total of 200 Map slots. How many simultaneous Map slots will it be allocated?
A. 200
B. 100
C. 75
D. 50

Answer is B. The FairScheduler takes all available map slots when no other job is running. See FairScheduler resources listed above.

Question

If the TaskTracker daemon on a slave node crashes, which of the following will occur?
A. All jobs which had tasks running on that node will fail
B. All jobs which had tasks running on that node will be automatically restarted
C. All tasks which were running on that node will pause until the TaskTracker is restarted
D. All tasks which were running on that node will be reallocated to different nodes

Answer is D. See Tom White's Hadoop: The Definitive Guide section on "How MapReduce Works" where he discusses at length failures and how the JobTracker will reschedule tasks if it fails to get a heartbeat etc.

Question

By default, log files for individual tasks in a job are stored:
A. On the TaskTracker's local disk, and in the job's output directory in HDFS
B. On the TaskTracker's local disk only
C. In the job's output directory in HDFS only
D. On the TaskTracker's local disk, and on the JobTracker's local disk
E. On the JobTracker's local disk only

Answer is B. Individual task logs get written to the local disk of the node running the task. If you search Hadoop log files, you'll find numerous blogs on this. Also discussed in White's Hadoop: The Definitive Guide and Eric Sammer's: Hadoop Operations.


Disclaimer: These exam preparation pages are intended to provide information about the objectives covered by each exam, related resources, and recommended reading and courses. The material contained within these pages is not intended to guarantee a passing score on any exam. Cloudera recommends that a candidate thoroughly understand the objectives for each exam and utilize the resources and training courses recommended on these pages to gain a thorough understand of the domain of knowledge related to the role the exam evaluates.