CCA-410 Study Guide
Begin Your Journey to Administrator Certification
The best way to study for the certification tests is to take a Cloudera training class. There is a high degree of correlation between Cloudera training classes and Cloudera certification tests. This is partially a function of design and partially the result of research and analysis -- we train and test on those skills, tasks, and knowledge we believe critical to the daily work of a Hadoop administrator. Of course, every training class is slightly different and is influenced by the needs of the students and the particular response of individual instructors. Also, every training class occurs at a moment in time, and the CDH/Hadoop ecosystem is dynamic. Thus, if you have taken a Cloudera training class, you should study the course materials and notes. However, you should also use these resources to check your understanding and update your skills.
This resource page is an ongoing work-in-progress. Please contribute: if you have exam prep suggestions or ideas, please email them to firstname.lastname@example.org
Recommended Cloudera Training CourseCloudera Administrator Training for Apache Hadoop
Practice TestCCA-410 Practice Test Subscription
- CDH4 docs (You can download all these docs free from Cloudera as PDFs). The CDH4 docs are the most up-to-date resources for administrators. Much changed with CDH4.1 and above and books can’t keep up with the docs
- Hadoop Tutorial. Cloudera’s free tutorial with demo CDH virtual machines. Note: this tutorial uses the old API and MapReduce v1 (MRv1).
- Cloudera Essentials for Apache Hadoop Cloudera’s free six-part video webinar explores traditional large-scale computing systems, and the limitations and alternative approaches and how apache Hadoop addresses particular issues.
- CDH4 Demo VM
- CDH3 Demo VM
- Eric Sammer’s Hadoop Operations: A Guide for Developers and Administrators
- Tom White’s Hadoop: The Definitive Guide, 3rd Edition. Aimed a bit more toward developers but still invaluable for administrators.
- Hadoop File System Shell Guide (note: we don't control apache.org links and as of 11 February 2013, they have been experiencing downtime. You may get a 404 error.)
What’s New in CDH4? A Guide for Previous Attendees of Cloudera Administrator Training for Apache Hadoop
- CDH3 & CDH4 Configuration files default settings. You don't need to memorize any of these values for the test; however, you can learn a lot from reading through the defaults. We include CDH3 because we feel it’s instructive to compare default settings.
- CDH3 mapred configuration file default settings
- CDH3 core-site configuration file default settings
- CDH3 hdfs configuration file default settings
- CDH4 mapreduce configuration file default settings
- CDH4 core-site configuration file default settings
- CDH4 hdfs configuration file default settings
Exam SectionsThese are the current exam sections and the percentage of the exam devoted to these topics.
- HDFS (38%)
- MapReduce (10%)
- Hadoop Cluster Planning (12%)
- Hadoop Cluster Installation and Administration (17%)
- Resource Management (6%)
- Monitoring and Logging (12%)
- Ecosystem (5%)
- Describe the function of all Hadoop Daemons
- Describe the normal operation of an Apache Hadoop cluster, both in data storage and in data processing.
- Identify current features of computing systems that motivate a system like Apache Hadoop.
- Classify major goals of HDFS Design
- Given a scenario, identify appropriate use case for HDFS Federation
- Identify components and daemon of an HDFS HA-Quorum cluster
- Analyze the role of HDFS security (Kerberos)
- Describe file read and write paths
- Hadoop: The Definitive Guide, 3rd edition: Chapter 3
- Hadoop Operations: Chapter 2
- Hadoop in Practice: Appendix C: HDFS Dissected
- CDH4 High Availability Guide
- CDH4 HA with Quorum-based storage docs
- Apache HDFS High Availability Using the Quorum Journal Manager docs
- Understand how to deploy MapReduce MapReduce v1 (MRv1)
- Understand how to deploy MapReduce v2 (MRv2 / YARN)
- Understand basic design strategy for MapReduce v2 (MRv2)
- Apache YARN docs (note: we don't control apache.org links and as of 11 February 2013, they have been experiencing downtime. You may get a 404 error.)
- CDH4 YARN deployment docs
- Principal points to consider in choosing the hardware and operating systems to host an Apache Hadoop cluster.
- Analyze the choices in selecting an OS
- Understand kernel tuning and disk swapping
- Given a scenario and workload pattern, identify a hardware configuration appropriate to the scenario
- Cluster sizing: given a scenario and frequency of execution, identify the specifics for the workload, including CPU, memory, storage, disk I/O
- Disk Sizing and Configuration, including JBOD versus RAID, SANs, virtualization, and disk sizing requirements in a cluster
- Network Topologies: understand network usage in Hadoop (for both HDFS and MapReduce) and propose or identify key network design components for a given scenario
- Hadoop Operations: Chapter 4
- Given a scenario, identify how the cluster will handle disk and machine failures.
- Analyze a logging configuration and logging configuration file format.
- Understand the basics of Hadoop metrics and cluster health monitoring.
- Identify the function and purpose of available tools for cluster monitoring.
- Identify the function and purpose of available tools for managing the Apache Hadoop file system.
- Hadoop Operations, Chapter 5
- Understand the overall design goals of each of Hadoop schedulers.
- Given a scenario, determine how the FIFO Scheduler allocates cluster resources.
- Given a scenario, determine how the Fair Scheduler allocates cluster resources.
- Given a scenario, determine how the Capacity Scheduler allocates cluster resources.
- A slide deck from Matei Zaharia, developer of the Fair Scheduler
- Hadoop Operations, Chapter 7 Capacity Scheduler Apache docs (note: we don't control apache.org links and as of 11 February 2013, they have been experiencing downtime. You may get a 404 error.)
- Understand the functions and features of Hadoop’s metric collection abilities
- Analyze the NameNode and JobTracker Web UIs
- Interpret a log4j configuration
- Understand how to monitor the Hadoop Daemons
- Identify and monitor CPU usage on master nodes
- Describe how to monitor swap and memory allocation on all nodes
- Identify how to view and manage Hadoop’s log files
- Interpret a log file
- Understand Ecosystem projects and what you need to do to deploy them on a cluster.
- Hadoop: The Definitive Guide, 3rd Edition: Chapters 11, 12, 14, 15
- Hadoop in Practice: Chapters 10, 11
- Hadoop in Action: Chapters 10, 11
- Apache Hive docs
- Apache Pig docs
- Introduction to Pig Video
- Apache Sqoop docs site
- Aaron Kimball on Sqoop at Hadoop World 2012
- Cloudera Manager Online Training Video Series
- Each project in the Hadoop ecosystem has at least one book devoted to it. The exam scope does not require deep knowledge of programming in Hive, Pig, Sqoop, Cloudera Manager, Flume, etc. rather how those projects contribute to an overall big data ecosystem.
In which file can you specify the NameNode's heap size?
Answer is A. Read more here: http://wiki.apache.org/hadoop/GettingStartedWithHadoop, particularly in the configuration section
Your cluster is running Hadoop 2.0.0-cdh4.1.1 or above, client machine A writes a 500MB file into HDFS. The block size is 128MB. After client A has written 300MB of the data, client B attempts to read the file. Which of the following is true?
A. A File Not Found exception will be thrown on Client B
B. A File Not Found e exception will be thrown on Client A
C. Client B will be able to read 300MB of the data
D. Client B will be able to read 256MB of the data
E. Client B will block and, when client A completes its write, client B will then read all 500MB
Your Hadoop cluster running MapReduce version one (MRv1) has a total of 100 Map slots and 50 Reduce slots, and is configured to use the FairScheduler. You have two pools: Production and Development. Production has a minMaps setting of 50, and a minReduces setting of 25. No jobs are running on the cluster. You submit a job to the Development pool which needs a total of 200 Map slots. How many simultaneous Map slots will it be allocated?
Answer is B. The FairScheduler takes all available map slots when no other job is running. See FairScheduler resources listed above.
If the TaskTracker daemon on a slave node crashes, which of the following will occur?
A. All jobs which had tasks running on that node will fail
B. All jobs which had tasks running on that node will be automatically restarted
C. All tasks which were running on that node will pause until the TaskTracker is restarted
D. All tasks which were running on that node will be reallocated to different nodes
Answer is D. See Tom White's Hadoop: The Definitive Guide section on "How MapReduce Works" where he discusses at length failures and how the JobTracker will reschedule tasks if it fails to get a heartbeat etc.
By default, log files for individual tasks in a job are stored:
A. On the TaskTracker's local disk, and in the job's output directory in HDFS
B. On the TaskTracker's local disk only
C. In the job's output directory in HDFS only
D. On the TaskTracker's local disk, and on the JobTracker's local disk
E. On the JobTracker's local disk only
Answer is B. Individual task logs get written to the local disk of the node running the task. If you search Hadoop log files, you'll find numerous blogs on this. Also discussed in White's Hadoop: The Definitive Guide and Eric Sammer's: Hadoop Operations.
Disclaimer: These exam preparation pages are intended to provide information about the objectives covered by each exam, related resources, and recommended reading and courses. The material contained within these pages is not intended to guarantee a passing score on any exam. Cloudera recommends that a candidate thoroughly understand the objectives for each exam and utilize the resources and training courses recommended on these pages to gain a thorough understand of the domain of knowledge related to the role the exam evaluates.