Limited Time Discount Offer
30% Off - Ends in 02:00:00

X

Cloudera CCA-470 Dumps

Cloudera
Cloudera Certified Administrator for Apache Hadoop CDH4 Upgrade (CCAH)
Cloudera
Cloudera Certified Administrator for Apache Hadoop CDH4 Upgrade (CCAH)

Questions & Answers for Cloudera CCA-470

Showing 1-15 of 86 Questions

Question #1

For each job, the Hadoop framework generates task log files. Where are Hadoop's task log
files stored?

A. Cached on the local disk of the slave node running the task, then purged immediately upon task completion.

B. Cached on the local disk of the slave node running the task, then copied into HDFS.

C. In HDFS, in the directory of the user who generates the job.

D. On the local disk of the slave node running the task.

Explanation: Job Statistics
These logs are created by the jobtracker. The jobtracker runtime statistics from jobs to
thesefiles. Those statistics include task attempts, time spent shuffling, input splits given to
task attempts, start times of tasks attempts and other information.
The statistics files are named:
<hostname>_<epoch-of-jobtracker-start>_<job-id>_<job-name>
where <hostname> is the hostname of the machine creating these logs, <epoch-of-
jobtracker-start> is the number of milliseconds that had elapsed since Unix Epoch when
the jobtracker daemon was started, <job-id> is the job ID, and <job-name> is the name of
the job.
For example:
ec2-72-44-61-184.compute-
1.amazonaws.com_1250641772616_job_200908190029_0002_hadoop_test-mini-mr
These logs are not rotated.You can clear these logs periodically without affecting Hadoop.
However, consider archiving the logs if they are of interest in the job development process.
Make sure you do not move or delete a file that is being written to by a running job.
Individual statistics logs are created for each job that is submitted to the cluster. The size of
each log file varies. Jobs with more tasks produce larger files.
Reference: Apache Hadoop Log Files: Where to find them in CDH, and what info they
contain

Question #2

How must you format the underlying filesystem of your Hadoop clusters slave nodes
running on Linux?

A. They may be formatted in nay Linux filesystem

B. They must be formatted as HDFS

C. They must be formatted as either ext3 or ext4

D. They must not be formatted - - HDFS will format the filesystem automatically

Explanation: The Hadoop Distributed File System is platform independent and can
function on top of any underlying file system and Operating System. Linux offers a variety
of file system choices, each with caveats that have an impact on HDFS.
As a general best practice, if you are mounting disks solely for Hadoop data, disable
noatime. This speeds up reads for files.
There are three Linux file system options that are popular to choose from:
Ext3
Ext4
XFS
Yahoo uses the ext3 file system for its Hadoop deployments. ext3 is also the default
filesystem choice for many popular Linux OS flavours. Since HDFS on ext3 has been
publicly tested on Yahoos cluster it makes for a safe choice for the underlying file system.
ext4 is the successor to ext3. ext4 has better performance with large files. ext4 also
introduced delayed allocation of data, which adds a bit more risk with unplanned server
outages while decreasing fragmentation and improving performance.
XFS offers better disk space utilization than ext3 and has much quicker disk formatting
times than ext3. This means that it is quicker to get started with a data node using XFS.
Reference: Hortonworks, Linux File Systems for HDFS

Question #3

Hadoop provider web interface can be used for all of the following EXCEPT: (choose 1)

A. Keeping track of the number of files and directories stored in HDFS.

B. Keeping track of jobs running on the cluster.

C. Browsing files in HDFS.

D. Keeping track of tasks running on each individual slave node.

E. Keeping track of processor and memory utilization on each individual slave node.

Question #4

Your existing Hadoop cluster has 30 slave nodes, each of which has 4 x 2T hard drives.
You plan to add another 10 nodes. How much disk space can your new nodes contain?

A. The new nodes must all contain 8TB of disk space, but it does not matter how the disks are configured

B. The new nodes cannot contain more than 8TB of disk space

C. The new nodes can contain any amount of disk space

D. The new nodes must all contain 4 x 2TB hard drives

Question #5

Your cluster implements HDFS High Availability (HA). You two NameNodes are named
nn01 and nn02. What occurs when you execute the command:
Hdfs haadmin -failover nn01 nn02

A. nn02 becomes the standby NameNode and nn02 becomes the active NameNode

B. Nn01 is fenced, and nn01 becomes the active NameNode

C. Nn01 is fenced, and nn02 becomes the active NameNode

D. Nn01 becomes the standby NameNode and nn02 becomes the active NameNode

Explanation: Failover- initiate a failover between two NameNodes
This subcommand causes a failover from the first provided NameNode to the second. If the
first NameNode is in the Standby state, this command simply transitions the second to the
Active state without error. If the first NameNode is in the Active state, an attempt will be
made to gracefully transition it to the Standby state. If this fails, the fencing methods (as
configured by dfs.ha.fencing.methods) will be attempted in order until one of the methods
succeeds. Only after this process will the second NameNode be transitioned to the Active
state. If no fencing method succeeds, the second NameNode will not be transitioned to the
Active state, and an error will be returned.
Reference: HDFS High Availability Administration, HA Administration using the haadmin
command

Question #6

You have a cluster running with the Fair in Scheduler enabled. There are currently no jobs
running on the cluster, and you submit a job A, so that only job A is running on the cluster.
A while later, you submit job B, Now job A and job B are running on the cluster at the same
time.
Which of the following describes how the Fair Scheduler operates? (Choose 2)

A. When job B gets submitted, it will get assigned tasks, while job A continues to run with fewer tasks.

B. When job A gets submitted, it doesn't consume all the task slots.

C. When job A gets submitted, it consumes all the task slots.

D. When job B gets submitted, job A has to finish first, before job B can get scheduled.

Reference: http://hadoop.apache.org/common/docs/r0.20.2/fair_scheduler.html
(introduction, first paragraph)

Question #7

You has a cluster running with the Fail Scheduler enabled. There are currently no jobs
running on the cluster you submit a job A, so that only job A is running on the cluster. A
while later, you submit job B. Now job A and Job B are running on the cluster al the same
time. How will the Fair' Scheduler handle these two Jobs?

A. When job A gets submitted, it consumes all the task slot

B. When job A gets submitted, it doesn't consume all the task slot

C. When job B gets submitted, Job A has to finish first, before job it can get scheduled.

D. When job B gets submitted, it will get assigned tasks, while job A continues to run with fewer tasks.

Explanation: Fair scheduling is a method of assigning resources to jobs such that all jobs
get, on average, an equal share of resources over time. When there is a single job running,
that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are
assigned to the new jobs, so that each job gets roughly the same amount of CPU time.
Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish
in reasonable time while not starving long jobs. It is also a reasonable way to share a
cluster between a number of users. Finally, fair sharing can also work with job priorities -
the priorities are used as weights to determine the fraction of total compute time that each
job should get.
Reference: Hadoop, Fair Scheduler Guide

Question #8

You have a cluster running with the FIFO scheduler enabled. You submit a large job A to
the cluster which you expect to run for one hour. Then, you submit job B to the cluster,
which you expect to run a couple of minutes only. Lets assume both jobs are running at
the same priority.
How does the FIFO scheduler execute the jobs? (Choose 3)

A. The order of execution of tasks within a job may vary.

B. When a job is submitted, all tasks belonging to that job are scheduled.

C. Given jobs A and B submitted in that order, all tasks from job A will be scheduled before all tasks from job B.

D. Since job B needs only a few tasks, if might finish before job A completes.

Reference: http://seriss.com/rush-current/rush/rush-priority.html#FIFO%20Scheduling (see
fifo scheduling)

Question #9

Assuming a large properly configured multi-rack Hadoop cluster, which scenario should not
result in loss of HDFS data assuming the default replication factor settings?

A. Ten percent of DataNodes simultaneously fail.

B. All DataNodes simultaneously fail.

C. An entire rack fails.

D. Multiple racks simultaneously fail.

E. Seventy percent of DataNodes simultaneously fail.

Reference: http://stackoverflow.com/questions/12399197/in-a-large-properly-configured-
multi-rack-hadoop-cluster-which-scenarios-will-b

Question #10

Your developers request that you enable them to use Hive on your Hadoop cluster. What
do install and/or configure?

A. Install the Hive interpreter on the client machines only, and configure a shared remote Hive Metastore.

B. Install the Hive Interpreter on the client machines and all the slave nodes, and configure a shared remote Hive Metastore.

C. Install the Hive interpreter on the master node running the JobTracker, and configure a shared remote Hive Metastore.

D. Install the Hive interpreter on the client machines and all nodes on the cluster

Explanation: The Hive Interpreter runs on a client machine.

Question #11

On a cluster running MapReduce v1 (MRv1), the value of the
mapred.tasktracker.map.tasks.maximum configuration parameter in the mapred-site.xml
file should be set to:

A. Half the number of the maximum number of Reduce tasks which can run simultaneously on an individual node.

B. The maximum number of Map tasks can run simultaneously on an individual node.

C. The same value on each slave node.

D. The maximum number of Map tasks which can run on the cluster as a whole.

E. Half the number of the maximum number of Reduce tasks which can run on the cluster as a whole.

Explanation: mapred.tasktracker.map.tasks.maximum
Range:1/2 * (cores/node) to 2 * (cores/node)
Description: Number of map tasks to deploy on each machine.

Question #12

You are running a Hadoop cluster with NameNode on host mynamenode, a secondary
NameNode on host mysecondary and DataNodes.
Which best describes how you determine when the last checkpoint happened?

A. Execute hdfs dfsadmin report on the command line in and look at the Last Checkpoint information.

B. Execute hdfs dfsadmin saveNameSpace on the command line which returns to you the last checkpoint value in fstime file.

C. Connect to the web UI of the Secondary NameNode (http://mysecondarynamenode:50090) and look at the Last Checkpoint information

D. Connect to the web UI of the NameNode (http://mynamenode:50070/) and look at the Last Checkpoint information

Explanation:
Note: SecondaryNameNode: Is the worst name ever given to the module in the history of
naming conventions. It is only a check point server which actually gets a back up of the
fsimage+edits files from the namenode.
It basically serves as a checkpoint server.
But it does not come up online automatically when the namenode goes down!
Although the secondary namenode can be used to bring up the namenode in the worst
case scenario (manually) with some data loss.

Question #13

How does the NameNode know DataNodes are available on a cluster running MapReduce
v1 (MRv1)

A. DataNodes listed in the dfs.hosts file. The NameNode uses as the definitive list of available DataNodes.

B. DataNodes heartbeat in the master on a regular basis.

C. The NameNode broadcasts a heartbeat on the network on a regular basis, and DataNodes respond.

D. The NameNode send a broadcast across the network when it first starts, and DataNodes respond.

Explanation: How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning
properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode
notices that it has not recieved a hearbeat message from a data node after a certain
amount of time, the data node is marked as dead. Since blocks will be under replicated the
system begins replicating the blocks that were stored on the dead datanode. The
NameNode Orchestrates the replication of data blocks from one datanode to another. The
replication data transfer happens directly between datanodes and the data never passes
through the namenode.
Reference: 24 Interview Questions & Answers for Hadoop MapReduce developers, How
NameNode Handles data node failures?

Question #14

Your cluster is running Map v1 (MRv1), with default replication set to 3, and a cluster
blocks 64MB. Identify which best describes the file read process when a Client application
connects into the cluster and requests a 50MB file?

A. The client queries the NameNode for the locations of the block, and reads all three copies. The first copy to complete transfer to the client is the one the client reads as part of Hadoops execution framework.

B. The client queries the NameNode for the locations of the block, and reads from the first location in the list of receives.

C. The client queries the NameNode for the locations of the block, and reads from a random location in the list it receives to eliminate network I/O loads by balancing which nodes it retrieves data from at any given time.

D. The client queries the NameNode and then retrieves the block from the nearest DataNode to the client and then passes that block back to the client.

Question #15

You configure you cluster with HDFS High Availability (HA) using Quorum-Based storage.
You do not implement HDFS Federation.
What is the maximum number of NameNodes daemon you should run on you cluster in
order to avoid a split-brain scenario with your NameNodes?

A. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number of NameNodes you can deploy.

B. Two active NameNodes and one Standby NameNode

C. One active NameNode and one Standby NameNode

D. Two active NameNodes and two Standby NameNodes

Explanation: In a typical HA cluster, two separate machines are configured as
NameNodes. At any point in time, one of the NameNodes is in an Active state, and the
other is in a Standby state. The Active NameNode is responsible for all client operations in
the cluster, while the Standby is simply acting as a slave, maintaining enough state to
provide a fast failover if necessary.
Note: It is vital for the correct operation of an HA cluster that only one of the NameNodes
be active at a time. Otherwise, the namespace state would quickly diverge between the
two, risking data loss or other incorrect results. In order to ensure this property and prevent
the so-called "split-brain scenario," the JournalNodes will only ever allow a single
NameNode to be a writer at a time. During a failover, the NameNode which is to become
active will simply take over the role of writing to the JournalNodes, which will effectively
prevent the other NameNode from continuing in the Active state, allowing the new Active
NameNode to safely proceed with failover.
Reference: Cloudera CDH4 High Availability Guide, Quorum-based Storage

×