Hadoop Ecosystem Question Bank for C-CAT
Topic-wise Hadoop Ecosystem MCQs for CDAC C-CAT preparation with answers and explanations.
Show Answer & Explanation
Correct Answer: B - Hadoop Distributed File System
HDFS stands for Hadoop Distributed File System - designed for storing large files across clusters.
Show Answer & Explanation
Correct Answer: A - Managing metadata and namespace
NameNode manages the filesystem namespace, maintains directory tree, and tracks where data blocks are stored.
Show Answer & Explanation
Correct Answer: D - Actual data blocks
DataNodes store actual data blocks and serve read/write requests from clients.
Show Answer & Explanation
Correct Answer: A - 128 MB
Default HDFS block size is 128 MB (was 64 MB in earlier versions), optimized for large sequential reads.
Show Answer & Explanation
Correct Answer: D - 3
Default replication factor is 3 - data is stored on 3 different nodes for fault tolerance.
Show Answer & Explanation
Correct Answer: A - Yet Another Resource Negotiator
YARN stands for Yet Another Resource Negotiator - it manages cluster resources and schedules applications.
Show Answer & Explanation
Correct Answer: D - ResourceManager
ResourceManager is the master that allocates resources and schedules applications across the cluster.
Show Answer & Explanation
Correct Answer: D - SQL-like queries on Hadoop
Hive provides SQL-like interface (HiveQL) to query data stored in Hadoop, ideal for data warehousing.
Show Answer & Explanation
Correct Answer: D - A high-level scripting language for data analysis
Pig provides Pig Latin scripting language for expressing data flows and transformations.
Show Answer & Explanation
Correct Answer: C - NoSQL columnar database
HBase is a distributed, scalable NoSQL database that runs on top of HDFS, modeled after Google's Bigtable.
Show Answer & Explanation
Correct Answer: B - Transferring data between Hadoop and RDBMS
Sqoop efficiently transfers bulk data between Hadoop and structured datastores like relational databases.
Show Answer & Explanation
Correct Answer: B - Collecting and aggregating log data
Flume is a distributed service for collecting, aggregating, and moving large amounts of log data.
Show Answer & Explanation
Correct Answer: D - Workflow scheduler
Oozie is a workflow scheduler for managing Hadoop jobs, supporting MapReduce, Pig, Hive, etc.
Show Answer & Explanation
Correct Answer: A - Coordination services for distributed systems
ZooKeeper provides centralized configuration, naming, synchronization, and group services.
Show Answer & Explanation
Correct Answer: B - Performs checkpointing of namespace
Secondary NameNode periodically merges namespace image with edit log - it's not a failover backup.
Show Answer & Explanation
Correct Answer: B - Kerberos with Ranger/Knox
Kerberos provides authentication, while Ranger and Knox provide authorization and security management.
Show Answer & Explanation
Correct Answer: D - Processes data in-memory
Spark processes data in-memory (RAM) rather than reading/writing to disk between stages like MapReduce.
Show Answer & Explanation
Correct Answer: D - Large sequential reads
HDFS is designed for large sequential reads of big files, not random access or small files.
Show Answer & Explanation
Correct Answer: B - Fault tolerance and network optimization
Rack awareness places replicas across racks for fault tolerance and optimizes network bandwidth.
Show Answer & Explanation
Correct Answer: A - Oracle
Oracle is a traditional RDBMS, not part of the Hadoop ecosystem. Hive, Pig, and HBase are Hadoop tools.
Show Answer & Explanation
Correct Answer: C - Hadoop Distributed File System
HDFS stands for Hadoop Distributed File System. It is the primary storage system used by Hadoop applications, designed to store large files across multiple machines in a cluster with high fault tolerance.
Show Answer & Explanation
Correct Answer: B - 128 MB
The default block size in HDFS for Hadoop 2.x and later is 128 MB. In Hadoop 1.x, the default was 64 MB. Large block sizes reduce the metadata overhead on the NameNode and improve throughput for large files.
Show Answer & Explanation
Correct Answer: C - NameNode
The NameNode is the master server that stores all the metadata about the HDFS file system, including the directory tree, file-to-block mappings, and the locations of blocks across DataNodes.
Show Answer & Explanation
Correct Answer: B - 3
The default replication factor in HDFS is 3. This means each block of data is stored on three different DataNodes to provide fault tolerance. If one or two nodes fail, the data is still available from other replicas.
Show Answer & Explanation
Correct Answer: C - It periodically merges the edit log with the FsImage to prevent the edit log from growing too large
The Secondary NameNode periodically merges the namespace image (FsImage) with the edit log to prevent the edit log from becoming too large. It is NOT a hot standby or backup for the NameNode despite its misleading name.
Show Answer & Explanation
Correct Answer: D - Yet Another Resource Negotiator
YARN stands for Yet Another Resource Negotiator. It is the resource management layer of Hadoop that manages and schedules resources across the cluster, separating resource management from data processing.
Show Answer & Explanation
Correct Answer: C - ResourceManager
The ResourceManager is the master daemon of YARN responsible for resource allocation and management across the entire Hadoop cluster. It has two main components: the Scheduler and the ApplicationsManager.
Show Answer & Explanation
Correct Answer: A - A per-application framework responsible for negotiating resources and managing task execution
The ApplicationMaster is a per-application framework-specific entity responsible for negotiating resources from the ResourceManager, working with NodeManagers to execute and monitor tasks for that specific application.
Show Answer & Explanation
Correct Answer: B - Apache Avro
Apache Avro is a data serialization framework developed within the Hadoop ecosystem. It provides rich data structures, a compact and fast binary format, and integration with dynamic languages, making it ideal for data storage and RPC.
Show Answer & Explanation
Correct Answer: B - The NameNode detects the failure and replicates the lost blocks to other DataNodes
When a DataNode fails, the NameNode detects it through missed heartbeat signals. It then identifies the under-replicated blocks that were on the failed node and initiates re-replication to other healthy DataNodes to maintain the configured replication factor.
Show Answer & Explanation
Correct Answer: C - NodeManager
The NodeManager is a per-node agent that runs on each worker node in the cluster. It is responsible for managing containers, monitoring resource usage (CPU, memory, disk, network) on the node, and reporting to the ResourceManager.
Show Answer & Explanation
Correct Answer: A - Placing replicas on different racks to improve fault tolerance and network performance
Rack awareness in HDFS is a policy that places block replicas across different racks in a data center. The default policy places one replica on the local rack and the remaining on a different rack, improving fault tolerance against rack-level failures.
Show Answer & Explanation
Correct Answer: D - It contains a complete snapshot of the file system metadata at a point in time
The FsImage file contains a complete snapshot of the file system metadata at a specific point in time, including the directory structure, file permissions, and block mappings. It is loaded by the NameNode at startup.
Show Answer & Explanation
Correct Answer: C - Apache HBase
Apache HBase is a distributed, column-family NoSQL database that runs on top of HDFS. It provides random, real-time read/write access to Big Data, similar to Google's Bigtable.
Show Answer & Explanation
Correct Answer: B - A transaction log that records every change made to the file system metadata
The edit log (or edits file) is a transaction log maintained by the NameNode that records every modification to the file system metadata, such as creating files, deleting files, or renaming. It ensures metadata durability.
Show Answer & Explanation
Correct Answer: A - Hadoop 2.0
YARN was introduced in Hadoop 2.0 as a major architectural improvement. It separated resource management from data processing, replacing the JobTracker/TaskTracker model of Hadoop 1.x, allowing multiple processing frameworks to run on the same cluster.
Show Answer & Explanation
Correct Answer: C - It represents an allocated set of resources (CPU, memory) on a node for running a task
A Container in YARN represents an allocation of resources (memory, CPU) on a single node. It is the unit of resource allocation managed by the NodeManager, and application tasks run inside containers.
Show Answer & Explanation
Correct Answer: B - Through periodic heartbeat signals sent by DataNodes
DataNodes periodically send heartbeat signals to the NameNode (default every 3 seconds). If the NameNode does not receive a heartbeat from a DataNode for a configurable period (default 10 minutes), it marks the DataNode as dead.
Show Answer & Explanation
Correct Answer: B - Analyzing large datasets using a high-level scripting language called Pig Latin
Apache Pig is a high-level platform for analyzing large datasets that uses a scripting language called Pig Latin. It abstracts the complexity of MapReduce programming and automatically converts Pig Latin scripts into MapReduce jobs.
Show Answer & Explanation
Correct Answer: A - The set of common utilities and libraries that support other Hadoop modules
Hadoop Common contains the common utilities, libraries, and Java Archive (JAR) files that are required by other Hadoop modules. It provides the foundation including file system abstractions, RPC, and serialization.
Show Answer & Explanation
Correct Answer: D - It allowed multiple NameNodes to manage separate namespaces independently
HDFS Federation (introduced in Hadoop 2.x) allows multiple independent NameNodes to manage separate namespace volumes. This improves scalability by distributing the namespace load and allows each NameNode to operate independently.
Show Answer & Explanation
Correct Answer: C - To act as a hot standby that can take over if the Active NameNode fails
In HDFS High Availability (HA), the Standby NameNode maintains an up-to-date copy of the namespace state and can quickly take over as the Active NameNode if it fails, providing automatic failover without data loss or significant downtime.
Show Answer & Explanation
Correct Answer: C - Apache ORC
Apache ORC (Optimized Row Columnar) is a columnar storage format that is highly optimized for reading, writing, and processing data in Hadoop. It provides efficient compression, predicate pushdown, and is the default format for Apache Hive.
Show Answer & Explanation
Correct Answer: A - Running duplicate copies of slow tasks on other nodes and using the result from whichever finishes first
Speculative execution is an optimization where Hadoop detects tasks that are running slower than average and launches duplicate (speculative) copies on other nodes. The result from whichever copy finishes first is used, and the other is killed.
Show Answer & Explanation
Correct Answer: D - RPC (Remote Procedure Call)
HDFS uses RPC (Remote Procedure Call) for communication between the NameNode and DataNodes. DataNodes send heartbeats and block reports to the NameNode via RPC, and the NameNode sends commands back to DataNodes through the same mechanism.
Show Answer & Explanation
Correct Answer: A - Terabytes to Petabytes
HDFS is designed to handle data ranging from terabytes to petabytes in size. It is optimized for storing and processing very large files distributed across clusters of commodity hardware, not for many small files.
Show Answer & Explanation
Correct Answer: C - A periodic message from DataNodes to the NameNode listing all blocks stored on that DataNode
A block report is a periodic message sent by each DataNode to the NameNode containing a list of all HDFS blocks stored on that DataNode. The NameNode uses these reports to maintain its block-to-DataNode mapping and detect missing replicas.
Show Answer & Explanation
Correct Answer: A - Each file occupies metadata in the NameNode's memory, so millions of small files can exhaust NameNode memory
The small files problem occurs because every file, directory, and block in HDFS is represented as an object in the NameNode's memory (~150 bytes each). Millions of small files consume excessive NameNode memory and degrade performance.
Show Answer & Explanation
Correct Answer: A - Capacity Scheduler
The Capacity Scheduler allows sharing a cluster along organizational lines by providing guaranteed minimum capacity to each queue. Each organization gets a guaranteed share of cluster resources, and excess capacity can be shared with others.
Show Answer & Explanation
Correct Answer: C - To maintain a shared edit log between Active and Standby NameNodes
JournalNodes maintain a shared edit log that the Active NameNode writes to and the Standby NameNode reads from to stay synchronized. A quorum of JournalNodes (at least 3) ensures that edits are durably recorded for failover scenarios.