Back to Practice Big Data

Hadoop Ecosystem - Practice MCQs for CCAT

50 Questions Section B: Programming Big Data

Hadoop Ecosystem Question Bank for C-CAT

Topic-wise Hadoop Ecosystem MCQs for CDAC C-CAT preparation with answers and explanations.

Q1.
HDFS stands for:
AHadoop Data File System
BHadoop Distributed File System
CHigh Data File Storage
DHierarchical Distributed File System
Show Answer & Explanation

Correct Answer: B - Hadoop Distributed File System

HDFS stands for Hadoop Distributed File System - designed for storing large files across clusters.

Q2.
In HDFS, the NameNode is responsible for:
AManaging metadata and namespace
BStoring actual data blocks
CRunning MapReduce jobs
DData compression
Show Answer & Explanation

Correct Answer: A - Managing metadata and namespace

NameNode manages the filesystem namespace, maintains directory tree, and tracks where data blocks are stored.

Q3.
DataNodes in HDFS store:
AOnly metadata
BOnly directory structure
COnly file names
DActual data blocks
Show Answer & Explanation

Correct Answer: D - Actual data blocks

DataNodes store actual data blocks and serve read/write requests from clients.

Q4.
Default block size in HDFS is:
A128 MB
B1 MB
C64 KB
D1 GB
Show Answer & Explanation

Correct Answer: A - 128 MB

Default HDFS block size is 128 MB (was 64 MB in earlier versions), optimized for large sequential reads.

Q5.
HDFS default replication factor is:
A1
B2
C5
D3
Show Answer & Explanation

Correct Answer: D - 3

Default replication factor is 3 - data is stored on 3 different nodes for fault tolerance.

Q6.
YARN in Hadoop stands for:
AYet Another Resource Negotiator
BYet Another Resource Navigator
CYielding Application Resource Network
DYouth Application Resource Node
Show Answer & Explanation

Correct Answer: A - Yet Another Resource Negotiator

YARN stands for Yet Another Resource Negotiator - it manages cluster resources and schedules applications.

Q7.
Which component schedules tasks in YARN?
ADataNode
BNameNode
CSecondary NameNode
DResourceManager
Show Answer & Explanation

Correct Answer: D - ResourceManager

ResourceManager is the master that allocates resources and schedules applications across the cluster.

Q8.
Hive is used for:
AReal-time processing
BStream processing
CGraph processing
DSQL-like queries on Hadoop
Show Answer & Explanation

Correct Answer: D - SQL-like queries on Hadoop

Hive provides SQL-like interface (HiveQL) to query data stored in Hadoop, ideal for data warehousing.

Q9.
Pig in Hadoop is:
AA storage system
BA monitoring tool
CA security framework
DA high-level scripting language for data analysis
Show Answer & Explanation

Correct Answer: D - A high-level scripting language for data analysis

Pig provides Pig Latin scripting language for expressing data flows and transformations.

Q10.
HBase is a:
ARelational database
BGraph database
CNoSQL columnar database
DDocument database
Show Answer & Explanation

Correct Answer: C - NoSQL columnar database

HBase is a distributed, scalable NoSQL database that runs on top of HDFS, modeled after Google's Bigtable.

Q11.
Sqoop is used for:
AData visualization
BTransferring data between Hadoop and RDBMS
CStream processing
DMachine learning
Show Answer & Explanation

Correct Answer: B - Transferring data between Hadoop and RDBMS

Sqoop efficiently transfers bulk data between Hadoop and structured datastores like relational databases.

Q12.
Flume is designed for:
ABatch processing
BCollecting and aggregating log data
CSQL queries
DGraph analysis
Show Answer & Explanation

Correct Answer: B - Collecting and aggregating log data

Flume is a distributed service for collecting, aggregating, and moving large amounts of log data.

Q13.
Oozie in Hadoop is a:
ADatabase
BSecurity system
CQuery engine
DWorkflow scheduler
Show Answer & Explanation

Correct Answer: D - Workflow scheduler

Oozie is a workflow scheduler for managing Hadoop jobs, supporting MapReduce, Pig, Hive, etc.

Q14.
ZooKeeper provides:
ACoordination services for distributed systems
BData storage
CQuery processing
DData compression
Show Answer & Explanation

Correct Answer: A - Coordination services for distributed systems

ZooKeeper provides centralized configuration, naming, synchronization, and group services.

Q15.
Secondary NameNode in HDFS:
AIs a backup NameNode
BPerforms checkpointing of namespace
CStores data blocks
DManages YARN
Show Answer & Explanation

Correct Answer: B - Performs checkpointing of namespace

Secondary NameNode periodically merges namespace image with edit log - it's not a failover backup.

Q16.
Which Hadoop component provides security?
AOozie
BKerberos with Ranger/Knox
CFlume
DSqoop
Show Answer & Explanation

Correct Answer: B - Kerberos with Ranger/Knox

Kerberos provides authentication, while Ranger and Knox provide authorization and security management.

Q17.
Spark is faster than MapReduce primarily because:
AUses bigger disk drives
BHas better UI
CUses faster networks
DProcesses data in-memory
Show Answer & Explanation

Correct Answer: D - Processes data in-memory

Spark processes data in-memory (RAM) rather than reading/writing to disk between stages like MapReduce.

Q18.
HDFS is optimized for:
ARandom reads/writes
BSmall files
CFrequent file modifications
DLarge sequential reads
Show Answer & Explanation

Correct Answer: D - Large sequential reads

HDFS is designed for large sequential reads of big files, not random access or small files.

Q19.
Rack awareness in HDFS helps with:
AFaster CPU processing
BFault tolerance and network optimization
CData compression
DQuery performance
Show Answer & Explanation

Correct Answer: B - Fault tolerance and network optimization

Rack awareness places replicas across racks for fault tolerance and optimizes network bandwidth.

Q20.
Which is NOT a Hadoop ecosystem tool?
AOracle
BPig
CHive
DHBase
Show Answer & Explanation

Correct Answer: A - Oracle

Oracle is a traditional RDBMS, not part of the Hadoop ecosystem. Hive, Pig, and HBase are Hadoop tools.

Q21.
What does HDFS stand for?
AHadoop Data Framework System
BHigh Data File Storage
CHadoop Distributed File System
DHierarchical Distributed File System
Show Answer & Explanation

Correct Answer: C - Hadoop Distributed File System

HDFS stands for Hadoop Distributed File System. It is the primary storage system used by Hadoop applications, designed to store large files across multiple machines in a cluster with high fault tolerance.

Q22.
What is the default block size in HDFS (Hadoop 2.x and later)?
A64 MB
B128 MB
C256 MB
D32 MB
Show Answer & Explanation

Correct Answer: B - 128 MB

The default block size in HDFS for Hadoop 2.x and later is 128 MB. In Hadoop 1.x, the default was 64 MB. Large block sizes reduce the metadata overhead on the NameNode and improve throughput for large files.

Q23.
Which component in HDFS is responsible for storing metadata about the file system?
ADataNode
BResourceManager
CNameNode
DNodeManager
Show Answer & Explanation

Correct Answer: C - NameNode

The NameNode is the master server that stores all the metadata about the HDFS file system, including the directory tree, file-to-block mappings, and the locations of blocks across DataNodes.

Q24.
What is the default replication factor in HDFS?
A1
B3
C2
D5
Show Answer & Explanation

Correct Answer: B - 3

The default replication factor in HDFS is 3. This means each block of data is stored on three different DataNodes to provide fault tolerance. If one or two nodes fail, the data is still available from other replicas.

Q25.
What is the role of the Secondary NameNode in HDFS?
AIt acts as a hot standby for the primary NameNode
BIt stores actual data blocks
CIt periodically merges the edit log with the FsImage to prevent the edit log from growing too large
DIt handles client read requests
Show Answer & Explanation

Correct Answer: C - It periodically merges the edit log with the FsImage to prevent the edit log from growing too large

The Secondary NameNode periodically merges the namespace image (FsImage) with the edit log to prevent the edit log from becoming too large. It is NOT a hot standby or backup for the NameNode despite its misleading name.

Q26.
What does YARN stand for in the Hadoop ecosystem?
AYet Another Resource Navigator
BYARN Application Resource Node
CYield And Resource Network
DYet Another Resource Negotiator
Show Answer & Explanation

Correct Answer: D - Yet Another Resource Negotiator

YARN stands for Yet Another Resource Negotiator. It is the resource management layer of Hadoop that manages and schedules resources across the cluster, separating resource management from data processing.

Q27.
Which YARN component is responsible for managing resources across the entire cluster?
ANodeManager
BApplicationMaster
CResourceManager
DContainer
Show Answer & Explanation

Correct Answer: C - ResourceManager

The ResourceManager is the master daemon of YARN responsible for resource allocation and management across the entire Hadoop cluster. It has two main components: the Scheduler and the ApplicationsManager.

Q28.
What is an ApplicationMaster in YARN?
AA per-application framework responsible for negotiating resources and managing task execution
BA global resource manager for the cluster
CA daemon that monitors node health
DThe primary storage manager in HDFS
Show Answer & Explanation

Correct Answer: A - A per-application framework responsible for negotiating resources and managing task execution

The ApplicationMaster is a per-application framework-specific entity responsible for negotiating resources from the ResourceManager, working with NodeManagers to execute and monitor tasks for that specific application.

Q29.
Which Hadoop ecosystem tool is used for data serialization?
AApache Oozie
BApache Avro
CApache Sqoop
DApache Flume
Show Answer & Explanation

Correct Answer: B - Apache Avro

Apache Avro is a data serialization framework developed within the Hadoop ecosystem. It provides rich data structures, a compact and fast binary format, and integration with dynamic languages, making it ideal for data storage and RPC.

Q30.
What happens when a DataNode fails in HDFS?
AThe entire cluster shuts down
BThe NameNode detects the failure and replicates the lost blocks to other DataNodes
CData on that node is permanently lost
DThe Secondary NameNode takes over
Show Answer & Explanation

Correct Answer: B - The NameNode detects the failure and replicates the lost blocks to other DataNodes

When a DataNode fails, the NameNode detects it through missed heartbeat signals. It then identifies the under-replicated blocks that were on the failed node and initiates re-replication to other healthy DataNodes to maintain the configured replication factor.

Q31.
Which component in YARN runs on each worker node to manage containers?
AResourceManager
BApplicationMaster
CNodeManager
DJobTracker
Show Answer & Explanation

Correct Answer: C - NodeManager

The NodeManager is a per-node agent that runs on each worker node in the cluster. It is responsible for managing containers, monitoring resource usage (CPU, memory, disk, network) on the node, and reporting to the ResourceManager.

Q32.
What is the rack awareness feature in HDFS?
APlacing replicas on different racks to improve fault tolerance and network performance
BOrganizing data alphabetically across racks
CLimiting storage to a single rack
DEncrypting data on specific racks
Show Answer & Explanation

Correct Answer: A - Placing replicas on different racks to improve fault tolerance and network performance

Rack awareness in HDFS is a policy that places block replicas across different racks in a data center. The default policy places one replica on the local rack and the remaining on a different rack, improving fault tolerance against rack-level failures.

Q33.
What is the purpose of the FsImage file in HDFS?
AIt stores actual data blocks
BIt stores temporary computation results
CIt logs all client requests
DIt contains a complete snapshot of the file system metadata at a point in time
Show Answer & Explanation

Correct Answer: D - It contains a complete snapshot of the file system metadata at a point in time

The FsImage file contains a complete snapshot of the file system metadata at a specific point in time, including the directory structure, file permissions, and block mappings. It is loaded by the NameNode at startup.

Q34.
Which of the following is a NoSQL database in the Hadoop ecosystem?
AApache Hive
BApache Pig
CApache HBase
DApache Sqoop
Show Answer & Explanation

Correct Answer: C - Apache HBase

Apache HBase is a distributed, column-family NoSQL database that runs on top of HDFS. It provides random, real-time read/write access to Big Data, similar to Google's Bigtable.

Q35.
What is the edit log in HDFS?
AA log of all data stored in DataNodes
BA transaction log that records every change made to the file system metadata
CA log of MapReduce job outputs
DA log of user authentication attempts
Show Answer & Explanation

Correct Answer: B - A transaction log that records every change made to the file system metadata

The edit log (or edits file) is a transaction log maintained by the NameNode that records every modification to the file system metadata, such as creating files, deleting files, or renaming. It ensures metadata durability.

Q36.
Which Hadoop version introduced YARN?
AHadoop 2.0
BHadoop 1.0
CHadoop 3.0
DHadoop 0.20
Show Answer & Explanation

Correct Answer: A - Hadoop 2.0

YARN was introduced in Hadoop 2.0 as a major architectural improvement. It separated resource management from data processing, replacing the JobTracker/TaskTracker model of Hadoop 1.x, allowing multiple processing frameworks to run on the same cluster.

Q37.
What is the role of a Container in YARN?
AIt stores HDFS blocks
BIt manages the NameNode metadata
CIt represents an allocated set of resources (CPU, memory) on a node for running a task
DIt handles network communication between nodes
Show Answer & Explanation

Correct Answer: C - It represents an allocated set of resources (CPU, memory) on a node for running a task

A Container in YARN represents an allocation of resources (memory, CPU) on a single node. It is the unit of resource allocation managed by the NodeManager, and application tasks run inside containers.

Q38.
How does the NameNode detect a DataNode failure?
ABy checking disk space
BThrough periodic heartbeat signals sent by DataNodes
CBy running diagnostic tests
DThrough user-reported errors
Show Answer & Explanation

Correct Answer: B - Through periodic heartbeat signals sent by DataNodes

DataNodes periodically send heartbeat signals to the NameNode (default every 3 seconds). If the NameNode does not receive a heartbeat from a DataNode for a configurable period (default 10 minutes), it marks the DataNode as dead.

Q39.
What is Apache Pig used for in the Hadoop ecosystem?
AReal-time data streaming
BAnalyzing large datasets using a high-level scripting language called Pig Latin
CManaging HDFS metadata
DScheduling workflows
Show Answer & Explanation

Correct Answer: B - Analyzing large datasets using a high-level scripting language called Pig Latin

Apache Pig is a high-level platform for analyzing large datasets that uses a scripting language called Pig Latin. It abstracts the complexity of MapReduce programming and automatically converts Pig Latin scripts into MapReduce jobs.

Q40.
What is the Hadoop Common module?
AThe set of common utilities and libraries that support other Hadoop modules
BThe data storage component of Hadoop
CThe resource management framework
DThe MapReduce processing engine
Show Answer & Explanation

Correct Answer: A - The set of common utilities and libraries that support other Hadoop modules

Hadoop Common contains the common utilities, libraries, and Java Archive (JAR) files that are required by other Hadoop modules. It provides the foundation including file system abstractions, RPC, and serialization.

Q41.
What improvement did HDFS Federation introduce?
AIt added encryption support
BIt added support for Windows
CIt increased the default block size
DIt allowed multiple NameNodes to manage separate namespaces independently
Show Answer & Explanation

Correct Answer: D - It allowed multiple NameNodes to manage separate namespaces independently

HDFS Federation (introduced in Hadoop 2.x) allows multiple independent NameNodes to manage separate namespace volumes. This improves scalability by distributing the namespace load and allows each NameNode to operate independently.

Q42.
What is the purpose of the Standby NameNode in HDFS High Availability?
ATo perform data compression
BTo run MapReduce jobs
CTo act as a hot standby that can take over if the Active NameNode fails
DTo manage DataNode heartbeats only
Show Answer & Explanation

Correct Answer: C - To act as a hot standby that can take over if the Active NameNode fails

In HDFS High Availability (HA), the Standby NameNode maintains an up-to-date copy of the namespace state and can quickly take over as the Active NameNode if it fails, providing automatic failover without data loss or significant downtime.

Q43.
Which file format provides both row-based and columnar storage in Hadoop?
ACSV
BPlain Text
CApache ORC
DJSON
Show Answer & Explanation

Correct Answer: C - Apache ORC

Apache ORC (Optimized Row Columnar) is a columnar storage format that is highly optimized for reading, writing, and processing data in Hadoop. It provides efficient compression, predicate pushdown, and is the default format for Apache Hive.

Q44.
What is speculative execution in Hadoop?
ARunning duplicate copies of slow tasks on other nodes and using the result from whichever finishes first
BRunning tasks before they are scheduled
CPredicting future data patterns
DExecuting tasks without input data
Show Answer & Explanation

Correct Answer: A - Running duplicate copies of slow tasks on other nodes and using the result from whichever finishes first

Speculative execution is an optimization where Hadoop detects tasks that are running slower than average and launches duplicate (speculative) copies on other nodes. The result from whichever copy finishes first is used, and the other is killed.

Q45.
Which protocol does HDFS use for communication between the NameNode and DataNodes?
AHTTP only
BFTP
CSMTP
DRPC (Remote Procedure Call)
Show Answer & Explanation

Correct Answer: D - RPC (Remote Procedure Call)

HDFS uses RPC (Remote Procedure Call) for communication between the NameNode and DataNodes. DataNodes send heartbeats and block reports to the NameNode via RPC, and the NameNode sends commands back to DataNodes through the same mechanism.

Q46.
What is the maximum recommended size of data that HDFS is designed to handle efficiently?
ATerabytes to Petabytes
BGigabytes
CMegabytes
DKilobytes
Show Answer & Explanation

Correct Answer: A - Terabytes to Petabytes

HDFS is designed to handle data ranging from terabytes to petabytes in size. It is optimized for storing and processing very large files distributed across clusters of commodity hardware, not for many small files.

Q47.
What is a block report in HDFS?
AA report of failed blocks
BA user-generated report of file sizes
CA periodic message from DataNodes to the NameNode listing all blocks stored on that DataNode
DA log of block creation times
Show Answer & Explanation

Correct Answer: C - A periodic message from DataNodes to the NameNode listing all blocks stored on that DataNode

A block report is a periodic message sent by each DataNode to the NameNode containing a list of all HDFS blocks stored on that DataNode. The NameNode uses these reports to maintain its block-to-DataNode mapping and detect missing replicas.

Q48.
What is the small files problem in HDFS?
AEach file occupies metadata in the NameNode's memory, so millions of small files can exhaust NameNode memory
BHDFS cannot store files smaller than 128 MB
CSmall files are automatically deleted
DSmall files cannot be replicated
Show Answer & Explanation

Correct Answer: A - Each file occupies metadata in the NameNode's memory, so millions of small files can exhaust NameNode memory

The small files problem occurs because every file, directory, and block in HDFS is represented as an object in the NameNode's memory (~150 bytes each). Millions of small files consume excessive NameNode memory and degrade performance.

Q49.
Which YARN scheduler provides guaranteed minimum resource shares to each queue?
ACapacity Scheduler
BFIFO Scheduler
CFair Scheduler
DPriority Scheduler
Show Answer & Explanation

Correct Answer: A - Capacity Scheduler

The Capacity Scheduler allows sharing a cluster along organizational lines by providing guaranteed minimum capacity to each queue. Each organization gets a guaranteed share of cluster resources, and excess capacity can be shared with others.

Q50.
What is the function of the JournalNode in HDFS High Availability?
ATo store actual data blocks
BTo replace the DataNode
CTo maintain a shared edit log between Active and Standby NameNodes
DTo schedule MapReduce jobs
Show Answer & Explanation

Correct Answer: C - To maintain a shared edit log between Active and Standby NameNodes

JournalNodes maintain a shared edit log that the Active NameNode writes to and the Standby NameNode reads from to stay synchronized. A quorum of JournalNodes (at least 3) ensures that edits are durably recorded for failover scenarios.

Showing 1-10 of 50 questions