Back to Practice Big Data

MapReduce - Practice MCQs for CCAT

50 Questions Section B: Programming Big Data

MapReduce Question Bank for C-CAT

Topic-wise MapReduce MCQs for CDAC C-CAT preparation with answers and explanations.

Q1.
MapReduce programming model consists of:
AMap phase only
BReduce phase only
CSort phase only
DMap and Reduce phases
Show Answer & Explanation

Correct Answer: D - Map and Reduce phases

MapReduce has Map phase (transforms data into key-value pairs) and Reduce phase (aggregates values by key).

Q2.
The Map function outputs:
AFinal results
BKey-value pairs
COnly keys
DOnly values
Show Answer & Explanation

Correct Answer: B - Key-value pairs

Map function processes input and emits intermediate key-value pairs for the Reduce phase.

Q3.
Shuffle and Sort phase occurs:
ABefore Map phase
BBetween Map and Reduce phases
CAfter Reduce phase
DOnly if specified
Show Answer & Explanation

Correct Answer: B - Between Map and Reduce phases

Shuffle and Sort transfers Map output to Reducers and sorts data by keys between phases.

Q4.
In MapReduce, Combiner is:
ASame as Reducer
BA mini-reducer that runs on Map output
CA file format
DA compression algorithm
Show Answer & Explanation

Correct Answer: B - A mini-reducer that runs on Map output

Combiner is an optional local reducer that runs on Map output to reduce network transfer.

Q5.
Partitioner in MapReduce determines:
ANumber of Map tasks
BFile split size
CWhich Reducer gets which key
DCompression type
Show Answer & Explanation

Correct Answer: C - Which Reducer gets which key

Partitioner determines which Reducer receives which key-value pairs, typically using hash of the key.

Q6.
Input to Map function is:
AEntire file
BOnly file name
COnly value
DKey-value pair
Show Answer & Explanation

Correct Answer: D - Key-value pair

Map receives a key-value pair where key is typically offset and value is the line content.

Q7.
InputFormat in MapReduce:
ADefines how to read and split input files
BCompresses data
CWrites final output
DManages memory
Show Answer & Explanation

Correct Answer: A - Defines how to read and split input files

InputFormat defines how input files are read and split into InputSplits for Map tasks.

Q8.
Number of Map tasks is determined by:
ANumber of Reducers
BNumber of input splits
CCluster size
DUser specification only
Show Answer & Explanation

Correct Answer: B - Number of input splits

Number of Map tasks equals number of input splits, which depends on input data size and block size.

Q9.
Word Count MapReduce: Map phase emits:
A(document, count)
B(word, 1) for each word
C(line_number, word)
D(total_count, word)
Show Answer & Explanation

Correct Answer: B - (word, 1) for each word

In Word Count, Map emits (word, 1) for each word occurrence; Reduce sums the counts per word.

Q10.
Speculative execution in MapReduce:
ARuns tasks on failed nodes
BRuns duplicate tasks to handle slow nodes
CPredicts task output
DCaches intermediate results
Show Answer & Explanation

Correct Answer: B - Runs duplicate tasks to handle slow nodes

Speculative execution runs backup copies of slow-running tasks to prevent stragglers from delaying jobs.

Q11.
Reduce function receives:
ASingle key-value pair
BRaw input data
COnly values
DKey and iterator of all values for that key
Show Answer & Explanation

Correct Answer: D - Key and iterator of all values for that key

Reduce receives a key and an iterator over all values associated with that key after shuffle/sort.

Q12.
OutputFormat in MapReduce:
ASplits input files
BHandles network
CManages Map tasks
DDefines how to write output
Show Answer & Explanation

Correct Answer: D - Defines how to write output

OutputFormat defines how Reduce output is written - format, location, and structure.

Q13.
Data locality in MapReduce means:
AAll data stored locally
BData compression
CMoving computation to where data resides
DData replication
Show Answer & Explanation

Correct Answer: C - Moving computation to where data resides

Data locality moves computation to nodes where data is stored rather than moving data over network.

Q14.
RecordReader in MapReduce:
AWrites output records
BCompresses records
CSorts records
DReads input split and generates key-value pairs
Show Answer & Explanation

Correct Answer: D - Reads input split and generates key-value pairs

RecordReader reads an InputSplit and generates key-value pairs for the Map function.

Q15.
Default number of Reducers is:
A1
B0
CSame as Mappers
DUnlimited
Show Answer & Explanation

Correct Answer: A - 1

Default number of Reducers is 1, but can be configured based on data size and cluster capacity.

Q16.
Counters in MapReduce are used for:
ACounting reducers
BMemory management
CFile compression
DTracking job statistics and metrics
Show Answer & Explanation

Correct Answer: D - Tracking job statistics and metrics

Counters track various statistics like input/output records, bytes processed, and custom metrics.

Q17.
DistributedCache in MapReduce provides:
ARead-only data distribution to all nodes
BIn-memory caching
CWrite caching
DNetwork caching
Show Answer & Explanation

Correct Answer: A - Read-only data distribution to all nodes

DistributedCache distributes read-only files (like lookup tables) to all nodes before task execution.

Q18.
Map-only job has:
AMultiple Reducers
BNo Reduce phase
CNo Map phase
DOnly shuffle phase
Show Answer & Explanation

Correct Answer: B - No Reduce phase

Map-only jobs set Reducers to 0, outputting Map results directly without reduce/aggregation.

Q19.
Job Tracker in Hadoop 1.x was responsible for:
AData storage
BResource management and job scheduling
CFile splitting
DData compression
Show Answer & Explanation

Correct Answer: B - Resource management and job scheduling

JobTracker managed resources and scheduled jobs in Hadoop 1.x, replaced by YARN ResourceManager in 2.x.

Q20.
Secondary sort in MapReduce:
ASorts by value as well as key
BSorts only keys
CSorts files
DSorts reducers
Show Answer & Explanation

Correct Answer: A - Sorts by value as well as key

Secondary sort allows sorting by both key and value, using composite keys and custom comparators.

Q21.
What are the two main phases of a MapReduce job?
AInput and Output
BSplit and Combine
CSort and Merge
DMap and Reduce
Show Answer & Explanation

Correct Answer: D - Map and Reduce

The two main phases of a MapReduce job are the Map phase and the Reduce phase. The Map phase processes input data and produces intermediate key-value pairs, while the Reduce phase aggregates the intermediate results to produce the final output.

Q22.
What is the input format of data for a Mapper in MapReduce?
AKey-value pairs
BTable rows
CJSON objects
DBinary streams
Show Answer & Explanation

Correct Answer: A - Key-value pairs

The Mapper receives input as key-value pairs. By default, the key is the byte offset of the line in the file, and the value is the text content of that line. The Mapper processes each pair and emits zero or more intermediate key-value pairs.

Q23.
What is the role of a Combiner in MapReduce?
ATo combine multiple input files
BTo merge sort the final output
CTo combine outputs of multiple Reducers
DTo perform local aggregation on Mapper output before sending to Reducer, reducing network traffic
Show Answer & Explanation

Correct Answer: D - To perform local aggregation on Mapper output before sending to Reducer, reducing network traffic

A Combiner acts as a mini-reducer that runs on the Mapper node. It performs local aggregation on the Mapper's output to reduce the amount of data transferred over the network to the Reducer, improving performance.

Q24.
What is the Shuffle and Sort phase in MapReduce?
AThe initial input splitting phase
BThe process of transferring Mapper output to Reducers and sorting by key
CThe final output writing phase
DThe data validation phase
Show Answer & Explanation

Correct Answer: B - The process of transferring Mapper output to Reducers and sorting by key

The Shuffle and Sort phase occurs between the Map and Reduce phases. It transfers the intermediate key-value pairs from Mappers to the appropriate Reducers (shuffle) and sorts them by key so that all values for the same key are grouped together.

Q25.
How many Reducers can a MapReduce job have?
AZero or more, configurable by the user
BExactly the same number as Mappers
CExactly one
DOnly two
Show Answer & Explanation

Correct Answer: A - Zero or more, configurable by the user

A MapReduce job can have zero or more Reducers, configurable by the user. Setting zero Reducers means only the Map phase runs (map-only job). Multiple Reducers allow parallel reduction of different key ranges.

Q26.
What is the default Partitioner in MapReduce?
ARangePartitioner
BRandomPartitioner
CRoundRobinPartitioner
DHashPartitioner
Show Answer & Explanation

Correct Answer: D - HashPartitioner

The default Partitioner in MapReduce is the HashPartitioner. It uses the hash of the key modulo the number of Reducers to determine which Reducer will receive a particular key-value pair, ensuring even distribution.

Q27.
What is the purpose of an InputSplit in MapReduce?
ATo split the output into multiple files
BTo divide the Reducer workload
CTo represent a logical chunk of data that will be processed by a single Mapper
DTo split the network bandwidth
Show Answer & Explanation

Correct Answer: C - To represent a logical chunk of data that will be processed by a single Mapper

An InputSplit represents a logical chunk of input data that will be processed by a single Map task. By default, an InputSplit corresponds to one HDFS block, and the number of InputSplits determines the number of Map tasks.

Q28.
Which of the following is true about the Combiner function?
AIt always runs after the Reducer
BIt must produce output of the same type as the Mapper output
CIt is mandatory in every MapReduce job
DIt runs on the Reducer node
Show Answer & Explanation

Correct Answer: B - It must produce output of the same type as the Mapper output

The Combiner must produce output with the same key-value types as the Mapper output since its output becomes input to the Reducer. It is optional (not mandatory) and runs on the Mapper node, not the Reducer node.

Q29.
In the classic word count example, what does the Mapper emit?
A(word, total_count)
B(word, 1) for each word encountered
C(filename, word)
D(line_number, word)
Show Answer & Explanation

Correct Answer: B - (word, 1) for each word encountered

In the word count example, the Mapper tokenizes each input line into words and emits (word, 1) for each word encountered. The Reducer then sums all the 1s for each unique word to get the total count.

Q30.
What happens if a Map task fails during execution?
AThe entire job fails immediately
BThe user must manually restart the task
CThe output file is deleted
DThe task is rescheduled on another node by the framework
Show Answer & Explanation

Correct Answer: D - The task is rescheduled on another node by the framework

If a Map task fails, the MapReduce framework automatically reschedules it on another available node. The framework provides fault tolerance by tracking task progress and re-executing failed tasks up to a configurable number of attempts.

Q31.
What is the RecordReader responsible for in MapReduce?
AWriting output records to HDFS
BSorting the Reducer output
CConverting the data in an InputSplit into key-value pairs for the Mapper
DManaging memory allocation for tasks
Show Answer & Explanation

Correct Answer: C - Converting the data in an InputSplit into key-value pairs for the Mapper

The RecordReader is responsible for reading data from an InputSplit and converting it into key-value pairs that are passed to the Mapper's map() function. The default RecordReader (LineRecordReader) treats each line as a record.

Q32.
What is the output of the Reduce function?
AKey-value pairs written to the output file system
BAlways a single value
COnly keys without values
DBinary data streams
Show Answer & Explanation

Correct Answer: A - Key-value pairs written to the output file system

The Reduce function outputs zero or more key-value pairs that are written to the output file system (typically HDFS). Each Reducer writes its output to a separate output file (e.g., part-r-00000).

Q33.
What determines the number of Map tasks in a MapReduce job?
AThe number of Reducers
BThe user always specifies it manually
CThe cluster size
DThe number of input splits (typically equal to the number of HDFS blocks)
Show Answer & Explanation

Correct Answer: D - The number of input splits (typically equal to the number of HDFS blocks)

The number of Map tasks is determined by the number of input splits, which is typically equal to the number of HDFS blocks in the input data. Each input split is processed by one Map task.

Q34.
What is a map-only job in MapReduce?
AA job that uses only one Mapper
BA job that maps data to a database
CA job with zero Reducers where only the Map phase executes
DA job that runs on a single node
Show Answer & Explanation

Correct Answer: C - A job with zero Reducers where only the Map phase executes

A map-only job is a MapReduce job configured with zero Reducers. Only the Map phase executes, and the Mapper output is written directly to HDFS. This is useful for tasks like data transformation or filtering that don't require aggregation.

Q35.
What is the purpose of the Partitioner in MapReduce?
ATo split input data into blocks
BTo partition the HDFS namespace
CTo determine which Reducer receives which key-value pair from the Mapper
DTo divide CPU resources among tasks
Show Answer & Explanation

Correct Answer: C - To determine which Reducer receives which key-value pair from the Mapper

The Partitioner determines which Reducer will receive a specific intermediate key-value pair output by the Mapper. It ensures that all values for the same key go to the same Reducer for correct aggregation.

Q36.
Which of the following operations is NOT suitable for a Combiner?
AAverage
BCount
CSum
DMaximum
Show Answer & Explanation

Correct Answer: A - Average

Average is not suitable for a Combiner because it is not an associative and commutative operation. Computing local averages and then averaging them gives incorrect results. Sum, Count, and Maximum can be correctly combined locally.

Q37.
What is the secondary sort technique in MapReduce?
ASorting the input data before the Map phase
BSorting data by file name
CRunning a second sort after the Reduce phase
DControlling the order of values within a Reducer for a given key
Show Answer & Explanation

Correct Answer: D - Controlling the order of values within a Reducer for a given key

Secondary sort is a technique to control the order in which values arrive at a Reducer for a given key. It is achieved by creating a composite key (natural key + secondary key) and using a custom Partitioner, GroupingComparator, and SortComparator.

Q38.
What is data locality in the context of MapReduce?
AStoring data in local databases
BKeeping all data on a single node
CScheduling Map tasks on nodes where the input data physically resides to minimize network transfer
DEncrypting data locally
Show Answer & Explanation

Correct Answer: C - Scheduling Map tasks on nodes where the input data physically resides to minimize network transfer

Data locality means scheduling Map tasks to run on the nodes (or at least the same rack) where the input HDFS blocks are stored. This minimizes data transfer over the network, which is often the bottleneck in distributed computing.

Q39.
What format does the Mapper output before it is sent to the Reducer?
ASerialized key-value pairs stored in sorted spill files on local disk
BRaw text
CEncrypted binary
DCompressed ZIP files
Show Answer & Explanation

Correct Answer: A - Serialized key-value pairs stored in sorted spill files on local disk

The Mapper output is serialized into key-value pairs, written to an in-memory buffer, and when the buffer fills, it is sorted and spilled to local disk files. These sorted spill files are later merged and fetched by the Reducers.

Q40.
What is a distributed cache in MapReduce?
AA caching layer on top of HDFS
BA facility to distribute read-only files, archives, and JARs to all task nodes for local access
CA distributed database
DA cache for storing intermediate results
Show Answer & Explanation

Correct Answer: B - A facility to distribute read-only files, archives, and JARs to all task nodes for local access

Distributed cache is a MapReduce facility that allows applications to distribute read-only files (lookup tables, dictionaries, configuration files) to all nodes running tasks, so they can be accessed locally without network overhead.

Q41.
How does MapReduce handle data skew?
AIt automatically balances data across all Reducers
BMapReduce does not have a data skew problem
CData skew can cause some Reducers to receive much more data, which can be mitigated using custom Partitioners or salting keys
DIt splits large keys into smaller ones automatically
Show Answer & Explanation

Correct Answer: C - Data skew can cause some Reducers to receive much more data, which can be mitigated using custom Partitioners or salting keys

Data skew occurs when some keys have significantly more values than others, causing uneven load on Reducers. It can be mitigated using custom Partitioners, salting keys (adding random prefixes), or using Combiners to reduce data volume.

Q42.
What is the OutputFormat class responsible for in MapReduce?
ADefining how the output of Reducers is written to the file system
BReading input data
CManaging task scheduling
DHandling network communication
Show Answer & Explanation

Correct Answer: A - Defining how the output of Reducers is written to the file system

The OutputFormat class defines how the output key-value pairs from Reducers are written to the file system. The default TextOutputFormat writes each key-value pair as a line of text. Custom OutputFormats can write to databases or other systems.

Q43.
What is the purpose of counters in MapReduce?
ATo count the number of nodes in the cluster
BTo count the number of output files
CTo track statistics and metrics about the job execution, such as records read, bytes written, etc.
DTo count the number of users
Show Answer & Explanation

Correct Answer: C - To track statistics and metrics about the job execution, such as records read, bytes written, etc.

Counters in MapReduce are used to gather statistics about the job, such as the number of input/output records, bytes read/written, and custom application-specific metrics. They help in monitoring and debugging MapReduce jobs.

Q44.
What happens during the merge phase on the Reducer side?
ASorted data fetched from multiple Mappers is merged into a single sorted stream grouped by key
BInput files are merged
COutput files from multiple Reducers are merged
DMapReduce jobs are merged
Show Answer & Explanation

Correct Answer: A - Sorted data fetched from multiple Mappers is merged into a single sorted stream grouped by key

During the merge phase on the Reducer side, the sorted intermediate data fetched from multiple Mappers is merge-sorted into a single sorted stream where all values for the same key are grouped together before being passed to the reduce() function.

Q45.
Which InputFormat reads entire files as a single record?
ATextInputFormat
BKeyValueTextInputFormat
CWholeFileInputFormat
DNLineInputFormat
Show Answer & Explanation

Correct Answer: C - WholeFileInputFormat

WholeFileInputFormat (or custom implementations of it) reads an entire file as a single record with the file name as the key and the file content as the value. This is useful when files should not be split across multiple Mappers.

Q46.
What is chaining in MapReduce?
AConnecting multiple clusters
BRunning Map and Reduce in the same JVM
CLinking multiple MapReduce jobs sequentially where the output of one job becomes the input of the next
DLinking multiple HDFS blocks
Show Answer & Explanation

Correct Answer: C - Linking multiple MapReduce jobs sequentially where the output of one job becomes the input of the next

Chaining in MapReduce refers to running multiple MapReduce jobs sequentially, where the output of one job becomes the input for the next. This is used for complex data processing pipelines that cannot be accomplished in a single MapReduce job.

Q47.
What is the purpose of the setup() method in a Mapper class?
ATo emit key-value pairs
BTo write output to HDFS
CTo perform one-time initialization before any map() calls, such as loading configuration or resources
DTo sort intermediate data
Show Answer & Explanation

Correct Answer: C - To perform one-time initialization before any map() calls, such as loading configuration or resources

The setup() method in a Mapper class is called once before any map() calls for that task. It is used for one-time initialization such as loading configuration parameters, opening database connections, or reading files from the distributed cache.

Q48.
What is a spill in the context of MapReduce?
AWriting the in-memory buffer of Mapper output to local disk when the buffer reaches a threshold
BData loss during processing
CNetwork packet loss
DOverflowing the Reducer memory
Show Answer & Explanation

Correct Answer: A - Writing the in-memory buffer of Mapper output to local disk when the buffer reaches a threshold

A spill occurs when the in-memory circular buffer used to store Mapper output reaches its threshold (default 80%). The buffered data is sorted by key, optionally combined, and written (spilled) to a local disk file. Multiple spill files may be merged later.

Q49.
In MapReduce, what is a join operation and how is it typically implemented?
AUsing map-side or reduce-side join techniques to combine data from multiple input datasets
BUsing SQL JOIN syntax directly
CJoins are not possible in MapReduce
DUsing a dedicated JoinNode
Show Answer & Explanation

Correct Answer: A - Using map-side or reduce-side join techniques to combine data from multiple input datasets

Joins in MapReduce are implemented using map-side joins (for sorted, partitioned data using distributed cache) or reduce-side joins (where data from multiple sources is tagged and sent to Reducers that match and combine records by key).

Q50.
What is the default number of reduce tasks in a MapReduce job?
A0
B1
CEqual to number of map tasks
DEqual to number of nodes
Show Answer & Explanation

Correct Answer: B - 1

The default number of reduce tasks in a MapReduce job is 1. This means all intermediate key-value pairs are sent to a single Reducer. Users can configure more Reducers using job.setNumReduceTasks() for better parallelism.

Showing 1-10 of 50 questions