MapReduce Question Bank for C-CAT
Topic-wise MapReduce MCQs for CDAC C-CAT preparation with answers and explanations.
Show Answer & Explanation
Correct Answer: D - Map and Reduce phases
MapReduce has Map phase (transforms data into key-value pairs) and Reduce phase (aggregates values by key).
Show Answer & Explanation
Correct Answer: B - Key-value pairs
Map function processes input and emits intermediate key-value pairs for the Reduce phase.
Show Answer & Explanation
Correct Answer: B - Between Map and Reduce phases
Shuffle and Sort transfers Map output to Reducers and sorts data by keys between phases.
Show Answer & Explanation
Correct Answer: B - A mini-reducer that runs on Map output
Combiner is an optional local reducer that runs on Map output to reduce network transfer.
Show Answer & Explanation
Correct Answer: C - Which Reducer gets which key
Partitioner determines which Reducer receives which key-value pairs, typically using hash of the key.
Show Answer & Explanation
Correct Answer: D - Key-value pair
Map receives a key-value pair where key is typically offset and value is the line content.
Show Answer & Explanation
Correct Answer: A - Defines how to read and split input files
InputFormat defines how input files are read and split into InputSplits for Map tasks.
Show Answer & Explanation
Correct Answer: B - Number of input splits
Number of Map tasks equals number of input splits, which depends on input data size and block size.
Show Answer & Explanation
Correct Answer: B - (word, 1) for each word
In Word Count, Map emits (word, 1) for each word occurrence; Reduce sums the counts per word.
Show Answer & Explanation
Correct Answer: B - Runs duplicate tasks to handle slow nodes
Speculative execution runs backup copies of slow-running tasks to prevent stragglers from delaying jobs.
Show Answer & Explanation
Correct Answer: D - Key and iterator of all values for that key
Reduce receives a key and an iterator over all values associated with that key after shuffle/sort.
Show Answer & Explanation
Correct Answer: D - Defines how to write output
OutputFormat defines how Reduce output is written - format, location, and structure.
Show Answer & Explanation
Correct Answer: C - Moving computation to where data resides
Data locality moves computation to nodes where data is stored rather than moving data over network.
Show Answer & Explanation
Correct Answer: D - Reads input split and generates key-value pairs
RecordReader reads an InputSplit and generates key-value pairs for the Map function.
Show Answer & Explanation
Correct Answer: A - 1
Default number of Reducers is 1, but can be configured based on data size and cluster capacity.
Show Answer & Explanation
Correct Answer: D - Tracking job statistics and metrics
Counters track various statistics like input/output records, bytes processed, and custom metrics.
Show Answer & Explanation
Correct Answer: A - Read-only data distribution to all nodes
DistributedCache distributes read-only files (like lookup tables) to all nodes before task execution.
Show Answer & Explanation
Correct Answer: B - No Reduce phase
Map-only jobs set Reducers to 0, outputting Map results directly without reduce/aggregation.
Show Answer & Explanation
Correct Answer: B - Resource management and job scheduling
JobTracker managed resources and scheduled jobs in Hadoop 1.x, replaced by YARN ResourceManager in 2.x.
Show Answer & Explanation
Correct Answer: A - Sorts by value as well as key
Secondary sort allows sorting by both key and value, using composite keys and custom comparators.
Show Answer & Explanation
Correct Answer: D - Map and Reduce
The two main phases of a MapReduce job are the Map phase and the Reduce phase. The Map phase processes input data and produces intermediate key-value pairs, while the Reduce phase aggregates the intermediate results to produce the final output.
Show Answer & Explanation
Correct Answer: A - Key-value pairs
The Mapper receives input as key-value pairs. By default, the key is the byte offset of the line in the file, and the value is the text content of that line. The Mapper processes each pair and emits zero or more intermediate key-value pairs.
Show Answer & Explanation
Correct Answer: D - To perform local aggregation on Mapper output before sending to Reducer, reducing network traffic
A Combiner acts as a mini-reducer that runs on the Mapper node. It performs local aggregation on the Mapper's output to reduce the amount of data transferred over the network to the Reducer, improving performance.
Show Answer & Explanation
Correct Answer: B - The process of transferring Mapper output to Reducers and sorting by key
The Shuffle and Sort phase occurs between the Map and Reduce phases. It transfers the intermediate key-value pairs from Mappers to the appropriate Reducers (shuffle) and sorts them by key so that all values for the same key are grouped together.
Show Answer & Explanation
Correct Answer: A - Zero or more, configurable by the user
A MapReduce job can have zero or more Reducers, configurable by the user. Setting zero Reducers means only the Map phase runs (map-only job). Multiple Reducers allow parallel reduction of different key ranges.
Show Answer & Explanation
Correct Answer: D - HashPartitioner
The default Partitioner in MapReduce is the HashPartitioner. It uses the hash of the key modulo the number of Reducers to determine which Reducer will receive a particular key-value pair, ensuring even distribution.
Show Answer & Explanation
Correct Answer: C - To represent a logical chunk of data that will be processed by a single Mapper
An InputSplit represents a logical chunk of input data that will be processed by a single Map task. By default, an InputSplit corresponds to one HDFS block, and the number of InputSplits determines the number of Map tasks.
Show Answer & Explanation
Correct Answer: B - It must produce output of the same type as the Mapper output
The Combiner must produce output with the same key-value types as the Mapper output since its output becomes input to the Reducer. It is optional (not mandatory) and runs on the Mapper node, not the Reducer node.
Show Answer & Explanation
Correct Answer: B - (word, 1) for each word encountered
In the word count example, the Mapper tokenizes each input line into words and emits (word, 1) for each word encountered. The Reducer then sums all the 1s for each unique word to get the total count.
Show Answer & Explanation
Correct Answer: D - The task is rescheduled on another node by the framework
If a Map task fails, the MapReduce framework automatically reschedules it on another available node. The framework provides fault tolerance by tracking task progress and re-executing failed tasks up to a configurable number of attempts.
Show Answer & Explanation
Correct Answer: C - Converting the data in an InputSplit into key-value pairs for the Mapper
The RecordReader is responsible for reading data from an InputSplit and converting it into key-value pairs that are passed to the Mapper's map() function. The default RecordReader (LineRecordReader) treats each line as a record.
Show Answer & Explanation
Correct Answer: A - Key-value pairs written to the output file system
The Reduce function outputs zero or more key-value pairs that are written to the output file system (typically HDFS). Each Reducer writes its output to a separate output file (e.g., part-r-00000).
Show Answer & Explanation
Correct Answer: D - The number of input splits (typically equal to the number of HDFS blocks)
The number of Map tasks is determined by the number of input splits, which is typically equal to the number of HDFS blocks in the input data. Each input split is processed by one Map task.
Show Answer & Explanation
Correct Answer: C - A job with zero Reducers where only the Map phase executes
A map-only job is a MapReduce job configured with zero Reducers. Only the Map phase executes, and the Mapper output is written directly to HDFS. This is useful for tasks like data transformation or filtering that don't require aggregation.
Show Answer & Explanation
Correct Answer: C - To determine which Reducer receives which key-value pair from the Mapper
The Partitioner determines which Reducer will receive a specific intermediate key-value pair output by the Mapper. It ensures that all values for the same key go to the same Reducer for correct aggregation.
Show Answer & Explanation
Correct Answer: A - Average
Average is not suitable for a Combiner because it is not an associative and commutative operation. Computing local averages and then averaging them gives incorrect results. Sum, Count, and Maximum can be correctly combined locally.
Show Answer & Explanation
Correct Answer: D - Controlling the order of values within a Reducer for a given key
Secondary sort is a technique to control the order in which values arrive at a Reducer for a given key. It is achieved by creating a composite key (natural key + secondary key) and using a custom Partitioner, GroupingComparator, and SortComparator.
Show Answer & Explanation
Correct Answer: C - Scheduling Map tasks on nodes where the input data physically resides to minimize network transfer
Data locality means scheduling Map tasks to run on the nodes (or at least the same rack) where the input HDFS blocks are stored. This minimizes data transfer over the network, which is often the bottleneck in distributed computing.
Show Answer & Explanation
Correct Answer: A - Serialized key-value pairs stored in sorted spill files on local disk
The Mapper output is serialized into key-value pairs, written to an in-memory buffer, and when the buffer fills, it is sorted and spilled to local disk files. These sorted spill files are later merged and fetched by the Reducers.
Show Answer & Explanation
Correct Answer: B - A facility to distribute read-only files, archives, and JARs to all task nodes for local access
Distributed cache is a MapReduce facility that allows applications to distribute read-only files (lookup tables, dictionaries, configuration files) to all nodes running tasks, so they can be accessed locally without network overhead.
Show Answer & Explanation
Correct Answer: C - Data skew can cause some Reducers to receive much more data, which can be mitigated using custom Partitioners or salting keys
Data skew occurs when some keys have significantly more values than others, causing uneven load on Reducers. It can be mitigated using custom Partitioners, salting keys (adding random prefixes), or using Combiners to reduce data volume.
Show Answer & Explanation
Correct Answer: A - Defining how the output of Reducers is written to the file system
The OutputFormat class defines how the output key-value pairs from Reducers are written to the file system. The default TextOutputFormat writes each key-value pair as a line of text. Custom OutputFormats can write to databases or other systems.
Show Answer & Explanation
Correct Answer: C - To track statistics and metrics about the job execution, such as records read, bytes written, etc.
Counters in MapReduce are used to gather statistics about the job, such as the number of input/output records, bytes read/written, and custom application-specific metrics. They help in monitoring and debugging MapReduce jobs.
Show Answer & Explanation
Correct Answer: A - Sorted data fetched from multiple Mappers is merged into a single sorted stream grouped by key
During the merge phase on the Reducer side, the sorted intermediate data fetched from multiple Mappers is merge-sorted into a single sorted stream where all values for the same key are grouped together before being passed to the reduce() function.
Show Answer & Explanation
Correct Answer: C - WholeFileInputFormat
WholeFileInputFormat (or custom implementations of it) reads an entire file as a single record with the file name as the key and the file content as the value. This is useful when files should not be split across multiple Mappers.
Show Answer & Explanation
Correct Answer: C - Linking multiple MapReduce jobs sequentially where the output of one job becomes the input of the next
Chaining in MapReduce refers to running multiple MapReduce jobs sequentially, where the output of one job becomes the input for the next. This is used for complex data processing pipelines that cannot be accomplished in a single MapReduce job.
Show Answer & Explanation
Correct Answer: C - To perform one-time initialization before any map() calls, such as loading configuration or resources
The setup() method in a Mapper class is called once before any map() calls for that task. It is used for one-time initialization such as loading configuration parameters, opening database connections, or reading files from the distributed cache.
Show Answer & Explanation
Correct Answer: A - Writing the in-memory buffer of Mapper output to local disk when the buffer reaches a threshold
A spill occurs when the in-memory circular buffer used to store Mapper output reaches its threshold (default 80%). The buffered data is sorted by key, optionally combined, and written (spilled) to a local disk file. Multiple spill files may be merged later.
Show Answer & Explanation
Correct Answer: A - Using map-side or reduce-side join techniques to combine data from multiple input datasets
Joins in MapReduce are implemented using map-side joins (for sorted, partitioned data using distributed cache) or reduce-side joins (where data from multiple sources is tagged and sent to Reducers that match and combine records by key).
Show Answer & Explanation
Correct Answer: B - 1
The default number of reduce tasks in a MapReduce job is 1. This means all intermediate key-value pairs are sent to a single Reducer. Users can configure more Reducers using job.setNumReduceTasks() for better parallelism.