Big Data Fundamentals Question Bank for C-CAT
Topic-wise Big Data Fundamentals MCQs for CDAC C-CAT preparation with answers and explanations.
Show Answer & Explanation
Correct Answer: B - Validation
The 5 V's of Big Data are Volume, Velocity, Variety, Veracity, and Value. Validation is not one of them.
Show Answer & Explanation
Correct Answer: A - Too large for traditional database systems
Big Data refers to datasets that are too large, complex, or fast-moving for traditional database systems to handle.
Show Answer & Explanation
Correct Answer: B - Velocity
Velocity refers to the speed at which data is generated, collected, and processed.
Show Answer & Explanation
Correct Answer: D - Variety
Variety refers to the different types of data formats - structured (tables), semi-structured (JSON, XML), and unstructured (text, images).
Show Answer & Explanation
Correct Answer: C - HDFS
HDFS (Hadoop Distributed File System) is designed for distributed storage across clusters of commodity hardware.
Show Answer & Explanation
Correct Answer: C - Data Lake stores raw data in native format
Data Lake stores raw data in its native format (schema-on-read), while Data Warehouse requires structured data with predefined schema (schema-on-write).
Show Answer & Explanation
Correct Answer: C - Extract, Transform, Load
ETL stands for Extract, Transform, Load - the process of extracting data from sources, transforming it, and loading into a destination system.
Show Answer & Explanation
Correct Answer: B - Credit card fraud detection
Credit card fraud detection requires real-time processing to identify suspicious transactions as they occur.
Show Answer & Explanation
Correct Answer: B - Trustworthiness of data
Veracity refers to the quality, accuracy, and trustworthiness of the data.
Show Answer & Explanation
Correct Answer: D - Single-user desktop application
Single-user desktop applications don't require Big Data technologies. The others involve processing large datasets.
Show Answer & Explanation
Correct Answer: D - Processing large volumes of data at scheduled intervals
Batch processing involves processing large volumes of accumulated data at scheduled intervals, not in real-time.
Show Answer & Explanation
Correct Answer: D - Yahoo
Hadoop was originally developed at Yahoo based on Google's published papers on MapReduce and GFS.
Show Answer & Explanation
Correct Answer: B - Two of three properties simultaneously
CAP theorem states that a distributed system can only guarantee 2 of 3: Consistency, Availability, and Partition tolerance.
Show Answer & Explanation
Correct Answer: B - Apache Kafka Streams
Apache Kafka Streams is designed for real-time stream processing. MapReduce is for batch processing.
Show Answer & Explanation
Correct Answer: D - Distributing data across multiple databases
Sharding horizontally partitions data across multiple database instances for scalability and performance.
Show Answer & Explanation
Correct Answer: D - Social media posts
Social media posts are unstructured - they don't follow a predefined data model or schema.
Show Answer & Explanation
Correct Answer: C - Data quality, security, and compliance
Data governance encompasses policies and processes for data quality, security, privacy, and regulatory compliance.
Show Answer & Explanation
Correct Answer: D - Batch and stream processing
Lambda architecture combines batch processing for comprehensive analysis and stream processing for real-time views.
Show Answer & Explanation
Correct Answer: B - Data freshness
Data freshness measures how current or up-to-date the data is - critical for time-sensitive applications.
Show Answer & Explanation
Correct Answer: A - Adding more machines to the cluster
Horizontal scaling (scale-out) adds more machines to distribute the workload, unlike vertical scaling which upgrades existing machines.
Show Answer & Explanation
Correct Answer: D - Veracity
The original 3Vs of Big Data defined by Doug Laney are Volume, Velocity, and Variety. Veracity (data quality/trustworthiness) was added later as a 4th V.
Show Answer & Explanation
Correct Answer: C - The speed at which data is generated and processed
Velocity refers to the speed at which data is generated, collected, and processed. Real-time or near-real-time processing of streaming data is a key challenge addressed by this V.
Show Answer & Explanation
Correct Answer: B - Relational database tables
Relational database tables contain structured data organized in predefined rows and columns with fixed schemas. Social media posts, videos, and emails are examples of unstructured or semi-structured data.
Show Answer & Explanation
Correct Answer: A - A centralized repository that stores raw data in its native format
A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale in its raw, native format. Unlike a data warehouse, it uses schema-on-read rather than schema-on-write.
Show Answer & Explanation
Correct Answer: D - Extract, Transform, Load
ETL stands for Extract, Transform, Load. It is a process where data is extracted from source systems, transformed (cleaned, enriched, formatted) to fit operational needs, and loaded into a target database or data warehouse.
Show Answer & Explanation
Correct Answer: C - Apache Kafka
Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It can handle trillions of events per day with high throughput and low latency.
Show Answer & Explanation
Correct Answer: D - Data Lakes use schema-on-read while Data Warehouses use schema-on-write
Data Lakes follow a schema-on-read approach where raw data is stored without a predefined schema and structure is applied when data is read. Data Warehouses use schema-on-write where data must conform to a predefined schema before being stored.
Show Answer & Explanation
Correct Answer: C - The economic worth derived from data analysis
Value refers to the economic worth or business insights that can be derived from analyzing Big Data. Having large volumes of data is meaningless unless valuable information and actionable insights can be extracted from it.
Show Answer & Explanation
Correct Answer: B - Apache Sqoop
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. It can import tables from RDBMS to HDFS and export data from HDFS to RDBMS.
Show Answer & Explanation
Correct Answer: A - Semi-structured data
XML is classified as semi-structured data. It does not conform to a strict tabular schema like structured data, but it has tags and markers that provide some organizational structure, unlike purely unstructured data such as images or free text.
Show Answer & Explanation
Correct Answer: C - Apache Tomcat
Apache Tomcat is a web server and servlet container, not a Big Data processing framework. Apache Spark, Flink, and Storm are all frameworks used for large-scale data processing.
Show Answer & Explanation
Correct Answer: B - The trustworthiness and quality of data
Veracity refers to the trustworthiness, accuracy, and quality of data. In Big Data, data can be noisy, incomplete, or inconsistent, making it important to assess and ensure data quality before analysis.
Show Answer & Explanation
Correct Answer: B - Collecting and processing data in large blocks at scheduled intervals
Batch processing involves collecting data over a period of time and processing it in large blocks (batches) at scheduled intervals. It is suitable for non-time-critical processing of large volumes of data, as opposed to real-time stream processing.
Show Answer & Explanation
Correct Answer: A - Ingesting large volumes of log and event data into Hadoop
Apache Flume is a distributed, reliable service for efficiently collecting, aggregating, and moving large amounts of log and event data from many sources to a centralized data store like HDFS.
Show Answer & Explanation
Correct Answer: B - The different formats, types, and sources of data
Variety refers to the different types, formats, and sources of data including structured (databases), semi-structured (JSON, XML), and unstructured (images, videos, text). Managing this diversity is a key Big Data challenge.
Show Answer & Explanation
Correct Answer: A - Data is loaded into the target system first, then transformed
ELT (Extract, Load, Transform) loads raw data into the target system first and then transforms it within the target system. This approach leverages the processing power of modern data warehouses and data lakes, unlike traditional ETL.
Show Answer & Explanation
Correct Answer: A - Distributed coordination and configuration management
Apache Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is used by many Big Data tools like Kafka and HBase for coordination.
Show Answer & Explanation
Correct Answer: A - A collection of video surveillance footage
Video surveillance footage is unstructured data as it does not follow any predefined data model or schema. Database tables, CSV files, and spreadsheets are structured data with defined formats and schemas.
Show Answer & Explanation
Correct Answer: C - The process of transporting data from various sources to a storage medium
Data ingestion is the process of obtaining and importing data from various sources for immediate use or storage in a database or data warehouse. It can be done in real-time (streaming) or in batches.
Show Answer & Explanation
Correct Answer: A - Apache Hive
Apache Hive provides a SQL-like query language called HiveQL (HQL) that allows users to query and manage large datasets stored in Hadoop's HDFS. It converts SQL queries into MapReduce or Tez jobs.
Show Answer & Explanation
Correct Answer: C - Adding more machines to distribute the workload
Horizontal scaling (scale-out) involves adding more machines to a cluster to distribute the workload. This is the preferred approach in Big Data as it is more cost-effective and provides better fault tolerance compared to vertical scaling.
Show Answer & Explanation
Correct Answer: B - Schema is applied when data is read from storage
Schema-on-read means the structure or schema is applied to data only when it is read or queried, not when it is stored. This approach is used in data lakes and allows flexibility in storing raw data of any format.
Show Answer & Explanation
Correct Answer: A - Spark performs in-memory processing which is significantly faster
Apache Spark's primary advantage is in-memory computing. It keeps intermediate data in memory (RAM) rather than writing to disk after each stage like MapReduce, making it up to 100x faster for certain workloads.
Show Answer & Explanation
Correct Answer: A - The process of cleaning, restructuring, and enriching raw data
Data wrangling or data munging is the process of cleaning, restructuring, and enriching raw data into a more usable format. It involves handling missing values, correcting errors, and transforming data for analysis.
Show Answer & Explanation
Correct Answer: C - Apache Parquet
Apache Parquet is a columnar storage format optimized for Big Data processing. It stores data by columns rather than rows, enabling efficient compression, encoding, and fast analytical queries that only need specific columns.
Show Answer & Explanation
Correct Answer: C - An isolated repository of data controlled by one department, inaccessible to others
A data silo is an isolated collection of data held by one group or department that is not easily or fully accessible by other groups in the same organization. Data silos hinder cross-functional analysis and collaboration.
Show Answer & Explanation
Correct Answer: D - Apache Oozie
Apache Oozie is a workflow scheduler system for managing Apache Hadoop jobs. It allows users to create directed acyclic graphs (DAGs) of actions and schedule complex data pipelines with dependencies.
Show Answer & Explanation
Correct Answer: A - To divide large datasets into smaller, manageable chunks for parallel processing
Data partitioning divides a large dataset into smaller, more manageable subsets (partitions) that can be processed in parallel across multiple nodes. This improves query performance and enables distributed processing.
Show Answer & Explanation
Correct Answer: D - Apache Storm
Apache Storm is a free, open-source distributed real-time computation system that processes unbounded streams of data reliably. Unlike Hive and Pig which are batch processing tools, Storm processes data in real-time.
Show Answer & Explanation
Correct Answer: A - Value
Value is considered the 5th V of Big Data and refers to the ability to turn data into meaningful and actionable business insights. Without extracting value, even large volumes of data serve no practical purpose.