Back to Practice Big Data

Big Data Fundamentals - Practice MCQs for CCAT

50 Questions Section B: Programming Big Data

Big Data Fundamentals Question Bank for C-CAT

Topic-wise Big Data Fundamentals MCQs for CDAC C-CAT preparation with answers and explanations.

Q1.
Which of the following is NOT one of the 5 V's of Big Data?
AVolume
BValidation
CVariety
DVelocity
Show Answer & Explanation

Correct Answer: B - Validation

The 5 V's of Big Data are Volume, Velocity, Variety, Veracity, and Value. Validation is not one of them.

Q2.
Big Data refers to datasets that are:
AToo large for traditional database systems
BSmall enough to process on a single machine
COnly structured data
DOnly unstructured data
Show Answer & Explanation

Correct Answer: A - Too large for traditional database systems

Big Data refers to datasets that are too large, complex, or fast-moving for traditional database systems to handle.

Q3.
Which characteristic of Big Data refers to the speed at which data is generated?
AVolume
BVelocity
CVariety
DVeracity
Show Answer & Explanation

Correct Answer: B - Velocity

Velocity refers to the speed at which data is generated, collected, and processed.

Q4.
Structured, semi-structured, and unstructured data represent which Big Data characteristic?
AVolume
BVelocity
CValue
DVariety
Show Answer & Explanation

Correct Answer: D - Variety

Variety refers to the different types of data formats - structured (tables), semi-structured (JSON, XML), and unstructured (text, images).

Q5.
Which technology is primarily used for distributed storage in Big Data?
AMySQL
BSQLite
CHDFS
DPostgreSQL
Show Answer & Explanation

Correct Answer: C - HDFS

HDFS (Hadoop Distributed File System) is designed for distributed storage across clusters of commodity hardware.

Q6.
Data Lake is different from Data Warehouse because:
AData Lake only stores structured data
BData Warehouse is cheaper
CData Lake stores raw data in native format
DData Lake requires schema before loading
Show Answer & Explanation

Correct Answer: C - Data Lake stores raw data in native format

Data Lake stores raw data in its native format (schema-on-read), while Data Warehouse requires structured data with predefined schema (schema-on-write).

Q7.
ETL in Big Data stands for:
AExecute, Test, Launch
BEncrypt, Transfer, Log
CExtract, Transform, Load
DExport, Translate, Link
Show Answer & Explanation

Correct Answer: C - Extract, Transform, Load

ETL stands for Extract, Transform, Load - the process of extracting data from sources, transforming it, and loading into a destination system.

Q8.
Which is an example of real-time Big Data processing?
AMonthly sales reports
BCredit card fraud detection
CAnnual inventory audit
DQuarterly financial statements
Show Answer & Explanation

Correct Answer: B - Credit card fraud detection

Credit card fraud detection requires real-time processing to identify suspicious transactions as they occur.

Q9.
Veracity in Big Data refers to:
ASpeed of data
BTrustworthiness of data
CSize of data
DType of data
Show Answer & Explanation

Correct Answer: B - Trustworthiness of data

Veracity refers to the quality, accuracy, and trustworthiness of the data.

Q10.
Which is NOT a common Big Data use case?
APredictive maintenance
BCustomer sentiment analysis
CRecommendation engines
DSingle-user desktop application
Show Answer & Explanation

Correct Answer: D - Single-user desktop application

Single-user desktop applications don't require Big Data technologies. The others involve processing large datasets.

Q11.
Batch processing in Big Data is characterized by:
AReal-time responses
BStream processing
CImmediate data processing
DProcessing large volumes of data at scheduled intervals
Show Answer & Explanation

Correct Answer: D - Processing large volumes of data at scheduled intervals

Batch processing involves processing large volumes of accumulated data at scheduled intervals, not in real-time.

Q12.
Which company originally developed Hadoop?
AGoogle
BAmazon
CFacebook
DYahoo
Show Answer & Explanation

Correct Answer: D - Yahoo

Hadoop was originally developed at Yahoo based on Google's published papers on MapReduce and GFS.

Q13.
The CAP theorem states that a distributed system can have at most:
AOne property
BTwo of three properties simultaneously
CAll three properties
DNone of the properties
Show Answer & Explanation

Correct Answer: B - Two of three properties simultaneously

CAP theorem states that a distributed system can only guarantee 2 of 3: Consistency, Availability, and Partition tolerance.

Q14.
Which is a stream processing framework?
AHadoop MapReduce
BApache Kafka Streams
CMySQL
DOracle DB
Show Answer & Explanation

Correct Answer: B - Apache Kafka Streams

Apache Kafka Streams is designed for real-time stream processing. MapReduce is for batch processing.

Q15.
Data sharding is used for:
AData encryption
BData backup
CData compression
DDistributing data across multiple databases
Show Answer & Explanation

Correct Answer: D - Distributing data across multiple databases

Sharding horizontally partitions data across multiple database instances for scalability and performance.

Q16.
Which is an example of unstructured data?
AEmployee database table
BCustomer order CSV file
CExcel spreadsheet
DSocial media posts
Show Answer & Explanation

Correct Answer: D - Social media posts

Social media posts are unstructured - they don't follow a predefined data model or schema.

Q17.
Data governance in Big Data ensures:
AFaster processing only
BCheaper storage
CData quality, security, and compliance
DSmaller file sizes
Show Answer & Explanation

Correct Answer: C - Data quality, security, and compliance

Data governance encompasses policies and processes for data quality, security, privacy, and regulatory compliance.

Q18.
Lambda architecture combines:
AOnly batch processing
BOnly stream processing
COnly data storage
DBatch and stream processing
Show Answer & Explanation

Correct Answer: D - Batch and stream processing

Lambda architecture combines batch processing for comprehensive analysis and stream processing for real-time views.

Q19.
Which metric measures how current the data is?
AData volume
BData freshness
CData variety
DData veracity
Show Answer & Explanation

Correct Answer: B - Data freshness

Data freshness measures how current or up-to-date the data is - critical for time-sensitive applications.

Q20.
Horizontal scaling in Big Data means:
AAdding more machines to the cluster
BAdding more resources to existing machine
CReducing data size
DCompressing files
Show Answer & Explanation

Correct Answer: A - Adding more machines to the cluster

Horizontal scaling (scale-out) adds more machines to distribute the workload, unlike vertical scaling which upgrades existing machines.

Q21.
Which of the following is NOT one of the original 3Vs of Big Data?
AVolume
BVelocity
CVariety
DVeracity
Show Answer & Explanation

Correct Answer: D - Veracity

The original 3Vs of Big Data defined by Doug Laney are Volume, Velocity, and Variety. Veracity (data quality/trustworthiness) was added later as a 4th V.

Q22.
What does the 'Velocity' characteristic of Big Data refer to?
AThe size of the data
BThe different types of data
CThe speed at which data is generated and processed
DThe accuracy of the data
Show Answer & Explanation

Correct Answer: C - The speed at which data is generated and processed

Velocity refers to the speed at which data is generated, collected, and processed. Real-time or near-real-time processing of streaming data is a key challenge addressed by this V.

Q23.
Which of the following is an example of structured data?
ASocial media posts
BRelational database tables
CVideo files
DEmail messages
Show Answer & Explanation

Correct Answer: B - Relational database tables

Relational database tables contain structured data organized in predefined rows and columns with fixed schemas. Social media posts, videos, and emails are examples of unstructured or semi-structured data.

Q24.
What is a Data Lake?
AA centralized repository that stores raw data in its native format
BA relational database with strict schema enforcement
CA data warehouse optimized for SQL queries
DA backup storage system for databases
Show Answer & Explanation

Correct Answer: A - A centralized repository that stores raw data in its native format

A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale in its raw, native format. Unlike a data warehouse, it uses schema-on-read rather than schema-on-write.

Q25.
What does ETL stand for in data processing?
AEncrypt, Transfer, Load
BEvaluate, Transform, Load
CExtract, Transfer, Log
DExtract, Transform, Load
Show Answer & Explanation

Correct Answer: D - Extract, Transform, Load

ETL stands for Extract, Transform, Load. It is a process where data is extracted from source systems, transformed (cleaned, enriched, formatted) to fit operational needs, and loaded into a target database or data warehouse.

Q26.
Which tool is primarily used for real-time stream processing of Big Data?
AApache Hive
BApache Sqoop
CApache Kafka
DApache Flume
Show Answer & Explanation

Correct Answer: C - Apache Kafka

Apache Kafka is a distributed streaming platform used for building real-time data pipelines and streaming applications. It can handle trillions of events per day with high throughput and low latency.

Q27.
What is the primary difference between a Data Lake and a Data Warehouse?
AThere is no difference between them
BData Lakes can only store structured data
CData Warehouses store raw data
DData Lakes use schema-on-read while Data Warehouses use schema-on-write
Show Answer & Explanation

Correct Answer: D - Data Lakes use schema-on-read while Data Warehouses use schema-on-write

Data Lakes follow a schema-on-read approach where raw data is stored without a predefined schema and structure is applied when data is read. Data Warehouses use schema-on-write where data must conform to a predefined schema before being stored.

Q28.
Which of the following is a characteristic of the 'Value' V in Big Data?
AThe speed of data generation
BThe size of the dataset
CThe economic worth derived from data analysis
DThe diversity of data sources
Show Answer & Explanation

Correct Answer: C - The economic worth derived from data analysis

Value refers to the economic worth or business insights that can be derived from analyzing Big Data. Having large volumes of data is meaningless unless valuable information and actionable insights can be extracted from it.

Q29.
Which Big Data tool is used for importing data from relational databases into Hadoop?
AApache Flume
BApache Sqoop
CApache Kafka
DApache Storm
Show Answer & Explanation

Correct Answer: B - Apache Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. It can import tables from RDBMS to HDFS and export data from HDFS to RDBMS.

Q30.
What type of data is XML classified as?
ASemi-structured data
BUnstructured data
CStructured data
DBinary data
Show Answer & Explanation

Correct Answer: A - Semi-structured data

XML is classified as semi-structured data. It does not conform to a strict tabular schema like structured data, but it has tags and markers that provide some organizational structure, unlike purely unstructured data such as images or free text.

Q31.
Which of the following is NOT a Big Data processing framework?
AApache Spark
BApache Flink
CApache Tomcat
DApache Storm
Show Answer & Explanation

Correct Answer: C - Apache Tomcat

Apache Tomcat is a web server and servlet container, not a Big Data processing framework. Apache Spark, Flink, and Storm are all frameworks used for large-scale data processing.

Q32.
What is the 'Veracity' V in Big Data concerned with?
AThe volume of data
BThe trustworthiness and quality of data
CThe speed of data processing
DThe variety of data formats
Show Answer & Explanation

Correct Answer: B - The trustworthiness and quality of data

Veracity refers to the trustworthiness, accuracy, and quality of data. In Big Data, data can be noisy, incomplete, or inconsistent, making it important to assess and ensure data quality before analysis.

Q33.
Which of the following best describes batch processing?
AProcessing data as it arrives in real-time
BCollecting and processing data in large blocks at scheduled intervals
CProcessing only structured data
DProcessing data using a single machine
Show Answer & Explanation

Correct Answer: B - Collecting and processing data in large blocks at scheduled intervals

Batch processing involves collecting data over a period of time and processing it in large blocks (batches) at scheduled intervals. It is suitable for non-time-critical processing of large volumes of data, as opposed to real-time stream processing.

Q34.
What is Apache Flume primarily used for?
AIngesting large volumes of log and event data into Hadoop
BSQL query execution on Big Data
CMachine learning on distributed data
DScheduling MapReduce jobs
Show Answer & Explanation

Correct Answer: A - Ingesting large volumes of log and event data into Hadoop

Apache Flume is a distributed, reliable service for efficiently collecting, aggregating, and moving large amounts of log and event data from many sources to a centralized data store like HDFS.

Q35.
In the context of Big Data, what does 'Variety' refer to?
AThe number of users accessing data
BThe different formats, types, and sources of data
CThe processing speed of data
DThe storage capacity required
Show Answer & Explanation

Correct Answer: B - The different formats, types, and sources of data

Variety refers to the different types, formats, and sources of data including structured (databases), semi-structured (JSON, XML), and unstructured (images, videos, text). Managing this diversity is a key Big Data challenge.

Q36.
Which of the following is an ELT approach?
AData is loaded into the target system first, then transformed
BData is transformed before loading into the target system
CData is only extracted and never loaded
DData is encrypted before transformation
Show Answer & Explanation

Correct Answer: A - Data is loaded into the target system first, then transformed

ELT (Extract, Load, Transform) loads raw data into the target system first and then transforms it within the target system. This approach leverages the processing power of modern data warehouses and data lakes, unlike traditional ETL.

Q37.
What is Apache Zookeeper used for in a Big Data ecosystem?
ADistributed coordination and configuration management
BData visualization
CMachine learning model training
DData ingestion from social media
Show Answer & Explanation

Correct Answer: A - Distributed coordination and configuration management

Apache Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. It is used by many Big Data tools like Kafka and HBase for coordination.

Q38.
Which of the following is an example of unstructured data?
AA collection of video surveillance footage
BCSV file with customer records
CEmployee salary table in a database
DA spreadsheet with sales figures
Show Answer & Explanation

Correct Answer: A - A collection of video surveillance footage

Video surveillance footage is unstructured data as it does not follow any predefined data model or schema. Database tables, CSV files, and spreadsheets are structured data with defined formats and schemas.

Q39.
What is data ingestion in Big Data?
AThe process of deleting old data
BThe process of encrypting data
CThe process of transporting data from various sources to a storage medium
DThe process of backing up data
Show Answer & Explanation

Correct Answer: C - The process of transporting data from various sources to a storage medium

Data ingestion is the process of obtaining and importing data from various sources for immediate use or storage in a database or data warehouse. It can be done in real-time (streaming) or in batches.

Q40.
Which tool provides a SQL-like interface to query data stored in Hadoop?
AApache Hive
BApache Sqoop
CApache Flume
DApache Zookeeper
Show Answer & Explanation

Correct Answer: A - Apache Hive

Apache Hive provides a SQL-like query language called HiveQL (HQL) that allows users to query and manage large datasets stored in Hadoop's HDFS. It converts SQL queries into MapReduce or Tez jobs.

Q41.
What is the primary advantage of horizontal scaling in Big Data systems?
AUpgrading a single machine's CPU
BUsing faster hard drives
CAdding more machines to distribute the workload
DIncreasing RAM in existing servers
Show Answer & Explanation

Correct Answer: C - Adding more machines to distribute the workload

Horizontal scaling (scale-out) involves adding more machines to a cluster to distribute the workload. This is the preferred approach in Big Data as it is more cost-effective and provides better fault tolerance compared to vertical scaling.

Q42.
Which of the following describes the concept of 'schema-on-read'?
ASchema is enforced when data is written to storage
BSchema is applied when data is read from storage
CSchema is never applied to data
DSchema is defined only for structured data
Show Answer & Explanation

Correct Answer: B - Schema is applied when data is read from storage

Schema-on-read means the structure or schema is applied to data only when it is read or queried, not when it is stored. This approach is used in data lakes and allows flexibility in storing raw data of any format.

Q43.
What is Apache Spark's primary advantage over traditional MapReduce?
ASpark performs in-memory processing which is significantly faster
BSpark can only process structured data
CSpark does not require a cluster
DSpark uses disk-based processing
Show Answer & Explanation

Correct Answer: A - Spark performs in-memory processing which is significantly faster

Apache Spark's primary advantage is in-memory computing. It keeps intermediate data in memory (RAM) rather than writing to disk after each stage like MapReduce, making it up to 100x faster for certain workloads.

Q44.
What is data wrangling (data munging)?
AThe process of cleaning, restructuring, and enriching raw data
BEncrypting sensitive data
CCompressing data for storage
DDeleting duplicate databases
Show Answer & Explanation

Correct Answer: A - The process of cleaning, restructuring, and enriching raw data

Data wrangling or data munging is the process of cleaning, restructuring, and enriching raw data into a more usable format. It involves handling missing values, correcting errors, and transforming data for analysis.

Q45.
Which of the following is a columnar storage format commonly used in Big Data?
ACSV
BJSON
CApache Parquet
DPlain Text
Show Answer & Explanation

Correct Answer: C - Apache Parquet

Apache Parquet is a columnar storage format optimized for Big Data processing. It stores data by columns rather than rows, enabling efficient compression, encoding, and fast analytical queries that only need specific columns.

Q46.
What does the term 'data silo' refer to?
AA type of data warehouse
BA backup storage system
CAn isolated repository of data controlled by one department, inaccessible to others
DA type of NoSQL database
Show Answer & Explanation

Correct Answer: C - An isolated repository of data controlled by one department, inaccessible to others

A data silo is an isolated collection of data held by one group or department that is not easily or fully accessible by other groups in the same organization. Data silos hinder cross-functional analysis and collaboration.

Q47.
Which Big Data tool is used for workflow scheduling and management?
AApache HBase
BApache Sqoop
CApache Flume
DApache Oozie
Show Answer & Explanation

Correct Answer: D - Apache Oozie

Apache Oozie is a workflow scheduler system for managing Apache Hadoop jobs. It allows users to create directed acyclic graphs (DAGs) of actions and schedule complex data pipelines with dependencies.

Q48.
What is the primary purpose of data partitioning in Big Data systems?
ATo divide large datasets into smaller, manageable chunks for parallel processing
BTo encrypt data
CTo delete unnecessary data
DTo compress data for storage
Show Answer & Explanation

Correct Answer: A - To divide large datasets into smaller, manageable chunks for parallel processing

Data partitioning divides a large dataset into smaller, more manageable subsets (partitions) that can be processed in parallel across multiple nodes. This improves query performance and enables distributed processing.

Q49.
Which of the following is a real-time data processing engine?
AApache Pig
BApache Hive
CApache Sqoop
DApache Storm
Show Answer & Explanation

Correct Answer: D - Apache Storm

Apache Storm is a free, open-source distributed real-time computation system that processes unbounded streams of data reliably. Unlike Hive and Pig which are batch processing tools, Storm processes data in real-time.

Q50.
What is the '5th V' of Big Data that refers to the ability to derive meaningful insights?
AValue
BVariability
CVisualization
DValidity
Show Answer & Explanation

Correct Answer: A - Value

Value is considered the 5th V of Big Data and refers to the ability to turn data into meaningful and actionable business insights. Without extracting value, even large volumes of data serve no practical purpose.

Showing 1-10 of 50 questions