Natural Language Processing Question Bank for C-CAT
Topic-wise Natural Language Processing MCQs for CDAC C-CAT preparation with answers and explanations.
Show Answer & Explanation
Correct Answer: B - Natural Language Processing
NLP (Natural Language Processing) is AI focused on interaction between computers and human language.
Show Answer & Explanation
Correct Answer: C - Breaking text into words or subwords
Tokenization splits text into individual units (tokens) like words, subwords, or characters.
Show Answer & Explanation
Correct Answer: B - Common words like "the", "is" often removed
Stop words are common words (the, is, at, etc.) often removed as they add little meaning.
Show Answer & Explanation
Correct Answer: C - Root/base form (may not be valid word)
Stemming removes suffixes to get word stems (running → runn), which may not be valid words.
Show Answer & Explanation
Correct Answer: B - Producing valid dictionary words
Lemmatization uses vocabulary and morphological analysis to return valid base forms (running → run).
Show Answer & Explanation
Correct Answer: D - Word importance in document relative to corpus
TF-IDF (Term Frequency-Inverse Document Frequency) measures how important a word is to a document in a corpus.
Show Answer & Explanation
Correct Answer: C - Represents text as word frequency counts ignoring order
Bag of Words represents text as a collection of word counts, ignoring grammar and word order.
Show Answer & Explanation
Correct Answer: A - Represent words as dense vectors capturing semantic meaning
Word embeddings map words to dense vectors where similar words have similar vector representations.
Show Answer & Explanation
Correct Answer: B - Named entities like persons, locations, organizations
NER identifies and classifies named entities in text into categories like person, organization, location.
Show Answer & Explanation
Correct Answer: D - Emotional tone (positive/negative/neutral)
Sentiment analysis identifies the emotional tone or attitude expressed in text.
Show Answer & Explanation
Correct Answer: B - Labels words with grammatical categories
POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to words in text.
Show Answer & Explanation
Correct Answer: B - Transformer-based language model
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model for various NLP tasks.
Show Answer & Explanation
Correct Answer: B - Focuses on relevant parts of input
Attention allows models to focus on relevant parts of input when producing output, weighing importance dynamically.
Show Answer & Explanation
Correct Answer: C - Text from one language to another
Machine Translation automatically translates text from one natural language to another.
Show Answer & Explanation
Correct Answer: D - Categorizing documents into predefined classes
Text classification assigns documents to predefined categories like spam detection or topic labeling.
Show Answer & Explanation
Correct Answer: B - Contiguous sequences of n items from text
N-grams are contiguous sequences of n items (words or characters) from text, capturing local context.
Show Answer & Explanation
Correct Answer: C - Sequential data like text
RNNs process sequential data by maintaining hidden state that captures information from previous inputs.
Show Answer & Explanation
Correct Answer: A - Vanishing gradient and short-term memory
LSTM (Long Short-Term Memory) uses gates to preserve long-term dependencies and mitigate vanishing gradients.
Show Answer & Explanation
Correct Answer: B - Process sequences in parallel with attention
Transformers use self-attention to process all positions in parallel, enabling faster training and better long-range dependencies.
Show Answer & Explanation
Correct Answer: A - Self-supervised learning on large text corpora
GPT (Generative Pre-trained Transformer) is pre-trained using self-supervised learning to predict next tokens in text.
Show Answer & Explanation
Correct Answer: C - Breaking text into smaller units like words or sentences
Tokenization is the process of breaking text into smaller units called tokens, which can be words, subwords, or sentences. It is typically the first step in any NLP pipeline.
Show Answer & Explanation
Correct Answer: D - Root or base form by removing affixes
Stemming reduces words to their root form by removing suffixes and prefixes. For example, 'running', 'runs', 'ran' are all reduced to 'run'. It may produce non-dictionary words.
Show Answer & Explanation
Correct Answer: B - Identify and classify named entities like persons, organizations, and locations
NER identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and monetary values.
Show Answer & Explanation
Correct Answer: A - The emotional tone or opinion expressed in text
Sentiment Analysis determines the emotional tone behind text — whether it expresses positive, negative, or neutral sentiment. It is widely used in analyzing customer reviews and social media.
Show Answer & Explanation
Correct Answer: A - Lemmatization produces valid dictionary words; stemming may not
Lemmatization uses vocabulary and morphological analysis to return valid dictionary words (lemmas), while stemming uses simple rule-based suffix stripping which may produce non-dictionary words.
Show Answer & Explanation
Correct Answer: C - Term Frequency-Inverse Document Frequency
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure that evaluates how important a word is to a document in a collection. It increases with frequency in a document but decreases with frequency across documents.
Show Answer & Explanation
Correct Answer: B - The
Stop words are common words like 'the', 'is', 'at', 'which', 'on' that are filtered out during text preprocessing because they carry little meaningful information for analysis.
Show Answer & Explanation
Correct Answer: A - A multiset of words disregarding grammar and word order
Bag of Words represents text as a multiset (bag) of words, disregarding grammar and word order. Each document is represented as a vector of word counts or frequencies.
Show Answer & Explanation
Correct Answer: D - Creating word embeddings
Word2Vec creates dense vector representations (embeddings) of words that capture semantic relationships. Words with similar meanings have similar vectors. It uses CBOW or Skip-gram architectures.
Show Answer & Explanation
Correct Answer: D - Grammatical categories like noun, verb, adjective to each word
POS tagging assigns grammatical categories (noun, verb, adjective, adverb, etc.) to each word in a sentence. It is fundamental to many NLP tasks like parsing and information extraction.
Show Answer & Explanation
Correct Answer: A - Speech Recognition
Speech Recognition (also called Automatic Speech Recognition or ASR) converts spoken language into written text. It is used in virtual assistants, dictation software, and voice-controlled systems.
Show Answer & Explanation
Correct Answer: D - Contiguous sequences of n items from text
N-grams are contiguous sequences of n items (words or characters) from text. Unigrams (n=1), bigrams (n=2), and trigrams (n=3) are commonly used for language modeling and text analysis.
Show Answer & Explanation
Correct Answer: D - Word Embeddings
Word Embeddings (like Word2Vec, GloVe, FastText) convert words into fixed-length dense vectors that capture semantic meaning. Unlike one-hot encoding, similar words have similar vector representations.
Show Answer & Explanation
Correct Answer: A - Supervised learning
Text classification is a supervised learning task where the model is trained on labeled text data to assign predefined categories to new text. Examples include spam detection and topic classification.
Show Answer & Explanation
Correct Answer: C - A large collection of text documents
A corpus (plural: corpora) is a large, structured collection of text documents used for training and evaluating NLP models. Examples include Wikipedia, news articles, and book collections.
Show Answer & Explanation
Correct Answer: C - Automatically translating text from one natural language to another
Machine Translation automatically translates text from one natural language to another (e.g., English to French). Modern approaches use neural networks (Neural Machine Translation).
Show Answer & Explanation
Correct Answer: A - Transformer
The Transformer architecture revolutionized NLP with its self-attention mechanism that processes all positions in a sequence simultaneously. BERT, GPT, and T5 are all based on Transformers.
Show Answer & Explanation
Correct Answer: A - Bidirectional Encoder Representations from Transformers
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model by Google that considers context from both directions (left and right) of a word simultaneously.
Show Answer & Explanation
Correct Answer: D - Cleaning and preparing raw text data for analysis
Text preprocessing involves cleaning and preparing raw text for NLP tasks. Common steps include tokenization, lowercasing, removing stop words, stemming/lemmatization, and handling special characters.
Show Answer & Explanation
Correct Answer: A - Group words into meaningful phrases based on POS tags
Chunking (shallow parsing) groups adjacent words into meaningful phrases based on their POS tags. For example, grouping 'the big cat' as a noun phrase (NP). It provides partial syntactic analysis.
Show Answer & Explanation
Correct Answer: C - Image Segmentation
Image Segmentation is a computer vision task, not an NLP application. Chatbots, spam filtering, and language translation are all common applications of Natural Language Processing.
Show Answer & Explanation
Correct Answer: C - Grammatical relationships between words in a sentence
Dependency parsing analyzes the grammatical structure of a sentence by identifying relationships (dependencies) between words, such as subject-verb, verb-object, and modifier relationships.
Show Answer & Explanation
Correct Answer: A - To represent words as numerical vectors capturing semantic meaning
Word embeddings represent words as dense numerical vectors in a continuous vector space where semantically similar words are closer together. This enables mathematical operations on word meanings.
Show Answer & Explanation
Correct Answer: D - Subword tokenization
Subword tokenization (like BPE, WordPiece) handles OOV words by breaking unknown words into smaller subword units that exist in the vocabulary. This is used in models like BERT and GPT.
Show Answer & Explanation
Correct Answer: B - Automatically extracting structured information from unstructured text
Information Extraction automatically extracts structured data (entities, relationships, events) from unstructured text. It includes tasks like NER, relation extraction, and event extraction.
Show Answer & Explanation
Correct Answer: C - Measuring similarity between two text vectors
Cosine similarity measures the cosine of the angle between two vectors, indicating how similar they are in direction regardless of magnitude. It is widely used to compare document or word vectors in NLP.
Show Answer & Explanation
Correct Answer: D - Machine Translation
Machine Translation is a sequence-to-sequence task that takes an input sequence (source language) and produces an output sequence (target language). Text summarization is another seq-to-seq task.
Show Answer & Explanation
Correct Answer: A - Sparse, high-dimensional vectors
One-hot encoding represents each word as a sparse, high-dimensional vector where only one element is 1 and the rest are 0. The vector dimension equals the vocabulary size, making it memory-inefficient.
Show Answer & Explanation
Correct Answer: D - To focus on relevant parts of the input when generating output
The attention mechanism allows the model to focus on the most relevant parts of the input sequence when generating each part of the output. It significantly improves performance on long sequences.
Show Answer & Explanation
Correct Answer: B - Extractive and abstractive
Text summarization is categorized as extractive (selecting important sentences from original text) and abstractive (generating new sentences that convey key information). Abstractive is more challenging.