NLP Interview Master
The ultimate collection of 200+ meticulously curated Natural Language Processing & LLM questions to help you ace your ML Engineer interview.
What is Natural Language Processing (NLP)?
BeginnerNatural Language Processing (NLP) is a subfield of artificial intelligence, computer science, and linguistics concerned with the interactions between computers ...
Comprehensive Explanation
What is Tokenization?
BeginnerTokenization is the process of breaking down a stream of text into smaller components called tokens. These could be words, characters, or subwords. It is usuall...
Comprehensive Explanation
What is Stemming?
BeginnerStemming is a crude heuristic process that chops off the ends of words to find their base or root form, even if the root is not a valid word. For example, 'runn...
Comprehensive Explanation
What is Lemmatization?
BeginnerLemmatization uses vocabulary and morphological analysis of words to remove inflectional endings and return the base or dictionary form of a word, which is know...
Comprehensive Explanation
What is Stop Word Removal?
BeginnerStop words are high-frequency words that often add little lexical value to sentences, such as 'is', 'the', and 'at'. Removing them can significantly reduce the ...
Comprehensive Explanation
What is a Bag-of-Words (BoW) model?
BeginnerA Bag-of-Words model is a simple representation of text used in NLP. It describes the occurrence of words within a document, ignoring word order and grammar but...
Comprehensive Explanation
What is TF-IDF?
BeginnerTF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic intended to reflect how important a word is to a document in a collectio...
Comprehensive Explanation
What are Word Embeddings?
BeginnerWord embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They map words or phrases to vectors ...
Comprehensive Explanation
What is Word2Vec?
BeginnerWord2Vec is a popular algorithm used for generating dense word embeddings. Created by researchers at Google, it uses a two-layer neural network (either Skip-Gra...
Comprehensive Explanation
What is Part-of-Speech (POS) Tagging?
BeginnerPOS tagging is the process of marking up a word in a text as corresponding to a particular part of speech based on both its definition and its context (e.g., ta...
Comprehensive Explanation
What is Named Entity Recognition (NER)?
BeginnerNER is an information extraction technique that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as pe...
Comprehensive Explanation
What is Sentiment Analysis?
BeginnerSentiment analysis (or opinion mining) uses NLP to identify, extract, and quantify subjective information from text, typically classifying the polarity as posit...
Comprehensive Explanation
What are N-grams?
BeginnerAn N-gram is a contiguous sequence of n items from a given sample of text or speech. When n=1 it's a unigram, n=2 is a bigram, and n=3 is a trigram. It captures...
Comprehensive Explanation
What is Levenshtein Distance?
BeginnerLevenshtein Distance is a string metric for measuring the difference between two sequences. It is the minimum number of single-character edits (insertions, dele...
Comprehensive Explanation
What is an RNN (Recurrent Neural Network)?
BeginnerAn RNN is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input t...
Comprehensive Explanation
Why use LSTM instead of a basic RNN?
BeginnerLSTMs (Long Short-Term Memory networks) are a special kind of RNN capable of learning long-term dependencies. They solve the vanishing gradient problem of stand...
Comprehensive Explanation
What is the Vanishing Gradient Problem?
BeginnerDuring backpropagation in deep neural networks (especially RNNs), gradients are recursively multiplied. If these gradients are small, the resulting gradient val...
Comprehensive Explanation
What is perplexity in Language Modeling?
BeginnerPerplexity is a measurement of how well a probability distribution or probability model predicts a sample. In NLP, a lower perplexity score indicates the langua...
Comprehensive Explanation
What is BLEU score?
BeginnerBLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another...
Comprehensive Explanation
What is ROUGE score?
BeginnerROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation, focusing prim...
Comprehensive Explanation
What is a Language Model?
BeginnerA language model learns the probability distribution over sequences of words. It tries to predict the next word in a sequence given the previous words.
Comprehensive Explanation
What is Text Generation generation constraints?
BeginnerConstraints like temperature, top-k sampling, and top-p (nucleus) sampling help control how deterministic or creative the output sequence from a language model ...
Comprehensive Explanation
Calculate Document Frequency.
BeginnerDocument Frequency (DF) is the number of documents in which a term appears. It helps determine the uniqueness of the term globally.
Comprehensive Explanation
What is the GloVe embedding?
BeginnerGloVe (Global Vectors) is an unsupervised learning algorithm for obtaining vector representations for words based on aggregated global word-word co-occurrence s...
Comprehensive Explanation
Why sequence padding is important?
BeginnerNeural networks process batches in matrices, so all input sequences must be the same length. Padding adds neutral tokens (like 0) to shorter sequences so they m...
Comprehensive Explanation
Explain the Attention mechanism.
IntermediateAttention mechanisms allow models to focus on specific parts of an input sequence when predicting the output, rather than relying on a single fixed-length hidde...
Comprehensive Explanation
What is Self-Attention?
IntermediateSelf-attention, or intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequ...
Comprehensive Explanation
Explain the Sequence-to-Sequence (Seq2Seq) Model.
IntermediateSeq2Seq models take an input sequence and generate an output sequence. They typically use an Encoder to read the input into a context vector, and a Decoder to g...
Comprehensive Explanation
What is Teacher Forcing?
IntermediateTeacher forcing is a fast and effective training technique for RNNs/Seq2Seq models where the model receives the ground truth output from the previous time step ...
Comprehensive Explanation
What is Subword Tokenization?
IntermediateSubword algorithms define tokens as characters or subwords, allowing models to mitigate the Out-Of-Vocabulary (OOV) problem efficiently. Examples include Byte-P...
Comprehensive Explanation
How does Byte-Pair Encoding (BPE) work?
IntermediateBPE starts with a base vocabulary of single characters. It iteratively finds the most frequent pair of adjacent tokens and merges them into a new single token u...
Comprehensive Explanation
What is WordPiece?
IntermediateWordPiece (used by BERT) is similar to BPE but it chooses pairs to merge based on maximizing the likelihood of the training data using the language model, rathe...
Comprehensive Explanation
Explain the concept of Word Sense Disambiguation (WSD).
IntermediateWSD is the process of identifying which sense of a word (i.e., meaning) is used in a sentence, when the word has multiple meanings (e.g., 'bank' of a river vs '...
Comprehensive Explanation
What is an LLM (Large Language Model)?
IntermediateAn LLM is a very large scale language model consisting of billions of parameters, typically based on the Transformer architecture, trained on immense quantities...
Comprehensive Explanation
What is BERT?
IntermediateBERT (Bidirectional Encoder Representations from Transformers) is a model designed to pre-train deep bidirectional representations from unlabeled text by jointl...
Comprehensive Explanation
Explain the Masked Language Modeling (MLM) task in BERT.
IntermediateDuring pre-training, BERT randomly masks 15% of the input tokens, and the objective is to predict those masked words using context from both directions.
Comprehensive Explanation
What is Next Sentence Prediction (NSP) in BERT?
IntermediateNSP is a binary classification task where BERT receives pairs of sentences and learns to predict whether the second sentence logically follows the first one in ...
Comprehensive Explanation
What is fine-tuning in NLP?
IntermediateFine-tuning takes a model that's already been pre-trained on a massive dataset (like BERT) and trains it further on a smaller, task-specific dataset (like senti...
Comprehensive Explanation
What is the difference between Extractive and Abstractive Summarization?
IntermediateExtractive summarization pulls the most important verbatim sentences from the text. Abstractive summarization generates new text that captures the essence of th...
Comprehensive Explanation
How do Convolutional Neural Networks (CNNs) perform in NLP?
IntermediateWhile CNNs are mainly for images, 1D CNNs are excellent at text classification. They slide multi-word filters over the text embeddings, effectively extracting h...
Comprehensive Explanation
Explain the Skip-gram architecture in Word2Vec.
IntermediateSkip-gram predicts context words given a target/center word. It works well with small amounts of data and represents rare words very well compared to CBOW.
Comprehensive Explanation
Explain Continuous Bag of Words (CBOW) architecture.
IntermediateCBOW predicts the target word from a window of surrounding context words. It trains faster and represents frequent words better than Skip-gram.
Comprehensive Explanation
What is Negative Sampling in Word2Vec?
IntermediateA technique to make calculating the loss faster by updating weights for merely a small sample of 'negative' (incorrect) words along with the target word, rather...
Comprehensive Explanation
What is Coreference Resolution?
IntermediateThe task of finding all expressions that refer to the same entity in a text. For example, in 'Jane took her dog out because it was barking', 'Jane' maps to 'her...
Comprehensive Explanation
Difference between Generative and Discriminative models.
IntermediateGenerative models (like naive bayes, GANs) map how the data was generated P(X,Y) and can generate new samples. Discriminative models (like Logistic Regression, ...
Comprehensive Explanation
What is the ELMo algorithm?
IntermediateEmbeddings from Language Models (ELMo) creates contextualized word embeddings using a deeply bidirectional LSTM. It computes an embedding for a word based on th...
Comprehensive Explanation
What is Top-K vs Top-P sampling?
IntermediateIn language generation, Top-K limits the next-word sample to the K most probable tokens. Top-P (nucleus sampling) limits the sample to a dynamic set of tokens w...
Comprehensive Explanation
What are the common evaluation metrics for NER?
IntermediatePrecision, Recall, and F1-score are standard. However, evaluating requires exact boundary matching (exact match) vs partial matching overlap since entities ofte...
Comprehensive Explanation
How do you handle Class Imbalance in NLP?
IntermediateTechniques include oversampling minority classes (SMOTE or synonym replacement), undersampling majority classes, using weighted loss functions, or adopting Foca...
Comprehensive Explanation
What are Zero-shot and Few-shot learning?
IntermediateZero-shot learning means giving the model a task it wasn't explicitly trained mapped to without any examples. Few-shot means providing the model with a tiny num...
Comprehensive Explanation
Explain the architecture of a Transformer model.
AdvancedTransformers utilize an Encoder-Decoder structure built entirely upon self-attention mechanisms, dropping recurrent/convolutional layers. Key components include...
Comprehensive Explanation
What is the purpose of Positional Encoding in Transformers?
AdvancedSince Transformers don't use recurrence and process all tokens simultaneously, they have no inherent notion of sequence order. Positional encodings (using sine/...
Comprehensive Explanation
Explain the calculation of Scaled Dot-Product Attention.
AdvancedAttention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V. Query matrices dot with Key matrices to get similarity scores, are scaled to prevent vanishing gradient...
Comprehensive Explanation
What is Multi-Head Attention?
AdvancedInstead of performing a single attention function, Multi-Head attention projects Queries, Keys, and Values h times with different learned weights. The h attenti...
Comprehensive Explanation
Difference between GPT architectures and BERT?
AdvancedGPT is an auto-regressive Decoder-only model trained left-to-right to predict the next token (excellent for generation). BERT is an Encoder-only model trained t...
Comprehensive Explanation
Explain Retrieval Augmented Generation (RAG).
AdvancedRAG connects an LLM to an external knowledge database. Upon a query, semantic search retrieves relevant document chunks from the database, prepends them to the ...
Comprehensive Explanation
What is LoRA (Low-Rank Adaptation)?
AdvancedLoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This drastically r...
Comprehensive Explanation
What is RLHF (Reinforcement Learning from Human Feedback)?
AdvancedRLHF aligns LLMs with human intent. It trains a 'Reward Model' based on human rankings of text outputs, and then uses Proximal Policy Optimization (PPO) reinfor...
Comprehensive Explanation
What is Prompt Engineering vs Prompt Tuning?
AdvancedPrompt Engineering is manually hard-coding optimal text templates. Prompt Tuning (or Soft Prompts) keeps model weights frozen but adds a small number of tunable...
Comprehensive Explanation
Explain the difference between T5 and standard Seq2Seq models.
AdvancedT5 (Text-to-Text Transfer Transformer) casts every NLP task—classification, translation, QA—into a text-to-text format. So both the inputs and outputs are treat...
Comprehensive Explanation
What are Vector Databases?
AdvancedDatabases specialized in storing indexing high-dimensional embeddings mathematically derived from data (text, images). They perform ultra-fast similarity search...
Comprehensive Explanation
Explain the 'Hallucination' problem in LLMs and mitigation strategies.
AdvancedHallucination is when LLMs generate fluent but factually incorrect information. Mitigation includes RAG, strict grounding prompts, lower temperature, RLHF, or u...
Comprehensive Explanation
What are KV (Key-Value) Caches in LLM decoding?
AdvancedDuring auto-regressive generation, past tokens' Key and Value tensors in the attention layers are cached instead of being recomputed every step. This turns gene...
Comprehensive Explanation
What is FlashAttention?
AdvancedFlashAttention is an exact IO-aware and drastically faster algorithm for computing exact attention. It prevents moving large N x N matrices between HBM (GPU mem...
Comprehensive Explanation
What is Quantization in neural networks? (e.g. 4-bit, 8-bit)
AdvancedQuantization reduces the precision of the network's weights and activations from FP32 (32-bit float) down to INT8 or INT4 formats. This significantly shrinks mo...
Comprehensive Explanation
Differentiate between RoPE and Absolute Positional Encodings.
AdvancedRotary Position Embedding (RoPE) injects absolute position by multiplying context representations with a rotation matrix. It wonderfully captures relative posit...
Comprehensive Explanation
Explain ALiBi (Attention with Linear Biases).
AdvancedALiBi removes positional embeddings entirely. Instead, it statically adds a linear penalty to the attention scores before the softmax operation depending on dis...
Comprehensive Explanation
Explain Directed Acyclic Graphs (DAGs) in Dependency Parsing.
AdvancedDependency parsing builds a tree outlining grammatical relations. Using algorithms like Chu-Liu-Edmonds, relationships are formed as directed edges connecting h...
Comprehensive Explanation
What is the Conditional Random Field (CRF) layer used for in NER?
AdvancedA CRF layer sits on top of an LSTM/Transformer and predicts tags jointly. It learns transition probabilities between labels (e.g., ensuring I-ORG follows B-ORG ...
Comprehensive Explanation
What is Knowledge Distillation?
AdvancedA compression technique where a smaller 'student' model is trained to mimic the softmax probabilities (soft targets) and intermediate representations of a massi...
Comprehensive Explanation
Explain Mixture of Experts (MoE) architecture.
AdvancedMoE replaces the dense Feed Forward network with multiple parallel 'experts'. A routing network conditionally determines which tiny subset of experts (usually 1...
Comprehensive Explanation
How does Semantic Search differ from Lexical/Keyword Search?
AdvancedLexical search (BM25) uses sparse tf-idf matching exact words. Semantic search embeds queries into dense vectors and finds documents closest in vector space, ma...
Comprehensive Explanation
What is Cross-Encoder vs Bi-Encoder?
AdvancedBi-encoders independently embed query and document, computing a simple fast dot product for similarity (great for searching databases). Cross-encoders concatena...
Comprehensive Explanation
What is Token Healing in LLMs?
AdvancedBecause tokenization merges prefix spaces/characters, prompting a model abruptly might split a logical word forcing poor probability spaces. Token healing dynam...
Comprehensive Explanation
Explain Direct Preference Optimization (DPO).
AdvancedDPO acts as a simpler, more stable alternative to RLHF. Instead of training a separate reward model, DPO mathematically maps the reward function directly onto t...
Comprehensive Explanation
Implement Cosine Similarity in Python.
Math/CodeDot product divided by product of magnitudes.
Comprehensive Explanation
Python / PyTorch Code
import numpy as np
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))Calculate TF-IDF for word 'data'.
Math/CodeTF(t) = (count in doc) / (total words in doc). IDF(t) = log_e(Total Docs / Docs with t).
Comprehensive Explanation
Python / PyTorch Code
import math
tf = 3 / 100
idf = math.log(1000 / (10 + 1))
tfidf = tf * idfWrite a PyTorch basic Self-Attention calculation.
Math/CodeForward pass utilizing matrix multiplications.
Comprehensive Explanation
Python / PyTorch Code
import torch
import torch.nn.functional as F
def attention(q, k, v, d_k):
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
attn = F.softmax(scores, dim=-1)
return torch.matmul(attn, v)Code: Convert a list of texts to bag of words.
Math/CodeUsing SKLearn CountVectorizer.
Comprehensive Explanation
Python / PyTorch Code
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(['nlp is fun', 'ai is nlp'])
print(X.toarray())How to apply Softmax in PyTorch?
Math/CodeUsing F.softmax along the last dimension.
Comprehensive Explanation
Python / PyTorch Code
logits = torch.tensor([1.0, 2.0, -1.0])
probs = torch.nn.functional.softmax(logits, dim=0)Implement a simple bigram character generator.
Math/CodeMatrix probabilities lookup.
Comprehensive Explanation
Python / PyTorch Code
counts = torch.zeros((27,27))
# ... fill counts
probs = counts / counts.sum('1', keepdim=True)
i = torch.multinomial(probs[0], num_samples=1)
char = itos[i.item()]Extract Named Entities using Spacy.
Math/CodeLoad core engine and iterate entites.
Comprehensive Explanation
Python / PyTorch Code
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple is buying a startup based in UK')
for ent in doc.ents:
print(ent.text, ent.label_)What is Cross-Entropy Loss formula mathematically?
Math/CodeLoss = -SUM(p(x) * log(q(x))) where p is true distribution and q is predicted distribution.
Comprehensive Explanation
Python / PyTorch Code
loss = nn.CrossEntropyLoss()
calculated = loss(logits, targets)Load a transformer model via HuggingFace.
Math/CodeUsing pipeline or direct.
Comprehensive Explanation
Python / PyTorch Code
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love NLP!')Write WordPiece Subword Tokenizer logic pseudo-code.
Math/CodeGreedy matching loop.
Comprehensive Explanation
Python / PyTorch Code
token_list = []
while word:
substr = get_longest_matching_prefix(word)
token_list.append(substr)
word = word[len(substr):]How to clip gradients in PyTorch to prevent exploding gradients?
Math/CodeClips norm of the gradients before stepping optimizer.
Comprehensive Explanation
Python / PyTorch Code
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)Calculate the Jaccard Similarity between two sentences.
Math/CodeIntersection over Union of sets.
Comprehensive Explanation
Python / PyTorch Code
def jaccard(s1, s2):
set1, set2 = set(s1.split()), set(s2.split())
intersect = len(set1.intersection(set2))
return intersect / (len(set1) + len(set2) - intersect)Write an Attention Mask for sequence padding.
Math/CodeCreate masks replacing 0s with extremely negative numbers so Softmax zeroes them out.
Comprehensive Explanation
Python / PyTorch Code
mask = (input_ids == pad_token).unsqueeze(1).unsqueeze(2)
scores = scores.masked_fill(mask, -1e9)Implement early stopping loop in PyTorch.
Math/CodeTrack validation loss over patience epochs.
Comprehensive Explanation
Python / PyTorch Code
best = float('inf')
patience, count = 3, 0
if val_loss < best:
best = val_loss
else:
count += 1
if count >= patience: breakGenerate Text using HuggingFace GPT-2.
Math/CodeUsing the generate method.
Comprehensive Explanation
Python / PyTorch Code
inputs = tokenizer('Hello, my dog is', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))Pad sequences in raw python.
Math/CodeAdding zeros to reach max_len.
Comprehensive Explanation
Python / PyTorch Code
padded = [seq + [0]*(max_len - len(seq)) for seq in sequences]Implement Levenshtein distance conceptually using recursion.
Math/CodeMinimizing insertions, deletions, substitutions.
Comprehensive Explanation
Python / PyTorch Code
def lev(a,b):
if not a: return len(b)
if not b: return len(a)
cost = 0 if a[0]==b[0] else 1
return min(lev(a[1:],b)+1, lev(a,b[1:])+1, lev(a[1:],b[1:])+cost)How to initialize LayerNorm weights.
Math/Codegamma to 1, beta to 0.
Comprehensive Explanation
Python / PyTorch Code
self.gamma = nn.Parameter(torch.ones(features))
self.beta = nn.Parameter(torch.zeros(features))Calculate output dimension of a 1D Conv Layer over text.
Math/CodeLength = [(Input - Filter + 2*Pad) / Stride] + 1.
Comprehensive Explanation
Python / PyTorch Code
out_dim = math.floor(((L_in - kernel_size + 2*padding) / stride) + 1)Retrieve embeddings from a PyTorch Embedding layer.
Math/CodePass indices into the embedding class instance.
Comprehensive Explanation
Python / PyTorch Code
embeds = nn.Embedding(vocab_size, dim)
indices = torch.tensor([1, 4, 10])
vectors = embeds(indices)Calculate Parameters in an LSTM relative to Input/Hidden size.
Math/Code4 * ((input_size * hidden_size) + (hidden_size * hidden_size) + hidden_size).
Comprehensive Explanation
Python / PyTorch Code
params = 4 * ((n * m) + (m * m) + m)Save and load a PyTorch NLP Model.
Math/CodeUsing state dicts.
Comprehensive Explanation
Python / PyTorch Code
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))Perform Top-K sampling conceptually.
Math/CodeSort, take first k, zero out rest, sample.
Comprehensive Explanation
Python / PyTorch Code
probs, indices = torch.topk(logits, k=5)
probs = F.softmax(probs, dim=-1)
next_token = torch.multinomial(probs, 1)Convert Text to sequence using HuggingFace Tokenizer.
Math/CodeCalling tokenizer directly outputs dict containing input_ids.
Comprehensive Explanation
Python / PyTorch Code
encodings = tokenizer(['Text one', 'Text two'], padding=True, truncation=True)
print(encodings['input_ids'])Write an RNN step function loop.
Math/CodeIterating time dimension manually updating hidden state.
Comprehensive Explanation
Python / PyTorch Code
h_t = torch.zeros(1, hidden_size)
for x_t in inputs:
h_t = torch.tanh(W_xh @ x_t + W_hh @ h_t + b)