NLP Interview Master

The ultimate collection of 200+ meticulously curated Natural Language Processing & LLM questions to help you ace your ML Engineer interview.

0 / 200 LearnedTransformers & LLMsMath & PyTorch
Showing 100 results in All Questions category.
1
What is Natural Language Processing (NLP)?
Beginner

Natural Language Processing (NLP) is a subfield of artificial intelligence, computer science, and linguistics concerned with the interactions between computers ...

Comprehensive Explanation
Natural Language Processing (NLP) is a subfield of artificial intelligence, computer science, and linguistics concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
Beginner
2
What is Tokenization?
Beginner

Tokenization is the process of breaking down a stream of text into smaller components called tokens. These could be words, characters, or subwords. It is usuall...

Comprehensive Explanation
Tokenization is the process of breaking down a stream of text into smaller components called tokens. These could be words, characters, or subwords. It is usually the first step in an NLP pipeline.
Beginner
3
What is Stemming?
Beginner

Stemming is a crude heuristic process that chops off the ends of words to find their base or root form, even if the root is not a valid word. For example, 'runn...

Comprehensive Explanation
Stemming is a crude heuristic process that chops off the ends of words to find their base or root form, even if the root is not a valid word. For example, 'running' becomes 'run'.
Beginner
4
What is Lemmatization?
Beginner

Lemmatization uses vocabulary and morphological analysis of words to remove inflectional endings and return the base or dictionary form of a word, which is know...

Comprehensive Explanation
Lemmatization uses vocabulary and morphological analysis of words to remove inflectional endings and return the base or dictionary form of a word, which is known as the lemma. For instance, 'better' becomes 'good'.
Beginner
5
What is Stop Word Removal?
Beginner

Stop words are high-frequency words that often add little lexical value to sentences, such as 'is', 'the', and 'at'. Removing them can significantly reduce the ...

Comprehensive Explanation
Stop words are high-frequency words that often add little lexical value to sentences, such as 'is', 'the', and 'at'. Removing them can significantly reduce the size of the vocabulary and improve model training speed and accuracy.
Beginner
6
What is a Bag-of-Words (BoW) model?
Beginner

A Bag-of-Words model is a simple representation of text used in NLP. It describes the occurrence of words within a document, ignoring word order and grammar but...

Comprehensive Explanation
A Bag-of-Words model is a simple representation of text used in NLP. It describes the occurrence of words within a document, ignoring word order and grammar but keeping multiplicity (frequency).
Beginner
7
What is TF-IDF?
Beginner

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic intended to reflect how important a word is to a document in a collectio...

Comprehensive Explanation
TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic intended to reflect how important a word is to a document in a collection or corpus. Words that appear frequently in one document but rarely across all documents get the highest score.
Beginner
8
What are Word Embeddings?
Beginner

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They map words or phrases to vectors ...

Comprehensive Explanation
Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They map words or phrases to vectors of real numbers in a continuous vector space (e.g., Word2Vec, GloVe).
Beginner
9
What is Word2Vec?
Beginner

Word2Vec is a popular algorithm used for generating dense word embeddings. Created by researchers at Google, it uses a two-layer neural network (either Skip-Gra...

Comprehensive Explanation
Word2Vec is a popular algorithm used for generating dense word embeddings. Created by researchers at Google, it uses a two-layer neural network (either Skip-Gram or CBOW) to reconstruct linguistic contexts of words.
Beginner
10
What is Part-of-Speech (POS) Tagging?
Beginner

POS tagging is the process of marking up a word in a text as corresponding to a particular part of speech based on both its definition and its context (e.g., ta...

Comprehensive Explanation
POS tagging is the process of marking up a word in a text as corresponding to a particular part of speech based on both its definition and its context (e.g., tagging a word as a noun, verb, adjective).
Beginner
11
What is Named Entity Recognition (NER)?
Beginner

NER is an information extraction technique that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as pe...

Comprehensive Explanation
NER is an information extraction technique that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities.
Beginner
12
What is Sentiment Analysis?
Beginner

Sentiment analysis (or opinion mining) uses NLP to identify, extract, and quantify subjective information from text, typically classifying the polarity as posit...

Comprehensive Explanation
Sentiment analysis (or opinion mining) uses NLP to identify, extract, and quantify subjective information from text, typically classifying the polarity as positive, negative, or neutral.
Beginner
13
What are N-grams?
Beginner

An N-gram is a contiguous sequence of n items from a given sample of text or speech. When n=1 it's a unigram, n=2 is a bigram, and n=3 is a trigram. It captures...

Comprehensive Explanation
An N-gram is a contiguous sequence of n items from a given sample of text or speech. When n=1 it's a unigram, n=2 is a bigram, and n=3 is a trigram. It captures context that BoW misses.
Beginner
14
What is Levenshtein Distance?
Beginner

Levenshtein Distance is a string metric for measuring the difference between two sequences. It is the minimum number of single-character edits (insertions, dele...

Comprehensive Explanation
Levenshtein Distance is a string metric for measuring the difference between two sequences. It is the minimum number of single-character edits (insertions, deletions, substitutions) required to change one word into the other.
Beginner
15
What is an RNN (Recurrent Neural Network)?
Beginner

An RNN is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input t...

Comprehensive Explanation
An RNN is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This makes them good for sequential data like text.
Beginner
16
Why use LSTM instead of a basic RNN?
Beginner

LSTMs (Long Short-Term Memory networks) are a special kind of RNN capable of learning long-term dependencies. They solve the vanishing gradient problem of stand...

Comprehensive Explanation
LSTMs (Long Short-Term Memory networks) are a special kind of RNN capable of learning long-term dependencies. They solve the vanishing gradient problem of standard RNNs through specialized memory cells and gating mechanisms.
Beginner
17
What is the Vanishing Gradient Problem?
Beginner

During backpropagation in deep neural networks (especially RNNs), gradients are recursively multiplied. If these gradients are small, the resulting gradient val...

Comprehensive Explanation
During backpropagation in deep neural networks (especially RNNs), gradients are recursively multiplied. If these gradients are small, the resulting gradient value shrinks exponentially, stopping earlier layers from learning.
Beginner
18
What is perplexity in Language Modeling?
Beginner

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. In NLP, a lower perplexity score indicates the langua...

Comprehensive Explanation
Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. In NLP, a lower perplexity score indicates the language model is better at predicting the next word.
Beginner
19
What is BLEU score?
Beginner

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another...

Comprehensive Explanation
BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another based on n-gram overlap between candidate and reference texts.
Beginner
20
What is ROUGE score?
Beginner

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation, focusing prim...

Comprehensive Explanation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation, focusing primarily on recall metrics.
Beginner
21
What is a Language Model?
Beginner

A language model learns the probability distribution over sequences of words. It tries to predict the next word in a sequence given the previous words.

Comprehensive Explanation
A language model learns the probability distribution over sequences of words. It tries to predict the next word in a sequence given the previous words.
Beginner
22
What is Text Generation generation constraints?
Beginner

Constraints like temperature, top-k sampling, and top-p (nucleus) sampling help control how deterministic or creative the output sequence from a language model ...

Comprehensive Explanation
Constraints like temperature, top-k sampling, and top-p (nucleus) sampling help control how deterministic or creative the output sequence from a language model is.
Beginner
23
Calculate Document Frequency.
Beginner

Document Frequency (DF) is the number of documents in which a term appears. It helps determine the uniqueness of the term globally.

Comprehensive Explanation
Document Frequency (DF) is the number of documents in which a term appears. It helps determine the uniqueness of the term globally.
Beginner
24
What is the GloVe embedding?
Beginner

GloVe (Global Vectors) is an unsupervised learning algorithm for obtaining vector representations for words based on aggregated global word-word co-occurrence s...

Comprehensive Explanation
GloVe (Global Vectors) is an unsupervised learning algorithm for obtaining vector representations for words based on aggregated global word-word co-occurrence statistics.
Beginner
25
Why sequence padding is important?
Beginner

Neural networks process batches in matrices, so all input sequences must be the same length. Padding adds neutral tokens (like 0) to shorter sequences so they m...

Comprehensive Explanation
Neural networks process batches in matrices, so all input sequences must be the same length. Padding adds neutral tokens (like 0) to shorter sequences so they match the maximum sequence length.
Beginner
26
Explain the Attention mechanism.
Intermediate

Attention mechanisms allow models to focus on specific parts of an input sequence when predicting the output, rather than relying on a single fixed-length hidde...

Comprehensive Explanation
Attention mechanisms allow models to focus on specific parts of an input sequence when predicting the output, rather than relying on a single fixed-length hidden vector. It computes a weighted sum of the inputs based on relevance.
Intermediate
27
What is Self-Attention?
Intermediate

Self-attention, or intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequ...

Comprehensive Explanation
Self-attention, or intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It's the core component of Transformers.
Intermediate
28
Explain the Sequence-to-Sequence (Seq2Seq) Model.
Intermediate

Seq2Seq models take an input sequence and generate an output sequence. They typically use an Encoder to read the input into a context vector, and a Decoder to g...

Comprehensive Explanation
Seq2Seq models take an input sequence and generate an output sequence. They typically use an Encoder to read the input into a context vector, and a Decoder to generate the output sequence (e.g., for translation).
Intermediate
29
What is Teacher Forcing?
Intermediate

Teacher forcing is a fast and effective training technique for RNNs/Seq2Seq models where the model receives the ground truth output from the previous time step ...

Comprehensive Explanation
Teacher forcing is a fast and effective training technique for RNNs/Seq2Seq models where the model receives the ground truth output from the previous time step as input for the current time step, instead of its own prediction.
Intermediate
30
What is Subword Tokenization?
Intermediate

Subword algorithms define tokens as characters or subwords, allowing models to mitigate the Out-Of-Vocabulary (OOV) problem efficiently. Examples include Byte-P...

Comprehensive Explanation
Subword algorithms define tokens as characters or subwords, allowing models to mitigate the Out-Of-Vocabulary (OOV) problem efficiently. Examples include Byte-Pair Encoding (BPE), WordPiece, and SentencePiece.
Intermediate
31
How does Byte-Pair Encoding (BPE) work?
Intermediate

BPE starts with a base vocabulary of single characters. It iteratively finds the most frequent pair of adjacent tokens and merges them into a new single token u...

Comprehensive Explanation
BPE starts with a base vocabulary of single characters. It iteratively finds the most frequent pair of adjacent tokens and merges them into a new single token until a target vocabulary size is reached.
Intermediate
32
What is WordPiece?
Intermediate

WordPiece (used by BERT) is similar to BPE but it chooses pairs to merge based on maximizing the likelihood of the training data using the language model, rathe...

Comprehensive Explanation
WordPiece (used by BERT) is similar to BPE but it chooses pairs to merge based on maximizing the likelihood of the training data using the language model, rather than just raw frequency.
Intermediate
33
Explain the concept of Word Sense Disambiguation (WSD).
Intermediate

WSD is the process of identifying which sense of a word (i.e., meaning) is used in a sentence, when the word has multiple meanings (e.g., 'bank' of a river vs '...

Comprehensive Explanation
WSD is the process of identifying which sense of a word (i.e., meaning) is used in a sentence, when the word has multiple meanings (e.g., 'bank' of a river vs 'bank' as an institution).
Intermediate
34
What is an LLM (Large Language Model)?
Intermediate

An LLM is a very large scale language model consisting of billions of parameters, typically based on the Transformer architecture, trained on immense quantities...

Comprehensive Explanation
An LLM is a very large scale language model consisting of billions of parameters, typically based on the Transformer architecture, trained on immense quantities of unlabeled text data.
Intermediate
35
What is BERT?
Intermediate

BERT (Bidirectional Encoder Representations from Transformers) is a model designed to pre-train deep bidirectional representations from unlabeled text by jointl...

Comprehensive Explanation
BERT (Bidirectional Encoder Representations from Transformers) is a model designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Intermediate
36
Explain the Masked Language Modeling (MLM) task in BERT.
Intermediate

During pre-training, BERT randomly masks 15% of the input tokens, and the objective is to predict those masked words using context from both directions.

Comprehensive Explanation
During pre-training, BERT randomly masks 15% of the input tokens, and the objective is to predict those masked words using context from both directions.
Intermediate
37
What is Next Sentence Prediction (NSP) in BERT?
Intermediate

NSP is a binary classification task where BERT receives pairs of sentences and learns to predict whether the second sentence logically follows the first one in ...

Comprehensive Explanation
NSP is a binary classification task where BERT receives pairs of sentences and learns to predict whether the second sentence logically follows the first one in the original document.
Intermediate
38
What is fine-tuning in NLP?
Intermediate

Fine-tuning takes a model that's already been pre-trained on a massive dataset (like BERT) and trains it further on a smaller, task-specific dataset (like senti...

Comprehensive Explanation
Fine-tuning takes a model that's already been pre-trained on a massive dataset (like BERT) and trains it further on a smaller, task-specific dataset (like sentiment analysis) adjusting its weights slightly.
Intermediate
39
What is the difference between Extractive and Abstractive Summarization?
Intermediate

Extractive summarization pulls the most important verbatim sentences from the text. Abstractive summarization generates new text that captures the essence of th...

Comprehensive Explanation
Extractive summarization pulls the most important verbatim sentences from the text. Abstractive summarization generates new text that captures the essence of the original text, similar to human summarizing.
Intermediate
40
How do Convolutional Neural Networks (CNNs) perform in NLP?
Intermediate

While CNNs are mainly for images, 1D CNNs are excellent at text classification. They slide multi-word filters over the text embeddings, effectively extracting h...

Comprehensive Explanation
While CNNs are mainly for images, 1D CNNs are excellent at text classification. They slide multi-word filters over the text embeddings, effectively extracting highly local N-gram features independent of where they appear.
Intermediate
41
Explain the Skip-gram architecture in Word2Vec.
Intermediate

Skip-gram predicts context words given a target/center word. It works well with small amounts of data and represents rare words very well compared to CBOW.

Comprehensive Explanation
Skip-gram predicts context words given a target/center word. It works well with small amounts of data and represents rare words very well compared to CBOW.
Intermediate
42
Explain Continuous Bag of Words (CBOW) architecture.
Intermediate

CBOW predicts the target word from a window of surrounding context words. It trains faster and represents frequent words better than Skip-gram.

Comprehensive Explanation
CBOW predicts the target word from a window of surrounding context words. It trains faster and represents frequent words better than Skip-gram.
Intermediate
43
What is Negative Sampling in Word2Vec?
Intermediate

A technique to make calculating the loss faster by updating weights for merely a small sample of 'negative' (incorrect) words along with the target word, rather...

Comprehensive Explanation
A technique to make calculating the loss faster by updating weights for merely a small sample of 'negative' (incorrect) words along with the target word, rather than computing softmax over the entire massive vocabulary.
Intermediate
44
What is Coreference Resolution?
Intermediate

The task of finding all expressions that refer to the same entity in a text. For example, in 'Jane took her dog out because it was barking', 'Jane' maps to 'her...

Comprehensive Explanation
The task of finding all expressions that refer to the same entity in a text. For example, in 'Jane took her dog out because it was barking', 'Jane' maps to 'her', and 'dog' maps to 'it'.
Intermediate
45
Difference between Generative and Discriminative models.
Intermediate

Generative models (like naive bayes, GANs) map how the data was generated P(X,Y) and can generate new samples. Discriminative models (like Logistic Regression, ...

Comprehensive Explanation
Generative models (like naive bayes, GANs) map how the data was generated P(X,Y) and can generate new samples. Discriminative models (like Logistic Regression, BERT classification) purely learn the decision boundary P(Y|X).
Intermediate
46
What is the ELMo algorithm?
Intermediate

Embeddings from Language Models (ELMo) creates contextualized word embeddings using a deeply bidirectional LSTM. It computes an embedding for a word based on th...

Comprehensive Explanation
Embeddings from Language Models (ELMo) creates contextualized word embeddings using a deeply bidirectional LSTM. It computes an embedding for a word based on the full sentence context.
Intermediate
47
What is Top-K vs Top-P sampling?
Intermediate

In language generation, Top-K limits the next-word sample to the K most probable tokens. Top-P (nucleus sampling) limits the sample to a dynamic set of tokens w...

Comprehensive Explanation
In language generation, Top-K limits the next-word sample to the K most probable tokens. Top-P (nucleus sampling) limits the sample to a dynamic set of tokens whose cumulative probability exceeds P.
Intermediate
48
What are the common evaluation metrics for NER?
Intermediate

Precision, Recall, and F1-score are standard. However, evaluating requires exact boundary matching (exact match) vs partial matching overlap since entities ofte...

Comprehensive Explanation
Precision, Recall, and F1-score are standard. However, evaluating requires exact boundary matching (exact match) vs partial matching overlap since entities often span multiple words.
Intermediate
49
How do you handle Class Imbalance in NLP?
Intermediate

Techniques include oversampling minority classes (SMOTE or synonym replacement), undersampling majority classes, using weighted loss functions, or adopting Foca...

Comprehensive Explanation
Techniques include oversampling minority classes (SMOTE or synonym replacement), undersampling majority classes, using weighted loss functions, or adopting Focal Loss.
Intermediate
50
What are Zero-shot and Few-shot learning?
Intermediate

Zero-shot learning means giving the model a task it wasn't explicitly trained mapped to without any examples. Few-shot means providing the model with a tiny num...

Comprehensive Explanation
Zero-shot learning means giving the model a task it wasn't explicitly trained mapped to without any examples. Few-shot means providing the model with a tiny number of demonstration examples (1-5) in the prompt.
Intermediate
51
Explain the architecture of a Transformer model.
Advanced

Transformers utilize an Encoder-Decoder structure built entirely upon self-attention mechanisms, dropping recurrent/convolutional layers. Key components include...

Comprehensive Explanation
Transformers utilize an Encoder-Decoder structure built entirely upon self-attention mechanisms, dropping recurrent/convolutional layers. Key components include Multi-Head Attention, Feed Forward Networks, Layer Normalization, and Positional Encodings.
Advanced
52
What is the purpose of Positional Encoding in Transformers?
Advanced

Since Transformers don't use recurrence and process all tokens simultaneously, they have no inherent notion of sequence order. Positional encodings (using sine/...

Comprehensive Explanation
Since Transformers don't use recurrence and process all tokens simultaneously, they have no inherent notion of sequence order. Positional encodings (using sine/cosine waves) are added to input embeddings to inject the relative or absolute position of words.
Advanced
53
Explain the calculation of Scaled Dot-Product Attention.
Advanced

Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V. Query matrices dot with Key matrices to get similarity scores, are scaled to prevent vanishing gradient...

Comprehensive Explanation
Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V. Query matrices dot with Key matrices to get similarity scores, are scaled to prevent vanishing gradients during softmax, and then multiplied by Value matrices.
Advanced
54
What is Multi-Head Attention?
Advanced

Instead of performing a single attention function, Multi-Head attention projects Queries, Keys, and Values h times with different learned weights. The h attenti...

Comprehensive Explanation
Instead of performing a single attention function, Multi-Head attention projects Queries, Keys, and Values h times with different learned weights. The h attention outputs are concatenated and linearly projected. It allows focusing on different representation subspaces (e.g., subject-verb vs adjectives).
Advanced
55
Difference between GPT architectures and BERT?
Advanced

GPT is an auto-regressive Decoder-only model trained left-to-right to predict the next token (excellent for generation). BERT is an Encoder-only model trained t...

Comprehensive Explanation
GPT is an auto-regressive Decoder-only model trained left-to-right to predict the next token (excellent for generation). BERT is an Encoder-only model trained to bidirectionally reconstruct masked tokens (excellent for classification/understanding).
Advanced
56
Explain Retrieval Augmented Generation (RAG).
Advanced

RAG connects an LLM to an external knowledge database. Upon a query, semantic search retrieves relevant document chunks from the database, prepends them to the ...

Comprehensive Explanation
RAG connects an LLM to an external knowledge database. Upon a query, semantic search retrieves relevant document chunks from the database, prepends them to the LLM's prompt, reducing hallucinations and anchoring answers in facts.
Advanced
57
What is LoRA (Low-Rank Adaptation)?
Advanced

LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This drastically r...

Comprehensive Explanation
LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This drastically reduces the number of trainable parameters down to ~1% while maintaining fine-tuning quality.
Advanced
58
What is RLHF (Reinforcement Learning from Human Feedback)?
Advanced

RLHF aligns LLMs with human intent. It trains a 'Reward Model' based on human rankings of text outputs, and then uses Proximal Policy Optimization (PPO) reinfor...

Comprehensive Explanation
RLHF aligns LLMs with human intent. It trains a 'Reward Model' based on human rankings of text outputs, and then uses Proximal Policy Optimization (PPO) reinforcement learning to fine-tune the LLM to maximize those rewards.
Advanced
59
What is Prompt Engineering vs Prompt Tuning?
Advanced

Prompt Engineering is manually hard-coding optimal text templates. Prompt Tuning (or Soft Prompts) keeps model weights frozen but adds a small number of tunable...

Comprehensive Explanation
Prompt Engineering is manually hard-coding optimal text templates. Prompt Tuning (or Soft Prompts) keeps model weights frozen but adds a small number of tunable, continuous (vector) tokens to the input, which are updated via backpropagation.
Advanced
60
Explain the difference between T5 and standard Seq2Seq models.
Advanced

T5 (Text-to-Text Transfer Transformer) casts every NLP task—classification, translation, QA—into a text-to-text format. So both the inputs and outputs are treat...

Comprehensive Explanation
T5 (Text-to-Text Transfer Transformer) casts every NLP task—classification, translation, QA—into a text-to-text format. So both the inputs and outputs are treated as text strings, allowing unified learning.
Advanced
61
What are Vector Databases?
Advanced

Databases specialized in storing indexing high-dimensional embeddings mathematically derived from data (text, images). They perform ultra-fast similarity search...

Comprehensive Explanation
Databases specialized in storing indexing high-dimensional embeddings mathematically derived from data (text, images). They perform ultra-fast similarity searches using metrics like Cosine Similarity (e.g. Pinecone, Milvus, Qdrant).
Advanced
62
Explain the 'Hallucination' problem in LLMs and mitigation strategies.
Advanced

Hallucination is when LLMs generate fluent but factually incorrect information. Mitigation includes RAG, strict grounding prompts, lower temperature, RLHF, or u...

Comprehensive Explanation
Hallucination is when LLMs generate fluent but factually incorrect information. Mitigation includes RAG, strict grounding prompts, lower temperature, RLHF, or using external validation tools (Toolformer).
Advanced
63
What are KV (Key-Value) Caches in LLM decoding?
Advanced

During auto-regressive generation, past tokens' Key and Value tensors in the attention layers are cached instead of being recomputed every step. This turns gene...

Comprehensive Explanation
During auto-regressive generation, past tokens' Key and Value tensors in the attention layers are cached instead of being recomputed every step. This turns generation from O(N^2) to roughly O(N) complexity for new tokens.
Advanced
64
What is FlashAttention?
Advanced

FlashAttention is an exact IO-aware and drastically faster algorithm for computing exact attention. It prevents moving large N x N matrices between HBM (GPU mem...

Comprehensive Explanation
FlashAttention is an exact IO-aware and drastically faster algorithm for computing exact attention. It prevents moving large N x N matrices between HBM (GPU memory) and SRAM (L1 cache), dramatically accelerating training and inferencing.
Advanced
65
What is Quantization in neural networks? (e.g. 4-bit, 8-bit)
Advanced

Quantization reduces the precision of the network's weights and activations from FP32 (32-bit float) down to INT8 or INT4 formats. This significantly shrinks mo...

Comprehensive Explanation
Quantization reduces the precision of the network's weights and activations from FP32 (32-bit float) down to INT8 or INT4 formats. This significantly shrinks model VRAM footprint and speeds up inference with minimal quality degradation.
Advanced
66
Differentiate between RoPE and Absolute Positional Encodings.
Advanced

Rotary Position Embedding (RoPE) injects absolute position by multiplying context representations with a rotation matrix. It wonderfully captures relative posit...

Comprehensive Explanation
Rotary Position Embedding (RoPE) injects absolute position by multiplying context representations with a rotation matrix. It wonderfully captures relative positional differences mathematically, which handles sequence length extrapolation far better.
Advanced
67
Explain ALiBi (Attention with Linear Biases).
Advanced

ALiBi removes positional embeddings entirely. Instead, it statically adds a linear penalty to the attention scores before the softmax operation depending on dis...

Comprehensive Explanation
ALiBi removes positional embeddings entirely. Instead, it statically adds a linear penalty to the attention scores before the softmax operation depending on distance. It allows models to easily extrapolate to sequence lengths not seen during training.
Advanced
68
Explain Directed Acyclic Graphs (DAGs) in Dependency Parsing.
Advanced

Dependency parsing builds a tree outlining grammatical relations. Using algorithms like Chu-Liu-Edmonds, relationships are formed as directed edges connecting h...

Comprehensive Explanation
Dependency parsing builds a tree outlining grammatical relations. Using algorithms like Chu-Liu-Edmonds, relationships are formed as directed edges connecting heads to dependents. Ensuring it fits a DAG stops cyclic dependencies.
Advanced
69
What is the Conditional Random Field (CRF) layer used for in NER?
Advanced

A CRF layer sits on top of an LSTM/Transformer and predicts tags jointly. It learns transition probabilities between labels (e.g., ensuring I-ORG follows B-ORG ...

Comprehensive Explanation
A CRF layer sits on top of an LSTM/Transformer and predicts tags jointly. It learns transition probabilities between labels (e.g., ensuring I-ORG follows B-ORG and not B-PER), preventing invalid sequence classifications.
Advanced
70
What is Knowledge Distillation?
Advanced

A compression technique where a smaller 'student' model is trained to mimic the softmax probabilities (soft targets) and intermediate representations of a massi...

Comprehensive Explanation
A compression technique where a smaller 'student' model is trained to mimic the softmax probabilities (soft targets) and intermediate representations of a massive 'teacher' model, maintaining high accuracy at lower scale.
Advanced
71
Explain Mixture of Experts (MoE) architecture.
Advanced

MoE replaces the dense Feed Forward network with multiple parallel 'experts'. A routing network conditionally determines which tiny subset of experts (usually 1...

Comprehensive Explanation
MoE replaces the dense Feed Forward network with multiple parallel 'experts'. A routing network conditionally determines which tiny subset of experts (usually 1 or 2) process each token, allowing massive parameter scaling while keeping FLOPs low.
Advanced
72
How does Semantic Search differ from Lexical/Keyword Search?
Advanced

Lexical search (BM25) uses sparse tf-idf matching exact words. Semantic search embeds queries into dense vectors and finds documents closest in vector space, ma...

Comprehensive Explanation
Lexical search (BM25) uses sparse tf-idf matching exact words. Semantic search embeds queries into dense vectors and finds documents closest in vector space, matching synonyms and meaning rather than just string overlap.
Advanced
73
What is Cross-Encoder vs Bi-Encoder?
Advanced

Bi-encoders independently embed query and document, computing a simple fast dot product for similarity (great for searching databases). Cross-encoders concatena...

Comprehensive Explanation
Bi-encoders independently embed query and document, computing a simple fast dot product for similarity (great for searching databases). Cross-encoders concatenate Query+Document analyzing full self-attention interactions (slower but vastly more accurate for re-ranking).
Advanced
74
What is Token Healing in LLMs?
Advanced

Because tokenization merges prefix spaces/characters, prompting a model abruptly might split a logical word forcing poor probability spaces. Token healing dynam...

Comprehensive Explanation
Because tokenization merges prefix spaces/characters, prompting a model abruptly might split a logical word forcing poor probability spaces. Token healing dynamically 'unrolls' the last token of a prompt to generate seamless continued sequences.
Advanced
75
Explain Direct Preference Optimization (DPO).
Advanced

DPO acts as a simpler, more stable alternative to RLHF. Instead of training a separate reward model, DPO mathematically maps the reward function directly onto t...

Comprehensive Explanation
DPO acts as a simpler, more stable alternative to RLHF. Instead of training a separate reward model, DPO mathematically maps the reward function directly onto the language model's policy, optimizing off human preference pairs directly.
Advanced
76
Implement Cosine Similarity in Python.
Math/Code

Dot product divided by product of magnitudes.

Comprehensive Explanation
Dot product divided by product of magnitudes.
Python / PyTorch Code
import numpy as np
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Math_code
77
Calculate TF-IDF for word 'data'.
Math/Code

TF(t) = (count in doc) / (total words in doc). IDF(t) = log_e(Total Docs / Docs with t).

Comprehensive Explanation
TF(t) = (count in doc) / (total words in doc). IDF(t) = log_e(Total Docs / Docs with t).
Python / PyTorch Code
import math
tf = 3 / 100
idf = math.log(1000 / (10 + 1))
tfidf = tf * idf
Math_code
78
Write a PyTorch basic Self-Attention calculation.
Math/Code

Forward pass utilizing matrix multiplications.

Comprehensive Explanation
Forward pass utilizing matrix multiplications.
Python / PyTorch Code
import torch
import torch.nn.functional as F

def attention(q, k, v, d_k):
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, v)
Math_code
79
Code: Convert a list of texts to bag of words.
Math/Code

Using SKLearn CountVectorizer.

Comprehensive Explanation
Using SKLearn CountVectorizer.
Python / PyTorch Code
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(['nlp is fun', 'ai is nlp'])
print(X.toarray())
Math_code
80
How to apply Softmax in PyTorch?
Math/Code

Using F.softmax along the last dimension.

Comprehensive Explanation
Using F.softmax along the last dimension.
Python / PyTorch Code
logits = torch.tensor([1.0, 2.0, -1.0])
probs = torch.nn.functional.softmax(logits, dim=0)
Math_code
81
Implement a simple bigram character generator.
Math/Code

Matrix probabilities lookup.

Comprehensive Explanation
Matrix probabilities lookup.
Python / PyTorch Code
counts = torch.zeros((27,27))
# ... fill counts
probs = counts / counts.sum('1', keepdim=True)
i = torch.multinomial(probs[0], num_samples=1)
char = itos[i.item()]
Math_code
82
Extract Named Entities using Spacy.
Math/Code

Load core engine and iterate entites.

Comprehensive Explanation
Load core engine and iterate entites.
Python / PyTorch Code
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple is buying a startup based in UK')
for ent in doc.ents:
    print(ent.text, ent.label_)
Math_code
83
What is Cross-Entropy Loss formula mathematically?
Math/Code

Loss = -SUM(p(x) * log(q(x))) where p is true distribution and q is predicted distribution.

Comprehensive Explanation
Loss = -SUM(p(x) * log(q(x))) where p is true distribution and q is predicted distribution.
Python / PyTorch Code
loss = nn.CrossEntropyLoss()
calculated = loss(logits, targets)
Math_code
84
Load a transformer model via HuggingFace.
Math/Code

Using pipeline or direct.

Comprehensive Explanation
Using pipeline or direct.
Python / PyTorch Code
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love NLP!')
Math_code
85
Write WordPiece Subword Tokenizer logic pseudo-code.
Math/Code

Greedy matching loop.

Comprehensive Explanation
Greedy matching loop.
Python / PyTorch Code
token_list = []
while word:
    substr = get_longest_matching_prefix(word)
    token_list.append(substr)
    word = word[len(substr):]
Math_code
86
How to clip gradients in PyTorch to prevent exploding gradients?
Math/Code

Clips norm of the gradients before stepping optimizer.

Comprehensive Explanation
Clips norm of the gradients before stepping optimizer.
Python / PyTorch Code
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Math_code
87
Calculate the Jaccard Similarity between two sentences.
Math/Code

Intersection over Union of sets.

Comprehensive Explanation
Intersection over Union of sets.
Python / PyTorch Code
def jaccard(s1, s2):
    set1, set2 = set(s1.split()), set(s2.split())
    intersect = len(set1.intersection(set2))
    return intersect / (len(set1) + len(set2) - intersect)
Math_code
88
Write an Attention Mask for sequence padding.
Math/Code

Create masks replacing 0s with extremely negative numbers so Softmax zeroes them out.

Comprehensive Explanation
Create masks replacing 0s with extremely negative numbers so Softmax zeroes them out.
Python / PyTorch Code
mask = (input_ids == pad_token).unsqueeze(1).unsqueeze(2)
scores = scores.masked_fill(mask, -1e9)
Math_code
89
Implement early stopping loop in PyTorch.
Math/Code

Track validation loss over patience epochs.

Comprehensive Explanation
Track validation loss over patience epochs.
Python / PyTorch Code
best = float('inf')
patience, count = 3, 0
if val_loss < best:
   best = val_loss
else:
   count += 1
   if count >= patience: break
Math_code
90
Generate Text using HuggingFace GPT-2.
Math/Code

Using the generate method.

Comprehensive Explanation
Using the generate method.
Python / PyTorch Code
inputs = tokenizer('Hello, my dog is', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
Math_code
91
Pad sequences in raw python.
Math/Code

Adding zeros to reach max_len.

Comprehensive Explanation
Adding zeros to reach max_len.
Python / PyTorch Code
padded = [seq + [0]*(max_len - len(seq)) for seq in sequences]
Math_code
92
Implement Levenshtein distance conceptually using recursion.
Math/Code

Minimizing insertions, deletions, substitutions.

Comprehensive Explanation
Minimizing insertions, deletions, substitutions.
Python / PyTorch Code
def lev(a,b):
    if not a: return len(b)
    if not b: return len(a)
    cost = 0 if a[0]==b[0] else 1
    return min(lev(a[1:],b)+1, lev(a,b[1:])+1, lev(a[1:],b[1:])+cost)
Math_code
93
How to initialize LayerNorm weights.
Math/Code

gamma to 1, beta to 0.

Comprehensive Explanation
gamma to 1, beta to 0.
Python / PyTorch Code
self.gamma = nn.Parameter(torch.ones(features))
self.beta = nn.Parameter(torch.zeros(features))
Math_code
94
Calculate output dimension of a 1D Conv Layer over text.
Math/Code

Length = [(Input - Filter + 2*Pad) / Stride] + 1.

Comprehensive Explanation
Length = [(Input - Filter + 2*Pad) / Stride] + 1.
Python / PyTorch Code
out_dim = math.floor(((L_in - kernel_size + 2*padding) / stride) + 1)
Math_code
95
Retrieve embeddings from a PyTorch Embedding layer.
Math/Code

Pass indices into the embedding class instance.

Comprehensive Explanation
Pass indices into the embedding class instance.
Python / PyTorch Code
embeds = nn.Embedding(vocab_size, dim)
indices = torch.tensor([1, 4, 10])
vectors = embeds(indices)
Math_code
96
Calculate Parameters in an LSTM relative to Input/Hidden size.
Math/Code

4 * ((input_size * hidden_size) + (hidden_size * hidden_size) + hidden_size).

Comprehensive Explanation
4 * ((input_size * hidden_size) + (hidden_size * hidden_size) + hidden_size).
Python / PyTorch Code
params = 4 * ((n * m) + (m * m) + m)
Math_code
97
Save and load a PyTorch NLP Model.
Math/Code

Using state dicts.

Comprehensive Explanation
Using state dicts.
Python / PyTorch Code
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))
Math_code
98
Perform Top-K sampling conceptually.
Math/Code

Sort, take first k, zero out rest, sample.

Comprehensive Explanation
Sort, take first k, zero out rest, sample.
Python / PyTorch Code
probs, indices = torch.topk(logits, k=5)
probs = F.softmax(probs, dim=-1)
next_token = torch.multinomial(probs, 1)
Math_code
99
Convert Text to sequence using HuggingFace Tokenizer.
Math/Code

Calling tokenizer directly outputs dict containing input_ids.

Comprehensive Explanation
Calling tokenizer directly outputs dict containing input_ids.
Python / PyTorch Code
encodings = tokenizer(['Text one', 'Text two'], padding=True, truncation=True)
print(encodings['input_ids'])
Math_code
100
Write an RNN step function loop.
Math/Code

Iterating time dimension manually updating hidden state.

Comprehensive Explanation
Iterating time dimension manually updating hidden state.
Python / PyTorch Code
h_t = torch.zeros(1, hidden_size)
for x_t in inputs:
    h_t = torch.tanh(W_xh @ x_t + W_hh @ h_t + b)
Math_code