NLP Interview Master

The ultimate collection of 200+ meticulously curated Natural Language Processing & LLM questions to help you ace your ML Engineer interview.

0 / 200 LearnedTransformers & LLMsMath & PyTorch

Showing 100 results in All Questions category.

What is Natural Language Processing (NLP)?

Beginner

Natural Language Processing (NLP) is a subfield of artificial intelligence, computer science, and linguistics concerned with the interactions between computers ...

Comprehensive Explanation

Natural Language Processing (NLP) is a subfield of artificial intelligence, computer science, and linguistics concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Beginner

What is Tokenization?

Beginner

Tokenization is the process of breaking down a stream of text into smaller components called tokens. These could be words, characters, or subwords. It is usuall...

Comprehensive Explanation

Tokenization is the process of breaking down a stream of text into smaller components called tokens. These could be words, characters, or subwords. It is usually the first step in an NLP pipeline.

Beginner

What is Stemming?

Beginner

Stemming is a crude heuristic process that chops off the ends of words to find their base or root form, even if the root is not a valid word. For example, 'runn...

Comprehensive Explanation

Stemming is a crude heuristic process that chops off the ends of words to find their base or root form, even if the root is not a valid word. For example, 'running' becomes 'run'.

Beginner

What is Lemmatization?

Beginner

Lemmatization uses vocabulary and morphological analysis of words to remove inflectional endings and return the base or dictionary form of a word, which is know...

Comprehensive Explanation

Lemmatization uses vocabulary and morphological analysis of words to remove inflectional endings and return the base or dictionary form of a word, which is known as the lemma. For instance, 'better' becomes 'good'.

Beginner

What is Stop Word Removal?

Beginner

Stop words are high-frequency words that often add little lexical value to sentences, such as 'is', 'the', and 'at'. Removing them can significantly reduce the ...

Comprehensive Explanation

Stop words are high-frequency words that often add little lexical value to sentences, such as 'is', 'the', and 'at'. Removing them can significantly reduce the size of the vocabulary and improve model training speed and accuracy.

Beginner

What is a Bag-of-Words (BoW) model?

Beginner

A Bag-of-Words model is a simple representation of text used in NLP. It describes the occurrence of words within a document, ignoring word order and grammar but...

Comprehensive Explanation

A Bag-of-Words model is a simple representation of text used in NLP. It describes the occurrence of words within a document, ignoring word order and grammar but keeping multiplicity (frequency).

Beginner

What is TF-IDF?

Beginner

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic intended to reflect how important a word is to a document in a collectio...

Comprehensive Explanation

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic intended to reflect how important a word is to a document in a collection or corpus. Words that appear frequently in one document but rarely across all documents get the highest score.

Beginner

What are Word Embeddings?

Beginner

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They map words or phrases to vectors ...

Comprehensive Explanation

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They map words or phrases to vectors of real numbers in a continuous vector space (e.g., Word2Vec, GloVe).

Beginner

What is Word2Vec?

Beginner

Word2Vec is a popular algorithm used for generating dense word embeddings. Created by researchers at Google, it uses a two-layer neural network (either Skip-Gra...

Comprehensive Explanation

Word2Vec is a popular algorithm used for generating dense word embeddings. Created by researchers at Google, it uses a two-layer neural network (either Skip-Gram or CBOW) to reconstruct linguistic contexts of words.

Beginner

What is Part-of-Speech (POS) Tagging?

Beginner

POS tagging is the process of marking up a word in a text as corresponding to a particular part of speech based on both its definition and its context (e.g., ta...

Comprehensive Explanation

POS tagging is the process of marking up a word in a text as corresponding to a particular part of speech based on both its definition and its context (e.g., tagging a word as a noun, verb, adjective).

Beginner

What is Named Entity Recognition (NER)?

Beginner

NER is an information extraction technique that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as pe...

Comprehensive Explanation

NER is an information extraction technique that seeks to locate and classify named entities mentioned in unstructured text into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities.

Beginner

What is Sentiment Analysis?

Beginner

Sentiment analysis (or opinion mining) uses NLP to identify, extract, and quantify subjective information from text, typically classifying the polarity as posit...

Comprehensive Explanation

Sentiment analysis (or opinion mining) uses NLP to identify, extract, and quantify subjective information from text, typically classifying the polarity as positive, negative, or neutral.

Beginner

What are N-grams?

Beginner

An N-gram is a contiguous sequence of n items from a given sample of text or speech. When n=1 it's a unigram, n=2 is a bigram, and n=3 is a trigram. It captures...

Comprehensive Explanation

An N-gram is a contiguous sequence of n items from a given sample of text or speech. When n=1 it's a unigram, n=2 is a bigram, and n=3 is a trigram. It captures context that BoW misses.

Beginner

What is Levenshtein Distance?

Beginner

Levenshtein Distance is a string metric for measuring the difference between two sequences. It is the minimum number of single-character edits (insertions, dele...

Comprehensive Explanation

Levenshtein Distance is a string metric for measuring the difference between two sequences. It is the minimum number of single-character edits (insertions, deletions, substitutions) required to change one word into the other.

Beginner

What is an RNN (Recurrent Neural Network)?

Beginner

An RNN is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input t...

Comprehensive Explanation

An RNN is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This makes them good for sequential data like text.

Beginner

Why use LSTM instead of a basic RNN?

Beginner

LSTMs (Long Short-Term Memory networks) are a special kind of RNN capable of learning long-term dependencies. They solve the vanishing gradient problem of stand...

Comprehensive Explanation

LSTMs (Long Short-Term Memory networks) are a special kind of RNN capable of learning long-term dependencies. They solve the vanishing gradient problem of standard RNNs through specialized memory cells and gating mechanisms.

Beginner

What is the Vanishing Gradient Problem?

Beginner

During backpropagation in deep neural networks (especially RNNs), gradients are recursively multiplied. If these gradients are small, the resulting gradient val...

Comprehensive Explanation

During backpropagation in deep neural networks (especially RNNs), gradients are recursively multiplied. If these gradients are small, the resulting gradient value shrinks exponentially, stopping earlier layers from learning.

Beginner

What is perplexity in Language Modeling?

Beginner

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. In NLP, a lower perplexity score indicates the langua...

Comprehensive Explanation

Perplexity is a measurement of how well a probability distribution or probability model predicts a sample. In NLP, a lower perplexity score indicates the language model is better at predicting the next word.

Beginner

What is BLEU score?

Beginner

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another...

Comprehensive Explanation

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another based on n-gram overlap between candidate and reference texts.

Beginner

What is ROUGE score?

Beginner

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation, focusing prim...

Comprehensive Explanation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation, focusing primarily on recall metrics.

Beginner

What is a Language Model?

Beginner

A language model learns the probability distribution over sequences of words. It tries to predict the next word in a sequence given the previous words.

Comprehensive Explanation

A language model learns the probability distribution over sequences of words. It tries to predict the next word in a sequence given the previous words.

Beginner

What is Text Generation generation constraints?

Beginner

Constraints like temperature, top-k sampling, and top-p (nucleus) sampling help control how deterministic or creative the output sequence from a language model ...

Comprehensive Explanation

Constraints like temperature, top-k sampling, and top-p (nucleus) sampling help control how deterministic or creative the output sequence from a language model is.

Beginner

Calculate Document Frequency.

Beginner

Document Frequency (DF) is the number of documents in which a term appears. It helps determine the uniqueness of the term globally.

Comprehensive Explanation

Document Frequency (DF) is the number of documents in which a term appears. It helps determine the uniqueness of the term globally.

Beginner

What is the GloVe embedding?

Beginner

GloVe (Global Vectors) is an unsupervised learning algorithm for obtaining vector representations for words based on aggregated global word-word co-occurrence s...

Comprehensive Explanation

GloVe (Global Vectors) is an unsupervised learning algorithm for obtaining vector representations for words based on aggregated global word-word co-occurrence statistics.

Beginner

Why sequence padding is important?

Beginner

Neural networks process batches in matrices, so all input sequences must be the same length. Padding adds neutral tokens (like 0) to shorter sequences so they m...

Comprehensive Explanation

Neural networks process batches in matrices, so all input sequences must be the same length. Padding adds neutral tokens (like 0) to shorter sequences so they match the maximum sequence length.

Beginner

Explain the Attention mechanism.

Intermediate

Attention mechanisms allow models to focus on specific parts of an input sequence when predicting the output, rather than relying on a single fixed-length hidde...

Comprehensive Explanation

Attention mechanisms allow models to focus on specific parts of an input sequence when predicting the output, rather than relying on a single fixed-length hidden vector. It computes a weighted sum of the inputs based on relevance.

Intermediate

What is Self-Attention?

Intermediate

Self-attention, or intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequ...

Comprehensive Explanation

Self-attention, or intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. It's the core component of Transformers.

Intermediate

Explain the Sequence-to-Sequence (Seq2Seq) Model.

Intermediate

Seq2Seq models take an input sequence and generate an output sequence. They typically use an Encoder to read the input into a context vector, and a Decoder to g...

Comprehensive Explanation

Seq2Seq models take an input sequence and generate an output sequence. They typically use an Encoder to read the input into a context vector, and a Decoder to generate the output sequence (e.g., for translation).

Intermediate

What is Teacher Forcing?

Intermediate

Teacher forcing is a fast and effective training technique for RNNs/Seq2Seq models where the model receives the ground truth output from the previous time step ...

Comprehensive Explanation

Teacher forcing is a fast and effective training technique for RNNs/Seq2Seq models where the model receives the ground truth output from the previous time step as input for the current time step, instead of its own prediction.

Intermediate

What is Subword Tokenization?

Intermediate

Subword algorithms define tokens as characters or subwords, allowing models to mitigate the Out-Of-Vocabulary (OOV) problem efficiently. Examples include Byte-P...

Comprehensive Explanation

Subword algorithms define tokens as characters or subwords, allowing models to mitigate the Out-Of-Vocabulary (OOV) problem efficiently. Examples include Byte-Pair Encoding (BPE), WordPiece, and SentencePiece.

Intermediate

How does Byte-Pair Encoding (BPE) work?

Intermediate

BPE starts with a base vocabulary of single characters. It iteratively finds the most frequent pair of adjacent tokens and merges them into a new single token u...

Comprehensive Explanation

BPE starts with a base vocabulary of single characters. It iteratively finds the most frequent pair of adjacent tokens and merges them into a new single token until a target vocabulary size is reached.

Intermediate

What is WordPiece?

Intermediate

WordPiece (used by BERT) is similar to BPE but it chooses pairs to merge based on maximizing the likelihood of the training data using the language model, rathe...

Comprehensive Explanation

WordPiece (used by BERT) is similar to BPE but it chooses pairs to merge based on maximizing the likelihood of the training data using the language model, rather than just raw frequency.

Intermediate

Explain the concept of Word Sense Disambiguation (WSD).

Intermediate

WSD is the process of identifying which sense of a word (i.e., meaning) is used in a sentence, when the word has multiple meanings (e.g., 'bank' of a river vs '...

Comprehensive Explanation

WSD is the process of identifying which sense of a word (i.e., meaning) is used in a sentence, when the word has multiple meanings (e.g., 'bank' of a river vs 'bank' as an institution).

Intermediate

What is an LLM (Large Language Model)?

Intermediate

An LLM is a very large scale language model consisting of billions of parameters, typically based on the Transformer architecture, trained on immense quantities...

Comprehensive Explanation

An LLM is a very large scale language model consisting of billions of parameters, typically based on the Transformer architecture, trained on immense quantities of unlabeled text data.

Intermediate

What is BERT?

Intermediate

BERT (Bidirectional Encoder Representations from Transformers) is a model designed to pre-train deep bidirectional representations from unlabeled text by jointl...

Comprehensive Explanation

BERT (Bidirectional Encoder Representations from Transformers) is a model designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

Intermediate

Explain the Masked Language Modeling (MLM) task in BERT.

Intermediate

During pre-training, BERT randomly masks 15% of the input tokens, and the objective is to predict those masked words using context from both directions.

Comprehensive Explanation

During pre-training, BERT randomly masks 15% of the input tokens, and the objective is to predict those masked words using context from both directions.

Intermediate

What is Next Sentence Prediction (NSP) in BERT?

Intermediate

NSP is a binary classification task where BERT receives pairs of sentences and learns to predict whether the second sentence logically follows the first one in ...

Comprehensive Explanation

NSP is a binary classification task where BERT receives pairs of sentences and learns to predict whether the second sentence logically follows the first one in the original document.

Intermediate

What is fine-tuning in NLP?

Intermediate

Fine-tuning takes a model that's already been pre-trained on a massive dataset (like BERT) and trains it further on a smaller, task-specific dataset (like senti...

Comprehensive Explanation

Fine-tuning takes a model that's already been pre-trained on a massive dataset (like BERT) and trains it further on a smaller, task-specific dataset (like sentiment analysis) adjusting its weights slightly.

Intermediate

What is the difference between Extractive and Abstractive Summarization?

Intermediate

Extractive summarization pulls the most important verbatim sentences from the text. Abstractive summarization generates new text that captures the essence of th...

Comprehensive Explanation

Extractive summarization pulls the most important verbatim sentences from the text. Abstractive summarization generates new text that captures the essence of the original text, similar to human summarizing.

Intermediate

How do Convolutional Neural Networks (CNNs) perform in NLP?

Intermediate

While CNNs are mainly for images, 1D CNNs are excellent at text classification. They slide multi-word filters over the text embeddings, effectively extracting h...

Comprehensive Explanation

While CNNs are mainly for images, 1D CNNs are excellent at text classification. They slide multi-word filters over the text embeddings, effectively extracting highly local N-gram features independent of where they appear.

Intermediate

Explain the Skip-gram architecture in Word2Vec.

Intermediate

Skip-gram predicts context words given a target/center word. It works well with small amounts of data and represents rare words very well compared to CBOW.

Comprehensive Explanation

Skip-gram predicts context words given a target/center word. It works well with small amounts of data and represents rare words very well compared to CBOW.

Intermediate

Explain Continuous Bag of Words (CBOW) architecture.

Intermediate

CBOW predicts the target word from a window of surrounding context words. It trains faster and represents frequent words better than Skip-gram.

Comprehensive Explanation

CBOW predicts the target word from a window of surrounding context words. It trains faster and represents frequent words better than Skip-gram.

Intermediate

What is Negative Sampling in Word2Vec?

Intermediate

A technique to make calculating the loss faster by updating weights for merely a small sample of 'negative' (incorrect) words along with the target word, rather...

Comprehensive Explanation

A technique to make calculating the loss faster by updating weights for merely a small sample of 'negative' (incorrect) words along with the target word, rather than computing softmax over the entire massive vocabulary.

Intermediate

What is Coreference Resolution?

Intermediate

The task of finding all expressions that refer to the same entity in a text. For example, in 'Jane took her dog out because it was barking', 'Jane' maps to 'her...

Comprehensive Explanation

The task of finding all expressions that refer to the same entity in a text. For example, in 'Jane took her dog out because it was barking', 'Jane' maps to 'her', and 'dog' maps to 'it'.

Intermediate

Difference between Generative and Discriminative models.

Intermediate

Generative models (like naive bayes, GANs) map how the data was generated P(X,Y) and can generate new samples. Discriminative models (like Logistic Regression, ...

Comprehensive Explanation

Generative models (like naive bayes, GANs) map how the data was generated P(X,Y) and can generate new samples. Discriminative models (like Logistic Regression, BERT classification) purely learn the decision boundary P(Y|X).

Intermediate

What is the ELMo algorithm?

Intermediate

Embeddings from Language Models (ELMo) creates contextualized word embeddings using a deeply bidirectional LSTM. It computes an embedding for a word based on th...

Comprehensive Explanation

Embeddings from Language Models (ELMo) creates contextualized word embeddings using a deeply bidirectional LSTM. It computes an embedding for a word based on the full sentence context.

Intermediate

What is Top-K vs Top-P sampling?

Intermediate

In language generation, Top-K limits the next-word sample to the K most probable tokens. Top-P (nucleus sampling) limits the sample to a dynamic set of tokens w...

Comprehensive Explanation

In language generation, Top-K limits the next-word sample to the K most probable tokens. Top-P (nucleus sampling) limits the sample to a dynamic set of tokens whose cumulative probability exceeds P.

Intermediate

What are the common evaluation metrics for NER?

Intermediate

Precision, Recall, and F1-score are standard. However, evaluating requires exact boundary matching (exact match) vs partial matching overlap since entities ofte...

Comprehensive Explanation

Precision, Recall, and F1-score are standard. However, evaluating requires exact boundary matching (exact match) vs partial matching overlap since entities often span multiple words.

Intermediate

How do you handle Class Imbalance in NLP?

Intermediate

Techniques include oversampling minority classes (SMOTE or synonym replacement), undersampling majority classes, using weighted loss functions, or adopting Foca...

Comprehensive Explanation

Techniques include oversampling minority classes (SMOTE or synonym replacement), undersampling majority classes, using weighted loss functions, or adopting Focal Loss.

Intermediate

What are Zero-shot and Few-shot learning?

Intermediate

Zero-shot learning means giving the model a task it wasn't explicitly trained mapped to without any examples. Few-shot means providing the model with a tiny num...

Comprehensive Explanation

Zero-shot learning means giving the model a task it wasn't explicitly trained mapped to without any examples. Few-shot means providing the model with a tiny number of demonstration examples (1-5) in the prompt.

Intermediate

Explain the architecture of a Transformer model.

Advanced

Transformers utilize an Encoder-Decoder structure built entirely upon self-attention mechanisms, dropping recurrent/convolutional layers. Key components include...

Comprehensive Explanation

Transformers utilize an Encoder-Decoder structure built entirely upon self-attention mechanisms, dropping recurrent/convolutional layers. Key components include Multi-Head Attention, Feed Forward Networks, Layer Normalization, and Positional Encodings.

Advanced

What is the purpose of Positional Encoding in Transformers?

Advanced

Since Transformers don't use recurrence and process all tokens simultaneously, they have no inherent notion of sequence order. Positional encodings (using sine/...

Comprehensive Explanation

Since Transformers don't use recurrence and process all tokens simultaneously, they have no inherent notion of sequence order. Positional encodings (using sine/cosine waves) are added to input embeddings to inject the relative or absolute position of words.

Advanced

Explain the calculation of Scaled Dot-Product Attention.

Advanced

Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V. Query matrices dot with Key matrices to get similarity scores, are scaled to prevent vanishing gradient...

Comprehensive Explanation

Attention(Q, K, V) = softmax((Q * K^T) / sqrt(d_k)) * V. Query matrices dot with Key matrices to get similarity scores, are scaled to prevent vanishing gradients during softmax, and then multiplied by Value matrices.

Advanced

What is Multi-Head Attention?

Advanced

Instead of performing a single attention function, Multi-Head attention projects Queries, Keys, and Values h times with different learned weights. The h attenti...

Comprehensive Explanation

Instead of performing a single attention function, Multi-Head attention projects Queries, Keys, and Values h times with different learned weights. The h attention outputs are concatenated and linearly projected. It allows focusing on different representation subspaces (e.g., subject-verb vs adjectives).

Advanced

Difference between GPT architectures and BERT?

Advanced

GPT is an auto-regressive Decoder-only model trained left-to-right to predict the next token (excellent for generation). BERT is an Encoder-only model trained t...

Comprehensive Explanation

GPT is an auto-regressive Decoder-only model trained left-to-right to predict the next token (excellent for generation). BERT is an Encoder-only model trained to bidirectionally reconstruct masked tokens (excellent for classification/understanding).

Advanced

Explain Retrieval Augmented Generation (RAG).

Advanced

RAG connects an LLM to an external knowledge database. Upon a query, semantic search retrieves relevant document chunks from the database, prepends them to the ...

Comprehensive Explanation

RAG connects an LLM to an external knowledge database. Upon a query, semantic search retrieves relevant document chunks from the database, prepends them to the LLM's prompt, reducing hallucinations and anchoring answers in facts.

Advanced

What is LoRA (Low-Rank Adaptation)?

Advanced

LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This drastically r...

Comprehensive Explanation

LoRA freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This drastically reduces the number of trainable parameters down to ~1% while maintaining fine-tuning quality.

Advanced

What is RLHF (Reinforcement Learning from Human Feedback)?

Advanced

RLHF aligns LLMs with human intent. It trains a 'Reward Model' based on human rankings of text outputs, and then uses Proximal Policy Optimization (PPO) reinfor...

Comprehensive Explanation

RLHF aligns LLMs with human intent. It trains a 'Reward Model' based on human rankings of text outputs, and then uses Proximal Policy Optimization (PPO) reinforcement learning to fine-tune the LLM to maximize those rewards.

Advanced

What is Prompt Engineering vs Prompt Tuning?

Advanced

Prompt Engineering is manually hard-coding optimal text templates. Prompt Tuning (or Soft Prompts) keeps model weights frozen but adds a small number of tunable...

Comprehensive Explanation

Prompt Engineering is manually hard-coding optimal text templates. Prompt Tuning (or Soft Prompts) keeps model weights frozen but adds a small number of tunable, continuous (vector) tokens to the input, which are updated via backpropagation.

Advanced

Explain the difference between T5 and standard Seq2Seq models.

Advanced

T5 (Text-to-Text Transfer Transformer) casts every NLP task—classification, translation, QA—into a text-to-text format. So both the inputs and outputs are treat...

Comprehensive Explanation

T5 (Text-to-Text Transfer Transformer) casts every NLP task—classification, translation, QA—into a text-to-text format. So both the inputs and outputs are treated as text strings, allowing unified learning.

Advanced

What are Vector Databases?

Advanced

Databases specialized in storing indexing high-dimensional embeddings mathematically derived from data (text, images). They perform ultra-fast similarity search...

Comprehensive Explanation

Databases specialized in storing indexing high-dimensional embeddings mathematically derived from data (text, images). They perform ultra-fast similarity searches using metrics like Cosine Similarity (e.g. Pinecone, Milvus, Qdrant).

Advanced

Explain the 'Hallucination' problem in LLMs and mitigation strategies.

Advanced

Hallucination is when LLMs generate fluent but factually incorrect information. Mitigation includes RAG, strict grounding prompts, lower temperature, RLHF, or u...

Comprehensive Explanation

Hallucination is when LLMs generate fluent but factually incorrect information. Mitigation includes RAG, strict grounding prompts, lower temperature, RLHF, or using external validation tools (Toolformer).

Advanced

What are KV (Key-Value) Caches in LLM decoding?

Advanced

During auto-regressive generation, past tokens' Key and Value tensors in the attention layers are cached instead of being recomputed every step. This turns gene...

Comprehensive Explanation

During auto-regressive generation, past tokens' Key and Value tensors in the attention layers are cached instead of being recomputed every step. This turns generation from O(N^2) to roughly O(N) complexity for new tokens.

Advanced

What is FlashAttention?

Advanced

FlashAttention is an exact IO-aware and drastically faster algorithm for computing exact attention. It prevents moving large N x N matrices between HBM (GPU mem...

Comprehensive Explanation

FlashAttention is an exact IO-aware and drastically faster algorithm for computing exact attention. It prevents moving large N x N matrices between HBM (GPU memory) and SRAM (L1 cache), dramatically accelerating training and inferencing.

Advanced

What is Quantization in neural networks? (e.g. 4-bit, 8-bit)

Advanced

Quantization reduces the precision of the network's weights and activations from FP32 (32-bit float) down to INT8 or INT4 formats. This significantly shrinks mo...

Comprehensive Explanation

Quantization reduces the precision of the network's weights and activations from FP32 (32-bit float) down to INT8 or INT4 formats. This significantly shrinks model VRAM footprint and speeds up inference with minimal quality degradation.

Advanced

Differentiate between RoPE and Absolute Positional Encodings.

Advanced

Rotary Position Embedding (RoPE) injects absolute position by multiplying context representations with a rotation matrix. It wonderfully captures relative posit...

Comprehensive Explanation

Rotary Position Embedding (RoPE) injects absolute position by multiplying context representations with a rotation matrix. It wonderfully captures relative positional differences mathematically, which handles sequence length extrapolation far better.

Advanced

Explain ALiBi (Attention with Linear Biases).

Advanced

ALiBi removes positional embeddings entirely. Instead, it statically adds a linear penalty to the attention scores before the softmax operation depending on dis...

Comprehensive Explanation

ALiBi removes positional embeddings entirely. Instead, it statically adds a linear penalty to the attention scores before the softmax operation depending on distance. It allows models to easily extrapolate to sequence lengths not seen during training.

Advanced

Explain Directed Acyclic Graphs (DAGs) in Dependency Parsing.

Advanced

Dependency parsing builds a tree outlining grammatical relations. Using algorithms like Chu-Liu-Edmonds, relationships are formed as directed edges connecting h...

Comprehensive Explanation

Dependency parsing builds a tree outlining grammatical relations. Using algorithms like Chu-Liu-Edmonds, relationships are formed as directed edges connecting heads to dependents. Ensuring it fits a DAG stops cyclic dependencies.

Advanced

What is the Conditional Random Field (CRF) layer used for in NER?

Advanced

A CRF layer sits on top of an LSTM/Transformer and predicts tags jointly. It learns transition probabilities between labels (e.g., ensuring I-ORG follows B-ORG ...

Comprehensive Explanation

A CRF layer sits on top of an LSTM/Transformer and predicts tags jointly. It learns transition probabilities between labels (e.g., ensuring I-ORG follows B-ORG and not B-PER), preventing invalid sequence classifications.

Advanced

What is Knowledge Distillation?

Advanced

A compression technique where a smaller 'student' model is trained to mimic the softmax probabilities (soft targets) and intermediate representations of a massi...

Comprehensive Explanation

A compression technique where a smaller 'student' model is trained to mimic the softmax probabilities (soft targets) and intermediate representations of a massive 'teacher' model, maintaining high accuracy at lower scale.

Advanced

Explain Mixture of Experts (MoE) architecture.

Advanced

MoE replaces the dense Feed Forward network with multiple parallel 'experts'. A routing network conditionally determines which tiny subset of experts (usually 1...

Comprehensive Explanation

MoE replaces the dense Feed Forward network with multiple parallel 'experts'. A routing network conditionally determines which tiny subset of experts (usually 1 or 2) process each token, allowing massive parameter scaling while keeping FLOPs low.

Advanced

How does Semantic Search differ from Lexical/Keyword Search?

Advanced

Lexical search (BM25) uses sparse tf-idf matching exact words. Semantic search embeds queries into dense vectors and finds documents closest in vector space, ma...

Comprehensive Explanation

Lexical search (BM25) uses sparse tf-idf matching exact words. Semantic search embeds queries into dense vectors and finds documents closest in vector space, matching synonyms and meaning rather than just string overlap.

Advanced

What is Cross-Encoder vs Bi-Encoder?

Advanced

Bi-encoders independently embed query and document, computing a simple fast dot product for similarity (great for searching databases). Cross-encoders concatena...

Comprehensive Explanation

Bi-encoders independently embed query and document, computing a simple fast dot product for similarity (great for searching databases). Cross-encoders concatenate Query+Document analyzing full self-attention interactions (slower but vastly more accurate for re-ranking).

Advanced

What is Token Healing in LLMs?

Advanced

Because tokenization merges prefix spaces/characters, prompting a model abruptly might split a logical word forcing poor probability spaces. Token healing dynam...

Comprehensive Explanation

Because tokenization merges prefix spaces/characters, prompting a model abruptly might split a logical word forcing poor probability spaces. Token healing dynamically 'unrolls' the last token of a prompt to generate seamless continued sequences.

Advanced

Explain Direct Preference Optimization (DPO).

Advanced

DPO acts as a simpler, more stable alternative to RLHF. Instead of training a separate reward model, DPO mathematically maps the reward function directly onto t...

Comprehensive Explanation

DPO acts as a simpler, more stable alternative to RLHF. Instead of training a separate reward model, DPO mathematically maps the reward function directly onto the language model's policy, optimizing off human preference pairs directly.

Advanced

Implement Cosine Similarity in Python.

Math/Code

Dot product divided by product of magnitudes.

Comprehensive Explanation

Dot product divided by product of magnitudes.

Python / PyTorch Code

import numpy as np
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Math_code

Calculate TF-IDF for word 'data'.

Math/Code

TF(t) = (count in doc) / (total words in doc). IDF(t) = log_e(Total Docs / Docs with t).

Comprehensive Explanation

TF(t) = (count in doc) / (total words in doc). IDF(t) = log_e(Total Docs / Docs with t).

Python / PyTorch Code

import math
tf = 3 / 100
idf = math.log(1000 / (10 + 1))
tfidf = tf * idf

Math_code

Write a PyTorch basic Self-Attention calculation.

Math/Code

Forward pass utilizing matrix multiplications.

Comprehensive Explanation

Forward pass utilizing matrix multiplications.

Python / PyTorch Code

import torch
import torch.nn.functional as F

def attention(q, k, v, d_k):
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
    attn = F.softmax(scores, dim=-1)
    return torch.matmul(attn, v)

Math_code

Code: Convert a list of texts to bag of words.

Math/Code

Using SKLearn CountVectorizer.

Comprehensive Explanation

Using SKLearn CountVectorizer.

Python / PyTorch Code

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(['nlp is fun', 'ai is nlp'])
print(X.toarray())

Math_code

How to apply Softmax in PyTorch?

Math/Code

Using F.softmax along the last dimension.

Comprehensive Explanation

Using F.softmax along the last dimension.

Python / PyTorch Code

logits = torch.tensor([1.0, 2.0, -1.0])
probs = torch.nn.functional.softmax(logits, dim=0)

Math_code

Implement a simple bigram character generator.

Math/Code

Matrix probabilities lookup.

Comprehensive Explanation

Matrix probabilities lookup.

Python / PyTorch Code

counts = torch.zeros((27,27))
# ... fill counts
probs = counts / counts.sum('1', keepdim=True)
i = torch.multinomial(probs[0], num_samples=1)
char = itos[i.item()]

Math_code

Extract Named Entities using Spacy.

Math/Code

Load core engine and iterate entites.

Comprehensive Explanation

Load core engine and iterate entites.

Python / PyTorch Code

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Apple is buying a startup based in UK')
for ent in doc.ents:
    print(ent.text, ent.label_)

Math_code

What is Cross-Entropy Loss formula mathematically?

Math/Code

Loss = -SUM(p(x) * log(q(x))) where p is true distribution and q is predicted distribution.

Comprehensive Explanation

Loss = -SUM(p(x) * log(q(x))) where p is true distribution and q is predicted distribution.

Python / PyTorch Code

loss = nn.CrossEntropyLoss()
calculated = loss(logits, targets)

Math_code

Load a transformer model via HuggingFace.

Math/Code

Using pipeline or direct.

Comprehensive Explanation

Using pipeline or direct.

Python / PyTorch Code

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
result = classifier('I love NLP!')

Math_code

Write WordPiece Subword Tokenizer logic pseudo-code.

Math/Code

Greedy matching loop.

Comprehensive Explanation

Greedy matching loop.

Python / PyTorch Code

token_list = []
while word:
    substr = get_longest_matching_prefix(word)
    token_list.append(substr)
    word = word[len(substr):]

Math_code

How to clip gradients in PyTorch to prevent exploding gradients?

Math/Code

Clips norm of the gradients before stepping optimizer.

Comprehensive Explanation

Clips norm of the gradients before stepping optimizer.

Python / PyTorch Code

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Math_code

Calculate the Jaccard Similarity between two sentences.

Math/Code

Intersection over Union of sets.

Comprehensive Explanation

Intersection over Union of sets.

Python / PyTorch Code

def jaccard(s1, s2):
    set1, set2 = set(s1.split()), set(s2.split())
    intersect = len(set1.intersection(set2))
    return intersect / (len(set1) + len(set2) - intersect)

Math_code

Write an Attention Mask for sequence padding.

Math/Code

Create masks replacing 0s with extremely negative numbers so Softmax zeroes them out.

Comprehensive Explanation

Create masks replacing 0s with extremely negative numbers so Softmax zeroes them out.

Python / PyTorch Code

mask = (input_ids == pad_token).unsqueeze(1).unsqueeze(2)
scores = scores.masked_fill(mask, -1e9)

Math_code

Implement early stopping loop in PyTorch.

Math/Code

Track validation loss over patience epochs.

Comprehensive Explanation

Track validation loss over patience epochs.

Python / PyTorch Code

best = float('inf')
patience, count = 3, 0
if val_loss < best:
   best = val_loss
else:
   count += 1
   if count >= patience: break

Math_code

Generate Text using HuggingFace GPT-2.

Math/Code

Using the generate method.

Comprehensive Explanation

Using the generate method.

Python / PyTorch Code

inputs = tokenizer('Hello, my dog is', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Math_code

Pad sequences in raw python.

Math/Code

Adding zeros to reach max_len.

Comprehensive Explanation

Adding zeros to reach max_len.

Python / PyTorch Code

padded = [seq + [0]*(max_len - len(seq)) for seq in sequences]

Math_code

Implement Levenshtein distance conceptually using recursion.

Math/Code

Minimizing insertions, deletions, substitutions.

Comprehensive Explanation

Minimizing insertions, deletions, substitutions.

Python / PyTorch Code

def lev(a,b):
    if not a: return len(b)
    if not b: return len(a)
    cost = 0 if a[0]==b[0] else 1
    return min(lev(a[1:],b)+1, lev(a,b[1:])+1, lev(a[1:],b[1:])+cost)

Math_code

How to initialize LayerNorm weights.

Math/Code

gamma to 1, beta to 0.

Comprehensive Explanation

gamma to 1, beta to 0.

Python / PyTorch Code

self.gamma = nn.Parameter(torch.ones(features))
self.beta = nn.Parameter(torch.zeros(features))

Math_code

Calculate output dimension of a 1D Conv Layer over text.

Math/Code

Length = [(Input - Filter + 2*Pad) / Stride] + 1.

Comprehensive Explanation

Length = [(Input - Filter + 2*Pad) / Stride] + 1.

Python / PyTorch Code

out_dim = math.floor(((L_in - kernel_size + 2*padding) / stride) + 1)

Math_code

Retrieve embeddings from a PyTorch Embedding layer.

Math/Code

Pass indices into the embedding class instance.

Comprehensive Explanation

Pass indices into the embedding class instance.

Python / PyTorch Code

embeds = nn.Embedding(vocab_size, dim)
indices = torch.tensor([1, 4, 10])
vectors = embeds(indices)

Math_code

Calculate Parameters in an LSTM relative to Input/Hidden size.

Math/Code

4 * ((input_size * hidden_size) + (hidden_size * hidden_size) + hidden_size).

Comprehensive Explanation

4 * ((input_size * hidden_size) + (hidden_size * hidden_size) + hidden_size).

Python / PyTorch Code

params = 4 * ((n * m) + (m * m) + m)

Math_code

Save and load a PyTorch NLP Model.

Math/Code

Using state dicts.

Comprehensive Explanation

Using state dicts.

Python / PyTorch Code

torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))

Math_code

Perform Top-K sampling conceptually.

Math/Code

Sort, take first k, zero out rest, sample.

Comprehensive Explanation

Sort, take first k, zero out rest, sample.

Python / PyTorch Code

probs, indices = torch.topk(logits, k=5)
probs = F.softmax(probs, dim=-1)
next_token = torch.multinomial(probs, 1)

Math_code

Convert Text to sequence using HuggingFace Tokenizer.

Math/Code

Calling tokenizer directly outputs dict containing input_ids.

Comprehensive Explanation

Calling tokenizer directly outputs dict containing input_ids.

Python / PyTorch Code

encodings = tokenizer(['Text one', 'Text two'], padding=True, truncation=True)
print(encodings['input_ids'])

Math_code

100

Write an RNN step function loop.

Math/Code

Iterating time dimension manually updating hidden state.

Comprehensive Explanation

Iterating time dimension manually updating hidden state.

Python / PyTorch Code

h_t = torch.zeros(1, hidden_size)
for x_t in inputs:
    h_t = torch.tanh(W_xh @ x_t + W_hh @ h_t + b)

Math_code