Q&A — Deep Learning & NLP · ENSAM 2026 Exam Prep

MODULE 01 ANN & Deep Learning Fundamentals 30 Q

01 What is the Universal Approximation Theorem and what does it guarantee about neural networks? ▾

Key Theorem

A feedforward neural network with at least one hidden layer and a sufficient number of neurons can approximate any continuous function to an arbitrary degree of accuracy.

What it does NOT guarantee: it says nothing about how many neurons are needed, how to find the weights, or whether training will converge to that solution. It is an existence result, not a constructive one.

Practical implication: neural networks are expressive enough — the challenge is optimization and generalization, not representational capacity.

02 What is the vanishing gradient problem and what causes it? ▾

Exam Favorite

During backpropagation, gradients are multiplied by the derivative of the activation function at each layer. With sigmoid/tanh, these derivatives are ≤ 0.25. Over many layers, repeated multiplication drives the gradient toward zero.

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_n} \cdot \sigma'(z_n) \cdot W_n \cdot \sigma'(z_{n-1}) \cdot W_{n-1} \cdots$$

Each $\sigma'$ term is a small number. With 10 layers of sigmoid: $0.25^{10} \approx 10^{-6}$ — the gradient effectively vanishes. Early layers stop learning.

Fixes: ReLU activation, residual connections (ResNet), batch normalization, LSTM gates.

03 Why is ReLU preferred over sigmoid/tanh in hidden layers? ▾

Exam Favorite

Property	Sigmoid/Tanh	ReLU
Derivative in saturation	→ 0 (vanishing gradient)	1 (no saturation for x>0)
Computation	Expensive (exp)	max(0,x) — trivial
Sparsity	All neurons active	~50% neurons zero — efficient
Centering	Tanh centered; Sigmoid not	Not zero-centered

ReLU's gradient is exactly 1 for positive inputs, so gradients flow without shrinking across many layers.

04 What is the dying ReLU problem and how does Leaky ReLU fix it? ▾

Dying ReLU: if a neuron's weighted input is always negative (e.g., due to a large negative bias), ReLU always outputs 0. The gradient is also 0 — the neuron never updates and is permanently "dead."

$\text{ReLU}(x) = \max(0, x)$ → gradient = 0 for $x \leq 0$
$\text{Leaky ReLU}(x) = \max(\alpha x, x)$, typically $\alpha = 0.01$

Leaky ReLU allows a small gradient ($\alpha$) for negative inputs, keeping neurons alive.

05 Write the gradient descent weight update formula and explain each term. ▾

Formula

$$w \leftarrow w - \eta \cdot \frac{\partial L}{\partial w}$$

$w$: weight being updated · $\eta$: learning rate (step size) · $\frac{\partial L}{\partial w}$: gradient of loss with respect to that weight (direction of steepest ascent — we subtract to descend).

Each update moves $w$ slightly in the direction that reduces the loss.

06 What is the difference between L1 and L2 regularization? ▾

$L1: \quad \mathcal{L}_{total} = \mathcal{L} + \lambda \sum |w_i|$ → sparse weights (many exactly 0)
$L2: \quad \mathcal{L}_{total} = \mathcal{L} + \lambda \sum w_i^2$ → small weights (none exactly 0)

	L1 (Lasso)	L2 (Ridge / Weight Decay)
Effect	Produces sparse models	Shrinks all weights uniformly
Feature selection	Yes — zeroes out features	No — keeps all features small
Typical use	When you suspect many features are irrelevant	Default regularization in deep learning

07 What is dropout and during which phases (train vs inference) is it active? ▾

Dropout randomly sets a fraction $p$ of neuron outputs to zero during each forward pass of training. This forces the network to not rely on any single neuron — learning redundant representations.

Training: active — neurons randomly zeroed with probability $p$.

Inference: dropout is OFF. All neurons are used, but their outputs are scaled by $(1-p)$ to maintain the same expected activation magnitude (or equivalently, inverted dropout scales during training).

Common values: $p = 0.2$–$0.5$ for Dense layers. Never use on the output layer.

08 What is Batch Normalization and what problem does it solve? ▾

Batch Normalization normalizes the activations of each layer across the mini-batch to have mean 0 and standard deviation 1, then applies learnable scale ($\gamma$) and shift ($\beta$) parameters.

$$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \qquad y = \gamma \hat{x} + \beta$$

Problem it solves — Internal Covariate Shift: as weights update, the distribution of inputs to each layer keeps changing, making training unstable. BN stabilizes these distributions.

Benefits: faster training, allows higher learning rates, reduces sensitivity to initialization, acts as mild regularizer.

09 What is the difference between Batch GD, Stochastic GD, and Mini-batch GD? ▾

Variant	Samples per update	Gradient quality	Speed
Batch GD	All N samples	Exact — smooth convergence	Very slow per epoch
Stochastic GD (SGD)	1 sample	Noisy — can escape local minima	Fast updates, unstable
Mini-batch GD	32–256 samples	Balanced — stable + efficient	Best in practice

Mini-batch GD is the standard. Typical batch sizes: 32, 64, 128. GPU memory limits maximum batch size.

10 What is Xavier (Glorot) initialization and when should it be used vs He initialization? ▾

Formula

Xavier: $\quad W \sim \mathcal{U}\!\left[-\frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}},\ \frac{\sqrt{6}}{\sqrt{n_{in}+n_{out}}}\right]$

He: $\quad W \sim \mathcal{N}\!\left(0,\ \sqrt{\frac{2}{n_{in}}}\right)$

Init	Designed for	When to use
Xavier	Sigmoid / Tanh	Symmetric activations
He	ReLU / Leaky ReLU	Any ReLU-family activation

Wrong initialization → vanishing or exploding activations from the first forward pass.

11 Write the gradient checking approximation formula and what tolerance is expected. ▾

FormulaExam Favorite

$$\frac{\partial L}{\partial w} \approx \frac{f(w + \varepsilon) - f(w - \varepsilon)}{2\varepsilon}, \qquad \varepsilon \approx 10^{-5}$$

This is the centered finite difference approximation. It numerically estimates the gradient and is compared against the analytical gradient from backprop.

Expected relative error: $< 10^{-7}$ → backprop is correct. If error $> 10^{-5}$, there is a bug in the backpropagation implementation.

12 What does a training curve where train_loss ↓ but val_loss ↑ indicate, and what should you do? ▾

This is the signature of overfitting: the model is memorizing training data rather than learning generalizable patterns.

Remedies:

1. Add Dropout (typical p=0.3–0.5) · 2. Add L2 regularization (weight decay) · 3. Reduce model capacity (fewer layers/neurons) · 4. Data augmentation · 5. EarlyStopping (stop at the point val_loss starts rising) · 6. Collect more data.

If both curves plateau at high loss: underfitting → increase model capacity or train longer.

13 Why use softmax for multi-class output and sigmoid for binary? What is the mathematical relationship? ▾

Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}} \in (0,1)$ — one probability for binary.

Softmax: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ — probabilities sum to 1 across all classes.

Binary (2 classes): sigmoid on one output neuron. Output is $P(\text{class}=1)$.

Multi-class (K classes): softmax on K output neurons. Each output is $P(\text{class}=k)$, and they sum to 1.

Softmax with 2 classes is mathematically equivalent to sigmoid. Softmax amplifies differences between logits — making the highest score more dominant.

14 What role does the chain rule play in backpropagation? ▾

Backpropagation computes $\frac{\partial L}{\partial w}$ for every weight $w$ in the network. Since the loss is a composition of many functions (layers), the chain rule allows decomposing this into a product of local gradients.

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$$

where $z = wx + b$ (linear), $a = \sigma(z)$ (activation), $L$ (loss). Each term is computed locally — the network only needs to propagate the accumulated gradient backward layer by layer.

15 What is the Adam optimizer and what two techniques does it combine? ▾

Adam (Adaptive Moment Estimation) combines Momentum and RMSprop:

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$ ← 1st moment (momentum)
$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ ← 2nd moment (RMSprop)
$w \leftarrow w - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}$

Default hyperparameters: $\beta_1=0.9$, $\beta_2=0.999$, $\eta=10^{-3}$, $\epsilon=10^{-8}$.

Benefit: adapts learning rate per parameter — parameters with rare gradients get larger updates.

16 What is cross-entropy loss and why is it preferred over MSE for classification? ▾

$$\mathcal{L}_{CE} = -\sum_i y_i \log(\hat{y}_i)$$

For binary: $\mathcal{L} = -[y \log \hat{y} + (1-y)\log(1-\hat{y})]$

Why not MSE for classification? With MSE and sigmoid output, gradients saturate when predictions are confidently wrong (the sigmoid is flat). Cross-entropy has a gradient of $(\hat{y} - y)$ — large when wrong, small when correct — producing strong learning signal exactly when needed.

17 What is overfitting, underfitting, and how do you detect each from training curves? ▾

Condition	Train Loss	Val Loss	Diagnosis
Good fit	Low	Low ≈ Train	Model generalizes
Overfitting	Very low	High & diverging	Memorizing training data
Underfitting	High	High ≈ Train	Model too simple / undertrained

Overfitting fix: regularization, dropout, more data, simpler model, EarlyStopping.

Underfitting fix: more layers/neurons, train longer, reduce regularization, better features.

18 Why can't all weights be initialized to zero? What is the problem? ▾

If all weights = 0, every neuron in a layer computes the same output (all zeros × inputs = 0). All neurons receive the same gradient and update identically — they remain identical forever. This is called the symmetry problem.

Result: a layer of N neurons behaves like a single neuron — the entire capacity of the layer is wasted.

Weights must be initialized randomly to break symmetry so each neuron learns different features.

19 What is the purpose of the bias term $b$ in a neuron? ▾

The bias allows the activation function to be shifted horizontally. Without it, the neuron computes $\sigma(\mathbf{w}^T\mathbf{x})$ which must pass through the origin.

$$z = \mathbf{w}^T\mathbf{x} + b$$

With bias, the neuron can fire (activate) even when all inputs are zero. It provides the model a degree of freedom independent of the input — allowing the hyperplane decision boundary to be positioned anywhere, not just through the origin.

20 What is EarlyStopping and what are the key parameters: patience and restore_best_weights? ▾

EarlyStopping monitors a metric (typically val_loss) and stops training when no improvement is seen for a number of epochs.

EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

patience=5: wait 5 epochs with no improvement before stopping.

restore_best_weights=True: after stopping, restore weights from the epoch with the best val_loss (not the last epoch, which may have started overfitting).

21 What is the learning rate and what happens if it is too large or too small? ▾

The learning rate $\eta$ controls the step size of each weight update.

Learning Rate	Effect
Too large	Overshoots minimum — loss oscillates or diverges. Training is unstable.
Too small	Tiny updates — training is extremely slow. May get stuck in local minima.
Optimal	Converges smoothly to a (local) minimum in reasonable time.

Adam's default $\eta = 10^{-3}$ is a good starting point. Use ReduceLROnPlateau to decay automatically.

22 What is gradient clipping and when is it used? ▾

Gradient clipping caps the gradient norm to a maximum value before the weight update step, preventing exploding gradients.

$$\text{if } \|\nabla\| > \text{threshold}: \quad \nabla \leftarrow \frac{\text{threshold}}{\|\nabla\|} \cdot \nabla$$

Typical threshold: 1.0–5.0 (from the RNN lab: GRAD_CLIP = 5.0).

Particularly important for RNNs/LSTMs processing long sequences, where gradients can explode exponentially due to repeated matrix multiplications.

23 What is the difference between the validation set and the test set? ▾

Set	Used for	Seen during training?
Training set	Computing gradients, updating weights	Yes — directly
Validation set	Hyperparameter tuning, model selection, EarlyStopping	Indirectly (no weight updates)
Test set	Final performance estimate on unseen data	Never — touched once at the end

Using the test set to tune hyperparameters is data leakage — your reported performance will be optimistically biased.

24 What activation function is used in the output layer for regression vs binary vs multi-class classification? ▾

Task	Output Activation	Loss Function
Regression	None (linear)	MSE / MAE
Binary classification	Sigmoid	Binary cross-entropy
Multi-class (exclusive)	Softmax	Categorical cross-entropy
Multi-label (independent)	Sigmoid per output	Binary cross-entropy per label

25 What is weight decay and how does it relate to L2 regularization? ▾

Weight decay and L2 regularization are mathematically equivalent for standard SGD.

With L2 regularization, the gradient update becomes:

$$w \leftarrow w - \eta \left(\frac{\partial L}{\partial w} + \lambda w\right) = (1 - \eta\lambda)w - \eta\frac{\partial L}{\partial w}$$

The factor $(1-\eta\lambda)$ decays the weight at every step — hence "weight decay." In Keras: kernel_regularizer=l2(0.01) or optimizer=Adam(weight_decay=1e-4).

26 What is a confusion matrix and what are the four values TP, TN, FP, FN? ▾

Predicted →	Positive	Negative
Actual Positive	TP (True Positive)	FN (False Negative) — missed
Actual Negative	FP (False Positive) — false alarm	TN (True Negative)

Accuracy = (TP+TN)/(TP+TN+FP+FN) · Precision = TP/(TP+FP) · Recall = TP/(TP+FN)

F1 = 2·P·R/(P+R) — harmonic mean, balances precision and recall.

27 What is the ReduceLROnPlateau callback and what does it do when triggered? ▾

ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6)

When val_loss stops improving for patience=3 epochs, it multiplies the learning rate by factor=0.5 (halves it). This allows the optimizer to take smaller steps to escape a plateau and fine-tune around a local minimum.

min_lr=1e-6 sets a floor — the LR will never go below this, preventing arbitrarily slow training.

28 What is the relationship between batch size and training stability/generalization? ▾

Batch size	Gradient noise	Effect
Small (8–32)	High — noisy updates	Better generalization (noise acts as regularizer), slower convergence, GPU underutilized
Large (256–2048)	Low — smooth gradient	Faster GPU utilization, sharper minima (worse generalization), may need LR scaling

Rule of thumb: scale LR linearly with batch size (linear scaling rule). Typical choice: 32–128.

29 What is the exploding gradient problem and how is it different from the vanishing gradient? ▾

	Vanishing Gradient	Exploding Gradient
What happens	Gradients → 0 (too small)	Gradients → ∞ (too large)
Effect	Early layers stop learning	Weight updates are catastrophically large (NaN)
Cause	Repeated multiplication of small numbers (<1)	Repeated multiplication of large numbers (>1)
Fix	ReLU, ResNet, LSTM, BN	Gradient clipping, careful initialization

30 What is momentum in optimization and what problem does it solve over plain gradient descent? ▾

Momentum accumulates a moving average of past gradients to smooth updates:

$v_t = \beta v_{t-1} + (1-\beta) g_t$ (velocity)
$w \leftarrow w - \eta v_t$

Typical $\beta = 0.9$. Problem it solves:

1. Oscillations in narrow ravines — momentum smooths the zig-zag path and accelerates along the consistent gradient direction.

2. Local minima — accumulated velocity can "roll through" shallow local minima.

3. Slow progress in flat regions — momentum keeps moving in the last useful direction.

MODULE 02 Convolutional Neural Networks (CNN) 30 Q

01 Why can't a plain Fully-Connected (Dense) network efficiently process images? ▾

A 224×224 RGB image has $224 \times 224 \times 3 = 150{,}528$ inputs. With just one hidden layer of 1,000 neurons: $150{,}528 \times 1{,}000 = 150$ million parameters — untrainable and prone to overfitting.

Additionally, FCNs ignore spatial structure: a pixel at (10,10) and (11,10) are treated as completely unrelated. CNNs exploit spatial locality through local connectivity and weight sharing.

02 What does a convolutional filter (kernel) compute, conceptually? ▾

A filter slides over the input image computing an element-wise dot product between the filter weights and the local patch of the input at each position:

$$(\mathbf{f} * \mathbf{x})[i,j] = \sum_{m}\sum_{n} f[m,n] \cdot x[i+m,\, j+n]$$

Each filter learns to detect a specific pattern (edge, curve, texture). Early filters detect edges; deeper filters detect complex patterns. The result is a feature map showing where that pattern appears in the image.

03 Write the formula for the output size of a convolutional layer. ▾

FormulaExam Favorite

$$W_{out} = \left\lfloor \frac{W_{in} - K + 2P}{S} \right\rfloor + 1$$

$W_{in}$: input size · $K$: kernel size · $P$: padding · $S$: stride.

Example: Input 5×5, kernel 3×3, P=1 (same), S=1: $\lfloor(5-3+2)/1\rfloor+1 = 5$ → output 5×5 (same padding preserves size).

Example: Input 224×224, K=3, P=0, S=2: $\lfloor(224-3)/2\rfloor+1 = 111$.

04 What is "same" padding vs "valid" padding? ▾

Padding	Zero-padding added	Output size	Use
same	$P = \lfloor K/2 \rfloor$	$W_{out} = W_{in}$ (stride=1)	Preserve spatial dims through conv layers
valid	$P = 0$	$W_{out} = W_{in} - K + 1$	Shrink spatial dims intentionally

"Same" padding adds zeros around the border so the filter reaches every position including the edges.

05 How many trainable parameters does Conv2D(32 filters, 3×3 kernel) have on an RGB input (3 channels)? ▾

Formula

$$\text{Params} = (K \times K \times C_{in} + 1) \times C_{out}$$

$(3 \times 3 \times 3 + 1) \times 32 = 28 \times 32 = \mathbf{896}$ parameters.

The "+1" is the bias per filter. Without bias: $3\times3\times3\times32 = 864$.

Weight sharing insight: the same 896 parameters are reused at every spatial position — this is why CNNs are so parameter-efficient vs FCNs.

06 What is "weight sharing" in CNNs and why is it important? ▾

In a convolutional layer, the same filter weights are reused at every spatial position of the input. This is weight sharing.

Why it matters:

1. Parameter efficiency: 896 params handle an entire 224×224 image instead of millions in an FCN.

2. Translation invariance: if a filter detects a horizontal edge, it detects it anywhere in the image — the same weights fire wherever the edge appears.

3. Generalization: the model is biased toward learning spatially reusable patterns, which matches the structure of natural images.

07 What is max pooling and what are its three benefits? ▾

Max pooling takes the maximum value in each pooling window (typically 2×2 with stride 2), halving the spatial dimensions.

Three benefits:

1. Dimensionality reduction: 2×2 pool with stride 2 → spatial size halved, reducing compute and memory.

2. Spatial invariance: small translations of the input produce the same max — the model is robust to minor shifts.

3. Overfitting control: fewer parameters in subsequent layers.

Max pooling has zero learnable parameters — it is a fixed operation.

08 What is Global Average Pooling (GAP) and why is it better than Flatten + Dense? ▾

Global Average Pooling reduces each feature map to a single number (its spatial average): an input of shape $(H, W, C)$ → output of shape $(C,)$.

	Flatten + Dense	Global Average Pooling
Parameters	H×W×C×Dense_units (massive)	0 (then small Dense layer)
Overfitting risk	High — 88% of VGG params are here	Low — drastically fewer params
Spatial info	Flattened to 1D	Summarized per channel

Modern architectures (ResNet, MobileNet, EfficientNet) all use GAP before the classifier head.

09 Why are two stacked 3×3 conv layers preferred over one 5×5 layer? Give the parameter count. ▾

Exam Favorite

Two stacked 3×3 layers have the same effective receptive field as one 5×5 layer — but fewer parameters:

Two 3×3: $2 \times (3\times3\times C) = 18C$ params
One 5×5: $5\times5\times C = 25C$ params
Saving: 28%

Three stacked 3×3 = effective 7×7: $27C$ vs $49C$ → 45% saving.

Additionally, two conv layers means two non-linearity applications — more representational power.

10 What is the filter doubling rule and why does it keep compute constant? ▾

After each max pooling (spatial size halved), the number of filters is doubled: $32 \to 64 \to 128 \to 256 \to 512$.

Why compute stays constant: total feature map volume $\approx H \times W \times C$. When $H$ and $W$ are halved (÷4 area), doubling $C$ (×2) keeps the product roughly constant: $\frac{H}{2}\times\frac{W}{2}\times 2C = \frac{H \times W \times C}{2}$ — actually halves, but within an acceptable budget.

This lets deeper layers capture more semantic features without exponentially increasing compute.

11 What is the power-of-2 compression cascade for a 224×224 input? How many conv blocks does it imply? ▾

Exam Favorite

Block	Spatial size after pool	÷ Factor
Input	224 × 224	—
Block 1	112 × 112	÷2
Block 2	56 × 56	÷2
Block 3	28 × 28	÷2
Block 4	14 × 14	÷2
Block 5 ← stop	7 × 7	÷2

Stop at 7×7 — going further destroys spatial structure needed for classification. This gives 5 conv blocks, which is exactly the depth of VGGNet.

12 What are the key innovations introduced by AlexNet (2012) that enabled modern deep learning? ▾

AlexNet won ImageNet 2012 with 15.3% top-5 error vs 26.2% runner-up. Key innovations:

1. ReLU activations — replacing tanh, 6× faster training.

2. Dropout (p=0.5) — first large-scale use as regularization.

3. GPU training — split across two GTX 580 GPUs, making deep networks practical.

4. Data augmentation — random crops, horizontal flips, color jitter.

5. Local Response Normalization — early normalization (later replaced by BN).

13 What is the residual (skip) connection in ResNet and what problem does it solve? ▾

Formula

$$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$$

The identity shortcut $+\mathbf{x}$ adds the input directly to the output of the conv layers.

Problems it solves:

1. Vanishing gradient: the gradient can flow directly through the identity shortcut without passing through activations, reaching early layers effectively.

2. Degradation problem: without skip connections, adding more layers paradoxically made accuracy worse. ResNet enables training 100+ layer networks.

ResNet-50 (25M params) outperforms VGG-16 (138M params).

14 What is the difference between Feature Extraction and Fine-Tuning in transfer learning? ▾

	Feature Extraction	Fine-Tuning
Pre-trained layers	Frozen — weights unchanged	Unfrozen — weights updated
What trains	Only new classifier head (Dense layers)	Entire model or last N layers
Learning rate	Normal LR for head	Very small LR ($10^{-5}$) — don't destroy pretrained weights
Data needed	Small dataset OK	Requires more data
When to use	Target domain ≈ ImageNet	Target domain differs significantly

15 What is MobileNet's key innovation and why is it suited for mobile devices? ▾

MobileNet uses depthwise separable convolutions: split a standard conv into two cheaper operations:

1. Depthwise conv: apply one filter per channel independently (spatial filtering).

2. Pointwise conv (1×1): combine channels (cross-channel mixing).

Standard: $K\times K\times C_{in}\times C_{out}$ multiplications
Depthwise sep: $K\times K\times C_{in} + C_{in}\times C_{out}$ ← ~8–9× fewer operations

This makes it suitable for real-time inference on CPUs and mobile chips with limited compute and battery.

16 In a VGG-like CNN, what percentage of parameters sit in Dense layers vs Conv layers? ▾

Exam Favorite

In VGG-16: Conv layers ≈ 14.7M params (12%), Dense layers ≈ 103M params (88%).

This is why modern architectures replace Flatten → Dense with Global Average Pooling — it eliminates the Dense layers' parameter explosion while matching or improving accuracy.

The 88% in Dense layers is the main reason VGG is memory-inefficient at inference time.

17 What are the three essential training callbacks? Give the purpose and key config for each. ▾

Exam Favorite

EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True) ModelCheckpoint('best.h5', save_best_only=True, monitor='val_loss') ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6)

Callback	Purpose
EarlyStopping	Stop training when val_loss stops improving — prevents overfitting
ModelCheckpoint	Save best model to disk — you always have the best checkpoint
ReduceLROnPlateau	Halve LR when progress stalls — escapes training plateaus

18 When is Recall more important than Precision? Give a concrete medical AI example. ▾

Recall = TP/(TP+FN) — measures the fraction of actual positives that are correctly identified.

Recall is more important when False Negatives are more costly than False Positives.

Example — Tumor detection: A False Negative = telling a patient with cancer that they are cancer-free. This leads to delayed treatment and potentially death. A False Positive = ordering an unnecessary biopsy on a healthy patient (costly but recoverable).

In this case, we want high Recall even at the cost of lower Precision.

19 What is data augmentation and name four transforms used in image classification? ▾

Data augmentation generates new training samples by applying label-preserving transformations to existing data. It reduces overfitting without collecting more data.

Category	Transforms
Geometric	Random flip (horizontal), rotation (±15°), zoom, random crop, translation
Photometric	Brightness, contrast, saturation adjustment, Gaussian noise, blur

Never apply augmentations that change the label — e.g., don't flip digits (6↔9), don't flip road signs.

20 What is Batch Normalization in CNNs — where is it applied and what does it do to feature maps? ▾

In CNNs, Batch Normalization is applied after conv layers, before (or after) ReLU. It normalizes each channel's activations across the mini-batch to mean 0, variance 1, then applies learnable $\gamma, \beta$.

Effect on feature maps: prevents any channel from dominating; stabilizes activation distributions across layers so deeper layers train on a consistent signal.

Benefits: faster convergence, allows larger learning rates, reduces sensitivity to weight initialization, mild regularization effect (can sometimes reduce Dropout need).

21 What is stride and how does stride=2 differ from stride=1 with max pooling for downsampling? ▾

Stride is the step size the filter moves between positions. Stride=1: dense coverage. Stride=2: skip every other position → output half the spatial size.

	Conv with stride=2	Conv(stride=1) + MaxPool(2×2)
Output size	⌊(W-K)/2⌋+1	⌊(W-K+2P)/1⌋+1 then ÷2
Has learnable params?	Yes (in conv)	MaxPool has none
Information retained	Learns what to keep	Always takes max value

Modern architectures (ResNet, EfficientNet) prefer strided conv over pooling.

22 What is the Inception module (GoogLeNet) and what problem does it solve? ▾

The Inception module applies multiple filter sizes in parallel (1×1, 3×3, 5×5) plus max pooling, then concatenates all outputs along the channel dimension.

Problem it solves: choosing the right filter size for each layer is non-trivial. By applying all sizes in parallel and letting the network learn which features are most useful, the network automatically selects the right scale.

1×1 convolutions before larger kernels perform dimensionality reduction (bottleneck), keeping compute manageable.

23 What does `include_top=False` mean when loading a pretrained model like VGG16? ▾

base = VGG16(weights='imagenet', include_top=False, input_shape=(224,224,3))

include_top=False loads the model without the final classifier layers (the Dense layers trained for 1000 ImageNet classes). You get only the convolutional backbone.

This lets you add your own classification head for your specific number of classes. The convolutional features learned on ImageNet are reused; only your new head is trained.

24 What is EfficientNet's compound scaling strategy? ▾

Previous architectures scaled one dimension at a time: more layers (deeper), more channels (wider), or larger input (higher resolution). EfficientNet scales all three simultaneously using a compound coefficient $\phi$:

depth: $d = \alpha^\phi$ · width: $w = \beta^\phi$ · resolution: $r = \gamma^\phi$
subject to: $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ (constant FLOP budget)

Result: EfficientNet-B7 achieves state-of-the-art accuracy with 8.4× fewer parameters than GPipe at the same accuracy level.

25 What is the F1 Score, when do you use it, and why is it better than accuracy for imbalanced datasets? ▾

$$F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}$$

F1 is the harmonic mean of Precision and Recall. It's 1.0 only when both are perfect.

Why use over accuracy for imbalanced data: if 99% of data is class A, a dumb classifier that always predicts A gets 99% accuracy but F1≈0 (because Recall for class B = 0). F1 forces the model to actually detect the minority class.

26 What is a 1×1 convolution and what is it used for? ▾

A 1×1 convolution applies a linear combination across channels at each spatial position — with no spatial filtering. For input $(H, W, C_{in})$: output is $(H, W, C_{out})$, same spatial size, different channel count.

Uses:

1. Dimensionality reduction (bottleneck): $C_{in}=256 \to C_{out}=64$ — reduces channel count before expensive 3×3 conv (Inception, ResNet bottleneck).

2. Increase channels: $C_{in}=64 \to C_{out}=256$ — expand representation.

3. Non-linear cross-channel mixing with no spatial receptive field increase.

27 What is AUC-ROC and what does an AUC of 0.5 vs 1.0 mean? ▾

The ROC curve plots True Positive Rate (Recall) vs False Positive Rate (FP/(FP+TN)) across all classification thresholds. AUC = Area Under the ROC Curve.

AUC	Meaning
1.0	Perfect classifier — separates all positives from negatives
0.9–0.99	Excellent
0.7–0.89	Good
0.5	Random guessing — no discriminative ability
<0.5	Worse than random (labels may be flipped)

AUC is threshold-independent and works well for imbalanced datasets.

28 What is the standard CNN pipeline for image classification (the full forward pass)? ▾

Input Image (H×W×3) → Normalize (/255) → [Conv2D(ReLU) → BatchNorm → MaxPool] × N blocks → GlobalAveragePooling or Flatten → Dense(ReLU) → Dropout → Dense(Softmax/Sigmoid) → Predicted class probabilities

Loss: categorical_crossentropy (multi-class) or binary_crossentropy (binary). Optimizer: Adam (lr=1e-3). Evaluate: accuracy + F1 on val set.

29 What does VGG stand for and what are the two main VGG variants? ▾

VGG = Visual Geometry Group (Oxford University, Simonyan & Zisserman, 2014).

Key design philosophy: use only 3×3 conv filters throughout, increasing depth.

Variant	Conv layers	Params	Top-5 error
VGG-16	13 conv + 3 Dense	138M	7.3%
VGG-19	16 conv + 3 Dense	144M	7.3%

VGG demonstrated that depth (using small 3×3 filters) is more effective than shallow networks with large filters.

30 What is the best practice for normalizing image inputs, and what problems does NOT normalizing cause? ▾

Best practice: divide pixel values by 255 to scale to [0,1], or standardize to mean=0, std=1 using ImageNet statistics (mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225] for pretrained models).

Problems from raw 0–255 inputs:

1. Gradient instability: large input magnitudes cause large pre-activations → saturation or exploding gradients.

2. Uneven learning: the optimizer takes large steps in some directions, slow in others — poor conditioning.

3. Weight initialization mismatch: He/Xavier init assumes inputs of moderate magnitude.

MODULE 03 Recurrent Networks — RNN · LSTM · GRU 30 Q

01 What is the fundamental architectural difference between a Feedforward Network and an RNN? ▾

A Feedforward Network maps each input independently to an output — there is no memory of previous inputs. Each sample is processed in isolation.

An RNN maintains a hidden state $h_t$ that is passed from one time step to the next, giving the network a form of memory:

$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$

At each step $t$, the output depends on the current input and all previous inputs (encoded in $h_{t-1}$). This makes RNNs suitable for sequential data: text, time series, audio, video.

02 Write the vanilla RNN hidden state update formula and explain each term. ▾

Formula

$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h)$$ $$\hat{y}_t = W_y h_t + b_y$$

$h_t$: hidden state at time $t$ · $h_{t-1}$: previous hidden state (memory) · $x_t$: current input · $W_h$: recurrent weight matrix (hidden-to-hidden) · $W_x$: input weight matrix · $\tanh$: keeps hidden state in $[-1,1]$.

The same weights $W_h, W_x$ are reused at every time step — this is weight sharing across time.

03 What is the vanishing gradient problem in RNNs and why is it worse than in FNNs? ▾

Exam Favorite

During BPTT, the gradient flows backward through time by multiplying $W_h^T$ at each step. For a sequence of length $T$:

$$\frac{\partial h_t}{\partial h_0} = \prod_{k=1}^{t} \frac{\partial h_k}{\partial h_{k-1}} = \prod_{k=1}^{t} W_h^T \cdot \text{diag}(\tanh'(\cdot))$$

Each $\tanh'$ ≤ 1. Over 100+ time steps: $\|W_h\|^{100} \cdot (≤1)^{100}$ → effectively zero.

Worse than in FNNs: an FNN has ~10–50 layers, but an RNN may unroll to 100–1000 time steps — far more multiplications.

Consequence: early time steps receive near-zero gradients — the RNN cannot learn long-range dependencies.

04 What is the exploding gradient problem in RNNs and how is it fixed? ▾

If the largest eigenvalue of $W_h$ is $> 1$, repeated multiplication causes gradients to grow exponentially → NaN weights, training collapses.

Fix: gradient clipping. Before the weight update, if the gradient norm exceeds a threshold, scale the gradient down:

$$\text{if } \|\nabla\| > \text{clip\_val}: \quad \nabla \leftarrow \frac{\text{clip\_val}}{\|\nabla\|} \cdot \nabla$$

From the course lab: GRAD_CLIP = 5.0. This is applied in PyTorch as torch.nn.utils.clip_grad_norm_(params, 5.0).

05 What is Backpropagation Through Time (BPTT)? ▾

BPTT is the algorithm for training RNNs. The RNN is "unrolled" through time to create a deep feedforward graph with one layer per time step, then standard backpropagation is applied through all steps.

Steps:

1. Forward pass: compute all $h_1, h_2, \ldots, h_T$ and outputs.

2. Compute total loss $\mathcal{L} = \sum_t \mathcal{L}_t$.

3. Backward pass: compute $\frac{\partial \mathcal{L}}{\partial W}$ by propagating gradients back from $t=T$ to $t=1$.

Truncated BPTT: for very long sequences, backprop only through the last $k$ steps to avoid memory issues.

06 Name the 4 gates of an LSTM and describe the role of each. ▾

Exam Favorite

Gate	Symbol	Activation	Role
Forget gate	$f_t$	Sigmoid	Decides what to erase from the cell state (0=forget, 1=keep)
Input gate	$i_t$	Sigmoid	Decides how much new information to write to cell state
Cell candidate	$\tilde{C}_t$	Tanh	New candidate values to potentially add to cell state
Output gate	$o_t$	Sigmoid	Decides what part of cell state to output as hidden state

07 Write the complete LSTM equations (all 5 equations). ▾

Formula

$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$    ← Forget gate
$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$    ← Input gate
$\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$    ← Cell candidate
$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$    ← Cell state update
$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$    ← Output gate
$h_t = o_t \odot \tanh(C_t)$    ← Hidden state

$\odot$: element-wise multiplication. $[h_{t-1}, x_t]$: concatenation. $C_t$: cell state (long-term memory). $h_t$: hidden state (working memory).

08 How does the LSTM cell state $C_t$ solve the vanishing gradient problem? ▾

The cell state $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ is updated via additive connections — unlike the vanilla RNN which multiplies $W_h h_{t-1}$ through tanh.

The gradient of the loss with respect to $C_{t-1}$:

$$\frac{\partial C_t}{\partial C_{t-1}} = f_t$$

The forget gate $f_t \in (0,1)$ is learned — when the network needs to remember something, it can set $f_t \approx 1$, allowing gradients to flow back essentially unchanged (gradient $\approx 1$ per step). This is the constant error carousel mechanism.

09 What are the 2 gates of a GRU and how does it differ from LSTM? ▾

GRU (Gated Recurrent Unit) merges the forget and input gates into one update gate and adds a reset gate:

$z_t = \sigma(W_z [h_{t-1}, x_t])$   ← Update gate (replaces forget+input)
$r_t = \sigma(W_r [h_{t-1}, x_t])$   ← Reset gate (how much past to forget)
$\tilde{h}_t = \tanh(W[r_t \odot h_{t-1}, x_t])$   ← Candidate
$h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$   ← Hidden state

	LSTM	GRU
Gates	4 (forget, input, cell, output)	2 (update, reset)
Separate cell state	Yes ($C_t$ and $h_t$)	No (only $h_t$)
Parameters	More	~25% fewer
Performance	Usually better for long sequences	Comparable, faster to train

10 What does the forget gate learn to do in language modeling? Give a concrete example. ▾

The forget gate learns to selectively erase information from the cell state when it is no longer relevant.

Example — subject-verb agreement:

"The cats that were chasing the mouse..."

When the model reads "cats" (plural subject), it stores this in $C_t$. When it later generates the verb, it needs to remember "cats" (plural) → "are" not "is." The forget gate keeps the plural information alive across the intervening words about the mouse.

After the clause ends, the forget gate can erase "cats" — it's no longer needed for subject-verb agreement.

11 What are the four RNN architecture configurations (many-to-one, one-to-many, etc.) with examples? ▾

Config	Input	Output	Example
One-to-one	Single	Single	Plain FNN (not really RNN)
One-to-many	Single	Sequence	Image captioning, music generation
Many-to-one	Sequence	Single	Sentiment classification, text → label
Many-to-many (equal)	Sequence	Sequence (same length)	POS tagging, NER, video frame labeling
Many-to-many (seq2seq)	Sequence	Sequence (diff length)	Machine translation, summarization

12 What is teacher forcing and when is it used? ▾

In sequence-to-sequence (encoder-decoder) models, the decoder generates output token by token. During training, there are two options:

Without teacher forcing: feed the decoder's own (possibly wrong) prediction as the next input. Errors compound → slow, unstable training.

With teacher forcing: feed the ground-truth previous token as input to the decoder at each step, regardless of what was predicted.

Benefits: faster convergence, stable gradients, avoids early error cascades.

Risk: "exposure bias" — the model performs worse at inference (where it sees its own predictions, not ground truth). Scheduled sampling mitigates this by gradually reducing teacher forcing.

13 What is a Bidirectional RNN and when is it useful? ▾

A Bidirectional RNN runs two separate RNNs on the sequence — one forward (left to right) and one backward (right to left). Their hidden states are concatenated at each time step.

When useful: tasks where context from both past AND future helps predict the current position:

· NER: "Apple" in "I bought an Apple iPhone" vs "I ate an apple" — future context disambiguates.

· POS tagging, sentence encoding, BERT (bidirectional Transformer).

Bidirectional RNNs cannot be used for tasks that require real-time generation (you need the full sequence upfront).

14 What is a stacked (deep) RNN and what does each layer learn? ▾

A stacked RNN has multiple RNN layers, where the output sequence of one layer becomes the input sequence for the next:

h1_t = RNN_1(x_t, h1_{t-1}) h2_t = RNN_2(h1_t, h2_{t-1}) h3_t = RNN_3(h2_t, h3_{t-1})

Layer 1 learns low-level patterns (word-level). Layer 2 learns phrase-level structures. Layer 3 learns sentence-level/semantic patterns.

Typical depth: 2–4 layers. More layers → more expressive but harder to train (vanishing gradients). Lab config: NUM_LAYERS=2.

15 What is the encoder-decoder (seq2seq) architecture and what information passes between them? ▾

The encoder reads the entire input sequence and compresses it into a fixed-size context vector $c$ (the final hidden state). The decoder takes $c$ as its initial hidden state and generates the output sequence token by token.

Encoder: $h_T^{enc}$ = context vector $c$
Decoder: $h_0^{dec} = c$, then generates $y_1, y_2, \ldots$

Bottleneck problem: compressing a long input to a single vector loses information. Attention mechanism (Chapter 7) solves this by giving the decoder access to all encoder hidden states, not just the final one.

16 What is the hidden size hyperparameter in RNNs and what does it control? ▾

The hidden size $d_h$ is the dimensionality of the hidden state vector $h_t \in \mathbb{R}^{d_h}$.

What it controls: the network's memory capacity — how much information can be stored in the hidden state at each time step.

Hidden size	Effect
Too small (e.g., 16)	Cannot capture complex patterns — underfitting
Good (64–512)	Balanced capacity and computation
Too large (1024+)	Slow, risk of overfitting, needs more data

From the course lab: HIDDEN_SIZE = 64 for temperature forecasting.

17 Why are RNNs fundamentally slower to train than CNNs or Transformers? ▾

RNNs have sequential data dependency: $h_t$ must be computed before $h_{t+1}$ because $h_t$ depends on $h_{t-1}$. This prevents parallelization across time steps.

CNNs: all spatial positions are processed in parallel — highly GPU-parallelizable.

Transformers: all positions are processed in parallel via matrix multiplication (attention). No sequential dependency.

This sequential bottleneck is a key motivation for replacing RNNs with Transformers for long sequences.

18 What is the long-range dependency problem and what is the maximum effective range of a vanilla RNN? ▾

The long-range dependency problem: a vanilla RNN cannot reliably learn relationships between tokens that are far apart in the sequence due to vanishing gradients.

In practice, vanilla RNNs effectively "remember" only about 5–10 time steps. Information from 50+ steps ago is largely lost.

LSTMs extend this to hundreds of steps in favorable conditions. Transformers, via direct attention connections, handle thousands of tokens equally regardless of distance — solving long-range dependencies completely.

19 What does the sequence length (SEQ_LEN) hyperparameter control in a time series RNN? ▾

SEQ_LEN is the length of the input window: how many past time steps the model sees at each prediction step. From the course lab: SEQ_LEN = 30 (30 past days of temperature to predict day 31).

SEQ_LEN	Trade-off
Too short	Model misses relevant long-term patterns
Too long	More computation, harder to train, risk of vanishing gradients

SEQ_LEN should match the actual relevant history in the data (e.g., 7 for weekly patterns, 365 for yearly).

20 Compare LSTM vs GRU vs Vanilla RNN for the task of long-form text generation. ▾

Architecture	Long-range	Speed	Params	Recommendation
Vanilla RNN	Poor	Fast	Fewest	Only for very short sequences
GRU	Good	Faster than LSTM	~25% fewer than LSTM	Good default when speed matters
LSTM	Best	Slower	Most	Best for long, complex sequences

For poetry/text generation with coherent long-term structure: LSTM or GRU. For real-time, resource-constrained apps: GRU.

21 What is the difference between a regression RNN and a classification RNN at the output layer? ▾

	Regression RNN	Classification RNN
Output activation	Linear (none)	Softmax (multi-class) / Sigmoid (binary)
Loss	MSE or MAE	Cross-entropy
Example	Temperature prediction (next value)	Sentiment: positive/negative
Output	Continuous value $\hat{y} \in \mathbb{R}$	Class probability $\hat{y} \in [0,1]$

22 What is the conceptual difference between the cell state $C_t$ and the hidden state $h_t$ in LSTM? ▾

Think of them as two types of memory:

	Cell state $C_t$	Hidden state $h_t$
Analogy	Long-term memory	Working memory
Update mechanism	Additive (can preserve unchanged)	Through output gate + tanh(Ct)
Range	Can carry info over 100s of steps	More local, used for immediate decisions
Passed to next step	Yes	Yes
Used as output	No (internal only)	Yes (feeds into Dense layers)

23 What is the purpose of the LEARNING_RATE=1e-3 and how does it interact with gradient clipping? ▾

LEARNING_RATE $\eta = 10^{-3}$ controls the step size of each weight update. For RNNs, this is typically smaller than for CNNs due to the sensitivity of recurrent dynamics.

Interaction with gradient clipping: gradient clipping controls the direction (maximum gradient norm), while LR controls the step size in that direction. Both together prevent unstable updates:

1. Clip: $\nabla \leftarrow \min(1, \text{clip}/\|\nabla\|) \cdot \nabla$
2. Update: $w \leftarrow w - \eta \cdot \nabla$

If gradients explode (norm → large), clipping brings them back; if LR is too large, updates still overshoot. Both controls are needed.

24 What is an embedding layer and why is it placed before the RNN in text models? ▾

An embedding layer maps integer token indices to dense vectors: token index $i \in \{0, \ldots, V-1\}$ → vector $\mathbf{e}_i \in \mathbb{R}^d$.

Embedding(vocab_size=10000, embedding_dim=128) # Input: [3, 47, 1203, 2] → Output: shape (4, 128)

Why before the RNN: RNNs expect continuous, dense inputs. Raw one-hot vectors are both too sparse (vocab_size dimensions with one 1) and semantically meaningless. Embeddings provide low-dimensional, semantically meaningful representations that the RNN can process efficiently.

The embedding weights are learned end-to-end with the RNN, or initialized with pretrained embeddings (Word2Vec, GloVe).

25 What is the difference between return_sequences=True and return_sequences=False in Keras LSTM? ▾

Parameter	Output shape	Use case
`return_sequences=False`	(batch, hidden_size) — only last $h_T$	Many-to-one: sentiment classification, regression
`return_sequences=True`	(batch, seq_len, hidden_size) — all $h_t$	Many-to-many: stacked LSTM, seq labeling, attention

When stacking LSTM layers, all intermediate layers must use return_sequences=True. Only the final layer can use False (if many-to-one).

26 What is the "constant error carousel" property of LSTM and why is it important? ▾

The "constant error carousel" (CEC) is the mechanism by which the LSTM cell state propagates gradients without attenuation. When the forget gate $f_t = 1$ and input gate $i_t = 0$:

$$C_t = 1 \cdot C_{t-1} + 0 \cdot \tilde{C}_t = C_{t-1}$$

The cell state is copied unchanged, and the gradient flows back with a factor of $f_t = 1$ — not decaying. The LSTM can sustain gradients over hundreds of steps when needed.

Importance: this is the fundamental mechanism enabling LSTMs to capture long-range dependencies that vanilla RNNs cannot.

27 What synthetic time series components are typically used in RNN temperature forecasting labs? ▾

The course lab uses a synthetic 4-year daily temperature signal built from three stacked components:

1. Annual seasonality: $A \sin(2\pi t / 365)$ — summer/winter cycle.

2. Weekly pattern: smaller periodic variation across the week.

3. Gaussian noise: random day-to-day fluctuation.

The model (Vanilla RNN with SEQ_LEN=30) must learn to predict day 31 from the 30-day window, effectively learning to decompose and extrapolate these components.

28 Why is tanh used in RNNs (for hidden states) rather than ReLU? ▾

Tanh is preferred for RNN hidden states because:

1. Bounded output $[-1, 1]$: prevents hidden states from growing unboundedly across time steps. With ReLU, repeated application of $h_t = \text{ReLU}(W_h h_{t-1} + \ldots)$ can cause exponential growth.

2. Zero-centered: unlike sigmoid, tanh is centered at 0, leading to better gradient flow (positive and negative gradients cancel less).

3. Empirically works better: in practice, ReLU in vanilla RNNs leads to exploding states. LSTMs/GRUs use tanh for cell candidates but use sigmoid for gating.

29 What is the PATIENCE=10 parameter in the RNN lab's EarlyStopping and why is it higher than in CNN labs? ▾

PATIENCE=10 means the training continues for 10 epochs without improvement before stopping. This is higher than CNN labs (patience=5) because:

1. RNN learning is noisier: gradient variance is higher due to sequential dependencies — temporary plateaus are common.

2. More oscillation: val_loss for sequence models often fluctuates more, so a higher patience prevents stopping too early during a genuine improvement phase.

3. Complex loss landscape: RNNs have more complex optimization surfaces — longer patience allows escaping local plateaus.

30 Summarize: what is the key reason to choose LSTM over vanilla RNN for sequence tasks? ▾

Exam Favorite

The key reason: LSTM solves the vanishing gradient problem through its additive cell state update mechanism, enabling it to learn dependencies spanning hundreds of time steps — which vanilla RNNs cannot.

Specifically:

· Vanilla RNN: gradient multiplied by $W_h^T \cdot \tanh'(\cdot)$ at every step → exponentially decays.

· LSTM: gradient through $C_t$ is multiplied by $f_t$ (learned, can be ≈1) → sustained gradient flow.

Choose vanilla RNN only for very short sequences (≤10 steps) or for educational purposes. Use LSTM or GRU for all practical sequence modeling.

MODULE 04 NLP Preprocessing Pipeline 30 Q

01 List the 6 standard steps of the NLP text cleaning pipeline in order. ▾

Exam Favorite

Step	Operation	Python (regex)
1	Lowercase	`text.lower()`
2	Remove HTML tags	`re.sub(r'<[^>]+>', ' ', text)`
3	Remove URLs	`re.sub(r'http\S+\|www\S+', ' ', text)`
4	Remove punctuation	`re.sub(r'[^\w\s]', ' ', text)`
5	Remove numbers (optional)	`re.sub(r'\d+', ' ', text)`
6	Normalize whitespace	`re.sub(r'\s+', ' ', text).strip()`

Step 5 is domain-dependent — keep numbers for financial text, technical specs, dates.

02 What is tokenization and what are the main types? ▾

Tokenization splits raw text into discrete units (tokens) that the model can process.

Type	Unit	Vocabulary size	OOV handling
Character	Single char	~100	None (no OOV)
Word	Whole word	50k–500k	Poor (UNK token)
Subword (BPE, WordPiece)	Sub-word units	30k–50k (controlled)	Excellent

BERT uses WordPiece; GPT uses BPE. Both are subword methods that balance vocabulary size with OOV robustness.

03 What is the difference between stemming and lemmatization? Give examples of each. ▾

Exam Favorite

	Stemming	Lemmatization
Method	Heuristic rules — chop suffix	Dictionary lookup + POS context
Output	Stem (may not be a real word)	Lemma (always a valid base form)
Speed	Fast	Slower
Example	"running" → "run", "studies" → "studi"	"running" → "run", "better" → "good"
Library	NLTK PorterStemmer	NLTK WordNetLemmatizer, spaCy

"better" → "good" is only possible with lemmatization (it knows the grammatical relationship).

04 What are stop words and why should "not", "no", "never" NOT be removed for sentiment analysis? ▾

Exam Favorite

Stop words: high-frequency function words with little standalone meaning: "the", "a", "is", "in", "of", "to"…

Critical exception — negation words: removing "not", "no", "never" completely destroys sentiment polarity:

Original	After stop word removal	Sentiment lost?
"The movie was not good"	"movie good"	YES — becomes positive!
"I have no complaints"	"complaints"	YES — becomes negative!
"Never disappointing"	"disappointing"	YES — reversed!

Domain-specific stop word lists. For sentiment: always keep negation words.

05 What is Part-of-Speech (POS) tagging and what does it enable downstream? ▾

POS tagging assigns a grammatical category to each token: NN (noun), VB (verb), JJ (adjective), RB (adverb), etc.

Example: "The quick brown fox" → [The/DT, quick/JJ, brown/JJ, fox/NN]

What it enables:

· Better lemmatization (need POS to know "runs" is VBZ → "run" vs noun "runs")

· Chunking and syntactic parsing

· Named entity recognition (nouns are NE candidates)

· Feature engineering for classical ML

· Word sense disambiguation ("bank" as NN near "river" vs near "loan")

06 What is Named Entity Recognition (NER) and what are the standard entity types? ▾

NER identifies and classifies real-world named entities mentioned in text.

Example: "Apple is headquartered in Cupertino, California" → [Apple/ORG, Cupertino/GPE, California/GPE]

Type	Examples
PERSON	Barack Obama, Marie Curie
ORG	Microsoft, ENSAM, WHO
GPE (Geo-Political Entity)	Morocco, Casablanca, Paris
DATE	May 18, 2026, next Tuesday
MONEY	$500, €1,000

Libraries: spaCy, NLTK (maxent_ne_chunker), HuggingFace token classifiers.

07 What is the OOV (Out-of-Vocabulary) problem and how do different tokenization strategies handle it? ▾

OOV occurs when a token at inference time does not appear in the training vocabulary. Word-level tokenizers map it to [UNK] — losing all information.

Strategy	OOV handling
Word-level	Poor — UNK token, loses word identity
Character-level	Perfect — every char is in-vocabulary
BPE (GPT)	Good — breaks rare words into known subword pieces
WordPiece (BERT)	Good — "unhappy" → ["un", "##happy"]
FastText	Good — uses character n-grams, builds OOV vector from subwords

08 What is BPE (Byte Pair Encoding) tokenization and how does it build its vocabulary? ▾

BPE starts with a character-level vocabulary, then iteratively merges the most frequent pair of adjacent tokens into a new token, until the desired vocabulary size is reached.

Example:

Start: l o w e r _ , l o w e s t _ Merge "l" + "o" → "lo": lo w e r _ , lo w e s t _ Merge "lo" + "w" → "low": low e r _ , low e s t _ ...until vocab_size = 30,000

Result: common words stay whole ("the", "is"); rare words split into familiar subword pieces.

Used by: GPT-2/3/4, RoBERTa.

09 What is dependency parsing and what information does it provide? ▾

Dependency parsing analyzes the grammatical structure of a sentence, identifying directed relationships between words.

Example: "The cat ate the fish"

ate ← nsubj ← cat ate → dobj → fish cat ← det ← The fish ← det ← the

Each word (except root) has one head and a typed dependency relation (nsubj=nominal subject, dobj=direct object, det=determiner).

Applications: information extraction, question answering, coreference resolution, relation extraction.

10 What is a Context-Free Grammar (CFG) and write a simple example? ▾

A CFG is a set of recursive rewrite rules (productions) that define the valid syntactic structures of a language:

S → NP VP NP → Det N | N VP → V NP | V Det → "the" | "a" N → "cat" | "fish" V → "ate"

The sentence "the cat ate a fish" is parsed as: S → NP VP → Det N V NP → the cat ate a fish.

CFGs underpin formal syntax analysis and are used in grammar checkers and information extraction.

11 How does NLTK's word_tokenize() differ from Python's str.split()? ▾

Input: "I don't like it, really!"	Result
`.split()`	["I", "don't", "like", "it,", "really!"] — punctuation attached
`word_tokenize()`	["I", "do", "n't", "like", "it", ",", "really", "!"] — contractions split, punct separated

word_tokenize uses Penn Treebank conventions and handles contractions ("don't" → "do" + "n't"), abbreviations (e.g., "U.S."), and punctuation correctly. str.split() is naive — splits only on whitespace.

12 What is the vocabulary size problem and why is it challenging? ▾

The English vocabulary is theoretically unbounded (new words, names, slang, technical terms). Word-level tokenizers must choose a fixed vocabulary size $V$:

· Too small ($V=5{,}000$): high OOV rate, many words → UNK.

· Too large ($V=500{,}000$): huge embedding matrix ($500k \times d$), slow softmax over output vocab, sparse training data per word.

Standard choice: $V = 30{,}000$–$50{,}000$ for word-level. Subword methods (BPE) handle this better by representing words as combinations of subword pieces.

13 When should you keep numbers in the text? Give two domain examples where removing them hurts performance. ▾

Step 5 of the pipeline (remove numbers) is optional and domain-specific.

Keep numbers for:

1. Financial text: "Revenue increased by 23%" — the 23% is the key signal for forecasting, sentiment, or risk classification.

2. Medical/clinical text: "Patient has a BMI of 32.5 and BP of 140/90" — numbers carry diagnostic significance.

3. Legal/scientific text: Article numbers, statute references, measurements are semantically important.

Remove numbers for: general sentiment analysis of movie reviews where numbers (year of release "2023", scene count "120 minutes") are noise.

14 What does spaCy's `en_core_web_sm` model provide and what are its capabilities? ▾

import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("Apple was founded in Cupertino in 1976.") for ent in doc.ents: print(ent.text, ent.label_)

en_core_web_sm is a small English model (~12MB) providing:

· Tokenization (with contractions, punctuation)

· POS tagging (token.pos_, token.tag_)

· Dependency parsing (token.dep_, token.head)

· Named Entity Recognition (doc.ents with entity types)

· Lemmatization (token.lemma_)

Larger models (en_core_web_md, lg) add word vectors.

15 What is the challenge of tokenizing social media text (Twitter, Reddit)? ▾

Standard tokenizers are designed for formal text and fail on social media:

Challenge	Example	Standard tokenizer behavior
Hashtags	#DeepLearning	Splits at #, losing hashtag meaning
@Mentions	@elonmusk	Treats as two tokens
Emojis	🔥👍😍	Often UNK or garbled
Slang/abbreviations	"lol", "brb", "gonna"	OOV or not split correctly
Repeated chars	"sooooo goood"	Rare OOV tokens

Solution: use specialized tokenizers (TweetTokenizer from NLTK, BERTweet for social media).

16 What is sentence segmentation and why is it non-trivial? ▾

Sentence segmentation splits a document into individual sentences (usually needed before tokenization).

Why non-trivial: periods don't always end sentences:

· Abbreviations: "Dr. Smith", "U.S.A.", "etc." — period is part of the word.

· Decimal numbers: "the price was $3.99" — period is not a sentence boundary.

· Ellipsis: "..." — multiple periods, not a sentence end.

NLTK's Punkt tokenizer and spaCy's sentencizer use statistical models trained to distinguish sentence-ending periods from other uses.

17 What is the difference between Porter Stemmer and Lancaster Stemmer? ▾

	Porter Stemmer	Lancaster Stemmer
Aggressiveness	Moderate	More aggressive
Speed	Moderate	Faster
Output readability	Better (closer to real words)	Often incomprehensible
Example	"generously" → "generous"	"generously" → "gen"

Porter is the most widely used English stemmer. Lancaster is useful when speed is critical and the stem form doesn't need to be interpretable.

18 What is coreference resolution and why is it important for information extraction? ▾

Coreference resolution identifies when multiple expressions in a text refer to the same entity:

"Barack Obama was elected in 2008. He served two terms. The president signed the law."

All three bold expressions refer to the same person. Coreference resolution links them.

Importance: without it, information extraction treats "He" and "the president" as separate entities — losing the connection between facts. Essential for question answering ("Who signed the law?" → Obama) and summarization.

19 What is word sense disambiguation (WSD) and why is it hard? ▾

WSD determines which meaning of a polysemous word is intended given its context:

· "I went to the bank to deposit money." → financial institution

· "We sat by the river bank." → river bank

Why hard:

1. Many words have dozens of senses (WordNet: "run" has 39 senses)

2. Sense boundaries are fuzzy and debated even among linguists

3. Context window must be the right size — too small misses signals, too large introduces noise

4. Rare senses have very few training examples

Modern approach: contextual embeddings (BERT) implicitly solve WSD without explicit disambiguation.

20 What regex pattern removes HTML tags and why is this an important preprocessing step for web-scraped data? ▾

import re text = re.sub(r'<[^>]+>', ' ', text) # " This is bold" → " This is bold"

The pattern [^>]+ matches any character except ">" — capturing everything between "<" and ">" (the tag content). Replacing with a space prevents word merging ("word word" → "word word" not "wordword").

Importance: IMDb reviews, news articles, Wikipedia — all scraped from HTML. Tags like  , , <a href=...> are noise that creates OOV tokens and disrupts tokenization.

21 What NLP libraries are used in the course and what is each one's specialty? ▾

Library	Specialty
NLTK	Classical NLP: tokenization, stemming, lemmatization, POS, NER chunker, CFG parsing
spaCy	Industrial-strength NLP: fast POS, NER, dependency parsing with pretrained models
scikit-learn	ML pipelines: CountVectorizer, TfidfVectorizer, classifiers, metrics
gensim	Word embeddings: Word2Vec, GloVe, FastText training
HuggingFace	Pretrained Transformers: BERT, GPT, T5 fine-tuning and inference

22 What is morphological analysis and how does it differ from stemming? ▾

Morphological analysis decomposes words into their constituent morphemes (smallest units of meaning): prefix, stem, suffix, inflectional endings.

Example: "unhappiness" → [un- (prefix, negation) + happy (root) + -ness (suffix, nominalization)]

Difference from stemming: stemming is a crude heuristic (chop endings). Morphological analysis is linguistically principled — it identifies the actual functional components and their roles.

Languages with rich morphology (Arabic, Turkish, Finnish) require proper morphological analysis; English stemming suffices for most English NLP tasks.

23 What is the full clean_text() pipeline function from the course? ▾

Exam Favorite

import re def clean_text(text): text = text.lower() text = re.sub(r'<[^>]+>', ' ', text) # remove HTML text = re.sub(r'http\S+|www\S+', ' ', text) # remove URLs text = re.sub(r'[^\w\s]', ' ', text) # remove punctuation # text = re.sub(r'\d+', ' ', text) # numbers: optional text = re.sub(r'\s+', ' ', text).strip() # normalize whitespace return text

This exact function represents the 6-step pipeline condensed into a reusable form.

24 What is the NLP text classification pipeline end-to-end (from raw text to model output)? ▾

Raw Text → Text Cleaning (clean_text()) → Tokenization (word_tokenize / subword) → Stop Word Removal (optional, domain-dependent) → Stemming / Lemmatization (optional) → Vectorization (BoW / TF-IDF / Embedding) → Model (Logistic Regression / LSTM / BERT) → Prediction (class probabilities) → Evaluation (accuracy, F1, confusion matrix)

25 Why is lowercasing the very first step in the preprocessing pipeline? ▾

Lowercasing must come first because subsequent regex operations work on the cleaned text, and case doesn't affect their patterns. But more importantly:

1. Vocabulary reduction: "Apple", "apple", "APPLE" → one token "apple". Reduces sparse OOV problem.

2. Consistency: sentence-starting capitalization ("The" vs "the") is not semantically meaningful.

3. Tokenizer/stemmer alignment: most stemmers/lemmatizers expect lowercase.

Exception: Named Entity Recognition often relies on capitalization ("Washington" the city vs "washington" the word). If NER is needed, do it before lowercasing.

26 What NLTK corpora need to be downloaded for a complete NLP pipeline? ▾

import nltk nltk.download('punkt') # sentence + word tokenizer nltk.download('stopwords') # stop word lists for 20+ languages nltk.download('wordnet') # WordNetLemmatizer dictionary nltk.download('averaged_perceptron_tagger') # POS tagger nltk.download('maxent_ne_chunker') # NER chunker nltk.download('words') # English word list

These are the 6 resources used in the course lab setup (Task 0.1).

27 What is the difference between lemmatization using NLTK WordNetLemmatizer with and without POS tag? ▾

from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() # Without POS tag (assumes noun by default): lemmatizer.lemmatize("running") # → "running" (wrong! noun form) lemmatizer.lemmatize("better") # → "better" (no change!) # With POS tag: lemmatizer.lemmatize("running", pos='v') # → "run" (correct!) lemmatizer.lemmatize("better", pos='a') # → "good" (correct!)

Without POS context, WordNetLemmatizer defaults to noun, often returning the wrong lemma. Always combine with POS tagging for accurate lemmatization.

28 What is chunking in NLP and how does it differ from full parsing? ▾

Chunking (shallow parsing) groups words into non-overlapping phrase chunks (NP, VP, PP) without producing the full parse tree:

Input POS tags: The/DT quick/JJ brown/JJ fox/NN jumps/VBZ

Chunks: [NP The quick brown fox] [VP jumps]

	Chunking	Full Parsing
Output	Flat phrase groups	Full hierarchical tree
Speed	Fast	Slow
Completeness	Partial (no nested structure)	Complete sentence structure
Use case	NER, information extraction	Grammar checking, machine translation

29 What is whitespace normalization and what problems does it fix? ▾

text = re.sub(r'\s+', ' ', text).strip() # "\n\n Hello \t world " → "Hello world"

Problems it fixes:

1. Multiple spaces between words (from previous substitutions replacing tags/URLs with spaces)

2. Newlines (\n) and tabs (\t) that tokenizers may treat as separate tokens

3. Leading/trailing whitespace

4. Ensures consistent single-space separation between all tokens, which downstream tokenizers and vectorizers expect.

30 Why is text preprocessing domain-specific? Compare the pipeline for medical records vs social media. ▾

Decision	Medical Records	Social Media
Lowercase	Careful — "HIV" vs "hiv", drug names case-sensitive	Yes — most is informal
Remove numbers	NO — doses, measurements are critical	Usually yes
Stop words	Keep "no", "not", "without" (clinical negation)	Remove most, keep negation
Stemming/Lemma	Lemmatization preferred (preserve meaning)	Stemming OK (speed matters)
Special tokens	Expand abbreviations ("MI" → myocardial infarction)	Handle hashtags, @mentions, emojis

There is no universal preprocessing pipeline — always design it for your domain and task.

MODULE 05 Classical Text Representation — BoW · N-grams · TF-IDF 30 Q

01 What is the Bag of Words (BoW) model and what information does it explicitly discard? ▾

BoW represents a document as a vector of word counts over a fixed vocabulary $V$: each dimension counts how many times word $i$ appears, ignoring where.

Explicitly discards:

1. Word order: "dog bites man" = "man bites dog" in BoW.

2. Syntax / grammar.

3. Context: the meaning of each word is independent of surrounding words.

4. Semantic relationships: "car" and "automobile" are treated as completely different features.

02 Write the BoW vector for "I love cats" with vocabulary ["cats","dogs","hate","love"]. ▾

Vocabulary (sorted): ["cats", "dogs", "hate", "love"] — indices 0, 1, 2, 3.

"I love cats" contains: "cats" (×1), "love" (×1). "I" is out-of-vocab.

$$\text{BoW}("I\ love\ cats") = [1, 0, 0, 1]$$

Compare: "I hate dogs" → [0, 1, 1, 0]. These two reviews are maximally different in the vector space — good! BoW correctly separates opposite sentiments here.

03 What is the IMDb 50k benchmark accuracy of BoW with unigrams? ▾

Exam Favorite

86.18% accuracy on the IMDb 50,000 review sentiment dataset (binary: positive/negative).

This is the baseline for classical methods. Remarkably strong for a method that ignores all word order and uses only raw counts.

BoW works surprisingly well for sentiment because sentiment-bearing words ("excellent", "terrible", "love", "hate") carry strong signal even without context.

04 Why does BoW fail on "The movie was not good, not bad"? What is this limitation called? ▾

In BoW, "not good, not bad" produces high counts for both "good" and "bad". A classifier trained on bag of words cannot distinguish:

· "The movie was not good, not bad" (mixed / neutral)

· "The movie was good, not bad" (positive)

· "The movie was not good, bad" (negative)

All three produce the same or very similar BoW vectors containing "good", "bad", "not". The model predicts incorrectly on neutral/mixed sentiments with double negation.

This is the negation blindness problem — BoW cannot model the interaction between "not" and the adjacent adjective.

05 What is an N-gram? Define unigram, bigram, and trigram with examples. ▾

An N-gram is a contiguous sequence of N tokens from a text.

Text: "The cat sat"

N	Name	Examples
1	Unigram	["The", "cat", "sat"]
2	Bigram	["The cat", "cat sat"]
3	Trigram	["The cat sat"]

N-grams capture local word order, enabling the model to recognize "not good" (bigram) as a negative phrase even with BoW.

06 Give the exact IMDb benchmark results for unigram, bigram, and trigram N-gram models. ▾

Exam Favorite

N-gram range	IMDb Accuracy	Interpretation
Unigram only (1,1)	86.18%	Baseline BoW
Bigram (1,2)	~88–89%	+2–3% from phrase detection
Trigram (1,2,3)	90.11%	Best classical N-gram

Each level adds local context, improving negation handling and phrase-level sentiment capture.

07 Write the TF-IDF formula and explain each component. ▾

FormulaExam Favorite

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\!\left(\frac{N}{df(t)}\right) + 1$$

$t$: term (word) · $d$: document · $N$: total number of documents · $df(t)$: number of documents containing term $t$.

TF (Term Frequency): how often $t$ appears in document $d$ — captures local importance.

IDF (Inverse Document Frequency): $\log(N/df(t))$ — penalizes words that appear in many documents (common words like "the" get IDF ≈ 0). The +1 is a smoothing term.

Combined: high weight → word is frequent in this document AND rare across the corpus → distinctive.

08 What is the IDF of a word that appears in ALL N documents? What does this mean? ▾

$$\text{IDF} = \log\!\left(\frac{N}{N}\right) + 1 = \log(1) + 1 = 0 + 1 = 1$$

With smoothing (+1), the IDF = 1 (not 0). Without smoothing: $\log(1) = 0$ — the word has zero weight regardless of TF.

Meaning: a word like "the" that appears in every document is not distinctive — it provides no information about which document is relevant. TF-IDF naturally down-weights it.

From the IMDb 50k corpus: "the" → IDF ≈ 0.0001 (essentially 0). "oscillating" → IDF ≈ 10.3 (very rare, highly distinctive).

09 What is the IMDb accuracy of TF-IDF with (1,2)-grams and why does it outperform raw BoW? ▾

90.1% on IMDb 50k — the best classical method.

TF-IDF outperforms raw BoW (86.18%) for two reasons:

1. IDF down-weighting: common uninformative words ("the", "is", "a") get near-zero weight, so they don't dominate the feature vector. BoW counts them equally with informative words.

2. Bigrams (1,2): captures "not good", "highly recommend", "waste of time" as single features — partial negation handling.

Combined: TF-IDF focuses attention on distinctive, sentiment-bearing terms and phrases.

10 What is the sparsity problem in BoW/TF-IDF and what dimensionality does it produce on IMDb? ▾

On the IMDb 50k corpus, the vocabulary is ~100,000 unique words. Each document is represented as a vector of dimension 100,000. But a typical review uses only 200–500 words — so 99.5%+ of dimensions are zero.

Problems with sparsity:

1. Memory: 50,000 documents × 100,000 features = 5 billion entries (mostly zeros). Must use sparse matrix format.

2. Curse of dimensionality: in high-dimensional sparse space, distances become uninformative — all documents seem equally far apart.

3. No semantic generalization: "car" and "automobile" are orthogonal vectors (no relationship encoded).

11 What is the dimensionality of a BoW vector and what determines it? ▾

The dimensionality = the vocabulary size $|V|$. Each dimension corresponds to one unique word (or N-gram for N>1).

What determines it:

· All unique words in the training corpus (after preprocessing)

· max_features parameter: e.g., CountVectorizer(max_features=10000) keeps only the 10,000 most frequent words.

For N-grams: bigram vocabulary grows as $O(V^2)$ — if 50k unigrams, ~2.5M potential bigrams (but most rare, so max_features is critical).

from sklearn.feature_extraction.text import CountVectorizer vec = CountVectorizer(max_features=10000, ngram_range=(1,2)) X = vec.fit_transform(texts) # shape: (n_docs, 10000)

12 What is CountVectorizer vs TfidfVectorizer in scikit-learn and what does each output? ▾

	CountVectorizer	TfidfVectorizer
Output values	Integer counts: how many times each word appears	Float TF-IDF scores: scaled by rarity
Common words	High count (e.g., "the": 50)	Low score (IDF ≈ 0)
Rare words	Low count (e.g., "oscillating": 1)	High score (high IDF)
Use case	BoW baseline, language modeling	Classification, information retrieval

from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer(ngram_range=(1,2), max_features=50000, sublinear_tf=True) X_train = vec.fit_transform(train_texts)

13 What is the difference between binary BoW and count BoW? ▾

	Binary BoW	Count BoW
Values	0 or 1: does word appear?	0, 1, 2, …: how many times?
Sensitivity to repetition	No — "great great great" = "great"	Yes — "great great great" → count=3
CountVectorizer param	`binary=True`	`binary=False` (default)
Use case	Short texts where repetition is noise	Long documents where frequency matters

For IMDb reviews: count BoW performs better because sentiment words are genuinely more frequent in strongly-sentiment reviews ("amazing amazing amazing!").

14 When should you STOP at classical methods and not use neural networks? Give 4 conditions. ▾

Exam Favorite

Stop at classical (BoW/TF-IDF + logistic regression/SVM) when:

1. Small dataset (<10,000 samples): neural networks overfit; classical methods generalize better with limited data.

2. Interpretability required: you need to explain which words drove the decision (legal, medical, financial compliance).

3. Low latency & resources: inference must be <1ms or runs on low-power hardware — neural networks are too slow/heavy.

4. Accuracy already sufficient: TF-IDF at 90% meets the business requirement; BERT at 94% is not worth the infrastructure cost.

15 What does the `sublinear_tf=True` parameter do in TfidfVectorizer? ▾

With sublinear_tf=True, the TF component is replaced by $1 + \log(\text{TF})$ instead of raw TF:

Normal TF: if "great" appears 10× → TF = 10
Sublinear TF: $1 + \log(10) = 1 + 2.3 = 3.3$

This compresses the range of term frequencies — a word appearing 10× is not 10× as important as one appearing 1×. The improvement diminishes as count grows. Particularly useful for long documents where some words repeat many times.

Recommended for most TF-IDF applications in practice.

16 What is the ngram_range=(1,2) parameter in scikit-learn vectorizers? ▾

TfidfVectorizer(ngram_range=(1,2)) # Extracts both unigrams AND bigrams # "I love this" → ["I", "love", "this", "I love", "love this"]

ngram_range=(min_n, max_n) specifies the lower and upper boundary of the N-gram range to be extracted:

Range	Extracts
(1,1)	Unigrams only
(1,2)	Unigrams + bigrams
(2,2)	Bigrams only
(1,3)	Unigrams + bigrams + trigrams

Larger ranges capture more context but exponentially increase vocabulary size.

17 What are the 4 shared limitations of BoW, N-gram, and TF-IDF representations? ▾

Exam Favorite

Limitation	Example
No semantic meaning	"car" and "automobile" are orthogonal vectors — no relationship
No context / polysemy	"bank" (financial) = "bank" (river) — same vector regardless of context
OOV problem	New words at inference time → zero vector or UNK
High dimensionality & sparsity	100k-dim vector, 99.5% zeros — curse of dimensionality

All three limitations are solved by neural embeddings (Word2Vec → BERT).

18 What is a document-term matrix and how is it structured? ▾

The document-term matrix (DTM) is the output of vectorization: rows = documents, columns = vocabulary terms.

$$\mathbf{X} \in \mathbb{R}^{N_{docs} \times |V|}$$

Entry $X_{ij}$ = count (BoW) or TF-IDF score of word $j$ in document $i$.

Stored as sparse matrix (scipy.sparse.csr_matrix) because 99%+ of entries are 0. In dense format: 50,000 docs × 100,000 words × 4 bytes = 20GB — impossible to fit in RAM. Sparse format stores only nonzero entries.

19 Why do bigrams outperform unigrams on sentiment tasks specifically? ▾

Sentiment is heavily driven by negation and degree adverbs — both require exactly 2-word context:

Bigram	Sentiment	Unigrams alone
"not good"	Negative	"not"(ambig) + "good"(positive) → confused
"highly recommend"	Strongly positive	"highly"(ambig) + "recommend"(positive)
"waste of"	Negative	Each word ambiguous alone
"absolutely terrible"	Very negative	"absolutely" adds no standalone signal

+2–3% from unigrams to bigrams (86.18% → ~88–89%) is directly attributable to capturing these patterns.

20 What does max_features=10000 do in CountVectorizer and what are the trade-offs of this parameter? ▾

max_features=10000 keeps only the 10,000 most frequent terms (by count across the training corpus). Rarer terms are discarded.

max_features	Trade-off
Too small (1,000)	Misses important domain words — underfitting
Good (10,000–50,000)	Balanced coverage vs. sparsity
None (all words)	Highest recall but extreme sparsity, slow training, overfitting risk

Typical values: 10,000–50,000 for BoW; with TF-IDF + N-grams: 50,000–100,000.

21 What is the co-occurrence matrix and how does it relate to word representations? ▾

A co-occurrence matrix $\mathbf{C}$ has shape $|V| \times |V|$ where $C_{ij}$ counts how many times word $i$ and word $j$ appear within a context window (e.g., ±5 words) across the corpus.

Relationship to embeddings: GloVe is based on factorizing the log co-occurrence matrix. The row vector of word $i$ in $\mathbf{C}$ is a high-dimensional, sparse semantic representation — SVD/factorization compresses it into dense word vectors.

Problem: $|V| \times |V|$ for $|V|=100k$ = 10 billion entries — memory-prohibitive without approximations.

22 What is the polysemy problem in classical text representation? Give a concrete example. ▾

Polysemy = one word form with multiple distinct meanings. Classical representations use a single feature dimension per word, forcing all meanings to share one vector entry:

· "I went to the bank to deposit my check." (financial)

· "The frog sat on the river bank." (geography)

Both uses of "bank" increment the same counter in BoW — the representation conflates two entirely different meanings. A classifier cannot distinguish them from the word alone.

This motivates contextual embeddings (BERT) which produce different vectors for "bank" in each sentence.

23 What is the key reason TF-IDF outperforms raw BoW for document classification? ▾

The IDF component solves the fundamental problem of raw BoW: common words dominate.

In raw BoW, "the" might appear 100× in a review and "excellent" 3×. The classifier sees "the" as 33× more important than "excellent" — but "the" is completely uninformative for sentiment.

TF-IDF assigns "the" an IDF ≈ 0 (appears in all documents) and "excellent" an IDF ≈ 8 (rare across corpus). So "excellent" gets weight $3 \times 8 = 24$ while "the" gets $100 \times 0 \approx 0$.

Result: the classifier focuses on discriminative words rather than function words.

24 How do you compute TF (term frequency) for a document? Are there different variants? ▾

Variant	Formula	Use
Raw count	$\text{TF}(t,d) = f_{t,d}$	BoW baseline
Normalized	$\text{TF}(t,d) = f_{t,d} / \sum_{t'} f_{t',d}$	Documents of different lengths
Sublinear (log)	$1 + \log(f_{t,d})$ if $f > 0$, else 0	Most practical TF-IDF (sublinear_tf=True)
Binary	1 if $f_{t,d} > 0$, else 0	Short text, presence/absence only

scikit-learn's TfidfVectorizer uses normalized TF by default (L2 row normalization after TF-IDF computation).

25 What is the typical machine learning classifier paired with TF-IDF vectors and why? ▾

The standard classifier paired with TF-IDF is Logistic Regression or Linear SVM.

Why linear classifiers:

1. TF-IDF vectors are high-dimensional (50k–100k dims) but sparse — linear models are computationally efficient in this space.

2. High-dimensional data is often linearly separable — a linear boundary in 100k dimensions is very expressive.

3. Fast training and inference — critical for production NLP.

4. Interpretable — coefficients directly show which words are most positive/negative.

Naive Bayes also works well for text (strong independence assumption, but efficient with sparse data).

26 What is the production use case where TF-IDF excels over neural methods? ▾

Information Retrieval / Search Engines — the original and still major use case of TF-IDF.

Given a query, rank documents by their TF-IDF cosine similarity to the query:

$$\text{similarity}(q, d) = \frac{\mathbf{v}_q \cdot \mathbf{v}_d}{\|\mathbf{v}_q\| \cdot \|\mathbf{v}_d\|}$$

Documents with query terms that are rare across the corpus (high IDF) are ranked higher than documents with common terms.

BM25 (the modern IR standard, used by Elasticsearch) is an improved variant of TF-IDF. TF-IDF remains the gold standard for keyword search, document deduplication, and keyword extraction in production.

27 What is the decision ladder — when to escalate from BoW → N-gram → TF-IDF → embeddings? ▾

Start with	Escalate if
BoW + Logistic Regression	Accuracy insufficient → try N-grams
N-gram (1,2) + Logistic Regression	Still not sufficient → try TF-IDF
TF-IDF (1,2) + Logistic Regression	Need >91%, have >50k samples → try Word2Vec/FastText
Word2Vec/FastText + LSTM	Need >93%, have large data, have GPU → try BERT fine-tuning

Golden rule: start simple. Only add complexity when accuracy requirements are not met by the simpler model.

28 What is the vocabulary explosion problem with N-grams and how is max_features used to control it? ▾

With unigrams: vocabulary size $|V| \approx 50{,}000$ (unique words in IMDb).

With bigrams: up to $|V|^2/2 \approx 1.25$ billion possible bigrams. In practice ~500k–2M are observed.

With trigrams: $|V|^3$ possible — astronomically large, mostly rare.

Solution — max_features: keep only the most frequent N-grams.

TfidfVectorizer(ngram_range=(1,3), max_features=100000) # Selects top 100k most frequent unigrams+bigrams+trigrams

Most rare N-grams are noise — frequent N-grams capture the real patterns. max_features is essential for N>2.

29 Why can't BoW/TF-IDF capture semantic similarity between "good" and "excellent"? ▾

In BoW/TF-IDF, each word is a one-hot dimension. "good" occupies dimension 4,231; "excellent" occupies dimension 17,842. Their vectors are:

$\mathbf{v}_{\text{good}} = [0,\ldots,1,\ldots,0]$ (1 at dim 4231)
$\mathbf{v}_{\text{excellent}} = [0,\ldots,1,\ldots,0]$ (1 at dim 17842)

Cosine similarity = 0 — they are completely orthogonal (unrelated) in the representation space.

But both words carry positive sentiment! A classifier must learn "excellent" is positive from scratch — there is no sharing of information between semantically related words.

This motivates word embeddings, where similar words have similar vectors.

30 Summarize the performance comparison of all classical methods on IMDb 50k. ▾

Exam Favorite

Method	IMDb Accuracy	Key advantage
BoW unigram (1,1)	86.18%	Simplest, fast, interpretable
N-gram bigram (1,2)	~88–89%	Captures "not good" patterns
N-gram trigram (1,2,3)	90.11%	Best N-gram coverage
TF-IDF (1,2)-gram	90.1%	Down-weights common words

TF-IDF is the best classical method overall. For resource-constrained production systems, TF-IDF at 90% is often the right stopping point before investing in neural methods.

MODULE 06 Word Embeddings — Static & Contextual 30 Q

01 State the distributional hypothesis. Why is it the foundation of word embeddings? ▾

Key Concept

"A word is characterized by the company it keeps." (Firth, 1957)

Words that appear in similar contexts have similar meanings. "cat" and "dog" both appear near "pet", "food", "veterinarian" → they should have similar representations.

Foundation: this hypothesis allows learning meaning purely from co-occurrence statistics — no manual annotation needed. Feed a neural network billions of words and it learns that semantically similar words appear in similar contexts → similar embedding vectors.

02 What problem do dense word embeddings solve compared to sparse BoW vectors? ▾

	BoW / TF-IDF	Word Embeddings
Dimensionality	50k–100k (sparse)	100–300 (dense)
Sparsity	99.5%+ zeros	All dimensions non-zero
Semantic similarity	Orthogonal (no relationship)	Cosine similarity captures semantics
OOV	Zero vector	Trained embeddings (Word2Vec) or subword (FastText)
"car" vs "automobile"	Completely unrelated	High cosine similarity (>0.8)

03 What is Word2Vec Skip-gram? Describe the training objective. ▾

Skip-gram: given a center word, predict the surrounding context words within a window.

Example (window=2): "The quick brown fox jumps" → given "brown", predict ["The", "quick", "fox", "jumps"].

$$\max \sum_{(w, c) \in D} \log P(c \mid w; \theta)$$

A neural network with one hidden layer (the embedding layer) learns to maximize the probability of real context words and minimize the probability of random words (negative sampling).

Skip-gram works better for rare words (each rare word gets many training signal from many contexts).

04 What is Word2Vec CBOW? How does it differ from Skip-gram? ▾

CBOW (Continuous Bag of Words): given the surrounding context words, predict the center word.

Example: ["The", "quick", ?, "fox", "jumps"] → predict "brown".

	Skip-gram	CBOW
Task	Center → predict contexts	Contexts → predict center
Rare words	Better (more training signal)	Worse
Speed	Slower (many predictions per word)	Faster (one prediction per window)
Common words	Comparable	Better (averages context)

Skip-gram is generally preferred; CBOW trains faster on large corpora.

05 Write the famous Word2Vec vector arithmetic result and explain what it demonstrates. ▾

Exam Favorite

$$\mathbf{v}_{\text{king}} - \mathbf{v}_{\text{man}} + \mathbf{v}_{\text{woman}} \approx \mathbf{v}_{\text{queen}}$$

This demonstrates that word embeddings capture linear analogical relationships in the semantic space.

The vector from "man" to "king" encodes the concept of "royalty". Adding this offset to "woman" yields a point close to "queen".

Other examples: Paris - France + Germany ≈ Berlin · Doctor - Man + Woman ≈ Nurse

This was the first clear evidence that distributed representations encode structured semantic knowledge.

06 Write the cosine similarity formula and explain what it measures. ▾

Formula

$$\cos(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \cdot \|\mathbf{v}\|}$$

Cosine similarity measures the angle between two vectors, ignoring their magnitude. Range: [-1, +1].

· +1: identical direction (same meaning)

· 0: orthogonal (no relationship)

· -1: opposite directions (antonyms — e.g., "good" vs "bad")

Preferred over Euclidean distance for embeddings because it's scale-invariant — a 300-dim vector for "cat" doesn't need to be the same magnitude as "dog" to be semantically close.

07 What are the exact IMDb 50k benchmark results for Word2Vec, GloVe, FastText, and BERT? ▾

Exam Favorite

Method	Accuracy	F1-Macro	Latency (CPU)
BoW unigram (baseline)	86.2%	0.86	~1ms
Word2Vec + mean pool	85.7%	0.86	~12ms
GloVe + mean pool	76.8%	0.77	~10ms
FastText	86.0%	0.86	~15ms
BERT fine-tuned	93.9%	0.94	370ms

08 Why does GloVe perform WORSE than BoW on IMDb sentiment (76.8% vs 86.2%)? This seems counterintuitive — explain. ▾

Exam Favorite

GloVe's underperformance on sentiment is a well-known result with two causes:

1. Global co-occurrence statistics fail for sentiment: GloVe trains on global word co-occurrence (how often words appear together across all documents). Words like "not" and "good" co-occur frequently in both positive and negative contexts — the global statistics cannot distinguish "not good" (negative) from "very good" (positive). The embedding for "good" blends all these contexts.

2. Mean-pooling destroys word order: averaging GloVe vectors over a sentence conflates "not good" with "good not" → the directional information ("not" modifies "good") is lost.

BoW with a trained classifier compensates by learning that the combination of "not" + "good" features implies negative sentiment. GloVe + mean-pool cannot.

09 How does GloVe differ from Word2Vec in its training approach? ▾

	Word2Vec	GloVe
Training	Predictive (neural network): predict context from center	Count-based (matrix factorization): factorize co-occurrence matrix
Data used	Local context window (5–10 words)	Global co-occurrence counts (entire corpus)
Objective	Maximize P(context\|center)	Minimize: $(\mathbf{w}_i^T\mathbf{w}_j + b_i + b_j - \log X_{ij})^2$
Scalability	Online (stream data)	Requires co-occurrence matrix upfront

GloVe (Global Vectors) explicitly incorporates global statistics; Word2Vec only sees local windows. Both produce similar quality embeddings for most tasks.

10 How does FastText handle Out-of-Vocabulary (OOV) words? Write the subword decomposition of "acting". ▾

Exam Favorite

FastText represents each word as the sum of its character n-gram embeddings (n=3–6 by default). The word vector is the average of all its subword vectors plus the word itself.

Decomposition of "acting" (n=3):

"acting" → <ac, act, cti, tin, ing, ng> (with boundary markers < and >) v("acting") = avg(v("<ac"), v("act"), v("cti"), v("tin"), v("ing"), v("ng>"), v("<acting>"))

OOV advantage: "unknownword" is OOV but "unknown" and "word" share n-grams with known words → FastText builds a vector from shared subword pieces. Word2Vec/GloVe return zero vector for OOV.

11 What is the polysemy problem in static embeddings? Give two meanings of one word. ▾

Static embeddings (Word2Vec, GloVe, FastText) assign one fixed vector per word, regardless of context. A polysemous word like "bank" has one embedding that blends all its meanings:

· "I deposited money at the bank." (financial institution)

· "She sat on the river bank." (riverbank)

The learned vector for "bank" is somewhere between these two meanings — accurate for neither. Downstream models cannot distinguish which sense is intended.

Contextual embeddings (BERT, GPT) solve this: "bank" near "money" gets a different vector than "bank" near "river".

12 What does BERT stand for and what are its two pre-training tasks? ▾

Exam Favorite

BERT = Bidirectional Encoder Representations from Transformers (Devlin et al., Google, 2018).

Pre-training Task	Description
MLM (Masked Language Modeling)	15% of tokens are masked with [MASK]. BERT predicts the original token using both left and right context → learns bidirectional representations.
NSP (Next Sentence Prediction)	Given pairs of sentences (A, B): predict whether B actually follows A in the original text. Learns discourse-level relationships.

Pre-trained on 3.3 billion words (Wikipedia + BookCorpus) for weeks on 64 TPUs. After pre-training, fine-tuned on downstream tasks.

13 What is the [CLS] token in BERT and how is it used for classification? ▾

[CLS] (Classification) is a special token prepended to every input sequence: [CLS] sentence tokens... [SEP].

After passing through all 12 Transformer encoder layers, the [CLS] token's final hidden state is a sentence-level representation — it has attended to all tokens and aggregated the full sequence meaning.

# Fine-tuning BERT for classification: bert_output = bert(input_ids, attention_mask) cls_vector = bert_output.last_hidden_state[:, 0, :] # shape: (batch, 768) logits = classifier_head(cls_vector) # Dense(num_classes)

This [CLS] vector is passed to a new Dense classification head and the entire model is fine-tuned end-to-end.

14 Compare BERT-Base and BERT-Large: layers, attention heads, hidden dimension, parameters. ▾

	BERT-Base	BERT-Large
Encoder layers (L)	12	24
Hidden size (H)	768	1,024
Attention heads (A)	12	16
Parameters	110M	340M

BERT-Base is the standard for most applications. BERT-Large gives ~1–2% better accuracy on benchmarks but requires ~3× more memory and compute.

15 What is BERT's CPU inference latency on IMDb and why does this matter for production? ▾

370ms per sample on CPU (from the course's exact benchmark).

Compare: BoW+LR ≈ 1ms, Word2Vec ≈ 12ms, GloVe ≈ 10ms, FastText ≈ 15ms.

Why it matters:

· A user-facing API with a 200ms SLA cannot use BERT on CPU — must use GPU (reduces to ~20–30ms) or distilled models (DistilBERT: 40% faster, 97% accuracy retention).

· At 100 requests/sec: BERT-CPU needs 37 CPUs. FastText needs 1.5 CPUs.

· Total cost of ownership can be 10–20× higher for BERT in high-traffic production.

16 What is the "golden rule" for choosing a text representation method? ▾

Exam Favorite

Start simple. Add complexity only when accuracy requires it.

The decision ladder:

1. Start with TF-IDF + Logistic Regression → fast, interpretable, often sufficient.

2. If OOV or morphology matters → FastText.

3. If context/polysemy matters and you have GPU → BERT.

4. The 4% gain (TF-IDF 90% → BERT 94%) comes at 370× higher latency. Is that trade-off worth it for your application?

17 When would you choose FastText over BERT? Give 3 conditions. ▾

Choose FastText over BERT when:

1. Real-time inference required: latency <50ms — BERT's 370ms is unacceptable. FastText: ~15ms.

2. Morphologically rich language or OOV-heavy domain: medical terms, technical jargon, code-switching — FastText handles OOV via subwords; BERT's WordPiece may fragment them poorly.

3. Low resource (no GPU, small dataset): FastText trains in seconds on CPU; BERT fine-tuning needs hours and a GPU.

Bonus condition: if TF-IDF achieves only 86% but you need 87–90% without BERT's overhead — FastText is the right intermediate step.

18 What is mean-pooling of word embeddings and why is it insufficient for sentiment? ▾

Mean-pooling computes the document representation as the average of all word vectors:

$$\mathbf{d} = \frac{1}{T} \sum_{t=1}^{T} \mathbf{e}_t$$

Why insufficient for sentiment:

1. Destroys word order: "not good" and "good not" produce the same average vector.

2. Dilutes negation: "The movie was absolutely terrible except for the amazing cinematography" → average of negative + positive words → neutral vector → misclassified.

3. Equal weighting: "the" (uninformative) and "brilliant" (highly informative) contribute equally to the average.

This is why Word2Vec+mean-pool (85.7%) underperforms even BoW (86.2%).

19 What is the [SEP] token in BERT and when is it used? ▾

[SEP] (Separator) marks the boundary between two sequences in BERT's input format:

# Single sentence: [CLS] I love this film [SEP] # Sentence pair (NSP, QA, NLI): [CLS] Question: What is AI? [SEP] AI is artificial intelligence [SEP]

BERT uses [SEP] to:

1. Signal the end of a sequence (single sentence tasks).

2. Separate two sentences in pair tasks (NSP, question answering where question and passage are concatenated).

Segment embeddings (0 for sentence A, 1 for sentence B) work together with [SEP] to tell BERT which tokens belong to which input.

20 What does fine-tuning BERT mean in practice? What is updated during fine-tuning? ▾

Fine-tuning BERT = taking pre-trained BERT (110M params) and continuing training on a small task-specific labeled dataset, updating all weights including the pre-trained encoder.

from transformers import BertForSequenceClassification model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2) optimizer = AdamW(model.parameters(), lr=2e-5) # very small LR! model.train() # Train for 2-4 epochs on task data

Key details:

· Learning rate must be very small (2e-5 to 5e-5) — large LR destroys pre-trained representations.

· Only 2–4 epochs needed (data is small, pre-training did most of the work).

· A task-specific head (Dense layer) is added on top of [CLS].

21 What is the typical dimensionality of Word2Vec embeddings and what range is common in practice? ▾

Word2Vec embeddings are typically $d = 100$–$300$ dimensions. The original Google News Word2Vec uses $d = 300$.

Dimension	Use case
50–100	Small corpora, fast inference, limited memory
100–300	Standard — balances quality vs cost (most pretrained models)
300+	Large corpora, high-accuracy tasks (GloVe 840B: d=300)

BERT uses hidden size 768 (Base) or 1024 (Large) — much larger because it captures contextual information, not just lexical.

22 What is the 6-review demo result: how many of 6 test reviews does each method classify correctly? ▾

Exam Favorite

Method	Correct / 6	Fails on
BoW	5/6	Negation: "not good, not bad"
TF-IDF	6/6	—
Word2Vec + mean-pool	4/6	Mixed reviews, negation
FastText	4/6	Mixed reviews, negation
BERT	6/6	— (understands context)

Surprising result: TF-IDF matches BERT on this small demo, while Word2Vec underperforms BoW.

23 Describe the three families of text representation in a single comparison. ▾

Property	Classical (BoW/TF-IDF)	Static Embed (W2V/GloVe)	Contextual (BERT)
Representation	Sparse count/weight vector	Dense fixed vector per word	Dense vector per token per context
Semantics	None	Distributional	Deep, contextual
Polysemy	One feature per word	One vector (conflated)	Different vector per context
OOV	Zero (or UNK)	Zero (W2V/GloVe) / subword (FT)	WordPiece subword
Latency	~1ms	~10–15ms	~370ms CPU
IMDb accuracy	90.1%	85.7% (W2V)	93.9%

24 What is the 30-year evolution of NLP text representations? ▾

Era	Method	Key innovation
1990s	BoW, TF-IDF	Sparse count-based vectors
2000s	LSA, LDA	Topic models, matrix factorization
2013	Word2Vec	Dense neural word embeddings, analogy arithmetic
2014	GloVe	Global co-occurrence factorization
2016	FastText	Subword embeddings, OOV robustness
2018	BERT	Contextual embeddings, bidirectional Transformer
2020+	GPT-3, T5, LLaMA	Generative, few-shot, massive scale

25 What is negative sampling in Word2Vec and why is it necessary? ▾

Problem: the full softmax over the entire vocabulary $|V|$ in the Skip-gram objective is computationally prohibitive: $O(|V|)$ per training step with $|V|=100k$ → 10 billion multiplications per epoch.

Negative sampling: instead of updating all $|V|$ words, for each (center, context) positive pair, randomly sample $k=5$–$20$ "negative" (non-context) words and update only those:

$$\mathcal{L} = \log \sigma(\mathbf{v}_{c}^T \mathbf{v}_{w}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n} [\log \sigma(-\mathbf{v}_{w_i}^T \mathbf{v}_w)]$$

Reduces training cost from $O(|V|)$ to $O(k)$ per step — making Word2Vec practical.

26 What is gensim's Word2Vec API? Write the training call and how to get a word vector. ▾

from gensim.models import Word2Vec # Training model = Word2Vec( sentences=tokenized_docs, # list of token lists vector_size=300, # embedding dimension window=5, # context window size min_count=2, # ignore words with freq < 2 sg=1, # 1=Skip-gram, 0=CBOW epochs=10 ) # Get vector for a word vec = model.wv['king'] # shape: (300,) # Find most similar model.wv.most_similar('king', topn=5) # → [('queen', 0.85), ('prince', 0.82), ...] # Analogy model.wv.most_similar(positive=['king','woman'], negative=['man'], topn=1) # → [('queen', 0.83)]

27 What is the strengths and limitations grid for Word2Vec? ▾

Strengths	Limitations
Captures semantic relationships and analogies	One vector per word — cannot handle polysemy
Dense, 300-dim vs 100k-dim BoW	No OOV handling (zero vector for new words)
Transfer: pretrained on large corpora	Mean-pooling loses word order
Cheap inference (~12ms)	Performs worse than BoW on sentiment (85.7% vs 86.2%)

28 What is the difference between feature extraction and fine-tuning when using BERT? ▾

	Feature Extraction (frozen)	Fine-Tuning (unfrozen)
BERT weights	Fixed — only extract [CLS] embedding	Updated with task-specific gradient
Compute	Cheap (no backprop through BERT)	Expensive (backprop through 110M params)
Accuracy	Lower (~90–92%)	Higher (~93.9%) — model adapts to task
Training data needed	Less (classifier only)	More (full model update)

Fine-tuning is the standard for BERT — that 93.9% is from fine-tuning. Feature extraction is used when compute is severely limited.

29 Why did Word2Vec achieve 85.7% (less than BoW's 86.2%) on IMDb despite being a "smarter" representation? ▾

Three compounding reasons:

1. Mean-pooling problem: averaging all word vectors makes the document vector a centroid that loses negation and word order. "not bad" averages to a neutral vector near both "bad" and "not".

2. Long reviews: IMDb reviews can be 500+ words. Averaging 500 vectors produces a very blurry, averaged representation that under-weights key sentiment words.

3. Sentiment vs. semantic task mismatch: Word2Vec learns semantic similarity (car≈automobile). But "terrible" and "brilliant" are semantically very different, which is exactly what we want — they should NOT be near each other. However, they might appear in similar syntactic positions ("the movie was __"), bringing their vectors closer than desired.

30 When would you use static embeddings (Word2Vec/FastText) vs contextual embeddings (BERT) in production? ▾

Scenario	Recommendation	Reason
Real-time chat bot (≤50ms)	FastText	BERT too slow on CPU
Semantic search / document similarity	Sentence-BERT or Word2Vec	Fast, good for retrieval
High-stakes classification (medical, legal)	BERT fine-tuned	Maximum accuracy, GPU available
Multilingual or morphologically rich language	FastText	Subwords handle OOV across languages
Edge / mobile deployment	FastText or DistilBERT	Low memory and compute

MODULE 07 Transformers & Large Language Models 30 Q

01 Why did Transformers replace RNNs? Name the three fundamental limitations of RNNs that Transformers solve. ▾

Exam Favorite

RNN Limitation	Transformer Solution
Sequential processing — cannot parallelize; $h_t$ depends on $h_{t-1}$	All positions processed in parallel via matrix multiplication
Vanishing gradients over long sequences — cannot learn long-range dependencies	Direct attention connections between any two positions — O(1) path length regardless of distance
Fixed-size context — information bottlenecked through hidden state	All encoder hidden states accessible at every decoder step (no bottleneck)

02 Write the scaled dot-product attention formula and explain every term. ▾

FormulaExam Favorite

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

$Q \in \mathbb{R}^{n \times d_k}$: Query matrix — "what am I looking for?"

$K \in \mathbb{R}^{m \times d_k}$: Key matrix — "what do I contain?"

$V \in \mathbb{R}^{m \times d_v}$: Value matrix — "what do I provide?"

$QK^T$: raw attention scores (dot product = similarity). $\sqrt{d_k}$: scaling to prevent softmax saturation. Softmax: converts scores to weights summing to 1. $V$: weighted sum of values.

Output: for each query, a weighted combination of values where weights reflect relevance of each key.

03 Why is scaling by √d_k necessary in the attention formula? ▾

Without scaling, as $d_k$ grows (e.g., $d_k = 64$), the dot products $QK^T$ grow in magnitude proportionally to $\sqrt{d_k}$.

Large dot products push the softmax into its saturation region — where most values are near 0 and one value is near 1 (winner-takes-all). This causes:

1. Vanishing gradients: softmax gradient is $p_i(1-p_i)$ — near 0 when $p_i \approx 0$ or $\approx 1$.

2. Loss of nuance: the model attends to only one position instead of a soft mixture.

Dividing by $\sqrt{d_k}$ keeps the pre-softmax scores in a range where gradients flow well.

04 What are Q, K, V matrices and how are they computed from the input? ▾

Q, K, V are linear projections of the input (or encoder output for cross-attention):

$Q = X W^Q$   ← what this position is "searching for"
$K = X W^K$   ← what this position "advertises"
$V = X W^V$   ← what this position "provides" if selected

$X \in \mathbb{R}^{n \times d_{model}}$: input sequence. $W^Q, W^K \in \mathbb{R}^{d_{model} \times d_k}$, $W^V \in \mathbb{R}^{d_{model} \times d_v}$: learned projection matrices.

In self-attention: Q, K, V all come from the same input X (every position can attend to every other position in the same sequence).

05 Write the Multi-Head Attention formula and explain what h attention heads gain over a single head. ▾

Formula

$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O$
$\text{head}_i = \text{Attention}(QW^Q_i,\, KW^K_i,\, VW^V_i)$

Original Transformer: $h = 8$ heads, $d_{model} = 512$, $d_k = d_v = 512/8 = 64$ per head.

What multiple heads gain: each head can attend to different aspects of the sequence simultaneously:

· Head 1: syntactic subject-verb relationships

· Head 2: coreference resolution (linking pronouns to nouns)

· Head 3: semantic similarity

· Head 4–8: other linguistic phenomena

Single attention collapses all patterns into one view — multiple heads provide representational diversity.

06 What are the components of a Transformer encoder block? List all layers in order. ▾

Exam Favorite

Input embeddings + Positional Encoding ↓ [Multi-Head Self-Attention] ↓ [Add & Normalize] ← residual connection + LayerNorm ↓ [Feed-Forward Network (FFN)] ← two linear layers + ReLU ↓ [Add & Normalize] ← residual connection + LayerNorm ↓ Output (fed to next encoder block or decoder)

BERT-Base stacks 12 of these encoder blocks. Each block has ~7.2M parameters.

07 Why do Transformers need positional encodings? What information do they provide? ▾

The attention mechanism is permutation-invariant: it computes the same attention scores regardless of the order of input tokens. "cat sat mat" and "mat cat sat" produce identical attention without positional information.

Positional encodings inject position information into the input embeddings:

$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$
$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$

These sinusoidal encodings are added to word embeddings before the first encoder block. They allow the model to distinguish "The cat sat" from "The sat cat".

Modern LLMs use Rotary Positional Embeddings (RoPE) or learned positional embeddings instead.

08 What is the Add & Normalize (Layer Normalization) operation and why is it used? ▾

$$\text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))$$

Add: the residual connection adds the input $\mathbf{x}$ directly to the sublayer output — same as ResNet. This ensures gradient flow and allows training deep networks (12–24 encoder blocks).

Normalize: Layer Normalization normalizes each token's vector to mean=0, std=1 across the feature dimension (unlike Batch Norm which normalizes across the batch).

Why LayerNorm over BatchNorm for Transformers: sequences have variable length, making batch statistics unstable. LayerNorm is computed per sample, per position — stable regardless of batch size or sequence length.

09 What is the Feed-Forward Network (FFN) inside each Transformer block? ▾

The FFN is applied independently to each position (token) after multi-head attention:

$$\text{FFN}(\mathbf{x}) = \max(0, \mathbf{x}W_1 + b_1) W_2 + b_2$$

Two linear layers with ReLU in between. Dimensions: $d_{model} = 512 \to d_{ff} = 2048 \to 512$ (typically $d_{ff} = 4 \times d_{model}$).

Role: attention mixes information across positions (which position to attend to). FFN applies a nonlinear transformation to each position independently — this is where much of the model's "knowledge" is stored. Recent research shows FFN layers act as key-value memories.

10 What is the difference between encoder-only, decoder-only, and encoder-decoder Transformer models? Give one example of each. ▾

Architecture	Structure	Example	Best for
Encoder-only	Stack of encoder blocks with bidirectional attention	BERT, RoBERTa	Classification, NER, QA (understanding)
Decoder-only	Stack of decoder blocks with causal (masked) attention	GPT-2/3/4, LLaMA	Text generation, completion
Encoder-Decoder	Encoder + decoder with cross-attention	T5, BART, original Transformer	Translation, summarization, seq2seq

11 What is causal masking in the decoder and why is it necessary? ▾

In a decoder generating text left-to-right, position $i$ must not attend to positions $j > i$ (future tokens) — this would be "cheating" during training (the model would see the answer).

Causal masking sets the attention score to $-\infty$ for all future positions before softmax:

$$M_{ij} = \begin{cases}-\infty & \text{if } j > i \\ 0 & \text{otherwise}\end{cases}$$

After softmax: $e^{-\infty} = 0$ → future positions get zero attention weight.

This enforces autoregressive generation: each token is predicted using only past context, making training consistent with inference.

12 What is greedy decoding, beam search, and temperature sampling in text generation? ▾

Strategy	How	Output quality	Diversity
Greedy	Always pick argmax token at each step	Locally optimal, often repetitive	None (deterministic)
Beam search (k=5)	Keep top-k sequences at each step, return best final	Higher quality than greedy	Low (structured)
Temperature ($T$)	Scale logits by $1/T$ before softmax. $T\to 0$: greedy; $T\to\infty$: uniform	Tunable	High (stochastic)

$$P(w_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

$T < 1$: more focused/conservative. $T > 1$: more random/creative. Typical: $T = 0.7$–$1.0$.

13 What is an emergent ability in LLMs? Give three examples. ▾

Emergent abilities are capabilities that appear suddenly when a model crosses a certain scale threshold — absent in smaller models, present in larger ones (not a smooth interpolation).

Emergent Ability	Approximate threshold
Multi-step arithmetic (3-digit addition)	~100B parameters
Chain-of-thought reasoning	~100B parameters
Code generation	~40B parameters
Translation without training on bilingual pairs	~100B parameters
Instruction following in zero-shot	~175B parameters (GPT-3)

These abilities are "emergent" because no one explicitly trained for them — they arise from general language modeling at scale.

14 What are the three stages of LLM training? Describe each briefly. ▾

Exam Favorite

Stage	Data	Objective
1. Pre-training	Trillions of tokens from the internet/books	Next-token prediction (autoregressive). Learns language, world knowledge, facts.
2. SFT (Supervised Fine-Tuning)	Thousands of human-written (instruction, response) pairs	Teach the model to follow instructions and respond helpfully.
3. RLHF (Reinforcement Learning from Human Feedback)	Human preferences: which of two responses is better?	Train a reward model on preferences; use PPO to optimize LLM output toward higher reward (safer, more helpful, honest).

15 Write the LoRA formula and explain what rank r and alpha α control. ▾

FormulaExam Favorite

$$W' = W + \frac{\alpha}{r} AB$$

$W \in \mathbb{R}^{d \times d}$: frozen pre-trained weight matrix.

$A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times d}$: trainable low-rank matrices.

$r$: rank — controls the number of trainable parameters. Small $r$ (4–16) → very few params (e.g., $r=8$ for $d=768$: $768\times8 + 8\times768 = 12k$ params vs $768^2 = 590k$ for full fine-tuning).

$\alpha$: scaling factor — controls the magnitude of the LoRA update. Typically set to $r$ or $2r$.

Why LoRA works: the update $\Delta W = AB$ is inherently low-rank — weight updates during fine-tuning have been empirically shown to have low intrinsic rank.

16 What is zero-shot prompting, few-shot prompting, and Chain-of-Thought (CoT)? Give an example of each. ▾

Technique	Prompt structure	When to use
Zero-shot	"Classify this review as positive or negative: 'I loved it!'"	Simple tasks, no examples available
Few-shot	"Positive: 'Great film!' Negative: 'Terrible movie!' Classify: 'It was ok'"	When task format needs demonstration
CoT	"Let's think step by step: Q: 5+3×2=? A: First multiply 3×2=6, then add 5+6=11"	Multi-step reasoning, math, logic

CoT dramatically improves performance on reasoning tasks — it only emerges in models with ~100B+ parameters.

17 What is RAG (Retrieval-Augmented Generation) and what problem does it solve? ▾

Exam Favorite

RAG combines a retrieval system (vector database) with a generative LLM to ground responses in real, up-to-date facts.

Problems it solves:

1. Hallucination: LLMs make up plausible-sounding but false facts. RAG retrieves real documents to constrain the answer.

2. Knowledge cutoff: LLMs only know data up to their training date. RAG can access current information.

3. Private/domain knowledge: LLMs don't know your company's internal documents. RAG retrieves them dynamically.

18 List the 5 steps of a RAG pipeline. ▾

1. INGEST: Split documents into chunks (e.g., 512 tokens) 2. EMBED: Encode chunks into vectors (sentence-transformer) 3. STORE: Index vectors in a vector database (Pinecone, FAISS, Chroma) 4. RETRIEVE: User query → embed query → cosine search → top-k chunks 5. GENERATE: Augmented prompt = query + retrieved chunks → LLM → answer

Step	Tools
Embed chunks	sentence-transformers, OpenAI ada-002
Vector store	FAISS (local), Pinecone (cloud), Chroma (local), Weaviate
Orchestration	LangChain, LlamaIndex

19 What is hallucination in LLMs and what causes it? ▾

Hallucination: an LLM generates confident, fluent, plausible-sounding output that is factually incorrect or fabricated.

Causes:

1. Statistical next-token prediction: the model is trained to produce likely tokens, not true ones. "Likely" ≠ "factual".

2. Knowledge gaps: if a fact wasn't in training data, the model "fills in" with the most statistically plausible completion.

3. No explicit memory: the model has no lookup mechanism — everything comes from parametric memory baked into weights.

4. Overconfidence: no uncertainty calibration — the model cannot distinguish what it knows vs doesn't know.

Mitigations: RAG, RLHF (reduce overconfident wrong answers), Constitutional AI, fine-tuning on factual data.

20 What is the key architectural difference between GPT (decoder-only) and BERT (encoder-only)? ▾

	BERT (Encoder-only)	GPT (Decoder-only)
Attention direction	Bidirectional — each token attends to all others	Causal (left-to-right) — each token attends only to past
Pre-training task	Masked Language Modeling + NSP	Next-token prediction (autoregressive)
Generation	Cannot generate (no autoregressive decoding)	Natural — generates one token at a time
Best for	Understanding tasks: classification, NER, QA	Generation: chat, completion, writing

21 What is a vector database and how does it enable semantic search in RAG? ▾

A vector database stores high-dimensional embedding vectors and enables fast approximate nearest neighbor (ANN) search:

Query: "What is the refund policy?" → embed → $\mathbf{q} \in \mathbb{R}^{768}$
Search: find top-k documents $\mathbf{d}_i$ where $\cos(\mathbf{q}, \mathbf{d}_i)$ is highest

Traditional DB limitation: SQL can only exact-match text. "Refund" and "return policy" are different strings — no match. A vector DB finds them because their embeddings are nearby.

DB	Type	Scale
FAISS	Library (Facebook AI)	Local, millions of vectors
Pinecone	Managed cloud	Billions of vectors
Chroma	Local, easy setup	Prototyping

22 What is RLHF and how does it make LLMs safer and more helpful? ▾

RLHF (Reinforcement Learning from Human Feedback) — Stage 3 of LLM training:

Step 1 — Reward model training: human annotators rank multiple model responses (A is better than B). Train a separate reward model $R$ to predict human preference scores.

Step 2 — PPO optimization: use Proximal Policy Optimization (PPO) to update the LLM to maximize the reward model's score while not diverging too far from the SFT model:

$$\max_\theta \mathbb{E}[R(\text{response})] - \beta \text{KL}[\pi_\theta || \pi_{SFT}]$$

Result: the model learns to produce responses humans prefer — more helpful, harmless, and honest (Anthropic's "HHH" criteria).

23 What is the T5 model's architecture and training paradigm? ▾

T5 (Text-to-Text Transfer Transformer, Google 2020) uses a full encoder-decoder architecture and frames every NLP task as text-to-text:

Task	Input	Output
Translation	"translate English to French: The cat"	"Le chat"
Summarization	"summarize: [long document...]"	"[summary]"
Classification	"sst2 sentence: I love it"	"positive"
QA	"question: Who? context: ..."	"[answer]"

Pre-trained on C4 (Colossal Clean Crawled Corpus) with a span-corruption objective (mask random spans). T5-11B has 11 billion parameters.

24 What is the difference between full fine-tuning and LoRA fine-tuning in terms of parameter count? ▾

Full fine-tuning: update all parameters of the model.

For LLaMA-7B: 7 billion parameters updated, requires ~28GB GPU VRAM (fp32), or ~14GB (bf16).

LoRA fine-tuning: freeze all original weights. Add low-rank matrices $A, B$ (rank $r=8$–16) to attention layers only.

Example: LLaMA-7B with LoRA (r=16): ~4 million trainable parameters (0.06% of 7B). Requires ~8GB VRAM — trainable on a single consumer GPU.

LoRA params per attention weight: $d \times r + r \times d = 2dr$
For $d=4096$, $r=16$: $2 \times 4096 \times 16 = 131k$ params per layer

Yet LoRA achieves comparable accuracy to full fine-tuning on most tasks.

25 What is SFT (Supervised Fine-Tuning) and how does it differ from pre-training? ▾

	Pre-training	SFT
Data	Trillions of tokens (raw web text)	Thousands of (instruction, response) pairs
Objective	Next-token prediction on all text	Next-token prediction on response only
Goal	Learn language, facts, reasoning	Teach to follow instructions helpfully
Duration	Weeks (thousands of GPUs)	Hours–days (dozens of GPUs)

Without SFT, a pre-trained model just continues the prompt (completion). SFT teaches it to actually answer questions, follow instructions, and refuse harmful requests.

26 What is HuggingFace and what three key components does it provide? ▾

HuggingFace is the de facto platform for working with pretrained Transformer models:

1. Model Hub (huggingface.co/models): 200k+ pretrained models (BERT, GPT-2, T5, LLaMA, etc.) — downloadable with one line of code.

2. 🤗 Transformers library: unified Python API for loading, fine-tuning, and running any model: from transformers import AutoModel, AutoTokenizer, pipeline.

3. Datasets library: 10k+ NLP datasets (IMDb, GLUE, SQuAD) with standardized loading and preprocessing.

Also provides: PEFT (LoRA, adapters), Accelerate (multi-GPU), Inference API (hosted inference).

27 What is Ollama and why is it useful for local LLM deployment? ▾

Ollama is a tool for running large language models locally on a laptop/workstation without cloud APIs.

ollama run llama3.1 # downloads and runs LLaMA 3.1 locally ollama run mistral # or Mistral, Gemma, Phi, etc.

Why useful:

1. Privacy: data never leaves your machine — critical for confidential documents.

2. No cost: no API fees for unlimited inference.

3. Offline: works without internet after model download.

4. Experimentation: test different models side by side quickly.

Uses 4-bit quantization to run 7B–13B parameter models on 8GB–16GB RAM laptops.

28 What is LangChain and what problem does it solve for LLM application development? ▾

LangChain is a framework for building LLM-powered applications by composing reusable components:

Component	Purpose
Chains	Sequential LLM calls: summarize → translate → reformat
Agents	LLM + tools (search, code exec, calculator) — LLM decides which tool to use
Memory	Maintain conversation history across calls
Retrievers	Connect vector databases for RAG pipelines
Prompt Templates	Reusable structured prompts with variables

Problem it solves: without LangChain, building multi-step LLM pipelines requires significant boilerplate. LangChain provides abstractions that work with OpenAI, Anthropic, Ollama, HuggingFace simultaneously.

29 What are the 5 questions to answer when selecting a model for a new NLP task? ▾

Exam Favorite

#	Question	Guides toward
1	How much labeled data do I have?	<1k → classical; 1k–100k → fine-tune BERT; 100k+ → train from scratch or LoRA
2	What is my latency requirement?	<10ms → TF-IDF/FastText; <50ms → distilled BERT; flexible → BERT
3	Is understanding or generation the task?	Understanding → BERT; Generation → GPT/T5
4	Is interpretability required?	Yes → TF-IDF + linear; No → any neural model
5	What GPU budget is available?	None → FastText/TF-IDF; Single GPU → LoRA; Multi-GPU → full fine-tuning

30 What is the "map that does not expire" — what fundamental skills remain relevant regardless of which LLM is popular? ▾

Key Concept

Specific model names (GPT-4, LLaMA-3, Claude) will be superseded every 6–12 months. The following fundamentals do not expire:

Skill	Why it lasts
Understanding attention & Transformers mathematically	All future models will be variants of this architecture
Knowing when to use which representation	Trade-offs (accuracy vs latency vs cost) are perennial
Evaluating models rigorously (benchmarks, metrics)	How you measure doesn't change
Data preprocessing & cleaning	"Garbage in, garbage out" is eternal
Prompt engineering principles	LLMs will always need clear, structured instructions

Exam Q&A — All Modules