01 What is the Universal Approximation Theorem and what does it guarantee about neural networks? ▾
Key Theorem
A feedforward neural network with at least one hidden layer and a sufficient number of neurons can approximate any continuous function to an arbitrary degree of accuracy.
What it does NOT guarantee: it says nothing about how many neurons are needed, how to find the weights, or whether training will converge to that solution. It is an existence result, not a constructive one.
Practical implication: neural networks are expressive enough — the challenge is optimization and generalization, not representational capacity.
02 What is the vanishing gradient problem and what causes it? ▾
Exam Favorite
During backpropagation, gradients are multiplied by the derivative of the activation function at each layer. With sigmoid/tanh, these derivatives are ≤ 0.25. Over many layers, repeated multiplication drives the gradient toward zero.
Each $\sigma'$ term is a small number. With 10 layers of sigmoid: $0.25^{10} \approx 10^{-6}$ — the gradient effectively vanishes. Early layers stop learning.
Fixes: ReLU activation, residual connections (ResNet), batch normalization, LSTM gates.
03 Why is ReLU preferred over sigmoid/tanh in hidden layers? ▾
Exam Favorite
| Property | Sigmoid/Tanh | ReLU |
|---|---|---|
| Derivative in saturation | → 0 (vanishing gradient) | 1 (no saturation for x>0) |
| Computation | Expensive (exp) | max(0,x) — trivial |
| Sparsity | All neurons active | ~50% neurons zero — efficient |
| Centering | Tanh centered; Sigmoid not | Not zero-centered |
ReLU's gradient is exactly 1 for positive inputs, so gradients flow without shrinking across many layers.
04 What is the dying ReLU problem and how does Leaky ReLU fix it? ▾
Dying ReLU: if a neuron's weighted input is always negative (e.g., due to a large negative bias), ReLU always outputs 0. The gradient is also 0 — the neuron never updates and is permanently "dead."
$\text{Leaky ReLU}(x) = \max(\alpha x, x)$, typically $\alpha = 0.01$
Leaky ReLU allows a small gradient ($\alpha$) for negative inputs, keeping neurons alive.
05 Write the gradient descent weight update formula and explain each term. ▾
Formula
$w$: weight being updated · $\eta$: learning rate (step size) · $\frac{\partial L}{\partial w}$: gradient of loss with respect to that weight (direction of steepest ascent — we subtract to descend).
Each update moves $w$ slightly in the direction that reduces the loss.
06 What is the difference between L1 and L2 regularization? ▾
$L2: \quad \mathcal{L}_{total} = \mathcal{L} + \lambda \sum w_i^2$ → small weights (none exactly 0)
| L1 (Lasso) | L2 (Ridge / Weight Decay) | |
|---|---|---|
| Effect | Produces sparse models | Shrinks all weights uniformly |
| Feature selection | Yes — zeroes out features | No — keeps all features small |
| Typical use | When you suspect many features are irrelevant | Default regularization in deep learning |
07 What is dropout and during which phases (train vs inference) is it active? ▾
Dropout randomly sets a fraction $p$ of neuron outputs to zero during each forward pass of training. This forces the network to not rely on any single neuron — learning redundant representations.
Training: active — neurons randomly zeroed with probability $p$.
Inference: dropout is OFF. All neurons are used, but their outputs are scaled by $(1-p)$ to maintain the same expected activation magnitude (or equivalently, inverted dropout scales during training).
08 What is Batch Normalization and what problem does it solve? ▾
Batch Normalization normalizes the activations of each layer across the mini-batch to have mean 0 and standard deviation 1, then applies learnable scale ($\gamma$) and shift ($\beta$) parameters.
Problem it solves — Internal Covariate Shift: as weights update, the distribution of inputs to each layer keeps changing, making training unstable. BN stabilizes these distributions.
Benefits: faster training, allows higher learning rates, reduces sensitivity to initialization, acts as mild regularizer.
09 What is the difference between Batch GD, Stochastic GD, and Mini-batch GD? ▾
| Variant | Samples per update | Gradient quality | Speed |
|---|---|---|---|
| Batch GD | All N samples | Exact — smooth convergence | Very slow per epoch |
| Stochastic GD (SGD) | 1 sample | Noisy — can escape local minima | Fast updates, unstable |
| Mini-batch GD | 32–256 samples | Balanced — stable + efficient | Best in practice |
Mini-batch GD is the standard. Typical batch sizes: 32, 64, 128. GPU memory limits maximum batch size.
10 What is Xavier (Glorot) initialization and when should it be used vs He initialization? ▾
Formula
He: $\quad W \sim \mathcal{N}\!\left(0,\ \sqrt{\frac{2}{n_{in}}}\right)$
| Init | Designed for | When to use |
|---|---|---|
| Xavier | Sigmoid / Tanh | Symmetric activations |
| He | ReLU / Leaky ReLU | Any ReLU-family activation |
Wrong initialization → vanishing or exploding activations from the first forward pass.
11 Write the gradient checking approximation formula and what tolerance is expected. ▾
FormulaExam Favorite
This is the centered finite difference approximation. It numerically estimates the gradient and is compared against the analytical gradient from backprop.
Expected relative error: $< 10^{-7}$ → backprop is correct. If error $> 10^{-5}$, there is a bug in the backpropagation implementation.
12 What does a training curve where train_loss ↓ but val_loss ↑ indicate, and what should you do? ▾
This is the signature of overfitting: the model is memorizing training data rather than learning generalizable patterns.
Remedies:
1. Add Dropout (typical p=0.3–0.5) · 2. Add L2 regularization (weight decay) · 3. Reduce model capacity (fewer layers/neurons) · 4. Data augmentation · 5. EarlyStopping (stop at the point val_loss starts rising) · 6. Collect more data.
13 Why use softmax for multi-class output and sigmoid for binary? What is the mathematical relationship? ▾
Softmax: $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ — probabilities sum to 1 across all classes.
Binary (2 classes): sigmoid on one output neuron. Output is $P(\text{class}=1)$.
Multi-class (K classes): softmax on K output neurons. Each output is $P(\text{class}=k)$, and they sum to 1.
Softmax with 2 classes is mathematically equivalent to sigmoid. Softmax amplifies differences between logits — making the highest score more dominant.
14 What role does the chain rule play in backpropagation? ▾
Backpropagation computes $\frac{\partial L}{\partial w}$ for every weight $w$ in the network. Since the loss is a composition of many functions (layers), the chain rule allows decomposing this into a product of local gradients.
where $z = wx + b$ (linear), $a = \sigma(z)$ (activation), $L$ (loss). Each term is computed locally — the network only needs to propagate the accumulated gradient backward layer by layer.
15 What is the Adam optimizer and what two techniques does it combine? ▾
Adam (Adaptive Moment Estimation) combines Momentum and RMSprop:
$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$ ← 2nd moment (RMSprop)
$w \leftarrow w - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}$
Default hyperparameters: $\beta_1=0.9$, $\beta_2=0.999$, $\eta=10^{-3}$, $\epsilon=10^{-8}$.
Benefit: adapts learning rate per parameter — parameters with rare gradients get larger updates.
16 What is cross-entropy loss and why is it preferred over MSE for classification? ▾
For binary: $\mathcal{L} = -[y \log \hat{y} + (1-y)\log(1-\hat{y})]$
Why not MSE for classification? With MSE and sigmoid output, gradients saturate when predictions are confidently wrong (the sigmoid is flat). Cross-entropy has a gradient of $(\hat{y} - y)$ — large when wrong, small when correct — producing strong learning signal exactly when needed.
17 What is overfitting, underfitting, and how do you detect each from training curves? ▾
| Condition | Train Loss | Val Loss | Diagnosis |
|---|---|---|---|
| Good fit | Low | Low ≈ Train | Model generalizes |
| Overfitting | Very low | High & diverging | Memorizing training data |
| Underfitting | High | High ≈ Train | Model too simple / undertrained |
Overfitting fix: regularization, dropout, more data, simpler model, EarlyStopping.
Underfitting fix: more layers/neurons, train longer, reduce regularization, better features.
18 Why can't all weights be initialized to zero? What is the problem? ▾
If all weights = 0, every neuron in a layer computes the same output (all zeros × inputs = 0). All neurons receive the same gradient and update identically — they remain identical forever. This is called the symmetry problem.
Result: a layer of N neurons behaves like a single neuron — the entire capacity of the layer is wasted.
19 What is the purpose of the bias term $b$ in a neuron? ▾
The bias allows the activation function to be shifted horizontally. Without it, the neuron computes $\sigma(\mathbf{w}^T\mathbf{x})$ which must pass through the origin.
With bias, the neuron can fire (activate) even when all inputs are zero. It provides the model a degree of freedom independent of the input — allowing the hyperplane decision boundary to be positioned anywhere, not just through the origin.
20 What is EarlyStopping and what are the key parameters: patience and restore_best_weights? ▾
EarlyStopping monitors a metric (typically val_loss) and stops training when no improvement is seen for a number of epochs.
patience=5: wait 5 epochs with no improvement before stopping.
restore_best_weights=True: after stopping, restore weights from the epoch with the best val_loss (not the last epoch, which may have started overfitting).
21 What is the learning rate and what happens if it is too large or too small? ▾
The learning rate $\eta$ controls the step size of each weight update.
| Learning Rate | Effect |
|---|---|
| Too large | Overshoots minimum — loss oscillates or diverges. Training is unstable. |
| Too small | Tiny updates — training is extremely slow. May get stuck in local minima. |
| Optimal | Converges smoothly to a (local) minimum in reasonable time. |
Adam's default $\eta = 10^{-3}$ is a good starting point. Use ReduceLROnPlateau to decay automatically.
22 What is gradient clipping and when is it used? ▾
Gradient clipping caps the gradient norm to a maximum value before the weight update step, preventing exploding gradients.
Typical threshold: 1.0–5.0 (from the RNN lab: GRAD_CLIP = 5.0).
Particularly important for RNNs/LSTMs processing long sequences, where gradients can explode exponentially due to repeated matrix multiplications.
23 What is the difference between the validation set and the test set? ▾
| Set | Used for | Seen during training? |
|---|---|---|
| Training set | Computing gradients, updating weights | Yes — directly |
| Validation set | Hyperparameter tuning, model selection, EarlyStopping | Indirectly (no weight updates) |
| Test set | Final performance estimate on unseen data | Never — touched once at the end |
24 What activation function is used in the output layer for regression vs binary vs multi-class classification? ▾
| Task | Output Activation | Loss Function |
|---|---|---|
| Regression | None (linear) | MSE / MAE |
| Binary classification | Sigmoid | Binary cross-entropy |
| Multi-class (exclusive) | Softmax | Categorical cross-entropy |
| Multi-label (independent) | Sigmoid per output | Binary cross-entropy per label |
25 What is weight decay and how does it relate to L2 regularization? ▾
Weight decay and L2 regularization are mathematically equivalent for standard SGD.
With L2 regularization, the gradient update becomes:
The factor $(1-\eta\lambda)$ decays the weight at every step — hence "weight decay." In Keras: kernel_regularizer=l2(0.01) or optimizer=Adam(weight_decay=1e-4).
26 What is a confusion matrix and what are the four values TP, TN, FP, FN? ▾
| Predicted → | Positive | Negative |
|---|---|---|
| Actual Positive | TP (True Positive) | FN (False Negative) — missed |
| Actual Negative | FP (False Positive) — false alarm | TN (True Negative) |
Accuracy = (TP+TN)/(TP+TN+FP+FN) · Precision = TP/(TP+FP) · Recall = TP/(TP+FN)
F1 = 2·P·R/(P+R) — harmonic mean, balances precision and recall.
27 What is the ReduceLROnPlateau callback and what does it do when triggered? ▾
When val_loss stops improving for patience=3 epochs, it multiplies the learning rate by factor=0.5 (halves it). This allows the optimizer to take smaller steps to escape a plateau and fine-tune around a local minimum.
min_lr=1e-6 sets a floor — the LR will never go below this, preventing arbitrarily slow training.
28 What is the relationship between batch size and training stability/generalization? ▾
| Batch size | Gradient noise | Effect |
|---|---|---|
| Small (8–32) | High — noisy updates | Better generalization (noise acts as regularizer), slower convergence, GPU underutilized |
| Large (256–2048) | Low — smooth gradient | Faster GPU utilization, sharper minima (worse generalization), may need LR scaling |
Rule of thumb: scale LR linearly with batch size (linear scaling rule). Typical choice: 32–128.
29 What is the exploding gradient problem and how is it different from the vanishing gradient? ▾
| Vanishing Gradient | Exploding Gradient | |
|---|---|---|
| What happens | Gradients → 0 (too small) | Gradients → ∞ (too large) |
| Effect | Early layers stop learning | Weight updates are catastrophically large (NaN) |
| Cause | Repeated multiplication of small numbers (<1) | Repeated multiplication of large numbers (>1) |
| Fix | ReLU, ResNet, LSTM, BN | Gradient clipping, careful initialization |
30 What is momentum in optimization and what problem does it solve over plain gradient descent? ▾
Momentum accumulates a moving average of past gradients to smooth updates:
$w \leftarrow w - \eta v_t$
Typical $\beta = 0.9$. Problem it solves:
1. Oscillations in narrow ravines — momentum smooths the zig-zag path and accelerates along the consistent gradient direction.
2. Local minima — accumulated velocity can "roll through" shallow local minima.
3. Slow progress in flat regions — momentum keeps moving in the last useful direction.
01 Why can't a plain Fully-Connected (Dense) network efficiently process images? ▾
A 224×224 RGB image has $224 \times 224 \times 3 = 150{,}528$ inputs. With just one hidden layer of 1,000 neurons: $150{,}528 \times 1{,}000 = 150$ million parameters — untrainable and prone to overfitting.
Additionally, FCNs ignore spatial structure: a pixel at (10,10) and (11,10) are treated as completely unrelated. CNNs exploit spatial locality through local connectivity and weight sharing.
02 What does a convolutional filter (kernel) compute, conceptually? ▾
A filter slides over the input image computing an element-wise dot product between the filter weights and the local patch of the input at each position:
Each filter learns to detect a specific pattern (edge, curve, texture). Early filters detect edges; deeper filters detect complex patterns. The result is a feature map showing where that pattern appears in the image.
03 Write the formula for the output size of a convolutional layer. ▾
FormulaExam Favorite
$W_{in}$: input size · $K$: kernel size · $P$: padding · $S$: stride.
Example: Input 5×5, kernel 3×3, P=1 (same), S=1: $\lfloor(5-3+2)/1\rfloor+1 = 5$ → output 5×5 (same padding preserves size).
Example: Input 224×224, K=3, P=0, S=2: $\lfloor(224-3)/2\rfloor+1 = 111$.
04 What is "same" padding vs "valid" padding? ▾
| Padding | Zero-padding added | Output size | Use |
|---|---|---|---|
| same | $P = \lfloor K/2 \rfloor$ | $W_{out} = W_{in}$ (stride=1) | Preserve spatial dims through conv layers |
| valid | $P = 0$ | $W_{out} = W_{in} - K + 1$ | Shrink spatial dims intentionally |
"Same" padding adds zeros around the border so the filter reaches every position including the edges.
05 How many trainable parameters does Conv2D(32 filters, 3×3 kernel) have on an RGB input (3 channels)? ▾
Formula
$(3 \times 3 \times 3 + 1) \times 32 = 28 \times 32 = \mathbf{896}$ parameters.
The "+1" is the bias per filter. Without bias: $3\times3\times3\times32 = 864$.
Weight sharing insight: the same 896 parameters are reused at every spatial position — this is why CNNs are so parameter-efficient vs FCNs.
06 What is "weight sharing" in CNNs and why is it important? ▾
In a convolutional layer, the same filter weights are reused at every spatial position of the input. This is weight sharing.
Why it matters:
1. Parameter efficiency: 896 params handle an entire 224×224 image instead of millions in an FCN.
2. Translation invariance: if a filter detects a horizontal edge, it detects it anywhere in the image — the same weights fire wherever the edge appears.
3. Generalization: the model is biased toward learning spatially reusable patterns, which matches the structure of natural images.
07 What is max pooling and what are its three benefits? ▾
Max pooling takes the maximum value in each pooling window (typically 2×2 with stride 2), halving the spatial dimensions.
Three benefits:
1. Dimensionality reduction: 2×2 pool with stride 2 → spatial size halved, reducing compute and memory.
2. Spatial invariance: small translations of the input produce the same max — the model is robust to minor shifts.
3. Overfitting control: fewer parameters in subsequent layers.
08 What is Global Average Pooling (GAP) and why is it better than Flatten + Dense? ▾
Global Average Pooling reduces each feature map to a single number (its spatial average): an input of shape $(H, W, C)$ → output of shape $(C,)$.
| Flatten + Dense | Global Average Pooling | |
|---|---|---|
| Parameters | H×W×C×Dense_units (massive) | 0 (then small Dense layer) |
| Overfitting risk | High — 88% of VGG params are here | Low — drastically fewer params |
| Spatial info | Flattened to 1D | Summarized per channel |
Modern architectures (ResNet, MobileNet, EfficientNet) all use GAP before the classifier head.
09 Why are two stacked 3×3 conv layers preferred over one 5×5 layer? Give the parameter count. ▾
Exam Favorite
Two stacked 3×3 layers have the same effective receptive field as one 5×5 layer — but fewer parameters:
One 5×5: $5\times5\times C = 25C$ params
Saving: 28%
Three stacked 3×3 = effective 7×7: $27C$ vs $49C$ → 45% saving.
Additionally, two conv layers means two non-linearity applications — more representational power.
10 What is the filter doubling rule and why does it keep compute constant? ▾
After each max pooling (spatial size halved), the number of filters is doubled: $32 \to 64 \to 128 \to 256 \to 512$.
Why compute stays constant: total feature map volume $\approx H \times W \times C$. When $H$ and $W$ are halved (÷4 area), doubling $C$ (×2) keeps the product roughly constant: $\frac{H}{2}\times\frac{W}{2}\times 2C = \frac{H \times W \times C}{2}$ — actually halves, but within an acceptable budget.
This lets deeper layers capture more semantic features without exponentially increasing compute.
11 What is the power-of-2 compression cascade for a 224×224 input? How many conv blocks does it imply? ▾
Exam Favorite
| Block | Spatial size after pool | ÷ Factor |
|---|---|---|
| Input | 224 × 224 | — |
| Block 1 | 112 × 112 | ÷2 |
| Block 2 | 56 × 56 | ÷2 |
| Block 3 | 28 × 28 | ÷2 |
| Block 4 | 14 × 14 | ÷2 |
| Block 5 ← stop | 7 × 7 | ÷2 |
Stop at 7×7 — going further destroys spatial structure needed for classification. This gives 5 conv blocks, which is exactly the depth of VGGNet.
12 What are the key innovations introduced by AlexNet (2012) that enabled modern deep learning? ▾
AlexNet won ImageNet 2012 with 15.3% top-5 error vs 26.2% runner-up. Key innovations:
1. ReLU activations — replacing tanh, 6× faster training.
2. Dropout (p=0.5) — first large-scale use as regularization.
3. GPU training — split across two GTX 580 GPUs, making deep networks practical.
4. Data augmentation — random crops, horizontal flips, color jitter.
5. Local Response Normalization — early normalization (later replaced by BN).
13 What is the residual (skip) connection in ResNet and what problem does it solve? ▾
Formula
The identity shortcut $+\mathbf{x}$ adds the input directly to the output of the conv layers.
Problems it solves:
1. Vanishing gradient: the gradient can flow directly through the identity shortcut without passing through activations, reaching early layers effectively.
2. Degradation problem: without skip connections, adding more layers paradoxically made accuracy worse. ResNet enables training 100+ layer networks.
ResNet-50 (25M params) outperforms VGG-16 (138M params).
14 What is the difference between Feature Extraction and Fine-Tuning in transfer learning? ▾
| Feature Extraction | Fine-Tuning | |
|---|---|---|
| Pre-trained layers | Frozen — weights unchanged | Unfrozen — weights updated |
| What trains | Only new classifier head (Dense layers) | Entire model or last N layers |
| Learning rate | Normal LR for head | Very small LR ($10^{-5}$) — don't destroy pretrained weights |
| Data needed | Small dataset OK | Requires more data |
| When to use | Target domain ≈ ImageNet | Target domain differs significantly |
15 What is MobileNet's key innovation and why is it suited for mobile devices? ▾
MobileNet uses depthwise separable convolutions: split a standard conv into two cheaper operations:
1. Depthwise conv: apply one filter per channel independently (spatial filtering).
2. Pointwise conv (1×1): combine channels (cross-channel mixing).
Depthwise sep: $K\times K\times C_{in} + C_{in}\times C_{out}$ ← ~8–9× fewer operations
This makes it suitable for real-time inference on CPUs and mobile chips with limited compute and battery.
16 In a VGG-like CNN, what percentage of parameters sit in Dense layers vs Conv layers? ▾
Exam Favorite
In VGG-16: Conv layers ≈ 14.7M params (12%), Dense layers ≈ 103M params (88%).
This is why modern architectures replace Flatten → Dense with Global Average Pooling — it eliminates the Dense layers' parameter explosion while matching or improving accuracy.
17 What are the three essential training callbacks? Give the purpose and key config for each. ▾
Exam Favorite
| Callback | Purpose |
|---|---|
| EarlyStopping | Stop training when val_loss stops improving — prevents overfitting |
| ModelCheckpoint | Save best model to disk — you always have the best checkpoint |
| ReduceLROnPlateau | Halve LR when progress stalls — escapes training plateaus |
18 When is Recall more important than Precision? Give a concrete medical AI example. ▾
Recall = TP/(TP+FN) — measures the fraction of actual positives that are correctly identified.
Recall is more important when False Negatives are more costly than False Positives.
Example — Tumor detection: A False Negative = telling a patient with cancer that they are cancer-free. This leads to delayed treatment and potentially death. A False Positive = ordering an unnecessary biopsy on a healthy patient (costly but recoverable).
In this case, we want high Recall even at the cost of lower Precision.
19 What is data augmentation and name four transforms used in image classification? ▾
Data augmentation generates new training samples by applying label-preserving transformations to existing data. It reduces overfitting without collecting more data.
| Category | Transforms |
|---|---|
| Geometric | Random flip (horizontal), rotation (±15°), zoom, random crop, translation |
| Photometric | Brightness, contrast, saturation adjustment, Gaussian noise, blur |
20 What is Batch Normalization in CNNs — where is it applied and what does it do to feature maps? ▾
In CNNs, Batch Normalization is applied after conv layers, before (or after) ReLU. It normalizes each channel's activations across the mini-batch to mean 0, variance 1, then applies learnable $\gamma, \beta$.
Effect on feature maps: prevents any channel from dominating; stabilizes activation distributions across layers so deeper layers train on a consistent signal.
Benefits: faster convergence, allows larger learning rates, reduces sensitivity to weight initialization, mild regularization effect (can sometimes reduce Dropout need).
21 What is stride and how does stride=2 differ from stride=1 with max pooling for downsampling? ▾
Stride is the step size the filter moves between positions. Stride=1: dense coverage. Stride=2: skip every other position → output half the spatial size.
| Conv with stride=2 | Conv(stride=1) + MaxPool(2×2) | |
|---|---|---|
| Output size | ⌊(W-K)/2⌋+1 | ⌊(W-K+2P)/1⌋+1 then ÷2 |
| Has learnable params? | Yes (in conv) | MaxPool has none |
| Information retained | Learns what to keep | Always takes max value |
Modern architectures (ResNet, EfficientNet) prefer strided conv over pooling.
22 What is the Inception module (GoogLeNet) and what problem does it solve? ▾
The Inception module applies multiple filter sizes in parallel (1×1, 3×3, 5×5) plus max pooling, then concatenates all outputs along the channel dimension.
Problem it solves: choosing the right filter size for each layer is non-trivial. By applying all sizes in parallel and letting the network learn which features are most useful, the network automatically selects the right scale.
1×1 convolutions before larger kernels perform dimensionality reduction (bottleneck), keeping compute manageable.
23 What does `include_top=False` mean when loading a pretrained model like VGG16? ▾
include_top=False loads the model without the final classifier layers (the Dense layers trained for 1000 ImageNet classes). You get only the convolutional backbone.
This lets you add your own classification head for your specific number of classes. The convolutional features learned on ImageNet are reused; only your new head is trained.
24 What is EfficientNet's compound scaling strategy? ▾
Previous architectures scaled one dimension at a time: more layers (deeper), more channels (wider), or larger input (higher resolution). EfficientNet scales all three simultaneously using a compound coefficient $\phi$:
subject to: $\alpha \cdot \beta^2 \cdot \gamma^2 \approx 2$ (constant FLOP budget)
Result: EfficientNet-B7 achieves state-of-the-art accuracy with 8.4× fewer parameters than GPipe at the same accuracy level.
25 What is the F1 Score, when do you use it, and why is it better than accuracy for imbalanced datasets? ▾
F1 is the harmonic mean of Precision and Recall. It's 1.0 only when both are perfect.
Why use over accuracy for imbalanced data: if 99% of data is class A, a dumb classifier that always predicts A gets 99% accuracy but F1≈0 (because Recall for class B = 0). F1 forces the model to actually detect the minority class.
26 What is a 1×1 convolution and what is it used for? ▾
A 1×1 convolution applies a linear combination across channels at each spatial position — with no spatial filtering. For input $(H, W, C_{in})$: output is $(H, W, C_{out})$, same spatial size, different channel count.
Uses:
1. Dimensionality reduction (bottleneck): $C_{in}=256 \to C_{out}=64$ — reduces channel count before expensive 3×3 conv (Inception, ResNet bottleneck).
2. Increase channels: $C_{in}=64 \to C_{out}=256$ — expand representation.
3. Non-linear cross-channel mixing with no spatial receptive field increase.
27 What is AUC-ROC and what does an AUC of 0.5 vs 1.0 mean? ▾
The ROC curve plots True Positive Rate (Recall) vs False Positive Rate (FP/(FP+TN)) across all classification thresholds. AUC = Area Under the ROC Curve.
| AUC | Meaning |
|---|---|
| 1.0 | Perfect classifier — separates all positives from negatives |
| 0.9–0.99 | Excellent |
| 0.7–0.89 | Good |
| 0.5 | Random guessing — no discriminative ability |
| <0.5 | Worse than random (labels may be flipped) |
AUC is threshold-independent and works well for imbalanced datasets.
28 What is the standard CNN pipeline for image classification (the full forward pass)? ▾
Loss: categorical_crossentropy (multi-class) or binary_crossentropy (binary). Optimizer: Adam (lr=1e-3). Evaluate: accuracy + F1 on val set.
29 What does VGG stand for and what are the two main VGG variants? ▾
VGG = Visual Geometry Group (Oxford University, Simonyan & Zisserman, 2014).
Key design philosophy: use only 3×3 conv filters throughout, increasing depth.
| Variant | Conv layers | Params | Top-5 error |
|---|---|---|---|
| VGG-16 | 13 conv + 3 Dense | 138M | 7.3% |
| VGG-19 | 16 conv + 3 Dense | 144M | 7.3% |
VGG demonstrated that depth (using small 3×3 filters) is more effective than shallow networks with large filters.
30 What is the best practice for normalizing image inputs, and what problems does NOT normalizing cause? ▾
Best practice: divide pixel values by 255 to scale to [0,1], or standardize to mean=0, std=1 using ImageNet statistics (mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225] for pretrained models).
Problems from raw 0–255 inputs:
1. Gradient instability: large input magnitudes cause large pre-activations → saturation or exploding gradients.
2. Uneven learning: the optimizer takes large steps in some directions, slow in others — poor conditioning.
3. Weight initialization mismatch: He/Xavier init assumes inputs of moderate magnitude.
01 What is the fundamental architectural difference between a Feedforward Network and an RNN? ▾
A Feedforward Network maps each input independently to an output — there is no memory of previous inputs. Each sample is processed in isolation.
An RNN maintains a hidden state $h_t$ that is passed from one time step to the next, giving the network a form of memory:
At each step $t$, the output depends on the current input and all previous inputs (encoded in $h_{t-1}$). This makes RNNs suitable for sequential data: text, time series, audio, video.
02 Write the vanilla RNN hidden state update formula and explain each term. ▾
Formula
$h_t$: hidden state at time $t$ · $h_{t-1}$: previous hidden state (memory) · $x_t$: current input · $W_h$: recurrent weight matrix (hidden-to-hidden) · $W_x$: input weight matrix · $\tanh$: keeps hidden state in $[-1,1]$.
The same weights $W_h, W_x$ are reused at every time step — this is weight sharing across time.
03 What is the vanishing gradient problem in RNNs and why is it worse than in FNNs? ▾
Exam Favorite
During BPTT, the gradient flows backward through time by multiplying $W_h^T$ at each step. For a sequence of length $T$:
Each $\tanh'$ ≤ 1. Over 100+ time steps: $\|W_h\|^{100} \cdot (≤1)^{100}$ → effectively zero.
Worse than in FNNs: an FNN has ~10–50 layers, but an RNN may unroll to 100–1000 time steps — far more multiplications.
Consequence: early time steps receive near-zero gradients — the RNN cannot learn long-range dependencies.
04 What is the exploding gradient problem in RNNs and how is it fixed? ▾
If the largest eigenvalue of $W_h$ is $> 1$, repeated multiplication causes gradients to grow exponentially → NaN weights, training collapses.
Fix: gradient clipping. Before the weight update, if the gradient norm exceeds a threshold, scale the gradient down:
From the course lab: GRAD_CLIP = 5.0. This is applied in PyTorch as torch.nn.utils.clip_grad_norm_(params, 5.0).
05 What is Backpropagation Through Time (BPTT)? ▾
BPTT is the algorithm for training RNNs. The RNN is "unrolled" through time to create a deep feedforward graph with one layer per time step, then standard backpropagation is applied through all steps.
Steps:
1. Forward pass: compute all $h_1, h_2, \ldots, h_T$ and outputs.
2. Compute total loss $\mathcal{L} = \sum_t \mathcal{L}_t$.
3. Backward pass: compute $\frac{\partial \mathcal{L}}{\partial W}$ by propagating gradients back from $t=T$ to $t=1$.
Truncated BPTT: for very long sequences, backprop only through the last $k$ steps to avoid memory issues.
06 Name the 4 gates of an LSTM and describe the role of each. ▾
Exam Favorite
| Gate | Symbol | Activation | Role |
|---|---|---|---|
| Forget gate | $f_t$ | Sigmoid | Decides what to erase from the cell state (0=forget, 1=keep) |
| Input gate | $i_t$ | Sigmoid | Decides how much new information to write to cell state |
| Cell candidate | $\tilde{C}_t$ | Tanh | New candidate values to potentially add to cell state |
| Output gate | $o_t$ | Sigmoid | Decides what part of cell state to output as hidden state |
07 Write the complete LSTM equations (all 5 equations). ▾
Formula
$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$ ← Input gate
$\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$ ← Cell candidate
$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ ← Cell state update
$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$ ← Output gate
$h_t = o_t \odot \tanh(C_t)$ ← Hidden state
$\odot$: element-wise multiplication. $[h_{t-1}, x_t]$: concatenation. $C_t$: cell state (long-term memory). $h_t$: hidden state (working memory).
08 How does the LSTM cell state $C_t$ solve the vanishing gradient problem? ▾
The cell state $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$ is updated via additive connections — unlike the vanilla RNN which multiplies $W_h h_{t-1}$ through tanh.
The gradient of the loss with respect to $C_{t-1}$:
The forget gate $f_t \in (0,1)$ is learned — when the network needs to remember something, it can set $f_t \approx 1$, allowing gradients to flow back essentially unchanged (gradient $\approx 1$ per step). This is the constant error carousel mechanism.
09 What are the 2 gates of a GRU and how does it differ from LSTM? ▾
GRU (Gated Recurrent Unit) merges the forget and input gates into one update gate and adds a reset gate:
$r_t = \sigma(W_r [h_{t-1}, x_t])$ ← Reset gate (how much past to forget)
$\tilde{h}_t = \tanh(W[r_t \odot h_{t-1}, x_t])$ ← Candidate
$h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$ ← Hidden state
| LSTM | GRU | |
|---|---|---|
| Gates | 4 (forget, input, cell, output) | 2 (update, reset) |
| Separate cell state | Yes ($C_t$ and $h_t$) | No (only $h_t$) |
| Parameters | More | ~25% fewer |
| Performance | Usually better for long sequences | Comparable, faster to train |
10 What does the forget gate learn to do in language modeling? Give a concrete example. ▾
The forget gate learns to selectively erase information from the cell state when it is no longer relevant.
Example — subject-verb agreement:
"The cats that were chasing the mouse..."
When the model reads "cats" (plural subject), it stores this in $C_t$. When it later generates the verb, it needs to remember "cats" (plural) → "are" not "is." The forget gate keeps the plural information alive across the intervening words about the mouse.
After the clause ends, the forget gate can erase "cats" — it's no longer needed for subject-verb agreement.
11 What are the four RNN architecture configurations (many-to-one, one-to-many, etc.) with examples? ▾
| Config | Input | Output | Example |
|---|---|---|---|
| One-to-one | Single | Single | Plain FNN (not really RNN) |
| One-to-many | Single | Sequence | Image captioning, music generation |
| Many-to-one | Sequence | Single | Sentiment classification, text → label |
| Many-to-many (equal) | Sequence | Sequence (same length) | POS tagging, NER, video frame labeling |
| Many-to-many (seq2seq) | Sequence | Sequence (diff length) | Machine translation, summarization |
12 What is teacher forcing and when is it used? ▾
In sequence-to-sequence (encoder-decoder) models, the decoder generates output token by token. During training, there are two options:
Without teacher forcing: feed the decoder's own (possibly wrong) prediction as the next input. Errors compound → slow, unstable training.
With teacher forcing: feed the ground-truth previous token as input to the decoder at each step, regardless of what was predicted.
Benefits: faster convergence, stable gradients, avoids early error cascades.
Risk: "exposure bias" — the model performs worse at inference (where it sees its own predictions, not ground truth). Scheduled sampling mitigates this by gradually reducing teacher forcing.
13 What is a Bidirectional RNN and when is it useful? ▾
A Bidirectional RNN runs two separate RNNs on the sequence — one forward (left to right) and one backward (right to left). Their hidden states are concatenated at each time step.
When useful: tasks where context from both past AND future helps predict the current position:
· NER: "Apple" in "I bought an Apple iPhone" vs "I ate an apple" — future context disambiguates.
· POS tagging, sentence encoding, BERT (bidirectional Transformer).
14 What is a stacked (deep) RNN and what does each layer learn? ▾
A stacked RNN has multiple RNN layers, where the output sequence of one layer becomes the input sequence for the next:
Layer 1 learns low-level patterns (word-level). Layer 2 learns phrase-level structures. Layer 3 learns sentence-level/semantic patterns.
Typical depth: 2–4 layers. More layers → more expressive but harder to train (vanishing gradients). Lab config: NUM_LAYERS=2.
15 What is the encoder-decoder (seq2seq) architecture and what information passes between them? ▾
The encoder reads the entire input sequence and compresses it into a fixed-size context vector $c$ (the final hidden state). The decoder takes $c$ as its initial hidden state and generates the output sequence token by token.
Decoder: $h_0^{dec} = c$, then generates $y_1, y_2, \ldots$
Bottleneck problem: compressing a long input to a single vector loses information. Attention mechanism (Chapter 7) solves this by giving the decoder access to all encoder hidden states, not just the final one.
16 What is the hidden size hyperparameter in RNNs and what does it control? ▾
The hidden size $d_h$ is the dimensionality of the hidden state vector $h_t \in \mathbb{R}^{d_h}$.
What it controls: the network's memory capacity — how much information can be stored in the hidden state at each time step.
| Hidden size | Effect |
|---|---|
| Too small (e.g., 16) | Cannot capture complex patterns — underfitting |
| Good (64–512) | Balanced capacity and computation |
| Too large (1024+) | Slow, risk of overfitting, needs more data |
From the course lab: HIDDEN_SIZE = 64 for temperature forecasting.
17 Why are RNNs fundamentally slower to train than CNNs or Transformers? ▾
RNNs have sequential data dependency: $h_t$ must be computed before $h_{t+1}$ because $h_t$ depends on $h_{t-1}$. This prevents parallelization across time steps.
CNNs: all spatial positions are processed in parallel — highly GPU-parallelizable.
Transformers: all positions are processed in parallel via matrix multiplication (attention). No sequential dependency.
This sequential bottleneck is a key motivation for replacing RNNs with Transformers for long sequences.
18 What is the long-range dependency problem and what is the maximum effective range of a vanilla RNN? ▾
The long-range dependency problem: a vanilla RNN cannot reliably learn relationships between tokens that are far apart in the sequence due to vanishing gradients.
In practice, vanilla RNNs effectively "remember" only about 5–10 time steps. Information from 50+ steps ago is largely lost.
LSTMs extend this to hundreds of steps in favorable conditions. Transformers, via direct attention connections, handle thousands of tokens equally regardless of distance — solving long-range dependencies completely.
19 What does the sequence length (SEQ_LEN) hyperparameter control in a time series RNN? ▾
SEQ_LEN is the length of the input window: how many past time steps the model sees at each prediction step. From the course lab: SEQ_LEN = 30 (30 past days of temperature to predict day 31).
| SEQ_LEN | Trade-off |
|---|---|
| Too short | Model misses relevant long-term patterns |
| Too long | More computation, harder to train, risk of vanishing gradients |
SEQ_LEN should match the actual relevant history in the data (e.g., 7 for weekly patterns, 365 for yearly).
20 Compare LSTM vs GRU vs Vanilla RNN for the task of long-form text generation. ▾
| Architecture | Long-range | Speed | Params | Recommendation |
|---|---|---|---|---|
| Vanilla RNN | Poor | Fast | Fewest | Only for very short sequences |
| GRU | Good | Faster than LSTM | ~25% fewer than LSTM | Good default when speed matters |
| LSTM | Best | Slower | Most | Best for long, complex sequences |
For poetry/text generation with coherent long-term structure: LSTM or GRU. For real-time, resource-constrained apps: GRU.
21 What is the difference between a regression RNN and a classification RNN at the output layer? ▾
| Regression RNN | Classification RNN | |
|---|---|---|
| Output activation | Linear (none) | Softmax (multi-class) / Sigmoid (binary) |
| Loss | MSE or MAE | Cross-entropy |
| Example | Temperature prediction (next value) | Sentiment: positive/negative |
| Output | Continuous value $\hat{y} \in \mathbb{R}$ | Class probability $\hat{y} \in [0,1]$ |
22 What is the conceptual difference between the cell state $C_t$ and the hidden state $h_t$ in LSTM? ▾
Think of them as two types of memory:
| Cell state $C_t$ | Hidden state $h_t$ | |
|---|---|---|
| Analogy | Long-term memory | Working memory |
| Update mechanism | Additive (can preserve unchanged) | Through output gate + tanh(Ct) |
| Range | Can carry info over 100s of steps | More local, used for immediate decisions |
| Passed to next step | Yes | Yes |
| Used as output | No (internal only) | Yes (feeds into Dense layers) |
23 What is the purpose of the LEARNING_RATE=1e-3 and how does it interact with gradient clipping? ▾
LEARNING_RATE $\eta = 10^{-3}$ controls the step size of each weight update. For RNNs, this is typically smaller than for CNNs due to the sensitivity of recurrent dynamics.
Interaction with gradient clipping: gradient clipping controls the direction (maximum gradient norm), while LR controls the step size in that direction. Both together prevent unstable updates:
2. Update: $w \leftarrow w - \eta \cdot \nabla$
If gradients explode (norm → large), clipping brings them back; if LR is too large, updates still overshoot. Both controls are needed.
24 What is an embedding layer and why is it placed before the RNN in text models? ▾
An embedding layer maps integer token indices to dense vectors: token index $i \in \{0, \ldots, V-1\}$ → vector $\mathbf{e}_i \in \mathbb{R}^d$.
Why before the RNN: RNNs expect continuous, dense inputs. Raw one-hot vectors are both too sparse (vocab_size dimensions with one 1) and semantically meaningless. Embeddings provide low-dimensional, semantically meaningful representations that the RNN can process efficiently.
The embedding weights are learned end-to-end with the RNN, or initialized with pretrained embeddings (Word2Vec, GloVe).
25 What is the difference between return_sequences=True and return_sequences=False in Keras LSTM? ▾
| Parameter | Output shape | Use case |
|---|---|---|
return_sequences=False | (batch, hidden_size) — only last $h_T$ | Many-to-one: sentiment classification, regression |
return_sequences=True | (batch, seq_len, hidden_size) — all $h_t$ | Many-to-many: stacked LSTM, seq labeling, attention |
return_sequences=True. Only the final layer can use False (if many-to-one).26 What is the "constant error carousel" property of LSTM and why is it important? ▾
The "constant error carousel" (CEC) is the mechanism by which the LSTM cell state propagates gradients without attenuation. When the forget gate $f_t = 1$ and input gate $i_t = 0$:
The cell state is copied unchanged, and the gradient flows back with a factor of $f_t = 1$ — not decaying. The LSTM can sustain gradients over hundreds of steps when needed.
Importance: this is the fundamental mechanism enabling LSTMs to capture long-range dependencies that vanilla RNNs cannot.
27 What synthetic time series components are typically used in RNN temperature forecasting labs? ▾
The course lab uses a synthetic 4-year daily temperature signal built from three stacked components:
1. Annual seasonality: $A \sin(2\pi t / 365)$ — summer/winter cycle.
2. Weekly pattern: smaller periodic variation across the week.
3. Gaussian noise: random day-to-day fluctuation.
The model (Vanilla RNN with SEQ_LEN=30) must learn to predict day 31 from the 30-day window, effectively learning to decompose and extrapolate these components.
28 Why is tanh used in RNNs (for hidden states) rather than ReLU? ▾
Tanh is preferred for RNN hidden states because:
1. Bounded output $[-1, 1]$: prevents hidden states from growing unboundedly across time steps. With ReLU, repeated application of $h_t = \text{ReLU}(W_h h_{t-1} + \ldots)$ can cause exponential growth.
2. Zero-centered: unlike sigmoid, tanh is centered at 0, leading to better gradient flow (positive and negative gradients cancel less).
3. Empirically works better: in practice, ReLU in vanilla RNNs leads to exploding states. LSTMs/GRUs use tanh for cell candidates but use sigmoid for gating.
29 What is the PATIENCE=10 parameter in the RNN lab's EarlyStopping and why is it higher than in CNN labs? ▾
PATIENCE=10 means the training continues for 10 epochs without improvement before stopping. This is higher than CNN labs (patience=5) because:
1. RNN learning is noisier: gradient variance is higher due to sequential dependencies — temporary plateaus are common.
2. More oscillation: val_loss for sequence models often fluctuates more, so a higher patience prevents stopping too early during a genuine improvement phase.
3. Complex loss landscape: RNNs have more complex optimization surfaces — longer patience allows escaping local plateaus.
30 Summarize: what is the key reason to choose LSTM over vanilla RNN for sequence tasks? ▾
Exam Favorite
The key reason: LSTM solves the vanishing gradient problem through its additive cell state update mechanism, enabling it to learn dependencies spanning hundreds of time steps — which vanilla RNNs cannot.
Specifically:
· Vanilla RNN: gradient multiplied by $W_h^T \cdot \tanh'(\cdot)$ at every step → exponentially decays.
· LSTM: gradient through $C_t$ is multiplied by $f_t$ (learned, can be ≈1) → sustained gradient flow.
01 List the 6 standard steps of the NLP text cleaning pipeline in order. ▾
Exam Favorite
| Step | Operation | Python (regex) |
|---|---|---|
| 1 | Lowercase | text.lower() |
| 2 | Remove HTML tags | re.sub(r'<[^>]+>', ' ', text) |
| 3 | Remove URLs | re.sub(r'http\S+|www\S+', ' ', text) |
| 4 | Remove punctuation | re.sub(r'[^\w\s]', ' ', text) |
| 5 | Remove numbers (optional) | re.sub(r'\d+', ' ', text) |
| 6 | Normalize whitespace | re.sub(r'\s+', ' ', text).strip() |
02 What is tokenization and what are the main types? ▾
Tokenization splits raw text into discrete units (tokens) that the model can process.
| Type | Unit | Vocabulary size | OOV handling |
|---|---|---|---|
| Character | Single char | ~100 | None (no OOV) |
| Word | Whole word | 50k–500k | Poor (UNK token) |
| Subword (BPE, WordPiece) | Sub-word units | 30k–50k (controlled) | Excellent |
BERT uses WordPiece; GPT uses BPE. Both are subword methods that balance vocabulary size with OOV robustness.
03 What is the difference between stemming and lemmatization? Give examples of each. ▾
Exam Favorite
| Stemming | Lemmatization | |
|---|---|---|
| Method | Heuristic rules — chop suffix | Dictionary lookup + POS context |
| Output | Stem (may not be a real word) | Lemma (always a valid base form) |
| Speed | Fast | Slower |
| Example | "running" → "run", "studies" → "studi" | "running" → "run", "better" → "good" |
| Library | NLTK PorterStemmer | NLTK WordNetLemmatizer, spaCy |
"better" → "good" is only possible with lemmatization (it knows the grammatical relationship).
04 What are stop words and why should "not", "no", "never" NOT be removed for sentiment analysis? ▾
Exam Favorite
Stop words: high-frequency function words with little standalone meaning: "the", "a", "is", "in", "of", "to"…
Critical exception — negation words: removing "not", "no", "never" completely destroys sentiment polarity:
| Original | After stop word removal | Sentiment lost? |
|---|---|---|
| "The movie was not good" | "movie good" | YES — becomes positive! |
| "I have no complaints" | "complaints" | YES — becomes negative! |
| "Never disappointing" | "disappointing" | YES — reversed! |
05 What is Part-of-Speech (POS) tagging and what does it enable downstream? ▾
POS tagging assigns a grammatical category to each token: NN (noun), VB (verb), JJ (adjective), RB (adverb), etc.
Example: "The quick brown fox" → [The/DT, quick/JJ, brown/JJ, fox/NN]
What it enables:
· Better lemmatization (need POS to know "runs" is VBZ → "run" vs noun "runs")
· Chunking and syntactic parsing
· Named entity recognition (nouns are NE candidates)
· Feature engineering for classical ML
· Word sense disambiguation ("bank" as NN near "river" vs near "loan")
06 What is Named Entity Recognition (NER) and what are the standard entity types? ▾
NER identifies and classifies real-world named entities mentioned in text.
Example: "Apple is headquartered in Cupertino, California" → [Apple/ORG, Cupertino/GPE, California/GPE]
| Type | Examples |
|---|---|
| PERSON | Barack Obama, Marie Curie |
| ORG | Microsoft, ENSAM, WHO |
| GPE (Geo-Political Entity) | Morocco, Casablanca, Paris |
| DATE | May 18, 2026, next Tuesday |
| MONEY | $500, €1,000 |
Libraries: spaCy, NLTK (maxent_ne_chunker), HuggingFace token classifiers.
07 What is the OOV (Out-of-Vocabulary) problem and how do different tokenization strategies handle it? ▾
OOV occurs when a token at inference time does not appear in the training vocabulary. Word-level tokenizers map it to [UNK] — losing all information.
| Strategy | OOV handling |
|---|---|
| Word-level | Poor — UNK token, loses word identity |
| Character-level | Perfect — every char is in-vocabulary |
| BPE (GPT) | Good — breaks rare words into known subword pieces |
| WordPiece (BERT) | Good — "unhappy" → ["un", "##happy"] |
| FastText | Good — uses character n-grams, builds OOV vector from subwords |
08 What is BPE (Byte Pair Encoding) tokenization and how does it build its vocabulary? ▾
BPE starts with a character-level vocabulary, then iteratively merges the most frequent pair of adjacent tokens into a new token, until the desired vocabulary size is reached.
Example:
Result: common words stay whole ("the", "is"); rare words split into familiar subword pieces.
Used by: GPT-2/3/4, RoBERTa.
09 What is dependency parsing and what information does it provide? ▾
Dependency parsing analyzes the grammatical structure of a sentence, identifying directed relationships between words.
Example: "The cat ate the fish"
Each word (except root) has one head and a typed dependency relation (nsubj=nominal subject, dobj=direct object, det=determiner).
Applications: information extraction, question answering, coreference resolution, relation extraction.
10 What is a Context-Free Grammar (CFG) and write a simple example? ▾
A CFG is a set of recursive rewrite rules (productions) that define the valid syntactic structures of a language:
The sentence "the cat ate a fish" is parsed as: S → NP VP → Det N V NP → the cat ate a fish.
CFGs underpin formal syntax analysis and are used in grammar checkers and information extraction.
11 How does NLTK's word_tokenize() differ from Python's str.split()? ▾
| Input: "I don't like it, really!" | Result |
|---|---|
.split() | ["I", "don't", "like", "it,", "really!"] — punctuation attached |
word_tokenize() | ["I", "do", "n't", "like", "it", ",", "really", "!"] — contractions split, punct separated |
word_tokenize uses Penn Treebank conventions and handles contractions ("don't" → "do" + "n't"), abbreviations (e.g., "U.S."), and punctuation correctly. str.split() is naive — splits only on whitespace.
12 What is the vocabulary size problem and why is it challenging? ▾
The English vocabulary is theoretically unbounded (new words, names, slang, technical terms). Word-level tokenizers must choose a fixed vocabulary size $V$:
· Too small ($V=5{,}000$): high OOV rate, many words → UNK.
· Too large ($V=500{,}000$): huge embedding matrix ($500k \times d$), slow softmax over output vocab, sparse training data per word.
Standard choice: $V = 30{,}000$–$50{,}000$ for word-level. Subword methods (BPE) handle this better by representing words as combinations of subword pieces.
13 When should you keep numbers in the text? Give two domain examples where removing them hurts performance. ▾
Step 5 of the pipeline (remove numbers) is optional and domain-specific.
Keep numbers for:
1. Financial text: "Revenue increased by 23%" — the 23% is the key signal for forecasting, sentiment, or risk classification.
2. Medical/clinical text: "Patient has a BMI of 32.5 and BP of 140/90" — numbers carry diagnostic significance.
3. Legal/scientific text: Article numbers, statute references, measurements are semantically important.
Remove numbers for: general sentiment analysis of movie reviews where numbers (year of release "2023", scene count "120 minutes") are noise.
14 What does spaCy's `en_core_web_sm` model provide and what are its capabilities? ▾
en_core_web_sm is a small English model (~12MB) providing:
· Tokenization (with contractions, punctuation)
· POS tagging (token.pos_, token.tag_)
· Dependency parsing (token.dep_, token.head)
· Named Entity Recognition (doc.ents with entity types)
· Lemmatization (token.lemma_)
Larger models (en_core_web_md, lg) add word vectors.
15 What is the challenge of tokenizing social media text (Twitter, Reddit)? ▾
Standard tokenizers are designed for formal text and fail on social media:
| Challenge | Example | Standard tokenizer behavior |
|---|---|---|
| Hashtags | #DeepLearning | Splits at #, losing hashtag meaning |
| @Mentions | @elonmusk | Treats as two tokens |
| Emojis | 🔥👍😍 | Often UNK or garbled |
| Slang/abbreviations | "lol", "brb", "gonna" | OOV or not split correctly |
| Repeated chars | "sooooo goood" | Rare OOV tokens |
Solution: use specialized tokenizers (TweetTokenizer from NLTK, BERTweet for social media).
16 What is sentence segmentation and why is it non-trivial? ▾
Sentence segmentation splits a document into individual sentences (usually needed before tokenization).
Why non-trivial: periods don't always end sentences:
· Abbreviations: "Dr. Smith", "U.S.A.", "etc." — period is part of the word.
· Decimal numbers: "the price was $3.99" — period is not a sentence boundary.
· Ellipsis: "..." — multiple periods, not a sentence end.
NLTK's Punkt tokenizer and spaCy's sentencizer use statistical models trained to distinguish sentence-ending periods from other uses.
17 What is the difference between Porter Stemmer and Lancaster Stemmer? ▾
| Porter Stemmer | Lancaster Stemmer | |
|---|---|---|
| Aggressiveness | Moderate | More aggressive |
| Speed | Moderate | Faster |
| Output readability | Better (closer to real words) | Often incomprehensible |
| Example | "generously" → "generous" | "generously" → "gen" |
Porter is the most widely used English stemmer. Lancaster is useful when speed is critical and the stem form doesn't need to be interpretable.
18 What is coreference resolution and why is it important for information extraction? ▾
Coreference resolution identifies when multiple expressions in a text refer to the same entity:
"Barack Obama was elected in 2008. He served two terms. The president signed the law."
All three bold expressions refer to the same person. Coreference resolution links them.
Importance: without it, information extraction treats "He" and "the president" as separate entities — losing the connection between facts. Essential for question answering ("Who signed the law?" → Obama) and summarization.
19 What is word sense disambiguation (WSD) and why is it hard? ▾
WSD determines which meaning of a polysemous word is intended given its context:
· "I went to the bank to deposit money." → financial institution
· "We sat by the river bank." → river bank
Why hard:
1. Many words have dozens of senses (WordNet: "run" has 39 senses)
2. Sense boundaries are fuzzy and debated even among linguists
3. Context window must be the right size — too small misses signals, too large introduces noise
4. Rare senses have very few training examples
Modern approach: contextual embeddings (BERT) implicitly solve WSD without explicit disambiguation.
20 What regex pattern removes HTML tags and why is this an important preprocessing step for web-scraped data? ▾
The pattern [^>]+ matches any character except ">" — capturing everything between "<" and ">" (the tag content). Replacing with a space prevents word merging ("word<br/>word" → "word word" not "wordword").
Importance: IMDb reviews, news articles, Wikipedia — all scraped from HTML. Tags like <br/>, <p>, <a href=...> are noise that creates OOV tokens and disrupts tokenization.
21 What NLP libraries are used in the course and what is each one's specialty? ▾
| Library | Specialty |
|---|---|
| NLTK | Classical NLP: tokenization, stemming, lemmatization, POS, NER chunker, CFG parsing |
| spaCy | Industrial-strength NLP: fast POS, NER, dependency parsing with pretrained models |
| scikit-learn | ML pipelines: CountVectorizer, TfidfVectorizer, classifiers, metrics |
| gensim | Word embeddings: Word2Vec, GloVe, FastText training |
| HuggingFace | Pretrained Transformers: BERT, GPT, T5 fine-tuning and inference |
22 What is morphological analysis and how does it differ from stemming? ▾
Morphological analysis decomposes words into their constituent morphemes (smallest units of meaning): prefix, stem, suffix, inflectional endings.
Example: "unhappiness" → [un- (prefix, negation) + happy (root) + -ness (suffix, nominalization)]
Difference from stemming: stemming is a crude heuristic (chop endings). Morphological analysis is linguistically principled — it identifies the actual functional components and their roles.
Languages with rich morphology (Arabic, Turkish, Finnish) require proper morphological analysis; English stemming suffices for most English NLP tasks.
23 What is the full clean_text() pipeline function from the course? ▾
Exam Favorite
This exact function represents the 6-step pipeline condensed into a reusable form.
24 What is the NLP text classification pipeline end-to-end (from raw text to model output)? ▾
25 Why is lowercasing the very first step in the preprocessing pipeline? ▾
Lowercasing must come first because subsequent regex operations work on the cleaned text, and case doesn't affect their patterns. But more importantly:
1. Vocabulary reduction: "Apple", "apple", "APPLE" → one token "apple". Reduces sparse OOV problem.
2. Consistency: sentence-starting capitalization ("The" vs "the") is not semantically meaningful.
3. Tokenizer/stemmer alignment: most stemmers/lemmatizers expect lowercase.
26 What NLTK corpora need to be downloaded for a complete NLP pipeline? ▾
These are the 6 resources used in the course lab setup (Task 0.1).
27 What is the difference between lemmatization using NLTK WordNetLemmatizer with and without POS tag? ▾
Without POS context, WordNetLemmatizer defaults to noun, often returning the wrong lemma. Always combine with POS tagging for accurate lemmatization.
28 What is chunking in NLP and how does it differ from full parsing? ▾
Chunking (shallow parsing) groups words into non-overlapping phrase chunks (NP, VP, PP) without producing the full parse tree:
Input POS tags: The/DT quick/JJ brown/JJ fox/NN jumps/VBZ
Chunks: [NP The quick brown fox] [VP jumps]
| Chunking | Full Parsing | |
|---|---|---|
| Output | Flat phrase groups | Full hierarchical tree |
| Speed | Fast | Slow |
| Completeness | Partial (no nested structure) | Complete sentence structure |
| Use case | NER, information extraction | Grammar checking, machine translation |
29 What is whitespace normalization and what problems does it fix? ▾
Problems it fixes:
1. Multiple spaces between words (from previous substitutions replacing tags/URLs with spaces)
2. Newlines (\n) and tabs (\t) that tokenizers may treat as separate tokens
3. Leading/trailing whitespace
4. Ensures consistent single-space separation between all tokens, which downstream tokenizers and vectorizers expect.
30 Why is text preprocessing domain-specific? Compare the pipeline for medical records vs social media. ▾
| Decision | Medical Records | Social Media |
|---|---|---|
| Lowercase | Careful — "HIV" vs "hiv", drug names case-sensitive | Yes — most is informal |
| Remove numbers | NO — doses, measurements are critical | Usually yes |
| Stop words | Keep "no", "not", "without" (clinical negation) | Remove most, keep negation |
| Stemming/Lemma | Lemmatization preferred (preserve meaning) | Stemming OK (speed matters) |
| Special tokens | Expand abbreviations ("MI" → myocardial infarction) | Handle hashtags, @mentions, emojis |
01 What is the Bag of Words (BoW) model and what information does it explicitly discard? ▾
BoW represents a document as a vector of word counts over a fixed vocabulary $V$: each dimension counts how many times word $i$ appears, ignoring where.
Explicitly discards:
1. Word order: "dog bites man" = "man bites dog" in BoW.
2. Syntax / grammar.
3. Context: the meaning of each word is independent of surrounding words.
4. Semantic relationships: "car" and "automobile" are treated as completely different features.
02 Write the BoW vector for "I love cats" with vocabulary ["cats","dogs","hate","love"]. ▾
Vocabulary (sorted): ["cats", "dogs", "hate", "love"] — indices 0, 1, 2, 3.
"I love cats" contains: "cats" (×1), "love" (×1). "I" is out-of-vocab.
Compare: "I hate dogs" → [0, 1, 1, 0]. These two reviews are maximally different in the vector space — good! BoW correctly separates opposite sentiments here.
03 What is the IMDb 50k benchmark accuracy of BoW with unigrams? ▾
Exam Favorite
86.18% accuracy on the IMDb 50,000 review sentiment dataset (binary: positive/negative).
This is the baseline for classical methods. Remarkably strong for a method that ignores all word order and uses only raw counts.
04 Why does BoW fail on "The movie was not good, not bad"? What is this limitation called? ▾
In BoW, "not good, not bad" produces high counts for both "good" and "bad". A classifier trained on bag of words cannot distinguish:
· "The movie was not good, not bad" (mixed / neutral)
· "The movie was good, not bad" (positive)
· "The movie was not good, bad" (negative)
All three produce the same or very similar BoW vectors containing "good", "bad", "not". The model predicts incorrectly on neutral/mixed sentiments with double negation.
This is the negation blindness problem — BoW cannot model the interaction between "not" and the adjacent adjective.
05 What is an N-gram? Define unigram, bigram, and trigram with examples. ▾
An N-gram is a contiguous sequence of N tokens from a text.
Text: "The cat sat"
| N | Name | Examples |
|---|---|---|
| 1 | Unigram | ["The", "cat", "sat"] |
| 2 | Bigram | ["The cat", "cat sat"] |
| 3 | Trigram | ["The cat sat"] |
N-grams capture local word order, enabling the model to recognize "not good" (bigram) as a negative phrase even with BoW.
06 Give the exact IMDb benchmark results for unigram, bigram, and trigram N-gram models. ▾
Exam Favorite
| N-gram range | IMDb Accuracy | Interpretation |
|---|---|---|
| Unigram only (1,1) | 86.18% | Baseline BoW |
| Bigram (1,2) | ~88–89% | +2–3% from phrase detection |
| Trigram (1,2,3) | 90.11% | Best classical N-gram |
Each level adds local context, improving negation handling and phrase-level sentiment capture.
07 Write the TF-IDF formula and explain each component. ▾
FormulaExam Favorite
$t$: term (word) · $d$: document · $N$: total number of documents · $df(t)$: number of documents containing term $t$.
TF (Term Frequency): how often $t$ appears in document $d$ — captures local importance.
IDF (Inverse Document Frequency): $\log(N/df(t))$ — penalizes words that appear in many documents (common words like "the" get IDF ≈ 0). The +1 is a smoothing term.
Combined: high weight → word is frequent in this document AND rare across the corpus → distinctive.
08 What is the IDF of a word that appears in ALL N documents? What does this mean? ▾
With smoothing (+1), the IDF = 1 (not 0). Without smoothing: $\log(1) = 0$ — the word has zero weight regardless of TF.
Meaning: a word like "the" that appears in every document is not distinctive — it provides no information about which document is relevant. TF-IDF naturally down-weights it.
From the IMDb 50k corpus: "the" → IDF ≈ 0.0001 (essentially 0). "oscillating" → IDF ≈ 10.3 (very rare, highly distinctive).
09 What is the IMDb accuracy of TF-IDF with (1,2)-grams and why does it outperform raw BoW? ▾
90.1% on IMDb 50k — the best classical method.
TF-IDF outperforms raw BoW (86.18%) for two reasons:
1. IDF down-weighting: common uninformative words ("the", "is", "a") get near-zero weight, so they don't dominate the feature vector. BoW counts them equally with informative words.
2. Bigrams (1,2): captures "not good", "highly recommend", "waste of time" as single features — partial negation handling.
Combined: TF-IDF focuses attention on distinctive, sentiment-bearing terms and phrases.
10 What is the sparsity problem in BoW/TF-IDF and what dimensionality does it produce on IMDb? ▾
On the IMDb 50k corpus, the vocabulary is ~100,000 unique words. Each document is represented as a vector of dimension 100,000. But a typical review uses only 200–500 words — so 99.5%+ of dimensions are zero.
Problems with sparsity:
1. Memory: 50,000 documents × 100,000 features = 5 billion entries (mostly zeros). Must use sparse matrix format.
2. Curse of dimensionality: in high-dimensional sparse space, distances become uninformative — all documents seem equally far apart.
3. No semantic generalization: "car" and "automobile" are orthogonal vectors (no relationship encoded).
11 What is the dimensionality of a BoW vector and what determines it? ▾
The dimensionality = the vocabulary size $|V|$. Each dimension corresponds to one unique word (or N-gram for N>1).
What determines it:
· All unique words in the training corpus (after preprocessing)
· max_features parameter: e.g., CountVectorizer(max_features=10000) keeps only the 10,000 most frequent words.
For N-grams: bigram vocabulary grows as $O(V^2)$ — if 50k unigrams, ~2.5M potential bigrams (but most rare, so max_features is critical).
12 What is CountVectorizer vs TfidfVectorizer in scikit-learn and what does each output? ▾
| CountVectorizer | TfidfVectorizer | |
|---|---|---|
| Output values | Integer counts: how many times each word appears | Float TF-IDF scores: scaled by rarity |
| Common words | High count (e.g., "the": 50) | Low score (IDF ≈ 0) |
| Rare words | Low count (e.g., "oscillating": 1) | High score (high IDF) |
| Use case | BoW baseline, language modeling | Classification, information retrieval |
13 What is the difference between binary BoW and count BoW? ▾
| Binary BoW | Count BoW | |
|---|---|---|
| Values | 0 or 1: does word appear? | 0, 1, 2, …: how many times? |
| Sensitivity to repetition | No — "great great great" = "great" | Yes — "great great great" → count=3 |
| CountVectorizer param | binary=True | binary=False (default) |
| Use case | Short texts where repetition is noise | Long documents where frequency matters |
For IMDb reviews: count BoW performs better because sentiment words are genuinely more frequent in strongly-sentiment reviews ("amazing amazing amazing!").
14 When should you STOP at classical methods and not use neural networks? Give 4 conditions. ▾
Exam Favorite
Stop at classical (BoW/TF-IDF + logistic regression/SVM) when:
1. Small dataset (<10,000 samples): neural networks overfit; classical methods generalize better with limited data.
2. Interpretability required: you need to explain which words drove the decision (legal, medical, financial compliance).
3. Low latency & resources: inference must be <1ms or runs on low-power hardware — neural networks are too slow/heavy.
4. Accuracy already sufficient: TF-IDF at 90% meets the business requirement; BERT at 94% is not worth the infrastructure cost.
15 What does the `sublinear_tf=True` parameter do in TfidfVectorizer? ▾
With sublinear_tf=True, the TF component is replaced by $1 + \log(\text{TF})$ instead of raw TF:
Sublinear TF: $1 + \log(10) = 1 + 2.3 = 3.3$
This compresses the range of term frequencies — a word appearing 10× is not 10× as important as one appearing 1×. The improvement diminishes as count grows. Particularly useful for long documents where some words repeat many times.
Recommended for most TF-IDF applications in practice.
16 What is the ngram_range=(1,2) parameter in scikit-learn vectorizers? ▾
ngram_range=(min_n, max_n) specifies the lower and upper boundary of the N-gram range to be extracted:
| Range | Extracts |
|---|---|
| (1,1) | Unigrams only |
| (1,2) | Unigrams + bigrams |
| (2,2) | Bigrams only |
| (1,3) | Unigrams + bigrams + trigrams |
Larger ranges capture more context but exponentially increase vocabulary size.
17 What are the 4 shared limitations of BoW, N-gram, and TF-IDF representations? ▾
Exam Favorite
| Limitation | Example |
|---|---|
| No semantic meaning | "car" and "automobile" are orthogonal vectors — no relationship |
| No context / polysemy | "bank" (financial) = "bank" (river) — same vector regardless of context |
| OOV problem | New words at inference time → zero vector or UNK |
| High dimensionality & sparsity | 100k-dim vector, 99.5% zeros — curse of dimensionality |
All three limitations are solved by neural embeddings (Word2Vec → BERT).
18 What is a document-term matrix and how is it structured? ▾
The document-term matrix (DTM) is the output of vectorization: rows = documents, columns = vocabulary terms.
Entry $X_{ij}$ = count (BoW) or TF-IDF score of word $j$ in document $i$.
Stored as sparse matrix (scipy.sparse.csr_matrix) because 99%+ of entries are 0. In dense format: 50,000 docs × 100,000 words × 4 bytes = 20GB — impossible to fit in RAM. Sparse format stores only nonzero entries.
19 Why do bigrams outperform unigrams on sentiment tasks specifically? ▾
Sentiment is heavily driven by negation and degree adverbs — both require exactly 2-word context:
| Bigram | Sentiment | Unigrams alone |
|---|---|---|
| "not good" | Negative | "not"(ambig) + "good"(positive) → confused |
| "highly recommend" | Strongly positive | "highly"(ambig) + "recommend"(positive) |
| "waste of" | Negative | Each word ambiguous alone |
| "absolutely terrible" | Very negative | "absolutely" adds no standalone signal |
+2–3% from unigrams to bigrams (86.18% → ~88–89%) is directly attributable to capturing these patterns.
20 What does max_features=10000 do in CountVectorizer and what are the trade-offs of this parameter? ▾
max_features=10000 keeps only the 10,000 most frequent terms (by count across the training corpus). Rarer terms are discarded.
| max_features | Trade-off |
|---|---|
| Too small (1,000) | Misses important domain words — underfitting |
| Good (10,000–50,000) | Balanced coverage vs. sparsity |
| None (all words) | Highest recall but extreme sparsity, slow training, overfitting risk |
Typical values: 10,000–50,000 for BoW; with TF-IDF + N-grams: 50,000–100,000.
21 What is the co-occurrence matrix and how does it relate to word representations? ▾
A co-occurrence matrix $\mathbf{C}$ has shape $|V| \times |V|$ where $C_{ij}$ counts how many times word $i$ and word $j$ appear within a context window (e.g., ±5 words) across the corpus.
Relationship to embeddings: GloVe is based on factorizing the log co-occurrence matrix. The row vector of word $i$ in $\mathbf{C}$ is a high-dimensional, sparse semantic representation — SVD/factorization compresses it into dense word vectors.
Problem: $|V| \times |V|$ for $|V|=100k$ = 10 billion entries — memory-prohibitive without approximations.
22 What is the polysemy problem in classical text representation? Give a concrete example. ▾
Polysemy = one word form with multiple distinct meanings. Classical representations use a single feature dimension per word, forcing all meanings to share one vector entry:
· "I went to the bank to deposit my check." (financial)
· "The frog sat on the river bank." (geography)
Both uses of "bank" increment the same counter in BoW — the representation conflates two entirely different meanings. A classifier cannot distinguish them from the word alone.
This motivates contextual embeddings (BERT) which produce different vectors for "bank" in each sentence.
23 What is the key reason TF-IDF outperforms raw BoW for document classification? ▾
The IDF component solves the fundamental problem of raw BoW: common words dominate.
In raw BoW, "the" might appear 100× in a review and "excellent" 3×. The classifier sees "the" as 33× more important than "excellent" — but "the" is completely uninformative for sentiment.
TF-IDF assigns "the" an IDF ≈ 0 (appears in all documents) and "excellent" an IDF ≈ 8 (rare across corpus). So "excellent" gets weight $3 \times 8 = 24$ while "the" gets $100 \times 0 \approx 0$.
Result: the classifier focuses on discriminative words rather than function words.
24 How do you compute TF (term frequency) for a document? Are there different variants? ▾
| Variant | Formula | Use |
|---|---|---|
| Raw count | $\text{TF}(t,d) = f_{t,d}$ | BoW baseline |
| Normalized | $\text{TF}(t,d) = f_{t,d} / \sum_{t'} f_{t',d}$ | Documents of different lengths |
| Sublinear (log) | $1 + \log(f_{t,d})$ if $f > 0$, else 0 | Most practical TF-IDF (sublinear_tf=True) |
| Binary | 1 if $f_{t,d} > 0$, else 0 | Short text, presence/absence only |
scikit-learn's TfidfVectorizer uses normalized TF by default (L2 row normalization after TF-IDF computation).
25 What is the typical machine learning classifier paired with TF-IDF vectors and why? ▾
The standard classifier paired with TF-IDF is Logistic Regression or Linear SVM.
Why linear classifiers:
1. TF-IDF vectors are high-dimensional (50k–100k dims) but sparse — linear models are computationally efficient in this space.
2. High-dimensional data is often linearly separable — a linear boundary in 100k dimensions is very expressive.
3. Fast training and inference — critical for production NLP.
4. Interpretable — coefficients directly show which words are most positive/negative.
Naive Bayes also works well for text (strong independence assumption, but efficient with sparse data).
26 What is the production use case where TF-IDF excels over neural methods? ▾
Information Retrieval / Search Engines — the original and still major use case of TF-IDF.
Given a query, rank documents by their TF-IDF cosine similarity to the query:
Documents with query terms that are rare across the corpus (high IDF) are ranked higher than documents with common terms.
BM25 (the modern IR standard, used by Elasticsearch) is an improved variant of TF-IDF. TF-IDF remains the gold standard for keyword search, document deduplication, and keyword extraction in production.
27 What is the decision ladder — when to escalate from BoW → N-gram → TF-IDF → embeddings? ▾
| Start with | Escalate if |
|---|---|
| BoW + Logistic Regression | Accuracy insufficient → try N-grams |
| N-gram (1,2) + Logistic Regression | Still not sufficient → try TF-IDF |
| TF-IDF (1,2) + Logistic Regression | Need >91%, have >50k samples → try Word2Vec/FastText |
| Word2Vec/FastText + LSTM | Need >93%, have large data, have GPU → try BERT fine-tuning |
28 What is the vocabulary explosion problem with N-grams and how is max_features used to control it? ▾
With unigrams: vocabulary size $|V| \approx 50{,}000$ (unique words in IMDb).
With bigrams: up to $|V|^2/2 \approx 1.25$ billion possible bigrams. In practice ~500k–2M are observed.
With trigrams: $|V|^3$ possible — astronomically large, mostly rare.
Solution — max_features: keep only the most frequent N-grams.
Most rare N-grams are noise — frequent N-grams capture the real patterns. max_features is essential for N>2.
29 Why can't BoW/TF-IDF capture semantic similarity between "good" and "excellent"? ▾
In BoW/TF-IDF, each word is a one-hot dimension. "good" occupies dimension 4,231; "excellent" occupies dimension 17,842. Their vectors are:
$\mathbf{v}_{\text{excellent}} = [0,\ldots,1,\ldots,0]$ (1 at dim 17842)
Cosine similarity = 0 — they are completely orthogonal (unrelated) in the representation space.
But both words carry positive sentiment! A classifier must learn "excellent" is positive from scratch — there is no sharing of information between semantically related words.
This motivates word embeddings, where similar words have similar vectors.
30 Summarize the performance comparison of all classical methods on IMDb 50k. ▾
Exam Favorite
| Method | IMDb Accuracy | Key advantage |
|---|---|---|
| BoW unigram (1,1) | 86.18% | Simplest, fast, interpretable |
| N-gram bigram (1,2) | ~88–89% | Captures "not good" patterns |
| N-gram trigram (1,2,3) | 90.11% | Best N-gram coverage |
| TF-IDF (1,2)-gram | 90.1% | Down-weights common words |
TF-IDF is the best classical method overall. For resource-constrained production systems, TF-IDF at 90% is often the right stopping point before investing in neural methods.
01 State the distributional hypothesis. Why is it the foundation of word embeddings? ▾
Key Concept
"A word is characterized by the company it keeps." (Firth, 1957)
Words that appear in similar contexts have similar meanings. "cat" and "dog" both appear near "pet", "food", "veterinarian" → they should have similar representations.
Foundation: this hypothesis allows learning meaning purely from co-occurrence statistics — no manual annotation needed. Feed a neural network billions of words and it learns that semantically similar words appear in similar contexts → similar embedding vectors.
02 What problem do dense word embeddings solve compared to sparse BoW vectors? ▾
| BoW / TF-IDF | Word Embeddings | |
|---|---|---|
| Dimensionality | 50k–100k (sparse) | 100–300 (dense) |
| Sparsity | 99.5%+ zeros | All dimensions non-zero |
| Semantic similarity | Orthogonal (no relationship) | Cosine similarity captures semantics |
| OOV | Zero vector | Trained embeddings (Word2Vec) or subword (FastText) |
| "car" vs "automobile" | Completely unrelated | High cosine similarity (>0.8) |
03 What is Word2Vec Skip-gram? Describe the training objective. ▾
Skip-gram: given a center word, predict the surrounding context words within a window.
Example (window=2): "The quick brown fox jumps" → given "brown", predict ["The", "quick", "fox", "jumps"].
A neural network with one hidden layer (the embedding layer) learns to maximize the probability of real context words and minimize the probability of random words (negative sampling).
Skip-gram works better for rare words (each rare word gets many training signal from many contexts).
04 What is Word2Vec CBOW? How does it differ from Skip-gram? ▾
CBOW (Continuous Bag of Words): given the surrounding context words, predict the center word.
Example: ["The", "quick", ?, "fox", "jumps"] → predict "brown".
| Skip-gram | CBOW | |
|---|---|---|
| Task | Center → predict contexts | Contexts → predict center |
| Rare words | Better (more training signal) | Worse |
| Speed | Slower (many predictions per word) | Faster (one prediction per window) |
| Common words | Comparable | Better (averages context) |
Skip-gram is generally preferred; CBOW trains faster on large corpora.
05 Write the famous Word2Vec vector arithmetic result and explain what it demonstrates. ▾
Exam Favorite
This demonstrates that word embeddings capture linear analogical relationships in the semantic space.
The vector from "man" to "king" encodes the concept of "royalty". Adding this offset to "woman" yields a point close to "queen".
Other examples: Paris - France + Germany ≈ Berlin · Doctor - Man + Woman ≈ Nurse
This was the first clear evidence that distributed representations encode structured semantic knowledge.
06 Write the cosine similarity formula and explain what it measures. ▾
Formula
Cosine similarity measures the angle between two vectors, ignoring their magnitude. Range: [-1, +1].
· +1: identical direction (same meaning)
· 0: orthogonal (no relationship)
· -1: opposite directions (antonyms — e.g., "good" vs "bad")
Preferred over Euclidean distance for embeddings because it's scale-invariant — a 300-dim vector for "cat" doesn't need to be the same magnitude as "dog" to be semantically close.
07 What are the exact IMDb 50k benchmark results for Word2Vec, GloVe, FastText, and BERT? ▾
Exam Favorite
| Method | Accuracy | F1-Macro | Latency (CPU) |
|---|---|---|---|
| BoW unigram (baseline) | 86.2% | 0.86 | ~1ms |
| Word2Vec + mean pool | 85.7% | 0.86 | ~12ms |
| GloVe + mean pool | 76.8% | 0.77 | ~10ms |
| FastText | 86.0% | 0.86 | ~15ms |
| BERT fine-tuned | 93.9% | 0.94 | 370ms |
08 Why does GloVe perform WORSE than BoW on IMDb sentiment (76.8% vs 86.2%)? This seems counterintuitive — explain. ▾
Exam Favorite
GloVe's underperformance on sentiment is a well-known result with two causes:
1. Global co-occurrence statistics fail for sentiment: GloVe trains on global word co-occurrence (how often words appear together across all documents). Words like "not" and "good" co-occur frequently in both positive and negative contexts — the global statistics cannot distinguish "not good" (negative) from "very good" (positive). The embedding for "good" blends all these contexts.
2. Mean-pooling destroys word order: averaging GloVe vectors over a sentence conflates "not good" with "good not" → the directional information ("not" modifies "good") is lost.
BoW with a trained classifier compensates by learning that the combination of "not" + "good" features implies negative sentiment. GloVe + mean-pool cannot.
09 How does GloVe differ from Word2Vec in its training approach? ▾
| Word2Vec | GloVe | |
|---|---|---|
| Training | Predictive (neural network): predict context from center | Count-based (matrix factorization): factorize co-occurrence matrix |
| Data used | Local context window (5–10 words) | Global co-occurrence counts (entire corpus) |
| Objective | Maximize P(context|center) | Minimize: $(\mathbf{w}_i^T\mathbf{w}_j + b_i + b_j - \log X_{ij})^2$ |
| Scalability | Online (stream data) | Requires co-occurrence matrix upfront |
GloVe (Global Vectors) explicitly incorporates global statistics; Word2Vec only sees local windows. Both produce similar quality embeddings for most tasks.
10 How does FastText handle Out-of-Vocabulary (OOV) words? Write the subword decomposition of "acting". ▾
Exam Favorite
FastText represents each word as the sum of its character n-gram embeddings (n=3–6 by default). The word vector is the average of all its subword vectors plus the word itself.
Decomposition of "acting" (n=3):
OOV advantage: "unknownword" is OOV but "unknown" and "word" share n-grams with known words → FastText builds a vector from shared subword pieces. Word2Vec/GloVe return zero vector for OOV.
11 What is the polysemy problem in static embeddings? Give two meanings of one word. ▾
Static embeddings (Word2Vec, GloVe, FastText) assign one fixed vector per word, regardless of context. A polysemous word like "bank" has one embedding that blends all its meanings:
· "I deposited money at the bank." (financial institution)
· "She sat on the river bank." (riverbank)
The learned vector for "bank" is somewhere between these two meanings — accurate for neither. Downstream models cannot distinguish which sense is intended.
12 What does BERT stand for and what are its two pre-training tasks? ▾
Exam Favorite
BERT = Bidirectional Encoder Representations from Transformers (Devlin et al., Google, 2018).
| Pre-training Task | Description |
|---|---|
| MLM (Masked Language Modeling) | 15% of tokens are masked with [MASK]. BERT predicts the original token using both left and right context → learns bidirectional representations. |
| NSP (Next Sentence Prediction) | Given pairs of sentences (A, B): predict whether B actually follows A in the original text. Learns discourse-level relationships. |
Pre-trained on 3.3 billion words (Wikipedia + BookCorpus) for weeks on 64 TPUs. After pre-training, fine-tuned on downstream tasks.
13 What is the [CLS] token in BERT and how is it used for classification? ▾
[CLS] (Classification) is a special token prepended to every input sequence: [CLS] sentence tokens... [SEP].
After passing through all 12 Transformer encoder layers, the [CLS] token's final hidden state is a sentence-level representation — it has attended to all tokens and aggregated the full sequence meaning.
This [CLS] vector is passed to a new Dense classification head and the entire model is fine-tuned end-to-end.
14 Compare BERT-Base and BERT-Large: layers, attention heads, hidden dimension, parameters. ▾
| BERT-Base | BERT-Large | |
|---|---|---|
| Encoder layers (L) | 12 | 24 |
| Hidden size (H) | 768 | 1,024 |
| Attention heads (A) | 12 | 16 |
| Parameters | 110M | 340M |
BERT-Base is the standard for most applications. BERT-Large gives ~1–2% better accuracy on benchmarks but requires ~3× more memory and compute.
15 What is BERT's CPU inference latency on IMDb and why does this matter for production? ▾
370ms per sample on CPU (from the course's exact benchmark).
Compare: BoW+LR ≈ 1ms, Word2Vec ≈ 12ms, GloVe ≈ 10ms, FastText ≈ 15ms.
Why it matters:
· A user-facing API with a 200ms SLA cannot use BERT on CPU — must use GPU (reduces to ~20–30ms) or distilled models (DistilBERT: 40% faster, 97% accuracy retention).
· At 100 requests/sec: BERT-CPU needs 37 CPUs. FastText needs 1.5 CPUs.
· Total cost of ownership can be 10–20× higher for BERT in high-traffic production.
16 What is the "golden rule" for choosing a text representation method? ▾
Exam Favorite
The decision ladder:
1. Start with TF-IDF + Logistic Regression → fast, interpretable, often sufficient.
2. If OOV or morphology matters → FastText.
3. If context/polysemy matters and you have GPU → BERT.
4. The 4% gain (TF-IDF 90% → BERT 94%) comes at 370× higher latency. Is that trade-off worth it for your application?
17 When would you choose FastText over BERT? Give 3 conditions. ▾
Choose FastText over BERT when:
1. Real-time inference required: latency <50ms — BERT's 370ms is unacceptable. FastText: ~15ms.
2. Morphologically rich language or OOV-heavy domain: medical terms, technical jargon, code-switching — FastText handles OOV via subwords; BERT's WordPiece may fragment them poorly.
3. Low resource (no GPU, small dataset): FastText trains in seconds on CPU; BERT fine-tuning needs hours and a GPU.
Bonus condition: if TF-IDF achieves only 86% but you need 87–90% without BERT's overhead — FastText is the right intermediate step.
18 What is mean-pooling of word embeddings and why is it insufficient for sentiment? ▾
Mean-pooling computes the document representation as the average of all word vectors:
Why insufficient for sentiment:
1. Destroys word order: "not good" and "good not" produce the same average vector.
2. Dilutes negation: "The movie was absolutely terrible except for the amazing cinematography" → average of negative + positive words → neutral vector → misclassified.
3. Equal weighting: "the" (uninformative) and "brilliant" (highly informative) contribute equally to the average.
This is why Word2Vec+mean-pool (85.7%) underperforms even BoW (86.2%).
19 What is the [SEP] token in BERT and when is it used? ▾
[SEP] (Separator) marks the boundary between two sequences in BERT's input format:
BERT uses [SEP] to:
1. Signal the end of a sequence (single sentence tasks).
2. Separate two sentences in pair tasks (NSP, question answering where question and passage are concatenated).
Segment embeddings (0 for sentence A, 1 for sentence B) work together with [SEP] to tell BERT which tokens belong to which input.
20 What does fine-tuning BERT mean in practice? What is updated during fine-tuning? ▾
Fine-tuning BERT = taking pre-trained BERT (110M params) and continuing training on a small task-specific labeled dataset, updating all weights including the pre-trained encoder.
Key details:
· Learning rate must be very small (2e-5 to 5e-5) — large LR destroys pre-trained representations.
· Only 2–4 epochs needed (data is small, pre-training did most of the work).
· A task-specific head (Dense layer) is added on top of [CLS].
21 What is the typical dimensionality of Word2Vec embeddings and what range is common in practice? ▾
Word2Vec embeddings are typically $d = 100$–$300$ dimensions. The original Google News Word2Vec uses $d = 300$.
| Dimension | Use case |
|---|---|
| 50–100 | Small corpora, fast inference, limited memory |
| 100–300 | Standard — balances quality vs cost (most pretrained models) |
| 300+ | Large corpora, high-accuracy tasks (GloVe 840B: d=300) |
BERT uses hidden size 768 (Base) or 1024 (Large) — much larger because it captures contextual information, not just lexical.
22 What is the 6-review demo result: how many of 6 test reviews does each method classify correctly? ▾
Exam Favorite
| Method | Correct / 6 | Fails on |
|---|---|---|
| BoW | 5/6 | Negation: "not good, not bad" |
| TF-IDF | 6/6 | — |
| Word2Vec + mean-pool | 4/6 | Mixed reviews, negation |
| FastText | 4/6 | Mixed reviews, negation |
| BERT | 6/6 | — (understands context) |
Surprising result: TF-IDF matches BERT on this small demo, while Word2Vec underperforms BoW.
23 Describe the three families of text representation in a single comparison. ▾
| Property | Classical (BoW/TF-IDF) | Static Embed (W2V/GloVe) | Contextual (BERT) |
|---|---|---|---|
| Representation | Sparse count/weight vector | Dense fixed vector per word | Dense vector per token per context |
| Semantics | None | Distributional | Deep, contextual |
| Polysemy | One feature per word | One vector (conflated) | Different vector per context |
| OOV | Zero (or UNK) | Zero (W2V/GloVe) / subword (FT) | WordPiece subword |
| Latency | ~1ms | ~10–15ms | ~370ms CPU |
| IMDb accuracy | 90.1% | 85.7% (W2V) | 93.9% |
24 What is the 30-year evolution of NLP text representations? ▾
| Era | Method | Key innovation |
|---|---|---|
| 1990s | BoW, TF-IDF | Sparse count-based vectors |
| 2000s | LSA, LDA | Topic models, matrix factorization |
| 2013 | Word2Vec | Dense neural word embeddings, analogy arithmetic |
| 2014 | GloVe | Global co-occurrence factorization |
| 2016 | FastText | Subword embeddings, OOV robustness |
| 2018 | BERT | Contextual embeddings, bidirectional Transformer |
| 2020+ | GPT-3, T5, LLaMA | Generative, few-shot, massive scale |
25 What is negative sampling in Word2Vec and why is it necessary? ▾
Problem: the full softmax over the entire vocabulary $|V|$ in the Skip-gram objective is computationally prohibitive: $O(|V|)$ per training step with $|V|=100k$ → 10 billion multiplications per epoch.
Negative sampling: instead of updating all $|V|$ words, for each (center, context) positive pair, randomly sample $k=5$–$20$ "negative" (non-context) words and update only those:
Reduces training cost from $O(|V|)$ to $O(k)$ per step — making Word2Vec practical.
26 What is gensim's Word2Vec API? Write the training call and how to get a word vector. ▾
27 What is the strengths and limitations grid for Word2Vec? ▾
| Strengths | Limitations |
|---|---|
| Captures semantic relationships and analogies | One vector per word — cannot handle polysemy |
| Dense, 300-dim vs 100k-dim BoW | No OOV handling (zero vector for new words) |
| Transfer: pretrained on large corpora | Mean-pooling loses word order |
| Cheap inference (~12ms) | Performs worse than BoW on sentiment (85.7% vs 86.2%) |
28 What is the difference between feature extraction and fine-tuning when using BERT? ▾
| Feature Extraction (frozen) | Fine-Tuning (unfrozen) | |
|---|---|---|
| BERT weights | Fixed — only extract [CLS] embedding | Updated with task-specific gradient |
| Compute | Cheap (no backprop through BERT) | Expensive (backprop through 110M params) |
| Accuracy | Lower (~90–92%) | Higher (~93.9%) — model adapts to task |
| Training data needed | Less (classifier only) | More (full model update) |
Fine-tuning is the standard for BERT — that 93.9% is from fine-tuning. Feature extraction is used when compute is severely limited.
29 Why did Word2Vec achieve 85.7% (less than BoW's 86.2%) on IMDb despite being a "smarter" representation? ▾
Three compounding reasons:
1. Mean-pooling problem: averaging all word vectors makes the document vector a centroid that loses negation and word order. "not bad" averages to a neutral vector near both "bad" and "not".
2. Long reviews: IMDb reviews can be 500+ words. Averaging 500 vectors produces a very blurry, averaged representation that under-weights key sentiment words.
3. Sentiment vs. semantic task mismatch: Word2Vec learns semantic similarity (car≈automobile). But "terrible" and "brilliant" are semantically very different, which is exactly what we want — they should NOT be near each other. However, they might appear in similar syntactic positions ("the movie was __"), bringing their vectors closer than desired.
30 When would you use static embeddings (Word2Vec/FastText) vs contextual embeddings (BERT) in production? ▾
| Scenario | Recommendation | Reason |
|---|---|---|
| Real-time chat bot (≤50ms) | FastText | BERT too slow on CPU |
| Semantic search / document similarity | Sentence-BERT or Word2Vec | Fast, good for retrieval |
| High-stakes classification (medical, legal) | BERT fine-tuned | Maximum accuracy, GPU available |
| Multilingual or morphologically rich language | FastText | Subwords handle OOV across languages |
| Edge / mobile deployment | FastText or DistilBERT | Low memory and compute |
01 Why did Transformers replace RNNs? Name the three fundamental limitations of RNNs that Transformers solve. ▾
Exam Favorite
| RNN Limitation | Transformer Solution |
|---|---|
| Sequential processing — cannot parallelize; $h_t$ depends on $h_{t-1}$ | All positions processed in parallel via matrix multiplication |
| Vanishing gradients over long sequences — cannot learn long-range dependencies | Direct attention connections between any two positions — O(1) path length regardless of distance |
| Fixed-size context — information bottlenecked through hidden state | All encoder hidden states accessible at every decoder step (no bottleneck) |
02 Write the scaled dot-product attention formula and explain every term. ▾
FormulaExam Favorite
$Q \in \mathbb{R}^{n \times d_k}$: Query matrix — "what am I looking for?"
$K \in \mathbb{R}^{m \times d_k}$: Key matrix — "what do I contain?"
$V \in \mathbb{R}^{m \times d_v}$: Value matrix — "what do I provide?"
$QK^T$: raw attention scores (dot product = similarity). $\sqrt{d_k}$: scaling to prevent softmax saturation. Softmax: converts scores to weights summing to 1. $V$: weighted sum of values.
Output: for each query, a weighted combination of values where weights reflect relevance of each key.
03 Why is scaling by √d_k necessary in the attention formula? ▾
Without scaling, as $d_k$ grows (e.g., $d_k = 64$), the dot products $QK^T$ grow in magnitude proportionally to $\sqrt{d_k}$.
Large dot products push the softmax into its saturation region — where most values are near 0 and one value is near 1 (winner-takes-all). This causes:
1. Vanishing gradients: softmax gradient is $p_i(1-p_i)$ — near 0 when $p_i \approx 0$ or $\approx 1$.
2. Loss of nuance: the model attends to only one position instead of a soft mixture.
Dividing by $\sqrt{d_k}$ keeps the pre-softmax scores in a range where gradients flow well.
04 What are Q, K, V matrices and how are they computed from the input? ▾
Q, K, V are linear projections of the input (or encoder output for cross-attention):
$K = X W^K$ ← what this position "advertises"
$V = X W^V$ ← what this position "provides" if selected
$X \in \mathbb{R}^{n \times d_{model}}$: input sequence. $W^Q, W^K \in \mathbb{R}^{d_{model} \times d_k}$, $W^V \in \mathbb{R}^{d_{model} \times d_v}$: learned projection matrices.
In self-attention: Q, K, V all come from the same input X (every position can attend to every other position in the same sequence).
05 Write the Multi-Head Attention formula and explain what h attention heads gain over a single head. ▾
Formula
$\text{head}_i = \text{Attention}(QW^Q_i,\, KW^K_i,\, VW^V_i)$
Original Transformer: $h = 8$ heads, $d_{model} = 512$, $d_k = d_v = 512/8 = 64$ per head.
What multiple heads gain: each head can attend to different aspects of the sequence simultaneously:
· Head 1: syntactic subject-verb relationships
· Head 2: coreference resolution (linking pronouns to nouns)
· Head 3: semantic similarity
· Head 4–8: other linguistic phenomena
Single attention collapses all patterns into one view — multiple heads provide representational diversity.
06 What are the components of a Transformer encoder block? List all layers in order. ▾
Exam Favorite
BERT-Base stacks 12 of these encoder blocks. Each block has ~7.2M parameters.
07 Why do Transformers need positional encodings? What information do they provide? ▾
The attention mechanism is permutation-invariant: it computes the same attention scores regardless of the order of input tokens. "cat sat mat" and "mat cat sat" produce identical attention without positional information.
Positional encodings inject position information into the input embeddings:
$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$
These sinusoidal encodings are added to word embeddings before the first encoder block. They allow the model to distinguish "The cat sat" from "The sat cat".
Modern LLMs use Rotary Positional Embeddings (RoPE) or learned positional embeddings instead.
08 What is the Add & Normalize (Layer Normalization) operation and why is it used? ▾
Add: the residual connection adds the input $\mathbf{x}$ directly to the sublayer output — same as ResNet. This ensures gradient flow and allows training deep networks (12–24 encoder blocks).
Normalize: Layer Normalization normalizes each token's vector to mean=0, std=1 across the feature dimension (unlike Batch Norm which normalizes across the batch).
Why LayerNorm over BatchNorm for Transformers: sequences have variable length, making batch statistics unstable. LayerNorm is computed per sample, per position — stable regardless of batch size or sequence length.
09 What is the Feed-Forward Network (FFN) inside each Transformer block? ▾
The FFN is applied independently to each position (token) after multi-head attention:
Two linear layers with ReLU in between. Dimensions: $d_{model} = 512 \to d_{ff} = 2048 \to 512$ (typically $d_{ff} = 4 \times d_{model}$).
Role: attention mixes information across positions (which position to attend to). FFN applies a nonlinear transformation to each position independently — this is where much of the model's "knowledge" is stored. Recent research shows FFN layers act as key-value memories.
10 What is the difference between encoder-only, decoder-only, and encoder-decoder Transformer models? Give one example of each. ▾
| Architecture | Structure | Example | Best for |
|---|---|---|---|
| Encoder-only | Stack of encoder blocks with bidirectional attention | BERT, RoBERTa | Classification, NER, QA (understanding) |
| Decoder-only | Stack of decoder blocks with causal (masked) attention | GPT-2/3/4, LLaMA | Text generation, completion |
| Encoder-Decoder | Encoder + decoder with cross-attention | T5, BART, original Transformer | Translation, summarization, seq2seq |
11 What is causal masking in the decoder and why is it necessary? ▾
In a decoder generating text left-to-right, position $i$ must not attend to positions $j > i$ (future tokens) — this would be "cheating" during training (the model would see the answer).
Causal masking sets the attention score to $-\infty$ for all future positions before softmax:
After softmax: $e^{-\infty} = 0$ → future positions get zero attention weight.
This enforces autoregressive generation: each token is predicted using only past context, making training consistent with inference.
12 What is greedy decoding, beam search, and temperature sampling in text generation? ▾
| Strategy | How | Output quality | Diversity |
|---|---|---|---|
| Greedy | Always pick argmax token at each step | Locally optimal, often repetitive | None (deterministic) |
| Beam search (k=5) | Keep top-k sequences at each step, return best final | Higher quality than greedy | Low (structured) |
| Temperature ($T$) | Scale logits by $1/T$ before softmax. $T\to 0$: greedy; $T\to\infty$: uniform | Tunable | High (stochastic) |
$T < 1$: more focused/conservative. $T > 1$: more random/creative. Typical: $T = 0.7$–$1.0$.
13 What is an emergent ability in LLMs? Give three examples. ▾
Emergent abilities are capabilities that appear suddenly when a model crosses a certain scale threshold — absent in smaller models, present in larger ones (not a smooth interpolation).
| Emergent Ability | Approximate threshold |
|---|---|
| Multi-step arithmetic (3-digit addition) | ~100B parameters |
| Chain-of-thought reasoning | ~100B parameters |
| Code generation | ~40B parameters |
| Translation without training on bilingual pairs | ~100B parameters |
| Instruction following in zero-shot | ~175B parameters (GPT-3) |
These abilities are "emergent" because no one explicitly trained for them — they arise from general language modeling at scale.
14 What are the three stages of LLM training? Describe each briefly. ▾
Exam Favorite
| Stage | Data | Objective |
|---|---|---|
| 1. Pre-training | Trillions of tokens from the internet/books | Next-token prediction (autoregressive). Learns language, world knowledge, facts. |
| 2. SFT (Supervised Fine-Tuning) | Thousands of human-written (instruction, response) pairs | Teach the model to follow instructions and respond helpfully. |
| 3. RLHF (Reinforcement Learning from Human Feedback) | Human preferences: which of two responses is better? | Train a reward model on preferences; use PPO to optimize LLM output toward higher reward (safer, more helpful, honest). |
15 Write the LoRA formula and explain what rank r and alpha α control. ▾
FormulaExam Favorite
$W \in \mathbb{R}^{d \times d}$: frozen pre-trained weight matrix.
$A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times d}$: trainable low-rank matrices.
$r$: rank — controls the number of trainable parameters. Small $r$ (4–16) → very few params (e.g., $r=8$ for $d=768$: $768\times8 + 8\times768 = 12k$ params vs $768^2 = 590k$ for full fine-tuning).
$\alpha$: scaling factor — controls the magnitude of the LoRA update. Typically set to $r$ or $2r$.
Why LoRA works: the update $\Delta W = AB$ is inherently low-rank — weight updates during fine-tuning have been empirically shown to have low intrinsic rank.
16 What is zero-shot prompting, few-shot prompting, and Chain-of-Thought (CoT)? Give an example of each. ▾
| Technique | Prompt structure | When to use |
|---|---|---|
| Zero-shot | "Classify this review as positive or negative: 'I loved it!'" | Simple tasks, no examples available |
| Few-shot | "Positive: 'Great film!' Negative: 'Terrible movie!' Classify: 'It was ok'" | When task format needs demonstration |
| CoT | "Let's think step by step: Q: 5+3×2=? A: First multiply 3×2=6, then add 5+6=11" | Multi-step reasoning, math, logic |
CoT dramatically improves performance on reasoning tasks — it only emerges in models with ~100B+ parameters.
17 What is RAG (Retrieval-Augmented Generation) and what problem does it solve? ▾
Exam Favorite
RAG combines a retrieval system (vector database) with a generative LLM to ground responses in real, up-to-date facts.
Problems it solves:
1. Hallucination: LLMs make up plausible-sounding but false facts. RAG retrieves real documents to constrain the answer.
2. Knowledge cutoff: LLMs only know data up to their training date. RAG can access current information.
3. Private/domain knowledge: LLMs don't know your company's internal documents. RAG retrieves them dynamically.
18 List the 5 steps of a RAG pipeline. ▾
| Step | Tools |
|---|---|
| Embed chunks | sentence-transformers, OpenAI ada-002 |
| Vector store | FAISS (local), Pinecone (cloud), Chroma (local), Weaviate |
| Orchestration | LangChain, LlamaIndex |
19 What is hallucination in LLMs and what causes it? ▾
Hallucination: an LLM generates confident, fluent, plausible-sounding output that is factually incorrect or fabricated.
Causes:
1. Statistical next-token prediction: the model is trained to produce likely tokens, not true ones. "Likely" ≠ "factual".
2. Knowledge gaps: if a fact wasn't in training data, the model "fills in" with the most statistically plausible completion.
3. No explicit memory: the model has no lookup mechanism — everything comes from parametric memory baked into weights.
4. Overconfidence: no uncertainty calibration — the model cannot distinguish what it knows vs doesn't know.
Mitigations: RAG, RLHF (reduce overconfident wrong answers), Constitutional AI, fine-tuning on factual data.
20 What is the key architectural difference between GPT (decoder-only) and BERT (encoder-only)? ▾
| BERT (Encoder-only) | GPT (Decoder-only) | |
|---|---|---|
| Attention direction | Bidirectional — each token attends to all others | Causal (left-to-right) — each token attends only to past |
| Pre-training task | Masked Language Modeling + NSP | Next-token prediction (autoregressive) |
| Generation | Cannot generate (no autoregressive decoding) | Natural — generates one token at a time |
| Best for | Understanding tasks: classification, NER, QA | Generation: chat, completion, writing |
21 What is a vector database and how does it enable semantic search in RAG? ▾
A vector database stores high-dimensional embedding vectors and enables fast approximate nearest neighbor (ANN) search:
Search: find top-k documents $\mathbf{d}_i$ where $\cos(\mathbf{q}, \mathbf{d}_i)$ is highest
Traditional DB limitation: SQL can only exact-match text. "Refund" and "return policy" are different strings — no match. A vector DB finds them because their embeddings are nearby.
| DB | Type | Scale |
|---|---|---|
| FAISS | Library (Facebook AI) | Local, millions of vectors |
| Pinecone | Managed cloud | Billions of vectors |
| Chroma | Local, easy setup | Prototyping |
22 What is RLHF and how does it make LLMs safer and more helpful? ▾
RLHF (Reinforcement Learning from Human Feedback) — Stage 3 of LLM training:
Step 1 — Reward model training: human annotators rank multiple model responses (A is better than B). Train a separate reward model $R$ to predict human preference scores.
Step 2 — PPO optimization: use Proximal Policy Optimization (PPO) to update the LLM to maximize the reward model's score while not diverging too far from the SFT model:
Result: the model learns to produce responses humans prefer — more helpful, harmless, and honest (Anthropic's "HHH" criteria).
23 What is the T5 model's architecture and training paradigm? ▾
T5 (Text-to-Text Transfer Transformer, Google 2020) uses a full encoder-decoder architecture and frames every NLP task as text-to-text:
| Task | Input | Output |
|---|---|---|
| Translation | "translate English to French: The cat" | "Le chat" |
| Summarization | "summarize: [long document...]" | "[summary]" |
| Classification | "sst2 sentence: I love it" | "positive" |
| QA | "question: Who? context: ..." | "[answer]" |
Pre-trained on C4 (Colossal Clean Crawled Corpus) with a span-corruption objective (mask random spans). T5-11B has 11 billion parameters.
24 What is the difference between full fine-tuning and LoRA fine-tuning in terms of parameter count? ▾
Full fine-tuning: update all parameters of the model.
For LLaMA-7B: 7 billion parameters updated, requires ~28GB GPU VRAM (fp32), or ~14GB (bf16).
LoRA fine-tuning: freeze all original weights. Add low-rank matrices $A, B$ (rank $r=8$–16) to attention layers only.
Example: LLaMA-7B with LoRA (r=16): ~4 million trainable parameters (0.06% of 7B). Requires ~8GB VRAM — trainable on a single consumer GPU.
For $d=4096$, $r=16$: $2 \times 4096 \times 16 = 131k$ params per layer
Yet LoRA achieves comparable accuracy to full fine-tuning on most tasks.
25 What is SFT (Supervised Fine-Tuning) and how does it differ from pre-training? ▾
| Pre-training | SFT | |
|---|---|---|
| Data | Trillions of tokens (raw web text) | Thousands of (instruction, response) pairs |
| Objective | Next-token prediction on all text | Next-token prediction on response only |
| Goal | Learn language, facts, reasoning | Teach to follow instructions helpfully |
| Duration | Weeks (thousands of GPUs) | Hours–days (dozens of GPUs) |
Without SFT, a pre-trained model just continues the prompt (completion). SFT teaches it to actually answer questions, follow instructions, and refuse harmful requests.
26 What is HuggingFace and what three key components does it provide? ▾
HuggingFace is the de facto platform for working with pretrained Transformer models:
1. Model Hub (huggingface.co/models): 200k+ pretrained models (BERT, GPT-2, T5, LLaMA, etc.) — downloadable with one line of code.
2. 🤗 Transformers library: unified Python API for loading, fine-tuning, and running any model: from transformers import AutoModel, AutoTokenizer, pipeline.
3. Datasets library: 10k+ NLP datasets (IMDb, GLUE, SQuAD) with standardized loading and preprocessing.
Also provides: PEFT (LoRA, adapters), Accelerate (multi-GPU), Inference API (hosted inference).
27 What is Ollama and why is it useful for local LLM deployment? ▾
Ollama is a tool for running large language models locally on a laptop/workstation without cloud APIs.
Why useful:
1. Privacy: data never leaves your machine — critical for confidential documents.
2. No cost: no API fees for unlimited inference.
3. Offline: works without internet after model download.
4. Experimentation: test different models side by side quickly.
Uses 4-bit quantization to run 7B–13B parameter models on 8GB–16GB RAM laptops.
28 What is LangChain and what problem does it solve for LLM application development? ▾
LangChain is a framework for building LLM-powered applications by composing reusable components:
| Component | Purpose |
|---|---|
| Chains | Sequential LLM calls: summarize → translate → reformat |
| Agents | LLM + tools (search, code exec, calculator) — LLM decides which tool to use |
| Memory | Maintain conversation history across calls |
| Retrievers | Connect vector databases for RAG pipelines |
| Prompt Templates | Reusable structured prompts with variables |
Problem it solves: without LangChain, building multi-step LLM pipelines requires significant boilerplate. LangChain provides abstractions that work with OpenAI, Anthropic, Ollama, HuggingFace simultaneously.
29 What are the 5 questions to answer when selecting a model for a new NLP task? ▾
Exam Favorite
| # | Question | Guides toward |
|---|---|---|
| 1 | How much labeled data do I have? | <1k → classical; 1k–100k → fine-tune BERT; 100k+ → train from scratch or LoRA |
| 2 | What is my latency requirement? | <10ms → TF-IDF/FastText; <50ms → distilled BERT; flexible → BERT |
| 3 | Is understanding or generation the task? | Understanding → BERT; Generation → GPT/T5 |
| 4 | Is interpretability required? | Yes → TF-IDF + linear; No → any neural model |
| 5 | What GPU budget is available? | None → FastText/TF-IDF; Single GPU → LoRA; Multi-GPU → full fine-tuning |
30 What is the "map that does not expire" — what fundamental skills remain relevant regardless of which LLM is popular? ▾
Key Concept
Specific model names (GPT-4, LLaMA-3, Claude) will be superseded every 6–12 months. The following fundamentals do not expire:
| Skill | Why it lasts |
|---|---|
| Understanding attention & Transformers mathematically | All future models will be variants of this architecture |
| Knowing when to use which representation | Trade-offs (accuracy vs latency vs cost) are perennial |
| Evaluating models rigorously (benchmarks, metrics) | How you measure doesn't change |
| Data preprocessing & cleaning | "Garbage in, garbage out" is eternal |
| Prompt engineering principles | LLMs will always need clear, structured instructions |