Course Summaries — Deep Learning & NLP

Deep Learning Essentials & ANN

Perceptron, Activation, Backprop, Optimizers, Regularization

1. Deep Learning vs Machine Learning

DL = multi-layered neural networks that learn features automatically from data. ML requires manual feature engineering. DL needs large data + GPU; ML works with less. DL is less interpretable (black box) but state-of-the-art on complex tasks.

2. Perceptron & Activation Functions

A perceptron computes \(z = W\cdot X + b\), then applies activation: \(y = f(z)\). Without activation, stacking layers = one linear transform.

Function	Formula	Range	Use
Sigmoid	\(1/(1+e^{-x})\)	(0,1)	Binary output; vanishing gradient for large \|x\|
Tanh	\((e^x-e^{-x})/(e^x+e^{-x})\)	(−1,1)	Zero-centered; also saturates
ReLU	\(\max(0,x)\)	[0,∞)	Default for hidden layers; fast, no saturation for x>0
Leaky ReLU	\(\max(0.01x,x)\)	(−∞,∞)	Fixes dying ReLU (always-zero neuron)
Softmax	\(e^{x_i}/\sum e^{x_j}\)	(0,1), sum=1	Multi-class output

3. Loss Functions

MSE (regression): \(\frac{1}{n}\sum (y_i - \hat{y}_i)^2\) — penalizes large errors heavily.
Binary Cross-Entropy: \(-\frac{1}{n}\sum[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]\) — classification gold standard.
Categorical Cross-Entropy: used with Softmax for multi-class.

4. Gradient Descent & Backpropagation

Update rule: \(W = W - \alpha\frac{\partial L}{\partial W}\) where α = learning rate. Backprop computes \(\frac{\partial L}{\partial W}\) via the chain rule: \(\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}\).

Variant	Data per update	Tradeoff
Batch GD	ALL data	Stable but slow
SGD	1 sample	Fast but noisy
Mini-batch	32–256	Best balance (default)

5. Optimizers

Optimizer	Key idea
SGD	Plain update \(W = W - \alpha\nabla L\)
Momentum	Accumulates velocity: \(v = \beta v - \alpha\nabla L\)
RMSprop	Adaptive LR per parameter: divide by √(moving avg of squared grads)
Adam	Momentum + RMSprop combined — best general-purpose. β₁=0.9, β₂=0.999

6. Regularization

L2 (Weight Decay): adds \(\lambda\|W\|^2\) to loss — penalizes large weights
L1: adds \(\lambda\|W\|_1\) — promotes sparsity (zero weights)
Dropout: randomly zeros p% of neurons during training. At inference, all active but scaled by (1−p). Typical p=0.2–0.5
BatchNorm: \(\hat{x} = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\varepsilon}}\), then \(y = \gamma\hat{x} + \beta\) — stabilizes, speeds up, regularizes
Early Stopping: stop when val_loss stops decreasing

7. ANN Architecture & Parameter Count

A fully-connected block: Linear → BatchNorm → ReLU → Dropout. Parameters per layer = \(in \times out + out\) (weights + biases). Example: 784→256→128→10 = 200,960 + 32,896 + 1,290 = ~235K params.

Weight initialization: Xavier for Sigmoid/Tanh (\(\sqrt{2/(in+out)}\)), He for ReLU (\(\sqrt{2/in}\)). Too small → vanishing; too large → exploding.

8. Evaluation Metrics

\(\text{Accuracy} = \frac{TP+TN}{Total}\) — use when classes are balanced.
\(\text{Precision} = \frac{TP}{TP+FP}\) — when FP is costly (spam).
\(\text{Recall} = \frac{TP}{TP+FN}\) — when FN is costly (disease).
\(\text{F1} = 2\frac{P\cdot R}{P+R}\) — balance when classes are imbalanced.

9. Training Diagnostics

Pattern	Diagnosis	Action
train↓ val↓	Good	Continue
train↓ val↑	Overfitting	More dropout, reduce capacity, early stop
Both high	Underfitting	Increase capacity, train longer
Large gap	Overfitting	More regularization

Key exam insight

Dropout ON during training, OFF at inference (weights scaled). BatchNorm behaves differently: uses batch stats during training, running averages at inference.

Convolutional Neural Networks (CNN)

Conv2d, Pooling, Transfer Learning, Medical Imaging

1. Why CNN? (vs ANN on images)

A 224×224×3 image has 150K pixels. A single ANN layer of 1024 units = 154M parameters — completely impractical. CNNs solve this with: local connectivity (each neuron sees a patch), weight sharing (same filter slides across image), and hierarchical learning (edges → textures → objects).

2. Conv2d Output Size

\(W_{out} = \lfloor\frac{W_{in} - K + 2P}{S}\rfloor + 1\)

Examples: 28×28, K=3, P=1, S=1 → 28 (same). 28×28, K=3, P=0, S=1 → 26 (shrinks). 28×28, K=2, S=2 → 14 (halved).

3. Conv2d Parameters

\(\text{Params} = (K \times K \times C_{in} + 1) \times C_{out}\)

Conv2d(3→32, K=3): (3×3×3+1)×32 = 896 parameters — orders of magnitude fewer than ANN.

4. Padding & Pooling

padding=0 (valid): no padding, spatial size shrinks each layer
padding=1 (same, K=3): 1 border of zeros, output = input size
MaxPool2d(2,2): takes max in 2×2 window, stride 2 → halves dimensions
AdaptiveAvgPool2d((1,1)): squeezes any spatial map to 1×1 (global pooling)

5. Standard CNN Block

Conv2d → BatchNorm2d → ReLU → MaxPool2d(2) → Dropout(p)

Conv extracts features, BN normalizes, ReLU adds non-linearity, MaxPool downsamples, Dropout regularizes.

6. Architecture Patterns (from lab)

Model	Domain	Structure	Final dims
DigitCNN	MNIST (28×28×1)	2 conv blocks + 2 FC	1→32→64 → 3136 → 128 → 10
ChestCNN	X-ray (224×224×3)	4 conv blocks + 2 FC	3→32→64→128→256 → 256 →128 →2

Why Dropout 0.25 in conv blocks, 0.5 in classifier? Dense layers have far more parameters → higher overfitting risk → need stronger regularization.

7. Transfer Learning — 5 Architectures

Model	Year	Params	Innovation	Head
VGG16	2014	138M	Uniform 3×3 convs, deep	classifier[6]
GoogLeNet	2014	6.8M	Inception (multi-scale in parallel)	fc
ResNet50	2015	25M	Skip connections (F(x)+x)	fc
MobileNetV2	2018	3.4M	Depthwise separable conv (8× fewer ops)	classifier[1]
EfficientNet	2019	5.3M	Compound scaling (depth+width+resolution)	classifier[1]

Two transfer learning strategies

Feature extraction: freeze backbone, train only head. Use for small datasets (<5k), similar domain. LR ~ 1e-4.
Fine-tuning: unfreeze all/some layers. Use for larger datasets, different domain. LR ~ 1e-5 (pre-trained weights are fragile).

8. Data Augmentation

Applied to training set only. Never val/test. Standard: RandomHorizontalFlip, RandomRotation(10°). Never use vertical flip for medical (anatomically wrong). Normalize with ImageNet stats: mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225].

9. Class Imbalance

Two solutions: WeightedRandomSampler (each batch ~balanced) or weighted CrossEntropyLoss (penalize minority mistakes more). For medical: prioritize Recall (Sensitivity) — missing a pneumonia case is fatal.

The 16-sample validation trap

Tiny val sets give unreliable accuracy — ±1 sample = ±6.25% swing. Use test set as primary estimate, validation only for early stopping signal.

RNN, LSTM & GRU

Recurrence, Gating, BPTT, Time Series, Text Generation

1. Why Recurrent Networks?

ANNs/CNNs assume independent inputs. Sequences (text, time series, speech) have temporal dependencies — "not good" ≠ "good not". RNNs maintain a hidden state that carries information across time steps.

2. Vanilla RNN

\(h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h)\)
\(y_t = W_{hy}h_t + b_y\)

W_hh and W_xh are shared across all time steps — same number of parameters regardless of sequence length.

3. The Vanishing Gradient Problem

In Backpropagation Through Time (BPTT), gradients flow backward through T time steps, multiplied by W_hh each step. If eigenvalues < 1 → gradients vanish (early time steps have no influence). If > 1 → gradients explode. Consequence: vanilla RNNs cannot learn long-range dependencies (>20–30 steps).

4. LSTM — The Solution

LSTM adds a cell state c_t (long-term memory) and 3 gates that control information flow:

Gate	Formula	Role
Forget	\(f_t = \sigma(W_f[h_{t-1},x_t] + b_f)\)	What to erase from c_{t-1}
Input	\(i_t = \sigma(W_i[h_{t-1},x_t] + b_i)\)	What new info to store
Candidate	\(g_t = \tanh(W_g[h_{t-1},x_t] + b_g)\)	New candidate values
Output	\(o_t = \sigma(W_o[h_{t-1},x_t] + b_o)\)	What to expose from c_t

Cell state update: \(c_t = f_t \odot c_{t-1} + i_t \odot g_t\) — addition creates a gradient highway, solving vanishing gradients.
Hidden state: \(h_t = o_t \odot \tanh(c_t)\)

5. GRU — Simplified LSTM

GRU merges cell+hidden into one state, uses 2 gates (reset + update) instead of 3. Fewer parameters (~3/4 of LSTM), often matches LSTM performance.

\(z_t = \sigma(W_z[h_{t-1},x_t])\) (update gate — interpolates old vs new)
\(r_t = \sigma(W_r[h_{t-1},x_t])\) (reset gate — how much past to use)
\(h_t = (1-z_t) \odot h_{t-1} + z_t \odot n_t\)

6. RNN vs LSTM vs GRU

Aspect	Vanilla RNN	LSTM	GRU
States	h_t only	h_t + c_t	h_t only
Gates	0	3	2
Parameters	Fewest	Most (~4× RNN)	Middle (~3× RNN)
Long-range	Poor	Excellent	Good

7. Gradient Clipping

nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) — apply AFTER loss.backward(), BEFORE optimizer.step(). Essential for RNNs to prevent exploding gradients.

8. PyTorch RNN Differences

Module	Returns	Initial state
nn.RNN	(output, h_n)	h0 only
nn.LSTM	(output, (h_n, c_n))	Tuple (h0, c0) ← DIFFERENT!
nn.GRU	(output, h_n)	h0 only (same as RNN)

9. Text Generation Pipeline

Language model predicts \(P(w_t | w_{. Architecture: Embedding → Dropout → RNN/LSTM/GRU → Linear(vocab_size). Training: input = sequence[0:T], target = sequence[1:T+1] (shifted by 1). Loss = CrossEntropy on last time step logits.

Generation: autoregressive — feed seed, get distribution over next word, sample, append, repeat. Temperature: T<1 = sharper (repetitive), T>1 = flatter (diverse). Top-k: restrict to top-k candidates (k=40 typical).

10. Perplexity

\(\text{PPL} = \exp(\text{cross_entropy_loss})\). Lower = less surprised = better model. Random = V (vocab size), good English LM = 20–100.

11. Time Series Forecasting

Preprocessing: MinMaxScaler (fit ONLY on train), sliding window construction (SEQ_LEN days → next value). NEVER shuffle time series — use chronological split (train=first 70%, val=next 15%, test=last 15%). Shuffling causes look-ahead bias.

ACF (Autocorrelation Function): correlation between y_t and y_{t-k}. Use to choose SEQ_LEN — where ACF drops below 0.5.

NLP Text Processing & Linguistic Analysis

Cleaning, Tokenization, Stemming/Lemmatization, POS, NER, Parsing

1. The 4-Stage NLP Pipeline

Raw Text→ 1. Clean→ 2. Tokenize→ 3. Remove Stopwords→ 4. Normalize

2. Stage 1: Text Cleaning

Pipeline: lowercase → remove HTML (<[^>]+>) → remove URLs → remove punctuation → normalize whitespace (\s+ → single space). Optional: remove numbers.

3. Stage 2: Tokenization

Sentence tokenization: split paragraph into sentences. Word tokenization: split sentence into words. Penn Treebank style treats punctuation as separate tokens: "it's" → ["it", "'s"].

4. Stage 3: Stop Word Removal

Remove high-frequency, low-information words (the, is, at, by...). When NOT to remove: sentiment analysis ("not good"), authorship attribution (stop words are style markers), neural models (learn importance automatically).

5. Stage 4: Stemming vs Lemmatization

Method	Mechanism	Output	Example	Speed
PorterStemmer	Rule-based suffix stripping	Often non-word	"studies" → "studi"	Fast
WordNetLemmatizer	Dictionary lookup + POS	Valid word	"ran" → "run"	Slower

Critical: WordNetLemmatizer needs pos='v' for verbs! lemmatize("ran") → "ran" (wrong), lemmatize("ran", pos='v') → "run" (correct).

6. Context-Free Grammar (CFG)

Production rules describe sentence structure: S → NP VP, NP → Det N, VP → V NP, PP → P NP. Parse tree for "the cat sat on the mat": S dominates NP("the cat") + VP("sat on the mat").

7. POS Tagging

Tag	Category	Example	Penn Treebank
NOUN	Noun	"model"	NN, NNS, NNP
VERB	Verb	"trained"	VB, VBD, VBZ
ADJ	Adjective	"deep"	JJ
ADP	Preposition	"on", "with"	IN

8. NER (Named Entity Recognition)

Labels: PERSON, ORG, GPE (countries/cities), DATE, MONEY, EVENT. spaCy: doc.ents returns entity spans with ent.label_.

9. Dependency Parsing

Reveals grammatical structure as a directed tree. Key labels: nsubj (subject), ROOT (main verb), dobj (direct object), amod (adjective modifier). Extract SVO: find ROOT → nsubj (in lefts) → dobj (in rights).

10. spaCy vs NLTK

Feature	spaCy	NLTK
Speed	Fast (C-optimized)	Slower (Python)
POS/NER/Parsing	Built-in, production quality	Educational, needs setup
Best for	Production NLP	Learning/research

Classical Text Representation

One-Hot, BoW, N-grams, TF-IDF, Sparse vs Dense

1. The Representation Problem

ML models need numerical input. Text is symbolic. The representation hierarchy:

One-Hot→ BoW→ N-grams→ TF-IDF→ Word2Vec→ BERT

2. One-Hot Encoding

Vector of size |V| with 1 at word's index. Problems: dimensionality = 50K+, 99.99% sparse, no semantic similarity — cosine("cat","dog") = cosine("cat","table") = 0.

3. Bag-of-Words (BoW)

Document = vector of word counts (order discarded). Uses CountVectorizer(max_features, min_df, max_df, ngram_range). Typically >99% sparse. Limitation: "The dog bit the man" = "The man bit the dog" (word order lost).

4. N-grams

Contiguous sequences of N words. Bigram: "not good" as one feature → captures negation. Tradeoff: bigrams grow vocab ~10×, trigrams ~100×. Performance peaks at bigrams.

5. TF-IDF — Weighted Importance

\(\text{TF}(t,d) = \frac{\text{count}(t,d)}{|d|}\) — how frequent in this doc.
\(\text{IDF}(t) = \log\frac{N}{1 + df(t)} + 1\) — how rare across all docs.
\(\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)\)

Word	TF	IDF	TF-IDF	Interpretation
"the"	0.15	0.1	0.015	Very low — appears everywhere
"film"	0.05	2.3	0.115	Medium — domain specific
"brilliant"	0.02	4.5	0.090	High — rare, discriminative

sublinear_tf=True: replaces TF with log(1+TF) — 100× frequency ≠ 100× importance.

6. IMDb Benchmarks (Classical Methods)

86.18%

BoW Accuracy

90.11%

N-grams (1-3)

90.13%

TF-IDF (1+2g)

0.965

TF-IDF AUC-ROC

7. Why TF-IDF > BoW

Common words (stopwords) are down-weighted by low IDF
Rare, discriminative words are amplified by high IDF
Document length normalized (TF is relative, not absolute count)

8. Why N-grams > BoW

"not good" as a single feature captures negation (BoW can't)
Phrasal patterns captured: specific sentiment bigrams

9. Feature Inspection

For logistic regression: clf.coef_[0] gives weight per feature. High positive → strongly positive sentiment ("brilliant", "excellent"). High negative → strongly negative ("terrible", "awful").

Fundamental limitation of ALL classical methods

NO semantic understanding. "brilliant" and "superb" are treated as unrelated features. No polysemy: "bank" (financial) = "bank" (river) — same vector. Context-independent. Only word embeddings and BERT solve this.

Word Embeddings

Word2Vec, GloVe, FastText, Distributional Hypothesis

1. Why Embeddings?

Classical methods are sparse, high-dimensional, semantically blind. Word embeddings produce dense, low-dimensional, semantically meaningful vectors. Famous property: \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\).

2. The Distributional Hypothesis

"Words that appear in similar contexts have similar meanings" (Harris, 1954). If "brilliant" and "superb" appear in the same contexts, their vectors should be close.

3. Word2Vec (Mikolov, 2013)

Trains a shallow neural network on context prediction. Two modes:
CBOW: context → target word (faster).
Skip-gram: target → context (better for rare words).

Parameter	Meaning	Typical
vector_size	Embedding dimension	100–300
window	Context words on each side	3–5
sg	1=Skip-gram, 0=CBOW	1
min_count	Ignore words below this frequency	1–5

Document vector = mean pooling: average of all word vectors in the document. Problem: all words get equal weight — "not" = "film" = "the". This is why TF-IDF often beats Word2Vec on classification.

4. GloVe (Stanford, 2014)

Builds a global co-occurrence matrix X[i,j] = how often word j appears near word i (weighted by 1/distance). Then factorizes: \(X \approx U V^T\) via SVD of log(X). Better at capturing global statistics; Word2Vec better at local patterns.

5. FastText (Facebook, 2016)

Key innovation: subword embeddings. Decomposes word into character n-grams (min_n=3, max_n=6):

"acting" → ["<ac", "act", "cti", "tin", "ing", "ng>", "<acting>"]

\(\vec{\text{word}} = \sum \vec{\text{subword}}\)

OOV solved: even unseen words get a vector via shared subwords. Best for noisy text, morphologically rich languages, rare domain terms.

6. Method Comparison

Method	Semantic	Context	OOV	Training
Word2Vec	✅	❌ (one vector/word)	❌ (zero vector)	Neural, local windows
GloVe	✅	❌	❌	Matrix factorization, global
FastText	✅	❌	✅ (subwords)	Neural, local + subword
BERT	✅	✅ (contextual)	✅	Transformer, bidirectional

7. IMDb Benchmarks (Embedding Methods)

85.65%

Word2Vec (100d)

76.80%

GloVe-SVD (5k)

85.95%

FastText (100d)

Why TF-IDF (90.13%) > Word2Vec on IMDb? Mean pooling gives equal weight to all words — "the" and "brilliant" contribute equally. TF-IDF naturally weights by importance. Key sentiment words get diluted by mean pooling.

8. Cosine Similarity

\(\cos(a,b) = \frac{a \cdot b}{\|a\| \times \|b\|}\). Ignores magnitude, focuses on direction. Near-synonyms > 0.8, related words 0.5–0.8, unrelated < 0.3. Better than Euclidean for word vectors (length ≠ strength of meaning).

9. The Decision Tree

Choosing a representation

Rare words/typos/informal text? → FastText
Semantic similarity important? → Word2Vec / GloVe
Very small corpus (<1k docs)? → TF-IDF (embeddings need data)
Always: start with TF-IDF as baseline. Justify complexity with measurable gains.

Transformers & Large Language Models

Attention, Multi-Head, BERT, GPT, LoRA, RAG, Prompting

1. Why Transformers? (vs RNNs)

RNNs have two critical limits: (1) sequential bottleneck — step t depends on step t−1, cannot parallelize; (2) vanishing gradients — even LSTM struggles beyond ~100 tokens. Transformers solve both: all positions processed in parallel, self-attention directly connects any two positions in O(1).

2. Scaled Dot-Product Attention

\(\text{Attention}(Q, K, V) = \text{softmax}\!\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V\)

Q (Query): "what am I looking for?" K (Key): "what do I contain?" V (Value): "what do I contribute?"
The \(\sqrt{d_k}\) scaling prevents dot products from growing large and pushing softmax into saturated (near-zero gradient) regions.

3. Multi-Head Attention

\(\text{MHA}(Q,K,V) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O\) where each head = Attention with different learned projections. Each head can focus on different aspects simultaneously: syntax, coreference, semantics.

4. Transformer Encoder Block

Two sub-layers, both with residual + LayerNorm:

Multi-Head Self-Attn→ Add & Norm→ FFN→ Add & Norm

\(\text{output} = \text{LayerNorm}(x + \text{sublayer}(x))\). FFN = two linear layers with ReLU: \(\max(0, xW_1+b_1)W_2+b_2\). Hidden dim typically 4× model dim — most parameters live here.

5. Positional Encoding & Causal Masking

Positional encoding: attention is permutation-invariant (tokens as a set). Add sinusoidal position vectors: \(PE(pos,2i)=\sin(pos/10000^{2i/d})\).
Causal masking: in decoder self-attention, set positions j > i to −∞ before softmax → token i can only see tokens ≤ i. Used in GPT (autoregressive generation).

6. Decoding Strategies

Strategy	Mechanism	Pros	Cons
Greedy	Pick max prob token	Fast, deterministic	Often suboptimal
Beam search	Keep top-k partial seqs	Higher quality	Slower, less diverse
Temperature T<1	Sharpen distribution	More coherent	Repetitive
Temperature T>1	Flatten distribution	More creative/diverse	Less coherent

7. BERT vs GPT (Architecture)

Aspect	BERT	GPT
Architecture	Encoder-only	Decoder-only
Training	Masked LM (bidirectional)	Autoregressive (next token)
Context	Sees full sequence (left+right)	Sees only past tokens
Best for	Classification, extraction	Generation
Parameters	110M (base)	175B+ (GPT-3)

8. BERT Details

BERT-base: 12 encoder layers, 12 heads, d=768, ~110M params. Max 512 tokens.
Pre-training: Masked LM (predict 15% masked tokens) + Next Sentence Prediction (sentence B follows A?).
Tokenization: WordPiece — "unbelievable" → ["un","##believ","##able"]. Special tokens: [CLS], [SEP], [PAD], [MASK].
Fine-tuning: small LR (2e-5 to 5e-5), warmup steps (500), 3 epochs usually enough. Frozen BERT (feature extraction) often worse than TF-IDF — BERT is designed to be fine-tuned.

9. LLM Training Stages

1. Pre-training→ 2. SFT→ 3. RLHF

Pre-training: next-token prediction on trillions of tokens — broad world knowledge.
SFT: Supervised Fine-Tuning on curated instruction–response pairs — teaches following instructions.
RLHF: train reward model on human preferences, then fine-tune LLM with PPO + KL penalty against SFT model — aligns with human values.

10. LoRA (Low-Rank Adaptation)

\(W' = W + \frac{\alpha}{r}AB\) where \(A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times k}\) with \(r \ll \min(d,k)\). Only \(r(d+k)\) trainable params vs \(dk\) for full fine-tuning. For 4096×4096 with r=8: 65K vs 16.8M — a 256× reduction.

11. Prompting Strategies

Zero-shot: task description only, no examples. Works on large well-trained models.
Few-shot: include k demonstration pairs before query. Guides format + reasoning, no weight updates.
Chain-of-Thought (CoT): include step-by-step reasoning in demonstrations (or append "Let's think step by step"). Dramatically improves multi-step reasoning.

12. RAG (Retrieval-Augmented Generation)

Grounds LLM in external documents to reduce hallucination. 5 steps:

1. Index→ 2. Retrieve→ 3. Augment→ 4. Generate→ 5. Cite

Chunk docs → embed into vector DB → embed query → find top-k similar chunks → prepend to prompt → LLM generates grounded answer → return sources.

13. Hallucination

LLMs generate plausible-sounding but factually incorrect content because they're trained to predict fluent continuations, not verify facts. Mitigations: RAG (ground in documents), RLHF honesty training.

14. The Full IMDb Hierarchy

86.18%

BoW

90.11%

N-grams

90.13%

TF-IDF

85.65%

Word2Vec

85.95%

FastText

93.90%

BERT FT (370ms)

15. The Representation Hierarchy

BoW→ TF-IDF (+weighting)→ W2V (+semantic)→ FastText (+OOV)→ BERT (+context)

Each level fixes one limitation of the previous. BERT is the only method that has contextual representations — the same word gets different vectors depending on its surrounding words, solving polysemy.

16. Tools & Ecosystem

HuggingFace Transformers: unified API for thousands of pre-trained models. AutoTokenizer, AutoModel, pipeline(), Trainer.
Ollama: run quantized open-source LLMs locally (4-bit quantization). ollama run llama3. Privacy-preserving, no cloud costs.
LangChain: compose LLM pipelines with standard invoke(input) → output interface. Chains prompt templates, LLMs, retrievers, tools.
Vector DBs (FAISS, Pinecone, Chroma): fast approximate nearest-neighbor search for dense embeddings. Required for RAG retrieval.

17. Model Selection — 5 Questions

How to choose the right model

1. Task type — classification, generation, or retrieval?
2. Latency/memory constraints? (TF-IDF = 1ms; BERT = 370ms CPU)
3. Is labeled fine-tuning data available?
4. Privacy / data residency requirements? (on-prem? Ollama?)
5. Compute budget for training + inference?

18. The Map That Does Not Expire

Understand why architectures work — the math behind attention, gradient flow, representation learning — not today's model names. Models change yearly; the underlying principles (information bottlenecks, optimization landscapes, inductive biases) are permanent.