1. Deep Learning vs Machine Learning
DL = multi-layered neural networks that learn features automatically from data. ML requires manual feature engineering. DL needs large data + GPU; ML works with less. DL is less interpretable (black box) but state-of-the-art on complex tasks.
2. Perceptron & Activation Functions
A perceptron computes \(z = W\cdot X + b\), then applies activation: \(y = f(z)\). Without activation, stacking layers = one linear transform.
| Function | Formula | Range | Use |
|---|---|---|---|
| Sigmoid | \(1/(1+e^{-x})\) | (0,1) | Binary output; vanishing gradient for large |x| |
| Tanh | \((e^x-e^{-x})/(e^x+e^{-x})\) | (−1,1) | Zero-centered; also saturates |
| ReLU | \(\max(0,x)\) | [0,∞) | Default for hidden layers; fast, no saturation for x>0 |
| Leaky ReLU | \(\max(0.01x,x)\) | (−∞,∞) | Fixes dying ReLU (always-zero neuron) |
| Softmax | \(e^{x_i}/\sum e^{x_j}\) | (0,1), sum=1 | Multi-class output |
3. Loss Functions
MSE (regression): \(\frac{1}{n}\sum (y_i - \hat{y}_i)^2\) — penalizes large errors heavily.
Binary Cross-Entropy: \(-\frac{1}{n}\sum[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]\) — classification gold standard.
Categorical Cross-Entropy: used with Softmax for multi-class.
4. Gradient Descent & Backpropagation
Update rule: \(W = W - \alpha\frac{\partial L}{\partial W}\) where α = learning rate. Backprop computes \(\frac{\partial L}{\partial W}\) via the chain rule: \(\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}\).
| Variant | Data per update | Tradeoff |
|---|---|---|
| Batch GD | ALL data | Stable but slow |
| SGD | 1 sample | Fast but noisy |
| Mini-batch | 32–256 | Best balance (default) |
5. Optimizers
| Optimizer | Key idea |
|---|---|
| SGD | Plain update \(W = W - \alpha\nabla L\) |
| Momentum | Accumulates velocity: \(v = \beta v - \alpha\nabla L\) |
| RMSprop | Adaptive LR per parameter: divide by √(moving avg of squared grads) |
| Adam | Momentum + RMSprop combined — best general-purpose. β₁=0.9, β₂=0.999 |
6. Regularization
- L2 (Weight Decay): adds \(\lambda\|W\|^2\) to loss — penalizes large weights
- L1: adds \(\lambda\|W\|_1\) — promotes sparsity (zero weights)
- Dropout: randomly zeros p% of neurons during training. At inference, all active but scaled by (1−p). Typical p=0.2–0.5
- BatchNorm: \(\hat{x} = \frac{x-\mu_B}{\sqrt{\sigma_B^2+\varepsilon}}\), then \(y = \gamma\hat{x} + \beta\) — stabilizes, speeds up, regularizes
- Early Stopping: stop when val_loss stops decreasing
7. ANN Architecture & Parameter Count
A fully-connected block: Linear → BatchNorm → ReLU → Dropout. Parameters per layer = \(in \times out + out\) (weights + biases). Example: 784→256→128→10 = 200,960 + 32,896 + 1,290 = ~235K params.
Weight initialization: Xavier for Sigmoid/Tanh (\(\sqrt{2/(in+out)}\)), He for ReLU (\(\sqrt{2/in}\)). Too small → vanishing; too large → exploding.
8. Evaluation Metrics
\(\text{Accuracy} = \frac{TP+TN}{Total}\) — use when classes are balanced.
\(\text{Precision} = \frac{TP}{TP+FP}\) — when FP is costly (spam).
\(\text{Recall} = \frac{TP}{TP+FN}\) — when FN is costly (disease).
\(\text{F1} = 2\frac{P\cdot R}{P+R}\) — balance when classes are imbalanced.
9. Training Diagnostics
| Pattern | Diagnosis | Action |
|---|---|---|
| train↓ val↓ | Good | Continue |
| train↓ val↑ | Overfitting | More dropout, reduce capacity, early stop |
| Both high | Underfitting | Increase capacity, train longer |
| Large gap | Overfitting | More regularization |
1. Why CNN? (vs ANN on images)
A 224×224×3 image has 150K pixels. A single ANN layer of 1024 units = 154M parameters — completely impractical. CNNs solve this with: local connectivity (each neuron sees a patch), weight sharing (same filter slides across image), and hierarchical learning (edges → textures → objects).
2. Conv2d Output Size
\(W_{out} = \lfloor\frac{W_{in} - K + 2P}{S}\rfloor + 1\)
Examples: 28×28, K=3, P=1, S=1 → 28 (same). 28×28, K=3, P=0, S=1 → 26 (shrinks). 28×28, K=2, S=2 → 14 (halved).
3. Conv2d Parameters
\(\text{Params} = (K \times K \times C_{in} + 1) \times C_{out}\)
Conv2d(3→32, K=3): (3×3×3+1)×32 = 896 parameters — orders of magnitude fewer than ANN.
4. Padding & Pooling
- padding=0 (valid): no padding, spatial size shrinks each layer
- padding=1 (same, K=3): 1 border of zeros, output = input size
- MaxPool2d(2,2): takes max in 2×2 window, stride 2 → halves dimensions
- AdaptiveAvgPool2d((1,1)): squeezes any spatial map to 1×1 (global pooling)
5. Standard CNN Block
Conv2d → BatchNorm2d → ReLU → MaxPool2d(2) → Dropout(p)
Conv extracts features, BN normalizes, ReLU adds non-linearity, MaxPool downsamples, Dropout regularizes.
6. Architecture Patterns (from lab)
| Model | Domain | Structure | Final dims |
|---|---|---|---|
| DigitCNN | MNIST (28×28×1) | 2 conv blocks + 2 FC | 1→32→64 → 3136 → 128 → 10 |
| ChestCNN | X-ray (224×224×3) | 4 conv blocks + 2 FC | 3→32→64→128→256 → 256 →128 →2 |
Why Dropout 0.25 in conv blocks, 0.5 in classifier? Dense layers have far more parameters → higher overfitting risk → need stronger regularization.
7. Transfer Learning — 5 Architectures
| Model | Year | Params | Innovation | Head |
|---|---|---|---|---|
| VGG16 | 2014 | 138M | Uniform 3×3 convs, deep | classifier[6] |
| GoogLeNet | 2014 | 6.8M | Inception (multi-scale in parallel) | fc |
| ResNet50 | 2015 | 25M | Skip connections (F(x)+x) | fc |
| MobileNetV2 | 2018 | 3.4M | Depthwise separable conv (8× fewer ops) | classifier[1] |
| EfficientNet | 2019 | 5.3M | Compound scaling (depth+width+resolution) | classifier[1] |
Fine-tuning: unfreeze all/some layers. Use for larger datasets, different domain. LR ~ 1e-5 (pre-trained weights are fragile).
8. Data Augmentation
Applied to training set only. Never val/test. Standard: RandomHorizontalFlip, RandomRotation(10°). Never use vertical flip for medical (anatomically wrong). Normalize with ImageNet stats: mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225].
9. Class Imbalance
Two solutions: WeightedRandomSampler (each batch ~balanced) or weighted CrossEntropyLoss (penalize minority mistakes more). For medical: prioritize Recall (Sensitivity) — missing a pneumonia case is fatal.
1. Why Recurrent Networks?
ANNs/CNNs assume independent inputs. Sequences (text, time series, speech) have temporal dependencies — "not good" ≠ "good not". RNNs maintain a hidden state that carries information across time steps.
2. Vanilla RNN
\(h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b_h)\)
\(y_t = W_{hy}h_t + b_y\)
W_hh and W_xh are shared across all time steps — same number of parameters regardless of sequence length.
3. The Vanishing Gradient Problem
In Backpropagation Through Time (BPTT), gradients flow backward through T time steps, multiplied by W_hh each step. If eigenvalues < 1 → gradients vanish (early time steps have no influence). If > 1 → gradients explode. Consequence: vanilla RNNs cannot learn long-range dependencies (>20–30 steps).
4. LSTM — The Solution
LSTM adds a cell state c_t (long-term memory) and 3 gates that control information flow:
| Gate | Formula | Role |
|---|---|---|
| Forget | \(f_t = \sigma(W_f[h_{t-1},x_t] + b_f)\) | What to erase from c_{t-1} |
| Input | \(i_t = \sigma(W_i[h_{t-1},x_t] + b_i)\) | What new info to store |
| Candidate | \(g_t = \tanh(W_g[h_{t-1},x_t] + b_g)\) | New candidate values |
| Output | \(o_t = \sigma(W_o[h_{t-1},x_t] + b_o)\) | What to expose from c_t |
Cell state update: \(c_t = f_t \odot c_{t-1} + i_t \odot g_t\) — addition creates a gradient highway, solving vanishing gradients.
Hidden state: \(h_t = o_t \odot \tanh(c_t)\)
5. GRU — Simplified LSTM
GRU merges cell+hidden into one state, uses 2 gates (reset + update) instead of 3. Fewer parameters (~3/4 of LSTM), often matches LSTM performance.
\(z_t = \sigma(W_z[h_{t-1},x_t])\) (update gate — interpolates old vs new)
\(r_t = \sigma(W_r[h_{t-1},x_t])\) (reset gate — how much past to use)
\(h_t = (1-z_t) \odot h_{t-1} + z_t \odot n_t\)
6. RNN vs LSTM vs GRU
| Aspect | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| States | h_t only | h_t + c_t | h_t only |
| Gates | 0 | 3 | 2 |
| Parameters | Fewest | Most (~4× RNN) | Middle (~3× RNN) |
| Long-range | Poor | Excellent | Good |
7. Gradient Clipping
nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0) — apply AFTER loss.backward(), BEFORE optimizer.step(). Essential for RNNs to prevent exploding gradients.
8. PyTorch RNN Differences
| Module | Returns | Initial state |
|---|---|---|
| nn.RNN | (output, h_n) | h0 only |
| nn.LSTM | (output, (h_n, c_n)) | Tuple (h0, c0) ← DIFFERENT! |
| nn.GRU | (output, h_n) | h0 only (same as RNN) |
9. Text Generation Pipeline
Language model predicts \(P(w_t | w_{
Generation: autoregressive — feed seed, get distribution over next word, sample, append, repeat. Temperature: T<1 = sharper (repetitive), T>1 = flatter (diverse). Top-k: restrict to top-k candidates (k=40 typical).
10. Perplexity
\(\text{PPL} = \exp(\text{cross_entropy_loss})\). Lower = less surprised = better model. Random = V (vocab size), good English LM = 20–100.
11. Time Series Forecasting
Preprocessing: MinMaxScaler (fit ONLY on train), sliding window construction (SEQ_LEN days → next value). NEVER shuffle time series — use chronological split (train=first 70%, val=next 15%, test=last 15%). Shuffling causes look-ahead bias.
ACF (Autocorrelation Function): correlation between y_t and y_{t-k}. Use to choose SEQ_LEN — where ACF drops below 0.5.
1. The 4-Stage NLP Pipeline
2. Stage 1: Text Cleaning
Pipeline: lowercase → remove HTML (<[^>]+>) → remove URLs → remove punctuation → normalize whitespace (\s+ → single space). Optional: remove numbers.
3. Stage 2: Tokenization
Sentence tokenization: split paragraph into sentences. Word tokenization: split sentence into words. Penn Treebank style treats punctuation as separate tokens: "it's" → ["it", "'s"].
4. Stage 3: Stop Word Removal
Remove high-frequency, low-information words (the, is, at, by...). When NOT to remove: sentiment analysis ("not good"), authorship attribution (stop words are style markers), neural models (learn importance automatically).
5. Stage 4: Stemming vs Lemmatization
| Method | Mechanism | Output | Example | Speed |
|---|---|---|---|---|
| PorterStemmer | Rule-based suffix stripping | Often non-word | "studies" → "studi" | Fast |
| WordNetLemmatizer | Dictionary lookup + POS | Valid word | "ran" → "run" | Slower |
Critical: WordNetLemmatizer needs pos='v' for verbs! lemmatize("ran") → "ran" (wrong), lemmatize("ran", pos='v') → "run" (correct).
6. Context-Free Grammar (CFG)
Production rules describe sentence structure: S → NP VP, NP → Det N, VP → V NP, PP → P NP. Parse tree for "the cat sat on the mat": S dominates NP("the cat") + VP("sat on the mat").
7. POS Tagging
| Tag | Category | Example | Penn Treebank |
|---|---|---|---|
| NOUN | Noun | "model" | NN, NNS, NNP |
| VERB | Verb | "trained" | VB, VBD, VBZ |
| ADJ | Adjective | "deep" | JJ |
| ADP | Preposition | "on", "with" | IN |
8. NER (Named Entity Recognition)
Labels: PERSON, ORG, GPE (countries/cities), DATE, MONEY, EVENT. spaCy: doc.ents returns entity spans with ent.label_.
9. Dependency Parsing
Reveals grammatical structure as a directed tree. Key labels: nsubj (subject), ROOT (main verb), dobj (direct object), amod (adjective modifier). Extract SVO: find ROOT → nsubj (in lefts) → dobj (in rights).
10. spaCy vs NLTK
| Feature | spaCy | NLTK |
|---|---|---|
| Speed | Fast (C-optimized) | Slower (Python) |
| POS/NER/Parsing | Built-in, production quality | Educational, needs setup |
| Best for | Production NLP | Learning/research |
1. The Representation Problem
ML models need numerical input. Text is symbolic. The representation hierarchy:
2. One-Hot Encoding
Vector of size |V| with 1 at word's index. Problems: dimensionality = 50K+, 99.99% sparse, no semantic similarity — cosine("cat","dog") = cosine("cat","table") = 0.
3. Bag-of-Words (BoW)
Document = vector of word counts (order discarded). Uses CountVectorizer(max_features, min_df, max_df, ngram_range). Typically >99% sparse. Limitation: "The dog bit the man" = "The man bit the dog" (word order lost).
4. N-grams
Contiguous sequences of N words. Bigram: "not good" as one feature → captures negation. Tradeoff: bigrams grow vocab ~10×, trigrams ~100×. Performance peaks at bigrams.
5. TF-IDF — Weighted Importance
\(\text{TF}(t,d) = \frac{\text{count}(t,d)}{|d|}\) — how frequent in this doc.
\(\text{IDF}(t) = \log\frac{N}{1 + df(t)} + 1\) — how rare across all docs.
\(\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)\)
| Word | TF | IDF | TF-IDF | Interpretation |
|---|---|---|---|---|
| "the" | 0.15 | 0.1 | 0.015 | Very low — appears everywhere |
| "film" | 0.05 | 2.3 | 0.115 | Medium — domain specific |
| "brilliant" | 0.02 | 4.5 | 0.090 | High — rare, discriminative |
sublinear_tf=True: replaces TF with log(1+TF) — 100× frequency ≠ 100× importance.
6. IMDb Benchmarks (Classical Methods)
7. Why TF-IDF > BoW
- Common words (stopwords) are down-weighted by low IDF
- Rare, discriminative words are amplified by high IDF
- Document length normalized (TF is relative, not absolute count)
8. Why N-grams > BoW
- "not good" as a single feature captures negation (BoW can't)
- Phrasal patterns captured: specific sentiment bigrams
9. Feature Inspection
For logistic regression: clf.coef_[0] gives weight per feature. High positive → strongly positive sentiment ("brilliant", "excellent"). High negative → strongly negative ("terrible", "awful").
1. Why Embeddings?
Classical methods are sparse, high-dimensional, semantically blind. Word embeddings produce dense, low-dimensional, semantically meaningful vectors. Famous property: \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\).
2. The Distributional Hypothesis
"Words that appear in similar contexts have similar meanings" (Harris, 1954). If "brilliant" and "superb" appear in the same contexts, their vectors should be close.
3. Word2Vec (Mikolov, 2013)
Trains a shallow neural network on context prediction. Two modes:
CBOW: context → target word (faster).
Skip-gram: target → context (better for rare words).
| Parameter | Meaning | Typical |
|---|---|---|
| vector_size | Embedding dimension | 100–300 |
| window | Context words on each side | 3–5 |
| sg | 1=Skip-gram, 0=CBOW | 1 |
| min_count | Ignore words below this frequency | 1–5 |
Document vector = mean pooling: average of all word vectors in the document. Problem: all words get equal weight — "not" = "film" = "the". This is why TF-IDF often beats Word2Vec on classification.
4. GloVe (Stanford, 2014)
Builds a global co-occurrence matrix X[i,j] = how often word j appears near word i (weighted by 1/distance). Then factorizes: \(X \approx U V^T\) via SVD of log(X). Better at capturing global statistics; Word2Vec better at local patterns.
5. FastText (Facebook, 2016)
Key innovation: subword embeddings. Decomposes word into character n-grams (min_n=3, max_n=6):
"acting" → ["<ac", "act", "cti", "tin", "ing", "ng>", "<acting>"]
\(\vec{\text{word}} = \sum \vec{\text{subword}}\)
OOV solved: even unseen words get a vector via shared subwords. Best for noisy text, morphologically rich languages, rare domain terms.
6. Method Comparison
| Method | Semantic | Context | OOV | Training |
|---|---|---|---|---|
| Word2Vec | ✅ | ❌ (one vector/word) | ❌ (zero vector) | Neural, local windows |
| GloVe | ✅ | ❌ | ❌ | Matrix factorization, global |
| FastText | ✅ | ❌ | ✅ (subwords) | Neural, local + subword |
| BERT | ✅ | ✅ (contextual) | ✅ | Transformer, bidirectional |
7. IMDb Benchmarks (Embedding Methods)
Why TF-IDF (90.13%) > Word2Vec on IMDb? Mean pooling gives equal weight to all words — "the" and "brilliant" contribute equally. TF-IDF naturally weights by importance. Key sentiment words get diluted by mean pooling.
8. Cosine Similarity
\(\cos(a,b) = \frac{a \cdot b}{\|a\| \times \|b\|}\). Ignores magnitude, focuses on direction. Near-synonyms > 0.8, related words 0.5–0.8, unrelated < 0.3. Better than Euclidean for word vectors (length ≠ strength of meaning).
9. The Decision Tree
Semantic similarity important? → Word2Vec / GloVe
Very small corpus (<1k docs)? → TF-IDF (embeddings need data)
Always: start with TF-IDF as baseline. Justify complexity with measurable gains.
1. Why Transformers? (vs RNNs)
RNNs have two critical limits: (1) sequential bottleneck — step t depends on step t−1, cannot parallelize; (2) vanishing gradients — even LSTM struggles beyond ~100 tokens. Transformers solve both: all positions processed in parallel, self-attention directly connects any two positions in O(1).
2. Scaled Dot-Product Attention
\(\text{Attention}(Q, K, V) = \text{softmax}\!\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V\)
Q (Query): "what am I looking for?" K (Key): "what do I contain?" V (Value): "what do I contribute?"
The \(\sqrt{d_k}\) scaling prevents dot products from growing large and pushing softmax into saturated (near-zero gradient) regions.
3. Multi-Head Attention
\(\text{MHA}(Q,K,V) = \text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O\) where each head = Attention with different learned projections. Each head can focus on different aspects simultaneously: syntax, coreference, semantics.
4. Transformer Encoder Block
Two sub-layers, both with residual + LayerNorm:
\(\text{output} = \text{LayerNorm}(x + \text{sublayer}(x))\). FFN = two linear layers with ReLU: \(\max(0, xW_1+b_1)W_2+b_2\). Hidden dim typically 4× model dim — most parameters live here.
5. Positional Encoding & Causal Masking
Positional encoding: attention is permutation-invariant (tokens as a set). Add sinusoidal position vectors: \(PE(pos,2i)=\sin(pos/10000^{2i/d})\).
Causal masking: in decoder self-attention, set positions j > i to −∞ before softmax → token i can only see tokens ≤ i. Used in GPT (autoregressive generation).
6. Decoding Strategies
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Greedy | Pick max prob token | Fast, deterministic | Often suboptimal |
| Beam search | Keep top-k partial seqs | Higher quality | Slower, less diverse |
| Temperature T<1 | Sharpen distribution | More coherent | Repetitive |
| Temperature T>1 | Flatten distribution | More creative/diverse | Less coherent |
7. BERT vs GPT (Architecture)
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Encoder-only | Decoder-only |
| Training | Masked LM (bidirectional) | Autoregressive (next token) |
| Context | Sees full sequence (left+right) | Sees only past tokens |
| Best for | Classification, extraction | Generation |
| Parameters | 110M (base) | 175B+ (GPT-3) |
8. BERT Details
BERT-base: 12 encoder layers, 12 heads, d=768, ~110M params. Max 512 tokens.
Pre-training: Masked LM (predict 15% masked tokens) + Next Sentence Prediction (sentence B follows A?).
Tokenization: WordPiece — "unbelievable" → ["un","##believ","##able"]. Special tokens: [CLS], [SEP], [PAD], [MASK].
Fine-tuning: small LR (2e-5 to 5e-5), warmup steps (500), 3 epochs usually enough. Frozen BERT (feature extraction) often worse than TF-IDF — BERT is designed to be fine-tuned.
9. LLM Training Stages
Pre-training: next-token prediction on trillions of tokens — broad world knowledge.
SFT: Supervised Fine-Tuning on curated instruction–response pairs — teaches following instructions.
RLHF: train reward model on human preferences, then fine-tune LLM with PPO + KL penalty against SFT model — aligns with human values.
10. LoRA (Low-Rank Adaptation)
\(W' = W + \frac{\alpha}{r}AB\) where \(A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times k}\) with \(r \ll \min(d,k)\). Only \(r(d+k)\) trainable params vs \(dk\) for full fine-tuning. For 4096×4096 with r=8: 65K vs 16.8M — a 256× reduction.
11. Prompting Strategies
- Zero-shot: task description only, no examples. Works on large well-trained models.
- Few-shot: include k demonstration pairs before query. Guides format + reasoning, no weight updates.
- Chain-of-Thought (CoT): include step-by-step reasoning in demonstrations (or append "Let's think step by step"). Dramatically improves multi-step reasoning.
12. RAG (Retrieval-Augmented Generation)
Grounds LLM in external documents to reduce hallucination. 5 steps:
Chunk docs → embed into vector DB → embed query → find top-k similar chunks → prepend to prompt → LLM generates grounded answer → return sources.
13. Hallucination
LLMs generate plausible-sounding but factually incorrect content because they're trained to predict fluent continuations, not verify facts. Mitigations: RAG (ground in documents), RLHF honesty training.
14. The Full IMDb Hierarchy
15. The Representation Hierarchy
Each level fixes one limitation of the previous. BERT is the only method that has contextual representations — the same word gets different vectors depending on its surrounding words, solving polysemy.
16. Tools & Ecosystem
- HuggingFace Transformers: unified API for thousands of pre-trained models. AutoTokenizer, AutoModel, pipeline(), Trainer.
- Ollama: run quantized open-source LLMs locally (4-bit quantization). ollama run llama3. Privacy-preserving, no cloud costs.
- LangChain: compose LLM pipelines with standard invoke(input) → output interface. Chains prompt templates, LLMs, retrievers, tools.
- Vector DBs (FAISS, Pinecone, Chroma): fast approximate nearest-neighbor search for dense embeddings. Required for RAG retrieval.
17. Model Selection — 5 Questions
2. Latency/memory constraints? (TF-IDF = 1ms; BERT = 370ms CPU)
3. Is labeled fine-tuning data available?
4. Privacy / data residency requirements? (on-prem? Ollama?)
5. Compute budget for training + inference?
18. The Map That Does Not Expire
Understand why architectures work — the math behind attention, gradient flow, representation learning — not today's model names. Models change yearly; the underlying principles (information bottlenecks, optimization landscapes, inductive biases) are permanent.