Module 07 — Transformers & LLMs

01Why the Transformer Was Needed

The Sequential Bottleneck

RNNs and LSTMs process sequences one token at a time. Each step depends on the previous hidden state, making parallelisation impossible. This created a hard ceiling:

Training speed — cannot leverage GPU parallelism; one step must complete before the next begins.
Long-range dependencies — even LSTMs struggle to connect information separated by hundreds of tokens; gradient signal degrades over long paths.
Scale barrier — training on internet-scale datasets (trillions of tokens) was simply not feasible with sequential architectures.

The Parallel Processing Solution

The Transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), replaces recurrence entirely with attention. Every token attends to every other token in a single parallel operation. The consequence was transformative: training on trillions of tokens became feasible, enabling GPT, BERT, and every major LLM that followed.

Property	RNN / LSTM	Transformer
Processing order	Sequential (step-by-step)	Parallel (all tokens at once)
Long-range dependencies	Weak (vanishing gradient)	Strong (direct attention path)
GPU utilisation	Poor (serial dependency)	Excellent (matrix ops)
Training scale	Billions of tokens max	Trillions of tokens feasible
Position awareness	Implicit (step index)	Explicit (positional encoding)

02The Core Intuition — What Attention Means

Attention was first introduced in 2015 to improve neural machine translation by creating shortcuts between the encoder and the decoder, solving issues with long input sequences.

The Core Idea

Attention dynamically computes relevance scores between all words in a sentence, allowing the model to focus on the most important words within the current context and combine their meanings. The result is context-aware representations.

Classic example — the word "bank":

"I deposited money at the bank" — attention weights concentrate on "money", "deposited".
"I sat by the river bank" — attention weights concentrate on "river", "sat".

Same word, different representation — this is what static embeddings (Word2Vec, GloVe) cannot do.

Historical Context

Year	Milestone	Significance
2015	Attention for translation (Bahdanau)	First attention mechanism, encoder-decoder shortcut
2017	Attention Is All You Need (Vaswani)	Replace recurrence entirely; the Transformer block
2018	BERT / GPT-1	Pre-training on massive text; fine-tune for tasks
2020+	GPT-3, scaling laws	Scale → emergent abilities; LLMs as general tools

03Self-Attention

Q, K, V — Query, Key, Value

Self-attention transforms each token into three vectors by multiplying its embedding by three learned weight matrices:

Query (Q) — "What am I looking for?"
Key (K) — "What information do I offer?"
Value (V) — "What content do I actually provide?"

Scaled Dot-Product Attention

Attention Formula $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Step-by-step breakdown:

Dot products: Compute $QK^T$ — the raw relevance score between every query and every key.
Scale: Divide by $\sqrt{d_k}$ (key dimension) to prevent vanishing gradients from large dot products.
Softmax: Convert scores to probabilities — the attention weights sum to 1.
Weighted sum: Multiply weights by $V$ — the output is a weighted combination of all value vectors.

Why the scaling? For large $d_k$, dot products grow large, pushing softmax into regions with near-zero gradients. Dividing by $\sqrt{d_k}$ keeps gradients healthy during training.

Input Pipeline Before Attention

Before self-attention, each token passes through:

Token embedding — maps the token ID to a dense vector (e.g. 512-dimensional).
Positional encoding — adds a position-dependent signal so the model knows token order (since there is no recurrence). Uses sine/cosine functions of varying frequencies: $PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)$

04Multi-Head Attention

Single-head attention acts like one giant spotlight that can only point in one direction, often focusing on the most obvious connection while missing others. Multi-head attention runs $h$ attention heads in parallel, each with its own $W^Q_i, W^K_i, W^V_i$ projection matrices.

Multi-Head Attention $$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O$$ $$\text{where} \quad \text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)$$

What Each Head Learns

Different heads specialise in different linguistic relationships simultaneously:

Syntactic Head

Subject-verb agreement, noun-adjective relations — structural grammar.

Semantic Head

Word meaning in context — "bank" as finance vs. geography.

Coreference Head

"The animal didn't cross because it was tired" — links "it" to "animal".

Position Head

Attends to nearby tokens; captures local phrase structure.

The outputs of all heads are concatenated and projected through $W^O$ to produce the final multi-head attention output. The original Transformer (Vaswani 2017) used $h = 8$ heads, $d_{model} = 512$, so each head operates on $d_k = d_v = 64$ dimensions.

05Full Transformer Architecture

The Encoder Block

Each encoder layer consists of two sub-layers with residual connections and layer normalisation (Add & Norm) around each:

Multi-Head Self-Attention — every token attends to all other tokens.
Feed-Forward Network (FFN) — a position-wise two-layer MLP applied identically to each token.

Encoder Sub-Layer (Add & Norm) $$\text{output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$

Feed-Forward Network

FFN per position $$\text{FFN}(x) = \max(0,\; xW_1 + b_1) W_2 + b_2$$

The FFN has a larger inner dimension ($d_{ff} = 2048$ in the original) than $d_{model} = 512$. It expands then contracts — like a "thinking" step after attention. The same FFN weights are applied to every position independently.

The Decoder Block

Each decoder layer has three sub-layers:

Masked Multi-Head Self-Attention — tokens can only attend to previous positions (causal masking prevents looking ahead during training).
Multi-Head Cross-Attention — queries come from the decoder, keys and values come from the encoder output. This is how the decoder "reads" the encoded source sequence.
Feed-Forward Network — same as in the encoder.

Output Layer

After the final decoder layer: Linear projection to vocabulary size → Softmax → probability distribution over the next token.

Architecture Summary

Component	Function	Original Paper Value
Token embedding	ID → dense vector	$d_{model} = 512$
Positional encoding	Inject position information	Sinusoidal, same dim as embedding
Encoder layers	Build contextual representations	$N = 6$ stacked layers
Multi-Head Attention	Capture diverse relationships	$h = 8$ heads, $d_k = d_v = 64$
FFN inner dim	Per-position transformation	$d_{ff} = 2048$
Decoder layers	Generate output sequence	$N = 6$ stacked layers
Add & Norm	Residual + layer normalisation	Around every sub-layer
Output projection	Decoder → vocabulary logits	Linear + Softmax

06How a Transformer Generates Text

The original Transformer architecture was designed for sequence-to-sequence tasks (e.g. translation). The decoder generates output autoregressively — one token at a time:

Start with the <BOS> (beginning-of-sequence) token.
Run it through the decoder; output a probability distribution over the vocabulary.
Sample or select the highest-probability token (decoding strategy).
Append the new token to the sequence and repeat from step 2.
Stop when the <EOS> token is generated or max length is reached.

Decoding Strategies

Strategy	Behaviour	Use Case
Greedy	Always pick the highest-probability token	Fast; deterministic; can be repetitive
Beam Search	Keep top-$k$ sequences at each step	Better quality; used in translation
Sampling	Sample from the full distribution	Diverse, creative outputs
Top-k Sampling	Sample from the top $k$ tokens only	Balances diversity and coherence
Temperature	Sharpen/flatten distribution before sampling	Low $T$ → conservative; high $T$ → creative

Training vs. Inference

During training, the decoder uses teacher forcing: all ground-truth tokens are fed in parallel (with causal masking), making training efficient. During inference, the model generates tokens one-by-one using its own previous outputs.

07The Scaling Leap — Same Architecture, Radically More Scale

LLMs are Transformer models trained at a scale previously unimaginable: billions of parameters, trillions of tokens, thousands of GPUs over months. The architecture is the same Transformer block — stacked, wider, and deeper.

Scaling Dimensions

Model Parameters

From millions (BERT: 110M) to hundreds of billions (GPT-4: ~1.8T). More parameters = more capacity to memorise and generalise.

Training Data

Web text, books, code, scientific papers — trillions of tokens. Diverse data enables broad capabilities.

Compute (FLOPs)

Scaling laws (Chinchilla, 2022) show optimal compute allocation: roughly equal scaling of parameters and data.

Context Length

From 512 tokens (BERT) to 128k+ tokens (GPT-4 Turbo). Longer context = better reasoning over documents.

Emergent Abilities

Beyond predictable performance improvements, large models exhibit emergent abilities — capabilities that appear suddenly at scale and were not present in smaller models:

Multi-step reasoning — solving multi-step math problems without explicit reasoning training.
In-context learning — learning from a few examples in the prompt without updating weights.
Chain-of-thought — generating coherent reasoning chains to improve accuracy.
Instruction following — understanding and executing complex natural language instructions.
Code generation — writing functional code across many programming languages.

Emergence threshold — these abilities don't appear gradually; they emerge sharply past a certain scale threshold. This is one of the most striking (and not fully understood) properties of LLMs.

08How LLMs Are Trained — The Three-Stage Pipeline

Every major production LLM follows the same three-stage training pipeline. Each stage converts the raw architecture into a more useful, aligned assistant.

Stage 1 — Pre-Training

Objective: Next-token prediction (autoregressive) on massive, diverse text data.

Data: web crawls (Common Crawl), books, Wikipedia, code, scientific papers.
Outcome: A model that knows language, facts, reasoning patterns — but has no instruction-following behaviour.
Cost: Millions of dollars in compute; done once by the lab, not practitioners.

Stage 2 — Supervised Fine-Tuning (SFT)

Objective: Teach the model to follow instructions using high-quality human-written examples.

Data: Curated prompt-response pairs, written or verified by humans.
Outcome: Model learns the format and style of helpful, coherent responses.
Scale: Much smaller dataset than pre-training (thousands to millions of examples).

Stage 3 — Alignment (RLHF)

Objective: Align model outputs with human values — helpfulness, harmlessness, honesty.

RLHF (Reinforcement Learning from Human Feedback): Human raters rank multiple model outputs; a reward model is trained on those rankings; the LLM is optimised against that reward model using RL (PPO).
DPO (Direct Preference Optimisation): A simpler alternative that skips the explicit reward model.
Outcome: The model refuses harmful requests, stays on-task, communicates clearly.

Stage	Objective	Data Type	Cost
Pre-training	Predict next token	Trillions of tokens (raw web)	Very high (lab only)
SFT	Follow instructions	Curated prompt-response pairs	Medium
RLHF / DPO	Align with human values	Human preference rankings	Medium

09Transformer-Based Model Variants

The original encoder-decoder Transformer has spawned a family of variants, each designed for a specific use case by keeping or removing parts of the architecture.

Type	Architecture	Training Objective	Best For	Examples
Encoder-Only	Encoder stack only	Masked Language Modelling (MLM) + NSP	Classification, NER, embeddings	BERT, RoBERTa, DistilBERT
Decoder-Only	Decoder stack only (causal)	Next-token prediction (autoregressive)	Text generation, chat, code	GPT series, LLaMA, Mistral
Encoder-Decoder	Full encoder + decoder	Sequence-to-sequence (span masking)	Translation, summarisation, QA	T5, BART, mT5

Modern LLMs are almost all decoder-only (GPT, LLaMA, Mistral, Claude, Gemini). The autoregressive objective maps naturally to generation tasks, and the unified architecture scales more cleanly than encoder-decoder variants.

10Prompt Engineering

Zero training data. Zero infrastructure changes. No compute cost. Just better instructions. Prompt engineering is the first tool every practitioner should exhaust before reaching for more expensive approaches.

System Prompts

Sets the model's role, persona, constraints, and style before any conversation begins. The same base model becomes a customer service agent, code reviewer, or research assistant depending on the system prompt.

Zero-Shot Prompting

Give the model a task description with no examples. Works well for common tasks the model encountered during training.

Prompt: "Classify the sentiment of this review as Positive or Negative. Review: 'The battery lasts forever but the screen is dim.' Answer:"

Few-Shot Prompting

Show 2–3 concrete examples of input → desired output. The model infers the pattern and applies it. Works through in-context learning — an emergent capability. Weights are never updated.

Review: "Loved the design, hated the keyboard." → Negative Review: "Fast shipping, perfect quality." → Positive Review: "Looks great but overpriced." → ?

Chain-of-Thought (CoT) Prompting

For reasoning tasks (math, logic, multi-step decisions), ask the model to "think step by step" before answering. Generates intermediate reasoning steps that guide the final output. Small prompt change, significant quality gain.

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many does he have? A: Let me think step by step. Roger starts with 5 balls. He buys 2 × 3 = 6 balls. Total = 5 + 6 = 11 balls.

When to stop here: Most practitioners jump to fine-tuning before exhausting sophisticated prompting. Prompting is fast, reversible, and free. Always exhaust it first before reaching for more expensive approaches.

Prompt Engineering Strategy Comparison

Strategy	Data Needed	Cost	Best For
Zero-shot	None	Free	Simple, common tasks
Few-shot	2–10 examples in prompt	Free	Format/pattern tasks
Chain-of-Thought	None (just phrasing)	Free	Reasoning, math, logic
System prompt	None	Free	Persona, style, constraints

11Retrieval-Augmented Generation (RAG)

RAG solves two critical limitations of bare LLMs:

Stale knowledge — LLMs have a training cutoff; they cannot know about recent events.
Hallucination — LLMs generate plausible but sometimes incorrect outputs with no external grounding.

How RAG Works

Index — chunk your documents, embed them with a text embedding model, store in a vector database.
Retrieve — when the user sends a query, embed the query and search for the most similar document chunks (cosine similarity).
Augment — prepend the retrieved chunks to the LLM's prompt as context.
Generate — the LLM answers the query grounded in the retrieved context, dramatically reducing hallucination.

Why RAG Matters

You can give any LLM access to private, up-to-date, domain-specific knowledge without modifying its weights. The retrieval backbone uses the word embeddings you learned in Module 05. RAG is how most enterprise LLM applications are built.

Component	Purpose	Common Tools
Embedding model	Encode text into dense vectors	text-embedding-ada-002, E5, BGE
Vector database	Store and search embeddings at scale	Pinecone, Chroma, FAISS
Retriever	Find top-$k$ relevant chunks	Cosine similarity, BM25 hybrid
LLM	Generate answer given context	GPT-4, Claude, LLaMA
Orchestrator	Tie the pipeline together	LangChain, LlamaIndex

12Fine-Tuning & PEFT

When to Fine-Tune

Fine-tune when a specific domain expertise, tone, style, or structured behaviour cannot be reliably produced by prompting and RAG alone.

Fine-Tuning Type	Goal	Example
Domain fine-tuning	Teach specialised vocabulary and fluency	Medical reports, legal contracts, financial filings
Task fine-tuning	Reliable structured output format or classification	JSON extraction, sentiment → one of three labels

Often done sequentially: domain adaptation first, then task fine-tuning on top. Choose based on the failure mode — wrong vocabulary or wrong behaviour.

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all billions of parameters — prohibitively expensive for most practitioners. PEFT techniques update only a tiny fraction of parameters while keeping the original weights frozen, achieving high-performance adaptation at a fraction of the cost.

Method	Approach	Parameters Updated
LoRA	Inject low-rank matrices $A, B$ alongside frozen weights: $W' = W + \Delta W = W + AB$	<1% of total
QLoRA	LoRA + quantise base model to 4-bit; fits 70B on a single GPU	<1% (4-bit base)
Prefix Tuning	Prepend trainable tokens to every attention layer	~0.1%
Adapter layers	Insert small trainable layers between frozen transformer layers	~2%

LoRA — Low-Rank Adaptation $$W' = W + \Delta W = W + \frac{\alpha}{r} AB$$

$W \in \mathbb{R}^{d \times k}$ is frozen. $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$ where rank $r \ll \min(d, k)$. $\alpha$ is a scaling factor.

13Model Selection — Five Questions That Do Not Expire

Model rankings change monthly. This decision framework does not.

#	Question	Guidance
01	Open or Closed?	Need to inspect/modify weights, run fully offline, or ensure data never leaves your infrastructure? → Open model. Otherwise, closed API gives maximum capability with minimal setup.
02	Size vs. Your Compute Reality?	3–7B: runs on a laptop GPU. 70B: needs a server. Frontier: cloud API only. There is no point selecting a model your infrastructure cannot run.
03	General or Specialized?	A coding task benefits from a code-specialized model. Medical tasks benefit from medical fine-tuned models. Do not default to general when a specialist exists for your domain.
04	Does Your Task Require Reasoning?	For complex multi-step problems — logical inference, math, research synthesis — a reasoning model (o3, DeepSeek-R1) significantly outperforms standard instruction-tuned models.
05	Latency & Cost Constraints?	Larger models: higher quality, slower, more expensive. Smaller quantized models: faster and cheaper at some quality cost. Real-time applications almost always need smaller models.

Choosing an Adaptation Strategy

If your problem is…	Start with…	Escalate to…
Simple, common task	Zero-shot prompting	Few-shot prompting
Complex reasoning	Chain-of-Thought prompting	Reasoning model (o3, R1)
Private / up-to-date data	RAG	RAG + fine-tuning
Domain vocabulary	Domain fine-tuning	Domain + task fine-tuning
Structured output format	Few-shot + system prompt	Task fine-tuning

14Tools & Libraries

HuggingFace

Model Hub — thousands of pretrained models, searchable by task.
Transformers — unified API to load, run, and fine-tune any model.
PEFT — LoRA and other parameter-efficient methods.
Datasets — standardised training and evaluation data access.

Ollama

Run open-source LLMs locally on your machine with one command. No API, no cloud, no cost. Ideal for development, privacy-sensitive applications, and experimentation with models like LLaMA, Mistral, Phi.

LangChain / LlamaIndex

Frameworks for building LLM-powered applications. Handle RAG pipelines, agent workflows, memory, and tool use. Focus on application logic, not infrastructure plumbing.

Vector Databases

Pinecone, Chroma, FAISS — store and retrieve document embeddings at scale. The retrieval backbone of every RAG system. Use the same embedding space from Module 05.

Quick Code — HuggingFace Pipeline

from transformers import pipeline # Sentiment analysis (fine-tuned BERT) classifier = pipeline("sentiment-analysis") result = classifier("The transformer architecture changed everything.") # → [{'label': 'POSITIVE', 'score': 0.9998}] # Text generation (GPT-2 style) generator = pipeline("text-generation", model="gpt2") result = generator("Attention is all you need", max_length=50) # Zero-shot classification clf = pipeline("zero-shot-classification") result = clf( "I need to refund my purchase", candidate_labels=["billing", "technical support", "shipping"] )

Quick Code — PEFT with LoRA

from transformers import AutoModelForSequenceClassification from peft import get_peft_model, LoraConfig, TaskType # Load base model model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") # Configure LoRA lora_config = LoraConfig( task_type=TaskType.SEQ_CLS, r=16, # rank — controls number of trainable parameters lora_alpha=32, # scaling factor lora_dropout=0.1, target_modules=["query", "value"] ) # Wrap model — only LoRA params will be updated model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 296,448 || all params: 109,779,458 || trainable%: 0.27%

15The Mental Map That Does Not Expire

The Evolution Timeline

Stage	Core Idea	Limitation That Motivated the Next Stage
RNN	Hidden state carries sequence memory	Vanishing gradient; sequential bottleneck
LSTM	Cell state + gates prevent forgetting	Still sequential; long context still limited
Attention	Dynamic relevance scores between all tokens	Used alongside RNN; not standalone
Transformer	Attention only — no recurrence; full parallelism	Needed massive data + compute to shine
LLM	Transformer at scale + 3-stage training	Alignment, reasoning, cost, hallucination
Agents / Reasoning	LLMs with tools, memory, multi-step plans	Active research frontier

Three Things to Carry Forward

The Transformer Block

Positional encoding → Multi-head self-attention → Residual → FFN → Residual. Same block, stacked. This is the engine of everything after the attention milestone.

The Three-Stage Pipeline

Pre-training → SFT → Alignment converts raw architecture into a useful, aligned assistant. Every major model follows it.

The Adaptation Ladder

Prompt engineering → RAG → Fine-tuning. Each addresses a different gap between what the model can do and what your task needs. Always start at the bottom.

The Right Question

It is never "which model is best?" It is always: "what does my task actually need?" Matching the right tool to the right problem is the professional skill.

Critical Reminders

I. LLMs are not search engines. They generate possible tokens — they do not retrieve verified facts. They can be confidently wrong. Treat every output as a first draft that requires verification.

II. LLMs do not understand — they simulate understanding. They have no beliefs, no intentions, and no connection to reality. Fluent, confident output is not evidence of comprehension; it is a probabilistic achievement. The model predicts tokens, not truth.

The field will keep moving. The mental map does not expire. RNN → LSTM → Attention → Transformer → LLM → Agents/Reasoning. You now understand the full trajectory.

Transformers& Large Language Models

Transformers
& Large Language Models