Transformers
& Large Language Models
Complete course notes — from the sequential bottleneck that motivated Transformers, through attention, multi-head attention, and the full encoder-decoder architecture, to LLMs, prompt engineering, RAG, and fine-tuning. Based on ENSAM 2025/2026 lecture PDFs.
RNNs and LSTMs process sequences one token at a time. Each step depends on the previous hidden state, making parallelisation impossible. This created a hard ceiling:
- Training speed — cannot leverage GPU parallelism; one step must complete before the next begins.
- Long-range dependencies — even LSTMs struggle to connect information separated by hundreds of tokens; gradient signal degrades over long paths.
- Scale barrier — training on internet-scale datasets (trillions of tokens) was simply not feasible with sequential architectures.
The Transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), replaces recurrence entirely with attention. Every token attends to every other token in a single parallel operation. The consequence was transformative: training on trillions of tokens became feasible, enabling GPT, BERT, and every major LLM that followed.
| Property | RNN / LSTM | Transformer |
|---|---|---|
| Processing order | Sequential (step-by-step) | Parallel (all tokens at once) |
| Long-range dependencies | Weak (vanishing gradient) | Strong (direct attention path) |
| GPU utilisation | Poor (serial dependency) | Excellent (matrix ops) |
| Training scale | Billions of tokens max | Trillions of tokens feasible |
| Position awareness | Implicit (step index) | Explicit (positional encoding) |
Attention was first introduced in 2015 to improve neural machine translation by creating shortcuts between the encoder and the decoder, solving issues with long input sequences.
Attention dynamically computes relevance scores between all words in a sentence, allowing the model to focus on the most important words within the current context and combine their meanings. The result is context-aware representations.
Classic example — the word "bank":
- "I deposited money at the bank" — attention weights concentrate on "money", "deposited".
- "I sat by the river bank" — attention weights concentrate on "river", "sat".
Same word, different representation — this is what static embeddings (Word2Vec, GloVe) cannot do.
| Year | Milestone | Significance |
|---|---|---|
| 2015 | Attention for translation (Bahdanau) | First attention mechanism, encoder-decoder shortcut |
| 2017 | Attention Is All You Need (Vaswani) | Replace recurrence entirely; the Transformer block |
| 2018 | BERT / GPT-1 | Pre-training on massive text; fine-tune for tasks |
| 2020+ | GPT-3, scaling laws | Scale → emergent abilities; LLMs as general tools |
Self-attention transforms each token into three vectors by multiplying its embedding by three learned weight matrices:
- Query (Q) — "What am I looking for?"
- Key (K) — "What information do I offer?"
- Value (V) — "What content do I actually provide?"
Step-by-step breakdown:
- Dot products: Compute $QK^T$ — the raw relevance score between every query and every key.
- Scale: Divide by $\sqrt{d_k}$ (key dimension) to prevent vanishing gradients from large dot products.
- Softmax: Convert scores to probabilities — the attention weights sum to 1.
- Weighted sum: Multiply weights by $V$ — the output is a weighted combination of all value vectors.
Before self-attention, each token passes through:
- Token embedding — maps the token ID to a dense vector (e.g. 512-dimensional).
- Positional encoding — adds a position-dependent signal so the model knows token order (since there is no recurrence). Uses sine/cosine functions of varying frequencies: $PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)$
Single-head attention acts like one giant spotlight that can only point in one direction, often focusing on the most obvious connection while missing others. Multi-head attention runs $h$ attention heads in parallel, each with its own $W^Q_i, W^K_i, W^V_i$ projection matrices.
Different heads specialise in different linguistic relationships simultaneously:
Subject-verb agreement, noun-adjective relations — structural grammar.
Word meaning in context — "bank" as finance vs. geography.
"The animal didn't cross because it was tired" — links "it" to "animal".
Attends to nearby tokens; captures local phrase structure.
The outputs of all heads are concatenated and projected through $W^O$ to produce the final multi-head attention output. The original Transformer (Vaswani 2017) used $h = 8$ heads, $d_{model} = 512$, so each head operates on $d_k = d_v = 64$ dimensions.
Each encoder layer consists of two sub-layers with residual connections and layer normalisation (Add & Norm) around each:
- Multi-Head Self-Attention — every token attends to all other tokens.
- Feed-Forward Network (FFN) — a position-wise two-layer MLP applied identically to each token.
The FFN has a larger inner dimension ($d_{ff} = 2048$ in the original) than $d_{model} = 512$. It expands then contracts — like a "thinking" step after attention. The same FFN weights are applied to every position independently.
Each decoder layer has three sub-layers:
- Masked Multi-Head Self-Attention — tokens can only attend to previous positions (causal masking prevents looking ahead during training).
- Multi-Head Cross-Attention — queries come from the decoder, keys and values come from the encoder output. This is how the decoder "reads" the encoded source sequence.
- Feed-Forward Network — same as in the encoder.
After the final decoder layer: Linear projection to vocabulary size → Softmax → probability distribution over the next token.
| Component | Function | Original Paper Value |
|---|---|---|
| Token embedding | ID → dense vector | $d_{model} = 512$ |
| Positional encoding | Inject position information | Sinusoidal, same dim as embedding |
| Encoder layers | Build contextual representations | $N = 6$ stacked layers |
| Multi-Head Attention | Capture diverse relationships | $h = 8$ heads, $d_k = d_v = 64$ |
| FFN inner dim | Per-position transformation | $d_{ff} = 2048$ |
| Decoder layers | Generate output sequence | $N = 6$ stacked layers |
| Add & Norm | Residual + layer normalisation | Around every sub-layer |
| Output projection | Decoder → vocabulary logits | Linear + Softmax |
The original Transformer architecture was designed for sequence-to-sequence tasks (e.g. translation). The decoder generates output autoregressively — one token at a time:
- Start with the <BOS> (beginning-of-sequence) token.
- Run it through the decoder; output a probability distribution over the vocabulary.
- Sample or select the highest-probability token (decoding strategy).
- Append the new token to the sequence and repeat from step 2.
- Stop when the <EOS> token is generated or max length is reached.
| Strategy | Behaviour | Use Case |
|---|---|---|
| Greedy | Always pick the highest-probability token | Fast; deterministic; can be repetitive |
| Beam Search | Keep top-$k$ sequences at each step | Better quality; used in translation |
| Sampling | Sample from the full distribution | Diverse, creative outputs |
| Top-k Sampling | Sample from the top $k$ tokens only | Balances diversity and coherence |
| Temperature | Sharpen/flatten distribution before sampling | Low $T$ → conservative; high $T$ → creative |
During training, the decoder uses teacher forcing: all ground-truth tokens are fed in parallel (with causal masking), making training efficient. During inference, the model generates tokens one-by-one using its own previous outputs.
LLMs are Transformer models trained at a scale previously unimaginable: billions of parameters, trillions of tokens, thousands of GPUs over months. The architecture is the same Transformer block — stacked, wider, and deeper.
From millions (BERT: 110M) to hundreds of billions (GPT-4: ~1.8T). More parameters = more capacity to memorise and generalise.
Web text, books, code, scientific papers — trillions of tokens. Diverse data enables broad capabilities.
Scaling laws (Chinchilla, 2022) show optimal compute allocation: roughly equal scaling of parameters and data.
From 512 tokens (BERT) to 128k+ tokens (GPT-4 Turbo). Longer context = better reasoning over documents.
Beyond predictable performance improvements, large models exhibit emergent abilities — capabilities that appear suddenly at scale and were not present in smaller models:
- Multi-step reasoning — solving multi-step math problems without explicit reasoning training.
- In-context learning — learning from a few examples in the prompt without updating weights.
- Chain-of-thought — generating coherent reasoning chains to improve accuracy.
- Instruction following — understanding and executing complex natural language instructions.
- Code generation — writing functional code across many programming languages.
Every major production LLM follows the same three-stage training pipeline. Each stage converts the raw architecture into a more useful, aligned assistant.
Objective: Next-token prediction (autoregressive) on massive, diverse text data.
- Data: web crawls (Common Crawl), books, Wikipedia, code, scientific papers.
- Outcome: A model that knows language, facts, reasoning patterns — but has no instruction-following behaviour.
- Cost: Millions of dollars in compute; done once by the lab, not practitioners.
Objective: Teach the model to follow instructions using high-quality human-written examples.
- Data: Curated prompt-response pairs, written or verified by humans.
- Outcome: Model learns the format and style of helpful, coherent responses.
- Scale: Much smaller dataset than pre-training (thousands to millions of examples).
Objective: Align model outputs with human values — helpfulness, harmlessness, honesty.
- RLHF (Reinforcement Learning from Human Feedback): Human raters rank multiple model outputs; a reward model is trained on those rankings; the LLM is optimised against that reward model using RL (PPO).
- DPO (Direct Preference Optimisation): A simpler alternative that skips the explicit reward model.
- Outcome: The model refuses harmful requests, stays on-task, communicates clearly.
| Stage | Objective | Data Type | Cost |
|---|---|---|---|
| Pre-training | Predict next token | Trillions of tokens (raw web) | Very high (lab only) |
| SFT | Follow instructions | Curated prompt-response pairs | Medium |
| RLHF / DPO | Align with human values | Human preference rankings | Medium |
The original encoder-decoder Transformer has spawned a family of variants, each designed for a specific use case by keeping or removing parts of the architecture.
| Type | Architecture | Training Objective | Best For | Examples |
|---|---|---|---|---|
| Encoder-Only | Encoder stack only | Masked Language Modelling (MLM) + NSP | Classification, NER, embeddings | BERT, RoBERTa, DistilBERT |
| Decoder-Only | Decoder stack only (causal) | Next-token prediction (autoregressive) | Text generation, chat, code | GPT series, LLaMA, Mistral |
| Encoder-Decoder | Full encoder + decoder | Sequence-to-sequence (span masking) | Translation, summarisation, QA | T5, BART, mT5 |
Zero training data. Zero infrastructure changes. No compute cost. Just better instructions. Prompt engineering is the first tool every practitioner should exhaust before reaching for more expensive approaches.
Sets the model's role, persona, constraints, and style before any conversation begins. The same base model becomes a customer service agent, code reviewer, or research assistant depending on the system prompt.
Give the model a task description with no examples. Works well for common tasks the model encountered during training.
Show 2–3 concrete examples of input → desired output. The model infers the pattern and applies it. Works through in-context learning — an emergent capability. Weights are never updated.
For reasoning tasks (math, logic, multi-step decisions), ask the model to "think step by step" before answering. Generates intermediate reasoning steps that guide the final output. Small prompt change, significant quality gain.
| Strategy | Data Needed | Cost | Best For |
|---|---|---|---|
| Zero-shot | None | Free | Simple, common tasks |
| Few-shot | 2–10 examples in prompt | Free | Format/pattern tasks |
| Chain-of-Thought | None (just phrasing) | Free | Reasoning, math, logic |
| System prompt | None | Free | Persona, style, constraints |
RAG solves two critical limitations of bare LLMs:
- Stale knowledge — LLMs have a training cutoff; they cannot know about recent events.
- Hallucination — LLMs generate plausible but sometimes incorrect outputs with no external grounding.
- Index — chunk your documents, embed them with a text embedding model, store in a vector database.
- Retrieve — when the user sends a query, embed the query and search for the most similar document chunks (cosine similarity).
- Augment — prepend the retrieved chunks to the LLM's prompt as context.
- Generate — the LLM answers the query grounded in the retrieved context, dramatically reducing hallucination.
You can give any LLM access to private, up-to-date, domain-specific knowledge without modifying its weights. The retrieval backbone uses the word embeddings you learned in Module 05. RAG is how most enterprise LLM applications are built.
| Component | Purpose | Common Tools |
|---|---|---|
| Embedding model | Encode text into dense vectors | text-embedding-ada-002, E5, BGE |
| Vector database | Store and search embeddings at scale | Pinecone, Chroma, FAISS |
| Retriever | Find top-$k$ relevant chunks | Cosine similarity, BM25 hybrid |
| LLM | Generate answer given context | GPT-4, Claude, LLaMA |
| Orchestrator | Tie the pipeline together | LangChain, LlamaIndex |
Fine-tune when a specific domain expertise, tone, style, or structured behaviour cannot be reliably produced by prompting and RAG alone.
| Fine-Tuning Type | Goal | Example |
|---|---|---|
| Domain fine-tuning | Teach specialised vocabulary and fluency | Medical reports, legal contracts, financial filings |
| Task fine-tuning | Reliable structured output format or classification | JSON extraction, sentiment → one of three labels |
Often done sequentially: domain adaptation first, then task fine-tuning on top. Choose based on the failure mode — wrong vocabulary or wrong behaviour.
Full fine-tuning updates all billions of parameters — prohibitively expensive for most practitioners. PEFT techniques update only a tiny fraction of parameters while keeping the original weights frozen, achieving high-performance adaptation at a fraction of the cost.
| Method | Approach | Parameters Updated |
|---|---|---|
| LoRA | Inject low-rank matrices $A, B$ alongside frozen weights: $W' = W + \Delta W = W + AB$ | <1% of total |
| QLoRA | LoRA + quantise base model to 4-bit; fits 70B on a single GPU | <1% (4-bit base) |
| Prefix Tuning | Prepend trainable tokens to every attention layer | ~0.1% |
| Adapter layers | Insert small trainable layers between frozen transformer layers | ~2% |
$W \in \mathbb{R}^{d \times k}$ is frozen. $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$ where rank $r \ll \min(d, k)$. $\alpha$ is a scaling factor.
Model rankings change monthly. This decision framework does not.
| # | Question | Guidance |
|---|---|---|
| 01 | Open or Closed? | Need to inspect/modify weights, run fully offline, or ensure data never leaves your infrastructure? → Open model. Otherwise, closed API gives maximum capability with minimal setup. |
| 02 | Size vs. Your Compute Reality? | 3–7B: runs on a laptop GPU. 70B: needs a server. Frontier: cloud API only. There is no point selecting a model your infrastructure cannot run. |
| 03 | General or Specialized? | A coding task benefits from a code-specialized model. Medical tasks benefit from medical fine-tuned models. Do not default to general when a specialist exists for your domain. |
| 04 | Does Your Task Require Reasoning? | For complex multi-step problems — logical inference, math, research synthesis — a reasoning model (o3, DeepSeek-R1) significantly outperforms standard instruction-tuned models. |
| 05 | Latency & Cost Constraints? | Larger models: higher quality, slower, more expensive. Smaller quantized models: faster and cheaper at some quality cost. Real-time applications almost always need smaller models. |
| If your problem is… | Start with… | Escalate to… |
|---|---|---|
| Simple, common task | Zero-shot prompting | Few-shot prompting |
| Complex reasoning | Chain-of-Thought prompting | Reasoning model (o3, R1) |
| Private / up-to-date data | RAG | RAG + fine-tuning |
| Domain vocabulary | Domain fine-tuning | Domain + task fine-tuning |
| Structured output format | Few-shot + system prompt | Task fine-tuning |
Model Hub — thousands of pretrained models, searchable by task.
Transformers — unified API to load, run, and fine-tune any model.
PEFT — LoRA and other parameter-efficient methods.
Datasets — standardised training and evaluation data access.
Run open-source LLMs locally on your machine with one command. No API, no cloud, no cost. Ideal for development, privacy-sensitive applications, and experimentation with models like LLaMA, Mistral, Phi.
Frameworks for building LLM-powered applications. Handle RAG pipelines, agent workflows, memory, and tool use. Focus on application logic, not infrastructure plumbing.
Pinecone, Chroma, FAISS — store and retrieve document embeddings at scale. The retrieval backbone of every RAG system. Use the same embedding space from Module 05.
| Stage | Core Idea | Limitation That Motivated the Next Stage |
|---|---|---|
| RNN | Hidden state carries sequence memory | Vanishing gradient; sequential bottleneck |
| LSTM | Cell state + gates prevent forgetting | Still sequential; long context still limited |
| Attention | Dynamic relevance scores between all tokens | Used alongside RNN; not standalone |
| Transformer | Attention only — no recurrence; full parallelism | Needed massive data + compute to shine |
| LLM | Transformer at scale + 3-stage training | Alignment, reasoning, cost, hallucination |
| Agents / Reasoning | LLMs with tools, memory, multi-step plans | Active research frontier |
Positional encoding → Multi-head self-attention → Residual → FFN → Residual. Same block, stacked. This is the engine of everything after the attention milestone.
Pre-training → SFT → Alignment converts raw architecture into a useful, aligned assistant. Every major model follows it.
Prompt engineering → RAG → Fine-tuning. Each addresses a different gap between what the model can do and what your task needs. Always start at the bottom.
It is never "which model is best?" It is always: "what does my task actually need?" Matching the right tool to the right problem is the professional skill.