MOD 07 Transformers & LLMs
ENSAM Casablanca · 2025/2026 ↩ Home
Deep Learning & NLP — ENSAM Casablanca

Transformers
& Large Language Models

Complete course notes — from the sequential bottleneck that motivated Transformers, through attention, multi-head attention, and the full encoder-decoder architecture, to LLMs, prompt engineering, RAG, and fine-tuning. Based on ENSAM 2025/2026 lecture PDFs.

Module07 of 07
PaperAttention Is All You Need (2017)
SourceLecture PDF · ENSAM 2025/2026
Part A
Foundations — Why Transformers & What Attention Means
01Why the Transformer Was Needed
The Sequential Bottleneck

RNNs and LSTMs process sequences one token at a time. Each step depends on the previous hidden state, making parallelisation impossible. This created a hard ceiling:

  • Training speed — cannot leverage GPU parallelism; one step must complete before the next begins.
  • Long-range dependencies — even LSTMs struggle to connect information separated by hundreds of tokens; gradient signal degrades over long paths.
  • Scale barrier — training on internet-scale datasets (trillions of tokens) was simply not feasible with sequential architectures.
The Parallel Processing Solution

The Transformer, introduced in "Attention Is All You Need" (Vaswani et al., 2017), replaces recurrence entirely with attention. Every token attends to every other token in a single parallel operation. The consequence was transformative: training on trillions of tokens became feasible, enabling GPT, BERT, and every major LLM that followed.

PropertyRNN / LSTMTransformer
Processing orderSequential (step-by-step)Parallel (all tokens at once)
Long-range dependenciesWeak (vanishing gradient)Strong (direct attention path)
GPU utilisationPoor (serial dependency)Excellent (matrix ops)
Training scaleBillions of tokens maxTrillions of tokens feasible
Position awarenessImplicit (step index)Explicit (positional encoding)
02The Core Intuition — What Attention Means

Attention was first introduced in 2015 to improve neural machine translation by creating shortcuts between the encoder and the decoder, solving issues with long input sequences.

The Core Idea

Attention dynamically computes relevance scores between all words in a sentence, allowing the model to focus on the most important words within the current context and combine their meanings. The result is context-aware representations.

Classic example — the word "bank":

  • "I deposited money at the bank" — attention weights concentrate on "money", "deposited".
  • "I sat by the river bank" — attention weights concentrate on "river", "sat".

Same word, different representation — this is what static embeddings (Word2Vec, GloVe) cannot do.

Historical Context
YearMilestoneSignificance
2015Attention for translation (Bahdanau)First attention mechanism, encoder-decoder shortcut
2017Attention Is All You Need (Vaswani)Replace recurrence entirely; the Transformer block
2018BERT / GPT-1Pre-training on massive text; fine-tune for tasks
2020+GPT-3, scaling lawsScale → emergent abilities; LLMs as general tools
Part B
The Transformer Architecture — Step by Step
03Self-Attention
Q, K, V — Query, Key, Value

Self-attention transforms each token into three vectors by multiplying its embedding by three learned weight matrices:

  • Query (Q) — "What am I looking for?"
  • Key (K) — "What information do I offer?"
  • Value (V) — "What content do I actually provide?"
Scaled Dot-Product Attention
Attention Formula $$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Step-by-step breakdown:

  1. Dot products: Compute $QK^T$ — the raw relevance score between every query and every key.
  2. Scale: Divide by $\sqrt{d_k}$ (key dimension) to prevent vanishing gradients from large dot products.
  3. Softmax: Convert scores to probabilities — the attention weights sum to 1.
  4. Weighted sum: Multiply weights by $V$ — the output is a weighted combination of all value vectors.
Why the scaling? For large $d_k$, dot products grow large, pushing softmax into regions with near-zero gradients. Dividing by $\sqrt{d_k}$ keeps gradients healthy during training.
Input Pipeline Before Attention

Before self-attention, each token passes through:

  1. Token embedding — maps the token ID to a dense vector (e.g. 512-dimensional).
  2. Positional encoding — adds a position-dependent signal so the model knows token order (since there is no recurrence). Uses sine/cosine functions of varying frequencies: $PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)$
04Multi-Head Attention

Single-head attention acts like one giant spotlight that can only point in one direction, often focusing on the most obvious connection while missing others. Multi-head attention runs $h$ attention heads in parallel, each with its own $W^Q_i, W^K_i, W^V_i$ projection matrices.

Multi-Head Attention $$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O$$ $$\text{where} \quad \text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)$$
What Each Head Learns

Different heads specialise in different linguistic relationships simultaneously:

Syntactic Head

Subject-verb agreement, noun-adjective relations — structural grammar.

Semantic Head

Word meaning in context — "bank" as finance vs. geography.

Coreference Head

"The animal didn't cross because it was tired" — links "it" to "animal".

Position Head

Attends to nearby tokens; captures local phrase structure.

The outputs of all heads are concatenated and projected through $W^O$ to produce the final multi-head attention output. The original Transformer (Vaswani 2017) used $h = 8$ heads, $d_{model} = 512$, so each head operates on $d_k = d_v = 64$ dimensions.

05Full Transformer Architecture
The Encoder Block

Each encoder layer consists of two sub-layers with residual connections and layer normalisation (Add & Norm) around each:

  1. Multi-Head Self-Attention — every token attends to all other tokens.
  2. Feed-Forward Network (FFN) — a position-wise two-layer MLP applied identically to each token.
Encoder Sub-Layer (Add & Norm) $$\text{output} = \text{LayerNorm}(x + \text{SubLayer}(x))$$
Feed-Forward Network
FFN per position $$\text{FFN}(x) = \max(0,\; xW_1 + b_1) W_2 + b_2$$

The FFN has a larger inner dimension ($d_{ff} = 2048$ in the original) than $d_{model} = 512$. It expands then contracts — like a "thinking" step after attention. The same FFN weights are applied to every position independently.

The Decoder Block

Each decoder layer has three sub-layers:

  1. Masked Multi-Head Self-Attention — tokens can only attend to previous positions (causal masking prevents looking ahead during training).
  2. Multi-Head Cross-Attention — queries come from the decoder, keys and values come from the encoder output. This is how the decoder "reads" the encoded source sequence.
  3. Feed-Forward Network — same as in the encoder.
Output Layer

After the final decoder layer: Linear projection to vocabulary size → Softmax → probability distribution over the next token.

Architecture Summary
ComponentFunctionOriginal Paper Value
Token embeddingID → dense vector$d_{model} = 512$
Positional encodingInject position informationSinusoidal, same dim as embedding
Encoder layersBuild contextual representations$N = 6$ stacked layers
Multi-Head AttentionCapture diverse relationships$h = 8$ heads, $d_k = d_v = 64$
FFN inner dimPer-position transformation$d_{ff} = 2048$
Decoder layersGenerate output sequence$N = 6$ stacked layers
Add & NormResidual + layer normalisationAround every sub-layer
Output projectionDecoder → vocabulary logitsLinear + Softmax
06How a Transformer Generates Text

The original Transformer architecture was designed for sequence-to-sequence tasks (e.g. translation). The decoder generates output autoregressively — one token at a time:

  1. Start with the <BOS> (beginning-of-sequence) token.
  2. Run it through the decoder; output a probability distribution over the vocabulary.
  3. Sample or select the highest-probability token (decoding strategy).
  4. Append the new token to the sequence and repeat from step 2.
  5. Stop when the <EOS> token is generated or max length is reached.
Decoding Strategies
StrategyBehaviourUse Case
GreedyAlways pick the highest-probability tokenFast; deterministic; can be repetitive
Beam SearchKeep top-$k$ sequences at each stepBetter quality; used in translation
SamplingSample from the full distributionDiverse, creative outputs
Top-k SamplingSample from the top $k$ tokens onlyBalances diversity and coherence
TemperatureSharpen/flatten distribution before samplingLow $T$ → conservative; high $T$ → creative
Training vs. Inference

During training, the decoder uses teacher forcing: all ground-truth tokens are fed in parallel (with causal masking), making training efficient. During inference, the model generates tokens one-by-one using its own previous outputs.

Part C
Large Language Models — Scale, Training & Architecture Variants
07The Scaling Leap — Same Architecture, Radically More Scale

LLMs are Transformer models trained at a scale previously unimaginable: billions of parameters, trillions of tokens, thousands of GPUs over months. The architecture is the same Transformer block — stacked, wider, and deeper.

Scaling Dimensions
Model Parameters

From millions (BERT: 110M) to hundreds of billions (GPT-4: ~1.8T). More parameters = more capacity to memorise and generalise.

Training Data

Web text, books, code, scientific papers — trillions of tokens. Diverse data enables broad capabilities.

Compute (FLOPs)

Scaling laws (Chinchilla, 2022) show optimal compute allocation: roughly equal scaling of parameters and data.

Context Length

From 512 tokens (BERT) to 128k+ tokens (GPT-4 Turbo). Longer context = better reasoning over documents.

Emergent Abilities

Beyond predictable performance improvements, large models exhibit emergent abilities — capabilities that appear suddenly at scale and were not present in smaller models:

  • Multi-step reasoning — solving multi-step math problems without explicit reasoning training.
  • In-context learning — learning from a few examples in the prompt without updating weights.
  • Chain-of-thought — generating coherent reasoning chains to improve accuracy.
  • Instruction following — understanding and executing complex natural language instructions.
  • Code generation — writing functional code across many programming languages.
Emergence threshold — these abilities don't appear gradually; they emerge sharply past a certain scale threshold. This is one of the most striking (and not fully understood) properties of LLMs.
08How LLMs Are Trained — The Three-Stage Pipeline

Every major production LLM follows the same three-stage training pipeline. Each stage converts the raw architecture into a more useful, aligned assistant.

Stage 1 — Pre-Training

Objective: Next-token prediction (autoregressive) on massive, diverse text data.

  • Data: web crawls (Common Crawl), books, Wikipedia, code, scientific papers.
  • Outcome: A model that knows language, facts, reasoning patterns — but has no instruction-following behaviour.
  • Cost: Millions of dollars in compute; done once by the lab, not practitioners.
Stage 2 — Supervised Fine-Tuning (SFT)

Objective: Teach the model to follow instructions using high-quality human-written examples.

  • Data: Curated prompt-response pairs, written or verified by humans.
  • Outcome: Model learns the format and style of helpful, coherent responses.
  • Scale: Much smaller dataset than pre-training (thousands to millions of examples).
Stage 3 — Alignment (RLHF)

Objective: Align model outputs with human values — helpfulness, harmlessness, honesty.

  • RLHF (Reinforcement Learning from Human Feedback): Human raters rank multiple model outputs; a reward model is trained on those rankings; the LLM is optimised against that reward model using RL (PPO).
  • DPO (Direct Preference Optimisation): A simpler alternative that skips the explicit reward model.
  • Outcome: The model refuses harmful requests, stays on-task, communicates clearly.
StageObjectiveData TypeCost
Pre-trainingPredict next tokenTrillions of tokens (raw web)Very high (lab only)
SFTFollow instructionsCurated prompt-response pairsMedium
RLHF / DPOAlign with human valuesHuman preference rankingsMedium
09Transformer-Based Model Variants

The original encoder-decoder Transformer has spawned a family of variants, each designed for a specific use case by keeping or removing parts of the architecture.

TypeArchitectureTraining ObjectiveBest ForExamples
Encoder-OnlyEncoder stack onlyMasked Language Modelling (MLM) + NSPClassification, NER, embeddingsBERT, RoBERTa, DistilBERT
Decoder-OnlyDecoder stack only (causal)Next-token prediction (autoregressive)Text generation, chat, codeGPT series, LLaMA, Mistral
Encoder-DecoderFull encoder + decoderSequence-to-sequence (span masking)Translation, summarisation, QAT5, BART, mT5
Modern LLMs are almost all decoder-only (GPT, LLaMA, Mistral, Claude, Gemini). The autoregressive objective maps naturally to generation tasks, and the unified architecture scales more cleanly than encoder-decoder variants.
Part D
Working with LLMs — Prompting, RAG, Fine-Tuning
10Prompt Engineering

Zero training data. Zero infrastructure changes. No compute cost. Just better instructions. Prompt engineering is the first tool every practitioner should exhaust before reaching for more expensive approaches.

System Prompts

Sets the model's role, persona, constraints, and style before any conversation begins. The same base model becomes a customer service agent, code reviewer, or research assistant depending on the system prompt.

Zero-Shot Prompting

Give the model a task description with no examples. Works well for common tasks the model encountered during training.

Prompt: "Classify the sentiment of this review as Positive or Negative. Review: 'The battery lasts forever but the screen is dim.' Answer:"
Few-Shot Prompting

Show 2–3 concrete examples of input → desired output. The model infers the pattern and applies it. Works through in-context learning — an emergent capability. Weights are never updated.

Review: "Loved the design, hated the keyboard." → Negative Review: "Fast shipping, perfect quality." → Positive Review: "Looks great but overpriced." → ?
Chain-of-Thought (CoT) Prompting

For reasoning tasks (math, logic, multi-step decisions), ask the model to "think step by step" before answering. Generates intermediate reasoning steps that guide the final output. Small prompt change, significant quality gain.

Q: Roger has 5 tennis balls. He buys 2 cans of 3 balls each. How many does he have? A: Let me think step by step. Roger starts with 5 balls. He buys 2 × 3 = 6 balls. Total = 5 + 6 = 11 balls.
When to stop here: Most practitioners jump to fine-tuning before exhausting sophisticated prompting. Prompting is fast, reversible, and free. Always exhaust it first before reaching for more expensive approaches.
Prompt Engineering Strategy Comparison
StrategyData NeededCostBest For
Zero-shotNoneFreeSimple, common tasks
Few-shot2–10 examples in promptFreeFormat/pattern tasks
Chain-of-ThoughtNone (just phrasing)FreeReasoning, math, logic
System promptNoneFreePersona, style, constraints
11Retrieval-Augmented Generation (RAG)

RAG solves two critical limitations of bare LLMs:

  • Stale knowledge — LLMs have a training cutoff; they cannot know about recent events.
  • Hallucination — LLMs generate plausible but sometimes incorrect outputs with no external grounding.
How RAG Works
  1. Index — chunk your documents, embed them with a text embedding model, store in a vector database.
  2. Retrieve — when the user sends a query, embed the query and search for the most similar document chunks (cosine similarity).
  3. Augment — prepend the retrieved chunks to the LLM's prompt as context.
  4. Generate — the LLM answers the query grounded in the retrieved context, dramatically reducing hallucination.
Why RAG Matters

You can give any LLM access to private, up-to-date, domain-specific knowledge without modifying its weights. The retrieval backbone uses the word embeddings you learned in Module 05. RAG is how most enterprise LLM applications are built.

ComponentPurposeCommon Tools
Embedding modelEncode text into dense vectorstext-embedding-ada-002, E5, BGE
Vector databaseStore and search embeddings at scalePinecone, Chroma, FAISS
RetrieverFind top-$k$ relevant chunksCosine similarity, BM25 hybrid
LLMGenerate answer given contextGPT-4, Claude, LLaMA
OrchestratorTie the pipeline togetherLangChain, LlamaIndex
12Fine-Tuning & PEFT
When to Fine-Tune

Fine-tune when a specific domain expertise, tone, style, or structured behaviour cannot be reliably produced by prompting and RAG alone.

Fine-Tuning TypeGoalExample
Domain fine-tuningTeach specialised vocabulary and fluencyMedical reports, legal contracts, financial filings
Task fine-tuningReliable structured output format or classificationJSON extraction, sentiment → one of three labels

Often done sequentially: domain adaptation first, then task fine-tuning on top. Choose based on the failure mode — wrong vocabulary or wrong behaviour.

Parameter-Efficient Fine-Tuning (PEFT)

Full fine-tuning updates all billions of parameters — prohibitively expensive for most practitioners. PEFT techniques update only a tiny fraction of parameters while keeping the original weights frozen, achieving high-performance adaptation at a fraction of the cost.

MethodApproachParameters Updated
LoRAInject low-rank matrices $A, B$ alongside frozen weights: $W' = W + \Delta W = W + AB$<1% of total
QLoRALoRA + quantise base model to 4-bit; fits 70B on a single GPU<1% (4-bit base)
Prefix TuningPrepend trainable tokens to every attention layer~0.1%
Adapter layersInsert small trainable layers between frozen transformer layers~2%
LoRA — Low-Rank Adaptation $$W' = W + \Delta W = W + \frac{\alpha}{r} AB$$

$W \in \mathbb{R}^{d \times k}$ is frozen. $A \in \mathbb{R}^{d \times r}$, $B \in \mathbb{R}^{r \times k}$ where rank $r \ll \min(d, k)$. $\alpha$ is a scaling factor.

13Model Selection — Five Questions That Do Not Expire

Model rankings change monthly. This decision framework does not.

#QuestionGuidance
01 Open or Closed? Need to inspect/modify weights, run fully offline, or ensure data never leaves your infrastructure? → Open model. Otherwise, closed API gives maximum capability with minimal setup.
02 Size vs. Your Compute Reality? 3–7B: runs on a laptop GPU. 70B: needs a server. Frontier: cloud API only. There is no point selecting a model your infrastructure cannot run.
03 General or Specialized? A coding task benefits from a code-specialized model. Medical tasks benefit from medical fine-tuned models. Do not default to general when a specialist exists for your domain.
04 Does Your Task Require Reasoning? For complex multi-step problems — logical inference, math, research synthesis — a reasoning model (o3, DeepSeek-R1) significantly outperforms standard instruction-tuned models.
05 Latency & Cost Constraints? Larger models: higher quality, slower, more expensive. Smaller quantized models: faster and cheaper at some quality cost. Real-time applications almost always need smaller models.
Choosing an Adaptation Strategy
If your problem is…Start with…Escalate to…
Simple, common taskZero-shot promptingFew-shot prompting
Complex reasoningChain-of-Thought promptingReasoning model (o3, R1)
Private / up-to-date dataRAGRAG + fine-tuning
Domain vocabularyDomain fine-tuningDomain + task fine-tuning
Structured output formatFew-shot + system promptTask fine-tuning
14Tools & Libraries
HuggingFace

Model Hub — thousands of pretrained models, searchable by task.
Transformers — unified API to load, run, and fine-tune any model.
PEFT — LoRA and other parameter-efficient methods.
Datasets — standardised training and evaluation data access.

Ollama

Run open-source LLMs locally on your machine with one command. No API, no cloud, no cost. Ideal for development, privacy-sensitive applications, and experimentation with models like LLaMA, Mistral, Phi.

LangChain / LlamaIndex

Frameworks for building LLM-powered applications. Handle RAG pipelines, agent workflows, memory, and tool use. Focus on application logic, not infrastructure plumbing.

Vector Databases

Pinecone, Chroma, FAISS — store and retrieve document embeddings at scale. The retrieval backbone of every RAG system. Use the same embedding space from Module 05.

Quick Code — HuggingFace Pipeline
from transformers import pipeline # Sentiment analysis (fine-tuned BERT) classifier = pipeline("sentiment-analysis") result = classifier("The transformer architecture changed everything.") # → [{'label': 'POSITIVE', 'score': 0.9998}] # Text generation (GPT-2 style) generator = pipeline("text-generation", model="gpt2") result = generator("Attention is all you need", max_length=50) # Zero-shot classification clf = pipeline("zero-shot-classification") result = clf( "I need to refund my purchase", candidate_labels=["billing", "technical support", "shipping"] )
Quick Code — PEFT with LoRA
from transformers import AutoModelForSequenceClassification from peft import get_peft_model, LoraConfig, TaskType # Load base model model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased") # Configure LoRA lora_config = LoraConfig( task_type=TaskType.SEQ_CLS, r=16, # rank — controls number of trainable parameters lora_alpha=32, # scaling factor lora_dropout=0.1, target_modules=["query", "value"] ) # Wrap model — only LoRA params will be updated model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 296,448 || all params: 109,779,458 || trainable%: 0.27%
Part E
The Map That Does Not Expire
15The Mental Map That Does Not Expire
The Evolution Timeline
StageCore IdeaLimitation That Motivated the Next Stage
RNNHidden state carries sequence memoryVanishing gradient; sequential bottleneck
LSTMCell state + gates prevent forgettingStill sequential; long context still limited
AttentionDynamic relevance scores between all tokensUsed alongside RNN; not standalone
TransformerAttention only — no recurrence; full parallelismNeeded massive data + compute to shine
LLMTransformer at scale + 3-stage trainingAlignment, reasoning, cost, hallucination
Agents / ReasoningLLMs with tools, memory, multi-step plansActive research frontier
Three Things to Carry Forward
The Transformer Block

Positional encoding → Multi-head self-attention → Residual → FFN → Residual. Same block, stacked. This is the engine of everything after the attention milestone.

The Three-Stage Pipeline

Pre-training → SFT → Alignment converts raw architecture into a useful, aligned assistant. Every major model follows it.

The Adaptation Ladder

Prompt engineering → RAG → Fine-tuning. Each addresses a different gap between what the model can do and what your task needs. Always start at the bottom.

The Right Question

It is never "which model is best?" It is always: "what does my task actually need?" Matching the right tool to the right problem is the professional skill.

Critical Reminders
I. LLMs are not search engines. They generate possible tokens — they do not retrieve verified facts. They can be confidently wrong. Treat every output as a first draft that requires verification.
II. LLMs do not understand — they simulate understanding. They have no beliefs, no intentions, and no connection to reality. Fluent, confident output is not evidence of comprehension; it is a probabilistic achievement. The model predicts tokens, not truth.
The field will keep moving. The mental map does not expire. RNN → LSTM → Attention → Transformer → LLM → Agents/Reasoning. You now understand the full trajectory.