Deep Learning & NLP — ENSAM Casablanca

Recurrent Neural Networks
LSTM & GRU

Complete course notes — sequential data, RNN architecture, BPTT, LSTM gates (forget/input/output), GRU, bidirectional RNN, and when to use each model. Based on ENSAM 2025/2026 lecture PDFs.

Module03 of 07

SourcesRNN/LSTM Lecture + ENSAM RNN PDF

LabsTemperature Forecasting · Poetry Generation

01Sequential Data — Order Matters

Sequential data is data where the order of elements carries meaning. The same data point means different things in different contexts.

Text / NLP

"The bank near the river" — same word "bank" has different meaning depending on context. Past words determine current word's meaning.

Time-Series

Stock price at $t=2$ depends on prices at $t=0$ and $t=1$ — order encodes causality. Prediction impossible without temporal context.

Audio & Speech

Individual sound samples only make sense in sequence. "Not good" → negative. "Not bad" → positive. Position determines meaning.

Other Examples

EEG signals, DNA sequences, video frames, weather patterns, robot joint angles, music notes — all require memory of the past.

Key insight: Sequential data has MEMORY — the past shapes the present. Context is everything. Order-independent models destroy all this meaning.

02Why Feed-Forward Networks Fail on Sequences

Standard FFNs (MLP, ANN) have three critical limitations that make them unsuitable for sequential data:

No temporal memory: The FFN sees isolated tokens with zero awareness of what came before. Classifying "The movie was not good" — it cannot know "not" negates "good".
Fixed-length constraint: Real sequences vary in length. Padding wastes compute; truncation destroys information irreversibly.
No weight sharing: Learning a pattern (e.g., "not good") at position 5 does not transfer to position 12 — leads to massive parameter counts and poor generalization.

Concrete Failure — EEG Seizure Prediction

Task: Predict seizure based on 1,000 EEG time steps FFN problems: ✗ Needs 1,000 separate input neurons ✗ If data = 999 steps → model breaks ✗ Pattern at step 50 invisible at step 800 RNN solution: ✓ ONE cell scans all steps sequentially ✓ Pattern learned once, applied everywhere ✓ Memory maintained across all time steps

03Recurrent Neural Network Architecture

A Recurrent Neural Network (RNN) adds a loop to a standard network — maintaining a hidden state across time steps. The same weights are shared across all time steps.

RNN hidden state update (at each time step $t$) $$h_t = f(W_h \cdot h_{t-1} + U \cdot x_t + b)$$ $$y_t = g(V \cdot h_t + c)$$

$h_t$: hidden state at step $t$ · $x_t$: input at step $t$ · $h_{t-1}$: previous hidden state
$W_h, U$: weight matrices (shared across all steps) · $b$: bias · $f$: tanh or ReLU
$y_t$: output at step $t$ · $V$: output weight matrix · $g$: output activation (e.g., Softmax)

The hidden state $h_t$ is called "everything I've seen so far, compressed into a vector." The same weights $W_h$ and $U$ are reused at EVERY time step — this is weight sharing across time.

04Backpropagation Through Time (BPTT)

To train RNNs, we unroll the network across time steps — treating it as a very deep network where each layer is one time step. Then standard backpropagation is applied through the unrolled chain.

Total loss (sum over all time steps) $$\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t(y_t, \hat{y}_t)$$

BPTT suffers from the vanishing/exploding gradient problem: gradients must be multiplied by $W_h$ at each time step going backward. If $\|W_h\| < 1$ → gradients shrink exponentially (vanish). If $\|W_h\| > 1$ → gradients explode.

Solution for exploding: Gradient clipping (cap gradients at a max norm).
Solution for vanishing: Use LSTM or GRU with gating mechanisms.

05Types of RNN Architectures (Input/Output Shapes)

Not all problems have the same input/output shape. RNNs are flexible:

Type	Input → Output	Use Case
Many-to-One	Sequence → Single value	Sentiment analysis (sentence → positive/negative)
One-to-Many	Single value → Sequence	Image captioning (image → sentence)
Many-to-Many (sync)	Sequence → Sequence (same length)	POS tagging (word → tag for each word)
Many-to-Many (async)	Sequence → Sequence (different length)	Machine translation (Seq2Seq with Encoder-Decoder)

06Vanilla RNN Limitations

Short-term memory: Vanilla RNN struggles with long-term dependencies — information from far-back steps effectively "fades away" due to vanishing gradients.
Vanishing gradient: Gradients diminish exponentially over many time steps → early steps contribute almost nothing to the learning signal.
Sequential bottleneck: Every time step must wait for the previous one — cannot be parallelized on GPU. Transformers solve this by processing all positions simultaneously.
No built-in attention: RNNs treat all positions similarly — no direct mechanism to relate position 1 to position 47 directly.

LSTM

Long Short-Term Memory Networks

07LSTM — Overview & Motivation

LSTM (Hochreiter & Schmidhuber, 1997) is a specialized RNN architecture designed to learn and remember information over long sequences, effectively solving the vanishing gradient problem.

Key innovation: a separate cell state $C_t$ acts as a "protected highway" for information — gradient flows through the $+$ operation nearly unchanged. This is why LSTM works for long sequences.

Analogy — Person Reading a Book

Think of the LSTM cell like a person reading a book and taking notes:

Forget gate: "Strike through old notes that are no longer relevant" (e.g., the previous paragraph's subject)
Input gate: "Write new information that seems important" (e.g., note the new subject "she")
Output gate: "Decide what part of your notes to say out loud right now"

08LSTM Gates — Full Equations

1. Forget Gate — "What to Discard"

Forget gate $$f_t = \sigma(W_f \cdot [h_{t-1},\; x_t] + b_f)$$

Output: value between 0 and 1 for each cell state dimension.
0 = "completely forget this" · 1 = "keep everything"
Example: when model sees new paragraph, forget previous subject's gender agreement rules.

2. Input Gate — "What New Information to Store"

Input gate (two parts) $$i_t = \sigma(W_i \cdot [h_{t-1},\; x_t] + b_i) \quad \leftarrow \text{what to update (gate)}$$ $$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1},\; x_t] + b_C) \quad \leftarrow \text{candidate values to write}$$

Together: $i_t$ decides what to update, $\tilde{C}_t$ provides the candidate values. Example: noting new subject "she" → future verbs should use feminine form.

3. Cell State Update — "The Core Memory"

Cell state update $$C_t = f_t \times C_{t-1} + i_t \times \tilde{C}_t$$

Forget some old stuff ($f_t \times C_{t-1}$) + add some new stuff ($i_t \times \tilde{C}_t$).
The gradient flows through the $+$ operation nearly unchanged — this is why LSTM solves vanishing gradient!

4. Output Gate — "What to Output"

Output gate & hidden state $$o_t = \sigma(W_o \cdot [h_{t-1},\; x_t] + b_o) \quad \leftarrow \text{output gate}$$ $$h_t = o_t \times \tanh(C_t) \quad \leftarrow \text{hidden state output}$$

$o_t$ decides what part of cell state $C_t$ to output as $h_t$. Example: cell state knows the full subject, but only outputs what's needed for the current prediction.

Summary — Four Equations

Given: [h_{t-1}, x_t] as combined input 1. Forget gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f) 2. Input gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i) C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C) 3. Cell update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t 4. Output gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o) h_t = o_t ⊙ tanh(C_t) Output: h_t (hidden state) and C_t (cell state) → passed to next time step

09LSTM Variants

Bidirectional LSTM

Processes input in both forward and backward directions. Concatenates both hidden states: $[h_\text{forward};\, h_\text{backward}]$. Used when full context is available (document NLP).

Stacked LSTM

Multiple LSTM layers stacked — output of one LSTM layer feeds into the next. Increases learning capacity. Used for complex tasks like speech recognition.

ConvLSTM

Integrates convolution into LSTM for spatial-temporal data. Used for video analysis and weather forecasting where both spatial and temporal dependencies matter.

LSTM Challenges

• Complex architecture — computationally intensive
• 4× more parameters than vanilla RNN
• Slow to train on long sequences
• Still sequential — cannot parallelize

GRU

Gated Recurrent Unit

10GRU — Streamlined LSTM

GRU (Cho et al., 2014) is a simplified LSTM with fewer parameters (~25% fewer). It merges the forget and input gates into a single update gate and eliminates the separate cell state.

GRU equations $$z_t = \sigma(W_z \cdot [h_{t-1},\; x_t] + b_z) \quad \leftarrow \text{update gate (forget + input combined)}$$ $$r_t = \sigma(W_r \cdot [h_{t-1},\; x_t] + b_r) \quad \leftarrow \text{reset gate}$$ $$\tilde{h}_t = \tanh(W \cdot [r_t \ast h_{t-1},\; x_t] + b) \quad \leftarrow \text{candidate hidden state}$$ $$h_t = z_t \times h_{t-1} + (1 - z_t) \times \tilde{h}_t \quad \leftarrow \text{hidden state update}$$

Update Gate $z_t$

Decides how much past state to keep vs. how much of the new candidate to use. Combines forget + input gate from LSTM into one operation.

Reset Gate $r_t$

Decides how much past state influences the new candidate. When $r_t \approx 0$, the new candidate ignores previous state — fresh start.

The hidden state $h_t$ directly serves as both output and memory — no separate cell state. GRU trains faster than LSTM with similar performance on medium-length sequences.

11Bidirectional RNN (BiRNN / BiLSTM)

Bidirectional RNN runs two RNNs in parallel: one forward (left-to-right) and one backward (right-to-left) through the sequence. At each step, concatenates both hidden states: $[h^\rightarrow; h^\leftarrow]$.

Why Bidirectional?

"The bank was by the river bank" — the second "bank" needs context from BOTH directions to be correctly understood as "riverbank." A forward-only RNN only sees what came before.

Use Cases

Named entity recognition, POS tagging, machine translation encoder, question answering — any task where full context is available at inference time.

Limitation

Cannot be used for real-time streaming or autoregressive generation — you don't have future tokens yet. Also requires 2× compute.

12Comparison Table — RNN vs LSTM vs GRU

Feature	Vanilla RNN	LSTM	GRU
Memory type	Hidden state only	Hidden state + Cell state	Hidden state only
Gates	None	3 (forget, input, output)	2 (update, reset)
Parameters	Fewest	Most (4× RNN)	Middle (3× RNN)
Long-term memory	Very poor	Excellent	Good
Training speed	Fastest	Slowest	Middle
Vanishing gradient	Severe	Nearly solved	Nearly solved
Best for	Very short sequences (<10 steps)	Long, complex sequences (50+ steps)	Medium sequences, limited compute
Introduced	~1986	1997	2014

13When to Use Which? — Decision Guide

Use	When	Example
Vanilla RNN	Very short sequences (<10 steps), maximum speed, learning problem	Simple character-level generation with 5-char windows
LSTM	Long sequences (50+ steps), precise timing matters, best accuracy needed, complex task	Translating 100-word paragraphs, stock price 30 days ahead
GRU	Medium-length sequences, limited compute (mobile/edge), similar to LSTM but faster	Sentiment analysis on tweets, keyword spotting on microcontroller
BiLSTM	Full context always available (no streaming), both directions needed	Named entity recognition in a document, question answering
Seq2Seq	Input and output lengths differ	Machine translation, summarization, speech-to-text

Where RNNs Still Win Over Transformers

Real-time streaming: Process one token at a time, $O(1)$ memory. Transformers need $O(n^2)$ memory to attend to all past tokens — infeasible for infinite streams.
Edge/embedded devices: A tiny GRU runs on a microcontroller for wake-word detection. GPT-style transformers cannot.
Irregular time-series: Sensor data, ECG, financial ticks — RNNs handle varying time steps naturally.
Reinforcement learning: Agents in partially observable environments use LSTMs as working memory.

Why Transformers Won (for most NLP)

Parallelizable training: All positions processed simultaneously — 10–100× faster than sequential RNNs on modern GPUs.
Global attention: Each position can directly attend to any other position — no sequential bottleneck.
Scales with data: Pre-training on internet-scale datasets only feasible with parallel processing.

Key Mental Model Summary

Concept	One-line intuition
RNN	A network with memory that updates at each step
Hidden state $h_t$	"Everything I've seen so far, compressed into a vector"
Vanishing gradient	Gradients shrink exponentially backwards through time
LSTM cell state $C_t$	A protected highway for gradients — only touched by + and ×
Forget gate	"What old memory is no longer relevant?"
Input gate	"What new information should I write to memory?"
Output gate	"What should I actually output right now?"
GRU	LSTM with forget+input merged, cell state eliminated
Bidirectional	Run two RNNs: one left-to-right, one right-to-left
Why Transformers won	Parallelizable training + global attention, no sequential bottleneck
Why RNNs persist	$O(1)$ inference memory, streaming, tiny devices

Recurrent Neural NetworksLSTM & GRU

Recurrent Neural Networks
LSTM & GRU