Recurrent Neural Networks
LSTM & GRU
Complete course notes — sequential data, RNN architecture, BPTT, LSTM gates (forget/input/output), GRU, bidirectional RNN, and when to use each model. Based on ENSAM 2025/2026 lecture PDFs.
Sequential data is data where the order of elements carries meaning. The same data point means different things in different contexts.
"The bank near the river" — same word "bank" has different meaning depending on context. Past words determine current word's meaning.
Stock price at $t=2$ depends on prices at $t=0$ and $t=1$ — order encodes causality. Prediction impossible without temporal context.
Individual sound samples only make sense in sequence. "Not good" → negative. "Not bad" → positive. Position determines meaning.
EEG signals, DNA sequences, video frames, weather patterns, robot joint angles, music notes — all require memory of the past.
Standard FFNs (MLP, ANN) have three critical limitations that make them unsuitable for sequential data:
- No temporal memory: The FFN sees isolated tokens with zero awareness of what came before. Classifying "The movie was not good" — it cannot know "not" negates "good".
- Fixed-length constraint: Real sequences vary in length. Padding wastes compute; truncation destroys information irreversibly.
- No weight sharing: Learning a pattern (e.g., "not good") at position 5 does not transfer to position 12 — leads to massive parameter counts and poor generalization.
A Recurrent Neural Network (RNN) adds a loop to a standard network — maintaining a hidden state across time steps. The same weights are shared across all time steps.
$h_t$: hidden state at step $t$ · $x_t$: input at step $t$ · $h_{t-1}$: previous hidden state
$W_h, U$: weight matrices (shared across all steps) · $b$: bias · $f$: tanh or ReLU
$y_t$: output at step $t$ · $V$: output weight matrix · $g$: output activation (e.g., Softmax)
The hidden state $h_t$ is called "everything I've seen so far, compressed into a vector." The same weights $W_h$ and $U$ are reused at EVERY time step — this is weight sharing across time.
To train RNNs, we unroll the network across time steps — treating it as a very deep network where each layer is one time step. Then standard backpropagation is applied through the unrolled chain.
BPTT suffers from the vanishing/exploding gradient problem: gradients must be multiplied by $W_h$ at each time step going backward. If $\|W_h\| < 1$ → gradients shrink exponentially (vanish). If $\|W_h\| > 1$ → gradients explode.
Solution for exploding: Gradient clipping (cap gradients at a max norm).
Solution for vanishing: Use LSTM or GRU with gating mechanisms.
Not all problems have the same input/output shape. RNNs are flexible:
| Type | Input → Output | Use Case |
|---|---|---|
| Many-to-One | Sequence → Single value | Sentiment analysis (sentence → positive/negative) |
| One-to-Many | Single value → Sequence | Image captioning (image → sentence) |
| Many-to-Many (sync) | Sequence → Sequence (same length) | POS tagging (word → tag for each word) |
| Many-to-Many (async) | Sequence → Sequence (different length) | Machine translation (Seq2Seq with Encoder-Decoder) |
- Short-term memory: Vanilla RNN struggles with long-term dependencies — information from far-back steps effectively "fades away" due to vanishing gradients.
- Vanishing gradient: Gradients diminish exponentially over many time steps → early steps contribute almost nothing to the learning signal.
- Sequential bottleneck: Every time step must wait for the previous one — cannot be parallelized on GPU. Transformers solve this by processing all positions simultaneously.
- No built-in attention: RNNs treat all positions similarly — no direct mechanism to relate position 1 to position 47 directly.
LSTM (Hochreiter & Schmidhuber, 1997) is a specialized RNN architecture designed to learn and remember information over long sequences, effectively solving the vanishing gradient problem.
Key innovation: a separate cell state $C_t$ acts as a "protected highway" for information — gradient flows through the $+$ operation nearly unchanged. This is why LSTM works for long sequences.
Think of the LSTM cell like a person reading a book and taking notes:
- Forget gate: "Strike through old notes that are no longer relevant" (e.g., the previous paragraph's subject)
- Input gate: "Write new information that seems important" (e.g., note the new subject "she")
- Output gate: "Decide what part of your notes to say out loud right now"
Output: value between 0 and 1 for each cell state dimension.
0 = "completely forget this" · 1 = "keep everything"
Example: when model sees new paragraph, forget previous subject's gender agreement rules.
Together: $i_t$ decides what to update, $\tilde{C}_t$ provides the candidate values. Example: noting new subject "she" → future verbs should use feminine form.
Forget some old stuff ($f_t \times C_{t-1}$) + add some new stuff ($i_t \times \tilde{C}_t$).
The gradient flows through the $+$ operation nearly unchanged — this is why LSTM solves vanishing gradient!
$o_t$ decides what part of cell state $C_t$ to output as $h_t$. Example: cell state knows the full subject, but only outputs what's needed for the current prediction.
Processes input in both forward and backward directions. Concatenates both hidden states: $[h_\text{forward};\, h_\text{backward}]$. Used when full context is available (document NLP).
Multiple LSTM layers stacked — output of one LSTM layer feeds into the next. Increases learning capacity. Used for complex tasks like speech recognition.
Integrates convolution into LSTM for spatial-temporal data. Used for video analysis and weather forecasting where both spatial and temporal dependencies matter.
• Complex architecture — computationally intensive
• 4× more parameters than vanilla RNN
• Slow to train on long sequences
• Still sequential — cannot parallelize
GRU (Cho et al., 2014) is a simplified LSTM with fewer parameters (~25% fewer). It merges the forget and input gates into a single update gate and eliminates the separate cell state.
Decides how much past state to keep vs. how much of the new candidate to use. Combines forget + input gate from LSTM into one operation.
Decides how much past state influences the new candidate. When $r_t \approx 0$, the new candidate ignores previous state — fresh start.
The hidden state $h_t$ directly serves as both output and memory — no separate cell state. GRU trains faster than LSTM with similar performance on medium-length sequences.
Bidirectional RNN runs two RNNs in parallel: one forward (left-to-right) and one backward (right-to-left) through the sequence. At each step, concatenates both hidden states: $[h^\rightarrow; h^\leftarrow]$.
"The bank was by the river bank" — the second "bank" needs context from BOTH directions to be correctly understood as "riverbank." A forward-only RNN only sees what came before.
Named entity recognition, POS tagging, machine translation encoder, question answering — any task where full context is available at inference time.
Cannot be used for real-time streaming or autoregressive generation — you don't have future tokens yet. Also requires 2× compute.
| Feature | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Memory type | Hidden state only | Hidden state + Cell state | Hidden state only |
| Gates | None | 3 (forget, input, output) | 2 (update, reset) |
| Parameters | Fewest | Most (4× RNN) | Middle (3× RNN) |
| Long-term memory | Very poor | Excellent | Good |
| Training speed | Fastest | Slowest | Middle |
| Vanishing gradient | Severe | Nearly solved | Nearly solved |
| Best for | Very short sequences (<10 steps) | Long, complex sequences (50+ steps) | Medium sequences, limited compute |
| Introduced | ~1986 | 1997 | 2014 |
| Use | When | Example |
|---|---|---|
| Vanilla RNN | Very short sequences (<10 steps), maximum speed, learning problem | Simple character-level generation with 5-char windows |
| LSTM | Long sequences (50+ steps), precise timing matters, best accuracy needed, complex task | Translating 100-word paragraphs, stock price 30 days ahead |
| GRU | Medium-length sequences, limited compute (mobile/edge), similar to LSTM but faster | Sentiment analysis on tweets, keyword spotting on microcontroller |
| BiLSTM | Full context always available (no streaming), both directions needed | Named entity recognition in a document, question answering |
| Seq2Seq | Input and output lengths differ | Machine translation, summarization, speech-to-text |
- Real-time streaming: Process one token at a time, $O(1)$ memory. Transformers need $O(n^2)$ memory to attend to all past tokens — infeasible for infinite streams.
- Edge/embedded devices: A tiny GRU runs on a microcontroller for wake-word detection. GPT-style transformers cannot.
- Irregular time-series: Sensor data, ECG, financial ticks — RNNs handle varying time steps naturally.
- Reinforcement learning: Agents in partially observable environments use LSTMs as working memory.
- Parallelizable training: All positions processed simultaneously — 10–100× faster than sequential RNNs on modern GPUs.
- Global attention: Each position can directly attend to any other position — no sequential bottleneck.
- Scales with data: Pre-training on internet-scale datasets only feasible with parallel processing.
| Concept | One-line intuition |
|---|---|
| RNN | A network with memory that updates at each step |
| Hidden state $h_t$ | "Everything I've seen so far, compressed into a vector" |
| Vanishing gradient | Gradients shrink exponentially backwards through time |
| LSTM cell state $C_t$ | A protected highway for gradients — only touched by + and × |
| Forget gate | "What old memory is no longer relevant?" |
| Input gate | "What new information should I write to memory?" |
| Output gate | "What should I actually output right now?" |
| GRU | LSTM with forget+input merged, cell state eliminated |
| Bidirectional | Run two RNNs: one left-to-right, one right-to-left |
| Why Transformers won | Parallelizable training + global attention, no sequential bottleneck |
| Why RNNs persist | $O(1)$ inference memory, streaming, tiny devices |