ENSAM Deep Learning & NLP — Study Guide
2025 / 2026
Course Notes — Academic Year 2025/2026

Deep Learning
& NLP

7 modules of complete course notes rebuilt from the professor's lecture PDFs. From ANN and CNN to Transformers and LLMs. ENSAM Casablanca · Hassan II University.

Modules7 complete modules
SourceProfessor lecture PDFs
MathKaTeX — LaTeX formulas
Exam18 May 2026 · Written redaction
Exam Countdown
--Days
:
--Hours
:
--Min
:
--Sec
Exam Preparation
Q&A210 Exam Questions — All Modules
30 questions per module · Answers hidden behind click · Covers every formula, benchmark, and concept from all 7 modules. Start here the night before the exam.
210 QuestionsClick to revealAll Modules
Course Modules
MOD 01Deep Learning Essentials & ANN
What is DL (compositionality, end-to-end, distributed representations), ML vs DL, deep architecture types, manifold hypothesis, ANN forward pass, all activation functions, loss functions, backpropagation, gradient descent, optimizers, regularization.
DL FundamentalsANNBackprop
MOD 02Convolutional Neural Networks
Why CNNs (parameter explosion, no spatial awareness), convolution operation (filter as flashlight), padding & stride formula, pooling (max/avg/global), CNN architectures (LeNet → EfficientNet, ResNet skip connections), transfer learning, data augmentation, evaluation metrics.
CNNResNetTransfer Learning
MOD 03RNN / LSTM / GRU
Sequential data, why FFN fails on sequences, RNN hidden state equation, BPTT, RNN types (many-to-one etc.), LSTM (all 4 gate equations with cell state highway), GRU (update & reset gates), BiRNN, comparison table, when to use each.
RNNLSTMGRUSeq2Seq
MOD 04NLP Introduction & Text Processing
Language layers (phonology → pragmatics), NLP definition, NLU vs NLG, full NLP pipeline, cleaning (regex 6 steps), tokenization (word/sentence/BPE), stop words (danger: "not"), stemming vs lemmatization, POS tagging, NER, dependency parsing, CFG.
NLPPipelinespaCyNLTK
MOD 05Classical Text Representation
Bag of Words (counting vector, 99% sparse), N-grams (captures negation via bigrams), TF-IDF (TF×IDF formula, real IDF values from 50k IMDb). Results: BoW 86.2% → TF-IDF 1+2gram 90.1%. When to stop at classical methods.
BoWTF-IDFN-gramsIMDb 50k
MOD 06Word Embeddings — Static & Contextual
Word2Vec (Skip-gram/CBOW, king−man+woman≈queen), GloVe (global co-occurrence), FastText (subwords, OOV), polysemy problem, BERT (MLM+NSP, 110M params, 93.9%). Exact IMDb benchmark: GloVe 76.8%, W2V 85.7%, FastText 86.0%, BERT 93.9%.
Word2VecGloVeBERTEmbeddings
MOD 07Transformers & Large Language Models
Why Transformers (sequential bottleneck → parallel), attention formula Attention(Q,K,V) = softmax(QKᵀ/√d_k)V, multi-head attention (h=8 heads), full encoder-decoder architecture, Add&Norm, positional encoding, LLMs (scaling, emergent abilities), 3-stage training (pre-training → SFT → RLHF), prompt engineering (zero-shot, few-shot, CoT), RAG, PEFT/LoRA, model selection framework.
TransformersLLMsAttentionRAGLoRA
Must-Know Formulas
Convolution Output Size
W_out = floor((W_in - K + 2P) / S) + 1
K = kernel, P = padding, S = stride
Gradient Descent Update
w ← w − η · ∂L/∂w
η = learning rate, L = loss function
LSTM Cell Update
C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
f = forget, i = input, C̃ = candidate
LSTM Output
h_t = o_t ⊙ tanh(C_t)
o_t = output gate
TF-IDF
TF-IDF(t,d) = TF(t,d) × log(N/df(t))+1
N = total docs, df(t) = docs with term t
Scaled Dot-Product Attention
Attention(Q,K,V) = softmax(QKᵀ/√d_k)V
Q/K/V = query, key, value matrices
RNN Hidden State
h_t = f(W_h·h_{t-1} + U·x_t + b)
Shared weights at every time step
LoRA Weight Update
W' = W + (α/r)·AB
r = rank (tiny), A∈R^{d×r}, B∈R^{r×k}
IMDb 50k Benchmark — Full Results
MethodTypeAccuracyF1LatencyGPU?OOVKey Strength
Bag of WordsClassical86.2%0.86~1 msNoNoSimple, interpretable, no training
N-gram (1+2)Classical~89%~0.89~3 msNoNoCaptures "not good" bigrams
TF-IDF 1+2gramClassical90.1%0.90~3 msNoNoBest classical — rare word weighting
Word2Vec (mean pool)Static85.7%0.86~12 msNoNoSemantic similarity, vector analogies
GloVe (mean pool)Static76.8%0.77~10 msNoNoAnalogy champion (WordSim-353)
FastText (mean pool)Static86.0%0.86~15 msNoYesOOV, noisy text, morphological languages
BERT fine-tunedContextual93.9%0.94~370 ms (CPU)YesYesPolysemy, negation, bidirectional context
Decision ladder (Golden Rule):
Start with TF-IDF + bigrams (fast baseline, no GPU, 90.1%). → If OOV/noisy text: use FastText. → If polysemy, negation, or nuance is critical and GPU available: BERT fine-tuned (93.9%).
Architecture Comparison — Three Families
FamilyTypical DataProcessingStrengthsLimitations
Feed-Forward (MLP)Tabular, vectorsDense layers stackedSimplicity, universalityNo spatial/temporal structure; parameter explosion on images
CNNImages, 2D/3D signalsLocal convolutions + poolingSpatial invariance, feature hierarchy, transfer learningNo native temporal handling; needs large datasets
RNN / LSTM / GRUSequences (text, time-series, speech)Temporal loop with hidden state + gatesLong-term memory (LSTM); streaming O(1) inferenceSequential training (slow); no native attention
TransformerAny sequence (text, images, audio)Parallel self-attention on all tokensParallelism; scales to internet-size data; emergent abilitiesO(n²) memory in attention; needs massive compute to shine