Deep Learning & NLP — ENSAM Casablanca

Convolutional Neural Networks
(CNN)

Complete course notes from professor lectures — CNN architecture, convolution, pooling, key models from LeNet to EfficientNet, transfer learning, and medical AI. Based on ENSAM 2025/2026 lecture PDFs.

Module02 of 07

SourcesCNN Lecture + ENSAM CNN PDF

LabsMNIST · Medical Image Classification

01Why CNNs? The Problem with MLP on Images

Standard MLPs treat input as a flat vector — they lose all spatial structure. For images, this creates three critical problems:

Parameter Explosion

224×224 RGB image = 150,528 inputs. With just 1 hidden layer of 1,000 neurons → 150 million parameters. Untrainable.

No Spatial Awareness

MLP sees a flat vector — loses all 2D structure. A shifted image looks completely different. Cannot recognize same object at different positions.

No Weight Sharing

MLP uses different weights for every position. An edge detector learned at position (5,5) does not transfer to position (50,50).

CNN Solutions

Local connectivity: each neuron looks at a small patch.
Weight sharing: same filter reused across image — far fewer parameters.
Translation invariance: recognizes patterns anywhere.
Hierarchical: edges → shapes → objects.

02The Convolution Operation

A convolutional layer applies small filters (kernels) that slide across the input image, performing element-wise multiplication and summing to produce a feature map. Each filter detects a specific pattern — edges, corners, textures.

The Filter — A "Flashlight"

Think of a filter as a small flashlight sliding across the image. At each position: multiply each filter value × corresponding pixel value, then sum all results → one output number. Different filters detect different patterns.

Convolution operation (2D) $$(f * x)(i,j) = \sum_{m=0}^{M-1}\sum_{n=0}^{N-1} x(i+m,\; j+n) \cdot f(m,n)$$

$f$: filter/kernel · $x$: input image · Output: feature map

Hierarchical Feature Learning

Layer 1 (shallow): Edges, color gradients, simple textures
Layer 2 (mid): Combines edges → corners, curves, textures
Layer 3+ (deep): Assembles shapes into object parts — eyes, wheels, fins
Fully Connected: Final classification using learned high-level feature vector

03Padding & Stride

Output spatial size after convolution $$W_{out} = \left\lfloor \frac{W_{in} - K + 2P}{S} \right\rfloor + 1$$

$W_{in}$: input size · $K$: kernel size · $P$: padding · $S$: stride

Padding

Adding zeros around the input before convolution.
Same: Adds enough padding so output size = input size. Preserves edges.
Valid: No padding. Output is smaller.

Stride

Step size of the filter movement.
Stride=1: Dense coverage, no skipping.
Stride=2: Skips positions → output half the size. Reduces compute.

Example: Input 5×5, Filter 3×3, Padding=1 (same), Stride=1 Output: ⌊(5 - 3 + 2) / 1⌋ + 1 = 5×5 ← same size! Example: Input 5×5, Filter 3×3, Padding=0, Stride=2 Output: ⌊(5 - 3) / 2⌋ + 1 = 2×2

Parameters in a Conv Layer

Conv2D parameter count $$\text{Params} = (K \times K \times C_{in} + 1) \times C_{out}$$

Example: Conv2D(32, 3×3) on RGB (3ch): $(3×3×3+1)×32 = 896$ params

04ReLU Activation in CNNs

ReLU$$f(x) = \max(0, x)$$

Applied element-wise after each conv layer. The standard activation for CNNs because:

Non-linearity: Without activation, stacking layers = 1 layer
Combats vanishing gradient: No saturation for positive values — gradients stay strong in deep networks
Computationally cheap: Just max(0,x) — faster than exp() or tanh
Sparse activation: Only active neurons fire → efficient representations

Variants: LeakyReLU keeps small negative gradient · ELU smooths the kink · GELU used in Transformers — but ReLU remains the CNN standard.

05Pooling Layers

Pooling downsamples feature maps — reducing parameters, computation, and building spatial invariance.

Output size (2×2 pool, stride 2)$$W_{out} = W_{in} / 2$$

Max Pooling

Takes maximum in each window. Best for detecting sharp features like edges. Most common in practice.

Average Pooling

Takes the mean in each window. Smoother — better for global feature summaries. Used in modern architectures.

Global Average Pooling (GAP)

Reduces each entire feature map to one number (its average). Replaces Flatten→Dense at end. Drastically reduces parameters and overfitting.

Why Pool?

• Reduces spatial size → fewer parameters
• Less compute, less memory
• Builds spatial invariance (small shifts → same output)
• Controls overfitting

06Flatten & Fully Connected Layers

Flatten converts 2D feature maps from the last conv/pool layer into a 1D vector to connect to dense layers. It bridges convolutional layers (spatial processing) with dense layers (classification).

Dense layer output$$\mathbf{y} = \mathbf{W} \cdot \mathbf{x} + \mathbf{b}$$

Dense layers use ReLU activation for non-linearity. The final dense layer uses Softmax to produce class probabilities.

Standard CNN Pipeline: Input → [Conv2D → ReLU → MaxPool] × N → Flatten → Dense(ReLU) → Dropout → Dense(Softmax)

07Full CNN Architecture — MNIST Lab

model = Sequential([ Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)), MaxPooling2D((2,2)), Conv2D(64, (3,3), activation='relu'), MaxPooling2D((2,2)), Flatten(), Dense(64, activation='relu'), Dropout(0.4), Dense(10, activation='softmax') # 10 digit classes ]) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.1)

Cross-Entropy Loss (for classification) $$\mathcal{L} = -\sum_i y_i \log(\hat{y}_i)$$

08CNN Architectures — LeNet to EfficientNet

Key ENSAM insight: ResNet achieves higher accuracy than VGG with only 25M params vs 138M — efficiency matters! More depth ≠ always better parameters.

Architecture	Year	Key Innovation	Use Case
LeNet	1998	First successful CNN for handwritten digit recognition. 2 conv + pooling + 3 dense layers.	MNIST handwritten digits
AlexNet	2012	Won ImageNet (ILSVRC). Introduced: ReLU activations, Dropout, GPU-based training. 5 conv + 3 dense.	Large-scale image classification
VGGNet	2014	Very deep (up to 19 layers) using only 3×3 filters. Simple, consistent architecture. 138M params.	General image classification
GoogLeNet (Inception)	2014	Inception module: multiple filter sizes (1×1, 3×3, 5×5) applied in parallel. Efficient via dimensionality reduction.	Complex classification tasks
ResNet	2015	Skip (residual) connections $\mathcal{F}(\mathbf{x})+\mathbf{x}$. Enables 100+ layers. Solves vanishing gradient in very deep networks. 25M params.	Classification, detection, keypoints
MobileNet	2017	Depthwise separable convolutions — drastically fewer operations. Designed for mobile/embedded devices.	Real-time mobile applications
EfficientNet	2019	Compound scaling of depth, width, and resolution simultaneously. Uses Swish activation. Highest accuracy per parameter.	Accuracy + efficiency balance

ResNet Skip Connection

Residual block $$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$$

The identity shortcut $+\mathbf{x}$ allows gradients to flow directly through — solves vanishing gradient for networks with 100+ layers.

09Transfer Learning

Transfer learning uses a model pre-trained on ImageNet (1.2M images, 1000 classes) and adapts it for a new task. Instead of training from scratch, you reuse learned feature representations — early layers learn universal features (edges, textures) useful for any image task.

Feature Extraction (Frozen)

Freeze all pre-trained conv layers. Only train a new classification head (Dense layers). Fast, requires little data. Best when target domain is similar to ImageNet.

Fine-Tuning (Unfrozen)

Unfreeze some/all pre-trained layers and continue training on your dataset with very small learning rate. Adapts features to your domain. Requires more data.

Transfer Learning in Keras: base = VGG16(weights='imagenet', include_top=False, input_shape=(224,224,3)) base.trainable = False # freeze pre-trained layers model = Sequential([ base, GlobalAveragePooling2D(), Dense(128, activation='relu'), Dropout(0.3), Dense(num_classes, activation='softmax') ])

10CNN Training Process & Best Practices

CNN Training Steps: 1. Data Preparation : normalize /255.0, reshape to (N, H, W, C), one-hot encode 2. Forward Pass : Image → Conv/ReLU/Pool → Flatten → Dense → Softmax → ŷ 3. Compute Loss : categorical_crossentropy(y, ŷ) 4. Backward Pass (Backprop): compute gradients ∂L/∂W for all conv and dense layers 5. Update Weights : Adam optimizer (default lr=1e-3) 6. Evaluate on val set : monitor train_acc, val_acc, train_loss, val_loss per epoch

Three Essential Callbacks (Systematic in Every Project)

Callback	Purpose	Typical Config
EarlyStopping	Stop when val_loss stops improving — prevents overfitting, saves time	patience=5, restore_best_weights=True
ModelCheckpoint	Save the model at its best val_loss — you always have the best version	save_best_only=True, monitor='val_loss'
ReduceLROnPlateau	Reduce learning rate when progress stalls — escapes plateaus	factor=0.5, patience=3, min_lr=1e-6

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau callbacks = [ EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True), ModelCheckpoint('best_model.h5', save_best_only=True, monitor='val_loss'), ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6) ] model.fit(X_train, y_train, epochs=100, validation_split=0.2, callbacks=callbacks) # Will stop well before 100 epochs if val_loss plateaus

CNN Architecture Design Rules

Rule	Why
Always use 3×3 convolutions	Two 3×3 layers = 5×5 receptive field with 18 vs 25 params (28% cheaper). Three 3×3 = 7×7 with 45% savings. The 3×3 is the default.
Double filters after each pooling	Keep compute roughly constant. Typical: 32 → 64 → 128 → 256 → 512 as spatial dims halve.
Dense layers hold 88% of params	In a VGG-like model: Conv ~14.7M params (12%), Dense ~103M (88%). Modern architectures replace Flatten+Dense with Global Average Pooling to fix this.
Normalize inputs to [0,1]	Never feed raw 0–255 values. Divide by 255 or standardise to mean=0, std=1.
Start with MobileNetV2 feature extraction	Lightweight, fast. If accuracy insufficient → fine-tune last layers with lr=1e-5. Never fine-tune with a large LR (destroys pre-trained weights).

Depth Determination: The 224 × 224 Compression Cascade

For a given input size, the number of conv blocks is fixed by the Power-of-2 Compression Schedule: keep halving with Max Pooling (stride 2) until the spatial map reaches 7 × 7 — the minimum before spatial structure is destroyed. For 224 × 224 RGB (the ImageNet standard), this yields exactly 5 conv blocks — the VGGNet depth.

Block	Spatial Size After Pooling	÷ Factor	Cumulative Reduction
Input (no pool)	224 × 224	—	×1
Block 1	112 × 112	÷2	×4
Block 2	56 × 56	÷2	×16
Block 3	28 × 28	÷2	×64
Block 4	14 × 14	÷2	×256
Block 5	7 × 7 ← stop	÷2	×1 024

Parameters per filter (excluding bias) $$\text{Params per filter} = K \times K \times C_{in}$$

Example: 3×3 filter on RGB → 3×3×3 = 27 params. Full layer Conv2D(32, 3×3) on RGB → $(3\!\times\!3\!\times\!3+1)\!\times\!32 = 896$ params (with bias).

Batch Normalization in CNNs

Applied after conv layers (before or after ReLU): normalizes feature maps per mini-batch. Benefits: stabilizes training, allows higher learning rates, reduces sensitivity to initialization, regularizes the model.

11Data Augmentation

Data augmentation prevents overfitting by generating varied training samples from existing data — the model sees more diversity without collecting new labeled data.

Geometric Transforms

Random flipping (horizontal/vertical), rotation (±15°), zoom, cropping, translation. Preserves semantic meaning while varying appearance.

Photometric Transforms

Brightness, contrast, saturation adjustments. Random noise addition. Gaussian blur. Makes model robust to lighting variations.

Keras ImageDataGenerator: datagen = ImageDataGenerator( rotation_range=15, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True, zoom_range=0.2, preprocessing_function=preprocess_input # for transfer learning )

12Evaluation Metrics

Metric	Formula	Use When
Accuracy	Correct / Total	Balanced classes
Precision	TP / (TP + FP)	False Positives costly
Recall	TP / (TP + FN)	False Negatives costly (medical AI)
F1 Score	2·P·R / (P+R)	Imbalanced classes, both errors matter
AUC-ROC	Area under ROC curve	Overall discrimination ability

Medical AI: In clinical applications (chest X-ray, tumor detection), Recall is critical — missing a disease (False Negative) is far more dangerous than a false alarm (False Positive).

13Advantages & Limitations

Advantages

• Automatic feature extraction
• Parameter sharing — far fewer params than dense
• Translation invariance
• Hierarchical learning (edges → objects)
• State-of-the-art on vision tasks
• Transfer learning available

Limitations

• Large datasets required for from-scratch training
• Computationally intensive (GPU needed)
• Black box — hard to interpret
• No temporal awareness (use CNN+RNN or 3D-CNN for video)
• Requires fixed-size input
• Memory intensive (VGG16: 138M params)

Key Applications

Image classification: Cats vs dogs, ImageNet 1000 classes
Object detection: YOLO, Faster-RCNN for self-driving cars
Medical imaging: Chest X-ray, MRI tumor detection, retinal OCT
Face recognition: Security systems
Image segmentation: U-Net for medical image outlining

Convolutional Neural Networks(CNN)

Convolutional Neural Networks
(CNN)