MOD 02 Convolutional Neural Networks (CNN)
ENSAM Casablanca · 2025/2026 ↩ Home
Deep Learning & NLP — ENSAM Casablanca

Convolutional Neural Networks
(CNN)

Complete course notes from professor lectures — CNN architecture, convolution, pooling, key models from LeNet to EfficientNet, transfer learning, and medical AI. Based on ENSAM 2025/2026 lecture PDFs.

Module02 of 07
SourcesCNN Lecture + ENSAM CNN PDF
LabsMNIST · Medical Image Classification
01Why CNNs? The Problem with MLP on Images

Standard MLPs treat input as a flat vector — they lose all spatial structure. For images, this creates three critical problems:

Parameter Explosion

224×224 RGB image = 150,528 inputs. With just 1 hidden layer of 1,000 neurons → 150 million parameters. Untrainable.

No Spatial Awareness

MLP sees a flat vector — loses all 2D structure. A shifted image looks completely different. Cannot recognize same object at different positions.

No Weight Sharing

MLP uses different weights for every position. An edge detector learned at position (5,5) does not transfer to position (50,50).

CNN Solutions

Local connectivity: each neuron looks at a small patch.
Weight sharing: same filter reused across image — far fewer parameters.
Translation invariance: recognizes patterns anywhere.
Hierarchical: edges → shapes → objects.

02The Convolution Operation

A convolutional layer applies small filters (kernels) that slide across the input image, performing element-wise multiplication and summing to produce a feature map. Each filter detects a specific pattern — edges, corners, textures.

The Filter — A "Flashlight"

Think of a filter as a small flashlight sliding across the image. At each position: multiply each filter value × corresponding pixel value, then sum all results → one output number. Different filters detect different patterns.

Convolution operation (2D) $$(f * x)(i,j) = \sum_{m=0}^{M-1}\sum_{n=0}^{N-1} x(i+m,\; j+n) \cdot f(m,n)$$

$f$: filter/kernel · $x$: input image · Output: feature map

Hierarchical Feature Learning
  • Layer 1 (shallow): Edges, color gradients, simple textures
  • Layer 2 (mid): Combines edges → corners, curves, textures
  • Layer 3+ (deep): Assembles shapes into object parts — eyes, wheels, fins
  • Fully Connected: Final classification using learned high-level feature vector
03Padding & Stride
Output spatial size after convolution $$W_{out} = \left\lfloor \frac{W_{in} - K + 2P}{S} \right\rfloor + 1$$

$W_{in}$: input size · $K$: kernel size · $P$: padding · $S$: stride

Padding

Adding zeros around the input before convolution.
Same: Adds enough padding so output size = input size. Preserves edges.
Valid: No padding. Output is smaller.

Stride

Step size of the filter movement.
Stride=1: Dense coverage, no skipping.
Stride=2: Skips positions → output half the size. Reduces compute.

Example: Input 5×5, Filter 3×3, Padding=1 (same), Stride=1 Output: ⌊(5 - 3 + 2) / 1⌋ + 1 = 5×5 ← same size! Example: Input 5×5, Filter 3×3, Padding=0, Stride=2 Output: ⌊(5 - 3) / 2⌋ + 1 = 2×2
Parameters in a Conv Layer
Conv2D parameter count $$\text{Params} = (K \times K \times C_{in} + 1) \times C_{out}$$

Example: Conv2D(32, 3×3) on RGB (3ch): $(3×3×3+1)×32 = 896$ params

04ReLU Activation in CNNs
ReLU$$f(x) = \max(0, x)$$

Applied element-wise after each conv layer. The standard activation for CNNs because:

  • Non-linearity: Without activation, stacking layers = 1 layer
  • Combats vanishing gradient: No saturation for positive values — gradients stay strong in deep networks
  • Computationally cheap: Just max(0,x) — faster than exp() or tanh
  • Sparse activation: Only active neurons fire → efficient representations
Variants: LeakyReLU keeps small negative gradient · ELU smooths the kink · GELU used in Transformers — but ReLU remains the CNN standard.
05Pooling Layers

Pooling downsamples feature maps — reducing parameters, computation, and building spatial invariance.

Output size (2×2 pool, stride 2)$$W_{out} = W_{in} / 2$$
Max Pooling

Takes maximum in each window. Best for detecting sharp features like edges. Most common in practice.

Average Pooling

Takes the mean in each window. Smoother — better for global feature summaries. Used in modern architectures.

Global Average Pooling (GAP)

Reduces each entire feature map to one number (its average). Replaces Flatten→Dense at end. Drastically reduces parameters and overfitting.

Why Pool?

• Reduces spatial size → fewer parameters
• Less compute, less memory
• Builds spatial invariance (small shifts → same output)
• Controls overfitting

06Flatten & Fully Connected Layers

Flatten converts 2D feature maps from the last conv/pool layer into a 1D vector to connect to dense layers. It bridges convolutional layers (spatial processing) with dense layers (classification).

Dense layer output$$\mathbf{y} = \mathbf{W} \cdot \mathbf{x} + \mathbf{b}$$

Dense layers use ReLU activation for non-linearity. The final dense layer uses Softmax to produce class probabilities.

Standard CNN Pipeline: Input → [Conv2D → ReLU → MaxPool] × N → Flatten → Dense(ReLU) → Dropout → Dense(Softmax)
07Full CNN Architecture — MNIST Lab
model = Sequential([ Conv2D(32, (3,3), activation='relu', input_shape=(28,28,1)), MaxPooling2D((2,2)), Conv2D(64, (3,3), activation='relu'), MaxPooling2D((2,2)), Flatten(), Dense(64, activation='relu'), Dropout(0.4), Dense(10, activation='softmax') # 10 digit classes ]) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.1)
Cross-Entropy Loss (for classification) $$\mathcal{L} = -\sum_i y_i \log(\hat{y}_i)$$
08CNN Architectures — LeNet to EfficientNet
Key ENSAM insight: ResNet achieves higher accuracy than VGG with only 25M params vs 138M — efficiency matters! More depth ≠ always better parameters.
ArchitectureYearKey InnovationUse Case
LeNet1998First successful CNN for handwritten digit recognition. 2 conv + pooling + 3 dense layers.MNIST handwritten digits
AlexNet2012Won ImageNet (ILSVRC). Introduced: ReLU activations, Dropout, GPU-based training. 5 conv + 3 dense.Large-scale image classification
VGGNet2014Very deep (up to 19 layers) using only 3×3 filters. Simple, consistent architecture. 138M params.General image classification
GoogLeNet (Inception)2014Inception module: multiple filter sizes (1×1, 3×3, 5×5) applied in parallel. Efficient via dimensionality reduction.Complex classification tasks
ResNet2015Skip (residual) connections $\mathcal{F}(\mathbf{x})+\mathbf{x}$. Enables 100+ layers. Solves vanishing gradient in very deep networks. 25M params.Classification, detection, keypoints
MobileNet2017Depthwise separable convolutions — drastically fewer operations. Designed for mobile/embedded devices.Real-time mobile applications
EfficientNet2019Compound scaling of depth, width, and resolution simultaneously. Uses Swish activation. Highest accuracy per parameter.Accuracy + efficiency balance
ResNet Skip Connection
Residual block $$\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}$$

The identity shortcut $+\mathbf{x}$ allows gradients to flow directly through — solves vanishing gradient for networks with 100+ layers.

09Transfer Learning

Transfer learning uses a model pre-trained on ImageNet (1.2M images, 1000 classes) and adapts it for a new task. Instead of training from scratch, you reuse learned feature representations — early layers learn universal features (edges, textures) useful for any image task.

Feature Extraction (Frozen)

Freeze all pre-trained conv layers. Only train a new classification head (Dense layers). Fast, requires little data. Best when target domain is similar to ImageNet.

Fine-Tuning (Unfrozen)

Unfreeze some/all pre-trained layers and continue training on your dataset with very small learning rate. Adapts features to your domain. Requires more data.

Transfer Learning in Keras: base = VGG16(weights='imagenet', include_top=False, input_shape=(224,224,3)) base.trainable = False # freeze pre-trained layers model = Sequential([ base, GlobalAveragePooling2D(), Dense(128, activation='relu'), Dropout(0.3), Dense(num_classes, activation='softmax') ])
10CNN Training Process & Best Practices
CNN Training Steps: 1. Data Preparation : normalize /255.0, reshape to (N, H, W, C), one-hot encode 2. Forward Pass : Image → Conv/ReLU/Pool → Flatten → Dense → Softmax → ŷ 3. Compute Loss : categorical_crossentropy(y, ŷ) 4. Backward Pass (Backprop): compute gradients ∂L/∂W for all conv and dense layers 5. Update Weights : Adam optimizer (default lr=1e-3) 6. Evaluate on val set : monitor train_acc, val_acc, train_loss, val_loss per epoch
Three Essential Callbacks (Systematic in Every Project)
CallbackPurposeTypical Config
EarlyStoppingStop when val_loss stops improving — prevents overfitting, saves timepatience=5, restore_best_weights=True
ModelCheckpointSave the model at its best val_loss — you always have the best versionsave_best_only=True, monitor='val_loss'
ReduceLROnPlateauReduce learning rate when progress stalls — escapes plateausfactor=0.5, patience=3, min_lr=1e-6
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau callbacks = [ EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True), ModelCheckpoint('best_model.h5', save_best_only=True, monitor='val_loss'), ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, min_lr=1e-6) ] model.fit(X_train, y_train, epochs=100, validation_split=0.2, callbacks=callbacks) # Will stop well before 100 epochs if val_loss plateaus
CNN Architecture Design Rules
RuleWhy
Always use 3×3 convolutionsTwo 3×3 layers = 5×5 receptive field with 18 vs 25 params (28% cheaper). Three 3×3 = 7×7 with 45% savings. The 3×3 is the default.
Double filters after each poolingKeep compute roughly constant. Typical: 32 → 64 → 128 → 256 → 512 as spatial dims halve.
Dense layers hold 88% of paramsIn a VGG-like model: Conv ~14.7M params (12%), Dense ~103M (88%). Modern architectures replace Flatten+Dense with Global Average Pooling to fix this.
Normalize inputs to [0,1]Never feed raw 0–255 values. Divide by 255 or standardise to mean=0, std=1.
Start with MobileNetV2 feature extractionLightweight, fast. If accuracy insufficient → fine-tune last layers with lr=1e-5. Never fine-tune with a large LR (destroys pre-trained weights).
Depth Determination: The 224 × 224 Compression Cascade

For a given input size, the number of conv blocks is fixed by the Power-of-2 Compression Schedule: keep halving with Max Pooling (stride 2) until the spatial map reaches 7 × 7 — the minimum before spatial structure is destroyed. For 224 × 224 RGB (the ImageNet standard), this yields exactly 5 conv blocks — the VGGNet depth.

BlockSpatial Size After Pooling÷ FactorCumulative Reduction
Input (no pool)224 × 224×1
Block 1112 × 112÷2×4
Block 256 × 56÷2×16
Block 328 × 28÷2×64
Block 414 × 14÷2×256
Block 57 × 7 ← stop÷2×1 024
Parameters per filter (excluding bias) $$\text{Params per filter} = K \times K \times C_{in}$$

Example: 3×3 filter on RGB → 3×3×3 = 27 params. Full layer Conv2D(32, 3×3) on RGB → $(3\!\times\!3\!\times\!3+1)\!\times\!32 = 896$ params (with bias).

Batch Normalization in CNNs

Applied after conv layers (before or after ReLU): normalizes feature maps per mini-batch. Benefits: stabilizes training, allows higher learning rates, reduces sensitivity to initialization, regularizes the model.

11Data Augmentation

Data augmentation prevents overfitting by generating varied training samples from existing data — the model sees more diversity without collecting new labeled data.

Geometric Transforms

Random flipping (horizontal/vertical), rotation (±15°), zoom, cropping, translation. Preserves semantic meaning while varying appearance.

Photometric Transforms

Brightness, contrast, saturation adjustments. Random noise addition. Gaussian blur. Makes model robust to lighting variations.

Keras ImageDataGenerator: datagen = ImageDataGenerator( rotation_range=15, width_shift_range=0.1, height_shift_range=0.1, horizontal_flip=True, zoom_range=0.2, preprocessing_function=preprocess_input # for transfer learning )
12Evaluation Metrics
MetricFormulaUse When
AccuracyCorrect / TotalBalanced classes
PrecisionTP / (TP + FP)False Positives costly
RecallTP / (TP + FN)False Negatives costly (medical AI)
F1 Score2·P·R / (P+R)Imbalanced classes, both errors matter
AUC-ROCArea under ROC curveOverall discrimination ability
Medical AI: In clinical applications (chest X-ray, tumor detection), Recall is critical — missing a disease (False Negative) is far more dangerous than a false alarm (False Positive).
13Advantages & Limitations
Advantages

• Automatic feature extraction
• Parameter sharing — far fewer params than dense
• Translation invariance
• Hierarchical learning (edges → objects)
• State-of-the-art on vision tasks
• Transfer learning available

Limitations

• Large datasets required for from-scratch training
• Computationally intensive (GPU needed)
• Black box — hard to interpret
• No temporal awareness (use CNN+RNN or 3D-CNN for video)
• Requires fixed-size input
• Memory intensive (VGG16: 138M params)

Key Applications
  • Image classification: Cats vs dogs, ImageNet 1000 classes
  • Object detection: YOLO, Faster-RCNN for self-driving cars
  • Medical imaging: Chest X-ray, MRI tumor detection, retinal OCT
  • Face recognition: Security systems
  • Image segmentation: U-Net for medical image outlining