Convolutional Neural Networks
(CNN)
Complete course notes from professor lectures — CNN architecture, convolution, pooling, key models from LeNet to EfficientNet, transfer learning, and medical AI. Based on ENSAM 2025/2026 lecture PDFs.
Standard MLPs treat input as a flat vector — they lose all spatial structure. For images, this creates three critical problems:
224×224 RGB image = 150,528 inputs. With just 1 hidden layer of 1,000 neurons → 150 million parameters. Untrainable.
MLP sees a flat vector — loses all 2D structure. A shifted image looks completely different. Cannot recognize same object at different positions.
MLP uses different weights for every position. An edge detector learned at position (5,5) does not transfer to position (50,50).
Local connectivity: each neuron looks at a small patch.
Weight sharing: same filter reused across image — far fewer parameters.
Translation invariance: recognizes patterns anywhere.
Hierarchical: edges → shapes → objects.
A convolutional layer applies small filters (kernels) that slide across the input image, performing element-wise multiplication and summing to produce a feature map. Each filter detects a specific pattern — edges, corners, textures.
Think of a filter as a small flashlight sliding across the image. At each position: multiply each filter value × corresponding pixel value, then sum all results → one output number. Different filters detect different patterns.
$f$: filter/kernel · $x$: input image · Output: feature map
- Layer 1 (shallow): Edges, color gradients, simple textures
- Layer 2 (mid): Combines edges → corners, curves, textures
- Layer 3+ (deep): Assembles shapes into object parts — eyes, wheels, fins
- Fully Connected: Final classification using learned high-level feature vector
$W_{in}$: input size · $K$: kernel size · $P$: padding · $S$: stride
Adding zeros around the input before convolution.
Same: Adds enough padding so output size = input size. Preserves edges.
Valid: No padding. Output is smaller.
Step size of the filter movement.
Stride=1: Dense coverage, no skipping.
Stride=2: Skips positions → output half the size. Reduces compute.
Example: Conv2D(32, 3×3) on RGB (3ch): $(3×3×3+1)×32 = 896$ params
Applied element-wise after each conv layer. The standard activation for CNNs because:
- Non-linearity: Without activation, stacking layers = 1 layer
- Combats vanishing gradient: No saturation for positive values — gradients stay strong in deep networks
- Computationally cheap: Just max(0,x) — faster than exp() or tanh
- Sparse activation: Only active neurons fire → efficient representations
Pooling downsamples feature maps — reducing parameters, computation, and building spatial invariance.
Takes maximum in each window. Best for detecting sharp features like edges. Most common in practice.
Takes the mean in each window. Smoother — better for global feature summaries. Used in modern architectures.
Reduces each entire feature map to one number (its average). Replaces Flatten→Dense at end. Drastically reduces parameters and overfitting.
• Reduces spatial size → fewer parameters
• Less compute, less memory
• Builds spatial invariance (small shifts → same output)
• Controls overfitting
Flatten converts 2D feature maps from the last conv/pool layer into a 1D vector to connect to dense layers. It bridges convolutional layers (spatial processing) with dense layers (classification).
Dense layers use ReLU activation for non-linearity. The final dense layer uses Softmax to produce class probabilities.
| Architecture | Year | Key Innovation | Use Case |
|---|---|---|---|
| LeNet | 1998 | First successful CNN for handwritten digit recognition. 2 conv + pooling + 3 dense layers. | MNIST handwritten digits |
| AlexNet | 2012 | Won ImageNet (ILSVRC). Introduced: ReLU activations, Dropout, GPU-based training. 5 conv + 3 dense. | Large-scale image classification |
| VGGNet | 2014 | Very deep (up to 19 layers) using only 3×3 filters. Simple, consistent architecture. 138M params. | General image classification |
| GoogLeNet (Inception) | 2014 | Inception module: multiple filter sizes (1×1, 3×3, 5×5) applied in parallel. Efficient via dimensionality reduction. | Complex classification tasks |
| ResNet | 2015 | Skip (residual) connections $\mathcal{F}(\mathbf{x})+\mathbf{x}$. Enables 100+ layers. Solves vanishing gradient in very deep networks. 25M params. | Classification, detection, keypoints |
| MobileNet | 2017 | Depthwise separable convolutions — drastically fewer operations. Designed for mobile/embedded devices. | Real-time mobile applications |
| EfficientNet | 2019 | Compound scaling of depth, width, and resolution simultaneously. Uses Swish activation. Highest accuracy per parameter. | Accuracy + efficiency balance |
The identity shortcut $+\mathbf{x}$ allows gradients to flow directly through — solves vanishing gradient for networks with 100+ layers.
Transfer learning uses a model pre-trained on ImageNet (1.2M images, 1000 classes) and adapts it for a new task. Instead of training from scratch, you reuse learned feature representations — early layers learn universal features (edges, textures) useful for any image task.
Freeze all pre-trained conv layers. Only train a new classification head (Dense layers). Fast, requires little data. Best when target domain is similar to ImageNet.
Unfreeze some/all pre-trained layers and continue training on your dataset with very small learning rate. Adapts features to your domain. Requires more data.
| Callback | Purpose | Typical Config |
|---|---|---|
| EarlyStopping | Stop when val_loss stops improving — prevents overfitting, saves time | patience=5, restore_best_weights=True |
| ModelCheckpoint | Save the model at its best val_loss — you always have the best version | save_best_only=True, monitor='val_loss' |
| ReduceLROnPlateau | Reduce learning rate when progress stalls — escapes plateaus | factor=0.5, patience=3, min_lr=1e-6 |
| Rule | Why |
|---|---|
| Always use 3×3 convolutions | Two 3×3 layers = 5×5 receptive field with 18 vs 25 params (28% cheaper). Three 3×3 = 7×7 with 45% savings. The 3×3 is the default. |
| Double filters after each pooling | Keep compute roughly constant. Typical: 32 → 64 → 128 → 256 → 512 as spatial dims halve. |
| Dense layers hold 88% of params | In a VGG-like model: Conv ~14.7M params (12%), Dense ~103M (88%). Modern architectures replace Flatten+Dense with Global Average Pooling to fix this. |
| Normalize inputs to [0,1] | Never feed raw 0–255 values. Divide by 255 or standardise to mean=0, std=1. |
| Start with MobileNetV2 feature extraction | Lightweight, fast. If accuracy insufficient → fine-tune last layers with lr=1e-5. Never fine-tune with a large LR (destroys pre-trained weights). |
For a given input size, the number of conv blocks is fixed by the Power-of-2 Compression Schedule: keep halving with Max Pooling (stride 2) until the spatial map reaches 7 × 7 — the minimum before spatial structure is destroyed. For 224 × 224 RGB (the ImageNet standard), this yields exactly 5 conv blocks — the VGGNet depth.
| Block | Spatial Size After Pooling | ÷ Factor | Cumulative Reduction |
|---|---|---|---|
| Input (no pool) | 224 × 224 | — | ×1 |
| Block 1 | 112 × 112 | ÷2 | ×4 |
| Block 2 | 56 × 56 | ÷2 | ×16 |
| Block 3 | 28 × 28 | ÷2 | ×64 |
| Block 4 | 14 × 14 | ÷2 | ×256 |
| Block 5 | 7 × 7 ← stop | ÷2 | ×1 024 |
Example: 3×3 filter on RGB → 3×3×3 = 27 params. Full layer Conv2D(32, 3×3) on RGB → $(3\!\times\!3\!\times\!3+1)\!\times\!32 = 896$ params (with bias).
Applied after conv layers (before or after ReLU): normalizes feature maps per mini-batch. Benefits: stabilizes training, allows higher learning rates, reduces sensitivity to initialization, regularizes the model.
Data augmentation prevents overfitting by generating varied training samples from existing data — the model sees more diversity without collecting new labeled data.
Random flipping (horizontal/vertical), rotation (±15°), zoom, cropping, translation. Preserves semantic meaning while varying appearance.
Brightness, contrast, saturation adjustments. Random noise addition. Gaussian blur. Makes model robust to lighting variations.
| Metric | Formula | Use When |
|---|---|---|
| Accuracy | Correct / Total | Balanced classes |
| Precision | TP / (TP + FP) | False Positives costly |
| Recall | TP / (TP + FN) | False Negatives costly (medical AI) |
| F1 Score | 2·P·R / (P+R) | Imbalanced classes, both errors matter |
| AUC-ROC | Area under ROC curve | Overall discrimination ability |
• Automatic feature extraction
• Parameter sharing — far fewer params than dense
• Translation invariance
• Hierarchical learning (edges → objects)
• State-of-the-art on vision tasks
• Transfer learning available
• Large datasets required for from-scratch training
• Computationally intensive (GPU needed)
• Black box — hard to interpret
• No temporal awareness (use CNN+RNN or 3D-CNN for video)
• Requires fixed-size input
• Memory intensive (VGG16: 138M params)
- Image classification: Cats vs dogs, ImageNet 1000 classes
- Object detection: YOLO, Faster-RCNN for self-driving cars
- Medical imaging: Chest X-ray, MRI tumor detection, retinal OCT
- Face recognition: Security systems
- Image segmentation: U-Net for medical image outlining