Deep Learning Basics: Architectures & Training

Introduction: Beyond Shallow Networks

You've mastered neural networks with 1-2 hidden layers. But what about deep networks with 10, 50, or even 1000 layers?

Deep learning powers modern AI breakthroughs:

Image Recognition: ResNet (152 layers)
Language Models: GPT-4 (hundreds of layers)
Game AI: AlphaGo, AlphaZero

Key Insight: Deeper networks learn hierarchical representations – from edges to textures to objects!

Before any theory, get your hands on the first layer of a CNN. Open the Conv Darkroom and feel how a single 3×3 kernel turns a raw image into an edge map — the exact thing backprop discovers on its own later in this lesson.

Try it: Type a new kernel into the editor (or switch to a different preset), then drag the occlusion patch across the input — watch the filtered output and the top-3 prediction change as you cover different regions.

Learning Objectives

Understand why depth matters
Master regularization techniques (dropout, L2)
Learn advanced optimization (Adam, RMSprop)
Apply batch normalization
Handle vanishing/exploding gradients
Build and train deep networks
Use transfer learning

1. Why Go Deep?

Hierarchical Feature Learning

Deep networks learn increasingly abstract features:

Computer Vision Example:

Layer 1: Edges, corners
Layer 2: Textures, simple shapes
Layer 3: Object parts (eyes, wheels)
Layer 4: Complete objects (faces, cars)

Build a Layer-1 Detector by Hand

Before we let backprop discover features for us, build one yourself. Open the Conv Darkroom: type a 3×3 kernel into the editor (or pick sobel-x from the presets) and watch a 64×64 image — a digit, a face silhouette, a checkerboard — convolve in real time. Stack the kernel four layers deep and see the response sharpen.

The occlusion overlay lets you cover part of the input with a draggable 12×12 patch; the tiny LeNet-style classifier emits a top-3 prediction. Whichever class drops the most when you cover a region is the "important" region — a hand-rolled saliency map.

The kernels you tune by hand in the Conv Darkroom (at the top of this lesson) are exactly what the first conv layer of a trained CNN ends up learning automatically.

2. Regularization Techniques

Dropout

Idea: Randomly "drop" neurons during training to prevent overfitting.

How it works:

During training: Each neuron has probability $p$ of being dropped
During inference: Use all neurons, scale activations by $(1-p)$

L2 Regularization (Weight Decay)

Add penalty for large weights to loss:

\mathcal{L}_{total} = \mathcal{L}_{data} + \lambda \sum_{i,j} w_{ij}^2

3. Advanced Optimizers

Adam (Adaptive Moment Estimation)

Combines momentum and adaptive learning rates:

m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

w_t = w_{t-1} - \alpha \frac{m_t}{\sqrt{v_t} + \epsilon}

4. Batch Normalization

Problem: Internal covariate shift – layer inputs' distributions change during training.

Solution: Normalize layer inputs:

\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

Benefits:

Faster training
Higher learning rates possible
Less sensitive to initialization
Acts as regularization

Bonus · Generative Side: Diffusion

Discriminative deep nets classify what is. Generative deep nets — diffusion models powering Stable Diffusion, DALL·E, and Midjourney — synthesize what could be. The principle is shockingly simple: train a network to reverse the process of adding noise.

The Diffusion Studio runs the real closed-form forward process (xₜ = √(α̅ₜ) x₀ + √(1 − α̅ₜ) ε with a cosine schedule) on a 28×28 hand-drawn source. The reverse process is a synthetic demonstration (we don't ship a U-Net), but the schedule, sampler comparison (Euler / DDIM-η=0 / DDIM-η=1), and predict/truth/diff visualization are pedagogically faithful.

Key Takeaways

✅ Deep networks learn hierarchical features from simple to complex

✅ Dropout prevents overfitting by randomly dropping neurons during training

✅ L2 regularization penalizes large weights

✅ Adam optimizer combines momentum and adaptive learning rates

✅ Batch normalization stabilizes training and enables higher learning rates

✅ Transfer learning reuses features learned on large datasets

What's Next?

Next lesson: Time Series Analysis – forecasting, ARIMA, LSTMs, and temporal patterns!