УРОКИ · 12 · 06 / 12
Deep Learning Basics: Architectures & Training
Explore deep learning architectures: CNNs for tabular data, dropout, batch normalization, and modern training techniques.
Introduction: Beyond Shallow Networks
You've mastered neural networks with 1-2 hidden layers. But what about deep networks with 10, 50, or even 1000 layers?
Deep learning powers modern AI breakthroughs:
- Image Recognition: ResNet (152 layers)
- Language Models: GPT-4 (hundreds of layers)
- Game AI: AlphaGo, AlphaZero
Key Insight: Deeper networks learn hierarchical representations – from edges to textures to objects!
Learning Objectives
- Understand why depth matters
- Master regularization techniques (dropout, L2)
- Learn advanced optimization (Adam, RMSprop)
- Apply batch normalization
- Handle vanishing/exploding gradients
- Build and train deep networks
- Use transfer learning
1. Why Go Deep?
Hierarchical Feature Learning
Deep networks learn increasingly abstract features:
Computer Vision Example:
- Layer 1: Edges, corners
- Layer 2: Textures, simple shapes
- Layer 3: Object parts (eyes, wheels)
- Layer 4: Complete objects (faces, cars)
Build a Layer-1 Detector by Hand
Before we let backprop discover features for us, build one yourself. Open the Conv Darkroom: type a 3×3 kernel into the editor (or pick sobel-x from the presets) and watch a 64×64 image — a digit, a face silhouette, a checkerboard — convolve in real time. Stack the kernel four layers deep and see the response sharpen.
The occlusion overlay lets you cover part of the input with a draggable 12×12 patch; the tiny LeNet-style classifier emits a top-3 prediction. Whichever class drops the most when you cover a region is the "important" region — a hand-rolled saliency map.
The kernels you tune by hand here are exactly what the first conv layer of a trained CNN ends up learning automatically.
2. Regularization Techniques
Dropout
Idea: Randomly "drop" neurons during training to prevent overfitting.
How it works:
- During training: Each neuron has probability of being dropped
- During inference: Use all neurons, scale activations by
L2 Regularization (Weight Decay)
Add penalty for large weights to loss:
3. Advanced Optimizers
Adam (Adaptive Moment Estimation)
Combines momentum and adaptive learning rates:
4. Batch Normalization
Problem: Internal covariate shift – layer inputs' distributions change during training.
Solution: Normalize layer inputs:
Benefits:
- Faster training
- Higher learning rates possible
- Less sensitive to initialization
- Acts as regularization
Bonus · Generative Side: Diffusion
Discriminative deep nets classify what is. Generative deep nets — diffusion models powering Stable Diffusion, DALL·E, and Midjourney — synthesize what could be. The principle is shockingly simple: train a network to reverse the process of adding noise.
The Diffusion Studio runs the real closed-form forward process (xₜ = √(α̅ₜ) x₀ + √(1 − α̅ₜ) ε with a cosine schedule) on a 28×28 hand-drawn source. The reverse process is a synthetic demonstration (we don't ship a U-Net), but the schedule, sampler comparison (Euler / DDIM-η=0 / DDIM-η=1), and predict/truth/diff visualization are pedagogically faithful.
Key Takeaways
✅ Deep networks learn hierarchical features from simple to complex
✅ Dropout prevents overfitting by randomly dropping neurons during training
✅ L2 regularization penalizes large weights
✅ Adam optimizer combines momentum and adaptive learning rates
✅ Batch normalization stabilizes training and enables higher learning rates
✅ Transfer learning reuses features learned on large datasets
What's Next?
Next lesson: Time Series Analysis – forecasting, ARIMA, LSTMs, and temporal patterns!
Further Reading
Interactive Visualizations
- Distill — Why Momentum Really Works — Goh, 2017. Sliders for momentum, learning rate, and condition number on real loss surfaces.
- An Interactive Tutorial on Numerical Optimization — Frederickson. SGD, momentum, Adam visualized side-by-side on Rosenbrock and other classic test surfaces.
- CS231n — Optimization Animations — Karpathy/Stanford's optimization visualizations in saddle and ravine landscapes.
- Loss Landscape Visualizer — explore high-dimensional loss landscapes for trained networks.
Video Tutorials
- 3Blue1Brown — Gradient Descent and Backpropagation — episodes 2–4.
- Andrej Karpathy — Building makemore (BatchNorm video) — the clearest explanation of why BatchNorm works.
Papers & Articles
- Adam: A Method for Stochastic Optimization — Kingma & Ba, ICLR 2015.
- Decoupled Weight Decay Regularization (AdamW) — Loshchilov & Hutter, ICLR 2019. Use this, not vanilla Adam.
- Batch Normalization — Ioffe & Szegedy, ICML 2015.
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting — Srivastava et al., JMLR 2014.
- The Lottery Ticket Hypothesis — Frankle & Carbin, ICLR 2019. Why over-parameterized networks generalize.
Documentation & Books
- Book: Deep Learning — Goodfellow, Bengio, Courville (Chapters 7–8, free online).
- PyTorch Optimizers & Keras Optimizers — production-grade implementations of every method discussed.