ADVANCED ML: UNSUPERVISED LEARNING & PRODUCTION / L05NEURAL NETWORKS FUNDAMENTALS: PERCEPTRONS TO BACKPROPAGATION
课程 · 12 · 05 / 12
LESSON 05 · ADVANCED · 60 MIN · ◆ 2 INSTRUMENTS

Neural Networks Fundamentals: Perceptrons to Backpropagation

Build neural networks from scratch. Master perceptrons, activation functions, and the backpropagation algorithm with interactive visualizations.

Introduction: The Brain-Inspired Revolution

Traditional machine learning algorithms like linear regression or decision trees follow rigid rules. But what if we could build systems that learn complex patterns, just like our brains?

Neural networks are the foundation of modern AI. They power:

  • Computer Vision: Face recognition, self-driving cars
  • Natural Language: ChatGPT, translation systems
  • Recommendation: Netflix, Spotify, YouTube
  • Healthcare: Disease diagnosis, drug discovery

Key Insight: Neural networks are universal function approximators – given enough neurons and data, they can learn almost any pattern!

Learning Objectives

  • Understand biological inspiration for neural networks
  • Master forward propagation and backpropagation
  • Implement neural networks from scratch
  • Choose appropriate activation functions
  • Understand gradient descent optimization
  • Diagnose and fix training issues
  • Apply neural networks to real problems

1. Biological Inspiration

TIP

🧠 The single best free intro to backprop ever made: 3Blue1Brown's Neural Networks series — four episodes, ~60 minutes total. Watch episode 1 before continuing. The geometric intuition for "what neurons learn" makes everything below feel obvious.

The Neuron

A biological neuron:

  1. Receives signals through dendrites
  2. Processes them in the cell body
  3. Fires output through the axon if threshold exceeded
  4. Connects to other neurons via synapses

The Artificial Neuron

An artificial neuron (perceptron) mimics this:

  1. Receives inputs x1,x2,...,xnx_1, x_2, ..., x_n
  2. Computes weighted sum: z=w1x1+w2x2+...+wnxn+bz = w_1x_1 + w_2x_2 + ... + w_nx_n + b
  3. Applies activation function: a=f(z)a = f(z)
  4. Outputs aa to next layer

Mathematical Formula:

a=f(i=1nwixi+b)=f(wTx+b)a = f(\sum_{i=1}^{n} w_ix_i + b) = f(w^Tx + b)

Where:

  • ww = weights (synaptic strengths)
  • bb = bias (threshold)
  • ff = activation function (firing rule)

2. Neural Network Architecture

Layers

A neural network consists of layers:

  1. Input Layer: Receives raw features
  2. Hidden Layers: Learn intermediate representations
  3. Output Layer: Produces predictions

Deep Learning = neural networks with multiple hidden layers (2+).

Network Topology

Common notation: [3, 4, 4, 2] means:

  • 3 input neurons
  • 2 hidden layers with 4 neurons each
  • 2 output neurons

Interactive Exploration

FIG. 02Neural Network Visualizer
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Visualize neural network architecture and training

Try this:

  1. Click on neurons to see their activations
  2. Adjust the architecture – add/remove layers
  3. Watch forward propagation flow through the network
  4. See how weights affect the output

3. Forward Propagation

The Process

Forward propagation computes the network's output:

For each layer ll:

  1. Compute weighted sum: z[l]=W[l]a[l1]+b[l]z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}
  2. Apply activation: a[l]=f(z[l])a^{[l]} = f(z^{[l]})

Example (2-layer network):

Layer 1 (Hidden):

z[1]=W[1]x+b[1]z^{[1]} = W^{[1]}x + b^{[1]} a[1]=σ(z[1])a^{[1]} = \sigma(z^{[1]})

Layer 2 (Output):

z[2]=W[2]a[1]+b[2]z^{[2]} = W^{[2]}a^{[1]} + b^{[2]} y^=σ(z[2])\hat{y} = \sigma(z^{[2]})

Implementation from Scratch

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

4. Activation Functions

Why Non-linearity?

Without activation functions, stacking layers is pointless:

a[2]=W[2](W[1]x+b[1])+b[2]=Wx+ba^{[2]} = W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]} = W'x + b'

This is just a linear transformation! Non-linear activations enable learning complex patterns.

Common Activation Functions

1. Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
  • Range: (0, 1)
  • Use: Binary classification output
  • Problem: Vanishing gradients

2. Tanh

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
  • Range: (-1, 1)
  • Use: Hidden layers (zero-centered)
  • Problem: Vanishing gradients

3. ReLU (Rectified Linear Unit)

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)
  • Range: [0, ∞)
  • Use: Hidden layers (default choice!)
  • Advantages: Fast, no vanishing gradient
  • Problem: Dead neurons

4. Leaky ReLU

Leaky ReLU(z)=max(0.01z,z)\text{Leaky ReLU}(z) = \max(0.01z, z)
  • Solves dead ReLU problem

Visualization

FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

5. Backpropagation: How Networks Learn

The Challenge

We want to minimize loss by adjusting weights:

minW,bL(W,b)\min_{W,b} \mathcal{L}(W, b)

But how to compute LW[l]\frac{\partial \mathcal{L}}{\partial W^{[l]}} for each layer?

The Solution: Chain Rule

Backpropagation applies the chain rule to efficiently compute gradients:

LW[l]=La[l]a[l]z[l]z[l]W[l]\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial W^{[l]}}

Backward pass:

  1. Compute output error: δ[L]=a[L]y\delta^{[L]} = a^{[L]} - y
  2. Propagate backward: δ[l]=(W[l+1])Tδ[l+1]f(z[l])\delta^{[l]} = (W^{[l+1]})^T\delta^{[l+1]} \odot f'(z^{[l]})
  3. Compute gradients: LW[l]=δ[l](a[l1])T\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]}(a^{[l-1]})^T

Implementation

FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

6. Training Dynamics

Gradient Descent

Update rule:

W:=WαLWW := W - \alpha \frac{\partial \mathcal{L}}{\partial W}

Where α\alpha is the learning rate.

Learning Rate Selection

  • Too small: Slow training, may get stuck
  • Too large: Overshoots minimum, unstable
  • Just right: Smooth, fast convergence

Common Issues

1. Vanishing Gradients

  • Problem: Gradients become tiny in deep networks
  • Solution: Use ReLU, better initialization, batch normalization

2. Exploding Gradients

  • Problem: Gradients become huge
  • Solution: Gradient clipping, careful initialization

3. Overfitting

  • Problem: Memorizes training data
  • Solution: Regularization (dropout, L2), more data

7. Real-World Application: Classification

FIG. 10Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 10Interactive Python code execution environment

Key Takeaways

Neural networks are universal function approximators inspired by the brain

Architecture: Input → Hidden layers → Output

Forward propagation: Compute predictions layer by layer

Backpropagation: Efficiently compute gradients using chain rule

Activation functions: Add non-linearity (prefer ReLU for hidden layers)

Training: Gradient descent optimizes weights to minimize loss

Challenges: Vanishing/exploding gradients, overfitting, hyperparameter tuning


What's Next?

Next lesson: Deep Learning Basics – building deeper networks, regularization techniques, and advanced optimizers!


Further Reading

Interactive Visualizations

Video Courses

Papers & Articles

Documentation & Books

  • Book: Deep Learning — Goodfellow, Bengio, Courville (free online).
  • Book: Dive into Deep Learning — Zhang et al. (free, executable).
  • Book: Neural Networks and Deep Learning — Michael Nielsen (free online). Still the gentlest first read.