Neural Networks Fundamentals: Perceptrons to Backpropagation

Introduction: The Brain-Inspired Revolution

Traditional machine learning algorithms like linear regression or decision trees follow rigid rules. But what if we could build systems that learn complex patterns, just like our brains?

Neural networks are the foundation of modern AI. They power:

Computer Vision: Face recognition, self-driving cars
Natural Language: ChatGPT, translation systems
Recommendation: Netflix, Spotify, YouTube
Healthcare: Disease diagnosis, drug discovery

Key Insight: Neural networks are universal function approximators – given enough neurons and data, they can learn almost any pattern!

Learning Objectives

Understand biological inspiration for neural networks
Master forward propagation and backpropagation
Implement neural networks from scratch
Choose appropriate activation functions
Understand gradient descent optimization
Diagnose and fix training issues
Apply neural networks to real problems

See One Before You Build One

Before any of the math, get a feel for what a neural network is: a stack of layers passing numbers forward. The visualizer renders a live network you can reshape and probe.

Try it: Add a hidden layer (or widen one to 6+ neurons), then click a neuron and watch how its activation — and everything downstream of it — changes. Notice that more neurons means more knobs the network can tune to fit a pattern.

1. Biological Inspiration

TIP

🧠 The single best free intro to backprop ever made: 3Blue1Brown's Neural Networks series — four episodes, ~60 minutes total. Watch episode 1 before continuing. The geometric intuition for "what neurons learn" makes everything below feel obvious.

The Neuron

A biological neuron:

Receives signals through dendrites
Processes them in the cell body
Fires output through the axon if threshold exceeded
Connects to other neurons via synapses

The Artificial Neuron

An artificial neuron (perceptron) mimics this:

Receives inputs $x_1, x_2, ..., x_n$
Computes weighted sum: $z = w_1x_1 + w_2x_2 + ... + w_nx_n + b$
Applies activation function: $a = f(z)$
Outputs $a$ to next layer

Mathematical Formula:

a = f(\sum_{i=1}^{n} w_ix_i + b) = f(w^Tx + b)

Where:

$w$ = weights (synaptic strengths)
$b$ = bias (threshold)
$f$ = activation function (firing rule)

2. Neural Network Architecture

Layers

A neural network consists of layers:

Input Layer: Receives raw features
Hidden Layers: Learn intermediate representations
Output Layer: Produces predictions

Deep Learning = neural networks with multiple hidden layers (2+).

Network Topology

Common notation: [3, 4, 4, 2] means:

3 input neurons
2 hidden layers with 4 neurons each
2 output neurons

Interactive Exploration

Now that you know the layer/topology vocabulary, scroll back up to the network visualizer at the top of the lesson and revisit it with fresh eyes:

Click on neurons to see their activations
Adjust the architecture – add/remove layers
Watch forward propagation flow through the network
See how weights affect the output

3. Forward Propagation

The Process

Forward propagation computes the network's output:

For each layer $l$ :

Compute weighted sum: $z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$
Apply activation: $a^{[l]} = f(z^{[l]})$

Example (2-layer network):

Layer 1 (Hidden):

z^{[1]} = W^{[1]}x + b^{[1]}

a^{[1]} = \sigma(z^{[1]})

Layer 2 (Output):

z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}

\hat{y} = \sigma(z^{[2]})

Implementation from Scratch

The whole forward pass is just two lines repeated per layer — a weighted sum, then an activation:

We'll wrap this into a reusable NeuralNetwork class in the Backpropagation section below, where the forward pass is paired with a backward pass so the network can actually learn.

4. Activation Functions

Why Non-linearity?

Without activation functions, stacking layers is pointless:

a^{[2]} = W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]} = W'x + b'

This is just a linear transformation! Non-linear activations enable learning complex patterns.

Common Activation Functions

1. Sigmoid

\sigma(z) = \frac{1}{1 + e^{-z}}

Range: (0, 1)
Use: Binary classification output
Problem: Vanishing gradients

2. Tanh

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Range: (-1, 1)
Use: Hidden layers (zero-centered)
Problem: Vanishing gradients

3. ReLU (Rectified Linear Unit)

\text{ReLU}(z) = \max(0, z)

Range: [0, ∞)
Use: Hidden layers (default choice!)
Advantages: Fast, no vanishing gradient
Problem: Dead neurons

4. Leaky ReLU

\text{Leaky ReLU}(z) = \max(0.01z, z)

Solves dead ReLU problem

Visualization

5. Backpropagation: How Networks Learn

Trace It By Hand First

Before reading the chain-rule notation, walk through it. The Backprop Tracer is a tiny SVG-rendered computational graph with three presets — single-neuron, 2-layer MLP, and softmax+CE. Watch the forward values flow left-to-right (top of each edge), then click Backward and watch the gradients propagate right-to-left (bottom of each edge, in rust).

Click any node to see its local derivative formula with the current numeric values substituted in. The same graph that takes 200 lines of NumPy reduces to a dozen multiplications you can step through.

The Challenge

We want to minimize loss by adjusting weights:

\min_{W,b} \mathcal{L}(W, b)

But how to compute $\frac{\partial \mathcal{L}}{\partial W^{[l]}}$ for each layer?

The Solution: Chain Rule

Backpropagation applies the chain rule to efficiently compute gradients:

\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial W^{[l]}}

Backward pass:

Compute output error: $\delta^{[L]} = a^{[L]} - y$
Propagate backward: $\delta^{[l]} = (W^{[l+1]})^T\delta^{[l+1]} \odot f'(z^{[l]})$
Compute gradients: $\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]}(a^{[l-1]})^T$

Implementation

6. Training Dynamics

Drop a Ball, Watch It Roll

The intuition for gradient descent is geometric: the loss surface is a landscape, and the optimizer is a ball rolling downhill. The Loss Landscape Atlas ships five hand-coded analytical surfaces — convex quadratic, Rosenbrock (the banana valley), Himmelblau (four local minima), a saddle, and a multi-modal sin/cos landscape. Click anywhere on the heatmap to drop a ball and watch the optimizer descend.

Try the four optimizers — SGD, momentum, Adam, RMSProp — on Rosenbrock at lr=0.01: only momentum and Adam reliably reach the global minimum. Increase the learning rate and watch SGD diverge.

This is the same dynamic that plays out in a million-parameter neural network — projected down to 2D so you can see it.

Gradient Descent

Update rule:

W := W - \alpha \frac{\partial \mathcal{L}}{\partial W}

Where $\alpha$ is the learning rate.

Learning Rate Selection

Too small: Slow training, may get stuck
Too large: Overshoots minimum, unstable
Just right: Smooth, fast convergence

Common Issues

1. Vanishing Gradients

Problem: Gradients become tiny in deep networks
Solution: Use ReLU, better initialization, batch normalization

2. Exploding Gradients

Problem: Gradients become huge
Solution: Gradient clipping, careful initialization

3. Overfitting

Problem: Memorizes training data
Solution: Regularization (dropout, L2), more data

7. Real-World Application: Classification

Key Takeaways

✅ Neural networks are universal function approximators inspired by the brain

✅ Architecture: Input → Hidden layers → Output

✅ Forward propagation: Compute predictions layer by layer

✅ Backpropagation: Efficiently compute gradients using chain rule

✅ Activation functions: Add non-linearity (prefer ReLU for hidden layers)

✅ Training: Gradient descent optimizes weights to minimize loss

✅ Challenges: Vanishing/exploding gradients, overfitting, hyperparameter tuning

What's Next?

Next lesson: Deep Learning Basics – building deeper networks, regularization techniques, and advanced optimizers!