Neural Networks Fundamentals: Perceptrons to Backpropagation

Introduction: The Brain-Inspired Revolution

Traditional machine learning algorithms like linear regression or decision trees follow rigid rules. But what if we could build systems that learn complex patterns, just like our brains?

Neural networks are the foundation of modern AI. They power:

Computer Vision: Face recognition, self-driving cars
Natural Language: ChatGPT, translation systems
Recommendation: Netflix, Spotify, YouTube
Healthcare: Disease diagnosis, drug discovery

Key Insight: Neural networks are universal function approximators – given enough neurons and data, they can learn almost any pattern!

Learning Objectives

Understand biological inspiration for neural networks
Master forward propagation and backpropagation
Implement neural networks from scratch
Choose appropriate activation functions
Understand gradient descent optimization
Diagnose and fix training issues
Apply neural networks to real problems

1. Biological Inspiration

The Neuron

A biological neuron:

Receives signals through dendrites
Processes them in the cell body
Fires output through the axon if threshold exceeded
Connects to other neurons via synapses

The Artificial Neuron

An artificial neuron (perceptron) mimics this:

Receives inputs $x_1, x_2, ..., x_n$
Computes weighted sum: $z = w_1x_1 + w_2x_2 + ... + w_nx_n + b$
Applies activation function: $a = f(z)$
Outputs $a$ to next layer

Mathematical Formula:

a = f(\sum_{i=1}^{n} w_ix_i + b) = f(w^Tx + b)

Where:

$w$ = weights (synaptic strengths)
$b$ = bias (threshold)
$f$ = activation function (firing rule)

2. Neural Network Architecture

Layers

A neural network consists of layers:

Input Layer: Receives raw features
Hidden Layers: Learn intermediate representations
Output Layer: Produces predictions

Deep Learning = neural networks with multiple hidden layers (2+).

Network Topology

Common notation: [3, 4, 4, 2] means:

3 input neurons
2 hidden layers with 4 neurons each
2 output neurons

Interactive Exploration

Try this:

Click on neurons to see their activations
Adjust the architecture – add/remove layers
Watch forward propagation flow through the network
See how weights affect the output

3. Forward Propagation

The Process

Forward propagation computes the network's output:

For each layer $l$ :

Compute weighted sum: $z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$
Apply activation: $a^{[l]} = f(z^{[l]})$

Example (2-layer network):

Layer 1 (Hidden):

z^{[1]} = W^{[1]}x + b^{[1]}

a^{[1]} = \sigma(z^{[1]})

Layer 2 (Output):

z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}

\hat{y} = \sigma(z^{[2]})

Implementation from Scratch

4. Activation Functions

Why Non-linearity?

Without activation functions, stacking layers is pointless:

a^{[2]} = W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]} = W'x + b'

This is just a linear transformation! Non-linear activations enable learning complex patterns.

Common Activation Functions

1. Sigmoid

\sigma(z) = \frac{1}{1 + e^{-z}}

Range: (0, 1)
Use: Binary classification output
Problem: Vanishing gradients

2. Tanh

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Range: (-1, 1)
Use: Hidden layers (zero-centered)
Problem: Vanishing gradients

3. ReLU (Rectified Linear Unit)

\text{ReLU}(z) = \max(0, z)

Range: [0, ∞)
Use: Hidden layers (default choice!)
Advantages: Fast, no vanishing gradient
Problem: Dead neurons

4. Leaky ReLU

\text{Leaky ReLU}(z) = \max(0.01z, z)

Solves dead ReLU problem

Visualization

5. Backpropagation: How Networks Learn

The Challenge

We want to minimize loss by adjusting weights:

\min_{W,b} \mathcal{L}(W, b)

But how to compute $\frac{\partial \mathcal{L}}{\partial W^{[l]}}$ for each layer?

The Solution: Chain Rule

Backpropagation applies the chain rule to efficiently compute gradients:

\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial a^{[l]}} \cdot \frac{\partial a^{[l]}}{\partial z^{[l]}} \cdot \frac{\partial z^{[l]}}{\partial W^{[l]}}

Backward pass:

Compute output error: $\delta^{[L]} = a^{[L]} - y$
Propagate backward: $\delta^{[l]} = (W^{[l+1]})^T\delta^{[l+1]} \odot f'(z^{[l]})$
Compute gradients: $\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]}(a^{[l-1]})^T$

Implementation

6. Training Dynamics

Gradient Descent

Update rule:

W := W - \alpha \frac{\partial \mathcal{L}}{\partial W}

Where $\alpha$ is the learning rate.

Learning Rate Selection

Too small: Slow training, may get stuck
Too large: Overshoots minimum, unstable
Just right: Smooth, fast convergence

Common Issues

1. Vanishing Gradients

Problem: Gradients become tiny in deep networks
Solution: Use ReLU, better initialization, batch normalization

2. Exploding Gradients

Problem: Gradients become huge
Solution: Gradient clipping, careful initialization

3. Overfitting

Problem: Memorizes training data
Solution: Regularization (dropout, L2), more data

7. Real-World Application: Classification

Key Takeaways

✅ Neural networks are universal function approximators inspired by the brain

✅ Architecture: Input → Hidden layers → Output

✅ Forward propagation: Compute predictions layer by layer

✅ Backpropagation: Efficiently compute gradients using chain rule

✅ Activation functions: Add non-linearity (prefer ReLU for hidden layers)

✅ Training: Gradient descent optimizes weights to minimize loss

✅ Challenges: Vanishing/exploding gradients, overfitting, hyperparameter tuning

What's Next?

Next lesson: Deep Learning Basics – building deeper networks, regularization techniques, and advanced optimizers!