Reinforcement Learning Introduction: Q-Learning & Agents

Introduction: Learning Through Interaction

Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.

Real-world applications:

Game AI (AlphaGo, Chess, Atari games)
Robotics (walking, grasping, manipulation)
Autonomous vehicles
Recommendation systems
Resource optimization

Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.

Learning Objectives

Understand the RL framework (agent, environment, reward)
Master key concepts (state, action, policy, value function)
Learn Q-learning algorithm
Apply exploration vs. exploitation strategies
Implement simple RL agents
Understand credit assignment problem

1. The RL Framework

TIP

▶ Try this first. Open the RL Arena — an 8×8 gridworld running real tabular Q-learning. Train it and watch the agent go from random flailing to a clean route to the goal as the policy arrows settle; then edit a wall or move the lava and retrain to see the policy adapt. Come back to the framework once you've watched reward turn into behaviour.

Want another angle? Karpathy's REINFORCEjs Gridworld animates value iteration, and his TD demo shows TD-learning — but the Arena above is the one to actually poke.

Core Components

Agent: The learner/decision maker Environment: The world the agent interacts with State $(s)$ : Current situation Action $(a)$ : What the agent can do Reward $(r)$ : Feedback signal

2. Key Concepts

Policy $\pi(a|s)$

A policy defines the agent's behavior: probability of taking action $a$ in state $s$ .

Deterministic: $a = \pi(s)$
Stochastic: $a \sim \pi(\cdot|s)$

Value Function $V(s)$

Expected cumulative reward from state $s$ :

V(s) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s]

Where $\gamma$ is the discount factor (0 < γ < 1).

Q-Function $Q(s, a)$

Expected cumulative reward from taking action $a$ in state $s$ :

Q(s, a) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s, a_t = a]

3. Grid World Example

4. Q-Learning Algorithm

Goal: Learn optimal Q-function through experience.

Update Rule:

Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Where:

$\alpha$ = learning rate
$\gamma$ = discount factor
$r$ = reward
$s'$ = next state

5. Exploration vs. Exploitation

Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward

ε-greedy strategy:

With probability ε: explore (random action)
With probability 1-ε: exploit (best known action)

Key Takeaways

✅ RL learns through trial and error using rewards

✅ Agent interacts with environment to maximize cumulative reward

✅ Q-learning learns optimal action-values through temporal difference updates

✅ Exploration vs exploitation: Must balance trying new actions vs. using best known actions

✅ Credit assignment: Which actions led to eventual reward/penalty?

✅ Applications: Games, robotics, autonomous systems, optimization

What's Next?

Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!

Reinforcement Learning Introduction: Q-Learning & Agents

Introduction: Learning Through Interaction

Learning Objectives

1. The RL Framework

Core Components

2. Key Concepts

Policy $\pi(a|s)$

Value Function $V(s)$

Q-Function $Q(s, a)$

3. Grid World Example

4. Q-Learning Algorithm

5. Exploration vs. Exploitation

Key Takeaways

What's Next?

Further Reading

Interactive Visualizations

Video Courses

Papers & Articles

Documentation & Books

Reinforcement Learning Introduction: Q-Learning & Agents

Introduction: Learning Through Interaction

Learning Objectives

1. The RL Framework

Core Components

2. Key Concepts

Policy π(a∣s)\pi(a|s)π(a∣s)

Value Function V(s)V(s)V(s)

Q-Function Q(s,a)Q(s, a)Q(s,a)

3. Grid World Example

4. Q-Learning Algorithm

5. Exploration vs. Exploitation

Key Takeaways

What's Next?

Further Reading

Interactive Visualizations

Video Courses

Papers & Articles

Documentation & Books

Policy $\pi(a|s)$

Value Function $V(s)$

Q-Function $Q(s, a)$