Reinforcement Learning Introduction: Q-Learning & Agents

Introduction: Learning Through Interaction

Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.

Real-world applications:

Game AI (AlphaGo, Chess, Atari games)
Robotics (walking, grasping, manipulation)
Autonomous vehicles
Recommendation systems
Resource optimization

Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.

Learning Objectives

Understand the RL framework (agent, environment, reward)
Master key concepts (state, action, policy, value function)
Learn Q-learning algorithm
Apply exploration vs. exploitation strategies
Implement simple RL agents
Understand credit assignment problem

1. The RL Framework

Core Components

Agent: The learner/decision maker Environment: The world the agent interacts with State $(s)$ : Current situation Action $(a)$ : What the agent can do Reward $(r)$ : Feedback signal

Loading Python runtime...

2. Key Concepts

Policy $\pi(a|s)$

A policy defines the agent's behavior: probability of taking action $a$ in state $s$ .

Deterministic: $a = \pi(s)$
Stochastic: $a \sim \pi(\cdot|s)$

Value Function $V(s)$

Expected cumulative reward from state $s$ :

V(s) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s]

Where $\gamma$ is the discount factor (0 < γ < 1).

Q-Function $Q(s, a)$

Expected cumulative reward from taking action $a$ in state $s$ :

Q(s, a) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s, a_t = a]

3. Grid World Example

Loading Python runtime...

4. Q-Learning Algorithm

Goal: Learn optimal Q-function through experience.

Update Rule:

Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Where:

$\alpha$ = learning rate
$\gamma$ = discount factor
$r$ = reward
$s'$ = next state

Loading Python runtime...

5. Exploration vs. Exploitation

Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward

ε-greedy strategy:

With probability ε: explore (random action)
With probability 1-ε: exploit (best known action)

Loading Python runtime...

Key Takeaways

✅ RL learns through trial and error using rewards

✅ Agent interacts with environment to maximize cumulative reward

✅ Q-learning learns optimal action-values through temporal difference updates

✅ Exploration vs exploitation: Must balance trying new actions vs. using best known actions

✅ Credit assignment: Which actions led to eventual reward/penalty?

✅ Applications: Games, robotics, autonomous systems, optimization

What's Next?

Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!

Advanced ML: Unsupervised Learning & Production

Reinforcement Learning Introduction: Q-Learning & Agents

Introduction: Learning Through Interaction

Learning Objectives

1. The RL Framework

Core Components

2. Key Concepts

Policy π(a∣s)\pi(a|s)π(a∣s)

Value Function V(s)V(s)V(s)

Q-Function Q(s,a)Q(s, a)Q(s,a)

3. Grid World Example

4. Q-Learning Algorithm

5. Exploration vs. Exploitation

Key Takeaways

What's Next?

Policy $\pi(a|s)$

Value Function $V(s)$

Q-Function $Q(s, a)$