Reinforcement Learning Introduction: Q-Learning & Agents

Introduction: Learning Through Interaction

Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.

Real-world applications:

  • Game AI (AlphaGo, Chess, Atari games)
  • Robotics (walking, grasping, manipulation)
  • Autonomous vehicles
  • Recommendation systems
  • Resource optimization

Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.

Learning Objectives

  • Understand the RL framework (agent, environment, reward)
  • Master key concepts (state, action, policy, value function)
  • Learn Q-learning algorithm
  • Apply exploration vs. exploitation strategies
  • Implement simple RL agents
  • Understand credit assignment problem

1. The RL Framework

Core Components

Agent: The learner/decision maker Environment: The world the agent interacts with State (s)(s): Current situation Action (a)(a): What the agent can do Reward (r)(r): Feedback signal

Loading Python runtime...


2. Key Concepts

Policy π(as)\pi(a|s)

A policy defines the agent's behavior: probability of taking action aa in state ss.

  • Deterministic: a=π(s)a = \pi(s)
  • Stochastic: aπ(s)a \sim \pi(\cdot|s)

Value Function V(s)V(s)

Expected cumulative reward from state ss:

V(s)=E[Rt+γRt+1+γ2Rt+2+...st=s]V(s) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s]

Where γ\gamma is the discount factor (0 < γ < 1).

Q-Function Q(s,a)Q(s, a)

Expected cumulative reward from taking action aa in state ss:

Q(s,a)=E[Rt+γRt+1+γ2Rt+2+...st=s,at=a]Q(s, a) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s, a_t = a]

3. Grid World Example

Loading Python runtime...


4. Q-Learning Algorithm

Goal: Learn optimal Q-function through experience.

Update Rule:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Where:

  • α\alpha = learning rate
  • γ\gamma = discount factor
  • rr = reward
  • ss' = next state

Loading Python runtime...


5. Exploration vs. Exploitation

Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward

ε-greedy strategy:

  • With probability ε: explore (random action)
  • With probability 1-ε: exploit (best known action)

Loading Python runtime...


Key Takeaways

RL learns through trial and error using rewards

Agent interacts with environment to maximize cumulative reward

Q-learning learns optimal action-values through temporal difference updates

Exploration vs exploitation: Must balance trying new actions vs. using best known actions

Credit assignment: Which actions led to eventual reward/penalty?

Applications: Games, robotics, autonomous systems, optimization


What's Next?

Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!