Reinforcement Learning Introduction: Q-Learning & Agents

Introduction: Learning Through Interaction

Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.

Real-world applications:

  • Game AI (AlphaGo, Chess, Atari games)
  • Robotics (walking, grasping, manipulation)
  • Autonomous vehicles
  • Recommendation systems
  • Resource optimization

Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.

Learning Objectives

  • Understand the RL framework (agent, environment, reward)
  • Master key concepts (state, action, policy, value function)
  • Learn Q-learning algorithm
  • Apply exploration vs. exploitation strategies
  • Implement simple RL agents
  • Understand credit assignment problem

1. The RL Framework

šŸ•¹ļø Watch RL converge live before learning the math: Karpathy's REINFORCEjs Gridworld shows value iteration, then TD-learning in a separate demo. Click "Run Value Iteration" — within seconds, the optimal policy emerges from random exploration. Best 60 seconds you can spend before reading the rest of this lesson.

Core Components

Agent: The learner/decision maker Environment: The world the agent interacts with State (s)(s): Current situation Action (a)(a): What the agent can do Reward (r)(r): Feedback signal

Loading tool...

2. Key Concepts

Policy Ļ€(a∣s)\pi(a|s)

A policy defines the agent's behavior: probability of taking action aa in state ss.

  • Deterministic: a=Ļ€(s)a = \pi(s)
  • Stochastic: aāˆ¼Ļ€(ā‹…āˆ£s)a \sim \pi(\cdot|s)

Value Function V(s)V(s)

Expected cumulative reward from state ss:

V(s)=E[Rt+γRt+1+γ2Rt+2+...∣st=s]V(s) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s]

Where γ\gamma is the discount factor (0 < γ < 1).

Q-Function Q(s,a)Q(s, a)

Expected cumulative reward from taking action aa in state ss:

Q(s,a)=E[Rt+γRt+1+γ2Rt+2+...∣st=s,at=a]Q(s, a) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s, a_t = a]

3. Grid World Example

Loading tool...

4. Q-Learning Algorithm

Goal: Learn optimal Q-function through experience.

Update Rule:

Q(s,a)←Q(s,a)+α[r+γmax⁔a′Q(s′,a′)āˆ’Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Where:

  • α\alpha = learning rate
  • γ\gamma = discount factor
  • rr = reward
  • s′s' = next state
Loading tool...

5. Exploration vs. Exploitation

Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward

ε-greedy strategy:

  • With probability ε: explore (random action)
  • With probability 1-ε: exploit (best known action)
Loading tool...

Key Takeaways

āœ… RL learns through trial and error using rewards

āœ… Agent interacts with environment to maximize cumulative reward

āœ… Q-learning learns optimal action-values through temporal difference updates

āœ… Exploration vs exploitation: Must balance trying new actions vs. using best known actions

āœ… Credit assignment: Which actions led to eventual reward/penalty?

āœ… Applications: Games, robotics, autonomous systems, optimization


What's Next?

Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!


Further Reading

Interactive Visualizations

Video Courses

Papers & Articles

Documentation & Books

  • Book: Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto (free PDF). The textbook.
  • Gymnasium — modern, maintained fork of OpenAI Gym.
  • Stable-Baselines3 — well-tested implementations of DQN, PPO, SAC, A2C in PyTorch.
  • CleanRL — single-file, high-quality reference implementations of every major RL algorithm.