ADVANCED ML: UNSUPERVISED LEARNING & PRODUCTION / L08REINFORCEMENT LEARNING INTRODUCTION: Q-LEARNING & AGENTS
课程 · 12 · 08 / 12
LESSON 08 · ADVANCED · 60 MIN · ◆ 1 INSTRUMENT

Reinforcement Learning Introduction: Q-Learning & Agents

Introduction to reinforcement learning: Markov Decision Processes, Q-Learning, and simple agent environments. Foundations for AI agents.

Introduction: Learning Through Interaction

Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.

Real-world applications:

  • Game AI (AlphaGo, Chess, Atari games)
  • Robotics (walking, grasping, manipulation)
  • Autonomous vehicles
  • Recommendation systems
  • Resource optimization

Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.

Learning Objectives

  • Understand the RL framework (agent, environment, reward)
  • Master key concepts (state, action, policy, value function)
  • Learn Q-learning algorithm
  • Apply exploration vs. exploitation strategies
  • Implement simple RL agents
  • Understand credit assignment problem

1. The RL Framework

SEE

🕹️ Watch RL converge live before learning the math: Karpathy's REINFORCEjs Gridworld shows value iteration, then TD-learning in a separate demo. Click "Run Value Iteration" — within seconds, the optimal policy emerges from random exploration. Best 60 seconds you can spend before reading the rest of this lesson.

Core Components

Agent: The learner/decision maker Environment: The world the agent interacts with State (s)(s): Current situation Action (a)(a): What the agent can do Reward (r)(r): Feedback signal

FIG. 02Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 02Interactive Python code execution environment

2. Key Concepts

Policy π(as)\pi(a|s)

A policy defines the agent's behavior: probability of taking action aa in state ss.

  • Deterministic: a=π(s)a = \pi(s)
  • Stochastic: aπ(s)a \sim \pi(\cdot|s)

Value Function V(s)V(s)

Expected cumulative reward from state ss:

V(s)=E[Rt+γRt+1+γ2Rt+2+...st=s]V(s) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s]

Where γ\gamma is the discount factor (0 < γ < 1).

Q-Function Q(s,a)Q(s, a)

Expected cumulative reward from taking action aa in state ss:

Q(s,a)=E[Rt+γRt+1+γ2Rt+2+...st=s,at=a]Q(s, a) = \mathbb{E}[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + ... | s_t = s, a_t = a]

3. Grid World Example

FIG. 04Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 04Interactive Python code execution environment

4. Q-Learning Algorithm

Goal: Learn optimal Q-function through experience.

Update Rule:

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)]

Where:

  • α\alpha = learning rate
  • γ\gamma = discount factor
  • rr = reward
  • ss' = next state
FIG. 06Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 06Interactive Python code execution environment

5. Exploration vs. Exploitation

Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward

ε-greedy strategy:

  • With probability ε: explore (random action)
  • With probability 1-ε: exploit (best known action)
FIG. 08Python Code Executor
INTERACTIVE
LOADING INSTRUMENT
Fig. 08Interactive Python code execution environment

Key Takeaways

RL learns through trial and error using rewards

Agent interacts with environment to maximize cumulative reward

Q-learning learns optimal action-values through temporal difference updates

Exploration vs exploitation: Must balance trying new actions vs. using best known actions

Credit assignment: Which actions led to eventual reward/penalty?

Applications: Games, robotics, autonomous systems, optimization


What's Next?

Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!


Further Reading

Interactive Visualizations

Video Courses

Papers & Articles

Documentation & Books

  • Book: Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto (free PDF). The textbook.
  • Gymnasium — modern, maintained fork of OpenAI Gym.
  • Stable-Baselines3 — well-tested implementations of DQN, PPO, SAC, A2C in PyTorch.
  • CleanRL — single-file, high-quality reference implementations of every major RL algorithm.