LESSONS · 12 · 08 / 12
Reinforcement Learning Introduction: Q-Learning & Agents
Introduction to reinforcement learning: Markov Decision Processes, Q-Learning, and simple agent environments. Foundations for AI agents.
Introduction: Learning Through Interaction
Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.
Real-world applications:
- Game AI (AlphaGo, Chess, Atari games)
- Robotics (walking, grasping, manipulation)
- Autonomous vehicles
- Recommendation systems
- Resource optimization
Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.
Learning Objectives
- Understand the RL framework (agent, environment, reward)
- Master key concepts (state, action, policy, value function)
- Learn Q-learning algorithm
- Apply exploration vs. exploitation strategies
- Implement simple RL agents
- Understand credit assignment problem
1. The RL Framework
TIP▶ Try this first. Open the RL Arena below — an 8×8 gridworld running real tabular Q-learning. Train it and watch the agent go from random flailing to a clean route to the goal as the policy arrows settle; then edit a wall or move the lava and retrain to see the policy adapt. Come back to the framework once you've watched reward turn into behaviour.
Want another angle? Karpathy's REINFORCEjs Gridworld animates value iteration, and his TD demo shows TD-learning — but the Arena above is the one to actually poke.
Core Components
Agent: The learner/decision maker Environment: The world the agent interacts with State : Current situation Action : What the agent can do Reward : Feedback signal
2. Key Concepts
Policy
A policy defines the agent's behavior: probability of taking action in state .
- Deterministic:
- Stochastic:
Value Function
Expected cumulative reward from state :
Where is the discount factor (0 < γ < 1).
Q-Function
Expected cumulative reward from taking action in state :
3. Grid World Example
4. Q-Learning Algorithm
Goal: Learn optimal Q-function through experience.
Update Rule:
Where:
- = learning rate
- = discount factor
- = reward
- = next state
5. Exploration vs. Exploitation
Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward
ε-greedy strategy:
- With probability ε: explore (random action)
- With probability 1-ε: exploit (best known action)
Key Takeaways
✅ RL learns through trial and error using rewards
✅ Agent interacts with environment to maximize cumulative reward
✅ Q-learning learns optimal action-values through temporal difference updates
✅ Exploration vs exploitation: Must balance trying new actions vs. using best known actions
✅ Credit assignment: Which actions led to eventual reward/penalty?
✅ Applications: Games, robotics, autonomous systems, optimization
What's Next?
Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!
Further Reading
Interactive Visualizations
- Andrej Karpathy — REINFORCEjs Gridworld — value iteration visualized cell-by-cell. Click "Run Value Iteration" and watch the policy emerge.
- REINFORCEjs — Q-Learning Demo — TD-learning in your browser, with explore-vs-exploit slider.
- Distill — Why Momentum Really Works — neighbor topic; many RL optimizers use momentum.
- OpenAI Gym Atari Demos — the canonical RL benchmark; Gymnasium is the actively maintained successor.
Video Courses
- David Silver — RL Course (DeepMind / UCL) — 10 lectures, the canonical free RL course.
- Hugging Face — Deep RL Course — modern, free, hands-on with Stable-Baselines3.
- Spinning Up in Deep RL — OpenAI's curated path through modern policy-gradient methods.
Papers & Articles
- Playing Atari with Deep Reinforcement Learning — Mnih et al., DeepMind 2013. The DQN paper that started modern RL.
- Proximal Policy Optimization Algorithms — Schulman et al., OpenAI 2017. The default modern policy-gradient method.
- Mastering the Game of Go with Deep Neural Networks and Tree Search — Silver et al., Nature 2016. AlphaGo.
- Deep Reinforcement Learning from Human Preferences — Christiano et al., 2017. The technical foundation of RLHF in modern LLMs.
Documentation & Books
- Book: Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto (free PDF). The textbook.
- Gymnasium — modern, maintained fork of OpenAI Gym.
- Stable-Baselines3 — well-tested implementations of DQN, PPO, SAC, A2C in PyTorch.
- CleanRL — single-file, high-quality reference implementations of every major RL algorithm.