Introduction: Learning Through Interaction
Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.
Real-world applications:
- Game AI (AlphaGo, Chess, Atari games)
- Robotics (walking, grasping, manipulation)
- Autonomous vehicles
- Recommendation systems
- Resource optimization
Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.
Learning Objectives
- Understand the RL framework (agent, environment, reward)
- Master key concepts (state, action, policy, value function)
- Learn Q-learning algorithm
- Apply exploration vs. exploitation strategies
- Implement simple RL agents
- Understand credit assignment problem
1. The RL Framework
🕹️ Watch RL converge live before learning the math: Karpathy's REINFORCEjs Gridworld shows value iteration, then TD-learning in a separate demo. Click "Run Value Iteration" — within seconds, the optimal policy emerges from random exploration. Best 60 seconds you can spend before reading the rest of this lesson.
Core Components
Agent: The learner/decision maker Environment: The world the agent interacts with State : Current situation Action : What the agent can do Reward : Feedback signal
2. Key Concepts
Policy
A policy defines the agent's behavior: probability of taking action in state .
- Deterministic:
- Stochastic:
Value Function
Expected cumulative reward from state :
Where is the discount factor (0 < γ < 1).
Q-Function
Expected cumulative reward from taking action in state :
3. Grid World Example
4. Q-Learning Algorithm
Goal: Learn optimal Q-function through experience.
Update Rule:
Where:
- = learning rate
- = discount factor
- = reward
- = next state
5. Exploration vs. Exploitation
Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward
ε-greedy strategy:
- With probability ε: explore (random action)
- With probability 1-ε: exploit (best known action)
Key Takeaways
✅ RL learns through trial and error using rewards
✅ Agent interacts with environment to maximize cumulative reward
✅ Q-learning learns optimal action-values through temporal difference updates
✅ Exploration vs exploitation: Must balance trying new actions vs. using best known actions
✅ Credit assignment: Which actions led to eventual reward/penalty?
✅ Applications: Games, robotics, autonomous systems, optimization
What's Next?
Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!
Further Reading
Interactive Visualizations
- Andrej Karpathy — REINFORCEjs Gridworld — value iteration visualized cell-by-cell. Click "Run Value Iteration" and watch the policy emerge.
- REINFORCEjs — Q-Learning Demo — TD-learning in your browser, with explore-vs-exploit slider.
- Distill — Why Momentum Really Works — neighbor topic; many RL optimizers use momentum.
- OpenAI Gym Atari Demos — the canonical RL benchmark; Gymnasium is the actively maintained successor.
Video Courses
- David Silver — RL Course (DeepMind / UCL) — 10 lectures, the canonical free RL course.
- Hugging Face — Deep RL Course — modern, free, hands-on with Stable-Baselines3.
- Spinning Up in Deep RL — OpenAI's curated path through modern policy-gradient methods.
Papers & Articles
- Playing Atari with Deep Reinforcement Learning — Mnih et al., DeepMind 2013. The DQN paper that started modern RL.
- Proximal Policy Optimization Algorithms — Schulman et al., OpenAI 2017. The default modern policy-gradient method.
- Mastering the Game of Go with Deep Neural Networks and Tree Search — Silver et al., Nature 2016. AlphaGo.
- Deep Reinforcement Learning from Human Preferences — Christiano et al., 2017. The technical foundation of RLHF in modern LLMs.
Documentation & Books
- Book: Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto (free PDF). The textbook.
- Gymnasium — modern, maintained fork of OpenAI Gym.
- Stable-Baselines3 — well-tested implementations of DQN, PPO, SAC, A2C in PyTorch.
- CleanRL — single-file, high-quality reference implementations of every major RL algorithm.