Introduction: Learning Through Interaction
Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.
Real-world applications:
- Game AI (AlphaGo, Chess, Atari games)
- Robotics (walking, grasping, manipulation)
- Autonomous vehicles
- Recommendation systems
- Resource optimization
Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.
Learning Objectives
- Understand the RL framework (agent, environment, reward)
- Master key concepts (state, action, policy, value function)
- Learn Q-learning algorithm
- Apply exploration vs. exploitation strategies
- Implement simple RL agents
- Understand credit assignment problem
1. The RL Framework
š¹ļø Watch RL converge live before learning the math: Karpathy's REINFORCEjs Gridworld shows value iteration, then TD-learning in a separate demo. Click "Run Value Iteration" ā within seconds, the optimal policy emerges from random exploration. Best 60 seconds you can spend before reading the rest of this lesson.
Core Components
Agent: The learner/decision maker Environment: The world the agent interacts with State : Current situation Action : What the agent can do Reward : Feedback signal
2. Key Concepts
Policy
A policy defines the agent's behavior: probability of taking action in state .
- Deterministic:
- Stochastic:
Value Function
Expected cumulative reward from state :
Where is the discount factor (0 < γ < 1).
Q-Function
Expected cumulative reward from taking action in state :
3. Grid World Example
4. Q-Learning Algorithm
Goal: Learn optimal Q-function through experience.
Update Rule:
Where:
- = learning rate
- = discount factor
- = reward
- = next state
5. Exploration vs. Exploitation
Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward
ε-greedy strategy:
- With probability ε: explore (random action)
- With probability 1-ε: exploit (best known action)
Key Takeaways
ā RL learns through trial and error using rewards
ā Agent interacts with environment to maximize cumulative reward
ā Q-learning learns optimal action-values through temporal difference updates
ā Exploration vs exploitation: Must balance trying new actions vs. using best known actions
ā Credit assignment: Which actions led to eventual reward/penalty?
ā Applications: Games, robotics, autonomous systems, optimization
What's Next?
Next lesson: MLOps Fundamentals ā automating ML workflows, CI/CD pipelines, and production infrastructure!
Further Reading
Interactive Visualizations
- Andrej Karpathy ā REINFORCEjs Gridworld ā value iteration visualized cell-by-cell. Click "Run Value Iteration" and watch the policy emerge.
- REINFORCEjs ā Q-Learning Demo ā TD-learning in your browser, with explore-vs-exploit slider.
- Distill ā Why Momentum Really Works ā neighbor topic; many RL optimizers use momentum.
- OpenAI Gym Atari Demos ā the canonical RL benchmark; Gymnasium is the actively maintained successor.
Video Courses
- David Silver ā RL Course (DeepMind / UCL) ā 10 lectures, the canonical free RL course.
- Hugging Face ā Deep RL Course ā modern, free, hands-on with Stable-Baselines3.
- Spinning Up in Deep RL ā OpenAI's curated path through modern policy-gradient methods.
Papers & Articles
- Playing Atari with Deep Reinforcement Learning ā Mnih et al., DeepMind 2013. The DQN paper that started modern RL.
- Proximal Policy Optimization Algorithms ā Schulman et al., OpenAI 2017. The default modern policy-gradient method.
- Mastering the Game of Go with Deep Neural Networks and Tree Search ā Silver et al., Nature 2016. AlphaGo.
- Deep Reinforcement Learning from Human Preferences ā Christiano et al., 2017. The technical foundation of RLHF in modern LLMs.
Documentation & Books
- Book: Reinforcement Learning: An Introduction (2nd ed.) ā Sutton & Barto (free PDF). The textbook.
- Gymnasium ā modern, maintained fork of OpenAI Gym.
- Stable-Baselines3 ā well-tested implementations of DQN, PPO, SAC, A2C in PyTorch.
- CleanRL ā single-file, high-quality reference implementations of every major RL algorithm.