Introduction: Learning Through Interaction
Unlike supervised learning (where we have labels) or unsupervised learning (where we find patterns), reinforcement learning learns by trial and error through rewards and penalties.
Real-world applications:
- Game AI (AlphaGo, Chess, Atari games)
- Robotics (walking, grasping, manipulation)
- Autonomous vehicles
- Recommendation systems
- Resource optimization
Key Insight: An agent learns to make sequential decisions by maximizing cumulative reward through interaction with an environment.
Learning Objectives
- Understand the RL framework (agent, environment, reward)
- Master key concepts (state, action, policy, value function)
- Learn Q-learning algorithm
- Apply exploration vs. exploitation strategies
- Implement simple RL agents
- Understand credit assignment problem
1. The RL Framework
Core Components
Agent: The learner/decision maker Environment: The world the agent interacts with State : Current situation Action : What the agent can do Reward : Feedback signal
Loading Python runtime...
2. Key Concepts
Policy
A policy defines the agent's behavior: probability of taking action in state .
- Deterministic:
- Stochastic:
Value Function
Expected cumulative reward from state :
Where is the discount factor (0 < γ < 1).
Q-Function
Expected cumulative reward from taking action in state :
3. Grid World Example
Loading Python runtime...
4. Q-Learning Algorithm
Goal: Learn optimal Q-function through experience.
Update Rule:
Where:
- = learning rate
- = discount factor
- = reward
- = next state
Loading Python runtime...
5. Exploration vs. Exploitation
Exploration: Try new actions to discover better strategies Exploitation: Use known best actions to maximize reward
ε-greedy strategy:
- With probability ε: explore (random action)
- With probability 1-ε: exploit (best known action)
Loading Python runtime...
Key Takeaways
✅ RL learns through trial and error using rewards
✅ Agent interacts with environment to maximize cumulative reward
✅ Q-learning learns optimal action-values through temporal difference updates
✅ Exploration vs exploitation: Must balance trying new actions vs. using best known actions
✅ Credit assignment: Which actions led to eventual reward/penalty?
✅ Applications: Games, robotics, autonomous systems, optimization
What's Next?
Next lesson: MLOps Fundamentals – automating ML workflows, CI/CD pipelines, and production infrastructure!