← → / SPACE to navigate
LECTURE 01 · INTRO TO RL

How machines learn
to play, win, repeat.

From spotting cats in photos to mastering Go — we'll trace the leap from supervised learning to reinforcement learning, and play a tiny RL game ourselves.

🎯 Classification Generation AlphaGo 🎮 Play a game
Title
01 / 27
THE BIG PICTURE

Two ways a machine can learn.

Style A

Supervised
Learning

"Here are 10,000 examples with the right answer. Learn the pattern."

  • Needs a teacher with answers
  • One shot per example
  • Great for: spam detection, photo tagging, translation
VS
Style B

Reinforcement
Learning

"Here's a world. Try things. I'll tell you when you do well."

  • No answers — only rewards
  • Many tries, learning from each
  • Great for: games, robots, self-driving
Two Ways
02 / 27
PART 1 · SUPERVISED LEARNING

Like studying with flashcards.

Show the model an example, tell it the right answer, and let it adjust. Repeat millions of times with new examples — until it gets the pattern right on cards it has never seen.

🐱CAT
🐶DOG
🦊FOX
🐰RABBIT

Each card = one labeled example. The model's job: predict the label of a new card.

Supervised Learning
03 / 27
EXAMPLE · CLASSIFICATION

Cat or dog?

A trained model takes any image and sorts it into one of a fixed set of classes. The output is a single guess.

📷 Image 🧠 Model 🏷️ "Dog"

Real systems power photo apps, medical scans, even sorting recyclables.

🐱CAT
🐶DOG
😺CAT
🐕DOG
🐈CAT
🦮DOG
Classification
04 / 27
THE MATH UNDER THE HOOD

Make wrong as small as possible.

Every supervised learning problem boils down to one line of math — find the model that makes its mistakes as small as possible.

\[ \min_{f} \; \ell\bigl(f(x),\, y\bigr) \]
\(x\) The input — e.g. a photo of an animal.
\(y\) The right answer — e.g. the label "cat".
\(f\) The model. Feed it \(x\) and it guesses: \(f(x)\).
\(\ell\) The loss — how far the guess \(f(x)\) is from the truth \(y\).
\(\min\) Find the smallest. Tweak \(f\) until the loss is as tiny as possible.

Training = searching through millions of possible \(f\)'s to find the one that's wrong the least.

The Math · Loss
05 / 27
STILL SUPERVISED LEARNING · ANOTHER EXAMPLE

Even ChatGPT is just
predicting the next word.

Same recipe as cat-vs-dog. Show the model billions of sentences, hide the next word, ask it to guess, shrink the loss. The only twist — instead of one label, it produces something one piece at a time.

📝 Text
"Write a haiku about cats"
A whisker twitches.
Sunlight pools on the carpet —
nap is non-negotiable.
🎨 Image
"A cat astronaut"
→ starts from random noise, refined into a picture step by step
🎵 Music / Code / Video
"Lo-fi beat for studying"
→ predicts the next note, then the next, then the next…

Same \(\min \, \ell(f(x), y)\) — but now \(y\) is a piece of the output (a word, a note, a slice of noise), not a label.

Generation
06 / 27
TRY IT · PREDICT THE NEXT WORD

You're the language model.

An LLM picks the next word from a probability distribution — over and over. Tap a word to add it. Watch the sentence grow.

The cat sat on the ___
Next Word
07 / 27
PAUSE · QUICK CHECK-IN

How are we
doing so far?

That was a lot — flashcards, math, even ChatGPT. Before we leave supervised learning behind and try something completely different, let's pause.

Got a question? Anything fuzzy? Speak up — no question is too small.
🔁 Want a replay? The math, the loss, generation — I can rewind any of it.
🚀 All good? Next up: the kind of learning that beat humans at Go.
Check-in
08 / 27
A PROBLEM

But what about
problems with no answer key?

How do you learn to ride a bike, win a chess match, or land a rocket — when no one can show you the "right answer" at every step?

Pivot to RL
09 / 27
WHY SUPERVISED LEARNING ISN'T ENOUGH

Where's the answer key?

Try to imagine a flashcard set for any of these. You can't — not because nobody tried, but because no human knows the right answer at every moment.

🚴
Learning to ride a bike What's the right amount to lean — right now? → No parent can label every micro-second of balance.
🎮
Beating a video game Should I jump now? Run? Duck? → You only find out it was wrong after the game ends.
🤖
A robot learning to walk What's the perfect motor angle for this step? → No engineer can write down every joint of every gait.
🐶
Training a puppy to sit How would you make 10,000 flashcards for "sit"? → You don't. You give a treat when it works — that's it.

No teacher. No labels. Just try things, see what happens, get rewarded. Sound familiar? That's how you learn most things — and it's where RL lives.

Why SL Fails
10 / 27
PART 3 · MODELING THE PROBLEM

Two characters, talking in a loop.

To do math on bikes, games, and puppies, we need a picture that fits all of them. RL boils every one of those situations down to two characters — an agent and an environment — passing messages back and forth.

Agent THE LEARNER 🤖 POLICY π Environment THE WORLD 🌍 ACTION REWARD + STATE

The agent's policy π — its strategy — picks an action. The world updates and hands back a new state + a reward, which the agent uses to nudge its policy. Repeat forever. Mathematicians call this a Markov Decision Process — but you can just call it a loop.

RL Loop
11 / 27
THE REWARD SIGNAL

A number every step.
Add them up.

After every action, the world hands back one number — the score for that step. Sometimes it's a big jackpot at the end (you won the game). Sometimes it's a little something every step (a tasty scoop of ice cream). Either way, the agent's only job: make the total as big as possible.

🏆 +1 WIN A GAME · ONCE, AT THE END
🍦 +0.7 TASTY SCOOP · EVERY STEP
💥 −1 CRASH / LOSE
\[ G = r_1 + r_2 + r_3 + \cdots + r_T \]
Return \(G\) = the sum of every step's reward. RL = pick actions that make \(G\) as big as possible.
Reward
12 / 27
A CLASSIC TENSION

Explore or exploit?

You walk into an ice cream shop with 31 flavors. Mint chip is your reliable favorite. Do you order it again — or finally try the one with the weird name?

🔍 EXPLORE

Try something new.

Maybe you'll discover something better. Maybe you'll waste a scoop on a flavor you hate.

🍓 🥑 🌶️
🎯 EXPLOIT

Stick with what works.

Guaranteed-good ice cream. But you'll never find anything better than mint chip.

🍦 🍦 🍦

Every RL agent has to balance these. Too much exploring = waste. Too much exploiting = miss the best.

Explore vs Exploit
13 / 27
CASE STUDY · ALPHAGO

The game everyone said
computers couldn't win.

Go board photo from the AlphaGo match
\(10^{170}\)
POSSIBLE BOARD POSITIONS
2,500+
YEARS OF HUMAN STUDY
2016
YEAR ALPHAGO BEAT WORLD CHAMP LEE SEDOL

More positions than atoms in the universe. You can't memorize Go — you have to understand it.

AlphaGo
14 / 27
UNDER THE HOOD

How AlphaGo learned, in three stages.

STAGE 1 📚

Imitate humans

Supervised learning on 30 million pro moves. "Given this board, what would a human play?" Now AlphaGo plays like a strong amateur.

STAGE 2 🤖

Play itself

Reinforcement learning from self-play. Two copies of AlphaGo battle for millions of games. Reward = +1 win, −1 lose. It invents its own strategies.

STAGE 3 🌳

Plan ahead

Tree search at game time. Imagines thousands of futures, picks the move that — across all those futures — gives the best expected return.

Stage 2 is the magic. Without an answer key, RL let the system surpass every human teacher it had.

How AlphaGo Learned
15 / 27
MARCH 10, 2016 · MATCH 2 · MOVE 37

The move
no human would have made.

It's not a human move. I've never seen a human play this move. So beautiful.

— Fan Hui, professional Go player, watching live

Estimated probability of a human playing it: 1 in 10,000. AlphaGo discovered it through self-play — no teacher could have shown it this move. Lee Sedol stared at the board for 12 minutes.

Go board photo with Move 37
Move 37
16 / 27
AND THEN... SILENCE

Twelve minutes.

I thought AlphaGo was based on probability calculation and it was merely a machine. But when I saw this move, I changed my mind. Surely AlphaGo is creative.

— Lee Sedol, after the match

12 min Lee Sedol's pause before responding
Move 37 + Lee Sedol's reaction
YOUTUBE @ 51:46 ↗

From the AlphaGo documentary — click to open at 51:46 in a new tab.

Reaction
17 / 27
PAUSE & THINK

How can a machine play a move
no human ever played?

AlphaGo started by copying 30 million pro moves. So where did Move 37 come from? Talk to a neighbor — guess before we answer.

🧠 It learned from humans first. Stage 1 = copy 30M pro moves. So why didn't it stay an imitator?
🤖 Then it played itself. Stage 2 = two AlphaGos battle for millions of games. Who labels the "right" move when no human is watching?
No teacher needed. Reward = +1 win, −1 lose. The game itself decides. That's how RL can surpass every human teacher.
0 / 3 revealed
Beyond Humans
18 / 27
NEXT UP · YOU PLAY

Your turn to be the agent.

One game: three mystery boxes, each hiding a reward you can't see. Tap to peek inside. You decide every move — and you only get 20 chances.

Your Turn
19 / 27

🎮 Three Mystery Boxes

Tap a box to pull. 20 pulls total.
0
Pulls used / 20
0.00
Total reward
Your best box so far
The Game
20 / 27
A SIMPLE RL ALGORITHM

Meet ε-greedy.

How would a computer play that game? Here's the simplest RL algorithm that works — it's just a coin flip and a running average. (ε is the Greek letter "epsilon" — think of it as "how curious am I?")

🔍 EXPLORE Pick a random box. Why: maybe one we ignored is secretly the best — only way to find out is to try it.
🎯 EXPLOIT Pick the box with the highest average so far. Why: it's been the winner so far — stick with it and collect points.
EVERY TURN
Roll a random number between 0 and 1.
If it's less than \(\varepsilon\)explore: pick a random box.
Otherwise → exploit: pick the box with the best average so far.
Update that box's running average with the reward you saw.
\(\varepsilon = 0\) Pure exploit. Always picks the current "best." Risk: stuck on bad early luck.
\(\varepsilon = 1/6\) 🎲 Roll a dice. 1 → explore, 2–6 → exploit. We'll use this on the next slide.
\(\varepsilon = 1\) Pure explore. Totally random. Learns the truth fast — but scores poorly.
ε-greedy
21 / 27
SAME BOXES · NEW PLAYER

ε-greedy plays your game.

Same three boxes. Same hidden means. Same 20 pulls. Each turn we roll a dice 🎲 — if it lands on 1, ε-greedy explores (random box). On 2–6, it exploits (best so far). That's ε = 1/6.

🔍 explore · pick random 🎯 exploit · pick the box with the best average so far
ε = 1/6 ≈ 0.17
click Roll & pull to take one step
A 0 pulls avg —
B 0 pulls avg —
C 0 pulls avg —
YOU
play game 1 first!
ε-GREEDY
click run!
OPTIMAL
15.00
always pick B
ε-greedy vs You
22 / 27
PAUSE & DISCUSS

What do you think
of ε-greedy?

You watched it play. You compared it to your own strategy. Now turn to a partner — or shout it out:

💪 What does it do well? Why does just "rolling a dice" beat picking a box at random?
⚠️ When would it fail? Imagine 100 boxes and only 20 pulls. Or rewards that change over time.
How would you make it smarter? Should ε start big and shrink? Should it remember more than the average?
Discuss ε-greedy
23 / 27
QUIZ · YOU JUST DID RL

Match each game word
to its RL name.

Tap a phrase on the left, then tap its match on the right. Get all 6 right and you've spoken the language of RL.

Agent YOU 🤖 POLICY π Environment 3 BOXES 🎁 ACTION REWARD
FROM THE GAME
RL NAME
0 / 6 matched
Debrief
24 / 27
PAUSE & THINK

Every part of the loop has a hard problem.

The same loop you saw earlier — agent, policy, action, state, reward, environment. Now tap any piece to see what the next 7 lectures will tackle. For each, ask yourself: what would YOU try?

↑ Click any piece of the loop to see its open problems
Challenges
25 / 27
FULL CIRCLE · CHATGPT, CLAUDE, GEMINI

Same recipe.
Now for language.

Remember predict the next word from earlier? That's just step one. Modern LLMs follow the same two-stage recipe as AlphaGo — first imitate, then RL.

AlphaGo
STAGE 1 · SUPERVISED Copy humans Train on 30 million pro moves. "Given this board, what would a human play?" → strong amateur.
STAGE 2 · RL Self-play Two AlphaGos battle. Reward = +1 win, −1 lose. Discovers Move 37 — beyond every teacher.
💬 ChatGPT & friends
STAGE 1 · SUPERVISED Pretrain on the internet Predict the next word over billions of webpages. "What word comes next?" → fluent imitator.
STAGE 2 · RL Learn from feedback Humans (or another model) rate responses. Reward = helpful & honest. Reasoning that no one wrote down emerges.

Same idea, different game. RL is what turns a fluent imitator into something that can surpass its teacher.

RL in LLMs
26 / 27
WRAP-UP

Four ideas to take with you.

01

Supervised = answer key.

Show many labeled examples. Minimize the squared error. Powers most AI you use today.

02

RL = trial & reward.

No labels. The agent acts, the environment rewards. Maximize total reward G over time.

03

Explore vs exploit.

Every learning agent — including you in the box game — must balance trying new things and using what works.

04

RL can surpass humans.

AlphaGo's Move 37 wasn't taught — it was discovered. Self-play RL invents strategies no teacher knows.

NEXT · LECTURE 2
What Is Reinforcement Learning? — Agent, environment, state, action, reward.
Takeaways
27 / 27