9 LIVE GAMES · BUILD YOUR FIRST AI

Reinforcement
learning, from scratch.

How machines learn to play, win, and repeat — built up from the very first idea. Hands-on games, real math, and the whole arc from cat-or-dog to AlphaGo.

For Grades 9–12 · Pace ~50 min per lecture
Lecturer Dr. Yao Ji, Dr. Ruqi Bai · Supervisor Dr. Guanghui (George) Lan

The lectures

LECTURE 01

How machines learn to play

From supervised classification to AlphaGo. Hands-on interactives: predict-the-next-word, a 3-box bandit, and ε-greedy in action.

Supervised vs RLAlphaGoMove 37Banditε-greedy
Open
LECTURE 02

What is reinforcement learning?

Make the picture precise: agent, environment, state, action, reward — the formal vocabulary you'll use for the rest of the course. With "Is this RL?" quiz + design-your-own-policy interactive.

MDPStateActionRewardPolicyMarkov
Coming soon
LECTURE 03

Long-term reward & value functions

From single rewards to lifetime planning. Return G, discount γ, Vπ, Qπ, and the optimal V*. Interactive γ-slider + Vπ visualizer.

Return GγVπQπV*
Coming soon
LECTURE 04

Evaluating a strategy

Bellman expectation, iterative policy evaluation, convergence. Watch V values flow from goal across a 4×4 grid sweep by sweep — live.

BellmanPolicy evalSweepingConvergence
Coming soon
LECTURE 05

Improving a strategy

Greedy improvement, policy iteration, value iteration, Bellman optimality, and π*. Watch a random 5×5 policy turn optimal in a few rounds.

GreedyPolicy iterValue iterπ*
Coming soon
LECTURE 06

Learning without a map

Bandits revisited, ε-greedy, Monte Carlo, TD, Q-learning, SARSA. Two interactives: ε-greedy bandit + live Q-learning trace on a 4×4 grid.

ε-greedyMonte CarloTDQ-learningSARSA
Coming soon
LECTURE 07

Putting it together

Synthesis. The big map, method comparator (5 algorithms), 8 project starters, ~30 lines of Python Q-learning, debugging guide. Bring your laptop.

SynthesisPracticeProjectsPythonGymnasium
Coming soon
LECTURE 08

RL, looking forward

From your gridworld to the frontier. DQN, AlphaGo lineage (→ MuZero), RLHF for ChatGPT, real-world apps, open problems, safety & reward hacking.

Deep RLAlphaGoRLHFSafety
Coming soon