9 LIVE GAMES · BUILD YOUR FIRST AI

Reinforcement
learning, from scratch.

How machines learn to play, win, and repeat — built up from the very first idea. Hands-on games, real math, and the whole arc from cat-or-dog to AlphaGo.

For Grades 9–12 · Pace ~50 min per lecture

Lecturer Dr. Yao Ji, Dr. Ruqi Bai · Supervisor Dr. Guanghui (George) Lan

The lectures

    LECTURE 01How machines learn to playFrom supervised classification to AlphaGo. Hands-on interactives: predict-the-next-word, a 3-box bandit, and ε-greedy in action.

        Supervised vs RLAlphaGoMove 37Banditε-greedy
      
Open

    LECTURE 02What is reinforcement learning?Make the picture precise: agent, environment, state, action, reward — the formal vocabulary you'll use for the rest of the course. With "Is this RL?" quiz + design-your-own-policy interactive.

        MDPStateActionRewardPolicyMarkov
      
Coming soon

    LECTURE 03Long-term reward & value functionsFrom single rewards to lifetime planning. Return G, discount γ, Vπ, Qπ, and the optimal V*. Interactive γ-slider + Vπ visualizer.

        Return GγVπQπV*
      
Coming soon

    LECTURE 04Evaluating a strategyBellman expectation, iterative policy evaluation, convergence. Watch V values flow from goal across a 4×4 grid sweep by sweep — live.

        BellmanPolicy evalSweepingConvergence
      
Coming soon

    LECTURE 05Improving a strategyGreedy improvement, policy iteration, value iteration, Bellman optimality, and π*. Watch a random 5×5 policy turn optimal in a few rounds.

        GreedyPolicy iterValue iterπ*
      
Coming soon

    LECTURE 06Learning without a mapBandits revisited, ε-greedy, Monte Carlo, TD, Q-learning, SARSA. Two interactives: ε-greedy bandit + live Q-learning trace on a 4×4 grid.

        ε-greedyMonte CarloTDQ-learningSARSA
      
Coming soon

    LECTURE 07Putting it togetherSynthesis. The big map, method comparator (5 algorithms), 8 project starters, ~30 lines of Python Q-learning, debugging guide. Bring your laptop.

        SynthesisPracticeProjectsPythonGymnasium
      
Coming soon

    LECTURE 08RL, looking forwardFrom your gridworld to the frontier. DQN, AlphaGo lineage (→ MuZero), RLHF for ChatGPT, real-world apps, open problems, safety & reward hacking.

        Deep RLAlphaGoRLHFSafety
      
Coming soon

Reinforcementlearning, from scratch.

The lectures

How machines learn to play

What is reinforcement learning?

Long-term reward & value functions

Evaluating a strategy

Improving a strategy

Learning without a map

Putting it together

RL, looking forward

Reinforcement
learning, from scratch.