Coding Challenge #119 - AI Pong Player
This challenge is to build your own AI pong player with reinforcement learning.
Hi, this is John with this week’s Coding Challenge.
🙏 Thank you for being a subscriber, I’m honoured to have you as a reader. 🎉
If there is a Coding Challenge you’d like to see, please let me know by replying to this email📧
Coding Challenge #119 - AI Pong Player
This challenge is to build your own reinforcement learning agent that learns to play Atari Pong directly from the pixels on the screen.
Pong is one of the oldest video games ever made, and it has a special place in the history of artificial intelligence. In 2013, DeepMind used Pong (and a handful of other Atari games) to show that a single algorithm could learn to play games at a human level, just by watching the screen and being told the score. That work kicked off the modern era of deep reinforcement learning. Pong is the friendliest of the Atari games to start with, the rules are simple, the screen is mostly empty, and the agent only needs to choose between moving the paddle up or down. That makes it the perfect first project for going from “I’ve read about reinforcement learning” to “I’ve actually trained an agent from raw pixels and watched it learn to win.” Building this project will introduce you to ideas you’ll come across again and again throughout your career: turning observations into features, sampling from a stochastic policy, computing returns, reducing variance, and the policy gradient itself.
If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️, with the bonus that you get 20% off any of my courses.
Buy one of my self-paced courses that walk you through a Coding Challenge.
Join one of my live courses where I personally teach you Go by building five of the coding challenges or systems software development by building a Redis clone.
The Challenge - Building Your Own AI Pong Player
In this challenge you’re going to build a policy gradient agent that learns to play Pong from raw pixels using the REINFORCE algorithm. Your agent will start out playing randomly, lose 21-0 over and over, and then, if you’ve wired everything up correctly, gradually start scoring points, then winning rallies, and eventually beating the built-in opponent more often than it loses.
This challenge is a good fit if you’ve written some Python before, are comfortable with NumPy, and have at least a passing acquaintance with neural networks. You don’t need to be an reinforcement learning expert. REINFORCE is one of the simplest deep reinforcement learning algorithms there is, and the version we’ll build here is famously the one Andrej Karpathy described in his “Pong from Pixels“ blog post. A small policy network, no value function, no replay buffer, no target network. Just a policy, some episodes, and a gradient.
A word of warning before you start: training from pixels is slow. Even on a sensible setup, you should expect a few hours of CPU training before the agent really starts to win, and you may want to leave it running overnight. That’s part of the experience, watching the score curve crawl upwards over many thousands of episodes is genuinely exciting once you’ve built the thing yourself.
Step Zero
In this introductory step you’re going to set your environment up ready to begin developing and testing your solution.
Python is the natural choice for this challenge because the reinforcement learning ecosystem lives there, but the ideas transfer cleanly to any language with a deep learning framework.
You’ll need three things installed: Gymnasium (the maintained successor to OpenAI Gym), the Atari environments via ALE-py, and a deep learning framework, PyTorch, TensorFlow, or JAX are all fine, pick whichever you’d like to practise with. You’ll also want NumPy, Matplotlib, and probably opencv-python or Pillow for image work. Have a quick read of the Gymnasium docs and the Atari environment list so you know what’s available.
Before you write any code, spend a few minutes playing Pong yourself if you’ve never seen it. Notice that the only thing that matters is your paddle’s vertical position, the ball’s position, and the ball’s direction of travel. Your agent will need to work this out from the screen, with no idea what any of those concepts mean.
Step 1
In this step your goal is to get a Pong environment running and have a “random agent” play a full game so you can see the data flowing.
Create the ALE/Pong-v5 environment from Gymnasium and run a single episode where, at every step, you pick an action uniformly at random and step the environment with it. For each step, print or log the reward. You should see mostly zeros, with the occasional -1 (the built-in opponent has scored against you) and very rarely a +1 (you got lucky). The episode should end after twenty-one points have been scored on one side.
Have a look at the action space (env.action_space) and the observation space (env.observation_space). The action space has six entries, but for Pong you really only ever need two of them: the action that moves the paddle up and the action that moves it down. Constraining your agent’s choices to just those two actions makes learning much faster, because there are fewer wrong things it can do. Pick the two action indices you’ll use throughout the rest of the challenge and write them down somewhere obvious in your code.
The observation is a 210 x 160 x 3 RGB image - the raw screen. Have a look at one with Matplotlib so you know what your agent is seeing. There’s a lot of pixels there that have nothing to do with playing Pong: the score at the top, the borders down the sides, the colours. We’ll fix all of that in the next step.
Testing: Run your random agent for one episode and confirm that:
The episode terminates of its own accord (you don’t have to cap the step count)
The total reward is somewhere between roughly
21and15(random play loses badly)The observation shape is
(210, 160, 3)withuint8values
Step 2
In this step your goal is to turn the raw 210 x 160 x 3 screen into a much smaller representation that contains just the information your agent needs.
There are four things to do here, and they should all happen inside a single function that takes a raw frame and returns the preprocessed observation:
Crop away the score area at the top of the screen and the borders on each side, leaving just the playing area.
Convert the result to greyscale - colour adds nothing useful in Pong.
Resize down to
80 x 80pixels. The image was already mostly empty space; at this resolution you can still clearly see the paddles and the ball.Flatten the
80 x 80grid into a single 1D vector of length6400. This is the input format your policy network will expect.
A static frame doesn’t tell your agent anything about which way the ball is moving, and direction is the most important thing in Pong. The classic trick - and the one used in the original Karpathy write-up - is to feed in the difference between the current preprocessed frame and the previous one. Pixels that didn’t change become zero, and pixels that did change show up as positive or negative values. The ball appears as a little bright streak pointing the way it’s travelling. Add this difference computation on top of your preprocessing function.
Testing: Save a few raw frames and their preprocessed versions to disk and look at them with an image viewer. The preprocessed frame should clearly show the two paddles and the ball as bright pixels on a dark background, with nothing else. Display a frame difference - it should be almost entirely black except for the ball and the moving paddle.
A good sanity check: the output of your preprocessing function should be a 1D NumPy array of length 6400 (or whatever shape you’ve chosen) with float32 values, not raw pixel bytes.
Step 3
In this step your goal is to build the neural network that maps a preprocessed observation to a probability distribution over actions, and use it to pick actions.
The policy network Karparthy describes is a tiny network - a single hidden layer with about 200 ReLU units, then an output layer that produces one number per action. Pass that output through a softmax (or a sigmoid if you’ve reduced things to a single output for “probability of moving up”) and you have a probability distribution. To pick an action, sample from that distribution rather than taking the most likely one.
Wire up an “act” function that takes a preprocessed frame, runs it through the network, and returns a sampled action plus whatever extra information you’ll need later for training (typically the log-probability of the action that was taken, or the network output itself).
Once that’s working, run another full episode - this time with your untrained network choosing the actions instead of random.choice. The agent will still lose badly (its weights are random), but the score should be in roughly the same ballpark as the random agent from Step 1. If you see something dramatically different, something is wrong with your preprocessing or your sampling.
Testing: Run a single episode with the untrained policy. The total reward should be in the same -21 to -15 range as the random agent. The action distribution - if you log it - should be close to 50/50 at the start of training. Print the shape of the network output and the sampled action index for the first few steps to make sure everything lines up.
Step 4
In this step your goal is to collect a complete episode of experience and turn the rewards into the returns that will drive learning.
For each step in an episode, store three things: the observation that was fed in, the action that was taken (or its log-probability), and the reward that came back from the environment. At the end of the episode you’ll have three lists, all the same length.
Now compute the discounted return for each step. The return at step t is the sum of all the rewards from step t onwards, with rewards further in the future weighted by a discount factor gamma (use 0.99). You should compute this as a single backwards pass over the reward list - much faster and cleaner than the obvious double loop. There’s one Pong-specific subtlety: every time someone scores a point, the rally ends and a new one begins inside the same episode. You probably want to reset the running sum when a non-zero reward appears, so credit for a point only flows back to the actions in that rally rather than all the way to the start of the game. This makes a big difference to learning speed.
Once you have the per-step returns, normalise them across the whole episode by subtracting the mean and dividing by the standard deviation. Normalised returns put roughly half the actions on the “this was better than average” side and half on the “this was worse” side, which gives the policy gradient a much more stable signal.
Testing: Run an episode, compute the returns, and have a look:
The length of your returns array matches the number of steps in the episode.
After normalisation, the mean should be close to zero and the standard deviation close to one.
For an action that was followed by a
+1reward soon after, the return should be positive; for one followed by a1, it should be negative.
A nice sanity print is to show, for the last twenty steps of an episode, the reward at that step and the discounted return - you’ll see the return building up smoothly and then jumping when a point is scored.
Step 5
In this step your goal is to actually update the policy in the direction that makes good actions more likely and bad actions less likely. This is the heart of the whole challenge.
The REINFORCE update is delightfully simple. For each step in your collected rollout, compute the loss as -log(probability of the action taken) * normalised return for that step, then sum (or average) across all the steps. Run that through your framework’s autograd, take a gradient step with an optimiser (Adam or RMSProp at a learning rate around 1e-3 to 1e-4 works well), and that’s it. Actions that led to better-than-average returns get pushed up; actions that led to worse-than-average returns get pushed down. You’re doing gradient ascent on expected return, even though you’re calling loss.backward().
A single episode’s worth of gradient is very noisy. Batch up the gradients over multiple episodes - ten is a sensible starting point - before you actually call the optimiser. You can either accumulate gradients across episodes or concatenate the per-step data and do one bigger update; both work.
Now wrap the whole thing in a training loop that runs for thousands of episodes, prints a running average of the score after each one, and just leaves it going. Be patient. For the first few hundred episodes the score will hover around -21 -- the policy is still essentially random and learning very slowly. After that, you should see the running average start to creep upwards. By the time you’ve trained for several thousand episodes (this can be many hours of wall-clock time on CPU), the running average should cross zero, meaning your agent is winning more rallies than it loses.
Testing: This is the step where things either work or they very visibly don’t. A few things to check as training progresses:
The running average reward should be trending upwards over time, not just bouncing around
After ~500 episodes, the agent should reliably score some points (running average above
21)After a few thousand episodes, the running average should be approaching zero or going positive
If the loss explodes or the score gets stuck at
21forever, the most common culprits are: forgetting to reset the discounted return between rallies, an unnormalised return signal, the wrong sign on the loss, or feeding raw frames instead of frame differences
If you’d like a stronger signal that things are alive, log the average length of an episode (in steps). Random play produces short episodes; an agent that’s learning to actually rally produces longer ones, well before the score itself starts to go up.
Step 6
In this step your goal is to make your training run something you can show off, not just a console of numbers scrolling by.
There are four things to add:
Save the model weights -- both periodically (every N episodes) and whenever a new best running average reward is achieved. You don’t want to leave a long training run going only to lose the weights.
An evaluation mode that loads a saved set of weights, plays a fixed number of episodes with the policy fixed (no learning, ideally with greedy action selection rather than sampling), and reports the average score. This is what you’d use to honestly compare two different training runs.
Video recording of the agent playing. Gymnasium has a
RecordVideowrapper that writes MP4s. Record a video of an early-training agent (it’ll be hilariously bad), a mid-training agent (starting to get the idea), and a late-training agent (winning, hopefully). Stitching these together is the single most satisfying artefact of the whole project.A training reward plot -- a simple Matplotlib chart showing the per-episode reward and a rolling average over the run. The shape of this curve, going from a flat line at
21through random play and up into positive territory, is the picture of an agent learning.
Testing: Once you have all four bits in place:
Kill your training script and restart it from a saved checkpoint. The running average should pick up roughly where it left off, not crash back to
21.Run evaluation on a checkpoint with sampling vs. greedy action selection; greedy should be at least as good as sampled.
Open one of your recorded videos and watch the agent play. It is unbelievably satisfying to see the paddle that you trained track the ball and put it past the opponent.
Going Further
Here are some ideas to take your Pong agent further:
Add a baseline to reduce variance. REINFORCE has notoriously noisy gradients. Subtract a baseline - the simplest one is the running average reward, the next-simplest is a learned value function - from the returns before you scale the policy gradient. This is the first step from REINFORCE towards Actor-Critic.
Replace the MLP with a small CNN. Convolutional layers are a much more natural fit for image input than a flattened MLP. You’ll lose the trick of feeding the frame difference and instead stack the last few frames as channels. Compare training time and final score against the MLP version.
Try a different algorithm. Once you have all the scaffolding - environment, preprocessing, training loop, logging - you can swap the algorithm out without rewriting the rest. Implement A2C, PPO, or DQN against the same Pong setup and see how they compare on sample efficiency and final score.
Run multiple environments in parallel. A single CPU core stepping through one game at a time is the bottleneck on most training runs. Use Gymnasium’s
AsyncVectorEnvto step several Pong games at once and gather rollouts much faster.Train an opponent. The built-in Pong AI is fixed and not very good. Once your agent beats it consistently, you’ve topped out the score. A natural next step is self-play: have two copies of your agent play each other and improve together.
P.S. If You Enjoy Coding Challenges Here Are Four Ways You Can Help Support It
Refer a friend or colleague to the newsletter. 🙏
Sign up for a paid subscription - think of it as buying me a coffee ☕️ twice a month, with the bonus that you also get 20% off any of my courses.
Buy one of my courses that walk you through a Coding Challenge.
Subscribe to the Coding Challenges YouTube channel!
Share Your Solutions!
If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know via Bluesky or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo
Request for Feedback
I’m writing these challenges to help you develop your skills as a software engineer based on how I’ve approached my own personal learning and development. What works for me, might not be the best way for you - so if you have suggestions for how I can make these challenges more useful to you and others, please get in touch and let me know. All feedback is greatly appreciated.
You can reach me on Bluesky, LinkedIn or through SubStack
Thanks and happy coding!
John

