Using Alpha Zero to Study an Original Game

Mike

2025-10-21

Artificial Intelligence, Neural Networks

Page content

Alpha Zero

A few years ago, I cloned one of Google DeepMind’s GitHub repositories, which I can no longer find on their GitHub page (this is the closest I could find to what I had used).

Using that script, I was able to change the game logic to implement a game I had developed around that time.

The Game: Ravel X

I called the game Ravel X, and even developed an Android version of it, which I had available on Google Play for about a year. Unfortunately, it’s no longer available there anymore. It is a deterministic, two-player abstract strategy game played on an $N \times N$ grid (where $ N $ is typically $ 6 $ or $ 8 $), where the objective is to

form solid lines of your color, while blocking your opponent from forming lines.
occupy the diagonal squares with your color.

Forming lines is more difficult than simply occupying diagonal spaces, and so forming a line provides many more points than occupying the diagonal spaces. There is a natural tension between focusing on the diagonals and trying to form lines.

The Basics of How the Script Works

Before I attempted to train a neural network, I watched a DeepMind x UCL Reinforcement Learning series on youtube, here’s the first lecture in that series. The lecture is titled “Introduction to Reinforcement Learning”. It covers topics like

1:57 What is Reinforcement Learning?
25:50 Atari Game Examples
29:33 Reward
30:55 Value
36:30 Agent State
37:45 Environment State
39:40 Observability
53:40 Policy
54:45 Value Functions
58:55 Bellman Equation

To make the most of the lecture series, you should read Reinforcement Learning: An Introduction. You can download the pdf for free by following the link from Stanford’s website. You will also want to read the class materials, which are linked to on youtube.

You can also read DeepMind’s page on Alpha Zero as well.

Alpha Zero starts with a neural network whose weights are set randomly. The network that I used had a policy head and a value head. It then plays the game against itself using a policy that it modifies and improves from the one provided by the neural network. This is the self-improvement mechanism that allows the algorithm to work, provided everything is properly setup (which can be the real struggle).

The Big Picture

As is the case with a lot of machine learning, what you’re trying to train for is not actually what you’re training for. What I mean by this is that, you’re trying to train an agent that can win games as much as possible. It should learn how to play well against a strong and flexible opponent. It achieves this by learning two things

a policy $ P(s,a) $: Given an observation, $s$, of the current game state, what is the probability that the algorithm will pick action $a$. This is returned as a vector, instead of a scalar, because of the $a$ variable.
a value $ v(s) $: Given an observation, $s$, of the current game state, what’s the expected reward, assuming all players play according to the current policy. This is returned as a scalar.

These correspond to the two heads of the neural network mentioned earlier. They are the two outputs of the neural network, a vector for the $P(s,a)$’s and a scalar for $v(s)$.

Self-Improvement

Alpha Zero then uses something called Monte Carlo Tree Search (MCTS) to play the game against itself a number of times (this number, $ n_\text{sims_per_move} $, is another hyperparameter representing the number of simulations used to select each move. It is kept constant throughout the learning process). Each time it plays, it follows a policy which is similar to the one provided by the neural network, however it deviates from the policy in a well-defined way in order to maximize the amount of information being learned. It does this by calculating how confident it is in each of the legal moves, and instead of simply choosing the one it thinks is best, it picks the one that instead maximizes the sum of

the expected reward ($ Q(s,a) $), which is the current estimate of the expected reward for that move, derived from the MCTS simulations.
the exploration bonus ($U(s,a)$), which is a measure of the uncertainty in the expected reward. This bonus is scaled by a hyperparameter ($ c_{\text{score}} $) and is higher for moves that have a greater prior probability (provided by the neural network’s policy head) but have been visited less often in the search tree.

The exploration bonus term is what enables the algorithm to discover new kinds of moves, which is an essential part of the process of learning to become better at the game, while the expected reward term allows the algorithm to leverage what it’s learned about the game so far to help steer the search in the right direction.

Another way to say this is that during self-play the algorithm selects the move, $a$, which maximizes

$$ \text{score}(s,a) := Q(s,a) + U(s,a) $$

Again, this process ensures that MCTS efficiently explores promising moves suggested by the neural network while still trying out less-certain options. The $ c_{\text{score}} $ constant itself is a fixed component of the MCTS algorithm, not a learnable parameter of the neural network.

Calculating the exploration bonus

You can calculate the exploration bonus with

$$ U(s,a) := c_{\text{score}} \cdot P(s,a) \cdot \frac{\sqrt{N(s)}}{1 + N(s,a)} $$

where

$ P(s,a) $ is the (vector) policy output (from the neural network) for the different actions corresponding to the different $a$’s and a given observation, $ s $.
$ N(s) $ is the total visit count of state $s$.
$ N(s, a) $ is the visit count of the state-action pair $(s, a)$.

Having $ 1 + N(s,a) $ in the denominator makes it so that an action, $a$, which has already been heavily explored, will have a smaller exploration bonus, $ U(s,a) $ than another action which hasn’t been explored as many times.

Some Subtleties

Alpha Zero does not necessarily rollout (i.e. play the game) until the end of the game. There is a $\text{max_depth}$ parameter. So, if game play ends before the game is finished, the value estimate $v(s)$ will be used as the game result. On the other hand, if the game is ended, the actual result will be used to update $W(s,a)$, which represents the total return so far for action $a$ selected at state $s$.

Another subtlety is that the Alpha Zero algorithm rolls out to $ n_\text{sims_per_move} $ leaf nodes. It is at those leaf nodes where the neural network is called. When a new leaf node is created the following statistics are stored for that new leaf node

$ N(s,a) $ :
– Tracks how many times action $ a $ has been chosen from state $ s $.
– Incremented by $1$ during the backpropagation step of every simulation that passes through $ (s,a) $.
$ W(s,a) $ :
– Tracks the accumulated reward for this action.
– Updated by adding the Neural Network Value $ v(s’) $ (where $ s’$ is the leaf node state), or the end-game reward during the backpropagation step.
$ Q(s,a) = W(s,a) / N(s,a) $ :
– Used for exploitation in $ \text{score}(s,a) $.
– Recalculated after $W$ and $N$ are updated.
$ P(s,a) $ :
– The initial probability of selecting action $a$ from state $s$. Used for exploration in $ \text{score}(s,a) $.
– Set by the Neural Network only once, during the expansion step when $s$ is first added to the tree. It does not change during this phase of the algorithm (later during the training phase).

What Alpha Zero (MCTS) Returns

After the full $ n_\text{sims_per_move} $ leaf nodes have been reached, Alpha Zero returns an action selected stochastically (randomly) from the probability distribution such that

$$ \pi (s_\text{root}, a) \propto N(s_\text{root}, a)^{1/\tau} $$

$ \tau $ is called the temperature and is typically set to $ \tau = 1 $ for the first few moves (e.g. first 30 moves in Go) to force maximal exploration, and then $ \tau \rightarrow 0 $, which is effectively $ \text{arg max} $ for the rest of the game to ensure the end-game is played optimally. I did not change the temperature settings, so they were probably kept at the settings they would have been at for a game like Go.

Alpha Zero also returns

$ \pi (s_\text{root}, a) $
– For training this is used as the target for the policy head $ P(s,a) $.
the average of $ Q(s) $
– For training this is used as the target for the value head $ v(s) $.

Using the Output

The output of Alpha Zero’s MCTS is then used to play self-play games. The observed state, $s$, as well as the MCTS output are then put into a self-play memory buffer. This is done for each move in each self-play game.

Once the self-play games have been finished, the outputs from the self-play games are collected and sometimes processed a little bit. In fact, there is some flexibility in implementing Alpha Zero. For training, the target for the policy head is $\pi(s_\text{root},a)$, and the target for the value head $v(s)$ is typically the average of $Q(s)$. However, for my implementation, I chose to use the final actual end-game reward instead, which is then back-propagated through the moves recorded in the game.

The move data, containing:

input state $s$
$ \pi (s, a) $
either average of $ Q(s) $ or actual end-game reward

is then shuffled and used as the training data for the current iteration. If everything is working properly, this data will be an improvement over the network’s raw $P(s,a)$ and $ v(s) $ outputs.

Training

The original script had no memory buffer, so at each iteration it was creating new self-play games, then using only that newly created data to train the network for current iteration. I was able to modify the script by having it load random samples of moves from the past $1 \text{million}$ moves. Moves from previous iterations would have their policy values updated (or sometimes updated with a fixed probability) by calling MCTS for the state, $s$, that was saved in the move data.

I also added

$ L^2 $ – normalization: added a multiple of the $ L^2 $ – norm of the parameters tensor.
an entropy bonus: a multiple of the entropy of the probability distribution returned by the neural network for the input state, $s$.

Performance

It probably did better than I should have expected it to. Professionals, like DeepMind, use values like $ n_\text{sims_per_move} = 1600 $ (I think that’s the actual value), while I used $32$. This causes the $ \text{score}(s,a) $ estimate to be very noisy, which degrades the quality of the MCTS process.

Using fairly noisy data is okay for the first several iterations, but it causes your training to become wildly unstable pretty quickly. That’s why I added things like a memory buffer, $ L^2 $ normalization and an entropy bonus. However, ultimately, they aren’t enough to make up for using $ n_\text{sims_per_move} = 32 $.

It Still Learns the Game Decently Well

I was able to train networks that could beat an engine I wrote that used a four-move look-ahead, optimizing the player’s score. The best networks I was able to train could beat the four-move look-ahead exhaustive search engine more than 70% of the time. I felt like that wasn’t so bad. It’s a lot better than I ever got at the game.

Note: if you check out the website for the game I created, Ravel X, the games shown on youtube aren’t very good. They were for the 8x8 version of the game, and the network used to play the game was trained very early on. It wasn’t a very strong network.

Hardware

I used nothing special at all, just one Nvidia GeForce RTX 2070 SUPER. Using $ n_\text{sims_per_move} = 32 $, it would take about 7 hours to create about 125,000 games. If I had done things correctly and used $ n_\text{sims_per_move} = 1600 $, it would have instead taken about 350 hours. That would have been about half a month per iteration. Alpha Zero converges slowly, so it could have taken over 100 iterations to become expert-level. That would have taken several years. On top of that, the network I used was pretty small, so it might not have gotten much better than I was able to achieve, anyway.

You can see that it’s very difficult to make progress in this area without using a lot of equipment. This is one reason why I tend to find it unhelpful to hear reports of how something like Alpha Zero was able to master a game like Chess in less than a day. It’s because a lot of compute power was thrown at the problem and it’s highly parallelizable. It’s a statement about the genius of the algorithm and the incredible compute power it was allowed to use.

The Neural Network

The script’s name for the network I used was ResnetPolicyValueNet256, which was a pax.Module. The parameters I used were the defaults dim: int = 256 and num_resblock: int = 6.

To quote their documentation:

1
2
3
4
5
6
7


    """Residual Conv Policy-Value network.

    Two-head network:
                        ┌─> action head
    input ─> Backbone  ─┤
                        └─> value head
    """

the Backbone has num_resblock ResidualBlock(dim) blocks.

This is nothing too fancy. I didn’t feel like attempting to change things around much. I did change the parameters a bit, and I couldn’t get better results with larger networks (e.g. ones with larger Backbones).

The network had a total of a little more than 12 million trainable parameters.

Conclusion

There’s a reason why, at least for now, there aren’t many deep reinforcement learning examples available online that cover the “middle area” between something large-scale, like Alpha Zero Chess, and something quite small, like tic-tac-toe. There’s a lot of computing that has to be done for Alpha Zero to learn an in-depth game. As processors become increasingly powerful, it will become feasible to use Alpha Zero and other similar algorithms to study new games.

However, the pace at which PCs are improving has been decelerating since around 2010. So, it’s probably going to be a while before either

an affordable home setup or
a cheap cloud-based setup

would be able to train a game like Ravel X properly in a reasonable amount of time.

Thank you for reading!

[There’s no comment section so far, but feel free to reach out to me through the Contact Us page!]