Cloud TPUs in Kubernetes Engine powering Minigo are now available in betaCloud TPUs in Kubernetes Engine powering Minigo are now available in betaProduct ManagerSoftware EngineerProduct Manager

For a more detailed explanation of Cloud TPUs in GKE, for example how to train the TensorFlow ResNet-50 model using Cloud TPU and GKE, check out the documentation.

640 Cloud TPUs in GKE powering Minigo

Internally, we use Cloud TPUs to run one of the most iconic Google machine learning workloads: Go. Specifically, we run Minigo, an open-source and independent implementation of Google DeepMind’s AlphaGo Zero algorithm, which was the first computer program to defeat a professional human Go player and world champion. Minigo was started by Googlers as a 20% project, written only from existing published papers after DeepMind retired AlphaGo.

Go is a strategy board game that was invented in China more than 2,500 years ago and that has fascinated humans ever since—and in recent years challenged computers. Players alternate placing stones on a grid of lines in an attempt to surround the most territory. The large number of choices available for each move and the very long horizon of their effects combine to make Go very difficult to analyze. Unlike chess or shogi, which have clear rules that determine when a game is finished (e.g., checkmate), a Go game is only over when both players agree. That’s a difficult problem for computers. It’s also very hard, even for skilled human players, to determine which player is winning or losing at a given point in the game.

Minigo plays a game of Go using a neural network, or a model, that answers two questions: “Which move is most likely to be played next?” called the policy, and “Which player is likely to win?” called the value. It uses the policy and value to search through the possible future states of the game and determine the best move to be played.

The neural network provides these answers using reinforcement learning which iteratively improves the model in a two-step process. First, the best network plays games against itself, recording the results of its search at each move. Second, the network is updated to better predict the results in step one. Then the updated model plays more games against itself, and the cycle repeats, with the self-play process producing new data for the training process to build better models, and so on ad infinitum.