ricela.blogg.se - Mcts scheduling

Simulation (or Roll-out)ĭuring this step, we roll out one or multiple simulations with reward accumulated for each simulation. In expansion step, we simply randomly pick an unexplored node of a leaf node. Remember that we reach a leaf node at the end of selection? Smart and neat!įor a more rigorous yet still easy-to-follow account of exploration vs exploitation (including UCB), please read Lilian Weng’s blog on this topic.įinally, in the selection step, we can traverse the snowcap or explored part of the tree until a leaf node has been reached. Therefore, UCB algorithm has exploration and exploitation inherently built-in. More specifically, each node has an associated UCB value and during selection we always chose the child node with the highest UCB value.Īs we can see, the more a node i is visited, the lower the second term in UCB becomes, hence decreasing its probability of being selected again. In the algorithm behind AlphaGo, a UCB based policy is used. One important consideration of such policy is the balance of exploration vs exploitation, a recurring topic in AI, especially in Reinforcement Learning. In this step, we use tree policy to construct path from root to most promising leaf node.Ī leaf node is a node which has unexplored child node(s).Ī tree policy is an informed policy used for action (node) selection in the snowcap (explored part of the game tree as opposed to the vast unexplored bottom part). In other words, MCTS pays more attention to nodes that are more promising, so it avoids having to brute force all possibilities which is impractical to do.Īt its core, MCTS consists of repeated iterations (ideally infinite, in practice constrained by computing time and resources) of 4 steps: selection, expansion, simulation and backup.

MCTS is one such smart algorithm that gives us a way out.Įssentially, MCTS uses Monte Carlo simulation to accumulate value estimates to guide towards highly rewarding trajectories in the search tree. Now we understand the fact that we need something smarter than uninformed search to navigate through a gigantic state space like the game of Go. This also gives some sense why the win of AlphaGo is such a milestone for humankind. Some back of the envelope calculation will quickly show how prohibitive it is to use uninformed search in Go:

To give some ideas, Go has averaging branch factor of 250 according to Wikipedia. Note that it is not fully expanded due to space. A game tree of tic-tac-toe ( image source).