Currently policy has a large effect on search even when there are millions of nodes. This causes weird behaviour and can lead to significant oversights. To fix this, I'm proposing that rather than maximizing Q + c_puct*policy*sqrt(N / n_i), we instead maximize Q + f(policy, N) + c_puct*sqrt(N/n_i). I don't have a great idea of what f should be, but a reasonable first aproximation is e^-(aN+policy), see the link for the graph. https://www.desmos.com/calculator/g3kmkjp5zs