Why not get rid of the softmax normalization and cpuct altogether. The idea is that if policy is learned and NOT normalized the NN would automatically learn best cpuct indirectly via policy. Policy is currently multiplied by cpuct, so if the NN can adjust policy more freely (without the imposed sum-to-one constraint) it can indirectly choose which cpuct to use. An added bonus is that the NN would learn how to adapt this per position. If for a position a cpuct of 3.0 would be optimal the NN would just output a policy three times as high.
I don't know if this has already been tried and/or discussed. Also it might or might not entail a small change to the policy output layer, as i do not know if currently any constraints are (directly or indirectly) imposed at that stage (apart from the softmax sum-to-one constraint imposed when first fetching the output). Also it might or might not make learning more (or less) difficult. Another thing that would change is that cross-entropy might have to be adapted/replaced to have a learning target that also preserves the scale of policy output.
Could this be made to work? Any ideas?