Radical idea to get rid of cpuct and softmax temperature, and put an end to the endless cpuct tests.

Why not get rid of the softmax normalization and cpuct altogether. The idea is that if policy is learned and NOT normalized the NN would automatically  learn best cpuct indirectly via policy. Policy is currently multiplied by cpuct, so if the NN can adjust policy more freely (without the imposed sum-to-one constraint) it can indirectly choose which cpuct to use. An added bonus is that the NN would learn how to adapt this per position. If for a position a cpuct of 3.0 would be optimal the NN would just output a policy three times as high. 

I don't know if this has already been tried and/or discussed. Also it might or might not entail a small change to the policy output layer, as i do not know if currently any constraints are (directly or indirectly) imposed at that stage (apart from the softmax sum-to-one constraint imposed when first fetching the output). Also it might or might not make learning more (or less) difficult. Another thing that would change is that cross-entropy might have to be adapted/replaced to have a learning target that also preserves the scale of policy output.

Could this be made to work? Any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Radical idea to get rid of cpuct and softmax temperature, and put an end to the endless cpuct tests. #317

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Radical idea to get rid of cpuct and softmax temperature, and put an end to the endless cpuct tests. #317

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions