Skip to content

Radical idea to get rid of cpuct and softmax temperature, and put an end to the endless cpuct tests. #317

@Videodr0me

Description

@Videodr0me

Why not get rid of the softmax normalization and cpuct altogether. The idea is that if policy is learned and NOT normalized the NN would automatically learn best cpuct indirectly via policy. Policy is currently multiplied by cpuct, so if the NN can adjust policy more freely (without the imposed sum-to-one constraint) it can indirectly choose which cpuct to use. An added bonus is that the NN would learn how to adapt this per position. If for a position a cpuct of 3.0 would be optimal the NN would just output a policy three times as high.

I don't know if this has already been tried and/or discussed. Also it might or might not entail a small change to the policy output layer, as i do not know if currently any constraints are (directly or indirectly) imposed at that stage (apart from the softmax sum-to-one constraint imposed when first fetching the output). Also it might or might not make learning more (or less) difficult. Another thing that would change is that cross-entropy might have to be adapted/replaced to have a learning target that also preserves the scale of policy output.

Could this be made to work? Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions