Make policy used only for initial guidance

Currently policy has a large effect on search even when there are millions of nodes. This causes weird behaviour and can lead to significant oversights. To fix this, I'm proposing that rather than maximizing `Q + c_puct*policy*sqrt(N / n_i)`, we instead maximize `Q + f(policy, N) + c_puct*sqrt(N/n_i)`. I don't have a great idea of what `f` should be, but a reasonable first aproximation is `e^-(aN+policy)`, see the link for the graph. https://www.desmos.com/calculator/g3kmkjp5zs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make policy used only for initial guidance #743

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make policy used only for initial guidance #743

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions