Skip to content

Seeking Guidance on Momentum Term Initialization in Momentum Optimizer #19

@szfmsmdx

Description

@szfmsmdx

I'm currently working on the implementation of the Momentum Optimizer and have a small question regarding the initialization of the momentum term. I would greatly appreciate your insights to help clarify my understanding.


Background

In the standard momentum update formula:
v(t) = β * v(t-1) + (1 - β) * grad
where β is the momentum coefficient, the lecture mentioned that the momentum term should be initialized as 0 at the first iteration since there is no historical gradient information initially.


My Question

While experimenting with the code, I noticed that manual initialization requires explicitly checking if a key exists. For example:

if id(w) not in self.u:
    self.u[id(w)] = 0.0  

However, this does not pass the test case. When I initialize it as

if id(w) not in self.u:
    self.u[id(w)] = (1 - self.momentum) * grad

it passed all the tests. My question is that the momentum term is initialized to (1 - self.momentum) * grad , which will cause the momentum term to directly contain the current gradient during the first update, rather than 0 as required by the standard momentum algorithm. This contradicts the core idea of ​​the momentum algorithm, because the momentum term should reflect the accumulation of historical gradients rather than directly introducing the scaling of the current gradient.

If anyone can see this and is willing to answer my questions, I would be very grateful, thank you very much!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions