-
Notifications
You must be signed in to change notification settings - Fork 36
Description
I'm currently working on the implementation of the Momentum Optimizer and have a small question regarding the initialization of the momentum term. I would greatly appreciate your insights to help clarify my understanding.
Background
In the standard momentum update formula:
v(t) = β * v(t-1) + (1 - β) * grad
where β is the momentum coefficient, the lecture mentioned that the momentum term should be initialized as 0 at the first iteration since there is no historical gradient information initially.
My Question
While experimenting with the code, I noticed that manual initialization requires explicitly checking if a key exists. For example:
if id(w) not in self.u:
self.u[id(w)] = 0.0 However, this does not pass the test case. When I initialize it as
if id(w) not in self.u:
self.u[id(w)] = (1 - self.momentum) * gradit passed all the tests. My question is that the momentum term is initialized to (1 - self.momentum) * grad , which will cause the momentum term to directly contain the current gradient during the first update, rather than 0 as required by the standard momentum algorithm. This contradicts the core idea of the momentum algorithm, because the momentum term should reflect the accumulation of historical gradients rather than directly introducing the scaling of the current gradient.
If anyone can see this and is willing to answer my questions, I would be very grateful, thank you very much!