Would SGD w/ momentum or NAG be a good fit for the gradient masking?

Just wondering if it's a good idea in theory to try this with SGD since theoretically, it should behave better than first+second moment optimization. Thoughts?