The Modified Huber Loss Op can be represented by the following function:
L(g) = max(0, 1 - g)^2 for g >= -1,
-4g otherwise.
g = y * f(x)
This is a piecewise function and it is made up of three distinct ranges: (-∞, -1), [-1, 1) and [1, +∞]. In the range of [1, +∞), the function result is constantly zero.
Although three ranges joint to their neighbors smoothly, the trends of gradients from two sides of junctions -1 and 1 diff from each other. That makes out gradient estimation results from Python and C++ a considerable difference.
The difference will further be divided by the gradient itself, to make the error estimation a relative
result. However, near junction 1 the gradient itself is quite near to zero. A considerable difference, then divided by a near zero value. It finally leads to a big error.
A simple solution is to keep all the g of the above function in out unit tests far enough from 1.