Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values #11314

@leezu

Description

@leezu

Description

The AddTakeGradLargeBatchCaller operator called during backward of Embedding is broken and results in nan at random positions in the gradient array.

Environment info (Required)

  • Cuda 9.0 or Cuda 9.2 with respective mxnet-cu92, mxnet-cu90 prebuilt binaries (both v1.2 and nightly affected)
  • EC2 p3.2xlarge or p2.xlarge instance

While it is occurs rarely with Cuda 9.0 or p2.xlarge it almost always occurs on Cuda 9.2 and p3.2xlarge.

Minimum reproducible example

import mxnet as mx
import numpy as np

N = 50000
ctx = mx.gpu()

embedding = mx.gluon.nn.Embedding(N, 300)
embedding.initialize(ctx=ctx)
i = 0
np.random.seed(1)
idx = mx.nd.array(np.random.randint(0, N, size=(1024, 160)), ctx=ctx)

got_nan = False
while True:
    i += 1
    with mx.autograd.record():
        emb_in = embedding(idx)
        loss = emb_in.sum()
    loss.backward()

    if not np.all(np.isfinite(embedding.weight.grad().asnumpy())):
        nan_rows = np.where(~np.isfinite(embedding.weight.grad().asnumpy()))[0]
        print(f'Got nan {i}\tRetrying with same data. '
              f'(Affected rows: {nan_rows.tolist()}).')
        got_nan = True
    else:
        if got_nan:  # We got nan before and it disappeared now
            print(f'nan disappeared in {i}..')
            break

    if i % 100 == 0:
        print(f'{i}')

Steps to reproduce

Run above script with cuda 9.2 and observe very frequent nan values:

% python debug_embedding_nan.py
Got nan 3       Retrying with same data. (Affected indices: [14721, 14721], [1, 2]).
Got nan 4       Retrying with same data. (Affected indices: [20, 20, 39, 39, 18232, 18232], [257, 258, 1, 2, 1, 2]).
Got nan 5       Retrying with same data. (Affected indices: [20, 20, 71, 33346, 38015], [257, 258, 258, 130, 130]).
Got nan 6       Retrying with same data. (Affected indices: [20, 20], [257, 258]).
nan disappeared in 7..
% python debug_embedding_nan.py 
Got nan 7       Retrying with same data. (Affected indices: [20, 20, 33, 71, 71, 71, 71, 71], [257, 258, 1, 1, 2, 129, 130, 258]).
nan disappeared in 8..
% python debug_embedding_nan.py
Got nan 1       Retrying with same data. (Affected indices: [1489], [129]).
Got nan 2       Retrying with same data. (Affected indices: [42581, 42581], [257, 258]).
nan disappeared in 3..

Run above script with cuda 9.0 and observe (infrequent) nan values:

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
Got nan 1461    Retrying with same data. (Affected indices: [3254], [2]).
nan disappeared in 1462..

What have you tried to solve it?

  1. Apply the following patch and set
From 3fd91f0078e70cf990ce1549081c03cfb50292ad Mon Sep 17 00:00:00 2001
From: Leonard Lausen <leonard@lausen.nl>
Date: Fri, 15 Jun 2018 18:45:39 +0000
Subject: [PATCH] MXNET_FORCE_ADDTAKEGRAD to disable
 AddTakeGradLargeBatchCaller

If MXNET_FORCE_ADDTAKEGRAD is set, EmbeddingOpBackward will always use
AddTakeGrad independently of gradient input and output shape
---
 src/operator/tensor/indexing_op.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/operator/tensor/indexing_op.h b/src/operator/tensor/indexing_op.h
index 87381960e..d3a1bdfd6 100644
--- a/src/operator/tensor/indexing_op.h
+++ b/src/operator/tensor/indexing_op.h
@@ -598,7 +598,11 @@ void EmbeddingOpBackward(const nnvm::NodeAttrs& attrs,
         uint64_t shape_out_prod =
           static_cast<uint64_t>(grad_out.shape_[0])*
           static_cast<uint64_t>(grad_out.shape_[1]);
-        if (shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384) {
+
+        const char *type = getenv("MXNET_FORCE_ADDTAKEGRAD");
+        const bool default_addtakegrad = (type == nullptr);
+
+        if (!default_addtakegrad || ( shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384 )) {
           AddTakeGrad(grad_in, data, grad_out);
         } else {
           AddTakeGradLargeBatchCaller(ctx, grad_in, data, grad_out);
--
2.17.1
  1. Run above script when MXNET_FORCE_ADDTAKEGRAD=1 is exported.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions