Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values

## Description

The `AddTakeGradLargeBatchCaller` operator called during backward of Embedding is broken and results in `nan` at random positions in the gradient array. 

## Environment info (Required)
- Cuda 9.0 or Cuda 9.2 with respective mxnet-cu92, mxnet-cu90 prebuilt binaries (both v1.2 and nightly affected)
- EC2 p3.2xlarge or p2.xlarge instance


While it is occurs rarely with Cuda 9.0 or p2.xlarge it almost always occurs on Cuda 9.2 and p3.2xlarge. 

## Minimum reproducible example

``` python
import mxnet as mx
import numpy as np

N = 50000
ctx = mx.gpu()

embedding = mx.gluon.nn.Embedding(N, 300)
embedding.initialize(ctx=ctx)
i = 0
np.random.seed(1)
idx = mx.nd.array(np.random.randint(0, N, size=(1024, 160)), ctx=ctx)

got_nan = False
while True:
    i += 1
    with mx.autograd.record():
        emb_in = embedding(idx)
        loss = emb_in.sum()
    loss.backward()

    if not np.all(np.isfinite(embedding.weight.grad().asnumpy())):
        nan_rows = np.where(~np.isfinite(embedding.weight.grad().asnumpy()))[0]
        print(f'Got nan {i}\tRetrying with same data. '
              f'(Affected rows: {nan_rows.tolist()}).')
        got_nan = True
    else:
        if got_nan:  # We got nan before and it disappeared now
            print(f'nan disappeared in {i}..')
            break

    if i % 100 == 0:
        print(f'{i}')

```

## Steps to reproduce

Run above script with cuda 9.2 and observe very frequent nan values:

``` 
% python debug_embedding_nan.py
Got nan 3       Retrying with same data. (Affected indices: [14721, 14721], [1, 2]).
Got nan 4       Retrying with same data. (Affected indices: [20, 20, 39, 39, 18232, 18232], [257, 258, 1, 2, 1, 2]).
Got nan 5       Retrying with same data. (Affected indices: [20, 20, 71, 33346, 38015], [257, 258, 258, 130, 130]).
Got nan 6       Retrying with same data. (Affected indices: [20, 20], [257, 258]).
nan disappeared in 7..
% python debug_embedding_nan.py 
Got nan 7       Retrying with same data. (Affected indices: [20, 20, 33, 71, 71, 71, 71, 71], [257, 258, 1, 1, 2, 129, 130, 258]).
nan disappeared in 8..
% python debug_embedding_nan.py
Got nan 1       Retrying with same data. (Affected indices: [1489], [129]).
Got nan 2       Retrying with same data. (Affected indices: [42581, 42581], [257, 258]).
nan disappeared in 3..

```

Run above script with cuda 9.0 and observe (infrequent) nan values:

``` 
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
Got nan 1461    Retrying with same data. (Affected indices: [3254], [2]).
nan disappeared in 1462..

```


## What have you tried to solve it?

1. Apply the following patch and set
``` patch
From 3fd91f0078e70cf990ce1549081c03cfb50292ad Mon Sep 17 00:00:00 2001
From: Leonard Lausen <leonard@lausen.nl>
Date: Fri, 15 Jun 2018 18:45:39 +0000
Subject: [PATCH] MXNET_FORCE_ADDTAKEGRAD to disable
 AddTakeGradLargeBatchCaller

If MXNET_FORCE_ADDTAKEGRAD is set, EmbeddingOpBackward will always use
AddTakeGrad independently of gradient input and output shape
---
 src/operator/tensor/indexing_op.h | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/operator/tensor/indexing_op.h b/src/operator/tensor/indexing_op.h
index 87381960e..d3a1bdfd6 100644
--- a/src/operator/tensor/indexing_op.h
+++ b/src/operator/tensor/indexing_op.h
@@ -598,7 +598,11 @@ void EmbeddingOpBackward(const nnvm::NodeAttrs& attrs,
         uint64_t shape_out_prod =
           static_cast<uint64_t>(grad_out.shape_[0])*
           static_cast<uint64_t>(grad_out.shape_[1]);
-        if (shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384) {
+
+        const char *type = getenv("MXNET_FORCE_ADDTAKEGRAD");
+        const bool default_addtakegrad = (type == nullptr);
+
+        if (!default_addtakegrad || ( shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384 )) {
           AddTakeGrad(grad_in, data, grad_out);
         } else {
           AddTakeGradLargeBatchCaller(ctx, grad_in, data, grad_out);
--
2.17.1

```

2. Run above script when `MXNET_FORCE_ADDTAKEGRAD=1` is exported.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values #11314

Description

Environment info (Required)

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values #11314

Description

Description

Environment info (Required)

Minimum reproducible example

Steps to reproduce

What have you tried to solve it?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions