Skip to content

NCCL_LAUNCH_MODE=PARALLEL 环境变量打开,多卡下容易hang住 #16272

@ccmeteorljh

Description

@ccmeteorljh

paddle version:1.3.0

+-------------------------------+----------------------+----------------------+
|   4  Tesla P40           Off  | 00000000:49:00.0 Off |                    0 |
| N/A   41C    P0    57W / 250W |  21729MiB / 22912MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P40           Off  | 00000000:4A:00.0 Off |                    0 |
| N/A   38C    P0    57W / 250W |  21729MiB / 22912MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P40           Off  | 00000000:4B:00.0 Off |                    0 |
| N/A   38C    P0    58W / 250W |  21729MiB / 22912MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P40           Off  | 00000000:4C:00.0 Off |                    0 |
| N/A   38C    P0    51W / 250W |  21729MiB / 22912MiB |      0%      Default

在transformer,image_classification, deeplabv3+ 都会出现;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions