Skip to content

Pass device_ids info from launch to trainer.#30632

Merged
gongweibao merged 9 commits intoPaddlePaddle:ascendrcfrom
gongweibao:addinitascend
Jan 21, 2021
Merged

Pass device_ids info from launch to trainer.#30632
gongweibao merged 9 commits intoPaddlePaddle:ascendrcfrom
gongweibao:addinitascend

Conversation

@gongweibao
Copy link
Contributor

PR types

Others

PR changes

Others

Describe

Pass device_ids info from launch to trainer.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

rank=fleet.worker_index
nranks=fleet.worker_num
world_size=fleet.worker_num
current_worker_accelerator_id=fleet.current_worker_accelerator_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个api是不是太长太晦涩了。这个api是一个进程多张卡时卡的id,还是一个进程一张卡时在机器内的id。
参考OPENMPI的环境变量。

OMPI_COMM_WORLD_SIZE
OMPI_COMM_WORLD_LOCAL_SIZE
OMPI_COMM_WORLD_RANK
OMPI_COMM_WORLD_LOCAL_RANK
OMPI_COMM_WORLD_NODE_RANK

我觉得可以搞成(或者讨论一个更好的方案)

world_size | nranks(总进程数)
local_size(节点内的进程数)
rank (进程rank)
local_rank(进程在节点内的rank)
node_rank(进程所在节点的rank)

另外,如果一个进程里面有多个device,可能也需要支持(或者需要支持吗,加上这个概念有点复杂)。

world_device_size(总设备数)
local_device_size(节点内设备数)
device_size(一个进程内设备数)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

nranks=fleet.worker_num
world_size=fleet.worker_num
current_worker_accelerator_id=fleet.current_worker_accelerator_id
worker_accelerator_ids=fleet.worker_accelerator_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个device_ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU也是device。

@gongweibao gongweibao merged commit f5aca8f into PaddlePaddle:ascendrc Jan 21, 2021
frankwhzhang added a commit that referenced this pull request Apr 7, 2021
* Ascend rc (#30483)

* Fix compilcation on CANN20.1 and older (#30494)

Fix compilcation on CANN20.1 and older

* Add distribution supported (#30578)

Add distribution supported

* Build praser for Hcom* operators (#30627)

Build praser for Hcom* operators

* Pass device_ids info from launch to trainer. (#30632)

Pass device_ids info from launch to trainer

* Add Hccl program group (#30642)

Add Hccl program group

* Add startup bash files of test_ascend_group. (#30645)

Add startup bash files of test_ascend_group

* cleanup (#30646)

cleanup test_ascend_group.py

* [Feature] Build parser to support distributed training (#30658)

[Feature] Build parser to support distributed training

* fix compilation on ascend-20.1 (#30722)

fix compilation on ascend-20.1

* Dev/fix ascend string (#30749)

Dev/fix ascend string

* code style (#30781)

code style

* Merge ascend_optimizer and ascend_parser. (#30776)

Merge ascend_optimizer and ascend_parser.

* Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug  (#30797)

Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug

* Add paddle ascend distribution training supported (#30796)

Add paddle ascend distribution training supported

* pass cxx_flags to gloo cmake (#30857)

* Destroy session first. (#30954)

Destroy session first.

* merge

* fix, test=develop

* fix, test=develop

* fix style, test=develop

* fix, test=develop

* fix

* fix log fatal, test=develop

* fix enforce style, test=develop

* fix, test=develop

* fix, test=develop

* fix rccl, test=develop

* fix test, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix node_num, test=develop

* fix ids str, test=develop

* fix ids str, test=develop

* fix ids str, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix, test=develop

* fix style code, test=develop

* fix style code, test=develop

* fix style code, test=develop

* fix style code, test=develop

Co-authored-by: hutuxian <[email protected]>
Co-authored-by: gongweibao <[email protected]>
Co-authored-by: Void Main <[email protected]>
Co-authored-by: Leo Chen <[email protected]>
Co-authored-by: dingsiyu <[email protected]>
Co-authored-by: OleNet <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants