Pass device_ids info from launch to trainer.#30632
Merged
gongweibao merged 9 commits intoPaddlePaddle:ascendrcfrom Jan 21, 2021
Merged
Pass device_ids info from launch to trainer.#30632gongweibao merged 9 commits intoPaddlePaddle:ascendrcfrom
gongweibao merged 9 commits intoPaddlePaddle:ascendrcfrom
Conversation
|
Thanks for your contribution! |
added 2 commits
January 21, 2021 17:46
| rank=fleet.worker_index | ||
| nranks=fleet.worker_num | ||
| world_size=fleet.worker_num | ||
| current_worker_accelerator_id=fleet.current_worker_accelerator_id |
Contributor
There was a problem hiding this comment.
这个api是不是太长太晦涩了。这个api是一个进程多张卡时卡的id,还是一个进程一张卡时在机器内的id。
参考OPENMPI的环境变量。
OMPI_COMM_WORLD_SIZE
OMPI_COMM_WORLD_LOCAL_SIZE
OMPI_COMM_WORLD_RANK
OMPI_COMM_WORLD_LOCAL_RANK
OMPI_COMM_WORLD_NODE_RANK
我觉得可以搞成(或者讨论一个更好的方案)
world_size | nranks(总进程数)
local_size(节点内的进程数)
rank (进程rank)
local_rank(进程在节点内的rank)
node_rank(进程所在节点的rank)
另外,如果一个进程里面有多个device,可能也需要支持(或者需要支持吗,加上这个概念有点复杂)。
world_device_size(总设备数)
local_device_size(节点内设备数)
device_size(一个进程内设备数)
| nranks=fleet.worker_num | ||
| world_size=fleet.worker_num | ||
| current_worker_accelerator_id=fleet.current_worker_accelerator_id | ||
| worker_accelerator_ids=fleet.worker_accelerator_ids |
frankwhzhang
added a commit
that referenced
this pull request
Apr 7, 2021
* Ascend rc (#30483) * Fix compilcation on CANN20.1 and older (#30494) Fix compilcation on CANN20.1 and older * Add distribution supported (#30578) Add distribution supported * Build praser for Hcom* operators (#30627) Build praser for Hcom* operators * Pass device_ids info from launch to trainer. (#30632) Pass device_ids info from launch to trainer * Add Hccl program group (#30642) Add Hccl program group * Add startup bash files of test_ascend_group. (#30645) Add startup bash files of test_ascend_group * cleanup (#30646) cleanup test_ascend_group.py * [Feature] Build parser to support distributed training (#30658) [Feature] Build parser to support distributed training * fix compilation on ascend-20.1 (#30722) fix compilation on ascend-20.1 * Dev/fix ascend string (#30749) Dev/fix ascend string * code style (#30781) code style * Merge ascend_optimizer and ascend_parser. (#30776) Merge ascend_optimizer and ascend_parser. * Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug (#30797) Ascendrc add converted op : [range/equal/range/uniform_random/expand/squeeze], fix cast op bug * Add paddle ascend distribution training supported (#30796) Add paddle ascend distribution training supported * pass cxx_flags to gloo cmake (#30857) * Destroy session first. (#30954) Destroy session first. * merge * fix, test=develop * fix, test=develop * fix style, test=develop * fix, test=develop * fix * fix log fatal, test=develop * fix enforce style, test=develop * fix, test=develop * fix, test=develop * fix rccl, test=develop * fix test, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix node_num, test=develop * fix ids str, test=develop * fix ids str, test=develop * fix ids str, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix, test=develop * fix style code, test=develop * fix style code, test=develop * fix style code, test=develop * fix style code, test=develop Co-authored-by: hutuxian <[email protected]> Co-authored-by: gongweibao <[email protected]> Co-authored-by: Void Main <[email protected]> Co-authored-by: Leo Chen <[email protected]> Co-authored-by: dingsiyu <[email protected]> Co-authored-by: OleNet <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR types
Others
PR changes
Others
Describe
Pass device_ids info from launch to trainer.