Add interface to launch parallel dygraph by multiprocessing#26044
Add interface to launch parallel dygraph by multiprocessing#26044chenwhql merged 38 commits intoPaddlePaddle:developfrom
Conversation
… dygraph/add_multiprocess_run_interface
… dygraph/add_multiprocess_run_interface
… dygraph/add_multiprocess_run_interface
| ParallelStrategy = core.ParallelStrategy | ||
|
|
||
|
|
||
| def init_parallel_env(backend='nccl'): |
There was a problem hiding this comment.
NCCL is an underlying communication library, I don't think it's necessary to let users know we have different backends here. If we want to support operating system such as windows that doesn't support NCCL, it's better to detect the operating system inside the init function to use other communication library, such as gloo. I highly recommend to remove backend argument currently for simplicity of usage.
There was a problem hiding this comment.
thx, I think it is okay to remove it, we can discuss removing this argument by cherry-pick
guru4elephant
left a comment
There was a problem hiding this comment.
please remove the backend argument for simplicity
感谢意见,确实应该有的,我后续出个报告可以吗?这个接口开发工作开展的时间有点短,近一周一直在讨论迭代接口形态,这个又要随2.0-beta发布,所以仅验证了正确性,性能对比还没来得及开展 这个接口在理论上与launch并无差别,只是换了一种多进程的启动方式,没有增加多余的实现,理论上不会有差别,同时这只是一种可选的启动方式,也不影响launch原来的使用 |
PR types
New features
PR changes
APIs
Describe
This PR add multiprocessing start method
start_processesandspawnfor dygraph data parallel training.1. Start method difference
launchpython -m paddle.distributed.launch --selected_gpus=0,1 train.pyspawnpython train.pyand add
spawnin__main__method, for example:2. Simple example
3. API change
Add 4 new apis:
paddle.distributed.spawn: start mulit-process training by spawn methodpaddle.distributed.init_parallel_env: init parallel environment variables & get paralllel strategypaddle.distributed.get_rank: get current process rankpaddle.distributed.get_world_size: get current world sizeMove 2 old apis:
paddle.prepare_context (fluid.dygraph.prepare_context)->paddle.distributed.prepare_contextpaddle.ParallelEnv (fluid.dygraph.ParallelEnv)->paddle.distributed.ParallelEnvRefine 1 old api:
paddle.DataParallel (fluid.dygraph.DataParallel): Setstrategyas an optional argumentDeprecate 1 old apis:
paddle.distributed.prepare_context (fluid.dygraph.prepare_context): replace bypaddle.distributed.init_parallel_envlater4. Correctness
Verify the correctness of the interface in the following models:
test_parallel_dygraph_mnist.pytest_parallel_dygraph_se_resnext.pytest_parallel_dygraph_transformer.py5. Related docs