Add async ssa graph executor by jacquesqiao · Pull Request #15409 · PaddlePaddle/Paddle

jacquesqiao · 2019-01-18T04:49:04Z

typhoonzero · 2019-01-18T07:03:46Z

paddle/fluid/framework/parallel_executor.cc

          paddle::framework::TensorCopy(main_tensor, cpu, t);
+        };
+
+        auto copy_memory = [&] { t->ShareDataWith(main_tensor); };


seems copy_memory and share_memory are reversed?

guru4elephant · 2019-01-22T03:35:38Z

paddle/fluid/framework/details/async_ssa_graph_executor.cc

+namespace details {
+
+AsyncSSAGraphExecutor::AsyncSSAGraphExecutor(
+    const ExecutionStrategy &strategy, const std::vector<Scope *> &local_scopes,


local_scopes是从哪里创建带入的？

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/parallel_executor.cc#L217

guru4elephant · 2019-01-22T03:43:34Z

paddle/fluid/framework/details/async_ssa_graph_executor.cc

+      lodtensor_ptrs.push_back(&fetch_data.at(scope_idx).at(fetch_idx));
+    }
+    ret.emplace_back();
+    ret.back().MergeLoDTensor(lodtensor_ptrs, platform::CPUPlace());


num_iteration_per_run_ > 1的情况下，各线程执行速度不一致，merge各个local_scope的结果是否有意义？

这个感觉可以去掉其实，反正已经是纯异步了，相当于减少一点做evel的数据量

大家步调不一致，参数版本也不一致，确实应该去掉，观察其中一个线程就够了

… add-async-ssa-graph-executor test=develop

Yancey0623 · 2019-02-25T03:42:00Z

paddle/fluid/framework/parallel_executor.cc

-                               member_->use_cuda_, member_->nccl_ctxs_.get());
+  if (build_strategy.async_mode_ && !build_strategy.is_distribution_) {
+    VLOG(3) << "use local async mode";
+    for (size_t i = 0; i < member_->places_.size(); ++i) {


@panyx0718 has a PR that passed graph insetead of program: #15425 . And ParallelGraphExecutor does not dependence program_desc: #15716 ?

… add-async-ssa-graph-executor test=develop

test=develop

panyx0718 · 2019-02-27T05:01:24Z

paddle/fluid/framework/details/async_ssa_graph_executor.cc

+  if (pool_) {
+    for (auto &f : run_futures) {
+      if (exception_holder_.IsCaught()) {
+        f.wait();


why sync here?

panyx0718 · 2019-02-27T05:02:53Z

paddle/fluid/framework/details/build_strategy.h

  // num_trainers is 1, so the current fields of build_strategy doesn't tell if
  // it's distributed model.
  bool is_distribution_{false};
+  bool async_mode_{false};


what is the relationship between async_mode and is_distribution

panyx0718 · 2019-02-27T05:03:08Z

paddle/fluid/framework/details/build_strategy.h

  // it's distributed model.
  bool is_distribution_{false};
+  bool async_mode_{false};
  int num_trainers_{1};


can num_trainers > 1 and not is_distribution?

panyx0718 · 2019-02-27T05:12:34Z

paddle/fluid/framework/parallel_executor.cc

    const std::vector<Scope *> &local_scopes,
    const ExecutionStrategy &exec_strategy, const BuildStrategy &build_strategy,
-    ir::Graph *graph)
+    std::vector<ir::Graph *> graphs)


avoid multiple graphs. A single graph can contain multiple sub-graphs

panyx0718 · 2019-02-27T05:13:57Z

paddle/fluid/framework/parallel_executor.cc

+  if (build_strategy.async_mode_ && !build_strategy.is_distribution_) {
+    VLOG(3) << "use local async mode";
+    temp_owned_graph =
+        build_strategy.Apply(std::move(temp_owned_graph), {member_->places_[0]},


why each graph needs to go through multi-device pass?

panyx0718 · 2019-02-27T05:17:33Z

python/paddle/fluid/parallel_executor.py

        # step7: init ParallelExecutor
        # ParallelExecutor API will be deprecated, don't support parallel graph.
-        self._graph = core.Graph(main.desc)
+        self._graphs = []


parallel_executor.py is deprecated.

panyx0718 · 2019-02-27T05:19:15Z

paddle/fluid/framework/parallel_executor.h

                            const ExecutionStrategy &exec_strategy,
                            const BuildStrategy &build_strategy,
-                            ir::Graph *graph);
+                            std::vector<ir::Graph *> graphs);


don't do multiple graphs.

… add-async-ssa-graph-executor

seiriosPlus · 2019-03-04T10:37:21Z

paddle/fluid/operators/reader/buffered_reader.cc

 namespace operators {
 namespace reader {
 BufferedReader::~BufferedReader() {
+  VLOG(1) << "~BufferedReader";


delete this VLOG

seiriosPlus · 2019-03-04T10:44:53Z

python/paddle/fluid/parallel_executor.py

        # step7: init ParallelExecutor
        # ParallelExecutor API will be deprecated, don't support parallel graph.
-        self._graph = core.Graph(main.desc)
+        self._graphs = []


the comment is wrong.

… add-async-ssa-graph-executor

jacquesqiao added 5 commits January 17, 2019 14:14

init async ssa executor

92a6c7a

init communicator

afda840

can run

ea66979

support num_iteration_per_run

88d71fa

remote communicator

69484f7

typhoonzero reviewed Jan 18, 2019

View reviewed changes

fix copy_memory and share_memory

f3210b6

jacquesqiao requested a review from panyx0718 January 18, 2019 15:49

guru4elephant self-requested a review January 22, 2019 03:34

guru4elephant reviewed Jan 22, 2019

View reviewed changes

jacquesqiao mentioned this pull request Jan 24, 2019

优化分布式异步训练的性能 #15500

Closed

4 tasks

jacquesqiao added 9 commits January 25, 2019 22:14

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ada43e8

… add-async-ssa-graph-executor test=develop

code optimize

fab8457

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a66115b

… add-async-ssa-graph-executor test=develop

add GenParentScopeTreeDebugInfo

62549e0

add some debug infor

be738a6

clean code of test_async_ssa_graph_executor_mnist

9da96ab

optimize test_async_ssa_graph_executor_mnist

7e145b7

add some debug info

02dab46

complete test_async_ssa_graph_executor_mnist test=develop

4a17261

jacquesqiao requested a review from velconia January 28, 2019 09:15

jacquesqiao added 6 commits January 29, 2019 07:10

update test test=develop

249f48e

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d6c0dca

… add-async-ssa-graph-executor test=develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

16af1db

… add-async-ssa-graph-executor test=develop

add a check for async_ssa_graph_exe test=develop

b1fe8d4

ThreadedSSAGraphExecutor support num_iteration_per_run test=develop

e72637d

support async mode in dist mode parallel executor

84367cf

jacquesqiao mentioned this pull request Feb 10, 2019

稀疏cpu模型训练优化和完善 #15508

Closed

11 tasks

jacquesqiao added 2 commits February 11, 2019 09:19

async mode support dist train

c4ded17

async ssa exe only support local mode

2171aa7

jacquesqiao added 9 commits February 21, 2019 11:16

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cc71e89

… add-async-ssa-graph-executor test=develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

31a05d3

… add-async-ssa-graph-executor test=develop

fix compile problem

9465c3d

fix multi graph test=develop

7f3be09

change the include of ThreadPool.h test=develop

12f6b8c

fix gpu error test=develop

f4f4816

fix code bug test=develop

ecedd53

revert the change of scope test=develop

b5b8e6c

add some check test=develop

10393dd

jacquesqiao force-pushed the add-async-ssa-graph-executor branch from c9bf8e2 to 10393dd Compare February 25, 2019 02:14

Yancey0623 reviewed Feb 25, 2019

View reviewed changes

jacquesqiao added 5 commits February 25, 2019 22:39

use one graph

43c8237

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cf0511f

… add-async-ssa-graph-executor test=develop

optimize code test=develop

dab7f36

fix style

ff01d70

test=develop

support multi graph

f768fbf

test=develop

panyx0718 requested changes Feb 27, 2019

View reviewed changes

jacquesqiao added 2 commits March 1, 2019 11:24

pure async mode train

847e4f4

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e70b172

… add-async-ssa-graph-executor

seiriosPlus reviewed Mar 4, 2019

View reviewed changes

jacquesqiao added 7 commits March 4, 2019 22:54

fix parallel executor async mode

8744f9a

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

b2c082c

… add-async-ssa-graph-executor

optimize test_async_ssa_graph_executor_mnist test=develop

e92ad8a

code clean test=develop

f28c258

revert change

c09477b

code format test=develop

4e218da

code format test=develop

5e8de51

jacquesqiao merged commit 5e8de51 into PaddlePaddle:develop Apr 2, 2019

Conversation

jacquesqiao commented Jan 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jacquesqiao commented Jan 18, 2019 •

edited

Loading