Fleet unify distributed training by seiriosPlus · Pull Request #16791 · PaddlePaddle/Paddle

seiriosPlus · 2019-04-11T08:32:56Z

No description provided.

guru4elephant · 2019-04-13T00:15:22Z

python/paddle/fluid/incubate/fleet/base/role_maker.py

            self.role_is_generated_ = True
+
+
+class UserDefinedRoleMaker(RoleMakerBase):


I think you can add a local role maker that can be easily configured. For example, the ip is 127.0.0.0 by default and port id is random generated within a range.

yep, I will seriously consider.

…into feature/fleet

CLAassistant · 2019-04-17T05:39:33Z

All committers have signed the CLA.

sandyhouse · 2019-04-17T06:07:16Z

python/paddle/fluid/incubate/fleet/collective/__init__.py

+        return self.worker_endpoints
+
+    def _get_current_id(self):
+        return self.current_id


Should we expose the two methods to users (get_worker_endpoints and get_current_id).
To use paddle.fluid.compiler for distributed training, we have to set build_strategy.num_trainers and build_strategy.trainer_id. So, if we expose the above two methods to users, we can simplify the programming for users.

…o feature/fleet

guru4elephant

generally I think more usage examples can be added. Although we do not want to release incubate to public currently, we have to do it in the future and users will need solid documentation of fleet.

guru4elephant · 2019-04-19T02:26:12Z

python/paddle/fluid/incubate/fleet/base/fleet_base.py

+
+    def is_first_worker(self):
+        """
+        Check whether the node is the first instance of worker.


please add Examples here if you want to make this public to users.

guru4elephant · 2019-04-19T02:28:33Z

python/paddle/fluid/incubate/fleet/base/fleet_base.py

+
+    def worker_id(self):
+        """
+        Get current worker id.


I think worker_id should be clarified here. In Collective mode, a worker id corresponds a gpu device card. In Parameter Server mode, worker id should be a pod that runs multi-thread training.

guru4elephant · 2019-04-19T02:29:14Z

python/paddle/fluid/incubate/fleet/base/fleet_base.py

+
+    def worker_num(self):
+        """
+        Get current worker number.


similar clarification with worker_id can be added in worker_num.

guru4elephant · 2019-04-19T02:33:33Z

python/paddle/fluid/incubate/fleet/parameter_server/pslib/__init__.py

+fleet = PSLib()
+
+
+class PSLibOptimizer(DistributedOptimizer):


The name of PSLibOptimizer seems wired. Could you rename this with algorithmic description to this class?

sandyhouse · 2019-04-19T07:46:10Z

python/paddle/fluid/incubate/fleet/base/fleet_base.py

+        The `dirname` is used to specify the folder where persistable variables
+        are going to be saved. If you would like to save variables in separate
+        files, set `filename` None; if you would like to save all variables in a
+        single file, use `filename` to specify the file name.


The dirname is used to specify the folder where persistable variables
are going to be saved. If you would like to save variables in separate
files, set filename None; if you would like to save all variables in a ...

What does that mean? Or add these in wrong place?

test=develop

…/fleet

…o feature/fleet

shanyi15 · 2019-04-24T02:32:16Z

have checked distribute_transpiler.py ,dataset.py, please preview Optimizer first.

chengduoZH · 2019-04-24T05:54:49Z

python/paddle/fluid/optimizer.py

-    'AdamaxOptimizer', 'DecayedAdagradOptimizer', 'RMSPropOptimizer',
-    'FtrlOptimizer', 'Adadelta', 'ModelAverage', 'LarsMomentum',
-    'LarsMomentumOptimizer', 'DGCMomentumOptimizer'
+    'Optimizer', 'SGD', 'Momentum', 'Adagrad', 'Adam', 'Adamax',


Don't expose Optimizer here.

chengduoZH · 2019-04-24T08:19:47Z

python/paddle/fluid/incubate/fleet/base/fleet_base.py

+    """
+    __metaclass__ = abc.ABCMeta
+
+    def __init__(self, optimizer, strategy=None):


Please explain the args.

chengduoZH · 2019-04-24T08:23:13Z

python/paddle/fluid/incubate/fleet/parameter_server/distributed_transpiler/__init__.py

+    def __init__(self):
+        """
+        A subclass for compatibility with fluid.transpiler.DistributeTranspiler.
+        """


line38~40 should be moved to between line36 and line 37.

guru4elephant · 2019-04-24T12:49:45Z

paddle/fluid/API.spec

 paddle.fluid.DistributeTranspiler.get_startup_program (ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'd796fc0c8d51503b556fcf6dc15c4f0c'))
 paddle.fluid.DistributeTranspiler.get_trainer_program (ArgSpec(args=['self', 'wait_port'], varargs=None, keywords=None, defaults=(True,)), ('document', '736330e31a7a54abccc0c7fd9119d9ff'))
-paddle.fluid.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None, '127.0.0.1:6174')), ('document', '06ce55338dfe96311ad1078235ab3bf4'))
+paddle.fluid.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, None, '127.0.0.1:6174')), ('document', '951af0a910f9c264723da78ad555f3df'))


this change may affect previous examples I guess, could you check the API change for test_case or public examples?

chengduoZH · 2019-04-24T15:48:10Z

python/paddle/fluid/incubate/fleet/base/fleet_base.py

+class Mode(Enum):
+    TRANSPILER = 1,
+    PSLIB = 2,
+    COLLECTIVE = 3


Could you explain the difference between those modes?

Three distributed training mode.

I will add more information in Mode .

# The first commit's message is: remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (PaddlePaddle#17066) # This is the 2nd commit message: Fleet unify distributed training (PaddlePaddle#16791) * implement distributed transpiler with fleet # This is the 3rd commit message: ParallelDyGraph with GPU collective mode (PaddlePaddle#16827) implement dygraph.parallel.DataParallel to hook reduce op. # This is the 4th commit message: Init mixed precision training interface (PaddlePaddle#16856) * Init mixed precision training interface * Add fp16 test script test=develop * All initializers support float16 test=develop * Code cleanup & add more code annotations test=develop * Update API spec test=develop * Add usage example in doc test=develop # This is the 5th commit message: fix reference_count_pass,test=develop (PaddlePaddle#17060) test=develop # This is the 6th commit message: Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (PaddlePaddle#17090) * Cache the information of linear interpolation in forward and use it in backward. test=develop * Fix cuda kernel. test=develop # This is the 7th commit message: remove unnecessary prepare_data (PaddlePaddle#17080) test=develop # This is the 8th commit message: fix interpolate cu. test=develop (PaddlePaddle#17101) # This is the 9th commit message: test=develop, double backward leaky_relu (PaddlePaddle#17067) backward of backward: leaky_relu # This is the 10th commit message: fix fuse optimizer ops (PaddlePaddle#17102) test=develop # This is the 11th commit message: truncated_gaussian_random supported in distributed training, test=develop (PaddlePaddle#17091) # This is the 12th commit message: Detailed coordinate description for yolov3 loss (PaddlePaddle#17007) * Detailed coordinate description for yolov3 loss test=develop * modified api.spec test=develop * modified loss name * fix api.spec test=develop * polish description test=develop * modified api.spec test=develop # This is the 13th commit message: fix test_weight_decay (PaddlePaddle#17109) test=develop # This is the 14th commit message: Path flag (PaddlePaddle#17105) * fix python/paddle/fluid/__init__.py detecting problems

* refine_dropout_mem,test=develop * # This is a combination of 14 commits. # The first commit's message is: remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (#17066) # This is the 2nd commit message: Fleet unify distributed training (#16791) * implement distributed transpiler with fleet # This is the 3rd commit message: ParallelDyGraph with GPU collective mode (#16827) implement dygraph.parallel.DataParallel to hook reduce op. # This is the 4th commit message: Init mixed precision training interface (#16856) * Init mixed precision training interface * Add fp16 test script test=develop * All initializers support float16 test=develop * Code cleanup & add more code annotations test=develop * Update API spec test=develop * Add usage example in doc test=develop # This is the 5th commit message: fix reference_count_pass,test=develop (#17060) test=develop # This is the 6th commit message: Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (#17090) * Cache the information of linear interpolation in forward and use it in backward. test=develop * Fix cuda kernel. test=develop # This is the 7th commit message: remove unnecessary prepare_data (#17080) test=develop # This is the 8th commit message: fix interpolate cu. test=develop (#17101) # This is the 9th commit message: test=develop, double backward leaky_relu (#17067) backward of backward: leaky_relu # This is the 10th commit message: fix fuse optimizer ops (#17102) test=develop # This is the 11th commit message: truncated_gaussian_random supported in distributed training, test=develop (#17091) # This is the 12th commit message: Detailed coordinate description for yolov3 loss (#17007) * Detailed coordinate description for yolov3 loss test=develop * modified api.spec test=develop * modified loss name * fix api.spec test=develop * polish description test=develop * modified api.spec test=develop # This is the 13th commit message: fix test_weight_decay (#17109) test=develop # This is the 14th commit message: Path flag (#17105) * fix python/paddle/fluid/__init__.py detecting problems

seiriosPlus added 2 commits April 10, 2019 20:54

add base classes

30744a5

implement distributed transpiler with fleet

593bfd3

seiriosPlus changed the title ~~DistributedTranspiler with Fleet~~ [WIP]DistributedTranspiler with Fleet Apr 11, 2019

implement distributed transpiler with fleet

ce85700

guru4elephant reviewed Apr 13, 2019

View reviewed changes

seiriosPlus and others added 8 commits April 15, 2019 14:54

implement pslib with fleet

1391b01

implement pslib with fleet

30835b0

implement pslib with fleet

f03164b

code structure update

126ba5d

code structure update, setup.py update

d13dfe3

update import in python

9c1b1b9

add the implementation for fleet collective

29cf469

Merge branch 'feature/fleet' of https://github.com/seiriosPlus/Paddle …

e0ad7d5

…into feature/fleet

seiriosPlus and others added 2 commits April 17, 2019 13:45

update role in pslib

02ce432

export 'get_worker_endpoints' and 'get_current_id'

31c0f83

sandyhouse reviewed Apr 17, 2019

View reviewed changes

seiriosPlus added 4 commits April 17, 2019 18:33

update role in pslib

61cda9b

add split files in base

24bbdf3

Merge branch 'feature/fleet' of ssh.github.com:seiriosPlus/Paddle int…

0259d40

…o feature/fleet

role maker fix, test=develop

de2a8f1

seiriosPlus changed the title ~~[WIP]DistributedTranspiler with Fleet~~ DistributedTranspiler with Fleet Apr 18, 2019

seiriosPlus changed the title ~~DistributedTranspiler with Fleet~~ Fleet unify distributed training Apr 18, 2019

Li Long and others added 3 commits April 18, 2019 13:58

unify public apis for fleet collective mode

3206781

add API annotation, test=develop

24c93d7

add API annotation, test=develop

436fedd

seiriosPlus requested a review from jacquesqiao April 18, 2019 12:52

guru4elephant reviewed Apr 19, 2019

View reviewed changes

fix in rename, test=develop

f84a866

sandyhouse reviewed Apr 19, 2019

View reviewed changes

seiriosPlus added 11 commits April 22, 2019 11:07

upadte API.spec

f51f727

rename raise

9d17d48

test=develop

Merge branch 'develop' of github.com:PaddlePaddle/Paddle into feature…

136f7a1

…/fleet

upadte API.spec, test=develop

122de19

fix in test base, test=develop

a0a3047

fix sync in test_listen_and_serv_op

cffce47

fix sync in test_listen_and_serv_op, test=develop

03fa38f

test=develop

5d0719d

upadte API.spec, test=develop

80049f9

Merge branch 'feature/fleet' of ssh.github.com:seiriosPlus/Paddle int…

2325f59

…o feature/fleet

upadte comment, test=develop

b79d393

sandyhouse previously approved these changes Apr 24, 2019

View reviewed changes

chengduoZH reviewed Apr 24, 2019

View reviewed changes

remove Optimizer, test=develop

10f041a

seiriosPlus dismissed sandyhouse’s stale review via 10f041a April 24, 2019 06:26

chengduoZH reviewed Apr 24, 2019

View reviewed changes

seiriosPlus added 3 commits April 24, 2019 17:53

remove Optimizer, reformated comments, test=develop

b0f19f6

reformated comments, test=develop

578e746

update instance of Optimizer, test=develop

fabda81

guru4elephant reviewed Apr 24, 2019

View reviewed changes

sync_mode in transpile compatibility, test=develop

16d1030

chengduoZH reviewed Apr 24, 2019

View reviewed changes

reformated api spec, test=develop

7ef165f

guru4elephant approved these changes Apr 25, 2019

View reviewed changes

seiriosPlus merged commit 1a4a51d into PaddlePaddle:develop Apr 25, 2019

		self.role_is_generated_ = True


		class UserDefinedRoleMaker(RoleMakerBase):

Conversation

seiriosPlus commented Apr 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CLAassistant commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guru4elephant left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shanyi15 commented Apr 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CLAassistant commented Apr 17, 2019 •

edited

Loading