Fleet unify distributed training#16791
Conversation
| self.role_is_generated_ = True | ||
|
|
||
|
|
||
| class UserDefinedRoleMaker(RoleMakerBase): |
There was a problem hiding this comment.
I think you can add a local role maker that can be easily configured. For example, the ip is 127.0.0.0 by default and port id is random generated within a range.
There was a problem hiding this comment.
yep, I will seriously consider.
| return self.worker_endpoints | ||
|
|
||
| def _get_current_id(self): | ||
| return self.current_id |
There was a problem hiding this comment.
Should we expose the two methods to users (get_worker_endpoints and get_current_id).
To use paddle.fluid.compiler for distributed training, we have to set build_strategy.num_trainers and build_strategy.trainer_id. So, if we expose the above two methods to users, we can simplify the programming for users.
guru4elephant
left a comment
There was a problem hiding this comment.
generally I think more usage examples can be added. Although we do not want to release incubate to public currently, we have to do it in the future and users will need solid documentation of fleet.
|
|
||
| def is_first_worker(self): | ||
| """ | ||
| Check whether the node is the first instance of worker. |
There was a problem hiding this comment.
please add Examples here if you want to make this public to users.
|
|
||
| def worker_id(self): | ||
| """ | ||
| Get current worker id. |
There was a problem hiding this comment.
I think worker_id should be clarified here. In Collective mode, a worker id corresponds a gpu device card. In Parameter Server mode, worker id should be a pod that runs multi-thread training.
|
|
||
| def worker_num(self): | ||
| """ | ||
| Get current worker number. |
There was a problem hiding this comment.
similar clarification with worker_id can be added in worker_num.
| fleet = PSLib() | ||
|
|
||
|
|
||
| class PSLibOptimizer(DistributedOptimizer): |
There was a problem hiding this comment.
The name of PSLibOptimizer seems wired. Could you rename this with algorithmic description to this class?
| The `dirname` is used to specify the folder where persistable variables | ||
| are going to be saved. If you would like to save variables in separate | ||
| files, set `filename` None; if you would like to save all variables in a | ||
| single file, use `filename` to specify the file name. |
There was a problem hiding this comment.
The dirname is used to specify the folder where persistable variables
are going to be saved. If you would like to save variables in separate
files, set filename None; if you would like to save all variables in a ...
What does that mean? Or add these in wrong place?
test=develop
|
have checked |
python/paddle/fluid/optimizer.py
Outdated
| 'AdamaxOptimizer', 'DecayedAdagradOptimizer', 'RMSPropOptimizer', | ||
| 'FtrlOptimizer', 'Adadelta', 'ModelAverage', 'LarsMomentum', | ||
| 'LarsMomentumOptimizer', 'DGCMomentumOptimizer' | ||
| 'Optimizer', 'SGD', 'Momentum', 'Adagrad', 'Adam', 'Adamax', |
There was a problem hiding this comment.
Don't expose Optimizer here.
| """ | ||
| __metaclass__ = abc.ABCMeta | ||
|
|
||
| def __init__(self, optimizer, strategy=None): |
There was a problem hiding this comment.
Please explain the args.
| def __init__(self): | ||
| """ | ||
| A subclass for compatibility with fluid.transpiler.DistributeTranspiler. | ||
| """ |
There was a problem hiding this comment.
line38~40 should be moved to between line36 and line 37.
paddle/fluid/API.spec
Outdated
| paddle.fluid.DistributeTranspiler.get_startup_program (ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'd796fc0c8d51503b556fcf6dc15c4f0c')) | ||
| paddle.fluid.DistributeTranspiler.get_trainer_program (ArgSpec(args=['self', 'wait_port'], varargs=None, keywords=None, defaults=(True,)), ('document', '736330e31a7a54abccc0c7fd9119d9ff')) | ||
| paddle.fluid.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None, '127.0.0.1:6174')), ('document', '06ce55338dfe96311ad1078235ab3bf4')) | ||
| paddle.fluid.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, None, '127.0.0.1:6174')), ('document', '951af0a910f9c264723da78ad555f3df')) |
There was a problem hiding this comment.
this change may affect previous examples I guess, could you check the API change for test_case or public examples?
| class Mode(Enum): | ||
| TRANSPILER = 1, | ||
| PSLIB = 2, | ||
| COLLECTIVE = 3 |
There was a problem hiding this comment.
Could you explain the difference between those modes?
There was a problem hiding this comment.
Three distributed training mode.
There was a problem hiding this comment.
I will add more information in Mode .
# The first commit's message is: remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (PaddlePaddle#17066) # This is the 2nd commit message: Fleet unify distributed training (PaddlePaddle#16791) * implement distributed transpiler with fleet # This is the 3rd commit message: ParallelDyGraph with GPU collective mode (PaddlePaddle#16827) implement dygraph.parallel.DataParallel to hook reduce op. # This is the 4th commit message: Init mixed precision training interface (PaddlePaddle#16856) * Init mixed precision training interface * Add fp16 test script test=develop * All initializers support float16 test=develop * Code cleanup & add more code annotations test=develop * Update API spec test=develop * Add usage example in doc test=develop # This is the 5th commit message: fix reference_count_pass,test=develop (PaddlePaddle#17060) test=develop # This is the 6th commit message: Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (PaddlePaddle#17090) * Cache the information of linear interpolation in forward and use it in backward. test=develop * Fix cuda kernel. test=develop # This is the 7th commit message: remove unnecessary prepare_data (PaddlePaddle#17080) test=develop # This is the 8th commit message: fix interpolate cu. test=develop (PaddlePaddle#17101) # This is the 9th commit message: test=develop, double backward leaky_relu (PaddlePaddle#17067) backward of backward: leaky_relu # This is the 10th commit message: fix fuse optimizer ops (PaddlePaddle#17102) test=develop # This is the 11th commit message: truncated_gaussian_random supported in distributed training, test=develop (PaddlePaddle#17091) # This is the 12th commit message: Detailed coordinate description for yolov3 loss (PaddlePaddle#17007) * Detailed coordinate description for yolov3 loss test=develop * modified api.spec test=develop * modified loss name * fix api.spec test=develop * polish description test=develop * modified api.spec test=develop # This is the 13th commit message: fix test_weight_decay (PaddlePaddle#17109) test=develop # This is the 14th commit message: Path flag (PaddlePaddle#17105) * fix python/paddle/fluid/__init__.py detecting problems
* refine_dropout_mem,test=develop * # This is a combination of 14 commits. # The first commit's message is: remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (#17066) # This is the 2nd commit message: Fleet unify distributed training (#16791) * implement distributed transpiler with fleet # This is the 3rd commit message: ParallelDyGraph with GPU collective mode (#16827) implement dygraph.parallel.DataParallel to hook reduce op. # This is the 4th commit message: Init mixed precision training interface (#16856) * Init mixed precision training interface * Add fp16 test script test=develop * All initializers support float16 test=develop * Code cleanup & add more code annotations test=develop * Update API spec test=develop * Add usage example in doc test=develop # This is the 5th commit message: fix reference_count_pass,test=develop (#17060) test=develop # This is the 6th commit message: Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (#17090) * Cache the information of linear interpolation in forward and use it in backward. test=develop * Fix cuda kernel. test=develop # This is the 7th commit message: remove unnecessary prepare_data (#17080) test=develop # This is the 8th commit message: fix interpolate cu. test=develop (#17101) # This is the 9th commit message: test=develop, double backward leaky_relu (#17067) backward of backward: leaky_relu # This is the 10th commit message: fix fuse optimizer ops (#17102) test=develop # This is the 11th commit message: truncated_gaussian_random supported in distributed training, test=develop (#17091) # This is the 12th commit message: Detailed coordinate description for yolov3 loss (#17007) * Detailed coordinate description for yolov3 loss test=develop * modified api.spec test=develop * modified loss name * fix api.spec test=develop * polish description test=develop * modified api.spec test=develop # This is the 13th commit message: fix test_weight_decay (#17109) test=develop # This is the 14th commit message: Path flag (#17105) * fix python/paddle/fluid/__init__.py detecting problems
No description provided.