Skip to content

Fleet unify distributed training#16791

Merged
seiriosPlus merged 38 commits intoPaddlePaddle:developfrom
seiriosPlus:feature/fleet
Apr 25, 2019
Merged

Fleet unify distributed training#16791
seiriosPlus merged 38 commits intoPaddlePaddle:developfrom
seiriosPlus:feature/fleet

Conversation

@seiriosPlus
Copy link
Collaborator

No description provided.

@seiriosPlus seiriosPlus changed the title DistributedTranspiler with Fleet [WIP]DistributedTranspiler with Fleet Apr 11, 2019
self.role_is_generated_ = True


class UserDefinedRoleMaker(RoleMakerBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can add a local role maker that can be easily configured. For example, the ip is 127.0.0.0 by default and port id is random generated within a range.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I will seriously consider.

@CLAassistant
Copy link

CLAassistant commented Apr 17, 2019

CLA assistant check
All committers have signed the CLA.

return self.worker_endpoints

def _get_current_id(self):
return self.current_id

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose the two methods to users (get_worker_endpoints and get_current_id).
To use paddle.fluid.compiler for distributed training, we have to set build_strategy.num_trainers and build_strategy.trainer_id. So, if we expose the above two methods to users, we can simplify the programming for users.

@seiriosPlus seiriosPlus changed the title [WIP]DistributedTranspiler with Fleet DistributedTranspiler with Fleet Apr 18, 2019
@seiriosPlus seiriosPlus changed the title DistributedTranspiler with Fleet Fleet unify distributed training Apr 18, 2019
@seiriosPlus seiriosPlus requested a review from jacquesqiao April 18, 2019 12:52
Copy link
Member

@guru4elephant guru4elephant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally I think more usage examples can be added. Although we do not want to release incubate to public currently, we have to do it in the future and users will need solid documentation of fleet.


def is_first_worker(self):
"""
Check whether the node is the first instance of worker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add Examples here if you want to make this public to users.


def worker_id(self):
"""
Get current worker id.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worker_id should be clarified here. In Collective mode, a worker id corresponds a gpu device card. In Parameter Server mode, worker id should be a pod that runs multi-thread training.


def worker_num(self):
"""
Get current worker number.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar clarification with worker_id can be added in worker_num.

fleet = PSLib()


class PSLibOptimizer(DistributedOptimizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of PSLibOptimizer seems wired. Could you rename this with algorithmic description to this class?

The `dirname` is used to specify the folder where persistable variables
are going to be saved. If you would like to save variables in separate
files, set `filename` None; if you would like to save all variables in a
single file, use `filename` to specify the file name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dirname is used to specify the folder where persistable variables
are going to be saved. If you would like to save variables in separate
files, set filename None; if you would like to save all variables in a ...

What does that mean? Or add these in wrong place?

@shanyi15
Copy link
Collaborator

have checked distribute_transpiler.py ,dataset.py, please preview Optimizer first.

sandyhouse
sandyhouse previously approved these changes Apr 24, 2019
'AdamaxOptimizer', 'DecayedAdagradOptimizer', 'RMSPropOptimizer',
'FtrlOptimizer', 'Adadelta', 'ModelAverage', 'LarsMomentum',
'LarsMomentumOptimizer', 'DGCMomentumOptimizer'
'Optimizer', 'SGD', 'Momentum', 'Adagrad', 'Adam', 'Adamax',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't expose Optimizer here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"""
__metaclass__ = abc.ABCMeta

def __init__(self, optimizer, strategy=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain the args.

def __init__(self):
"""
A subclass for compatibility with fluid.transpiler.DistributeTranspiler.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line38~40 should be moved to between line36 and line 37.

paddle.fluid.DistributeTranspiler.get_startup_program (ArgSpec(args=['self', 'endpoint', 'pserver_program', 'startup_program'], varargs=None, keywords=None, defaults=(None, None)), ('document', 'd796fc0c8d51503b556fcf6dc15c4f0c'))
paddle.fluid.DistributeTranspiler.get_trainer_program (ArgSpec(args=['self', 'wait_port'], varargs=None, keywords=None, defaults=(True,)), ('document', '736330e31a7a54abccc0c7fd9119d9ff'))
paddle.fluid.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'sync_mode', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, True, None, '127.0.0.1:6174')), ('document', '06ce55338dfe96311ad1078235ab3bf4'))
paddle.fluid.DistributeTranspiler.transpile (ArgSpec(args=['self', 'trainer_id', 'program', 'pservers', 'trainers', 'startup_program', 'current_endpoint'], varargs=None, keywords=None, defaults=(None, '127.0.0.1:6174', 1, None, '127.0.0.1:6174')), ('document', '951af0a910f9c264723da78ad555f3df'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change may affect previous examples I guess, could you check the API change for test_case or public examples?

class Mode(Enum):
TRANSPILER = 1,
PSLIB = 2,
COLLECTIVE = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the difference between those modes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three distributed training mode.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more information in Mode .

@seiriosPlus seiriosPlus merged commit 1a4a51d into PaddlePaddle:develop Apr 25, 2019
sneaxiy pushed a commit to sneaxiy/Paddle that referenced this pull request Apr 28, 2019
# The first commit's message is:
remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (PaddlePaddle#17066)

# This is the 2nd commit message:

Fleet unify distributed training (PaddlePaddle#16791)

* implement distributed transpiler with fleet
# This is the 3rd commit message:

ParallelDyGraph with GPU collective mode (PaddlePaddle#16827)

implement dygraph.parallel.DataParallel to hook reduce op.

# This is the 4th commit message:

Init mixed precision training interface (PaddlePaddle#16856)

* Init mixed precision training interface

* Add fp16 test script

test=develop

* All initializers support float16

test=develop

* Code cleanup & add more code annotations

test=develop

* Update API spec

test=develop

* Add usage example in doc

test=develop

# This is the 5th commit message:

fix reference_count_pass,test=develop (PaddlePaddle#17060)

test=develop
# This is the 6th commit message:

Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (PaddlePaddle#17090)

* Cache the information of linear interpolation in forward and use it in backward.
test=develop

* Fix cuda kernel.
test=develop

# This is the 7th commit message:

remove unnecessary prepare_data (PaddlePaddle#17080)

test=develop
# This is the 8th commit message:

fix interpolate cu. test=develop (PaddlePaddle#17101)

# This is the 9th commit message:

test=develop, double backward leaky_relu (PaddlePaddle#17067)

backward of backward: leaky_relu
# This is the 10th commit message:

fix fuse optimizer ops (PaddlePaddle#17102)

test=develop
# This is the 11th commit message:

truncated_gaussian_random supported in distributed training, test=develop (PaddlePaddle#17091)

# This is the 12th commit message:

 Detailed coordinate description for yolov3 loss (PaddlePaddle#17007)

* Detailed coordinate description for yolov3 loss

test=develop

* modified api.spec

test=develop

* modified loss name

* fix api.spec

test=develop

* polish description

test=develop

* modified api.spec

test=develop

# This is the 13th commit message:

fix test_weight_decay (PaddlePaddle#17109)

test=develop
# This is the 14th commit message:

Path flag (PaddlePaddle#17105)

* fix python/paddle/fluid/__init__.py detecting problems
sneaxiy added a commit that referenced this pull request Apr 28, 2019
* refine_dropout_mem,test=develop

* # This is a combination of 14 commits.
# The first commit's message is:
remove ut test_dist_word2vec in mac ci, will fix it in private, test=develop (#17066)

# This is the 2nd commit message:

Fleet unify distributed training (#16791)

* implement distributed transpiler with fleet
# This is the 3rd commit message:

ParallelDyGraph with GPU collective mode (#16827)

implement dygraph.parallel.DataParallel to hook reduce op.

# This is the 4th commit message:

Init mixed precision training interface (#16856)

* Init mixed precision training interface

* Add fp16 test script

test=develop

* All initializers support float16

test=develop

* Code cleanup & add more code annotations

test=develop

* Update API spec

test=develop

* Add usage example in doc

test=develop

# This is the 5th commit message:

fix reference_count_pass,test=develop (#17060)

test=develop
# This is the 6th commit message:

Speedup roi_perspective_transform op by caching the information of linear interpolation in forward (#17090)

* Cache the information of linear interpolation in forward and use it in backward.
test=develop

* Fix cuda kernel.
test=develop

# This is the 7th commit message:

remove unnecessary prepare_data (#17080)

test=develop
# This is the 8th commit message:

fix interpolate cu. test=develop (#17101)

# This is the 9th commit message:

test=develop, double backward leaky_relu (#17067)

backward of backward: leaky_relu
# This is the 10th commit message:

fix fuse optimizer ops (#17102)

test=develop
# This is the 11th commit message:

truncated_gaussian_random supported in distributed training, test=develop (#17091)

# This is the 12th commit message:

 Detailed coordinate description for yolov3 loss (#17007)

* Detailed coordinate description for yolov3 loss

test=develop

* modified api.spec

test=develop

* modified loss name

* fix api.spec

test=develop

* polish description

test=develop

* modified api.spec

test=develop

# This is the 13th commit message:

fix test_weight_decay (#17109)

test=develop
# This is the 14th commit message:

Path flag (#17105)

* fix python/paddle/fluid/__init__.py detecting problems
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants