Skip to content

Conversation

@wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Aug 10, 2021

PR types

Performance optimization

PR changes

Others

Describe

Hybrid support fp16 allreduce. Usage:

import paddle.distributed.fleet as fleet

strategy = fleet.DistributedStrategy()
strategy.sharding = True
strategy.sharding_configs = {
    "sharding_degree": 1,
    "mp_degree": 1,
    "pp_degree": 2,
    "dp_degree": 2,
}
strategy.pipeline = True
strategy.pipeline_configs = {
    "schedule_mode": "1F1B",
    "micro_batch_size": 2,
    "accumulate_steps": 4,
}
strategy.amp = True
strategy.fp16_allreduce = True

Test

Test in 16node*8cards 32G V100, with Ernie3.0 model.

Model config:

value
hidden size 8192
num attention heads 128
num hidden layers 76
num sharing layers 64
branch hidden size 768
branch num attention heads 16

Hybrid configs, with fused_allreduce 128MB:

dp mp pp micro bsz global bsz
2 8 8 2 256

Performance:

fp16_allreduce throughput(tokens/s) improve
false 13394
true 15224 13.6%

Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

merged_gradient_names = []
first_opt_op_idx = None

merged_suffix = '@MERGED@FP16' if fp16_allreduce else '@MERGED'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should explain the suffix for gradname for later maintainer, we now have two many suffix for grad, immediately grad, accumulated grad, casted grad, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, add in next PR

Copy link

@sandyhouse sandyhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangxicoding wangxicoding merged commit 4d7af37 into PaddlePaddle:develop Aug 11, 2021
@wangxicoding wangxicoding deleted the hybird_fp16_allreduce branch August 11, 2021 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants