Add some dist-training robust cases into fluid benchmark test#11207
Add some dist-training robust cases into fluid benchmark test#11207typhoonzero merged 16 commits intoPaddlePaddle:developfrom
Conversation
2. add learning rate decay feature into fluid benchmark test 3. add L1&L2 regularization feature into fluid benchmark test 4. add error clipping feature into fluid benchmark test 5. add gradient clipping feature into fluid benchmark test
| import paddle.fluid.core as core | ||
| import paddle.fluid.framework as framework | ||
| from paddle.fluid.executor import Executor | ||
| from models.model_base import get_decay_learning_rate |
There was a problem hiding this comment.
model_base is not uploaded?
There was a problem hiding this comment.
Thanks for review, I added the benchmark/fluid/models/model_base.py file in next commit
|
|
||
| optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate) | ||
| # set gradient clip | ||
| set_gradient_clip(args.gradient_clip_method, args.gradient_clip_norm) |
There was a problem hiding this comment.
Is there a way that we can disable these settings if the args is empty?
There was a problem hiding this comment.
if clip_method in args is None, these settings will be disabled, and if user do NOT specify the args --gradient_clip_method, the args will be None in the case of default.
the code was like below
def set_gradient_clip(clip_method, clip_norm=1.):
if not clip_method:
return None
|
I'm currently thinking, we can test all these cases using unit test and not CE. Run e2e tests with CE may spend alot of time |
|
Actually, test cases have cover the most part of these features, however, what we need is:
|
|
So I guess, put these features in fluid benchmark and add about 6-7 cases in ce is a good choice, which will cost around 30s each case |
2. remove lr_decay, regularization, clipping out of fluid_benchmark.py
benchmark/fluid/args.py
Outdated
| choices=[], | ||
| help='Error clipping method, not allowed yet') | ||
| parser.add_argument( | ||
| '--error_clip_min', |
There was a problem hiding this comment.
Can we remove clipping and other optimization configures in argument, It might be clean if we leave these settings to model configs, Thanks!
2. fix bug in test_listen_and_serv_op
| return | ||
| except os.error: | ||
| retry_times -= 1 | ||
| retry_times -= sleep_time |
There was a problem hiding this comment.
This change seems is not for current PR? and retry_times seems is the total count for trying not the time.
There was a problem hiding this comment.
I changed the name retry_times to start_left_time to indicate that this is the left time for pserver starting, and this change is for passing the CI
|
Ref #11213 for adding unit tests |
Close #11206