-
Notifications
You must be signed in to change notification settings - Fork 4.6k
autotp training(fix dco) #7004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autotp training(fix dco) #7004
Conversation
Signed-off-by: inkcherry <[email protected]>
|
Kudos @inkcherry for contributing AutoTP training! It's a nice feature make tensor parallel training/finetuning more available to HF model users. I think a tutorial page would help user discover and learn how to use this feature in DeepSpeed. Is it possible to write a tutorial and add it under https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials introducing steps how to use this feature? I remember you have an example training alpaca with DeepSpeed AutoTP. |
Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: siqi <[email protected]>
Yes, I will add some document soon~ |
|
@inkcherry, I think a blog would be appropriate to publicize this amazing technology. Although blogs can be a bit work, we will be glad to collaborate and jointly advertise. |
Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Logan Adams <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: gyou2021 <[email protected]>
|
@inkcherry @loadams @tjruwase keep_module_on_host is currently ignored and not used in auto_tp. |
|
@inkcherry, can you please help address issues raised by @oelayan7. It seems we need to re-apply the lost changes from #6846. |
|
@oelayan7 I will fix it later. Thanks! |
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: yisheng <[email protected]>
Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>
|
@inkcherry, @delock, we are seeing AutoTP CI hangs with torch 2.7. @stas00 created the following minimal CI repro that hangs class TestTpDataloaderCorrectness(DistributedTest):
world_size = 4
reuse_dist_env = True
def test(self):
set_autotp_mode(training=True)He noted the cause is related to class TestTpDataloaderCorrectness(DistributedTest):
world_size = 4
reuse_dist_env = True
def test(self, tp_size: int):
set_autotp_mode(training=True)
set_autotp_mode(training=False) |
|
Thank you, Tunji - more specifically - the hanging condition is:
|
|
hi @tjruwase @stas00 Thanks for the information! In my local environment with PyTorch 2.7, I noticed that the unit test passes, but the pytest process doesn't exit cleanly(hang). Using |
|
yes, that's exactly the same behavior I observed. Here is the py-spy trace of hanging: |
|
it hangs here in the corresponding part of the trace is: |
|
should we open an Issue about it? Otherwise this will get forgotten and then someone will have to diagnose this again. It'll happen as soon as someone tries it with pt-2.7.x |
Did #7321 fixed the issue? Otherwise we should open an issue to track. |
|
I suppose the thing to figure out is whether you just need to make the tests pass, or is there an actual bug introduced in pt-2.7 that needs to be diagnosed, reported and fixed in the pt-core? Swiping the problem under the carpet but making the tests pass often can rear its head later in a real user application. But this is not the project I was part of, I just reported the problem, so it's up to you what you do with it. |
Thanks @stas00 ! We will take a look at this. I think it depends on whether this bug can only be reproduced by pytest or we can build a standalone reproducer without using pytest. Will need some deeper look to decide. |
|
thanks @delock and @stas00, Yes, we've observed an apparent conflict between the process exit logic in DistributedTest (which was designed independently) and PyTorch's destroy_process_group. We'll conduct a deeper analysis to: Isolate whether this is solely a PyTorch issue, or A design mismatch in our test infrastructure. |
|
Hey @inkcherry , I am confused that the function tp_model_init is not called in the initialization. Does this function need to be explicitly called when I want to enable autoTP training? |
Same as this PR. affeb88
I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: )