autotp training(fix dco) #7004

inkcherry · 2025-02-05T05:21:10Z

Same as this PR. affeb88
I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>

delock · 2025-02-06T01:29:03Z

Kudos @inkcherry for contributing AutoTP training! It's a nice feature make tensor parallel training/finetuning more available to HF model users.

I think a tutorial page would help user discover and learn how to use this feature in DeepSpeed. Is it possible to write a tutorial and add it under https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials introducing steps how to use this feature? I remember you have an example training alpaca with DeepSpeed AutoTP.

Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]>

Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>

Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: siqi <[email protected]>

inkcherry · 2025-02-07T03:35:03Z

Kudos @inkcherry for contributing AutoTP training! It's a nice feature make tensor parallel training/finetuning more available to HF model users.

I think a tutorial page would help user discover and learn how to use this feature in DeepSpeed. Is it possible to write a tutorial and add it under https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials introducing steps how to use this feature? I remember you have an example training alpaca with DeepSpeed AutoTP.

Yes, I will add some document soon~

tjruwase · 2025-02-07T11:52:57Z

@inkcherry, I think a blog would be appropriate to publicize this amazing technology. Although blogs can be a bit work, we will be glad to collaborate and jointly advertise.

Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Logan Adams <[email protected]>

Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>

Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: gyou2021 <[email protected]>

oelayan7 · 2025-02-27T11:54:12Z

@inkcherry @loadams @tjruwase
cc @nelyahu
Please notice that this PR has removed keep_module_on_host which was added in #6846

keep_module_on_host is currently ignored and not used in auto_tp.
Please align the current code to the old version that was in auto_tp

tjruwase · 2025-02-27T15:19:43Z

@inkcherry, can you please help address issues raised by @oelayan7. It seems we need to re-apply the lost changes from #6846.

inkcherry · 2025-02-27T16:22:31Z

@oelayan7
Sorry for the break, I actually noticed this issue as well.
You can use this as a temporary workaround for now:
in deepspeed/module_inject/layers set
_partition = move(_partition, 'cpu').detach()

I will fix it later. Thanks!

Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>

Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: yisheng <[email protected]>

Same as [this PR](deepspeedai#6922). [affeb88](deepspeedai@affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]>

sfc-gh-truwase · 2025-05-29T20:21:39Z

@inkcherry, @delock, we are seeing AutoTP CI hangs with torch 2.7.

@stas00 created the following minimal CI repro that hangs

class TestTpDataloaderCorrectness(DistributedTest):
    world_size = 4
    reuse_dist_env = True

    def test(self):
        set_autotp_mode(training=True)

He noted the cause is related to DEEPSPEED_AUTOTP_MODE and confirmed that the hang goes away with the following change

class TestTpDataloaderCorrectness(DistributedTest):
    world_size = 4
    reuse_dist_env = True

    def test(self, tp_size: int):
        set_autotp_mode(training=True)
        set_autotp_mode(training=False)

stas00 · 2025-05-29T20:31:23Z

Thank you, Tunji - more specifically - the hanging condition is:

reuse_dist_env=True (no problem with False)
world_size=4 (no problem with 2)
set_autotp_mode(training=True) (has to be called)
pytorch-2.7 or higher

inkcherry · 2025-05-30T02:13:38Z

hi @tjruwase @stas00 Thanks for the information! In my local environment with PyTorch 2.7, I noticed that the unit test passes, but the pytest process doesn't exit cleanly(hang). Using reuse_dist_env=False resolve the issue. However, it appears that set_autotp_mode(training=False) still causes a hang. I've submitted a fix here: #7321.

stas00 · 2025-05-30T17:56:48Z

yes, that's exactly the same behavior I observed.

Here is the py-spy trace of hanging:

$ py-spy dump --pid 1712817
Process 1712817: /usr/bin/python /home/yak/.local/bin/pytest --disable-warnings --instafail -m sequential -sv tests/unit/model_parallelism/test_autotp_training.py::TestTpLayerFwdBwd::testRowParallel[False-4]
Python v3.10.12 (/usr/bin/python3.10)

Thread 1712817 (idle): "MainThread"
    wait (threading.py:320)
    wait (threading.py:607)
    wait (multiprocessing/pool.py:765)
    get (multiprocessing/pool.py:768)
    starmap (multiprocessing/pool.py:375)
    _close_pool (unit/common.py:349)
    pytest_runtest_teardown (conftest.py:80)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    <lambda> (_pytest/runner.py:242)
    from_call (_pytest/runner.py:341)
    call_and_report (_pytest/runner.py:241)
    runtestprotocol (_pytest/runner.py:137)
    pytest_runtest_protocol (_pytest/runner.py:113)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    pytest_runtestloop (_pytest/main.py:362)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    _main (_pytest/main.py:337)
    wrap_session (_pytest/main.py:283)
    pytest_cmdline_main (_pytest/main.py:330)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    main (_pytest/config/__init__.py:177)
    console_main (_pytest/config/__init__.py:201)
    <module> (pytest:8)
Thread 1712881 (idle): "Thread-1 (run_server)"
    accept (socket.py:293)
    run_server (pytest_rerunfailures.py:433)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1712994 (idle): "Thread-2 (_read_thread)"
    _recv_msg (torch/_inductor/compile_worker/subproc_pool.py:57)
    _read_thread (torch/_inductor/compile_worker/subproc_pool.py:182)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1713067 (idle): "Thread-3 (_handle_workers)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _wait_for_updates (multiprocessing/pool.py:502)
    _handle_workers (multiprocessing/pool.py:522)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1713068 (idle): "Thread-4 (_handle_tasks)"
    _handle_tasks (multiprocessing/pool.py:531)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1713069 (idle): "Thread-5 (_handle_results)"
    _recv (multiprocessing/connection.py:379)
    _recv_bytes (multiprocessing/connection.py:414)
    recv (multiprocessing/connection.py:250)
    _handle_results (multiprocessing/pool.py:579)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

stas00 · 2025-05-30T17:57:44Z

it hangs here in starmap in reuse_dist_env handling code.

    def _close_pool(self, pool, num_procs, force=False):
        if force or not self.reuse_dist_env:
            msg = pool.starmap(self._dist_destroy, [() for _ in range(num_procs)])
            pool.close()
            pool.join()

the corresponding part of the trace is:

    wait (threading.py:320)
    wait (threading.py:607)
    wait (multiprocessing/pool.py:765)
    get (multiprocessing/pool.py:768)
    starmap (multiprocessing/pool.py:375)
    _close_pool (unit/common.py:349)
    pytest_runtest_teardown (conftest.py:80)

stas00 · 2025-06-03T00:21:29Z

should we open an Issue about it? Otherwise this will get forgotten and then someone will have to diagnose this again. It'll happen as soon as someone tries it with pt-2.7.x

delock · 2025-06-04T00:46:32Z

should we open an Issue about it? Otherwise this will get forgotten and then someone will have to diagnose this again. It'll happen as soon as someone tries it with pt-2.7.x

Did #7321 fixed the issue? Otherwise we should open an issue to track.

inkcherry · 2025-06-04T03:18:20Z

@stas00 @delock yes, created : #7334

stas00 · 2025-06-09T20:19:02Z

I suppose the thing to figure out is whether you just need to make the tests pass, or is there an actual bug introduced in pt-2.7 that needs to be diagnosed, reported and fixed in the pt-core?

Swiping the problem under the carpet but making the tests pass often can rear its head later in a real user application.

But this is not the project I was part of, I just reported the problem, so it's up to you what you do with it.

delock · 2025-06-10T02:14:53Z

I suppose the thing to figure out is whether you just need to make the tests pass, or is there an actual bug introduced in pt-2.7 that needs to be diagnosed, reported and fixed in the pt-core?

Swiping the problem under the carpet but making the tests pass often can rear its head later in a real user application.

But this is not the project I was part of, I just reported the problem, so it's up to you what you do with it.

Thanks @stas00 ! We will take a look at this. I think it depends on whether this bug can only be reproduced by pytest or we can build a standalone reproducer without using pytest. Will need some deeper look to decide.

inkcherry · 2025-06-10T02:39:37Z

thanks @delock and @stas00, Yes, we've observed an apparent conflict between the process exit logic in DistributedTest (which was designed independently) and PyTorch's destroy_process_group. We'll conduct a deeper analysis to: Isolate whether this is solely a PyTorch issue, or A design mismatch in our test infrastructure.
Will sync findings promptly

rookie-ryan · 2025-08-04T04:43:44Z

Hey @inkcherry , I am confused that the function tp_model_init is not called in the initialization. Does this function need to be explicitly called when I want to enable autoTP training?

autotp training squash merge

09d09a5

Signed-off-by: inkcherry <[email protected]>

inkcherry requested review from loadams and tjruwase as code owners February 5, 2025 05:21

tjruwase approved these changes Feb 5, 2025

View reviewed changes

tjruwase added this pull request to the merge queue Feb 5, 2025

Merged via the queue into deepspeedai:master with commit f04649d Feb 5, 2025
12 checks passed

tjruwase mentioned this pull request Feb 7, 2025

[BUG] Fix ds_chat regression #7014

Closed

inkcherry mentioned this pull request Feb 25, 2025

Autotp training #6922

Closed

hwchen2017 mentioned this pull request Mar 15, 2025

[Roadmap] DeepSpeed Roadmap Q1 2025 #6946

Closed

2 tasks

autotp training(fix dco) #7004

autotp training(fix dco) #7004

Uh oh!

Conversation

inkcherry commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

delock commented Feb 6, 2025

Uh oh!

inkcherry commented Feb 7, 2025

Uh oh!

tjruwase commented Feb 7, 2025

Uh oh!

oelayan7 commented Feb 27, 2025

Uh oh!

tjruwase commented Feb 27, 2025

Uh oh!

inkcherry commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sfc-gh-truwase commented May 29, 2025

Uh oh!

stas00 commented May 29, 2025

Uh oh!

inkcherry commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented May 30, 2025

Uh oh!

stas00 commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Jun 3, 2025

Uh oh!

delock commented Jun 4, 2025

Uh oh!

inkcherry commented Jun 4, 2025

Uh oh!

stas00 commented Jun 9, 2025

Uh oh!

delock commented Jun 10, 2025

Uh oh!

inkcherry commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rookie-ryan commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

inkcherry commented Feb 5, 2025 •

edited

Loading

inkcherry commented Feb 27, 2025 •

edited

Loading

inkcherry commented May 30, 2025 •

edited

Loading

stas00 commented May 30, 2025 •

edited

Loading

inkcherry commented Jun 10, 2025 •

edited

Loading

rookie-ryan commented Aug 4, 2025 •

edited

Loading