Skip to content

Conversation

@inkcherry
Copy link
Contributor

@inkcherry inkcherry commented Feb 5, 2025

Same as this PR. affeb88
I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: )

@tjruwase tjruwase added this pull request to the merge queue Feb 5, 2025
Merged via the queue into deepspeedai:master with commit f04649d Feb 5, 2025
12 checks passed
@delock
Copy link
Collaborator

delock commented Feb 6, 2025

Kudos @inkcherry for contributing AutoTP training! It's a nice feature make tensor parallel training/finetuning more available to HF model users.

I think a tutorial page would help user discover and learn how to use this feature in DeepSpeed. Is it possible to write a tutorial and add it under https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials introducing steps how to use this feature? I remember you have an example training alpaca with DeepSpeed AutoTP.

tjruwase pushed a commit that referenced this pull request Feb 6, 2025
Same as [this PR](#6922).
[affeb88](affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
fitzjalen pushed a commit to fitzjalen/DeepSpeed that referenced this pull request Feb 6, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
siqi654321 pushed a commit to siqi654321/DeepSpeed that referenced this pull request Feb 7, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
Signed-off-by: siqi <[email protected]>
@inkcherry
Copy link
Contributor Author

Kudos @inkcherry for contributing AutoTP training! It's a nice feature make tensor parallel training/finetuning more available to HF model users.

I think a tutorial page would help user discover and learn how to use this feature in DeepSpeed. Is it possible to write a tutorial and add it under https://github.com/deepspeedai/DeepSpeed/tree/master/docs/_tutorials introducing steps how to use this feature? I remember you have an example training alpaca with DeepSpeed AutoTP.

Yes, I will add some document soon~

@tjruwase
Copy link
Contributor

tjruwase commented Feb 7, 2025

@inkcherry, I think a blog would be appropriate to publicize this amazing technology. Although blogs can be a bit work, we will be glad to collaborate and jointly advertise.

loadams pushed a commit that referenced this pull request Feb 7, 2025
Same as [this PR](#6922).
[affeb88](affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
traincheck-team pushed a commit to traincheck-team/DeepSpeed that referenced this pull request Feb 9, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
gyou2021 pushed a commit to gyou2021/DeepSpeed that referenced this pull request Feb 18, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
gyou2021 pushed a commit to gyou2021/DeepSpeed that referenced this pull request Feb 18, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
@inkcherry inkcherry mentioned this pull request Feb 25, 2025
@oelayan7
Copy link
Contributor

@inkcherry @loadams @tjruwase
cc @nelyahu
Please notice that this PR has removed keep_module_on_host which was added in #6846

keep_module_on_host is currently ignored and not used in auto_tp.
Please align the current code to the old version that was in auto_tp

@tjruwase
Copy link
Contributor

@inkcherry, can you please help address issues raised by @oelayan7. It seems we need to re-apply the lost changes from #6846.

@inkcherry
Copy link
Contributor Author

inkcherry commented Feb 27, 2025

@oelayan7
Sorry for the break, I actually noticed this issue as well.
You can use this as a temporary workaround for now:
in deepspeed/module_inject/layers set
_partition = move(_partition, 'cpu').detach()

I will fix it later. Thanks!

gyou2021 pushed a commit to gyou2021/DeepSpeed that referenced this pull request Feb 28, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
tohtana pushed a commit that referenced this pull request Feb 28, 2025
Same as [this PR](#6922).
[affeb88](affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
Signed-off-by: Masahiro Tanaka <[email protected]>
ys950902 pushed a commit to ys950902/DeepSpeed that referenced this pull request Mar 6, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
Signed-off-by: yisheng <[email protected]>
mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Mar 20, 2025
Same as [this PR](deepspeedai#6922).
[affeb88](deepspeedai@affeb88)
I noticed the CI updated the DCO check recently. Using the suggested
rebase method for sign-off would reintroduce many conflicts, so I opted
for a squash merge with sign-off instead. thanks: )

Signed-off-by: inkcherry <[email protected]>
@sfc-gh-truwase
Copy link
Collaborator

@inkcherry, @delock, we are seeing AutoTP CI hangs with torch 2.7.

@stas00 created the following minimal CI repro that hangs

class TestTpDataloaderCorrectness(DistributedTest):
    world_size = 4
    reuse_dist_env = True

    def test(self):
        set_autotp_mode(training=True)

He noted the cause is related to DEEPSPEED_AUTOTP_MODE and confirmed that the hang goes away with the following change

class TestTpDataloaderCorrectness(DistributedTest):
    world_size = 4
    reuse_dist_env = True

    def test(self, tp_size: int):
        set_autotp_mode(training=True)
        set_autotp_mode(training=False)

@stas00
Copy link
Collaborator

stas00 commented May 29, 2025

Thank you, Tunji - more specifically - the hanging condition is:

  1. reuse_dist_env=True (no problem with False)
  2. world_size=4 (no problem with 2)
  3. set_autotp_mode(training=True) (has to be called)
  4. pytorch-2.7 or higher

@inkcherry
Copy link
Contributor Author

inkcherry commented May 30, 2025

hi @tjruwase @stas00 Thanks for the information! In my local environment with PyTorch 2.7, I noticed that the unit test passes, but the pytest process doesn't exit cleanly(hang). Using reuse_dist_env=False resolve the issue. However, it appears that set_autotp_mode(training=False) still causes a hang. I've submitted a fix here: #7321.

@stas00
Copy link
Collaborator

stas00 commented May 30, 2025

yes, that's exactly the same behavior I observed.

Here is the py-spy trace of hanging:

$ py-spy dump --pid 1712817
Process 1712817: /usr/bin/python /home/yak/.local/bin/pytest --disable-warnings --instafail -m sequential -sv tests/unit/model_parallelism/test_autotp_training.py::TestTpLayerFwdBwd::testRowParallel[False-4]
Python v3.10.12 (/usr/bin/python3.10)

Thread 1712817 (idle): "MainThread"
    wait (threading.py:320)
    wait (threading.py:607)
    wait (multiprocessing/pool.py:765)
    get (multiprocessing/pool.py:768)
    starmap (multiprocessing/pool.py:375)
    _close_pool (unit/common.py:349)
    pytest_runtest_teardown (conftest.py:80)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    <lambda> (_pytest/runner.py:242)
    from_call (_pytest/runner.py:341)
    call_and_report (_pytest/runner.py:241)
    runtestprotocol (_pytest/runner.py:137)
    pytest_runtest_protocol (_pytest/runner.py:113)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    pytest_runtestloop (_pytest/main.py:362)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    _main (_pytest/main.py:337)
    wrap_session (_pytest/main.py:283)
    pytest_cmdline_main (_pytest/main.py:330)
    _multicall (pluggy/_callers.py:103)
    _hookexec (pluggy/_manager.py:120)
    __call__ (pluggy/_hooks.py:513)
    main (_pytest/config/__init__.py:177)
    console_main (_pytest/config/__init__.py:201)
    <module> (pytest:8)
Thread 1712881 (idle): "Thread-1 (run_server)"
    accept (socket.py:293)
    run_server (pytest_rerunfailures.py:433)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1712994 (idle): "Thread-2 (_read_thread)"
    _recv_msg (torch/_inductor/compile_worker/subproc_pool.py:57)
    _read_thread (torch/_inductor/compile_worker/subproc_pool.py:182)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1713067 (idle): "Thread-3 (_handle_workers)"
    select (selectors.py:416)
    wait (multiprocessing/connection.py:931)
    _wait_for_updates (multiprocessing/pool.py:502)
    _handle_workers (multiprocessing/pool.py:522)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1713068 (idle): "Thread-4 (_handle_tasks)"
    _handle_tasks (multiprocessing/pool.py:531)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)
Thread 1713069 (idle): "Thread-5 (_handle_results)"
    _recv (multiprocessing/connection.py:379)
    _recv_bytes (multiprocessing/connection.py:414)
    recv (multiprocessing/connection.py:250)
    _handle_results (multiprocessing/pool.py:579)
    run (threading.py:953)
    _bootstrap_inner (threading.py:1016)
    _bootstrap (threading.py:973)

@stas00
Copy link
Collaborator

stas00 commented May 30, 2025

it hangs here in starmap in reuse_dist_env handling code.

    def _close_pool(self, pool, num_procs, force=False):
        if force or not self.reuse_dist_env:
            msg = pool.starmap(self._dist_destroy, [() for _ in range(num_procs)])
            pool.close()
            pool.join()

the corresponding part of the trace is:

    wait (threading.py:320)
    wait (threading.py:607)
    wait (multiprocessing/pool.py:765)
    get (multiprocessing/pool.py:768)
    starmap (multiprocessing/pool.py:375)
    _close_pool (unit/common.py:349)
    pytest_runtest_teardown (conftest.py:80)

@stas00
Copy link
Collaborator

stas00 commented Jun 3, 2025

should we open an Issue about it? Otherwise this will get forgotten and then someone will have to diagnose this again. It'll happen as soon as someone tries it with pt-2.7.x

@delock
Copy link
Collaborator

delock commented Jun 4, 2025

should we open an Issue about it? Otherwise this will get forgotten and then someone will have to diagnose this again. It'll happen as soon as someone tries it with pt-2.7.x

Did #7321 fixed the issue? Otherwise we should open an issue to track.

@inkcherry
Copy link
Contributor Author

@stas00 @delock yes, created : #7334

@stas00
Copy link
Collaborator

stas00 commented Jun 9, 2025

I suppose the thing to figure out is whether you just need to make the tests pass, or is there an actual bug introduced in pt-2.7 that needs to be diagnosed, reported and fixed in the pt-core?

Swiping the problem under the carpet but making the tests pass often can rear its head later in a real user application.

But this is not the project I was part of, I just reported the problem, so it's up to you what you do with it.

@delock
Copy link
Collaborator

delock commented Jun 10, 2025

I suppose the thing to figure out is whether you just need to make the tests pass, or is there an actual bug introduced in pt-2.7 that needs to be diagnosed, reported and fixed in the pt-core?

Swiping the problem under the carpet but making the tests pass often can rear its head later in a real user application.

But this is not the project I was part of, I just reported the problem, so it's up to you what you do with it.

Thanks @stas00 ! We will take a look at this. I think it depends on whether this bug can only be reproduced by pytest or we can build a standalone reproducer without using pytest. Will need some deeper look to decide.

@inkcherry
Copy link
Contributor Author

inkcherry commented Jun 10, 2025

thanks @delock and @stas00, Yes, we've observed an apparent conflict between the process exit logic in DistributedTest (which was designed independently) and PyTorch's destroy_process_group. We'll conduct a deeper analysis to: Isolate whether this is solely a PyTorch issue, or A design mismatch in our test infrastructure.
Will sync findings promptly

@rookie-ryan
Copy link

rookie-ryan commented Aug 4, 2025

Hey @inkcherry , I am confused that the function tp_model_init is not called in the initialization. Does this function need to be explicitly called when I want to enable autoTP training?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants