Add CI for Autoparallel experiment llama3 on 4 GPUs #2105

xmfan · 2025-12-03T23:07:49Z

Stacked PRs:

Validated with python -m torchtitan.experiments.auto_parallel.tests.integration_tests artifacts-to-be-uploaded --ngpu 4

NGPU=4 LOG_RANK=0,1,2,3 CONFIG_FILE=./tests/integration_tests/base_config.toml TRAIN_FILE=torchtitan.train COMM_MODE= TORCHFT_LIGHTHOUSE=http://localhost:29510 PYTORCH_ALLOC_CONF=expandable_segments:True torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0,1,2,3 --role rank --tee 3 -m torchtitan.train --job.config_file ./tests/integration_tests/base_config.toml --job.dump_folder artifacts-to-be-uploaded/llama3_autoparallel_fsdp_tp --model.name auto_parallel.llama3 --parallelism.data_parallel_shard_degree 2 --parallelism.tensor_parallel_degree 2 --job.custom_config_module=torchtitan.experiments.auto_parallel.job_config

stack-info: PR: #2105, branch: xmfan/stack/5

xmfan · 2025-12-03T23:08:37Z

.github/workflows/integration_test_8gpu_auto_parallel.yaml

@@ -0,0 +1,56 @@
+name: Auto Parallel 8 GPU Integration Tests


The pattern across the other workflows seems to still be naming this 8 GPU

IIUC it basically says we have at most 8 GPU used in each integration tests (and we only have 8 available)

xmfan · 2025-12-03T23:08:57Z

torchtitan/experiments/auto_parallel/tests/integration_tests.py

+            "llama3_autoparallel_fsdp_tp",
+            ngpu=4,
+        ),
+        # TODO: Re-enable this once we fix the test


This doesn't work yet, I'm thinking of enabling it in a different PR

tianyu-l

sgtm, except that the test is not passing?

stack-info: PR: #2105, branch: xmfan/stack/5

xmfan · 2025-12-08T19:33:14Z

Test will pass after meta-pytorch/autoparallel#274

tianyu-l

sgtm

xmfan requested review from fegin, tianyu-l, wconstab and wwwjn as code owners December 3, 2025 23:07

xmfan added a commit that referenced this pull request Dec 3, 2025

Add CI for Autoparallel experiment llama3 on 4 GPUs

bcf9310

stack-info: PR: #2105, branch: xmfan/stack/5

xmfan force-pushed the xmfan/stack/5 branch from 2501ed4 to bcf9310 Compare December 3, 2025 23:07

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 3, 2025

xmfan commented Dec 3, 2025

View reviewed changes

tianyu-l reviewed Dec 4, 2025

View reviewed changes

xmfan added a commit that referenced this pull request Dec 5, 2025

Add CI for Autoparallel experiment llama3 on 4 GPUs

f4c1c55

stack-info: PR: #2105, branch: xmfan/stack/5

xmfan force-pushed the xmfan/stack/5 branch from bcf9310 to f4c1c55 Compare December 5, 2025 17:50

Add CI for Autoparallel experiment llama3 on 4 GPUs

8497818

stack-info: PR: #2105, branch: xmfan/stack/5

tianyu-l approved these changes Dec 8, 2025

View reviewed changes

xmfan force-pushed the xmfan/stack/5 branch from f4c1c55 to 8497818 Compare December 9, 2025 05:08

This was referenced Dec 9, 2025

Rename auto_parallel experiment to autoparallel #2128

Merged

[Autoparallel] Add local_map variant of DSv3 and 2D mesh AP #2129

Open

xmfan merged commit 1ebd914 into main Dec 9, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CI for Autoparallel experiment llama3 on 4 GPUs #2105

Add CI for Autoparallel experiment llama3 on 4 GPUs #2105

Uh oh!

xmfan commented Dec 3, 2025 •

edited

Loading

Uh oh!

xmfan Dec 3, 2025

Uh oh!

wwwjn Dec 3, 2025

Uh oh!

xmfan Dec 3, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

xmfan commented Dec 8, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add CI for Autoparallel experiment llama3 on 4 GPUs #2105

Add CI for Autoparallel experiment llama3 on 4 GPUs #2105

Uh oh!

Conversation

xmfan commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfan Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

xmfan commented Dec 8, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xmfan commented Dec 3, 2025 •

edited

Loading

xmfan Dec 3, 2025 •

edited

Loading