Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Dec 3, 2025

Stacked PRs:


Validated with python -m torchtitan.experiments.auto_parallel.tests.integration_tests artifacts-to-be-uploaded --ngpu 4

NGPU=4 LOG_RANK=0,1,2,3 CONFIG_FILE=./tests/integration_tests/base_config.toml TRAIN_FILE=torchtitan.train COMM_MODE= TORCHFT_LIGHTHOUSE=http://localhost:29510 PYTORCH_ALLOC_CONF=expandable_segments:True torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0,1,2,3 --role rank --tee 3 -m torchtitan.train --job.config_file ./tests/integration_tests/base_config.toml --job.dump_folder artifacts-to-be-uploaded/llama3_autoparallel_fsdp_tp --model.name auto_parallel.llama3 --parallelism.data_parallel_shard_degree 2 --parallelism.tensor_parallel_degree 2 --job.custom_config_module=torchtitan.experiments.auto_parallel.job_config

xmfan added a commit that referenced this pull request Dec 3, 2025
stack-info: PR: #2105, branch: xmfan/stack/5
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 3, 2025
@@ -0,0 +1,56 @@
name: Auto Parallel 8 GPU Integration Tests
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern across the other workflows seems to still be naming this 8 GPU

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC it basically says we have at most 8 GPU used in each integration tests (and we only have 8 available)

"llama3_autoparallel_fsdp_tp",
ngpu=4,
),
# TODO: Re-enable this once we fix the test
Copy link
Member Author

@xmfan xmfan Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work yet, I'm thinking of enabling it in a different PR

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, except that the test is not passing?

xmfan added a commit that referenced this pull request Dec 5, 2025
stack-info: PR: #2105, branch: xmfan/stack/5
stack-info: PR: #2105, branch: xmfan/stack/5
@xmfan
Copy link
Member Author

xmfan commented Dec 8, 2025

Test will pass after meta-pytorch/autoparallel#274

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@xmfan xmfan merged commit 1ebd914 into main Dec 9, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants