Add checks for parallel_state initialization by yafshar · Pull Request #1680 · huggingface/optimum-habana

yafshar · 2025-01-06T22:26:38Z

What does this PR do?

Modified parallel_state initialization to include a check for uninitialized state.
Added validation to ensure sequence parallel world size matches context parallel size.
Included a check to ensure FP8 amax reduction group is not already initialized.

Background

The #1501 PR added support for Context Parallelism and took advantage of model and data parallel groups used in Megatron-LM. In the original (Megatron-LM) implementation the global objects treated with an init function, while the #1501 PR included the parallel_state in a class

optimum-habana/optimum/habana/accelerate/state.py

Line 31 in f48dda8

class GaudiPartialState(PartialState):

at

optimum-habana/optimum/habana/accelerate/state.py

Line 91 in f48dda8

    
           parallel_state.initialize_model_parallel(sequence_parallel_size=context_parallel_size, use_fp8=False)

This usage caused issues like what happens in #1649 where the object created using GaudiTrainingArguments is created more than once.

Fixes # (issue)

#1649

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

- Modified parallel_state initialization to include a check for uninitialized state. - Added validation to ensure sequence parallel world size matches context parallel size. - Included a check to ensure FP8 amax reduction group is not already initialized.

yafshar · 2025-01-06T22:27:41Z

@bhargaveede can you review this PR?

bhargaveede · 2025-01-07T12:19:04Z

+                            "The initialized sequence parallel world size does not match the context parallel size."
+                        )
+                    # Ensure that the parallel_state is initialized similarly with use_fp8=False
+                    if parallel_state._AMAX_REDUCTION_GROUP is not None:


if parallel_state._AMAX_REDUCTION_GROUP is not None:

This is just to check fp8 is set as False, am I right? Rest all seems fine.

Yes, since this constructor is using use_fp8=False this is extra check to ensure it is consistent.

Do you think this check is unnecessary? I should add a get function in parallel_state if we use this

@bhargaveede I changed this check to info and added a function in parallel_state. If you have time please check the changes.

- Remove unused global variable - Add new function to check amax reduction group initialization

yafshar · 2025-01-15T22:50:12Z

@libinta would you please check this PR? (add a run_label) It is necessary for PEFT + sentence transformers

regisss · 2025-01-20T22:06:09Z

@yafshar LGTM! Can you also merge the latest main branch into your branch? I'll trigger the CI after that 🙂

yafshar · 2025-01-21T12:36:35Z

@regisss please trigger the CI .

yafshar marked this pull request as ready for review January 6, 2025 22:26

yafshar requested a review from regisss as a code owner January 6, 2025 22:26

yafshar mentioned this pull request Jan 6, 2025

Sentence transformers 3.3.1 #1628

Merged

3 tasks

bhargaveede reviewed Jan 7, 2025

View reviewed changes

yafshar added 2 commits January 7, 2025 15:51

Update amax reduction group initialization check

7829e10

Update the parallel_state module

9a19cfc

- Remove unused global variable - Add new function to check amax reduction group initialization

yafshar added 2 commits January 16, 2025 14:57

Merge branch 'main' into parallel_state

a103bfc

Merge branch 'main' into parallel_state

3114a84

Merge branch 'main' into parallel_state

ad7da8b

regisss approved these changes Jan 21, 2025

View reviewed changes

regisss merged commit 49aac84 into huggingface:main Jan 21, 2025

yafshar deleted the parallel_state branch January 21, 2025 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checks for parallel_state initialization#1680

Add checks for parallel_state initialization#1680
regisss merged 6 commits intohuggingface:mainfrom
yafshar:parallel_state

yafshar commented Jan 6, 2025

Uh oh!

yafshar commented Jan 6, 2025

Uh oh!

bhargaveede Jan 7, 2025

Uh oh!

yafshar Jan 7, 2025

Uh oh!

yafshar Jan 7, 2025

Uh oh!

yafshar Jan 7, 2025

Uh oh!

yafshar commented Jan 15, 2025

Uh oh!

regisss commented Jan 20, 2025

Uh oh!

yafshar commented Jan 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yafshar commented Jan 6, 2025

What does this PR do?

Background

Before submitting

Uh oh!

yafshar commented Jan 6, 2025

Uh oh!

bhargaveede Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

yafshar Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

yafshar Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

yafshar Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

yafshar commented Jan 15, 2025

Uh oh!

regisss commented Jan 20, 2025

Uh oh!

yafshar commented Jan 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants