Skip to content

Add checks for parallel_state initialization#1680

Merged
regisss merged 6 commits intohuggingface:mainfrom
yafshar:parallel_state
Jan 21, 2025
Merged

Add checks for parallel_state initialization#1680
regisss merged 6 commits intohuggingface:mainfrom
yafshar:parallel_state

Conversation

@yafshar
Copy link
Copy Markdown
Contributor

@yafshar yafshar commented Jan 6, 2025

What does this PR do?

  • Modified parallel_state initialization to include a check for uninitialized state.
  • Added validation to ensure sequence parallel world size matches context parallel size.
  • Included a check to ensure FP8 amax reduction group is not already initialized.

Background

The #1501 PR added support for Context Parallelism and took advantage of model and data parallel groups used in Megatron-LM. In the original (Megatron-LM) implementation the global objects treated with an init function, while the #1501 PR included the parallel_state in a class

class GaudiPartialState(PartialState):
at
parallel_state.initialize_model_parallel(sequence_parallel_size=context_parallel_size, use_fp8=False)

This usage caused issues like what happens in #1649 where the object created using GaudiTrainingArguments is created more than once.

Fixes # (issue)

#1649

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

- Modified parallel_state initialization to include a check for
  uninitialized state.
- Added validation to ensure sequence parallel world size matches
  context parallel size.
- Included a check to ensure FP8 amax reduction group is not
  already initialized.
@yafshar yafshar marked this pull request as ready for review January 6, 2025 22:26
@yafshar yafshar requested a review from regisss as a code owner January 6, 2025 22:26
@yafshar
Copy link
Copy Markdown
Contributor Author

yafshar commented Jan 6, 2025

@bhargaveede can you review this PR?

@yafshar yafshar mentioned this pull request Jan 6, 2025
3 tasks
Comment thread optimum/habana/accelerate/state.py Outdated
"The initialized sequence parallel world size does not match the context parallel size."
)
# Ensure that the parallel_state is initialized similarly with use_fp8=False
if parallel_state._AMAX_REDUCTION_GROUP is not None:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if parallel_state._AMAX_REDUCTION_GROUP is not None:

This is just to check fp8 is set as False, am I right? Rest all seems fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since this constructor is using use_fp8=False this is extra check to ensure it is consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this check is unnecessary? I should add a get function in parallel_state if we use this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhargaveede I changed this check to info and added a function in parallel_state. If you have time please check the changes.

- Remove unused global variable
- Add new function to check amax reduction group initialization
@yafshar
Copy link
Copy Markdown
Contributor Author

yafshar commented Jan 15, 2025

@libinta would you please check this PR? (add a run_label) It is necessary for PEFT + sentence transformers

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Jan 20, 2025

@yafshar LGTM! Can you also merge the latest main branch into your branch? I'll trigger the CI after that 🙂

@yafshar
Copy link
Copy Markdown
Contributor Author

yafshar commented Jan 21, 2025

@regisss please trigger the CI .

@regisss regisss merged commit 49aac84 into huggingface:main Jan 21, 2025
@yafshar yafshar deleted the parallel_state branch January 21, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants