Add checks for parallel_state initialization#1680
Conversation
- Modified parallel_state initialization to include a check for uninitialized state. - Added validation to ensure sequence parallel world size matches context parallel size. - Included a check to ensure FP8 amax reduction group is not already initialized.
|
@bhargaveede can you review this PR? |
| "The initialized sequence parallel world size does not match the context parallel size." | ||
| ) | ||
| # Ensure that the parallel_state is initialized similarly with use_fp8=False | ||
| if parallel_state._AMAX_REDUCTION_GROUP is not None: |
There was a problem hiding this comment.
if parallel_state._AMAX_REDUCTION_GROUP is not None:
This is just to check fp8 is set as False, am I right? Rest all seems fine.
There was a problem hiding this comment.
Yes, since this constructor is using use_fp8=False this is extra check to ensure it is consistent.
There was a problem hiding this comment.
Do you think this check is unnecessary? I should add a get function in parallel_state if we use this
There was a problem hiding this comment.
@bhargaveede I changed this check to info and added a function in parallel_state. If you have time please check the changes.
- Remove unused global variable - Add new function to check amax reduction group initialization
|
@libinta would you please check this PR? (add a run_label) It is necessary for PEFT + sentence transformers |
|
@yafshar LGTM! Can you also merge the latest main branch into your branch? I'll trigger the CI after that 🙂 |
|
@regisss please trigger the CI . |
What does this PR do?
Background
The #1501 PR added support for Context Parallelism and took advantage of model and data parallel groups used in Megatron-LM. In the original (Megatron-LM) implementation the global objects treated with an init function, while the #1501 PR included the parallel_state in a class
optimum-habana/optimum/habana/accelerate/state.py
Line 31 in f48dda8
optimum-habana/optimum/habana/accelerate/state.py
Line 91 in f48dda8
This usage caused issues like what happens in #1649 where the object created using
GaudiTrainingArgumentsis created more than once.Fixes # (issue)
#1649
Before submitting