feat: refactor main_ds.py (2/n) Accelerator class #594

cdoern · 2025-06-04T14:13:04Z

** PLEASE NOTE, THIS PR INCLUDES THE CHANGES IN #572, AND WILL BE REDUCED IN SIZE ONCE THAT MERGES **

Introduce a new design for key components of main_ds.py. Namely splitting Model initialization, Accelerator initialization, Optimizer initialization, and Checkpoint saving initialization into classes. This commit introduces the Accelerator class

The Accelerator class aims to both store commonly accessed variables associated with the accelerated model and abstract model/optimizer mutation away from the user who should only access our Model and Optimizer classes.

These classes are one of a few steps needed to "SDK-ify" the training library

Adding structure to code via classes can either be someone's favorite or least favorite thing. So I figured I'd explain myself before continuing. Here is my rationale:

Classes provide logical structuring to code, especially code meant to be a publicly consumable SDK and allows you to associate related objects and methods with one another.

Being able to group functionality under the Model, Accelerator, and Checkpointer classes inherently reduces code complexity and duplication. Being able to store things like , self.distributed_framework,self.lora_config, etc in a way such that within the class they are accessible within different methods allows the arguments per method to go down drastically, as well as complex return values. Simpler methods and argument/return values allows for simpler testing of code.

Accelerator works with Model to abstract common utilities behind a custom class that allows users to seamlessly setup their model for training Signed-off-by: Charlie Doern <[email protected]>

Signed-off-by: Charlie Doern <[email protected]>

github-actions · 2025-06-04T21:27:28Z

E2E (NVIDIA L40S x4) (python 3.11) workflow launched on this PR: View run

github-actions · 2025-06-05T01:14:07Z

e2e workflow succeeded on this PR: View run, congrats!

thisisatharva-rh · 2025-06-05T01:17:16Z

src/instructlab/training/accelerator.py

+class Accelerator:
+    def __init__(
+        self,
+        model: Model,


Model acts as a "factory" class that creates nn.module once the from_pretrained method is called in the case of each; Liger, Dolomite and the normal transformer model. That is what we should pass into the Accelerator so that we can avoid the weird model.model references.

so, I think by not passing in model: Model we lose a lot of the seamless nature of these classes. things like self.model.lora_config are not possible within the Accelerator class if model is type hinted to nn.module, right?

yeah, that's correct. I guess we need to refine the model class more for that to happen; approving rn in the spirit of getting the refactor in quickly.

thisisatharva-rh

we need to have a larger conversation about what the Model class should look like, but for now, this looks good.

mergify bot added testing Relates to testing ci-failure labels Jun 4, 2025

cdoern force-pushed the refactor-accelerator branch from 1de1d23 to 25230a7 Compare June 4, 2025 17:26

mergify bot removed the ci-failure label Jun 4, 2025

cdoern force-pushed the refactor-accelerator branch 2 times, most recently from aa7af32 to 81f731e Compare June 4, 2025 18:15

mergify bot added ci-failure and removed ci-failure labels Jun 4, 2025

cdoern added 2 commits June 4, 2025 17:22

feat: add Accelerator class and usage

456fa3c

Accelerator works with Model to abstract common utilities behind a custom class that allows users to seamlessly setup their model for training Signed-off-by: Charlie Doern <[email protected]>

feat: test Accelerator class

bf4be9f

Signed-off-by: Charlie Doern <[email protected]>

cdoern force-pushed the refactor-accelerator branch from 81f731e to bf4be9f Compare June 4, 2025 21:22

thisisatharva-rh reviewed Jun 5, 2025

View reviewed changes

thisisatharva-rh approved these changes Jun 5, 2025

View reviewed changes

mergify bot added the one-approval label Jun 5, 2025

booxter approved these changes Jun 5, 2025

View reviewed changes

mergify bot merged commit 3a2dcc1 into instructlab:main Jun 5, 2025
18 checks passed

mergify bot removed the one-approval label Jun 5, 2025

cdoern mentioned this pull request Jun 10, 2025

feat: refactor main_ds.py (3/n) Checkpointer Class #605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: refactor main_ds.py (2/n) Accelerator class #594

feat: refactor main_ds.py (2/n) Accelerator class #594

Uh oh!

cdoern commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

thisisatharva-rh Jun 5, 2025

Uh oh!

cdoern Jun 5, 2025

Uh oh!

thisisatharva-rh Jun 5, 2025

Uh oh!

thisisatharva-rh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: refactor main_ds.py (2/n) Accelerator class #594

feat: refactor main_ds.py (2/n) Accelerator class #594

Uh oh!

Conversation

cdoern commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 5, 2025

Uh oh!

thisisatharva-rh Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

cdoern Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

thisisatharva-rh Jun 5, 2025

Choose a reason for hiding this comment

Uh oh!

thisisatharva-rh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants