Support CPU offload? #2827

fzyzcjy · 2024-10-21T03:10:40Z

fzyzcjy
Oct 21, 2024

Hi! I wonder whether unsloth will support some kind of CPU offload?

For example, I would like to finetune a 7-8B model on 24GB gpu. Since LoRA usually results in reduced performance, it would be great if I could do full finetune.

There seems to be some techniques about cpu offloading (e.g. DeepSpeed has some) during, let alone the commonly seen cpu offloading for inferencing. However, searching unsloth's doc does not say things about configuring some cpu offloading.

Thus I wonder, is it because it is impossible or have severe drawback (e.g. will be 100x slower), or just not-yet-implemented / on the plan? Thanks!

adamo1139 · 2024-10-21T19:00:52Z

adamo1139
Oct 21, 2024

Unsloth doesn't support full finetune though.

0 replies

fzyzcjy · 2024-10-21T23:10:35Z

fzyzcjy
Oct 21, 2024
Author

@adamo1139 May I know a bit more? It seems that I have been using unsloth for full finetune for many times.

0 replies

adamo1139 · 2024-10-21T23:14:31Z

adamo1139
Oct 21, 2024

#1021

I've tried to see whether Daniel commented on it recently and it looks like it officially isn't supported but might kinda work anyway. I just assumed it didn't work since that was the communication I saw.

0 replies

fzyzcjy · 2024-10-21T23:23:11Z

fzyzcjy
Oct 21, 2024
Author

Ah! I never knew that, thanks for pointing out!

0 replies

danielhanchen · 2024-10-24T05:23:42Z

danielhanchen
Oct 24, 2024
Maintainer

We had a discussion on Discord - it's possible to do CPU offloading, but in a smart way - ie offload 1/2 of the layers, and bring the other 1/2 back dynamically. This can hide all communication and cut VRAM usage by 50% - it's more of an engineering challenge though to make it work sadly

0 replies

fzyzcjy · 2024-10-24T05:26:53Z

fzyzcjy
Oct 24, 2024
Author

Looks reasonable! Sadly to hear it requires much engineering...

0 replies

danielhanchen · 2024-10-24T08:01:03Z

danielhanchen
Oct 24, 2024
Maintainer

Ye unfortunately the goal is to not make it slower - the dumbest solution is dynamic offloading ie offload everything if it doesnt fit, then bring it in slowly - this is not a good idea sadly

0 replies

fzyzcjy · 2024-10-25T02:02:05Z

fzyzcjy
Oct 25, 2024
Author

@danielhanchen I thought about it a bit more, and here is my brainstorm:

Firstly, DeepSpeed does support to finetune in such scenario (7B param). According to its estimation, we need 15G gpu memory (good) and 158G cpu memory (well...), if we only offload optimizer to cpu (and not offload param to cpu).

Now the problem becomes whether we can reduce CPU memory requirement. My naive thought is that, there is adamw_8bit (though it does not support CPU yet, but this seems to be an engineering problem instead of research problem), so maybe we can reduce the 8byte/param to 2byte/param for optimizer states. There may be other optimizations available, since the scenario is single-GPU, thus reducing some artifacts caused by multi-GPU setup. And theoretically speaking, since 1.5B model needs 14GB memory (I tested), then it seems 7B should need 65GB memory in total, which is acceptable.

0 replies

danielhanchen · 2024-10-27T09:25:44Z

danielhanchen
Oct 27, 2024
Maintainer

@fzyzcjy Sounds reasonable - actually on optimizers - Torch AO has a full CPU offloaded 8bit AdamW optimizer https://github.com/pytorch/ao?tab=readme-ov-file#memory-efficient-optimizers which might be interesting

0 replies

fzyzcjy · 2024-10-27T09:30:39Z

fzyzcjy
Oct 27, 2024
Author

@danielhanchen That looks interesting, thank you!

0 replies

fzyzcjy · 2025-01-21T14:37:47Z

fzyzcjy
Jan 21, 2025
Author

@danielhanchen Hi, is there any updates? I am recently interested in finetuning a llama 3.2 90B vision model using lora on 24GB card (w/ 64GB cpu memory). Wondering whether the vanilla copy layer-by-layer will fit this.

0 replies

fzyzcjy · 2025-01-21T14:37:57Z

fzyzcjy
Jan 21, 2025
Author

@danielhanchen I am happy to PR if this is not too much engineering work. (A most naive version may be just move tensors between cpu and gpu here and there, possibly making it async to allow overlapping communication and computation - does that look good to you?)

0 replies

rolandtannous · 2025-06-29T21:34:28Z

rolandtannous
Jun 29, 2025
Collaborator

We now partially offload some of the layers from GPU.
I will leave this thread open though and move it from github issues to discussions.

1 reply

rlleshi Aug 28, 2025

@rolandtannous is there some docs that show how to do the offloading?

Uh oh!

Support CPU offload? #2827

Uh oh!

fzyzcjy Oct 21, 2024

Replies: 13 comments · 1 reply

Uh oh!

adamo1139 Oct 21, 2024

Uh oh!

fzyzcjy Oct 21, 2024 Author

Uh oh!

adamo1139 Oct 21, 2024

Uh oh!

fzyzcjy Oct 21, 2024 Author

Uh oh!

danielhanchen Oct 24, 2024 Maintainer

Uh oh!

fzyzcjy Oct 24, 2024 Author

Uh oh!

danielhanchen Oct 24, 2024 Maintainer

Uh oh!

Uh oh!

fzyzcjy Oct 25, 2024 Author

Uh oh!

danielhanchen Oct 27, 2024 Maintainer

Uh oh!

fzyzcjy Oct 27, 2024 Author

Uh oh!

Uh oh!

fzyzcjy Jan 21, 2025 Author

Uh oh!

Uh oh!

fzyzcjy Jan 21, 2025 Author

Uh oh!

Uh oh!

rolandtannous Jun 29, 2025 Collaborator

Uh oh!

rlleshi Aug 28, 2025

fzyzcjy
Oct 21, 2024

Replies: 13 comments 1 reply

adamo1139
Oct 21, 2024

fzyzcjy
Oct 21, 2024
Author

adamo1139
Oct 21, 2024

fzyzcjy
Oct 21, 2024
Author

danielhanchen
Oct 24, 2024
Maintainer

fzyzcjy
Oct 24, 2024
Author

danielhanchen
Oct 24, 2024
Maintainer

fzyzcjy
Oct 25, 2024
Author

danielhanchen
Oct 27, 2024
Maintainer

fzyzcjy
Oct 27, 2024
Author

fzyzcjy
Jan 21, 2025
Author

fzyzcjy
Jan 21, 2025
Author

rolandtannous
Jun 29, 2025
Collaborator