-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Mac compatibility! #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Mac compatibility! #170
Conversation
Mac Support: Adapt FramePack for Apple Silicon
|
@donghao1393 , could I get your and @brandon929's support to test this out? I just don't have a more powerful machine. Happy to adjust numbers / make this adaptive for the chip you've got |
|
@mdaiter Thanks for your invitation. Surely I would help you to test it after finishing my current tasks of submitting those PR to pytorch. I thought the issue, #65 (comment), must be in my side. So just let me re-run again to test and find some good news on it. |
|
Thanks @donghao1393 ! Appreciate it. |
| @@ -1,8 +1,10 @@ | |||
| from tkinter import W | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks compatibility with my Python installed using homebrew. Also, I don't think the import is being utilized as I didn't run into any issues when I removed the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! My bad - I can remove this
|
I did some testing with these changes, compared to just using an MPS pytorch device without any extra changes to FramePack (See my fork of FramePack). Overall, the changes you have added support generating frames at the "640" resolution bucket. If you just use an MPS device with the standard code, you get noisy frames at this resolution (as you mention above this is not a FramePack issue, the same problem happens in ComfyUI and seems to be an issue with MPS support in pytorch). On the downside, I do see a performance hit of 50% at lower resolutions that do otherwise generate correctly with only changing the MPS device. At higher resolutions the hit appears to be 100% - however that may not be a true comparison as without your changes this resolution only generates noisy garbage frames. I am testing with an 80 GPU core M3 Ultra using the same image, prompt, etc. with the following results. At a frame resolution of 544 x 704 (this is the standard "640" bucket): Here is a dump from Terminal for a complete run using this PR with TeaCache enabled: Here is a dump with the same (kind of) settings using the standard FramePack pipeline with an MPS device with TeaCache enabled. There is another difference in that I am rendering the video at 24fps so I am generating fewer latent frames in total than if the video was rendered at 30fps. However, each pass should be the same only there is one fewer pass (you can confirm by looking at the decoded latent tensor shapes): |
|
Also, here are the results of running your test scripts. Results from an M3 Ultra (80 core GPU) Mac Studio: Results from an M4 Max (40 core GPU) MacBook Pro: |
|
@brandon929, you're the man! Thank you.
|
|
I have run it under M4 Max + 128GB. For the test script. |
|
Okay. The move's probably to do a fallback system to this, in case frames are too large to fit into a |
|
Alright @brandon929 + @donghao1393 , you're up. Got it running on my M3 with automatic chunking for that fused torch nn functional attention function, it flies now. Getting 180 seconds / iteration, you should get it to be a lot faster. My machine needs to chunk it 6 times, you probably need to chunk once because you've got so much RAM. I'd guess it runs in 20 / 30 seconds per iteration. |
|
(once that's tested, I'll take out that chunk print statement - wanna see what your chunk size is) |
|
I have run with the latest commit. The time for per iter has been reduced to around 20 seconds. But the issue on the output video still exists. |
|
On M3 Ultra, I see 27.42s/it for the first iteration (so no TeaCache skipping steps). However, the video is "blown out", which is a condition I have seen when using the non-chunked inferencing on my branch and the resolution is too high. With high resolutions and MPS, there is a point where the colors are blown out but there are still shapes, etc. recognizable from the image. Then there is a second point where it only creates noisy blocks with no recognizable shape in the image. I have not looked at your changes yet, but I am assuming the chunking is meant to help avoid some of these MPS issues with large tensors. If so, perhaps the threshold used to determine the chunk size is still too large. I also ran this on my 128GB M4 Max MBP (so there is thermal throttling), and it chose the same chunk size and did 55s/it on the first iteration. The M4 Max's final denoising time with TeaCache for the first 33 frames was [14:54<00:00, 35.79s/it]. The video it generated had the same blown out colors. M3 Ultra output: |
|
I did try reducing the Perhaps it is a problem with the |
|
Super strange - @brandon929 , you said my code worked previously at high res when yours didn't, right? It might be worth checking whether that's the cause of this issue. We could downscale the photo and try at lower res? |
|
Ahhhh you know what I think it is? |
|
Ahhh @donghao1393 @brandon929 so I actually think it is because of At its core, the issue isn’t the model or the data—it’s the way Apple’s Metal Performance Shaders backend implements “fused” attention. Unlike the CPU and CUDA paths, which tile and stream the Q⋅Kᵀ→softmax→V matmul under the hood, MPS literally materializes the entire Q×K score matrix in bfloat16 and then immediately casts it to float32 for the softmax. On an 18 GB machine that full-buffer approach blows well past the per‑buffer cap, so you either crash or silently fall back to a lower‑precision code path that corrupts the output. Compounding that, when you do any manual or chunked fallback entirely in bfloat16, the 7‑bit mantissa simply can’t represent the fine gradations of a long softmax over thousands of keys—so even if it “fits” in memory, the attention weights collapse (under‑ or overflow) and your decoded image ends up with blown‑out colors. In short: MPS’s naive full‑matrix allocation and limited bf16 precision conspire to both exceed your memory budget and destroy the numerical fidelity of the softmax. I discovered that Apple’s MPS implementation of On beefier MPS devices (e.g. 128 GB), you can optimistically try the full Q fused call first and automatically fall back to chunking only if it OOMs. In practice this means you get the fastest path out‑of‑the‑box on high‑memory Macs, and a robust, memory‑safe chunked fallback on 16–18 GB systems—with no blown‑out colors and predictable performance. Please let me know what's best! |
|
@mdaiter According to your advice, I have made this commit, donghao1393@c11d3fd. It may not be the correct one, because the output has quality issue over all pixel. Here are the running log. speed: (19*60+11)/37=31.10810811/it The output cannot be played. 250422_224525_098_8997_10.mp4You can copy the download link and watch it locally. |
|
The issue with MP4s and play-back compatibility was corrected, if you merge latest from lllyasviel:main you will get the fix. However, when I play the video you uploaded it has the blocky-noise issue discussed by @mdaiter above. There is some threshold where chunking is necessary to ensure the generated frames look correct. This quality problem occurs before you hit an OOM. There is also an issue where you get frames that match expectations but the colors are "blown-out", which @mdaiter provided an explanation for above. Overall, it seems depending on the tensor size passed to the sampling stage (so the problem is likely localized to the attention calculations and the tensors being utilized there, the latent image tensors passed to the sampler aren't particularly large):
250423_074338_892_2939_9.mp4I have not yet had the time to dig into the problem and identify exactly where those thresholds lie. I know with my 128GB and 512GB Macs, at the default "640" bucket I get issue 2. At a lower resolution bucket like "576" I get issue 3. At the resolution bucket "416" I never see any artifacts in the generated output. These are not issues specifically with FramePack as I see these same issues using ComfyUI with MPS devices. Ideally, a short-term solution would be to identify these thresholds and take the minimally-intrusive approach. For example, with case 4 if there is enough RAM just use the Pytorch MPS fused attention implementation. If there is not enough RAM, chunk as appropriately. With case 3 and 4, maybe resort to chunking and using FP32 and taking the perf and RAM usage hit. Long-term solution would be to fix the MPS fused attention implementation in Pytorch. (I have no idea how much work this is.) |
|
Thanks for the detailed explanation! I have dug into the attention mechanism implementation in the PyTorch MPS backend and understand the problem you describe. The core of the problem is indeed the way the MPS backend implements:
The three thresholds you mentioned make perfect sense:
I think the long-term solution is indeed to fix the MPS fused attention implementation in PyTorch to use chunking and streaming like the CPU and CUDA backends. This will fundamentally solve the problem, rather than circumventing it at the application level. I'd love to contribute to this effort, and I can try to implement this fix in PyTorch if you're interested. Correct me if I were wrong. |
|
I have this error on MacBookPro M1 Max 64GB. Clean py310 venv: Result: Following suggestion or commenting out two lines in question raises:
MacOS: 15.3.1 (24D70) |
If you understand the pytorch backend and have an idea on how to fix it i' sure the pytorch guys would appreciate it, feels like the mps backend continues to trail behind, due to less attention to it i think sadly (due to alot of the devs just not having macs) :( |
|
Nice work! Please let me know if you need any testing on a high VRAM machine. I can run tests on an M2 Ultra with 128GB unified memory. Feel free to @ me directly, so I get notified. |
|
Is this still being worked on? Are there plans to update it with the FramePack-F1 support changes from yesterday? Edit: It looks like F1 won't work because avg_pool3d isn't implemented for MPS. |
|
FYI, I got F1 working on Mac using the same changes that @brandon929 made for the normal version, but I had to modify center_down_sample_3d to use the commented out code instead of the call to PyTorch's avg_pool3d as that's not supported for MPS. It generates output that looks good, though I'm not sure it's any faster.. |
|
@Morac2 |
I built it, but I was having issues installing it as I can't find a compatible torchaudio and torchvision. I eventually got it installed with torchaudio 0.11.0 and torchvision 0.12.0, both of which are really old, but newer versions complain they aren't compatible. Generation is working, so I guess it's okay, but it seems like it should be installing a newer version. On the plus side your version of torch worked with Edit: I forced installation of the nightly torchaudio and torchvision. Despite it complaining they aren't compatible, it ran. |
|
Is there an updated version of test_mps_attention.py? The chunked_attention_bfloat16 function being imported doesn't exist in hunyuan_video_packed. It was removed in commit 2a5edc0 as such the test errors out immediately. |
|
I have done a first trial of applying flash attention on mps. but it requires more work to ensure the accuracy. Currently I am working in a feature of my another project. And I will have a holiday since next week. I will go back to work for this after that. |
According to pytorch/pytorch#139668 flash attention for mps is already implemented in PyTorch. It looks like metal flash attention support was merged into PyTorch 3 weeks ago. pytorch/pytorch#152781 pytorch/pytorch#151742 still hasn’t been merged for some reason. |
|
Looks gorgeous. will test if those pr fit the framepack. |
| model.to(device=target_device) | ||
| if torch.cuda.is_available(): | ||
| torch.cuda.empty_cache() | ||
| torch.mps.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.mps.empty_cache should be call when mps is available


Hey all!
Got Mac compatibility fairly up and running. Tested it, the latents came out well! It's just kind of slow - working on this now.
It seems like there's some kind of funky stuff going on with Mac's underlying MPS system: transformers aren't run in parallel, which leads to slowdowns (5 mins / frame), but the frames do come out with the underlying proper form.
Happy to answer commentary and get this merged. I'm working on seeing if there's a way to use the
torch.nn.functionalscaled dot product attention function.Right now, there's a big component to this: 15GB is too large a buffer to allocate in MPS on Metal for a single frame. That leads to a necessity for chunking the scaled dot product attention portion of this.
Happy to take commentary, make edits, and clean up this code. Would love high level feedback if you've got it.
Big shout out to @donghao1393 for cleaning up some of this code