Tinygrad MPS #65

mdaiter · 2025-04-18T10:49:10Z

Sorry, the auto AI stuff bonked a lot of my code. Rolling some stuff back, this is just a train of thought. Basically, just disregard this.

I'm gonna start digging to get tinygrad grafted into the transformer bit. Metal blocks (and compacting that thing down in general) should help a lot with speed + memory usage on machines.

mdaiter · 2025-04-18T11:05:12Z

Moving DynamicSwap_HunyuanVideoTransformer3DModelPacked to mps with preserved memory: 6 GB
  0%|                                                                                                                                                                                         | 0/25 [00:00<?, ?it/s]/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py:81: UserWarning: The operator 'aten::avg_pool3d.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:14.)
  return torch.nn.functional.avg_pool3d(x, kernel_size, stride=kernel_size)
  0%|                                                                                                                                                                                         | 0/25 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/Users/msd/Code/FramePack/demo_gradio.py", line 237, in worker
    generated_latents = sample_hunyuan(
                        ^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/pipelines/k_diffusion_hunyuan.py", line 116, in sample_hunyuan
    results = sample_unipc(k_model, latents, sigmas, extra_args=sampler_kwargs, disable=False, callback=callback)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 141, in sample_unipc
    return FlowMatchUniPC(model, extra_args=extra_args, variant=variant).sample(noise, sigmas=sigmas, callback=callback, disable_pbar=disable)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 118, in sample
    model_prev_list = [self.model_fn(x, vec_t)]
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 23, in model_fn
    return self.model(x, t, **self.extra_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/wrapper.py", line 37, in k_model
    pred_positive = transformer(hidden_states=hidden_states, timestep=timestep, return_dict=False, **extra_args['positive'])[0].float()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 997, in forward
    hidden_states, encoder_hidden_states = self.gradient_checkpointing_method(
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 834, in gradient_checkpointing_method
    result = block(*args)
             ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 654, in forward
    attn_output, context_attn_output = self.attn(
                                       ^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 605, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 174, in __call__
    hidden_states = attn_varlen_func(query, key, value, cu_seqlens_q, cu_seqlens_kv, max_seqlen_q, max_seqlen_kv)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 124, in attn_varlen_func
    x = torch.nn.functional.scaled_dot_product_attention(q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)).transpose(1, 2)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 14.45 GiB

^here's my current error, if anyone wants to tap on

mdaiter · 2025-04-18T14:35:45Z

Threw in a Tensor recast in Metal. Here's where I'm at:
MemoryError: Metal OOM while allocating size=15510555648
Super, super strange: I think Metal caps the amount of contiguous memory allowed while allocating a buffer

mdaiter · 2025-04-18T15:18:43Z

Ayyy it's running (kinda), gotta confirm output but it's past 0%! Looks like attention needed to be segmented, but MPS doesn't have a segmented attention built-in. Built my own

mdaiter · 2025-04-18T15:32:29Z

@lllyasviel , would love your take on this. Got the MPS kernels thrown in, and split the attention up to batch it. I don't know if we should expose a slider through gradio for chunk-sizing on M-series machines, but it's currently diffusing (slowly). One operation only runs on CPU on Mac, so you can run it, but it just takes a while (1h30m for a 25 frame video)

mdaiter · 2025-04-18T20:15:57Z

Okay - finally started updating portions of this for more gains.

Manual flash attention -- works! Still no major speedup.
Matmul optimization: works! Still no major speedup.
Dug harder - turns out, Python itself is the overhead:
60 transformer blocks get called for about 2.5 seconds each on Mac, with these sizes of buffers passed between them. Because Mac can't process these in parallel, it goes sequentially (oh dear god).
I've tried torch.compile for this model, but Dynamo's having a hissy fit.
For now, I think this is probably the best it's gonna be - 5 minutes a frame though, :(

e1732a364fed · 2025-04-20T09:37:34Z

Just a Note, there's another fork for mac which seems to be faster

https://www.reddit.com/r/StableDiffusion/comments/1k2neim/framepack_on_macos/

https://github.com/brandon929/FramePack

For reference, on my M3 Ultra Mac Studio and default settings, I am generating 1 second of video in around 2.5 minutes.

mdaiter · 2025-04-20T14:31:58Z

@e1732a364fed , appreciate it! Looking into it now.

donghao1393 · 2025-04-20T14:45:09Z

I have made pytorch supporting avg_pool3d.out from this repository on MPS by pytorch/pytorch#151742. Another PR, pytorch/pytorch#151760, to support for upsample_nearest3d.vec is under developing.

Previously it takes more than 100 seconds to render one frame on my M4 Max by using code based on this PR. And the quality of its output is not good. I suppose the reason must be something wrong in my fork where not using MPS actually.

FramePack on  main [✘!?⇡] via 🐍 v3.10.16 (env)
󰄛 ❯ bash run_demo.sh
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=False, server='0.0.0.0', port=None, inbrowser=False)
Free VRAM 95.99955749511719 GB
High-VRAM Mode: True
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5789.24it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 15.99it/s]
Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 14855.86it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 20.31it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
latent_padding_size = 27, is_last_section = False
 48%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                                                                                                 | 12/25 [20:47<22:30, 103.92s/it]
Traceback (most recent call last):
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 243, in worker
    generated_latents = sample_hunyuan(
  File "/Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/pipelines/k_diffusion_hunyuan.py", line 116, in sample_hunyuan
    results = sample_unipc(k_model, latents, sigmas, extra_args=sampler_kwargs, disable=False, callback=callback)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 141, in sample_unipc
    return FlowMatchUniPC(model, extra_args=extra_args, variant=variant).sample(noise, sigmas=sigmas, callback=callback, disable_pbar=disable)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 134, in sample
    callback({'x': x, 'i': i, 'denoised': model_prev_list[-1]})
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 234, in callback
    raise KeyboardInterrupt('User ends the task.')
KeyboardInterrupt: User ends the task.
^CKeyboard interruption in main thread... closing server.

250418_230711_181_7405_37.mp4

And this fork with the same pytorch, it takes me less than 20 seconds to render one frame. The output has a well quality.

FramePack-macos on  main [?] via 🐍 v3.10.16 (env)
󰄛 ❯ python demo_gradio.py
Currently enabled native sdp backends: ['flash', 'math', 'mem_efficient', 'cudnn']
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=False, server='0.0.0.0', port=None, inbrowser=False, output_dir='./outputs', fp32=False)
Free VRAM 96.0 GB
High-VRAM Mode: True
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 3350.08it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.55it/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3579.78it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 13.26it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Resolution: 336 x 496
latent_padding_size = 18, is_last_section = False
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [05:51<00:00, 14.06s/it]
/Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/torch/nn/functional.py:4651: UserWarning: The operator 'aten::upsample_nearest3d.vec' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/aten/src/ATen/mps/MPSFallback.mm:14.)
  return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors)
Decoded. Current latent shape torch.Size([1, 16, 9, 62, 42]); pixel shape torch.Size([1, 3, 33, 496, 336])
latent_padding_size = 9, is_last_section = False
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
 84%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                            | 21/25 [05:48<01:23, 20.77s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [07:32<00:00, 18.10s/it]
Decoded. Current latent shape torch.Size([1, 16, 18, 62, 42]); pixel shape torch.Size([1, 3, 69, 496, 336])
latent_padding_size = 0, is_last_section = True
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [08:18<00:00, 19.92s/it]
Decoded. Current latent shape torch.Size([1, 16, 28, 62, 42]); pixel shape torch.Size([1, 3, 109, 496, 336])
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
^CKeyboard interruption in main thread... closing server.

250420_173459_831_8394_28.mp4

I just saw the new PR, hope it would be helpful for that.

mdaiter · 2025-04-20T14:52:32Z

@donghao1393 I gotcha - I chunked these buffers down fairly aggressively to generate the output. Output's not good? I tried getting it out with the guy jumping, and it worked well! Let me know what's going wrong.

Maybe it's because my machine's just really constrained, I've got a M3 Pro with 18GB of RAM, so I'm really constrained

donghao1393 · 2025-04-20T14:58:02Z

@mdaiter It's in the first output. The video seems not playing right.
I have checked the file. It seems not encoded correctly. But it can be played under the player, not in quicklook.
Let me re-encode and upload here to show the issue.

250418_230711_181_7405_37.reoutput.mp4

mdaiter added 4 commits April 18, 2025 18:07

starting conversion to tinygrad - mps sucks

c339bc0

remove conversion file

205343c

rolling back tinygrad full convert

5f06f73

Okay finally torch stuff should be functional again

ea1fad0

mdaiter force-pushed the tinygrad branch from fa6fc67 to ea1fad0 Compare April 18, 2025 10:56

Diffusing on MPS!

59972e9

vae load onto mps

58b177e

This was referenced Apr 18, 2025

Mac Support: Adapt FramePack for Apple Silicon donghao1393/FramePack#1

Open

Mac Support: Adapt FramePack for Apple Silicon donghao1393/FramePack#2

Merged

donghao1393 mentioned this pull request Apr 20, 2025

Mac compatibility! #170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tinygrad MPS #65

Tinygrad MPS #65

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

e1732a364fed commented Apr 20, 2025

Uh oh!

mdaiter commented Apr 20, 2025

Uh oh!

donghao1393 commented Apr 20, 2025 •

edited

Loading

Uh oh!

mdaiter commented Apr 20, 2025

Uh oh!

donghao1393 commented Apr 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tinygrad MPS #65

Are you sure you want to change the base?

Tinygrad MPS #65

Uh oh!

Conversation

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

mdaiter commented Apr 18, 2025

Uh oh!

e1732a364fed commented Apr 20, 2025

Uh oh!

mdaiter commented Apr 20, 2025

Uh oh!

donghao1393 commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdaiter commented Apr 20, 2025

Uh oh!

donghao1393 commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

donghao1393 commented Apr 20, 2025 •

edited

Loading

donghao1393 commented Apr 20, 2025 •

edited

Loading