Skip to content

Conversation

@mdaiter
Copy link

@mdaiter mdaiter commented Apr 18, 2025

Sorry, the auto AI stuff bonked a lot of my code. Rolling some stuff back, this is just a train of thought. Basically, just disregard this.

I'm gonna start digging to get tinygrad grafted into the transformer bit. Metal blocks (and compacting that thing down in general) should help a lot with speed + memory usage on machines.

@mdaiter
Copy link
Author

mdaiter commented Apr 18, 2025

Moving DynamicSwap_HunyuanVideoTransformer3DModelPacked to mps with preserved memory: 6 GB
  0%|                                                                                                                                                                                         | 0/25 [00:00<?, ?it/s]/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py:81: UserWarning: The operator 'aten::avg_pool3d.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:14.)
  return torch.nn.functional.avg_pool3d(x, kernel_size, stride=kernel_size)
  0%|                                                                                                                                                                                         | 0/25 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/Users/msd/Code/FramePack/demo_gradio.py", line 237, in worker
    generated_latents = sample_hunyuan(
                        ^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/pipelines/k_diffusion_hunyuan.py", line 116, in sample_hunyuan
    results = sample_unipc(k_model, latents, sigmas, extra_args=sampler_kwargs, disable=False, callback=callback)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 141, in sample_unipc
    return FlowMatchUniPC(model, extra_args=extra_args, variant=variant).sample(noise, sigmas=sigmas, callback=callback, disable_pbar=disable)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 118, in sample
    model_prev_list = [self.model_fn(x, vec_t)]
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 23, in model_fn
    return self.model(x, t, **self.extra_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/k_diffusion/wrapper.py", line 37, in k_model
    pred_positive = transformer(hidden_states=hidden_states, timestep=timestep, return_dict=False, **extra_args['positive'])[0].float()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 997, in forward
    hidden_states, encoder_hidden_states = self.gradient_checkpointing_method(
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 834, in gradient_checkpointing_method
    result = block(*args)
             ^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 654, in forward
    attn_output, context_attn_output = self.attn(
                                       ^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 605, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 174, in __call__
    hidden_states = attn_varlen_func(query, key, value, cu_seqlens_q, cu_seqlens_kv, max_seqlen_q, max_seqlen_kv)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/msd/Code/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 124, in attn_varlen_func
    x = torch.nn.functional.scaled_dot_product_attention(q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2)).transpose(1, 2)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Invalid buffer size: 14.45 GiB

^here's my current error, if anyone wants to tap on

@mdaiter
Copy link
Author

mdaiter commented Apr 18, 2025

Threw in a Tensor recast in Metal. Here's where I'm at:
MemoryError: Metal OOM while allocating size=15510555648
Super, super strange: I think Metal caps the amount of contiguous memory allowed while allocating a buffer

@mdaiter
Copy link
Author

mdaiter commented Apr 18, 2025

Ayyy it's running (kinda), gotta confirm output but it's past 0%! Looks like attention needed to be segmented, but MPS doesn't have a segmented attention built-in. Built my own

@mdaiter
Copy link
Author

mdaiter commented Apr 18, 2025

@lllyasviel , would love your take on this. Got the MPS kernels thrown in, and split the attention up to batch it. I don't know if we should expose a slider through gradio for chunk-sizing on M-series machines, but it's currently diffusing (slowly). One operation only runs on CPU on Mac, so you can run it, but it just takes a while (1h30m for a 25 frame video)

@mdaiter
Copy link
Author

mdaiter commented Apr 18, 2025

Okay - finally started updating portions of this for more gains.

  1. Manual flash attention -- works! Still no major speedup.
  2. Matmul optimization: works! Still no major speedup.
    Dug harder - turns out, Python itself is the overhead:
    60 transformer blocks get called for about 2.5 seconds each on Mac, with these sizes of buffers passed between them. Because Mac can't process these in parallel, it goes sequentially (oh dear god).
    I've tried torch.compile for this model, but Dynamo's having a hissy fit.
    For now, I think this is probably the best it's gonna be - 5 minutes a frame though, :(

@e1732a364fed
Copy link

Just a Note, there's another fork for mac which seems to be faster

https://www.reddit.com/r/StableDiffusion/comments/1k2neim/framepack_on_macos/

https://github.com/brandon929/FramePack

For reference, on my M3 Ultra Mac Studio and default settings, I am generating 1 second of video in around 2.5 minutes.

@mdaiter
Copy link
Author

mdaiter commented Apr 20, 2025

@e1732a364fed , appreciate it! Looking into it now.

@donghao1393
Copy link

donghao1393 commented Apr 20, 2025

I have made pytorch supporting avg_pool3d.out from this repository on MPS by pytorch/pytorch#151742. Another PR, pytorch/pytorch#151760, to support for upsample_nearest3d.vec is under developing.

  • Previously it takes more than 100 seconds to render one frame on my M4 Max by using code based on this PR. And the quality of its output is not good. I suppose the reason must be something wrong in my fork where not using MPS actually.
FramePack on  main [✘!?⇡] via 🐍 v3.10.16 (env)
󰄛 ❯ bash run_demo.sh
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=False, server='0.0.0.0', port=None, inbrowser=False)
Free VRAM 95.99955749511719 GB
High-VRAM Mode: True
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5789.24it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 15.99it/s]
Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 14855.86it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 20.31it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
latent_padding_size = 27, is_last_section = False
 48%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                                                                                                 | 12/25 [20:47<22:30, 103.92s/it]
Traceback (most recent call last):
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 243, in worker
    generated_latents = sample_hunyuan(
  File "/Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/pipelines/k_diffusion_hunyuan.py", line 116, in sample_hunyuan
    results = sample_unipc(k_model, latents, sigmas, extra_args=sampler_kwargs, disable=False, callback=callback)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 141, in sample_unipc
    return FlowMatchUniPC(model, extra_args=extra_args, variant=variant).sample(noise, sigmas=sigmas, callback=callback, disable_pbar=disable)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 134, in sample
    callback({'x': x, 'i': i, 'denoised': model_prev_list[-1]})
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 234, in callback
    raise KeyboardInterrupt('User ends the task.')
KeyboardInterrupt: User ends the task.
^CKeyboard interruption in main thread... closing server.
250418_230711_181_7405_37.mp4
  • And this fork with the same pytorch, it takes me less than 20 seconds to render one frame. The output has a well quality.
FramePack-macos on  main [?] via 🐍 v3.10.16 (env)
󰄛 ❯ python demo_gradio.py
Currently enabled native sdp backends: ['flash', 'math', 'mem_efficient', 'cudnn']
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=False, server='0.0.0.0', port=None, inbrowser=False, output_dir='./outputs', fp32=False)
Free VRAM 96.0 GB
High-VRAM Mode: True
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 3350.08it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.55it/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3579.78it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 13.26it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Resolution: 336 x 496
latent_padding_size = 18, is_last_section = False
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [05:51<00:00, 14.06s/it]
/Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/torch/nn/functional.py:4651: UserWarning: The operator 'aten::upsample_nearest3d.vec' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/aten/src/ATen/mps/MPSFallback.mm:14.)
  return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors)
Decoded. Current latent shape torch.Size([1, 16, 9, 62, 42]); pixel shape torch.Size([1, 3, 33, 496, 336])
latent_padding_size = 9, is_last_section = False
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
 84%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                            | 21/25 [05:48<01:23, 20.77s/it]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [07:32<00:00, 18.10s/it]
Decoded. Current latent shape torch.Size([1, 16, 18, 62, 42]); pixel shape torch.Size([1, 3, 69, 496, 336])
latent_padding_size = 0, is_last_section = True
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [08:18<00:00, 19.92s/it]
Decoded. Current latent shape torch.Size([1, 16, 28, 62, 42]); pixel shape torch.Size([1, 3, 109, 496, 336])
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
^CKeyboard interruption in main thread... closing server.
250420_173459_831_8394_28.mp4

I just saw the new PR, hope it would be helpful for that.

@mdaiter
Copy link
Author

mdaiter commented Apr 20, 2025

@donghao1393 I gotcha - I chunked these buffers down fairly aggressively to generate the output. Output's not good? I tried getting it out with the guy jumping, and it worked well! Let me know what's going wrong.

Maybe it's because my machine's just really constrained, I've got a M3 Pro with 18GB of RAM, so I'm really constrained

@donghao1393
Copy link

donghao1393 commented Apr 20, 2025

@mdaiter It's in the first output. The video seems not playing right.
I have checked the file. It seems not encoded correctly. But it can be played under the player, not in quicklook.
Let me re-encode and upload here to show the issue.

250418_230711_181_7405_37.reoutput.mp4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants