Skip to content

Conversation

@mdaiter
Copy link

@mdaiter mdaiter commented Apr 20, 2025

Hey all!

Got Mac compatibility fairly up and running. Tested it, the latents came out well! It's just kind of slow - working on this now.

It seems like there's some kind of funky stuff going on with Mac's underlying MPS system: transformers aren't run in parallel, which leads to slowdowns (5 mins / frame), but the frames do come out with the underlying proper form.

Happy to answer commentary and get this merged. I'm working on seeing if there's a way to use the torch.nn.functional scaled dot product attention function.

Right now, there's a big component to this: 15GB is too large a buffer to allocate in MPS on Metal for a single frame. That leads to a necessity for chunking the scaled dot product attention portion of this.

Happy to take commentary, make edits, and clean up this code. Would love high level feedback if you've got it.

Big shout out to @donghao1393 for cleaning up some of this code

@donghao1393 donghao1393 mentioned this pull request Apr 20, 2025
@mdaiter
Copy link
Author

mdaiter commented Apr 20, 2025

@donghao1393 , could I get your and @brandon929's support to test this out? I just don't have a more powerful machine. Happy to adjust numbers / make this adaptive for the chip you've got

@donghao1393
Copy link

donghao1393 commented Apr 20, 2025

@mdaiter Thanks for your invitation. Surely I would help you to test it after finishing my current tasks of submitting those PR to pytorch. I thought the issue, #65 (comment), must be in my side. So just let me re-run again to test and find some good news on it.

@mdaiter
Copy link
Author

mdaiter commented Apr 20, 2025

Thanks @donghao1393 ! Appreciate it.
I tried segmenting attention into flash attention for Mac and chunking. However, that's only from a METAL compatibility issue where a 15GB memory buffer is the absolute maximum size you can allocate. It might be best to use @brandon929's solution, and I'll just downscale video to be generated on the M3 Pro.
Either way, really appreciate the support / initial patches on my branch - thanks man!

@@ -1,8 +1,10 @@
from tkinter import W

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks compatibility with my Python installed using homebrew. Also, I don't think the import is being utilized as I didn't run into any issues when I removed the line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! My bad - I can remove this

@brandon929
Copy link

brandon929 commented Apr 20, 2025

I did some testing with these changes, compared to just using an MPS pytorch device without any extra changes to FramePack (See my fork of FramePack).

Overall, the changes you have added support generating frames at the "640" resolution bucket. If you just use an MPS device with the standard code, you get noisy frames at this resolution (as you mention above this is not a FramePack issue, the same problem happens in ComfyUI and seems to be an issue with MPS support in pytorch). On the downside, I do see a performance hit of 50% at lower resolutions that do otherwise generate correctly with only changing the MPS device. At higher resolutions the hit appears to be 100% - however that may not be a true comparison as without your changes this resolution only generates noisy garbage frames.

I am testing with an 80 GPU core M3 Ultra using the same image, prompt, etc. with the following results.
At a frame resolution of 352 x 464:
This PR: 11.78s/it (no TeaCache)
Standard FramePack /w MPS device: 7.95s/it (no TeaCache)

At a frame resolution of 544 x 704 (this is the standard "640" bucket):
This PR: 44.76s/it (no TeaCache)
Standard FramePack /w MPS device: 26.73s/it (no TeaCache, of course this generates garbage as mentioned above so may not be comparable)

Here is a dump from Terminal for a complete run using this PR with TeaCache enabled:

Resolution: 352 x 464
latent_padding_size = 27, is_last_section = False
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:35<00:00,  6.22s/it]
/opt/homebrew/lib/python3.10/site-packages/torch/nn/functional.py:4737: UserWarning: The operator 'aten::upsample_nearest3d.vec' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:14.)
  return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors)
/opt/homebrew/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
Decoded. Current latent shape torch.Size([1, 16, 9, 58, 44]); pixel shape torch.Size([1, 3, 33, 464, 352])
latent_padding_size = 18, is_last_section = False
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:35<00:00,  6.23s/it]
Decoded. Current latent shape torch.Size([1, 16, 18, 58, 44]); pixel shape torch.Size([1, 3, 69, 464, 352])
latent_padding_size = 9, is_last_section = False
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:35<00:00,  6.22s/it]
Decoded. Current latent shape torch.Size([1, 16, 27, 58, 44]); pixel shape torch.Size([1, 3, 105, 464, 352])
latent_padding_size = 0, is_last_section = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [02:35<00:00,  6.23s/it]
Decoded. Current latent shape torch.Size([1, 16, 37, 58, 44]); pixel shape torch.Size([1, 3, 145, 464, 352])

Here is a dump with the same (kind of) settings using the standard FramePack pipeline with an MPS device with TeaCache enabled. There is another difference in that I am rendering the video at 24fps so I am generating fewer latent frames in total than if the video was rendered at 30fps. However, each pass should be the same only there is one fewer pass (you can confirm by looking at the decoded latent tensor shapes):

Resolution: 352 x 464
latent_padding_size = 18, is_last_section = False
  0%|                                                                                                                                                                                                  | 0/25 [00:00<?, ?it/s]/Users/bcook/Documents/FramePack/diffusers_helper/models/hunyuan_video_packed.py:79: UserWarning: The operator 'aten::avg_pool3d.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:14.)
  return torch.nn.functional.avg_pool3d(x, kernel_size, stride=kernel_size)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [01:44<00:00,  4.16s/it]
/opt/homebrew/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
Decoded. Current latent shape torch.Size([1, 16, 9, 58, 44]); pixel shape torch.Size([1, 3, 33, 464, 352])
latent_padding_size = 9, is_last_section = False
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [01:44<00:00,  4.17s/it]
Decoded. Current latent shape torch.Size([1, 16, 18, 58, 44]); pixel shape torch.Size([1, 3, 69, 464, 352])
latent_padding_size = 0, is_last_section = True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [01:44<00:00,  4.18s/it]
Decoded. Current latent shape torch.Size([1, 16, 28, 58, 44]); pixel shape torch.Size([1, 3, 109, 464, 352])

@brandon929
Copy link

Also, here are the results of running your test scripts.

Results from an M3 Ultra (80 core GPU) Mac Studio:

(.venv) ➜  FramePack git:(mdaiterUpdates) python3.10 test_mps_attention.py
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Testing MPS attention mechanisms for FramePack
MPS is available. Device: mps
Free MPS memory: 464.00 GB
Creating test tensors with shape [batch=1, seq_len=64, hidden_size=4096]
Testing regular attention...
Regular attention completed in 0.1114 seconds
Output shape: torch.Size([1, 64, 4096])

Testing chunked attention...
Testing with chunk_size=16
Testing with chunk_size=32
Mean absolute error between regular and chunked attention: 0.000000
Chunked attention tests completed in 0.1284 seconds

Testing MPS variable length attention...
Variable length attention completed in 0.1718 seconds

All MPS attention tests passed!

---

MPS attention test PASSED
(.venv) ➜  FramePack git:(mdaiterUpdates) ✗ python3.10 test_mps_basic.py 
Testing basic MPS support for FramePack
MPS is available. Device: mps
Free MPS memory: 464.00 GB
CPU tensor created with shape torch.Size([1000, 1000])
Tensor moved to MPS: mps:0
Matrix multiplication on MPS completed with shape torch.Size([1000, 1000])
Successfully created tensor with dtype torch.float32 on MPS
Successfully created tensor with dtype torch.float16 on MPS
Successfully created tensor with dtype torch.bfloat16 on MPS

MPS basic support test PASSED

---

(.venv) ➜  FramePack git:(mdaiterUpdates) ✗ python3.10 test_mac_support.py
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Testing MPS support for FramePack
MPS is available. Device: mps
Free MPS memory: 464.00 GB
Test tensor created on MPS: mps:0
Loading transformer model...
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 45590.26it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.23it/s]
Model loaded successfully on mps:0
Testing forward pass with small input...
Forward pass successful!
Output shape: torch.Size([1, 16, 4, 64, 64])

MPS support test PASSED

Results from an M4 Max (40 core GPU) MacBook Pro:

(.venv) ➜  FramePack git:(mdaiterUpdates) ✗ python3.10 test_mps_attention.py 
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Testing MPS attention mechanisms for FramePack
MPS is available. Device: mps
Free MPS memory: 96.00 GB
Creating test tensors with shape [batch=1, seq_len=64, hidden_size=4096]
Testing regular attention...
Regular attention completed in 0.1045 seconds
Output shape: torch.Size([1, 64, 4096])

Testing chunked attention...
Testing with chunk_size=16
Testing with chunk_size=32
Mean absolute error between regular and chunked attention: 0.000000
Chunked attention tests completed in 0.1327 seconds

Testing MPS variable length attention...
Variable length attention completed in 0.1511 seconds

All MPS attention tests passed!

MPS attention test PASSED

----

(.venv) ➜  FramePack git:(mdaiterUpdates) ✗ python3.10 test_mps_basic.py    
Testing basic MPS support for FramePack
MPS is available. Device: mps
Free MPS memory: 96.00 GB
CPU tensor created with shape torch.Size([1000, 1000])
Tensor moved to MPS: mps:0
Matrix multiplication on MPS completed with shape torch.Size([1000, 1000])
Successfully created tensor with dtype torch.float32 on MPS
Successfully created tensor with dtype torch.float16 on MPS
Successfully created tensor with dtype torch.bfloat16 on MPS

MPS basic support test PASSED

---

(.venv) ➜  FramePack git:(mdaiterUpdates) ✗ python3.10 test_mac_support.py
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Testing MPS support for FramePack
MPS is available. Device: mps
Free MPS memory: 96.00 GB
Test tensor created on MPS: mps:0
Loading transformer model...
Fetching 3 files: 100%|████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 27354.16it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 3/3 [00:00<00:00, 15.64it/s]
Model loaded successfully on mps:0
Testing forward pass with small input...
Forward pass successful!
Output shape: torch.Size([1, 16, 4, 64, 64])

MPS support test PASSED

@mdaiter
Copy link
Author

mdaiter commented Apr 21, 2025

@brandon929, you're the man! Thank you.
The best way to move forward might be to profile the underlying system and choose the right encoder. If memory's constrained, it uses the flash attention style manual transformer segmentation that I used. Also - my bet is that with this architecture, you only really need to use my method at the top part. The easiest way would be to:

  1. Measure the size of the buffer, see the memory on the system, and choose.
  2. If you're too high in the transformer stack and need more memory, chunk.
  3. If you're not too high on the transformer stack, used the fused torch.nn.functional op
  4. Keep chunking and deciding

@donghao1393
Copy link

donghao1393 commented Apr 21, 2025

I have run it under M4 Max + 128GB.
It costs me 93 seconds to render one frame.

FramePack on  AvgPool3dToConv3d [?] via 🐍 v3.10.16 (env)
󰄛 ❯ python demo_gradio.py
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=False, server='0.0.0.0', port=None, inbrowser=False)
Free VRAM 95.99955749511719 GB
High-VRAM Mode: True
Downloading shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2467.60it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 15.85it/s]
Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 25627.11it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 20.12it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
latent_padding_size = 27, is_last_section = False
  0%|                                                                                                                                 | 0/25 [01:32<?, ?it/s]
Traceback (most recent call last):
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 244, in worker
    generated_latents = sample_hunyuan(
  File "/Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/pipelines/k_diffusion_hunyuan.py", line 116, in sample_hunyuan
    results = sample_unipc(k_model, latents, sigmas, extra_args=sampler_kwargs, disable=False, callback=callback)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 141, in sample_unipc
    return FlowMatchUniPC(model, extra_args=extra_args, variant=variant).sample(noise, sigmas=sigmas, callback=callback, disable_pbar=disable)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 134, in sample
    callback({'x': x, 'i': i, 'denoised': model_prev_list[-1]})
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 235, in callback
    raise KeyboardInterrupt('User ends the task.')
KeyboardInterrupt: User ends the task.
latent_padding_size = 27, is_last_section = False
 16%|███████████████████▏                                                                                                    | 4/25 [10:29<55:05, 157.38s/it]
Traceback (most recent call last):
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 244, in worker
    generated_latents = sample_hunyuan(
  File "/Users/dong.hao/studio/projects/public/FramePack/docs/pytorch-offcial/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/pipelines/k_diffusion_hunyuan.py", line 116, in sample_hunyuan
    results = sample_unipc(k_model, latents, sigmas, extra_args=sampler_kwargs, disable=False, callback=callback)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 141, in sample_unipc
    return FlowMatchUniPC(model, extra_args=extra_args, variant=variant).sample(noise, sigmas=sigmas, callback=callback, disable_pbar=disable)
  File "/Users/dong.hao/studio/projects/public/FramePack/diffusers_helper/k_diffusion/uni_pc_fm.py", line 134, in sample
    callback({'x': x, 'i': i, 'denoised': model_prev_list[-1]})
  File "/Users/dong.hao/studio/projects/public/FramePack/demo_gradio.py", line 235, in callback
    raise KeyboardInterrupt('User ends the task.')
KeyboardInterrupt: User ends the task.
^CKeyboard interruption in main thread... closing server.

For the test script.

FramePack on  AvgPool3dToConv3d [?] via 🐍 v3.10.16 (env)
󰄛 ❯ python test_mps_attention.py
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Testing MPS attention mechanisms for FramePack
MPS is available. Device: mps
Free MPS memory: 96.00 GB
Creating test tensors with shape [batch=1, seq_len=64, hidden_size=4096]
Testing regular attention...
Regular attention completed in 0.0195 seconds
Output shape: torch.Size([1, 64, 4096])

Testing chunked attention...
Testing with chunk_size=16
Testing with chunk_size=32
Mean absolute error between regular and chunked attention: 0.000000
Chunked attention tests completed in 0.0286 seconds

Testing MPS variable length attention...
Variable length attention completed in 0.0299 seconds

All MPS attention tests passed!

MPS attention test PASSED
                                                                                                                                                             
FramePack on  AvgPool3dToConv3d [?] via 🐍 v3.10.16 (env) took 3s
󰄛 ❯ python test_mps_basic.py
Testing basic MPS support for FramePack
MPS is available. Device: mps
Free MPS memory: 96.00 GB
CPU tensor created with shape torch.Size([1000, 1000])
Tensor moved to MPS: mps:0
Matrix multiplication on MPS completed with shape torch.Size([1000, 1000])
Successfully created tensor with dtype torch.float32 on MPS
Successfully created tensor with dtype torch.float16 on MPS
Successfully created tensor with dtype torch.bfloat16 on MPS

MPS basic support test PASSED
                                                                                                                                                             
FramePack on  AvgPool3dToConv3d [?] via 🐍 v3.10.16 (env)
󰄛 ❯ python test_mac_support.py
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Testing MPS support for FramePack
MPS is available. Device: mps
Free MPS memory: 96.00 GB
Test tensor created on MPS: mps:0
Loading transformer model...
Fetching 3 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 53544.31it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 22.99it/s]
Model loaded successfully on mps:0
Testing forward pass with small input...
Forward pass successful!
Output shape: torch.Size([1, 16, 4, 64, 64])

MPS support test PASSED
                                                                                                                                                             
FramePack on  AvgPool3dToConv3d [?] via 🐍 v3.10.16 (env) took 25s
󰄛 ❯

@mdaiter
Copy link
Author

mdaiter commented Apr 21, 2025

Okay. The move's probably to do a fallback system to this, in case frames are too large to fit into a scaled_dot_product_attention. On it - @brandon929 and @donghao1393 , lemme try to get this out or collab with you guys for the best solution

@mdaiter
Copy link
Author

mdaiter commented Apr 21, 2025

Alright @brandon929 + @donghao1393 , you're up. Got it running on my M3 with automatic chunking for that fused torch nn functional attention function, it flies now. Getting 180 seconds / iteration, you should get it to be a lot faster. My machine needs to chunk it 6 times, you probably need to chunk once because you've got so much RAM. I'd guess it runs in 20 / 30 seconds per iteration.

@mdaiter mdaiter mentioned this pull request Apr 21, 2025
@mdaiter
Copy link
Author

mdaiter commented Apr 21, 2025

(once that's tested, I'll take out that chunk print statement - wanna see what your chunk size is)

@donghao1393
Copy link

I have run with the latest commit. The time for per iter has been reduced to around 20 seconds. But the issue on the output video still exists.

FramePack on  AvgPool3dToConv3d [?] via 🐍 v3.10.16 (env) took 33s
󰄛 ❯ TOKENIZERS_PARALLELISM=false python demo_gradio.py --share
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=True, server='0.0.0.0', port=None, inbrowser=False)
Free VRAM 95.99955749511719 GB
High-VRAM Mode: True
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 7533.55it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.66it/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 13515.48it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 13.66it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860
* Running on public URL: https://4642e40a8103a26ff0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
latent_padding_size = 0, is_last_section = True
  0%|                                                                                                                                  | 0/25 [00:00<?, ?it/s]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
  4%|████▉                                                                                                                     | 1/25 [01:32<37:03, 92.64s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
  8%|█████████▊                                                                                                                | 2/25 [02:58<34:03, 88.87s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 16%|███████████████████▌                                                                                                      | 4/25 [04:21<18:13, 52.09s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 24%|█████████████████████████████▎                                                                                            | 6/25 [05:43<13:08, 41.50s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 32%|███████████████████████████████████████                                                                                   | 8/25 [07:04<10:31, 37.17s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 44%|█████████████████████████████████████████████████████▏                                                                   | 11/25 [08:25<06:45, 28.97s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 56%|███████████████████████████████████████████████████████████████████▊                                                     | 14/25 [09:46<04:44, 25.89s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 68%|██████████████████████████████████████████████████████████████████████████████████▎                                      | 17/25 [11:05<03:15, 24.43s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 80%|████████████████████████████████████████████████████████████████████████████████████████████████▊                        | 20/25 [12:23<01:58, 23.60s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▍              | 22/25 [13:41<01:21, 27.30s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏    | 24/25 [14:59<00:29, 29.55s/it]Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
Trying 6 chunks of size 3068
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [16:17<00:00, 39.09s/it]
Decoded. Current latent shape torch.Size([1, 16, 10, 64, 96]); pixel shape torch.Size([1, 3, 37, 512, 768])
^CKeyboard interruption in main thread... closing server.
Killing tunnel 0.0.0.0:7860 <> https://4642e40a8103a26ff0.gradio.live
                                                                                                                                                              
FramePack on  AvgPool3dToConv3d [?] via 🐍 v3.10.16 (env) took 26m52s
󰄛 ❯

@brandon929
Copy link

brandon929 commented Apr 21, 2025

On M3 Ultra, I see 27.42s/it for the first iteration (so no TeaCache skipping steps). However, the video is "blown out", which is a condition I have seen when using the non-chunked inferencing on my branch and the resolution is too high.

With high resolutions and MPS, there is a point where the colors are blown out but there are still shapes, etc. recognizable from the image. Then there is a second point where it only creates noisy blocks with no recognizable shape in the image.

I have not looked at your changes yet, but I am assuming the chunking is meant to help avoid some of these MPS issues with large tensors. If so, perhaps the threshold used to determine the chunk size is still too large.

I also ran this on my 128GB M4 Max MBP (so there is thermal throttling), and it chose the same chunk size and did 55s/it on the first iteration. The M4 Max's final denoising time with TeaCache for the first 33 frames was [14:54<00:00, 35.79s/it]. The video it generated had the same blown out colors.

M3 Ultra output:

Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=False, server='0.0.0.0', port=None, inbrowser=False)
Free VRAM 463.9995574951172 GB
High-VRAM Mode: True
Downloading shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 3984.14it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.16it/s]
Fetching 3 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 24769.51it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  6.03it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
latent_padding_size = 27, is_last_section = False
  0%|                                                                                                                                                                                                                                                   | 0/25 [00:00<?, ?it/s]Trying 6 chunks of size 3067
Trying 6 chunks of size 3067
 ... repeated ...
Trying 6 chunks of size 3067
  4%|█████████▍                                                                                                                                                                                                                                 | 1/25 [00:27<10:58, 27.42s/it]Trying 6 chunks of size 3067
Trying 6 chunks of size 3067
 ... truncated logs ...
Trying 6 chunks of size 3067
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [06:19<00:00, 15.18s/it]
/opt/homebrew/lib/python3.10/site-packages/torch/nn/functional.py:4737: UserWarning: The operator 'aten::upsample_nearest3d.vec' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:14.)
  return torch._C._nn.upsample_nearest3d(input, output_size, scale_factors)
/opt/homebrew/lib/python3.10/site-packages/torchvision/io/_video_deprecation_warning.py:5: UserWarning: The video decoding and encoding capabilities of torchvision are deprecated from version 0.22 and will be removed in version 0.24. We recommend that you migrate to TorchCodec, where we'll consolidate the future decoding/encoding capabilities of PyTorch: https://github.com/pytorch/torchcodec
  warnings.warn(
Decoded. Current latent shape torch.Size([1, 16, 9, 96, 64]); pixel shape torch.Size([1, 3, 33, 768, 512])

@brandon929
Copy link

I did try reducing the _SAFE_BUF size to increase chunks and the video output seems the same Trying 23 chunks of size 801.

Perhaps it is a problem with the safe_sdp_attention logic and not an MPS pytorch bug? Looking at the latent preview on the web app, the first few steps seem OK but it seems to go bad as it continues to denoise.

@mdaiter
Copy link
Author

mdaiter commented Apr 22, 2025

Super strange - safe_sdp_attention is just chunking and wrapping calls to the original function. In the worst case, it just falls back to a manual mat-mul.

@brandon929 , you said my code worked previously at high res when yours didn't, right? It might be worth checking whether that's the cause of this issue. We could downscale the photo and try at lower res?

@mdaiter
Copy link
Author

mdaiter commented Apr 22, 2025

Ahhhh you know what I think it is? bfloat16 is exhausted, and we're blowing things out. Let me try with .float()

@mdaiter
Copy link
Author

mdaiter commented Apr 22, 2025

Ahhh @donghao1393 @brandon929 so I actually think it is because of bfloat16. My latents seem fine when keeping everything in float32, but at the expense of a pretty hard slowdown + a lot more memory being used.

At its core, the issue isn’t the model or the data—it’s the way Apple’s Metal Performance Shaders backend implements “fused” attention. Unlike the CPU and CUDA paths, which tile and stream the Q⋅Kᵀ→softmax→V matmul under the hood, MPS literally materializes the entire Q×K score matrix in bfloat16 and then immediately casts it to float32 for the softmax. On an 18 GB machine that full-buffer approach blows well past the per‑buffer cap, so you either crash or silently fall back to a lower‑precision code path that corrupts the output.

Compounding that, when you do any manual or chunked fallback entirely in bfloat16, the 7‑bit mantissa simply can’t represent the fine gradations of a long softmax over thousands of keys—so even if it “fits” in memory, the attention weights collapse (under‑ or overflow) and your decoded image ends up with blown‑out colors. In short: MPS’s naive full‑matrix allocation and limited bf16 precision conspire to both exceed your memory budget and destroy the numerical fidelity of the softmax.

I discovered that Apple’s MPS implementation of scaled_dot_product_attention naively allocates a full Q⋅Kᵀ matrix in bfloat16 (2 bytes/elem) plus an intermediate float32 softmax buffer (4 bytes/elem), which exceeds the 16 GiB per‑buffer cap on smaller‑memory machines and silently crashes or corrupts the output. My solution is to slice Q into manageable chunks, run each chunk in BF16 matmul, up‑cast only the smaller score block to float32 for a precise softmax, then cast back to BF16 for the final V matmul, writing results in‑place. This preserves the full dynamic range of softmax without recasting the huge K and V tensors, avoids one giant allocation, and keeps performance within about 20% of a fused kernel. This is one solution, but it's been fairly slow on my M3 Pro.

On beefier MPS devices (e.g. 128 GB), you can optimistically try the full Q fused call first and automatically fall back to chunking only if it OOMs. In practice this means you get the fastest path out‑of‑the‑box on high‑memory Macs, and a robust, memory‑safe chunked fallback on 16–18 GB systems—with no blown‑out colors and predictable performance.

Please let me know what's best!

@donghao1393
Copy link

donghao1393 commented Apr 22, 2025

@mdaiter According to your advice, I have made this commit, donghao1393@c11d3fd. It may not be the correct one, because the output has quality issue over all pixel. Here are the running log.

FramePack on  AvgPool3dToConv3d [?⇡] via 🐍 v3.10.16 (env) took 13s
󰄛 ❯ TOKENIZERS_PARALLELISM=false python demo_gradio.py --share
Currently enabled native sdp backends: []
Xformers is not installed!
Flash Attn is not installed!
Sage Attn is not installed!
Namespace(share=True, server='0.0.0.0', port=None, inbrowser=False)
Free VRAM 95.99955749511719 GB
High-VRAM Mode: True
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9642.08it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 15.23it/s]
Fetching 3 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 14786.03it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 12.79it/s]
transformer.high_quality_fp32_output_for_inference = True
* Running on local URL:  http://0.0.0.0:7860
* Running on public URL: https://b13c18e8fb47268bb4.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
latent_padding_size = 0, is_last_section = True
  0%|                                                                                                                                  | 0/25 [00:00<?, ?it/s]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
  4%|████▉                                                                                                                     | 1/25 [01:13<29:27, 73.64s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
  8%|█████████▊                                                                                                                | 2/25 [02:30<28:52, 75.32s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 16%|███████████████████▌                                                                                                      | 4/25 [04:07<18:06, 51.73s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 24%|█████████████████████████████▎                                                                                            | 6/25 [05:35<13:30, 42.66s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 32%|███████████████████████████████████████                                                                                   | 8/25 [07:03<11:07, 39.27s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 40%|████████████████████████████████████████████████▍                                                                        | 10/25 [08:26<09:10, 36.68s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 52%|██████████████████████████████████████████████████████████████▉                                                          | 13/25 [09:47<05:45, 28.75s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 64%|█████████████████████████████████████████████████████████████████████████████▍                                           | 16/25 [11:08<03:53, 25.95s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 72%|███████████████████████████████████████████████████████████████████████████████████████                                  | 18/25 [12:29<03:24, 29.28s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 80%|████████████████████████████████████████████████████████████████████████████████████████████████▊                        | 20/25 [13:49<02:34, 30.98s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 84%|█████████████████████████████████████████████████████████████████████████████████████████████████████▋                   | 21/25 [15:08<02:58, 44.73s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 92%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████▎         | 23/25 [16:27<01:17, 38.74s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏    | 24/25 [17:48<00:51, 51.21s/it]
Trying 1 chunks of size 18407 (total tokens: 18407, safe_chunk: 5185)
...repeat
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [19:11<00:00, 46.07s/it]
Decoded. Current latent shape torch.Size([1, 16, 10, 96, 64]); pixel shape torch.Size([1, 3, 37, 768, 512])
^CKeyboard interruption in main thread... closing server.
Killing tunnel 0.0.0.0:7860 <> https://b13c18e8fb47268bb4.gradio.live

FramePack on  AvgPool3dToConv3d [?⇡] via 🐍 v3.10.16 (env) took 26m46s
󰄛 ❯ ffprobe -v error -select_streams v:0 -count_frames -show_entries stream=nb_read_frames outputs/250422_224525_098_8997_10.mp4
[STREAM]
nb_read_frames=37
[/STREAM]

speed: (19*60+11)/37=31.10810811/it

The output cannot be played.

250422_224525_098_8997_10.mp4

You can copy the download link and watch it locally.

@brandon929
Copy link

The issue with MP4s and play-back compatibility was corrected, if you merge latest from lllyasviel:main you will get the fix. However, when I play the video you uploaded it has the blocky-noise issue discussed by @mdaiter above.

There is some threshold where chunking is necessary to ensure the generated frames look correct. This quality problem occurs before you hit an OOM. There is also an issue where you get frames that match expectations but the colors are "blown-out", which @mdaiter provided an explanation for above.

Overall, it seems depending on the tensor size passed to the sampling stage (so the problem is likely localized to the attention calculations and the tensors being utilized there, the latent image tensors passed to the sampler aren't particularly large):

  1. At some large threshold, you get an OOM.
  2. At a lower threshold, you get blocky-noise as in your uploaded video.
    436239796-3a6674a5-4c43-448f-b8c1-fb7245c73fd2-0001
  3. At a yet lower threshold, you get blown-out colors.
    250423_073637_558_3335_9-0001
  4. Finally at a lower threshold, you get properly generated frames without visual artifacts
250423_074338_892_2939_9.mp4

I have not yet had the time to dig into the problem and identify exactly where those thresholds lie. I know with my 128GB and 512GB Macs, at the default "640" bucket I get issue 2. At a lower resolution bucket like "576" I get issue 3. At the resolution bucket "416" I never see any artifacts in the generated output.

These are not issues specifically with FramePack as I see these same issues using ComfyUI with MPS devices.

Ideally, a short-term solution would be to identify these thresholds and take the minimally-intrusive approach.

For example, with case 4 if there is enough RAM just use the Pytorch MPS fused attention implementation. If there is not enough RAM, chunk as appropriately.

With case 3 and 4, maybe resort to chunking and using FP32 and taking the perf and RAM usage hit.

Long-term solution would be to fix the MPS fused attention implementation in Pytorch. (I have no idea how much work this is.)

@donghao1393
Copy link

donghao1393 commented Apr 23, 2025

Thanks for the detailed explanation! I have dug into the attention mechanism implementation in the PyTorch MPS backend and understand the problem you describe.

The core of the problem is indeed the way the MPS backend implements:

  1. The MPS backend allocates memory for the entire Q×K matrix at once, rather than using chunking and streaming like the CPU and CUDA backends
  2. Matrix multiplication is done in bfloat16 precision (2 bytes/element) and then immediately converted to float32 (4 bytes/element) for the softmax calculation
  3. This causes a sudden increase in memory usage and can lead to precision issues

The three thresholds you mentioned make perfect sense:

  • Threshold 1: Exceeding this threshold will result in OOM (out of memory) errors
  • Threshold 2: Lower than threshold 1 but still high will result in blocky noise
  • Threshold 3: A bit lower will result in blown colors but still recognizable shapes
  • Threshold 4: Low enough to produce frames without visual artifacts

I think the long-term solution is indeed to fix the MPS fused attention implementation in PyTorch to use chunking and streaming like the CPU and CUDA backends. This will fundamentally solve the problem, rather than circumventing it at the application level.

I'd love to contribute to this effort, and I can try to implement this fix in PyTorch if you're interested.

Correct me if I were wrong.

@astr0gator
Copy link

I have this error on MacBookPro M1 Max 64GB.

Clean py310 venv:

pip install torch torchvision torchaudio 
pip install -r requirements.txt 
python demo_gradio.py

Result:

(py310)  FramePack ❯ python demo_gradio.py
Traceback (most recent call last):
  File "/Users/a/Downloads/FramePack/FramePack/demo_gradio.py", line 21, in <module>
    from diffusers_helper.models.hunyuan_video_packed import HunyuanVideoTransformer3DModelPacked
  File "/Users/a/Downloads/FramePack/FramePack/diffusers_helper/models/hunyuan_video_packed.py", line 29, in <module>
    if torch.backends.cuda.cudnn_sdp_enabled():
AttributeError: module 'torch.backends.cuda' has no attribute 'cudnn_sdp_enabled'. Did you mean: 'flash_sdp_enabled'?

Following suggestion or commenting out two lines in question raises:
AssertionError: Torch not compiled with CUDA enabled

python demo_gradio.py --fp32 doesn't help

MacOS: 15.3.1 (24D70)
Commit: a875c8b

@cchance27
Copy link

cchance27 commented Apr 23, 2025

I'd love to contribute to this effort, and I can try to implement this fix in PyTorch if you're interested.

Correct me if I were wrong.

If you understand the pytorch backend and have an idea on how to fix it i' sure the pytorch guys would appreciate it, feels like the mps backend continues to trail behind, due to less attention to it i think sadly (due to alot of the devs just not having macs) :(

@SplittyDev
Copy link

SplittyDev commented Apr 27, 2025

Nice work! Please let me know if you need any testing on a high VRAM machine. I can run tests on an M2 Ultra with 128GB unified memory. Feel free to @ me directly, so I get notified.

@Morac2
Copy link

Morac2 commented May 4, 2025

Is this still being worked on? Are there plans to update it with the FramePack-F1 support changes from yesterday?

Edit: It looks like F1 won't work because avg_pool3d isn't implemented for MPS.

@Morac2
Copy link

Morac2 commented May 4, 2025

FYI, I got F1 working on Mac using the same changes that @brandon929 made for the normal version, but I had to modify center_down_sample_3d to use the commented out code instead of the call to PyTorch's avg_pool3d as that's not supported for MPS. It generates output that looks good, though I'm not sure it's any faster..

See main...Morac2:FramePack:main

@donghao1393
Copy link

donghao1393 commented May 4, 2025

@Morac2
I've been working on this part of pytorch.
I have a PR for avg_pool3d that you can patch: pytorch/pytorch#151742

@Morac2
Copy link

Morac2 commented May 4, 2025

@Morac2 I've been working on this part of pytorch. I have a PR for avg_pool3d that you can patch: pytorch/pytorch#151742

I built it, but I was having issues installing it as I can't find a compatible torchaudio and torchvision. I eventually got it installed with torchaudio 0.11.0 and torchvision 0.12.0, both of which are really old, but newer versions complain they aren't compatible. Generation is working, so I guess it's okay, but it seems like it should be installing a newer version.

On the plus side your version of torch worked with avg_pool3d. I'm not sure it was much faster and it's still using a ridiculous amount of RAM (over 75 GB) which grows which each second of video added. It was over 90 GB after finishing 5 seconds of video.

Edit: I forced installation of the nightly torchaudio and torchvision. Despite it complaining they aren't compatible, it ran.

@Morac2
Copy link

Morac2 commented May 10, 2025

Is there an updated version of test_mps_attention.py? The chunked_attention_bfloat16 function being imported doesn't exist in hunyuan_video_packed. It was removed in commit 2a5edc0 as such the test errors out immediately.

@donghao1393
Copy link

donghao1393 commented May 16, 2025

I have done a first trial of applying flash attention on mps. but it requires more work to ensure the accuracy. Currently I am working in a feature of my another project. And I will have a holiday since next week. I will go back to work for this after that.

@Morac2
Copy link

Morac2 commented May 26, 2025

I have done a first trial of applying flash attention on mps. but it requires more work to ensure the accuracy. Currently I am working in a feature of my another project. And I will have a holiday since next week. I will go back to work for this after that.

According to pytorch/pytorch#139668 flash attention for mps is already implemented in PyTorch. It looks like metal flash attention support was merged into PyTorch 3 weeks ago. pytorch/pytorch#152781

pytorch/pytorch#151742 still hasn’t been merged for some reason.

@donghao1393
Copy link

donghao1393 commented May 27, 2025

Looks gorgeous. will test if those pr fit the framepack.

model.to(device=target_device)
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.mps.empty_cache()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.mps.empty_cache should be call when mps is available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants