This is a custom node that allows for basic image generation using Hunyuan Image 3.0.
- Supports CPU and disk offload to allow generation on consumer setups
- When using CPU offload, weights are stored in system RAM and transferred to the GPU as needed for processing
- When using disk offload, weights are stored in system RAM and on disk and transferred to the GPU as needed for processing
Tested (uses CPU + disk offload):
- Windows1
- An Nvidia GPU with 24GB of VRAM1
- 128 GB of system memory2
- A PCIe 4.0 x16 connection between GPU and CPU3
- An NVMe SSD4
Recommended (uses CPU offload, but no disk offload)5:
- Windows
- An Nvidia GPU with >32GB of VRAM
- 192GB of system memory
- A PCIe 5.0 x16 connection between GPU and CPU
Ultra (does not use offload)5:
- Enough VRAM to run entirely on GPU (possibly 192GB of VRAM)
- Navigate to your custom_nodes folder
- Clone this repo:
git clone https://github.com/bgreene2/ComfyUI-HunyuanImage3.git - Change to the directory
cd ComfyUI-HunyuanImage3 - Assuming the correct Python environemnt is loaded, install dependencies
pip install -r requirements.txt - If pytorch is not already installed, install pytorch:
pip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128 - Add the flag
--disable-cuda-mallocto your ComfyUI startup script.
- Download the model weights
- If you are using an Nvidia GPU, in the Nvidia control panel, change the "CUDA Sysmem Fallback Policy" to "Prefer Sysmem Fallback"
The node has the following inputs:
- prompt - Your text prompt
The node has the following parameters:
- seed - The seed for the random number generator
- control after generate - This is attached to the 'seed' parameter
- use_dimensions - Whether to use the below user-specified width and height values. If false, the model will automatically choose the image dimensions.
- width - Image width
- height - Image height
- steps - Number of steps. The recommended value is 50.
- guidance_scale - CFG. The recommended value is 7.5.
- attn_implementation - Valid values are
sdpaandflash_attention_2 - moe_impl - Valid values are
eagerandflashinfer - weights_folder - The path to the model's weights on disk. Note: the path can't have a dot in it.
- use_offload - Whether to use CPU / disk offload.
- disk_offload_layers - The number of layers (out of 32) to offload to disk, rather than hold in memory. See the Performance Tuning section for more details.
Basic usage: Connect a String (Multiline) input to the prompt input, and connect the Image output to a Save Image node. An example workflow is provided.

If you are using disk offload, you need to choose the number of layers to offload such that your RAM is not completely filled, so that the system does not use swap. This may require some trial-and-error while monitoring memory usage. On a system with 128GB of memory, 10 layers is a good starting point.
With an RTX 4090, 128GB DDR4, PCIe 4.0, an image will take on the order of 1-1.5 hours to generate.
This model works best with detailed prompts. See the HuggingFace page for a prompting guide. You can also use an LLM to rewrite your prompts, such as PromptEnhancer.
If you are getting crashes due to running out of GPU memory, make sure you are doing the following:
- Use Windows
- Use an Nvidia GPU
- Set Sysmem Fallback Policy in the Nvidia control panel
- Run ComfyUI with
--disable-cuda-malloc
Footnotes
-
Windows + Nvidia is required when using a GPU with low VRAM, because this combination allows for the driver to fallback to using system memory if graphics memory is full. Otherwise some operations will cause the program to crash with OOM. Graphics memory usage appears to spike above 32GB of VRAM, so system memory fallback is probably still necessary with an RTX 5090. ↩ ↩2
-
The less memory you have, the more weights will have to be offloaded to disk, with a dramatic performance penalty. ↩
-
When using CPU offload, a major performance bottleneck is in copying weights across the PCIe bus to GPU. ↩
-
When using disk offload, a major bottleneck is in reading from drive, so you want it to be as fast as possible. ↩

