Please refer to the Inference section of INSTALL.md for instructions on environment setup.
-
Generate a Hugging Face access token. Set the access token to 'Read' permission (default is 'Fine-grained').
-
Log in to Hugging Face with the access token:
huggingface-cli login-
Accept the Llama-Guard-3-8B terms
-
Download the Cosmos model weights from Hugging Face:
PYTHONPATH=$(pwd) python scripts/download_checkpoints.py --output_dir checkpoints/Note that this will require about 300GB of free storage. Not all these checkpoints will be used in every generation.
- The downloaded files should be in the following structure:
checkpoints/
├── nvidia
│ │
│ ├── Cosmos-Guardrail1
│ │ ├── README.md
│ │ ├── blocklist/...
│ │ ├── face_blur_filter/...
│ │ └── video_content_safety_filter/...
│ │
│ ├── Cosmos-Transfer1-7B
│ │ ├── base_model.pt
│ │ ├── vis_control.pt
│ │ ├── edge_control.pt
│ │ ├── edge_control_distilled.pt
│ │ ├── seg_control.pt
│ │ ├── depth_control.pt
│ │ ├── keypoint_control.ptg
│ │ ├── 4kupscaler_control.pt
│ │ └── config.json
│ │
│ ├── Cosmos-Transfer1-7B-Sample-AV/
│ │ ├── base_model.pt
│ │ ├── hdmap_control.pt
│ │ └── lidar_control.pt
│ │
│ │── Cosmos-Tokenize1-CV8x8x8-720p
│ │ ├── decoder.jit
│ │ ├── encoder.jit
│ │ ├── autoencoder.jit
│ │ └── mean_std.pt
│ │
│ └── Cosmos-UpsamplePrompt1-12B-Transfer
│ ├── depth
│ │ ├── consolidated.safetensors
│ │ ├── params.json
│ │ └── tekken.json
│ ├── README.md
│ ├── segmentation
│ │ ├── consolidated.safetensors
│ │ ├── params.json
│ │ └── tekken.json
│ ├── seg_upsampler_example.png
│ └── viscontrol
│ ├── consolidated.safetensors
│ ├── params.json
│ └── tekken.json
│
├── depth-anything/...
├── facebook/...
├── google-t5/...
├── IDEA-Research/...
└── meta-llama/...
For a general overview of how to use the model see this guide.
Ensure you are at the root of the repository before executing the following:
export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/inference_keypoint \
--controlnet_specs assets/inference_cosmos_transfer1_single_control_keypoint.json \
--offload_text_encoder_modelYou can also choose to run the inference on multiple GPUs as follows:
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/inference_keypoint \
--controlnet_specs assets/inference_cosmos_transfer1_single_control_keypoint.json \
--offload_text_encoder_model \
--num_gpus $NUM_GPUThis launches transfer.py and configures the controlnets for inference according to assets/inference_keypoint_input_video.json:
{
"prompt": "The video takes place in a kitchen setting ...",
"input_video_path": "assets/inference_keypoint_input_video.mp4",
"keypoint": {
"control_weight": 1.0
}
}The input video looks like this:
ex2_input.mp4
Here's what the model outputs:
output.mp4
Note that the faces in the generated video have been blurred by the guardrail.