Skip to content

Latest commit

 

History

History
143 lines (115 loc) · 4.65 KB

File metadata and controls

143 lines (115 loc) · 4.65 KB

Cosmos-Transfer1: Inference featuring keypoint control

Install Cosmos-Transfer1

Environment setup

Please refer to the Inference section of INSTALL.md for instructions on environment setup.

Download Checkpoints

  1. Generate a Hugging Face access token. Set the access token to 'Read' permission (default is 'Fine-grained').

  2. Log in to Hugging Face with the access token:

huggingface-cli login
  1. Accept the Llama-Guard-3-8B terms

  2. Download the Cosmos model weights from Hugging Face:

PYTHONPATH=$(pwd) python scripts/download_checkpoints.py --output_dir checkpoints/

Note that this will require about 300GB of free storage. Not all these checkpoints will be used in every generation.

  1. The downloaded files should be in the following structure:
checkpoints/
├── nvidia
│   │
│   ├── Cosmos-Guardrail1
│   │   ├── README.md
│   │   ├── blocklist/...
│   │   ├── face_blur_filter/...
│   │   └── video_content_safety_filter/...
│   │
│   ├── Cosmos-Transfer1-7B
│   │   ├── base_model.pt
│   │   ├── vis_control.pt
│   │   ├── edge_control.pt
│   │   ├── edge_control_distilled.pt
│   │   ├── seg_control.pt
│   │   ├── depth_control.pt
│   │   ├── keypoint_control.ptg
│   │   ├── 4kupscaler_control.pt
│   │   └── config.json
│   │
│   ├── Cosmos-Transfer1-7B-Sample-AV/
│   │   ├── base_model.pt
│   │   ├── hdmap_control.pt
│   │   └── lidar_control.pt
│   │
│   │── Cosmos-Tokenize1-CV8x8x8-720p
│   │   ├── decoder.jit
│   │   ├── encoder.jit
│   │   ├── autoencoder.jit
│   │   └── mean_std.pt
│   │
│   └── Cosmos-UpsamplePrompt1-12B-Transfer
│       ├── depth
│       │   ├── consolidated.safetensors
│       │   ├── params.json
│       │   └── tekken.json
│       ├── README.md
│       ├── segmentation
│       │   ├── consolidated.safetensors
│       │   ├── params.json
│       │   └── tekken.json
│       ├── seg_upsampler_example.png
│       └── viscontrol
│           ├── consolidated.safetensors
│           ├── params.json
│           └── tekken.json
│
├── depth-anything/...
├── facebook/...
├── google-t5/...
├── IDEA-Research/...
└── meta-llama/...

Run Example

For a general overview of how to use the model see this guide.

Ensure you are at the root of the repository before executing the following:

export CUDA_VISIBLE_DEVICES=0
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
PYTHONPATH=$(pwd) python cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/inference_keypoint \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_keypoint.json \
    --offload_text_encoder_model

You can also choose to run the inference on multiple GPUs as follows:

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0,1,2,3}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=4}"
PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
    --checkpoint_dir $CHECKPOINT_DIR \
    --video_save_folder outputs/inference_keypoint \
    --controlnet_specs assets/inference_cosmos_transfer1_single_control_keypoint.json \
    --offload_text_encoder_model \
    --num_gpus $NUM_GPU

This launches transfer.py and configures the controlnets for inference according to assets/inference_keypoint_input_video.json:

{
    "prompt": "The video takes place in a kitchen setting ...",
    "input_video_path": "assets/inference_keypoint_input_video.mp4",
    "keypoint": {
        "control_weight": 1.0
    }
}

The input and output videos

The input video looks like this:

ex2_input.mp4

Here's what the model outputs:

output.mp4

Note that the faces in the generated video have been blurred by the guardrail.