Implement timestep_conditioning [Ready for review]#3411
Implement timestep_conditioning [Ready for review]#3411Yash-Vijay29 wants to merge 16 commits intoopenvinotoolkit:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end support for VAE timestep conditioning in the LTX video generation path, exposing the new capability through the C++ and Python APIs and validating it with Python tests.
Changes:
- Extend
Text2VideoPipeline::decode()andAutoencoderKLLTXVideo::decode()to acceptdecode_timestep(defaulting to0.0f). - Pass the normalized last scheduler timestep into VAE decode inside
LTXPipeline. - Expose
timestep_conditioningin the Python config binding and add Python tests covering config exposure and pipeline decode API behavior.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python_tests/test_video_generation.py | Adds tests for timestep_conditioning exposure and Text2VideoPipeline.decode() accepting an optional decode_timestep. |
| src/python/py_video_generation_pipelines.cpp | Exposes Text2VideoPipeline.decode(latent, decode_timestep=0.0) to Python with GIL release and docstring. |
| src/python/py_video_generation_models.cpp | Exposes AutoencoderKLLTXVideo::Config::timestep_conditioning and adds decode_timestep arg to VAE decode binding. |
| src/cpp/src/video_generation/text2video_pipeline.cpp | Implements new Text2VideoPipeline::decode(latent, decode_timestep) forwarding to impl. |
| src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp | Implements timestep input handling in reshape/decode when timestep_conditioning is enabled. |
| src/cpp/src/video_generation/ltx_pipeline.hpp | Computes decode_timestep from scheduler timesteps and passes it into VAE decode; updates pipeline decode signature. |
| src/cpp/include/openvino/genai/video_generation/text2video_pipeline.hpp | Updates public API and docs for decode_timestep. |
| src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp | Updates public API and docs for decode_timestep on VAE decode. |
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (4)
src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp:56
- The comment says normalization is
timestep / 1000, but the actual normalization in the pipeline istimestep / timesteps.front()(effectivelytimestep / num_train_timesteps). Consider updating the wording to avoid implying the divisor is always exactly 1000.
// When timestep_conditioning is enabled in the config, decode_timestep must be
// the last scheduler timestep normalized to [0, 1] (i.e., timestep / 1000).
// For models without timestep_conditioning, the value is ignored.
ov::Tensor decode(const ov::Tensor& latent, float decode_timestep = 0.0f);
src/cpp/src/video_generation/ltx_pipeline.hpp:676
LTXPipeline::decode()updates the sharedm_perf_metrics.vae_decoder_inference_duration. Since user callbacks are executed on a worker thread (ThreadedCallbackWrapper), callingpipe.decode()from inside a callback will write tom_perf_metricsconcurrently with the main generation thread, causing a data race (UB) and potentially corrupting perf metrics. Consider returning perf stats computed locally fordecode()(or guarding perf metrics with a mutex / makingdecode()not mutate shared state).
VideoGenerationResult decode(const ov::Tensor& latent, float decode_timestep = 0.0f) {
ov::Tensor postprocessed = postprocess_latents(latent);
const auto decode_start = std::chrono::steady_clock::now();
ov::Tensor video = m_vae->decode(postprocessed, decode_timestep);
m_perf_metrics.vae_decoder_inference_duration =
std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - decode_start)
.count();
return VideoGenerationResult{video, m_perf_metrics};
src/python/py_video_generation_pipelines.cpp:118
- Docstring states normalization as
timestep / 1000, but the scheduler’snum_train_timestepsis configurable (even if typically 1000). Consider updating the wording totimestep / num_train_timesteps(ortimestep / max_timestep) to match the C++ implementation and avoid confusion.
decode_timestep (float): Last scheduler timestep normalized to [0, 1] (timestep / 1000).
Required when the VAE config has timestep_conditioning=True (e.g., LTX-Video 0.9.1+).
Ignored for models without timestep conditioning.
src/python/py_video_generation_models.cpp:227
- Docstring hard-codes normalization as
timestep / 1000, but the scheduler config can changenum_train_timesteps. Consider wording this astimestep / num_train_timesteps(ortimestep / max_timestep) for accuracy and consistency with the pipeline’s normalization logic.
decode_timestep (float): Last scheduler timestep normalized to [0, 1] (timestep / 1000).
Required when the VAE config has timestep_conditioning=True (e.g., LTX-Video 0.9.1+).
Ignored for models without timestep conditioning.
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Outdated
Show resolved
Hide resolved
src/cpp/include/openvino/genai/video_generation/text2video_pipeline.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
tests/python_tests/test_video_generation.py:318
- This comment assumes
decode_timestep=0.5corresponds to500/1000, but the normalization is described elsewhere astimestep / max_timestep(scheduler-dependent). Please reword to avoid hard-coding 1000 so the test comment stays accurate if scheduler configs/models change.
# decode_timestep=0.5 corresponds to scheduler timestep 500 / 1000; ignored for non-conditioning models.
result = pipe.decode(latent_tensor, decode_timestep=0.5)
src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp
Outdated
Show resolved
Hide resolved
src/cpp/include/openvino/genai/video_generation/text2video_pipeline.hpp
Outdated
Show resolved
Hide resolved
src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (3)
src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp:56
- Header comment says
decode_timestepistimestep / 1000, but schedulernum_train_timestepsis configurable andText2VideoPipelinedocs describe normalization astimestep / max_timestep(typicallynum_train_timesteps). Please update this comment to avoid hardcoding1000and keep documentation consistent across the API surface.
// When timestep_conditioning is enabled in the config, decode_timestep must be
// the last scheduler timestep normalized to [0, 1] (i.e., timestep / 1000).
// For models without timestep_conditioning, the value is ignored.
ov::Tensor decode(const ov::Tensor& latent, float decode_timestep = 0.0f);
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:186
ov::Tensor tsis allocated on everydecode()call whentimestep_conditioningis enabled. Ifdecode()is used inside callbacks to preview intermediate results, this repeated allocation can add overhead. Consider caching/reusing a{1}f32 tensor (e.g., as a member) and just updating its value beforeinfer().
if (m_config.timestep_conditioning) {
ov::Tensor ts(ov::element::f32, {1});
ts.data<float>()[0] = decode_timestep;
m_decoder_request.set_tensor("timestep", ts);
}
tests/python_tests/test_video_generation.py:318
- The inline comment assumes normalization is
timestep / 1000("500 / 1000"), but the scheduler’snum_train_timestepsis configurable and the C++ API docs describe normalization astimestep / max_timestep(typicallynum_train_timesteps). Consider rewording this comment to avoid hardcoding 1000 and just state that0.5is a representative normalized timestep value.
# decode_timestep=0.5 corresponds to scheduler timestep 500 / 1000; ignored for non-conditioning models.
result = pipe.decode(latent_tensor, decode_timestep=0.5)
src/cpp/include/openvino/genai/video_generation/text2video_pipeline.hpp
Outdated
Show resolved
Hide resolved
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Outdated
Show resolved
Hide resolved
src/cpp/include/openvino/genai/video_generation/text2video_pipeline.hpp
Outdated
Show resolved
Hide resolved
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Outdated
Show resolved
Hide resolved
c2bc0a0 to
4243dcc
Compare
site/docs/use-cases/video-generation/_sections/_usage_options/index.mdx
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
src/cpp/include/openvino/genai/video_generation/generation_config.hpp:76
VideoGenerationConfigis a public struct; inserting new fields (decode_timestep,image_cond_noise_scale) in the middle changes the offsets of existing members that follow (e.g.,taylorseer_config,adapters), which is a stronger ABI break than appending new fields at the end. If maintaining C++ ABI for existing clients is a goal, consider adding new fields at the end of the struct (or moving the struct behind a pImpl/versioned wrapper).
/// Decode-time timestep for timestep-conditioned VAE decoders.
/// std::nullopt uses pipeline default which is 0.0f for LTX-Video pipeline runtime.
/// This value is forwarded to VAE only when VAE config enables timestep_conditioning.
std::optional<float> decode_timestep = std::nullopt;
/// Decode-time image conditioning noise scale for timestep-conditioned VAE decoders.
/// std::nullopt uses pipeline default which is 0.0f for LTX-Video pipeline runtime.
/// This value is forwarded to VAE only when VAE config enables timestep_conditioning.
std::optional<float> image_cond_noise_scale = std::nullopt;
/**
* TaylorSeer configuration for caching transformer outputs.
* When set, enables TaylorSeer Lite acceleration which skips some transformer inferences
* and predicts outputs using Taylor series approximation.
*/
std::optional<TaylorSeerCacheConfig> taylorseer_config;
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp
Outdated
Show resolved
Hide resolved
src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp
Outdated
Show resolved
Hide resolved
Removed error handling for video loading and skipped pairs tracking.
Description
Wrote logic for timestep conditioning
Wrote tests for timestep conditioning
Tested against LTX-Video-0.9.1
Changes:
@likholat
So i patched the optimum-intel to work with LTX Video 0.9.1 (it has timestep_conditioning in its decoder)
and modified WWB cli to help benchmark these models.
updated documentation to include extra parameters for timestep_conditioning.
I also opened a PR to modify optimum-intel for as they didnt support exporting timestep_conditioning models into IR format either:
huggingface/optimum-intel#1652
Together with this PR it should allow inference to work with LTX-Video 0.9.1 atleast. Other similar models from LTX family should work
TESTING METHOD FOR TIMESTEP:
first ran
wwb --base-model Lightricks/LTX-Video-0.9.1 --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --hf --decode-timestep 0.05 --decode-noise-scale 0.025 --num-samples 5then ran
wwb --target-model ltx-video-0.9.1-ov --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --genai --output ltx_video_genai_ts --decode-timestep 0.05 --decode-noise-scale 0.025 --num-samples 5Accuracy: with timestep at 0.05 and decode-noise-scale 0.025:
0.76939785
Regular HF took 53 minutes to complete.
GenAI pipeline took 40 minutes to complete.
attaching metrics:
metrics_per_question.csv
TESTING FOR LTX-Video 0.9.1 with TIMESTEP OFF
first ran
wwb --target-model ltx-video-0.9.1-ov --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --genai --output ltx_video_genai_ts --decode-timestep 0 --decode-noise-scale 0 --num-samples 5then ran
wwb --target-model ltx-video-0.9.1-ov --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --genai --output ltx_video_genai_ts --decode-timestep 0 --decode-noise-scale 0 --num-samples 5Similarity score over the 5 prompts:
0.751931
Attaching metrics per question:
metrics_per_question_ts_off.csv
Let me know if you need other tests run or some changes to the codebase.
Ltx-Video- 0.9.1 works as far as i can tell. other similar models would too hopefully.
Fixes #3410
Checklist: