Skip to content

Implement timestep_conditioning [Ready for review]#3411

Open
Yash-Vijay29 wants to merge 16 commits intoopenvinotoolkit:masterfrom
Yash-Vijay29:timestep_conditioning
Open

Implement timestep_conditioning [Ready for review]#3411
Yash-Vijay29 wants to merge 16 commits intoopenvinotoolkit:masterfrom
Yash-Vijay29:timestep_conditioning

Conversation

@Yash-Vijay29
Copy link
Copy Markdown
Contributor

@Yash-Vijay29 Yash-Vijay29 commented Feb 28, 2026

Description

Wrote logic for timestep conditioning
Wrote tests for timestep conditioning
Tested against LTX-Video-0.9.1

Changes:

  • Timestep_conditioning models such as LTX-Video 0.9.1 can be successfully run using pipeline
  • WWB now has extra parameters "decode-timestep" "decode-noise-scale" to input with such models
  • "decode-timestep" and "decode-noise-scale" are parameters that can be input in models where vae decoder supports timestep conditioning "true in the config"
  • there is a search algorithm to look for timestep embeddings in the vae_decoder, common keywords are included
  • Documentation has been updated for LTX Pipeline, and WWB accordingly

@likholat

So i patched the optimum-intel to work with LTX Video 0.9.1 (it has timestep_conditioning in its decoder)

and modified WWB cli to help benchmark these models.

updated documentation to include extra parameters for timestep_conditioning.

I also opened a PR to modify optimum-intel for as they didnt support exporting timestep_conditioning models into IR format either:
huggingface/optimum-intel#1652

Together with this PR it should allow inference to work with LTX-Video 0.9.1 atleast. Other similar models from LTX family should work

TESTING METHOD FOR TIMESTEP:
first ran
wwb --base-model Lightricks/LTX-Video-0.9.1 --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --hf --decode-timestep 0.05 --decode-noise-scale 0.025 --num-samples 5

then ran

wwb --target-model ltx-video-0.9.1-ov --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --genai --output ltx_video_genai_ts --decode-timestep 0.05 --decode-noise-scale 0.025 --num-samples 5

Accuracy: with timestep at 0.05 and decode-noise-scale 0.025:

0.76939785

Regular HF took 53 minutes to complete.
GenAI pipeline took 40 minutes to complete.

attaching metrics:
metrics_per_question.csv

TESTING FOR LTX-Video 0.9.1 with TIMESTEP OFF

first ran
wwb --target-model ltx-video-0.9.1-ov --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --genai --output ltx_video_genai_ts --decode-timestep 0 --decode-noise-scale 0 --num-samples 5
then ran

wwb --target-model ltx-video-0.9.1-ov --gt-data video_gen_test_ts/gt.csv --model-type text-to-video --genai --output ltx_video_genai_ts --decode-timestep 0 --decode-noise-scale 0 --num-samples 5

Similarity score over the 5 prompts:
0.751931

Attaching metrics per question:
metrics_per_question_ts_off.csv

Let me know if you need other tests run or some changes to the codebase.

Ltx-Video- 0.9.1 works as far as i can tell. other similar models would too hopefully.

Fixes #3410

Checklist:

  • This PR follows GenAI Contributing guidelines.
  • Tests have been updated or added to cover the new code.
  • This PR fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings February 28, 2026 09:00
@Yash-Vijay29 Yash-Vijay29 marked this pull request as draft February 28, 2026 09:01
@github-actions github-actions bot added category: Python API Python API for GenAI category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: video generation labels Feb 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end support for VAE timestep conditioning in the LTX video generation path, exposing the new capability through the C++ and Python APIs and validating it with Python tests.

Changes:

  • Extend Text2VideoPipeline::decode() and AutoencoderKLLTXVideo::decode() to accept decode_timestep (defaulting to 0.0f).
  • Pass the normalized last scheduler timestep into VAE decode inside LTXPipeline.
  • Expose timestep_conditioning in the Python config binding and add Python tests covering config exposure and pipeline decode API behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/python_tests/test_video_generation.py Adds tests for timestep_conditioning exposure and Text2VideoPipeline.decode() accepting an optional decode_timestep.
src/python/py_video_generation_pipelines.cpp Exposes Text2VideoPipeline.decode(latent, decode_timestep=0.0) to Python with GIL release and docstring.
src/python/py_video_generation_models.cpp Exposes AutoencoderKLLTXVideo::Config::timestep_conditioning and adds decode_timestep arg to VAE decode binding.
src/cpp/src/video_generation/text2video_pipeline.cpp Implements new Text2VideoPipeline::decode(latent, decode_timestep) forwarding to impl.
src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp Implements timestep input handling in reshape/decode when timestep_conditioning is enabled.
src/cpp/src/video_generation/ltx_pipeline.hpp Computes decode_timestep from scheduler timesteps and passes it into VAE decode; updates pipeline decode signature.
src/cpp/include/openvino/genai/video_generation/text2video_pipeline.hpp Updates public API and docs for decode_timestep.
src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp Updates public API and docs for decode_timestep on VAE decode.

Copilot AI review requested due to automatic review settings February 28, 2026 09:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (4)

src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp:56

  • The comment says normalization is timestep / 1000, but the actual normalization in the pipeline is timestep / timesteps.front() (effectively timestep / num_train_timesteps). Consider updating the wording to avoid implying the divisor is always exactly 1000.
    // When timestep_conditioning is enabled in the config, decode_timestep must be
    // the last scheduler timestep normalized to [0, 1] (i.e., timestep / 1000).
    // For models without timestep_conditioning, the value is ignored.
    ov::Tensor decode(const ov::Tensor& latent, float decode_timestep = 0.0f);

src/cpp/src/video_generation/ltx_pipeline.hpp:676

  • LTXPipeline::decode() updates the shared m_perf_metrics.vae_decoder_inference_duration. Since user callbacks are executed on a worker thread (ThreadedCallbackWrapper), calling pipe.decode() from inside a callback will write to m_perf_metrics concurrently with the main generation thread, causing a data race (UB) and potentially corrupting perf metrics. Consider returning perf stats computed locally for decode() (or guarding perf metrics with a mutex / making decode() not mutate shared state).
    VideoGenerationResult decode(const ov::Tensor& latent, float decode_timestep = 0.0f) {
        ov::Tensor postprocessed = postprocess_latents(latent);

        const auto decode_start = std::chrono::steady_clock::now();
        ov::Tensor video = m_vae->decode(postprocessed, decode_timestep);
        m_perf_metrics.vae_decoder_inference_duration =
            std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - decode_start)
                .count();

        return VideoGenerationResult{video, m_perf_metrics};

src/python/py_video_generation_pipelines.cpp:118

  • Docstring states normalization as timestep / 1000, but the scheduler’s num_train_timesteps is configurable (even if typically 1000). Consider updating the wording to timestep / num_train_timesteps (or timestep / max_timestep) to match the C++ implementation and avoid confusion.
                decode_timestep (float): Last scheduler timestep normalized to [0, 1] (timestep / 1000).
                    Required when the VAE config has timestep_conditioning=True (e.g., LTX-Video 0.9.1+).
                    Ignored for models without timestep conditioning.

src/python/py_video_generation_models.cpp:227

  • Docstring hard-codes normalization as timestep / 1000, but the scheduler config can change num_train_timesteps. Consider wording this as timestep / num_train_timesteps (or timestep / max_timestep) for accuracy and consistency with the pipeline’s normalization logic.
                decode_timestep (float): Last scheduler timestep normalized to [0, 1] (timestep / 1000).
                    Required when the VAE config has timestep_conditioning=True (e.g., LTX-Video 0.9.1+).
                    Ignored for models without timestep conditioning.

Copilot AI review requested due to automatic review settings February 28, 2026 10:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

tests/python_tests/test_video_generation.py:318

  • This comment assumes decode_timestep=0.5 corresponds to 500/1000, but the normalization is described elsewhere as timestep / max_timestep (scheduler-dependent). Please reword to avoid hard-coding 1000 so the test comment stays accurate if scheduler configs/models change.
        # decode_timestep=0.5 corresponds to scheduler timestep 500 / 1000; ignored for non-conditioning models.
        result = pipe.decode(latent_tensor, decode_timestep=0.5)

Copilot AI review requested due to automatic review settings February 28, 2026 10:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (3)

src/cpp/include/openvino/genai/video_generation/autoencoder_kl_ltx_video.hpp:56

  • Header comment says decode_timestep is timestep / 1000, but scheduler num_train_timesteps is configurable and Text2VideoPipeline docs describe normalization as timestep / max_timestep (typically num_train_timesteps). Please update this comment to avoid hardcoding 1000 and keep documentation consistent across the API surface.
    // When timestep_conditioning is enabled in the config, decode_timestep must be
    // the last scheduler timestep normalized to [0, 1] (i.e., timestep / 1000).
    // For models without timestep_conditioning, the value is ignored.
    ov::Tensor decode(const ov::Tensor& latent, float decode_timestep = 0.0f);

src/cpp/src/video_generation/models/autoencoder_kl_ltx_video.cpp:186

  • ov::Tensor ts is allocated on every decode() call when timestep_conditioning is enabled. If decode() is used inside callbacks to preview intermediate results, this repeated allocation can add overhead. Consider caching/reusing a {1} f32 tensor (e.g., as a member) and just updating its value before infer().
    if (m_config.timestep_conditioning) {
        ov::Tensor ts(ov::element::f32, {1});
        ts.data<float>()[0] = decode_timestep;
        m_decoder_request.set_tensor("timestep", ts);
    }

tests/python_tests/test_video_generation.py:318

  • The inline comment assumes normalization is timestep / 1000 ("500 / 1000"), but the scheduler’s num_train_timesteps is configurable and the C++ API docs describe normalization as timestep / max_timestep (typically num_train_timesteps). Consider rewording this comment to avoid hardcoding 1000 and just state that 0.5 is a representative normalized timestep value.
        # decode_timestep=0.5 corresponds to scheduler timestep 500 / 1000; ignored for non-conditioning models.
        result = pipe.decode(latent_tensor, decode_timestep=0.5)

@Yash-Vijay29 Yash-Vijay29 changed the title implement tests and logic for timestep_conditioning logic for timestep_conditioning Mar 6, 2026
Copilot AI review requested due to automatic review settings March 6, 2026 08:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

@Yash-Vijay29 Yash-Vijay29 changed the title logic for timestep_conditioning Implement timestep_conditioning Mar 6, 2026
@Yash-Vijay29 Yash-Vijay29 marked this pull request as ready for review March 6, 2026 11:43
@github-actions github-actions bot added the category: WWB PR changes WWB label Mar 6, 2026
Copilot AI review requested due to automatic review settings March 26, 2026 08:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

@github-actions github-actions bot removed the category: tokenizers Tokenizer class or submodule update label Mar 26, 2026
Copilot AI review requested due to automatic review settings March 26, 2026 13:19
@Yash-Vijay29 Yash-Vijay29 force-pushed the timestep_conditioning branch from c2bc0a0 to 4243dcc Compare March 26, 2026 13:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/cpp/include/openvino/genai/video_generation/generation_config.hpp:76

  • VideoGenerationConfig is a public struct; inserting new fields (decode_timestep, image_cond_noise_scale) in the middle changes the offsets of existing members that follow (e.g., taylorseer_config, adapters), which is a stronger ABI break than appending new fields at the end. If maintaining C++ ABI for existing clients is a goal, consider adding new fields at the end of the struct (or moving the struct behind a pImpl/versioned wrapper).
    /// Decode-time timestep for timestep-conditioned VAE decoders.
    /// std::nullopt uses pipeline default which is 0.0f for LTX-Video pipeline runtime.
    /// This value is forwarded to VAE only when VAE config enables timestep_conditioning.
    std::optional<float> decode_timestep = std::nullopt;

    /// Decode-time image conditioning noise scale for timestep-conditioned VAE decoders.
    /// std::nullopt uses pipeline default which is 0.0f for LTX-Video pipeline runtime.
    /// This value is forwarded to VAE only when VAE config enables timestep_conditioning.
    std::optional<float> image_cond_noise_scale = std::nullopt;

    /**
     * TaylorSeer configuration for caching transformer outputs.
     * When set, enables TaylorSeer Lite acceleration which skips some transformer inferences
     * and predicts outputs using Taylor series approximation.
     */
    std::optional<TaylorSeerCacheConfig> taylorseer_config;

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Removed error handling for video loading and skipped pairs tracking.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: GH Pages Docs Github Pages documentation category: Python API Python API for GenAI category: video generation category: WWB PR changes WWB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Timestep_conditioning in ltx_pipeline

3 participants