Skip to content

Conversation

@davidberenstein1957
Copy link
Member

Description

  • Introduced ImageRewardMetric to evaluate text-to-image generation quality, outperforming existing methods.
  • Updated pyproject.toml to include new dependencies: image-reward and clip.
  • Integrated ImageRewardMetric into the evaluation task processing.
  • Added unit tests for ImageRewardMetric, covering registration, prompt extraction, scoring, and error handling.

Related Issue

Fixes #268

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Additional Notes

* Introduced ImageRewardMetric to evaluate text-to-image generation quality, outperforming existing methods.
* Updated pyproject.toml to include new dependencies: image-reward and clip.
* Integrated ImageRewardMetric into the evaluation task processing.
* Added unit tests for ImageRewardMetric, covering registration, prompt extraction, scoring, and error handling.
@davidberenstein1957 davidberenstein1957 linked an issue Jul 22, 2025 that may be closed by this pull request
@davidberenstein1957
Copy link
Member Author

@begumcig I found that all prompts are truncated to 35 tokens... not sure if that makes sense but otherwise we could overwrite the score function as most of the prompt mentioned in their dataset actually extend to longer than 35 tokens.

https://github.com/THUDM/ImageReward/blob/c1392c6dd0fd6ecd6d416c96959ab744a6d0a8fb/ImageReward/ImageReward.py#L110

* Added ImageReward as a dependency in pyproject.toml for enhanced text-to-image evaluation.
* Included timm>=1.0.0 in dependencies for improved model performance.
* Updated uv.lock to reflect changes in package versions and ensure consistency across environments.
….lock

* Changed the GitHub repository source for the image-reward dependency from THUDM to PrunaAI in both pyproject.toml and uv.lock files.
* Removed the timm>=1.0.0 dependency from pyproject.toml to streamline the dependency list.
@davidberenstein1957
Copy link
Member Author

@sharpenb @sdiazlor @begumcig I've implemented a more generalisable reward metric here: #272

Copy link
Member

@begumcig begumcig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work David! Thank you so much for being so attentive to the details of the evaluation module. It's already almost there, requested some small changes :)

assert all(prompt.startswith("prompt_") for prompt in extracted)


def test_score_image():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are really comprehensive thanks a lot! Shall we also add a case or cases using either PrunaModel.run_inference() or the EvaluationAgent, similar to our tests in cmmd metric? That way we would also ensure the metric is compatible with our engine and agent!

Copy link
Member

@sharpenb sharpenb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description suggests some big overlap between #272 and #270. It would be nice to merge this two PR togethers (and address all @begum comments ;))

@github-actions
Copy link

This PR has been inactive for 10 days and is now marked as stale.


# Compute the result
result = metric.compute()
import pdb; pdb.set_trace()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Debugging Code Left in Test

A pdb.set_trace() call was left in test_update_and_compute. This debugging statement pauses test execution, which breaks automated test runs and CI/CD pipelines.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add ImageReward metric

4 participants