Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Oct 28, 2025

Fixes #24

juanmichelini and others added 9 commits October 15, 2025 14:43
- Added missing swebench-eval script entry point that maps to benchmarks.swe_bench.eval_infer:main
- This resolves the 'No such file or directory' error when running 'uv run swebench-eval'
- Now swebench-eval works as an entry point to eval_infer.py, similar to how swebench-infer works for run_infer.py

Co-authored-by: openhands <[email protected]>
@juanmichelini
Copy link
Collaborator Author

@OpenHands /codereview-roasted

@openhands-ai
Copy link

openhands-ai bot commented Oct 30, 2025

I'm on it! juanmichelini can track my progress at all-hands.dev

@openhands-ai
Copy link

openhands-ai bot commented Oct 30, 2025

Code Review Summary for PR #57: "jmj/swebench eval infer"

I've completed a comprehensive code review of the changes in this PR, which adds SWE-Bench evaluation functionality to complement the existing inference capabilities.

Taste Rating: 🔴 Needs improvement - Violates fundamental principles

Critical Issues That Must Be Fixed:

  1. [benchmarks/utils/patch_utils.py, Lines 42-47] - BROKEN FILE FILTERING: The logic any(f in diff for f in files) uses dangerous substring matching that will cause data corruption. A file named "setup_test.py" would incorrectly match "setup.py" and be filtered out.

  2. [benchmarks/swe_bench/eval_infer.py, Line 158] - BRITTLE SYSTEM DEPENDENCY: Hardcoded /usr/bin/python3 path will break on systems where Python is installed elsewhere, violating the "never break userspace" principle.

  3. [benchmarks/swe_bench/eval_infer.py, Lines 134-196 & 199-289] - EXCESSIVE COMPLEXITY: Functions are doing too much - the main function handles 5+ different responsibilities, and run_swebench_evaluation mixes path manipulation, subprocess execution, and error handling.

Key Technical Problems:

  • Security Risk: Substring-based file filtering could match partial paths and cause silent data loss
  • Fragile Design: Git patches treated as raw strings instead of structured data
  • Poor Error Handling: Subprocess calls don't properly capture or handle stderr
  • Hardcoded Values: Setup file list and Python path are not configurable

Verdict:

Needs rework - The patch filtering logic is fundamentally broken and will cause data corruption in production. The subprocess execution is brittle and environment-dependent.

Key Architectural Insight:

This code treats git patches as unstructured text when they have well-defined formats that should be parsed properly. The file filtering especially needs complete rewriting to use proper path matching instead of dangerous substring operations.

The core functionality (converting OpenHands output to SWE-Bench format) solves a real problem, but the implementation has several critical flaws that must be addressed before this can be safely merged.

View full conversation

@juanmichelini
Copy link
Collaborator Author

@OpenHands you made some good points in the code review, can you update this branch with your suggestions and I'll test it? Do the necessary changes but let's keep the PR as minimal as possible.

@openhands-ai
Copy link

openhands-ai bot commented Oct 30, 2025

I'm on it! juanmichelini can track my progress at all-hands.dev

Replace dangerous substring matching with proper filename extraction and exact matching.
The previous implementation used  which would incorrectly
filter files like 'my_setup.py' when trying to filter 'setup.py'.

Now properly parses diff headers using regex to extract exact filenames and performs
exact matching instead of substring matching.

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Oct 30, 2025

✅ Code Review Suggestions Implemented

I have successfully updated the jmj/swebench-eval-infer branch with the critical fix identified in the code review, keeping the changes minimal as requested.

🔧 Changes Made

Fixed Critical File Filtering Bug in benchmarks/utils/patch_utils.py:

  • Problem: The original code used dangerous substring matching any(f in diff for f in files) which would incorrectly filter files like my_setup.py when trying to filter setup.py
  • Solution: Replaced with proper regex-based filename extraction and exact matching that parses diff headers correctly

📊 Impact

  • Files changed: 1 file
  • Lines changed: +14 insertions, -2 deletions
  • Scope: Minimal and targeted fix addressing the fundamental issue

Verification

  • Created and ran comprehensive test confirming the fix works correctly
  • Verified exact filename matching now works properly
  • Confirmed no regression in functionality

🚀 Status

The changes have been committed and pushed to the PR branch. The fix addresses the "BROKEN FILE FILTERING" issue that was flagged as violating fundamental principles in the code review, ensuring that only exact filename matches are filtered instead of dangerous substring matching.

The PR is now ready for your testing!

View full conversation

Comment on lines 112 to 131
def install_swebench() -> None:
"""Install SWE-Bench if not already available."""
import importlib.util

if importlib.util.find_spec("swebench") is not None:
logger.info("SWE-Bench is already installed")
return

logger.info("Installing SWE-Bench...")
try:
# Install SWE-Bench using uv
subprocess.run(
["uv", "add", "swebench"], check=True, capture_output=True, text=True
)
logger.info("SWE-Bench installed successfully")
except subprocess.CalledProcessError as e:
logger.error(f"Failed to install SWE-Bench: {e}")
logger.error(f"stdout: {e.stdout}")
logger.error(f"stderr: {e.stderr}")
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just uv add swebench and maintain its version in pyproject.toml and uv.lock -- so you can get rid of this function and all associate configs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OpenHands please do it for me

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on it! xingyaoww can track my progress at all-hands.dev

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

I've successfully addressed the PR comment from @xingyaoww on PR #57. The requested changes have been completed and pushed to the jmj/swebench-eval-infer branch.

Changes Made:

Added swebench as a proper dependency in pyproject.toml using uv add swebench

Removed the install_swebench() function (22 lines) - This function previously attempted to dynamically install swebench at runtime

Removed the --install-swebench CLI argument and its help text

Removed the code that called install_swebench() in the main function

Updated README.md to remove the --install-swebench example and simplified the workflow description

Result:

The SWE-Bench dependency is now properly managed through pyproject.toml and uv.lock like other dependencies, eliminating the need for runtime installation logic. This is cleaner, more maintainable, and follows standard Python dependency management practices.

Commit: ec01450 - "refactor: manage swebench dependency via uv add"
Branch: jmj/swebench-eval-infer

The changes are now live on the PR and ready for review.

View full conversation

- Added swebench to pyproject.toml dependencies
- Removed install_swebench() function and --install-swebench flag
- Updated README to reflect simplified installation process
- SWE-Bench is now installed as a regular dependency via uv
@openhands-ai
Copy link

openhands-ai bot commented Nov 4, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #57 at branch `jmj/swebench-eval-infer`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

- Applied ruff formatting to benchmarks/utils/patch_utils.py
- Added .openhands/setup.sh from software-agent-sdk for environment setup
- Updated .gitignore to allow .openhands/setup.sh

Co-authored-by: openhands <[email protected]>
- Added succinct repository documentation for OpenHands benchmarks
- Covers dev setup, code standards, testing, and benchmark-specific guidelines
- Updated .gitignore to allow .openhands/microagents/ directory

Co-authored-by: openhands <[email protected]>
- Locked swebench v4.1.0 and its dependencies
- Added beautifulsoup4, gitpython, ghapi, and other transitive deps

Co-authored-by: openhands <[email protected]>
@xingyaoww xingyaoww merged commit d789b46 into main Nov 4, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: swe_bench/eval_infer.py

4 participants