-
Notifications
You must be signed in to change notification settings - Fork 6
jmj/swebench eval infer #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Added missing swebench-eval script entry point that maps to benchmarks.swe_bench.eval_infer:main - This resolves the 'No such file or directory' error when running 'uv run swebench-eval' - Now swebench-eval works as an entry point to eval_infer.py, similar to how swebench-infer works for run_infer.py Co-authored-by: openhands <[email protected]>
|
@OpenHands /codereview-roasted |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Code Review Summary for PR #57: "jmj/swebench eval infer"I've completed a comprehensive code review of the changes in this PR, which adds SWE-Bench evaluation functionality to complement the existing inference capabilities. Taste Rating: 🔴 Needs improvement - Violates fundamental principlesCritical Issues That Must Be Fixed:
Key Technical Problems:
Verdict:❌ Needs rework - The patch filtering logic is fundamentally broken and will cause data corruption in production. The subprocess execution is brittle and environment-dependent. Key Architectural Insight:This code treats git patches as unstructured text when they have well-defined formats that should be parsed properly. The file filtering especially needs complete rewriting to use proper path matching instead of dangerous substring operations. The core functionality (converting OpenHands output to SWE-Bench format) solves a real problem, but the implementation has several critical flaws that must be addressed before this can be safely merged. |
|
@OpenHands you made some good points in the code review, can you update this branch with your suggestions and I'll test it? Do the necessary changes but let's keep the PR as minimal as possible. |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Replace dangerous substring matching with proper filename extraction and exact matching. The previous implementation used which would incorrectly filter files like 'my_setup.py' when trying to filter 'setup.py'. Now properly parses diff headers using regex to extract exact filenames and performs exact matching instead of substring matching. Co-authored-by: openhands <[email protected]>
✅ Code Review Suggestions ImplementedI have successfully updated the 🔧 Changes MadeFixed Critical File Filtering Bug in
📊 Impact
✅ Verification
🚀 StatusThe changes have been committed and pushed to the PR branch. The fix addresses the "BROKEN FILE FILTERING" issue that was flagged as violating fundamental principles in the code review, ensuring that only exact filename matches are filtered instead of dangerous substring matching. The PR is now ready for your testing! |
benchmarks/swe_bench/eval_infer.py
Outdated
| def install_swebench() -> None: | ||
| """Install SWE-Bench if not already available.""" | ||
| import importlib.util | ||
|
|
||
| if importlib.util.find_spec("swebench") is not None: | ||
| logger.info("SWE-Bench is already installed") | ||
| return | ||
|
|
||
| logger.info("Installing SWE-Bench...") | ||
| try: | ||
| # Install SWE-Bench using uv | ||
| subprocess.run( | ||
| ["uv", "add", "swebench"], check=True, capture_output=True, text=True | ||
| ) | ||
| logger.info("SWE-Bench installed successfully") | ||
| except subprocess.CalledProcessError as e: | ||
| logger.error(f"Failed to install SWE-Bench: {e}") | ||
| logger.error(f"stdout: {e.stdout}") | ||
| logger.error(f"stderr: {e.stderr}") | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just uv add swebench and maintain its version in pyproject.toml and uv.lock -- so you can get rid of this function and all associate configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OpenHands please do it for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on it! xingyaoww can track my progress at all-hands.dev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary
I've successfully addressed the PR comment from @xingyaoww on PR #57. The requested changes have been completed and pushed to the jmj/swebench-eval-infer branch.
Changes Made:
✅ Added swebench as a proper dependency in pyproject.toml using uv add swebench
✅ Removed the install_swebench() function (22 lines) - This function previously attempted to dynamically install swebench at runtime
✅ Removed the --install-swebench CLI argument and its help text
✅ Removed the code that called install_swebench() in the main function
✅ Updated README.md to remove the --install-swebench example and simplified the workflow description
Result:
The SWE-Bench dependency is now properly managed through pyproject.toml and uv.lock like other dependencies, eliminating the need for runtime installation logic. This is cleaner, more maintainable, and follows standard Python dependency management practices.
Commit: ec01450 - "refactor: manage swebench dependency via uv add"
Branch: jmj/swebench-eval-infer
The changes are now live on the PR and ready for review.
- Added swebench to pyproject.toml dependencies - Removed install_swebench() function and --install-swebench flag - Updated README to reflect simplified installation process - SWE-Bench is now installed as a regular dependency via uv
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
- Applied ruff formatting to benchmarks/utils/patch_utils.py - Added .openhands/setup.sh from software-agent-sdk for environment setup - Updated .gitignore to allow .openhands/setup.sh Co-authored-by: openhands <[email protected]>
- Added succinct repository documentation for OpenHands benchmarks - Covers dev setup, code standards, testing, and benchmark-specific guidelines - Updated .gitignore to allow .openhands/microagents/ directory Co-authored-by: openhands <[email protected]>
- Locked swebench v4.1.0 and its dependencies - Added beautifulsoup4, gitpython, ghapi, and other transitive deps Co-authored-by: openhands <[email protected]>
Fixes #24