-
Notifications
You must be signed in to change notification settings - Fork 50
Release/2.7.0 #1219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Release/2.7.0 #1219
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… evaluation methods
Co-authored-by: Copilot <[email protected]>
…th sample generation and tqdm progress tracking
Co-authored-by: Copilot <[email protected]>
Feature/implement the amega
…sks; implement AttackerLLM for adversarial learning in exam questions
…sage handling and reasoning prompt generation
…s; add MedFuzzSample for improved data handling
…d improved response handling
…andling in MedFuzz transformation
…for original and perturbed data
…-tests-in-robustness Feature/implement the fuzz tests in robustness
…ialogue summarization tasks
…rySample with result tracking
… enhance DialogueToSummarySample with evaluation logic
- Simplified the `_build_prompt` method by removing it and directly using the `llm.predict` method in the `evaluate` function. - Changed the `evaluate` method to accept single input and prediction dictionaries instead of lists. - Updated error handling in the `evaluate` method to return a structured error response. - Enhanced `evaluate_batch` to process a list of input and prediction pairs. - Added support for specifying the number of samples and column names when loading datasets in `ClinicalNoteSummary`. - Implemented a new method `aci_dialog` for loading ACI dialog datasets. - Fixed typos in comments and class docstrings for clarity.
…valuation logic in ClinicalNoteSummary
…-Corp in README.md
…entalchat16k-dataset-support-for-clinical-evaluation Feature/implement the mentalchat16k dataset support for clinical evaluation
dcecchini
approved these changes
Sep 11, 2025
…test and self-hosted
…ing torch cache command
…s-for-270 Updates/websites updates for 270
…s-for-270 updated: api documentation with pacific ai links
amit-shrestha
approved these changes
Sep 19, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📢 Highlights
We’re thrilled to announce the latest LangTest release, bringing advanced benchmarks, new robustness testing, and improved developer experience to your model evaluation workflows.
🩺 Autonomous Medical Evaluation for Guideline Adherence (AMEGA):
We’ve integrated AMEGA, a comprehensive benchmark for assessing LLM adherence to clinical guidelines. Covering 20 diagnostic scenarios across 13 specialties. The benchmark includes 135 questions and 1,337 weighted scoring elements, providing one of the most rigorous frameworks for evaluating medical knowledge in real-world clinical settings.
🧪 MedFuzz Robustness Testing:
To better reflect real-world clinical complexities, we're introducing MedFuzz, a healthcare-specific robustness approach that probes LLMs beyond standard benchmarks
🎲 Randomized Options in QA Tasks:
Introducing a new robustness test to mitigate positional bias in multiple-choice evaluations, LangTest now supports the randomized option ordering test type in the robustness category.
📝 ACI-Bench: Ambient Clinical Intelligence Benchmark:
LangTest now supports evaluation with ACI-Bench, a novel benchmark for automatic visit note generation in clinical contexts
💬 MTS-Dialog: Clinical Summary Evaluation:
We’ve added support for the MTS-Dialog dataset to evaluate models on dialogue-to-summary generation and to support sectioned summaries (headers + contents) for more structured evaluation
🧠 MentalChat16K Clinical Evaluation Support:
LangTest now supports the MentalChat16K dataset, enabling evaluation of LLMs in mental health–focused conversational contexts.
🔒Security Enhancements:
Critical vulnerabilities and security issues have been addressed, reinforcing the LangTest's overall stability and safety.