Skip to content

Conversation

@chakravarthik27
Copy link
Collaborator

📢 Highlights

We’re thrilled to announce the latest LangTest release, bringing advanced benchmarks, new robustness testing, and improved developer experience to your model evaluation workflows.

  • 🩺 Autonomous Medical Evaluation for Guideline Adherence (AMEGA):
    We’ve integrated AMEGA, a comprehensive benchmark for assessing LLM adherence to clinical guidelines. Covering 20 diagnostic scenarios across 13 specialties. The benchmark includes 135 questions and 1,337 weighted scoring elements, providing one of the most rigorous frameworks for evaluating medical knowledge in real-world clinical settings.

  • 🧪 MedFuzz Robustness Testing:
    To better reflect real-world clinical complexities, we're introducing MedFuzz, a healthcare-specific robustness approach that probes LLMs beyond standard benchmarks

  • 🎲 Randomized Options in QA Tasks:
    Introducing a new robustness test to mitigate positional bias in multiple-choice evaluations, LangTest now supports the randomized option ordering test type in the robustness category.

  • 📝 ACI-Bench: Ambient Clinical Intelligence Benchmark:
    LangTest now supports evaluation with ACI-Bench, a novel benchmark for automatic visit note generation in clinical contexts

  • 💬 MTS-Dialog: Clinical Summary Evaluation:
    We’ve added support for the MTS-Dialog dataset to evaluate models on dialogue-to-summary generation and to support sectioned summaries (headers + contents) for more structured evaluation

  • 🧠 MentalChat16K Clinical Evaluation Support:
    LangTest now supports the MentalChat16K dataset, enabling evaluation of LLMs in mental health–focused conversational contexts.

  • 🔒Security Enhancements:
    Critical vulnerabilities and security issues have been addressed, reinforcing the LangTest's overall stability and safety.

chakravarthik27 and others added 30 commits March 16, 2025 13:38
Co-authored-by: Copilot <[email protected]>
…th sample generation and tqdm progress tracking
Co-authored-by: Copilot <[email protected]>
…sks; implement AttackerLLM for adversarial learning in exam questions
…sage handling and reasoning prompt generation
…s; add MedFuzzSample for improved data handling
…-tests-in-robustness

Feature/implement the fuzz tests in robustness
… enhance DialogueToSummarySample with evaluation logic
- Simplified the `_build_prompt` method by removing it and directly using the `llm.predict` method in the `evaluate` function.
- Changed the `evaluate` method to accept single input and prediction dictionaries instead of lists.
- Updated error handling in the `evaluate` method to return a structured error response.
- Enhanced `evaluate_batch` to process a list of input and prediction pairs.
- Added support for specifying the number of samples and column names when loading datasets in `ClinicalNoteSummary`.
- Implemented a new method `aci_dialog` for loading ACI dialog datasets.
- Fixed typos in comments and class docstrings for clarity.
…entalchat16k-dataset-support-for-clinical-evaluation

Feature/implement the mentalchat16k dataset support for clinical evaluation
@chakravarthik27 chakravarthik27 self-assigned this Sep 11, 2025
@chakravarthik27 chakravarthik27 marked this pull request as draft September 11, 2025 07:34
…s-for-270

Updates/websites updates for 270
…s-for-270

updated: api documentation with pacific ai links
@chakravarthik27 chakravarthik27 marked this pull request as ready for review September 19, 2025 12:43
@chakravarthik27 chakravarthik27 merged commit cdcadc7 into main Sep 19, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants