Skip to content

Saidharan-dev/medical_llm_uncertainty

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Medical LLM Uncertainty Quantification

A robust framework for quantifying and analyzing uncertainty in medical answers generated by large language models (LLMs). This system detects potential hallucinations and provides confidence scores to help assess the reliability of AI-generated medical responses.

Overview

Medical hallucinations in LLMs occur when the model generates medically inaccurate or completely fabricated information with high confidence. This framework mitigates this risk by:

  • Monte Carlo Sampling: Generating multiple responses to the same question with temperature-based stochasticity
  • Uncertainty Quantification: Computing multiple uncertainty metrics from the generated outputs
  • Hallucination Detection: Fusing metrics into a composite hallucination risk score
  • Confidence Calibration: Producing interpretable confidence scores and risk labels for medical practitioners

Architecture

Core Components

medical_llm_uncertainty/
├── main.py                 # Entry point to run the complete pipeline
├── config.py              # Configuration parameters for the system
├── model_loader.py        # LLM model initialization and loading
├── inference/
│   └── mc_sampler.py      # Monte Carlo sampling engine
├── uncertainty/
│   ├── entropy.py         # Entropy-based uncertainty metrics
│   ├── semantic_variance.py # Semantic variation analysis
│   └── fusion.py          # Metric fusion for hallucination scoring
└── calibration/
    └── confidence.py      # Confidence score calculation and risk labels

Uncertainty Metrics

  1. Mean Entropy (entropy.py)

    • Average Shannon entropy across all Monte Carlo samples
    • Higher values indicate greater model uncertainty
    • Computed from token probability distributions
  2. Entropy Spike (entropy.py)

    • Detects sharp increases in entropy (95th percentile threshold)
    • Identifies points where the model suddenly becomes uncertain
    • Suggests potential knowledge gaps
  3. Semantic Variance (semantic_variance.py)

    • Measures divergence in semantic meaning between outputs
    • Uses sentence embeddings (all-MiniLM-L6-v2) to compute cosine distances
    • Higher variance = more inconsistent outputs = potential hallucination
  4. Hallucination Score (fusion.py)

    • Weighted fusion of the three metrics
    • Combines entropy, spike, and semantic variance signals
    • Configurable weights via hyperparameters
  5. Confidence Score (confidence.py)

    • Calibrated from hallucination score using exponential decay
    • Range: 0.0 (no confidence) to 1.0 (high confidence)
    • Risk labels: Low (≥0.85), Medium (0.60-0.84), High (<0.60)

Installation

Requirements

  • Python 3.10+
  • CUDA 12.0+ (for GPU acceleration)
  • 8GB+ VRAM (recommended for Llama-2-7b)

Setup

  1. Clone the repository:
git clone <repository-url>
cd medical_llm_uncertainty
  1. Create a virtual environment (recommended):
python -m venv venv
venv\Scripts\activate  # On Windows
source venv/bin/activate  # On Linux/Mac
  1. Install dependencies:
pip install -r requirements.txt
  1. Accept the Llama-2 model terms:

Configuration

Adjust system parameters in config.py:

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"  # Base LLM

# Monte Carlo sampling
MC_RUNS = 15           # Number of samples per question
MAX_NEW_TOKENS = 128   # Max tokens per sample

# Generation parameters
TEMPERATURE = 0.7      # Stochasticity (higher = more varied outputs)
TOP_P = 0.9           # Nucleus sampling threshold

# Metric weights (sum should be ~1.0)
ALPHA = 0.4           # Mean entropy weight
BETA = 0.3            # Entropy spike weight
GAMMA = 0.3           # Semantic variance weight

LAMBDA = 1.2          # Confidence decay factor

Usage

Quick Start

Run the complete pipeline on a medical question:

python main.py

Default question: "According to standard clinical guidelines, what are the recommended first-line treatments for hypertension in adults?"

Expected Output

==============================
ANSWER:
First-line treatments for hypertension in adults typically include:
1. Thiazide diuretics
2. ACE inhibitors
3. Angiotensin II receptor blockers
4. Calcium channel blockers
[...]

--- Reliability ---
Mean Entropy      : 0.2345
Entropy Spike     : 0.5678
Semantic Variance : 0.1234

Confidence Score (C): 0.892
Hallucination Risk : Low
==============================

Programmatic Usage

Use the pipeline programmatically in your code:

from main import run_pipeline

# Single question
run_pipeline("What is the standard dosage for metformin?")

# Or use components individually
from model_loader import load_model
from inference.mc_sampler import monte_carlo_generate
from uncertainty.entropy import mean_entropy, entropy_spike
from uncertainty.semantic_variance import semantic_variance
from uncertainty.fusion import hallucination_score
from calibration.confidence import confidence_score, risk_label

tokenizer, model = load_model()
outputs, entropies = monte_carlo_generate(model, tokenizer, question)
mean_ent = mean_entropy(entropies)
conf_score = confidence_score(hallucination_score(mean_ent, ...))

How It Works

Pipeline Flow

Medical Question
    ↓
[Model Loader] → Load Llama-2-7b with 4-bit quantization
    ↓
[MC Sampler] → Generate 15 diverse outputs with temperature sampling
    ↓
[Uncertainty Metrics]
    ├→ Mean Entropy: Token-level uncertainty
    ├→ Entropy Spike: Sudden confidence drops
    └→ Semantic Variance: Output diversity
    ↓
[Fusion] → Weighted combination into hallucination score
    ↓
[Calibration] → Convert to confidence score (0-1)
    ↓
Risk Label: Low / Medium / High

Key Features

  • 4-bit Quantization: Reduces memory footprint (↓75%) while maintaining quality
  • Automatic Device Mapping: Handles CPU/GPU/multi-GPU setups
  • Efficient Embeddings: Uses lightweight sentence transformers
  • Production-Ready: Configurable, modular, and easily extendable

Performance Considerations

Component Latency Memory
Model Load ~30s 6-7GB (4-bit)
MC Sampling (15 runs) ~45s Shared with model
Entropy Calculation <100ms Minimal
Semantic Variance ~2s 500MB (embedder)
Total Pipeline ~90s 7-8GB total

Benchmarks on RTX 3090 Ti

Troubleshooting

CUDA Out of Memory

  • Reduce MC_RUNS in config.py
  • Use DEVICE_MAP = "cpu" for CPU-only mode
  • Increase GPU memory allocation if using MPS/ROCm

Slow Inference

  • Check GPU is being used: nvidia-smi during execution
  • Ensure CUDA toolkit matches PyTorch version
  • Update to latest bitsandbytes for optimized kernels

Rate Limiting (HuggingFace)

  • Ensure you're logged in: huggingface-cli login
  • Token limits apply for free tier accounts

Extending the System

Add New Uncertainty Metrics

Create a new module in uncertainty/ and integrate into the pipeline:

# uncertainty/custom_metric.py
def custom_uncertainty_score(outputs, entropies):
    """Your custom metric implementation"""
    return score

# Update main.py
from uncertainty.custom_metric import custom_uncertainty_score
custom_score = custom_uncertainty_score(outputs, entropies)

Swap Models

Modify config.py:

MODEL_NAME = "meta-llama/Llama-2-13b-chat-hf"  # Or any HF model
# Adjust MAX_NEW_TOKENS and TEMPERATURE as needed

Limitations & Future Work

  • Fine-tuning: Currently uses base Llama-2; domain-specific fine-tuning could improve medical accuracy
  • Evaluation: Requires ground-truth medical datasets for metric validation
  • Real-time: Currently single-question focused; batch processing would enable scalability
  • Multilingual: Currently English-only; multilingual models bring broader applicability

Citation

If you use this framework in research, please cite:

@software{medical_llm_uncertainty_2026,
  title={Medical LLM Uncertainty Quantification Framework},
  author={Your Name},
  year={2026},
  url={https://github.com/your-repo}
}

License

[Add your license here - e.g., MIT, Apache 2.0]

Disclaimer

⚠️ Medical Use: This system is a research tool for analyzing LLM outputs. It should NOT be used as a standalone clinical decision-making system. Always consult qualified healthcare professionals for medical advice.

Contact & Support

For issues, questions, or contributions, please open an issue or contact the maintainers.


Last Updated: February 2026

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages