Medical LLM Uncertainty Quantification

A robust framework for quantifying and analyzing uncertainty in medical answers generated by large language models (LLMs). This system detects potential hallucinations and provides confidence scores to help assess the reliability of AI-generated medical responses.

Overview

Medical hallucinations in LLMs occur when the model generates medically inaccurate or completely fabricated information with high confidence. This framework mitigates this risk by:

Monte Carlo Sampling: Generating multiple responses to the same question with temperature-based stochasticity
Uncertainty Quantification: Computing multiple uncertainty metrics from the generated outputs
Hallucination Detection: Fusing metrics into a composite hallucination risk score
Confidence Calibration: Producing interpretable confidence scores and risk labels for medical practitioners

Architecture

Core Components

medical_llm_uncertainty/
├── main.py                 # Entry point to run the complete pipeline
├── config.py              # Configuration parameters for the system
├── model_loader.py        # LLM model initialization and loading
├── inference/
│   └── mc_sampler.py      # Monte Carlo sampling engine
├── uncertainty/
│   ├── entropy.py         # Entropy-based uncertainty metrics
│   ├── semantic_variance.py # Semantic variation analysis
│   └── fusion.py          # Metric fusion for hallucination scoring
└── calibration/
    └── confidence.py      # Confidence score calculation and risk labels

Uncertainty Metrics

Mean Entropy (entropy.py)
- Average Shannon entropy across all Monte Carlo samples
- Higher values indicate greater model uncertainty
- Computed from token probability distributions
Entropy Spike (entropy.py)
- Detects sharp increases in entropy (95th percentile threshold)
- Identifies points where the model suddenly becomes uncertain
- Suggests potential knowledge gaps
Semantic Variance (semantic_variance.py)
- Measures divergence in semantic meaning between outputs
- Uses sentence embeddings (all-MiniLM-L6-v2) to compute cosine distances
- Higher variance = more inconsistent outputs = potential hallucination
Hallucination Score (fusion.py)
- Weighted fusion of the three metrics
- Combines entropy, spike, and semantic variance signals
- Configurable weights via hyperparameters
Confidence Score (confidence.py)
- Calibrated from hallucination score using exponential decay
- Range: 0.0 (no confidence) to 1.0 (high confidence)
- Risk labels: Low (≥0.85), Medium (0.60-0.84), High (<0.60)

Installation

Requirements

Python 3.10+
CUDA 12.0+ (for GPU acceleration)
8GB+ VRAM (recommended for Llama-2-7b)

Setup

Clone the repository:

git clone <repository-url>
cd medical_llm_uncertainty

Create a virtual environment (recommended):

python -m venv venv
venv\Scripts\activate  # On Windows
source venv/bin/activate  # On Linux/Mac

Install dependencies:

pip install -r requirements.txt

Accept the Llama-2 model terms:
- Visit https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- Accept the model license
- Login locally: huggingface-cli login

Configuration

Adjust system parameters in config.py:

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"  # Base LLM

# Monte Carlo sampling
MC_RUNS = 15           # Number of samples per question
MAX_NEW_TOKENS = 128   # Max tokens per sample

# Generation parameters
TEMPERATURE = 0.7      # Stochasticity (higher = more varied outputs)
TOP_P = 0.9           # Nucleus sampling threshold

# Metric weights (sum should be ~1.0)
ALPHA = 0.4           # Mean entropy weight
BETA = 0.3            # Entropy spike weight
GAMMA = 0.3           # Semantic variance weight

LAMBDA = 1.2          # Confidence decay factor

Usage

Quick Start

Run the complete pipeline on a medical question:

python main.py

Default question: "According to standard clinical guidelines, what are the recommended first-line treatments for hypertension in adults?"

Expected Output

==============================
ANSWER:
First-line treatments for hypertension in adults typically include:
1. Thiazide diuretics
2. ACE inhibitors
3. Angiotensin II receptor blockers
4. Calcium channel blockers
[...]

--- Reliability ---
Mean Entropy      : 0.2345
Entropy Spike     : 0.5678
Semantic Variance : 0.1234

Confidence Score (C): 0.892
Hallucination Risk : Low
==============================

Programmatic Usage

Use the pipeline programmatically in your code:

from main import run_pipeline

# Single question
run_pipeline("What is the standard dosage for metformin?")

# Or use components individually
from model_loader import load_model
from inference.mc_sampler import monte_carlo_generate
from uncertainty.entropy import mean_entropy, entropy_spike
from uncertainty.semantic_variance import semantic_variance
from uncertainty.fusion import hallucination_score
from calibration.confidence import confidence_score, risk_label

tokenizer, model = load_model()
outputs, entropies = monte_carlo_generate(model, tokenizer, question)
mean_ent = mean_entropy(entropies)
conf_score = confidence_score(hallucination_score(mean_ent, ...))

How It Works

Pipeline Flow

Medical Question
    ↓
[Model Loader] → Load Llama-2-7b with 4-bit quantization
    ↓
[MC Sampler] → Generate 15 diverse outputs with temperature sampling
    ↓
[Uncertainty Metrics]
    ├→ Mean Entropy: Token-level uncertainty
    ├→ Entropy Spike: Sudden confidence drops
    └→ Semantic Variance: Output diversity
    ↓
[Fusion] → Weighted combination into hallucination score
    ↓
[Calibration] → Convert to confidence score (0-1)
    ↓
Risk Label: Low / Medium / High

Key Features

4-bit Quantization: Reduces memory footprint (↓75%) while maintaining quality
Automatic Device Mapping: Handles CPU/GPU/multi-GPU setups
Efficient Embeddings: Uses lightweight sentence transformers
Production-Ready: Configurable, modular, and easily extendable

Performance Considerations

Component	Latency	Memory
Model Load	~30s	6-7GB (4-bit)
MC Sampling (15 runs)	~45s	Shared with model
Entropy Calculation	<100ms	Minimal
Semantic Variance	~2s	500MB (embedder)
Total Pipeline	~90s	7-8GB total

Benchmarks on RTX 3090 Ti

Troubleshooting

CUDA Out of Memory

Reduce MC_RUNS in config.py
Use DEVICE_MAP = "cpu" for CPU-only mode
Increase GPU memory allocation if using MPS/ROCm

Slow Inference

Check GPU is being used: nvidia-smi during execution
Ensure CUDA toolkit matches PyTorch version
Update to latest bitsandbytes for optimized kernels

Rate Limiting (HuggingFace)

Ensure you're logged in: huggingface-cli login
Token limits apply for free tier accounts

Extending the System

Add New Uncertainty Metrics

Create a new module in uncertainty/ and integrate into the pipeline:

# uncertainty/custom_metric.py
def custom_uncertainty_score(outputs, entropies):
    """Your custom metric implementation"""
    return score

# Update main.py
from uncertainty.custom_metric import custom_uncertainty_score
custom_score = custom_uncertainty_score(outputs, entropies)

Swap Models

Modify config.py:

MODEL_NAME = "meta-llama/Llama-2-13b-chat-hf"  # Or any HF model
# Adjust MAX_NEW_TOKENS and TEMPERATURE as needed

Limitations & Future Work

Fine-tuning: Currently uses base Llama-2; domain-specific fine-tuning could improve medical accuracy
Evaluation: Requires ground-truth medical datasets for metric validation
Real-time: Currently single-question focused; batch processing would enable scalability
Multilingual: Currently English-only; multilingual models bring broader applicability

Citation

If you use this framework in research, please cite:

@software{medical_llm_uncertainty_2026,
  title={Medical LLM Uncertainty Quantification Framework},
  author={Your Name},
  year={2026},
  url={https://github.com/your-repo}
}

License

[Add your license here - e.g., MIT, Apache 2.0]

Disclaimer

⚠️ Medical Use: This system is a research tool for analyzing LLM outputs. It should NOT be used as a standalone clinical decision-making system. Always consult qualified healthcare professionals for medical advice.

Contact & Support

For issues, questions, or contributions, please open an issue or contact the maintainers.

Last Updated: February 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medical LLM Uncertainty Quantification

Overview

Architecture

Core Components

Uncertainty Metrics

Installation

Requirements

Setup

Configuration

Usage

Quick Start

Expected Output

Programmatic Usage

How It Works

Pipeline Flow

Key Features

Performance Considerations

Troubleshooting

CUDA Out of Memory

Slow Inference

Rate Limiting (HuggingFace)

Extending the System

Add New Uncertainty Metrics

Swap Models

Limitations & Future Work

Citation

License

Disclaimer

Contact & Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
calibration		calibration
inference		inference
uncertainty		uncertainty
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
model_loader.py		model_loader.py
requirements.txt		requirements.txt

Saidharan-dev/medical_llm_uncertainty

Folders and files

Latest commit

History

Repository files navigation

Medical LLM Uncertainty Quantification

Overview

Architecture

Core Components

Uncertainty Metrics

Installation

Requirements

Setup

Configuration

Usage

Quick Start

Expected Output

Programmatic Usage

How It Works

Pipeline Flow

Key Features

Performance Considerations

Troubleshooting

CUDA Out of Memory

Slow Inference

Rate Limiting (HuggingFace)

Extending the System

Add New Uncertainty Metrics

Swap Models

Limitations & Future Work

Citation

License

Disclaimer

Contact & Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages