A robust framework for quantifying and analyzing uncertainty in medical answers generated by large language models (LLMs). This system detects potential hallucinations and provides confidence scores to help assess the reliability of AI-generated medical responses.
Medical hallucinations in LLMs occur when the model generates medically inaccurate or completely fabricated information with high confidence. This framework mitigates this risk by:
- Monte Carlo Sampling: Generating multiple responses to the same question with temperature-based stochasticity
- Uncertainty Quantification: Computing multiple uncertainty metrics from the generated outputs
- Hallucination Detection: Fusing metrics into a composite hallucination risk score
- Confidence Calibration: Producing interpretable confidence scores and risk labels for medical practitioners
medical_llm_uncertainty/
├── main.py # Entry point to run the complete pipeline
├── config.py # Configuration parameters for the system
├── model_loader.py # LLM model initialization and loading
├── inference/
│ └── mc_sampler.py # Monte Carlo sampling engine
├── uncertainty/
│ ├── entropy.py # Entropy-based uncertainty metrics
│ ├── semantic_variance.py # Semantic variation analysis
│ └── fusion.py # Metric fusion for hallucination scoring
└── calibration/
└── confidence.py # Confidence score calculation and risk labels
-
Mean Entropy (
entropy.py)- Average Shannon entropy across all Monte Carlo samples
- Higher values indicate greater model uncertainty
- Computed from token probability distributions
-
Entropy Spike (
entropy.py)- Detects sharp increases in entropy (95th percentile threshold)
- Identifies points where the model suddenly becomes uncertain
- Suggests potential knowledge gaps
-
Semantic Variance (
semantic_variance.py)- Measures divergence in semantic meaning between outputs
- Uses sentence embeddings (all-MiniLM-L6-v2) to compute cosine distances
- Higher variance = more inconsistent outputs = potential hallucination
-
Hallucination Score (
fusion.py)- Weighted fusion of the three metrics
- Combines entropy, spike, and semantic variance signals
- Configurable weights via hyperparameters
-
Confidence Score (
confidence.py)- Calibrated from hallucination score using exponential decay
- Range: 0.0 (no confidence) to 1.0 (high confidence)
- Risk labels: Low (≥0.85), Medium (0.60-0.84), High (<0.60)
- Python 3.10+
- CUDA 12.0+ (for GPU acceleration)
- 8GB+ VRAM (recommended for Llama-2-7b)
- Clone the repository:
git clone <repository-url>
cd medical_llm_uncertainty- Create a virtual environment (recommended):
python -m venv venv
venv\Scripts\activate # On Windows
source venv/bin/activate # On Linux/Mac- Install dependencies:
pip install -r requirements.txt- Accept the Llama-2 model terms:
- Visit https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
- Accept the model license
- Login locally:
huggingface-cli login
Adjust system parameters in config.py:
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf" # Base LLM
# Monte Carlo sampling
MC_RUNS = 15 # Number of samples per question
MAX_NEW_TOKENS = 128 # Max tokens per sample
# Generation parameters
TEMPERATURE = 0.7 # Stochasticity (higher = more varied outputs)
TOP_P = 0.9 # Nucleus sampling threshold
# Metric weights (sum should be ~1.0)
ALPHA = 0.4 # Mean entropy weight
BETA = 0.3 # Entropy spike weight
GAMMA = 0.3 # Semantic variance weight
LAMBDA = 1.2 # Confidence decay factorRun the complete pipeline on a medical question:
python main.pyDefault question: "According to standard clinical guidelines, what are the recommended first-line treatments for hypertension in adults?"
==============================
ANSWER:
First-line treatments for hypertension in adults typically include:
1. Thiazide diuretics
2. ACE inhibitors
3. Angiotensin II receptor blockers
4. Calcium channel blockers
[...]
--- Reliability ---
Mean Entropy : 0.2345
Entropy Spike : 0.5678
Semantic Variance : 0.1234
Confidence Score (C): 0.892
Hallucination Risk : Low
==============================
Use the pipeline programmatically in your code:
from main import run_pipeline
# Single question
run_pipeline("What is the standard dosage for metformin?")
# Or use components individually
from model_loader import load_model
from inference.mc_sampler import monte_carlo_generate
from uncertainty.entropy import mean_entropy, entropy_spike
from uncertainty.semantic_variance import semantic_variance
from uncertainty.fusion import hallucination_score
from calibration.confidence import confidence_score, risk_label
tokenizer, model = load_model()
outputs, entropies = monte_carlo_generate(model, tokenizer, question)
mean_ent = mean_entropy(entropies)
conf_score = confidence_score(hallucination_score(mean_ent, ...))Medical Question
↓
[Model Loader] → Load Llama-2-7b with 4-bit quantization
↓
[MC Sampler] → Generate 15 diverse outputs with temperature sampling
↓
[Uncertainty Metrics]
├→ Mean Entropy: Token-level uncertainty
├→ Entropy Spike: Sudden confidence drops
└→ Semantic Variance: Output diversity
↓
[Fusion] → Weighted combination into hallucination score
↓
[Calibration] → Convert to confidence score (0-1)
↓
Risk Label: Low / Medium / High
- 4-bit Quantization: Reduces memory footprint (↓75%) while maintaining quality
- Automatic Device Mapping: Handles CPU/GPU/multi-GPU setups
- Efficient Embeddings: Uses lightweight sentence transformers
- Production-Ready: Configurable, modular, and easily extendable
| Component | Latency | Memory |
|---|---|---|
| Model Load | ~30s | 6-7GB (4-bit) |
| MC Sampling (15 runs) | ~45s | Shared with model |
| Entropy Calculation | <100ms | Minimal |
| Semantic Variance | ~2s | 500MB (embedder) |
| Total Pipeline | ~90s | 7-8GB total |
Benchmarks on RTX 3090 Ti
- Reduce
MC_RUNSin config.py - Use
DEVICE_MAP = "cpu"for CPU-only mode - Increase GPU memory allocation if using MPS/ROCm
- Check GPU is being used:
nvidia-smiduring execution - Ensure CUDA toolkit matches PyTorch version
- Update to latest
bitsandbytesfor optimized kernels
- Ensure you're logged in:
huggingface-cli login - Token limits apply for free tier accounts
Create a new module in uncertainty/ and integrate into the pipeline:
# uncertainty/custom_metric.py
def custom_uncertainty_score(outputs, entropies):
"""Your custom metric implementation"""
return score
# Update main.py
from uncertainty.custom_metric import custom_uncertainty_score
custom_score = custom_uncertainty_score(outputs, entropies)Modify config.py:
MODEL_NAME = "meta-llama/Llama-2-13b-chat-hf" # Or any HF model
# Adjust MAX_NEW_TOKENS and TEMPERATURE as needed- Fine-tuning: Currently uses base Llama-2; domain-specific fine-tuning could improve medical accuracy
- Evaluation: Requires ground-truth medical datasets for metric validation
- Real-time: Currently single-question focused; batch processing would enable scalability
- Multilingual: Currently English-only; multilingual models bring broader applicability
If you use this framework in research, please cite:
@software{medical_llm_uncertainty_2026,
title={Medical LLM Uncertainty Quantification Framework},
author={Your Name},
year={2026},
url={https://github.com/your-repo}
}[Add your license here - e.g., MIT, Apache 2.0]
For issues, questions, or contributions, please open an issue or contact the maintainers.
Last Updated: February 2026