A Python pipeline for detecting whether an LLM has been tampered with (fine-tuned, backdoored, quantized, or instruction-tuned) relative to a trusted baseline, using the LIIH (Layered Integrity Invariant Hash) Framework.
LIIH frames tamper detection as a pair classification problem: given a trusted baseline model
Detection targets:
- Fine-tuning / RLHF: Unauthorized weight updates or alignment changes
- Backdoor injection: BadNet-style poisoned models
- Quantization: Precision modifications (FP32 → INT8)
- Instruction-tuning: Base → instruction-tuned variant swaps
- Cross-family substitution: Model replaced with a different architecture
Four orthogonal detection components, each contributing comparison features to the final vector:
- Zeroth-order Jacobian estimation via ZeroPrint (arXiv:2510.06605)
- Semantic perturbations using DistilBERT as perturber
- Per-probe cosine similarities + delta mean vector + scalar statistics
- Detects: Model substitution, deep fine-tuning, weight-level changes
- Free-text response embeddings via
all-MiniLM-L6-v2 - Linguistic complexity analysis (sentence length, vocabulary richness)
- Per-probe cosine similarities + cross-mean similarity + complexity deltas
- Detects: RLHF updates, alignment drift, behavioral shifts
- TTFT (Time to First Token) and OTPS (Output Tokens Per Second)
- Hardware-agnostic ratios and relative differences
- Detects: Quantization, hardware substitution, infrastructure changes
- Bilateral trace embeddings via LLMmap (arXiv:2407.15847)
- Query ∥ response embeddings using
multilingual-E5-large-instruct - Per-probe cosine similarities + scalar statistics
- Detects: Identity-level model swaps, architecture changes
src/
├── __init__.py
├── config.py # Model pairs, datasets, LIIH component configs
├── pipeline.py # Main orchestration script
├── data/
│ ├── model_loader.py # HuggingFace model loader (incl. int8 quantization)
│ └── dataset_loader.py # MMLU-Pro & TruthfulQA probe datasets
├── features/
│ ├── jacobian_extractor.py # Component I: ZeroPrint Jacobian fingerprinting
│ ├── semantic_extractor.py # Component B/S: Semantic drift detection
│ ├── temporal_extractor.py # Component T: TTFT & OTPS profiling
│ ├── llmmap_extractor.py # Component L: LLMmap bilateral traces
│ └── liih_builder.py # Composite LIIH comparison vector builder
├── classifier/
│ ├── trainer.py # Multi-classifier trainer (RF, XGBoost, SVM, LR)
│ └── evaluator.py # Evaluation, confusion matrices, feature importance
└── utils/
├── perturber.py # Semantic perturbation for ZeroPrint
└── helpers.py # Utility functions
- Python 3.8+
- CUDA-capable GPU (recommended; required for int8 quantization pairs)
- 16GB+ RAM
pip install -r requirements.txtKey packages:
transformers— HuggingFace Transformerstorch— PyTorchbitsandbytes— INT8/INT4 quantization via CUDA kernelssentence-transformers— Sentence embeddingsscikit-learn— ML algorithms (RF, SVM, LR)xgboost— XGBoost classifierdatasets— HuggingFace Datasets
python src/pipeline.pyThis will:
- Load all model pairs defined in
config.py(legitimate + tampered) - Inject in-pipeline backdoor pairs (BadNet-style)
- Extract per-model LIIH signatures (Jacobian, Semantic, Temporal, LLMmap)
- Build pairwise 100-feature comparison vectors
- Train four classifiers (Random Forest, XGBoost, SVM, Logistic Regression)
- Run component-level ablation study
- Generate evaluation reports and figures in
results/
Edit src/config.py to customize model pairs and LIIH component settings:
JACOBIAN_CONFIG = {
"k_top_tokens": 32, # Top-K tokens for Jacobian approximation
"num_probes": 20, # Number of perturbation probes per model
"perturber_model": "..." # DistilBERT perturber model name
}
SEMANTIC_CONFIG = {
"num_behavioral_probes": 20, # Number of semantic probes per model
"max_new_tokens": 64,
"embedding_model": "all-MiniLM-L6-v2"
}
TEMPORAL_CONFIG = {
"num_timing_probes": 10, # Number of timing measurements per model
"fixed_input_length": 32,
"fixed_output_length": 32,
"warmup_runs": 2
}The pipeline evaluates 38 model pairs across five modification types, spanning 82M to 7B parameters and eight model families (GPT-2, GPT-Neo, OPT, Pythia, BLOOM, SmolLM2, Qwen2.5, OLMo):
| Modification Type | # Pairs | Label |
|---|---|---|
| Identical / legitimate | 17 | 0 |
| Instruction-tuning | 9 | 1 |
| Backdoor injection | 7 | 1 |
| Quantization (INT8) | 3 | 1 |
| Fine-tuning | 2 | 1 |
Train/test split: 28 train / 10 test (stratified by modification type × label).
| Classifier | CV Accuracy | Test Accuracy |
|---|---|---|
| Random Forest | 93.33% ± 9.43% | 100% |
| XGBoost | 100.00% ± 0.00% | 100% |
| SVM | 85.93% ± 4.19% | 90% |
| Logistic Regression | 85.93% ± 4.19% | 90% |
Results and figures are saved to results/:
confusion_matrix4.pdf— RF and XGBoost confusion matricesconfusion_matrix5.pdf— SVM and LR confusion matricesrandom_forest_feature_importance.pdfxgboost_feature_importance.pdflogistric_regression_feature_importance.pdf*_metrics_report.txt— per-classifier evaluation reports
- Black-box detection — no access to model weights required
- Four orthogonal detection layers — Jacobian, Semantic, Temporal, LLMmap
-
Pair-based classification — compares
$V_1$ vs$V_2$ directly - In-pipeline backdoor injection — BadNet-style poisoning without external datasets
- Real INT8 quantization — via bitsandbytes CUDA kernels (≥1.3B params)
- Signature caching — per-model features cached for fast re-experimentation
- Component-level ablation — built-in leave-one-out ablation study
- Multi-classifier — RF, XGBoost, SVM, and Logistic Regression trained in parallel
- ZeroPrint: Shao et al., "Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation", arXiv:2510.06605 (2025)
- LLMmap: Pasquini, Kornaropoulos, Ateniese, "LLMmap: Fingerprinting For Large Language Models", arXiv:2407.15847 (2024)
- BadNets: Gu, Dolan-Gavitt, Garg, "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain", arXiv:1708.06733 (2017)
This implementation uses open-source models and datasets. See individual model licenses on HuggingFace.