[NeurIPS 2025] Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling
- 2025.12.16 β Released interpretability analysis code.
- 2025.11.24 β Released all preprocessed WSI features, full codebase, and instructions for running HiVE-MIL.
- 2025.09.18 β Our HiVE-MIL has been accepted at NeurIPS 2025! π₯π
- 2025.05.16 β Released the initial code submission.
We propose HiVE-MIL (Hierarchical Vision-LanguagE MIL), a data-efficient VLM adaptation framework for gigapixel WSIs that models hierarchical interactions and intra-scale visionβlanguage alignments, enabling robust few-shot WSI classification and interpretable predictions.
- WSIs are gigapixel-scale and have a hierarchical structure (e.g., 5x β 20x).
- Traditional MIL requires large labeled WSI datasets, which are limited by privacy and rare-disease scarcity and it learns only from the original slides, leading to staining variability and domain shift.
- Existing Vision-Language MIL incorporates text as domain knowledge but still lacks explicit hierarchy modeling and robust multimodal alignment.
- Cross-scale hierarchical interaction
- Hierarchical Graph: Constructs parentβchild edges between 5x and 20x visual/text nodes.
- Hierarchical Text Contrastive Loss (HTCL): Enforces semantic consistency across scales.
- Intra-scale multimodal interaction
- Heterogeneous Graph: Models imageβtext relationships within each scale.
- Text-Guided Dynamic Filtering (TGDF): Selects informative patchβtext pairs while suppressing weak or irrelevant ones during training.
Together, these components enable HiVE-MIL to capture hierarchical and semantic dependencies across scales and modalities, delivering strong few-shot performance and interpretable predictions across multiple WSI benchmarks.
Set up your environment with these simple steps:
# Create and activate environment
conda create --name hivemil python=3.9.21
conda activate hivemil
conda install pytorch==2.3.0 torchvision==0.18.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# Install dependencies
git clone https://github.com/bryanwong17/HiVE-MIL.git
cd HiVE-MIL
pip install -r requirements.txt
pip install torch_sparse -f https://data.pyg.org/whl/torch-2.3.0+cu118.html
pip install topk@git+https://github.com/oval-group/smooth-topk.git@12c1645f187e2fa0c05f47bf1fe48864d4bd2707The public TCGA datasets can be downloaded from the NIH Genomic Data Commons Data Portal. For the specific downloading tool, please refer to GDC Data Transfer.
For each dataset, a .csv file is needed in the following format and put it into the dataset_csv folder:
Headname: 'case_id, slide_id, label, level0_mag'
Each line: 'patient_0, TCGA-A7-A26I-01Z-00-DX1.0077D012-BC14-4E96-84F7-A1A6A3A778DF, IDC, 40'
For reference, we provide example files in the dataset_csv folder.
To reproduce our results, download the preprocessed hierarchical WSI features (~20 - 25 GB) using the links below.
| Feature extractor | Download link |
|---|---|
| PLIP | tcga_brca_plip.zip |
| QuiltNet | tcga_brca_quiltnet.zip |
| CONCH | tcga_brca_conch.zip |
| Feature extractor | Download link |
|---|---|
| PLIP | tcga_nsclc_plip.zip |
| QuiltNet | tcga_nsclc_quiltnet.zip |
| CONCH | tcga_nsclc_conch.zip |
| Feature extractor | Download link |
|---|---|
| PLIP | tcga_rcc_plip.zip |
| QuiltNet | tcga_rcc_quiltnet.zip |
| CONCH | tcga_rcc_conch.zip |
To unzip the downloaded the file (e.g., tcga_brca_quiltnet.zip), use the following command:
unzip tcga_brca_quiltnet.zip
After extraction, the folder structure should look like this:
DATASET_ROOT/
βββ tcga_brca_quiltnet/
βββ quiltnet_5x/
β βββ slide_a.h5
β βββ slide_b.h5
β βββ ...
βββ hierarchical_quiltnet_5x_20x/
βββ slide_a.h5
βββ slide_b.h5
βββ ...
For each slide (e.g., slide_a), the feature files are:
quiltnet_5x/slide_a.h5: 5x patch features, shape[#5x_patches, feat_dim]hierarchical_quiltnet_5x_20x/slide_a.h5: 5xβ20x hierarchical features, shape[#5x_patches, 16, feat_dim]
Note: In the parentβchild hierarchy, each 5x patch corresponds to up to 16 20x child patches. Missing children are zero-padded.
To generate hierarchical features, you must first extract both 5x and 20x patch features. You can follow the preprocessing pipeline from ViLa-MIL, which includes:
- generate patch coordinate
- crop patches
- extract patch features
After running all preprocessing steps (e.g., for the tcga_brca dataset using quiltnet), your directory should include the quiltnet_5x and quiltnet_20x folders with the following structure:
DATASET_ROOT/
βββ tcga_brca_quiltnet/
βββ quiltnet_5x/
β βββ slide_a.h5
β βββ slide_b.h5
β βββ ...
βββ quiltnet_20x/
βββ slide_a.h5
βββ slide_b.h5
βββ ...
For each WSI (e.g., slide_a), the extracted feature files should be:
quiltnet_5x/slide_a.h5: 5x patch features, shape[#5x_patches, feat_dim]quiltnet_20x/slide_a.h5: 20x patch features, shape[#20x_patches, feat_dim]
Next, construct the hierarchical 5xβ20x features by linking patches via their absolute coordinates:
python create_hierarchical_features.py \
--dataset-root-path DATASET_ROOT \
--dataset-name DATASET_NAME \
--feature-extractor-name FEATURE_EXTRACTOR_NAME \
--low-mag 5 \
--high-mag 20 \
--max-patches 16Parameter Descriptions:
dataset-root-path: Path to the directory containing the extracted 5x and 20x feature folders.dataset-name: Name of the dataset to process (e.g.,tcga_brca,tcga_nsclc,tcga_rcc).feature-extractor-name: Feature extractor used to generate patch embeddings (e.g.,quiltnet,plip,conch).low-mag: Low magnification level used for parent patches (e.g.,5x).high-mag: High magnification level used for child patches (e.g.,20x).max-patches: Maximum number of 20x child patches linked to each 5x parent patch. (Default:(high_mag / low_mag)^2, e.g.,(20 / 5)^2 = 16).
To verify that each 5x patch correctly aligns with its corresponding 20x child patches, run:
python check_hierarchical_consistency.py \
--dataset-root-path DATASET_ROOT \
--dataset-name DATASET_NAME \
--feature-extractor-name FEATURE_EXTRACTOR_NAMEParameter Descriptions:
dataset-root-path: Path to the directory containing the extracted 5x and 20x feature folders.dataset-name: Name of the dataset to process (e.g.,tcga_brca,tcga_nsclc,tcga_rcc).feature-extractor-name: Feature extractor used to generate patch embeddings (e.g.,quiltnet,plip,conch).
You can follow the split datasets steps from ViLa-MIL, which includes:
- generate
ksplitting datasets with different seeds - build the few-shot dataset split
For reproducibility, we also provide the splits used in our main experiments (16-shot) in the splits folder
HiVE-MIL uses a frozen LLM to generate hierarchical morphological descriptions for each class in the dataset.
The prompt used for generating these descriptions is:
The task is to summarize the morphological features of the {dataset_name} dataset for the classes {class_name_1}, {class_name_2}, ..., {class_name_c} classes. For each class, list four representative morphological features observed at 5x magnification, followed by three finer sub-features observed at 20x magnification for each. Each description should include the morphological term along with an explanation of its defining visual features.
The generated structure follows these rules:
- The first four entries correspond to coarse-scale (5x) morphological features.
- Each 5x feature is expanded into three fine-scale (20x) sub-features, providing more detailed morphological descriptions.
For reproducibility, all generated hierarchical descriptions are included in the text_prompt directory, produced using GPT-4o.
To use conch as the feature extractor, download the pre-trained conch model from the following link: pytorch_model.bin and place this file in conch/pytorch_model.bin.
Once everything is set up, run the following script to train HiVE-MIL.
python main.py \
--seed 1 \
--drop_out \
--early_stopping \
--lr 1e-4 \
--k 5 \
--few_shot_num 16 \
--feature_extractor FEATURE_EXTRACTOR_NAME \
--feature_dim 512 \
--bag_loss ce \
--task TASK \
--results_dir RESULT_DIR \
--model_type HiVE_MIL \
--data_root_dir DATASET_ROOT \
Parameter Descriptions:
seed: Random seed used to ensure reproducibility across training runs (Default:1).drop_out: Enables dropout during training to improve model generalization (Default:True).early_stopping: Activates early stopping based on validation performance to prevent overfitting (Default:True).lr: Learning rate used by the optimizer (Default:1e-4).k: Number of few-shot splits (Default:5).few_shot_num: Number of labeled samples per class used in the few-shot training setting (Default:16).feature_extractor: Feature extractor used for training (e.g.,quiltnet,plip,conch).feature_dim: Dimensionality of the extracted patch features (Default:512).bag_loss: Loss function applied at the bag level for MIL training (Default:ce).task: Name of the dataset to be used for training (e.g.,tcga_brca,tcga_nsclc,tcga_rcc).results_dir: Directory where training outputs and results will be saved.model_type: Model architecture used for training (Default:HiVE_MIL).data_root_dir: Path to the directory containing the extracted 5x and 20x feature folders.
After training, you can check the mean and standard deviation of ACC, AUC, and Macro F1 across the few-shot splits in the results.csv file, located in the RESULT_DIR folder. These values should closely match the main results reported in Table 1 of the paper.
Please refer to interpretability_analysis.ipynb, which provides interpretable evidence based on the description of the contributing text descriptions (5x and 20x)
If you find our work useful in your research, please consider starring π this repo and citing our paper at:
@article{wong2025few,
title={Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling},
author={Wong, Bryan and Kim, Jong Woo and Fu, Huazhu and Yi, Mun Yong},
journal={arXiv preprint arXiv:2505.17982},
year={2025}
}
This project is based on ViLa-MIL, CLAM, CoOp, PLIP, QuiltNet, and CONCH. We sincerely thank the authors of these excellent works.
If you have any questions, feedback, or issues regarding this project, please reach out to us via email: [email protected] or open an issue on GitHub.

