Official implementation for WWW 2026 paper: "Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning"
Introduction: RGLM is a novel alignment pipeline, termed reconstructive graph instruction tuning. The key idea is to reconstruct the graph information from the LLM’s graph token outputs, which are largely ignored in current GTokenLLMs. (a) Overview of RGLM. Given an input TAG, RGLM aims to reconstruct the graph information from the LLM’s graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. (b) RGLM-Decoder variant directly reconstructs the raw node features and topology in the input space. (c) RGLM-Similarizer and (d) RGLM-Denoiser variants reconstruct latent node representations of pre-trained GNNs via the cosine-similarity and denoising strategy, respectively.
- OS: Linux Ubuntu 5.15.0-102-generic
- CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
- GPU: NVIDIA A800 80GB
Create and activate a conda environment with the required packages:
# Create conda environment
conda create -n rglm python=3.10
conda activate rglm
# Install dependencies
pip install -r requirements.txtRGLM/
├── assets/
│ └── images/
│ └── RGLM.png # framework overview figure
├── checkpoints/ # saved model checkpoints
├── dataset/
│ ├── cora/ # Cora raw/processed files
│ ├── pubmed/ # Pubmed raw/processed files
│ ├── ogbn-arxiv/ # OGBN-Arxiv raw/processed files
│ ├── reddit/ # Reddit raw/processed files
│ ├── preprocess.py # dataset preprocessing pipeline
│ ├── preprocess_link.py # link-level preprocessing utilities
│ └── laplacian_2_*.pt # precomputed Laplacian tensors
├── model/
│ ├── language_model/
│ │ └── reconglm_llama.py # LLM backbone integration
│ ├── decoder/
│ │ └── decoder.py # RGLM-Decoder implementation
│ ├── similarizer/
│ │ └── similarizer.py # RGLM-Similarizer implementation
│ ├── denoiser/
│ │ ├── graph_denoiser.py # RGLM-Denoiser main module
│ │ └── diffusion_utils/ # diffusion process components
│ │ ├── gaussian_diffusion.py
│ │ ├── diffusion_utils.py
│ │ └── respace.py
│ ├── reconglm_arch.py # overall RGLM architecture
│ ├── builder.py # model construction entry
│ ├── apply_delta.py # apply parameter delta
│ ├── make_delta.py # build parameter delta
│ └── consolidate.py # checkpoint consolidation
├── train/
│ ├── train.py # training entry point
│ ├── train_mem.py # memory-optimized training
│ ├── reconglm_trainer.py # custom trainer for RGLM
│ └── llama_flash_attn_monkey_patch.py # flash-attention patch
├── eval/
│ ├── eval_pretrain.py # pretraining-stage evaluation
│ └── eval_res.py # result evaluation/reporting
├── scripts/
│ ├── train_decoder.sh # train RGLM-Decoder
│ ├── train_similarizer.sh # train RGLM-Similarizer
│ ├── train_denoiser.sh # train RGLM-Denoiser
│ └── eval.sh # evaluation script
├── utils/
│ ├── constants.py
│ ├── conversation.py
│ ├── data_process.py
│ └── utils.py
├── requirements.txt
└── REDEME.md
Core workflow:
- Prepare datasets in
dataset/via preprocessing scripts. - Launch variant-specific training with scripts in
scripts/ortrain/train.py. - Load model components from
model/(decoder/similarizer/denoiser). - Evaluate with
eval/scripts and store artifacts incheckpoints/.
We use the following datasets, all available under MIT license:
- Cora, Pubmed, OGBN-Arxiv: LLaGA Repository
- Reddit: GLBench Repository
For convenience, we provide preprocessed datasets at dataset. Place the downloaded files in the corresponding subdirectories under the dataset/ directory.
To add your own dataset, first convert it to the LLaGA format (GLBench_preprocess to build the instruction dataset for node classification, generate_link_instruction.py to build the instruction dataset for link prediction), then run preprocess.py for node classification, preprocess_link.py for link prediction.
The following table summarizes the datasets used in this project:
| Dataset | # Nodes | # Edges | # Class | Splitting | Domain |
|---|---|---|---|---|---|
| Cora | 2,708 | 5,429 | 7 | 6:2:2 | citation |
| Pubmed | 19,717 | 44,338 | 3 | 6:2:2 | citation |
| OGBN-Arxiv | 169,343 | 1,166,243 | 40 | 6:2:3 | citation |
| 33,434 | 198,448 | 2 | 1:1:8 | social network |
We provide runnable scripts in scripts/ for training and evaluation.
Before running experiments, activate the environment and move to the project root:
conda activate rglmTrain each variant with:
# RGLM-Decoder
bash scripts/train_decoder.sh
# RGLM-Similarizer
bash scripts/train_similarizer.sh
# RGLM-Denoiser
bash scripts/train_denoiser.shThe default output checkpoints are saved to:
./checkpoints/reconglm_decoder/..../checkpoints/reconglm_similarizer/..../checkpoints/reconglm_denoiser/...
Use the evaluation script with a trained checkpoint path:
bash scripts/eval.sh /path/to/checkpointDuring instruction tuning, we train for one epoch using AdamW with a per-device batch size of 4 and a projector learning rate of 2e-3. Following LLaGA, we upsample the smallest datasets (Cora and Reddit) by replicating their training samples three times to alleviate data imbalance. The warmup ratio is set to 3e-2, the maximum LLM input length is 4096, and for the Neighbor Detail Template we sample 2-hop neighbors with 10 neighbors per hop. During inference, we set the LLM temperature to 0.001 for deterministic and reproducible outputs.
For LoRA fine-tuning, we use the following hyperparameter grid:
- lora_r: 8, 16, 32
- lora_alpha: 2 * lora_r, 4 * lora_r
- lora_dropout: 0.05, 0.1
- learning_rate: 1e-4, 2e-4, 5e-4
For three RGLM variants, the variant-specific search space is:
- RGLM-Decoder:
$\lambda_{f}$ in {0.1, 0.2, 0.4, 0.6, 0.8, 1.0},$\lambda_{s}$ in {1, 2, 4, 6, 8, 10} - RGLM-Similarizer and RGLM-Denoiser:
$\lambda_{l}$ in [0.2, 2.0] with step size 0.2
If you find this work useful, please consider starring 🌟 this repo and citing 📑 our paper:
@article{zhang2026rglm,
title={Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning},
author={Zhang, Zhongjian and Wang, Xiao and Zhang, Mengmei and Tan, Jiarui and Shi, Chuan},
journal={arXiv preprint arXiv:2603.01385},
year={2026}
}