You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you find our research helpful, please consider giving us a star β to support the latest updates.
π₯ News
Coming Soon π§ We will release the Plan2Gen agent implementation, along with full code for analysis and ablation studies β enabling complete reproduction and future extensions of our framework. Stay tuned!
2025.06.07 πππ To further advance AGI-level T2I, weβve added a structured summary of key insights to GitHub β including both in-paper highlights and new reflections.
fromdatasetsimportload_dataset# Login using e.g. `huggingface-cli login` to access this datasetds=load_dataset("YCZhou/LongBench-T2I")
2025.05.31 πππ We open-sourced the LongBench-T2I dataset and evaluation toolkit on GitHub β now available for the community! β Take the LongBench-T2I Challenge! π₯
2025.05.30 πππ We release the paper Draw ALL Your Imagine β a holistic benchmark and agent framework for complex instruction-based image generation. Please Check it out for more details! π
π Citation
If you find our work useful for your research, please kindly cite our paper as follows:
@article{zhou2025draw,
title={Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation},
author={Zhou, Yucheng and Yuan, Jiahao and Wang, Qianning},
journal={arXiv preprint arXiv:2505.24787},
year={2025}
}
π Overview
LongBench-T2I is a comprehensive benchmark and agent framework for evaluating and improving complex instruction-based text-to-image generation β pushing toward AGI-level capabilities in controllable visual synthesis.
β Evaluation results will be saved as a .jsonl file in the specified --eval_folder, with per-image scores, comments, and an overall statistical summary.
Name of the image generation method. Determines the subdirectory under ./outputs/ to evaluate.
"plan2gen_3"
--eval_folder
str
Directory to save evaluation output results (.jsonl format).
"./eval"
--object_file
str
Path to the input .jsonl file containing object instruction labels.
"data/instruction.jsonl"
--evaluator
str
Evaluation model to use. Choices: "gemini-2.0-flash", "OpenGVLab/InternVL3-78B".
"gemini-2.0-flash"
--Gemni_API_Key
List[str]
API key(s) for accessing Gemini models. Multiple keys supported for rotation.
Required (for Gemini)
π Key Insights
Key Insight 1: Diffusion-based vs AR-based Models
AR-based models outperform diffusion-based models in complex instruction-following by offering better structure, coherence, and efficiency, while diffusion models still lead in visual detail and richness.
Key Insight 2: Text Encoder-based vs LLM framework-based Models
LLM framework-based models significantly outperform text encoder-based models, especially in composition, text understanding, and background quality, confirming the advantage of LLM-guided planning in handling complex image generation prompts.
Key Insight 3: Language Understanding β Visual Quality
Surprisingly, higher perplexity sometimes correlates with better image qualityβespecially in smaller modelsβrevealing a disconnect between language understanding and visual generation.
π― Case Study Comparison
Instruction
GPT-4o
Plan2gen (Ours)
Case 0: Click to expand full instruction IDX: 176
"The sun hung low behind a stained-glass skylight, casting kaleidoscopic shadows across a kitchen that felt both ancient and impossible. A ripe papaya, its flesh the color of melted amber, lay cut open on a worn wooden cutting board, glistening as if drenched in a slow rain of golden honey. From its core, a single tendril of vapor curled into the air, rising not in heat but in defiance of gravity, drifting sideways into the space where the ceiling should have been. There, it met a hanging chandelier of suspended clockwork orbs, each rotating in counter-orbit to one another, their gears clicking in harmony with the drip of papaya juice into a ceramic bowl that seemed both solid and semi-transparent. Inside the bowl, liquid shimmered and thickened as it was scooped by a silver spoon, which had not been touched by a hand but by a shadowy tendril extending from a nearby wall, its form flickering like a mirage caught in a heatwave. The fly above hovered in place, its wings moving impossibly fast, casting a tiny shadow that danced across the wall and transformed, in turn, into the silhouette of a woman mid-stance in a slow, graceful pirouette. Behind her, a mirror hung askew, reflecting not the kitchen but a jungle of overripe fruit trees, each bearing fruit that pulsed and changedβmangoes into eyes, pineapples into tiny doors. A vine wrapped from the mirror to the clockwork chandelier, anchoring it with tendrils that fed on time itself, their leaves unfurling in perfect synchronization with the heartbeat of the spoon as it dipped into the papayaβs pool of nectar. A ceramic cup, empty yet full, sat nearby, its surface etched with the same pattern as the wooden cutting board, the two objects linked by an invisible thread that pulsed with a faint violet glow. A small breeze moved through the roomβnot from any window or fan, but from the very air as it remembered itself shifting in the absence of time. The wooden board creaked slightly, reacting to the subtle tremors caused by the spoonβs shadowy grip, while the clockwork orbs began to hum in a chord that seemed to stretch both into the future and the past. Somewhere in the periphery, a door creaked open by itself, revealing a corridor that led into a garden of fruit and gears and liquid light, each element alive with purpose and impossible symmetry. In this space, the papaya was not just fruit but a nexus, its juice a conduit for memory, its flesh a map to forgotten worlds. The kitchen, though grounded in familiar objects, was now a thresholdβwhere logic unraveled and reality bent into the beautiful, the bewildering, and the boundless."
Case 1: Click to expand full instruction IDX: 227
"In the flickering amber glow of a gas lamp that hovers midair just above the cluttered wooden kitchen table, a weathered leather suitcase remains open, its brass latches frozen in the act of being unlatched, as if time had hesitated in the moment before a journey. Inside, a tangled ball of earbuds is slowly unraveling itself, each wire twisting through a constellation of folded maps, some of which are shifting subtly as if the geography they depict is alive and restless. A half-eaten chocolate bar lies nestled among these relics of travel, its melting sides dripping not into the grain of the wood but upward, as if gravity has momentarily lost interest in this particular corner of the room. A red scarfβthreaded with the faint shimmer of liquid silverβemerges from beneath a stack of notebooks, one of which is open and turning its own pages, each sheet writing new lines as it flips, ink blooming like spilled stars from an unseen pen. The coffee cup, left to cool in the corner of the table, has left a circular ring of moisture, not just on the wood but on the glass of the window, where it distorts the blurred silhouette of distant mountains. Outside, the rain does not fall but floats in suspended motion, the droplets reflecting the scene within like ghostly mirrors. The lamp casts long, wavering shadows that stretch toward the ceiling, which is not a ceiling at all but a swirling expanse of sky, where constellations blink in and out in rhythm with the turning pages. The scarf, now caught in a slow spiral of air that rises from the melting chocolate, begins to lift from the table, carrying with it a loose notebook page that floats into the lampβs glow and is briefly consumed by its flickering flame before reappearing crumpled in the center of the suitcase. The coffee, left to sit in silence, begins to ripple without disturbance, forming patterns that mirror the tangled earbuds below it. The maps continue to shift, their borders dissolving and reforming as though they are deciding the shape of the world in real time, and with each new configuration, the mountains outside subtly change their position and hue. A faint ticking begins in the space between the scarf and the window, like the heartbeat of the room itself, and with each beat, the suitcase seems to pulse as if it is breathing, the leather contracting and expanding in a rhythm that echoes the slow, hypnotic drip of the chocolate."