Equivariant Diffusion Model for Text-Driven Human Motion Generation

This project explores enhancing text-guided Human Motion Generation (HMG) by incorporating rotational and reflectional equivariance into a diffusion-based architecture. We propose Equi-MDM, a modification of the baseline Motion Diffusion Model (MDM), where standard linear layers are replaced with Equivariant Linear (EquiLinear) layers to preserve the symmetry and physical plausibility of generated human motions.

Method Overview

Human motion often exhibits strong structural properties like symmetry, periodicity, and limb coordination. To better capture these patterns, we designed an equivariant diffusion model using EquiLinear layers that enforce SO(3) symmetries throughout the learning and generation process.

Our approach is structured as follows:

Use MDM as the baseline.
Replace its transformer encoder linear layers with EquiLinear layers (based on scalar irreducible representations).
Train and evaluate on the HumanML3D dataset.

The core idea: ensure that noise prediction functions inherit equivariance from the expert policy, respecting rotations and reflections during generation.

Model Architecture

Figure: (Left) Overview of the Equivariant Motion Diffusion Model (EquiMDM). (Right) Sampling process illustrating denoising over steps.

Quantitative Results

Methods	Matching Score ↑	R-Precision@1 ↑	R-Precision@2 ↑	R-Precision@3 ↑	FID ↓	Diversity ↑
T2M-GPT	3.505 ± 0.017	0.470 ± 0.003	0.659 ± 0.002	0.758 ± 0.002	0.335	-
MMM	3.359 ± 0.009	0.487 ± 0.003	0.683 ± 0.002	0.782 ± 0.001	0.132	-
MoMask	3.353 ± 0.010	0.490 ± 0.004	0.687 ± 0.003	0.786 ± 0.003	0.116	-
MDM (50 steps)	3.640 ± 0.028	0.440 ± 0.007	0.636 ± 0.006	0.742 ± 0.004	0.518	-
MotionDiffuse	3.490 ± 0.023	0.450 ± 0.006	0.641 ± 0.005	0.753 ± 0.005	0.778	-
Ground Truth	3.238 ± 0.006	0.453 ± 0.003	0.657 ± 0.002	0.768 ± 0.002	0.001	9.264
Equi-MDM (500K)	3.363 ± 0.024	0.435 ± 0.005	0.644 ± 0.006	0.758 ± 0.005	0.742	10.109

Equi-MDM achieves comparable semantic alignment and diversity to state-of-the-art models.
Higher diversity suggests better multi-modality generation.
FID improvement at higher training steps indicates increased realism.

Qualitative Results

Symmetric and realistic motion generation.
Improved naturalness for actions like yoga poses, walking, and dancing.
Less frame-wise noise compared to baseline MDM.

Motion Sequences: Symmetric motions generated for text prompts like "person performing tree pose" and "person walking straight then running."

📽️ Video Demonstrations

📽️ Comparison with MDM

🏊 Swim

MDM:
Equivariant:

🌳 Tree Pose

MDM:
Equivariant:

Conclusion

We demonstrated that embedding equivariant inductive biases into diffusion-based human motion generation models improves symmetry, motion fidelity, and semantic alignment. Future work will explore:

Dynamic sequence length prediction.
Extending equivariance to the attention mechanisms.
Applying full symmetry handling (rotation, translation) across broader architectures.

Acknowledgments

This code is standing on the shoulders of giants. We want to thank the following contributors that our code is based on:

Motion Diffusion Model

References

Please cite our project if you find this work helpful.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
assets		assets
body_models		body_models
data_loaders		data_loaders
dataset		dataset
diffusion		diffusion
eval		eval
img_results		img_results
model		model
prepare		prepare
sample		sample
train		train
utils		utils
visualize		visualize
.gitignore		.gitignore
DiP.md		DiP.md
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
environment.yml		environment.yml
environment38.yml		environment38.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Equivariant Diffusion Model for Text-Driven Human Motion Generation

Method Overview

Model Architecture

Quantitative Results