MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion

Jihao Gu¹, Fei Wang^2,5, Kun Li^{3 📧}, Yanyan Wei², Zhiliang Wu³, and Dan Guo^2,4,5

¹ University College London (UCL), Gower Street, London, WC1E 6BT, UK
²School of Computer Science and Information Engineering, School of Artificial Intelligence, Hefei University of Technology (HFUT)
³ReLER, CCAI, Zhejiang University, China
⁴Key Laboratory of Knowledge Engineering with Big Data (HFUT), Ministry of Education
⁵Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China

🏆Champion Solution for Micro-gesture Classification in 3rd MiGA @ IJCAI 2025

🎉 The generated ensemble/prediction.zip represents our final submission, achieving an impressive 🏆 Top-1 Accuracy of 73.213%! 🌟

📚 0. Table of Contents

📦 1. Installation
📂 2. Data preparation
- 🔽 2.1 Download our pre-processed dataset (Recommend)
- ⚙️ 2.2 Process dataset by yourself [Optional]
🏋️‍♂️ 3. Training & Testing
- 3.1 PoseConv3D
- 3.2 VideoSwinT
💥 4. Ensemble (Multi-modal Fusion)
🙏 5. Acknowledgement
📧 6. Contact

📦 1. Installation

git clone https://github.com/momiji-bit/MM-Gesture
cd MM-Gesture

📂 2. Data preparation

🔽 2.1 Download our pre-processed dataset (Recommend)

🔐 To facilitate your access to our preprocessed video data, you can download it directly from HuggingFace.

🔐 To comply with the dataset’s usage policy, we have restricted access to the processed files. Please request access through HuggingFace, and we will approve it promptly.

cd dataset 
pip install huggingface_hub
huggingface-cli login
# export HF_ENDPOINT=https://hf-mirror.com  # (Optional) For users in China, enable the mirror
mkdir -p ./iMiGUE_SRTFD
huggingface-cli download Geo2425/iMiGUE_SRTFD --repo-type dataset --local-dir ./iMiGUE_SRTFD

unzip ./iMiGUE_SRTFD/Skeleton.zip -d .
unzip ./iMiGUE_SRTFD/RGB.zip -d .
unzip ./iMiGUE_SRTFD/Taylor.zip -d .
unzip ./iMiGUE_SRTFD/Flow.zip -d .
unzip ./iMiGUE_SRTFD/Depth.zip -d .

mkdir RGB/clips
cp -r RGB/train/* RGB/clips
cp -r RGB/val/* RGB/clips
cp -r RGB/test/* RGB/clips

# rm -r ./iMiGUE_SRTFD
cd ..

⚙️ 2.2 Process dataset by yourself [Optional]

If you've already downloaded the preprocessed data, feel free to skip this step.

cd dataset
mkdir Skeleton RGB Taylor Flow Depth MiGA

2.2.1 Download MiGA'25 Official Dataset (Track 1)

Download here: Kaggle MiGA Challenge Track 1

You just need to download the following files:

1️⃣ imigue_skeleton_phase1.zip → imigue_data_phase1
2️⃣ imigue_rgb_phase1.zip → imigue_rgb_phase1
3️⃣ imigue_skeleton_phase2.zip → imigue_data_phase2 🔒
4️⃣ imigue_rgb_phase2.zip → imigue_rgb_phase2 🔒

Or use these commands to download and unzip:

cd MiGA

# 🏋️‍♂️ Train and Validation dataset
wget https://miga3.a3s.fi/imigue_skeleton_phase1.zip
wget https://miga3.a3s.fi/imigue_rgb_phase1.zip
unzip imigue_skeleton_phase1.zip
unzip imigue_rgb_phase1.zip

# 🧪 Test dataset
# 🔒 Note: Links might expire based on organizer’s access policy.
wget https://miga3.a3s.fi/imigue_skeleton_phase2.zip
wget https://miga3.a3s.fi/imigue_rgb_phase2.zip
unzip imigue_skeleton_phase2.zip
unzip imigue_rgb_phase2.zip

2.2.2 Generate Skeleton Data

To generate the skeleton data, simply run the code provided in the Jupyter notebook:

Open and execute `dataset/tools/processing_Skeleton.ipynb`.

2.2.3 Generate RGB Videos

For RGB video generation, use the provided Jupyter notebook:

Open and execute `dataset/tools/processing_RGB.ipynb`.

2.2.4 Generate Taylor Videos

To generate Taylor-encoded videos:

cd ../tools

python taylor.py ../RGB/train ../Taylor/train
python taylor.py ../RGB/val ../Taylor/val
python taylor.py ../RGB/test ../Taylor/test

2.2.5 Generate Optical Flow Videos

We use memflow for optical flow generation.

Setup Follow memflow’s official instructions to install dependencies and download pretrained models.
Optimized Execution Use the custom script inference_mp4.py for efficient GPU utilization.
Run the following commands:

python inference_mp4.py \
  --name MemFlowNet \
  --stage things \
  --restore_ckpt ckpts/MemFlowNet_things.pth \
  --input_dir ../../MiGA/RGB/train \
  --output_dir ../../MiGA/Flow/train

python inference_mp4.py \
  --name MemFlowNet \
  --stage things \
  --restore_ckpt ckpts/MemFlowNet_things.pth \
  --input_dir ../../MiGA/RGB/val \
  --output_dir ../../MiGA/Flow/val

python inference_mp4.py \
  --name MemFlowNet \
  --stage things \
  --restore_ckpt ckpts/MemFlowNet_things.pth \
  --input_dir ../../MiGA/RGB/test \
  --output_dir ../../MiGA/Flow/test

2.2.6 Generate Depth Videos

We use Video-Depth-Anything to generate depth videos.

Setup Follow the official instructions to configure the environment and download pretrained models.
Optimized Execution Use the custom script run_dir.py for efficient GPU utilization.
Run the following commands:

# For training data
python3 run_dir.py \
  --input_dir ../../MiGA/RGB/train \
  --output_dir ../../MiGA/Depth/train \
  --encoder vits \
  --grayscale \
  --procs_per_gpu 2

# For validation data
python3 run_dir.py \
  --input_dir ../../MiGA/RGB/val \
  --output_dir ../../MiGA/Depth/val \
  --encoder vits \
  --grayscale \
  --procs_per_gpu 2

# For test data
python3 run_dir.py \
  --input_dir ../../MiGA/RGB/test \
  --output_dir ../../MiGA/Depth/test \
  --encoder vits \
  --grayscale \
  --procs_per_gpu 2

🏋️‍♂️ 3. Training & Testing

✨ Pre-trained models are available for download here. 📥🎯

Model (Size)	Modality	Link
PoseConv3D	Joint	Download
PoseConv3D	Limb	Download
PoseConv3D	RGB+Joint	Download
PoseConv3D	RGB+Limb	Download
VideoSwinT (Base/Small/Tiny)	RGB	Download
VideoSwinT (Small/Tiny)	RGB*	Download
VideoSwinT (Base/Small/Tiny)	Taylor	Download
VideoSwinT (Base)	Optical Flow	Download
VideoSwinT (Base/Small)	Depth	Download

3.1 PoseConv3D

# Install dependencies
conda env create -f pyskl_environment.yml -y
conda activate pyskl  # Or: source activate pyskl
cd pyskl

Then, run the code in pyskl/RUN.ipynb for training and testing.

3.2 VideoSwinT

# Install dependencies
conda env create -f openmmlab_environment.yml -y
conda activate openmmlab  # Or: source activate openmmlab
cd mmaction2

Then, run the code in mmaction2/RUN.ipynb for training and testing.

💥 4. Ensemble (Multi-modal Fusion)

We provide a script for combining six modalities (Joint, Limb, RGB, Taylor, Optical Flow, Depth) to leverage their complementary strengths and improve accuracy:

Run ensemble/ensemble.py to generate the final competition results.

🙏 5. Acknowledgement

This code began with PYSKL and mmaction2 toolbox. We thank the developers for doing most of the heavy-lifting.

If you found this code useful, please consider citing:

@article{gu2025mm,
  title={MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion},
  author={Gu, Jihao and Wang, Fei and Li, Kun and Wei, Yanyan and Wu, Zhiliang and Guo, Dan},
  journal={arXiv preprint arXiv:2507.08344},
  year={2025}
}

@article{guo2024benchmarking,
  title={Benchmarking Micro-action Recognition: Dataset, Methods, and Applications},
  author={Guo, Dan and Li, Kun and Hu, Bin and Zhang, Yan and Wang, Meng},
  journal={IEEE Transactions on Circuits and Systems for Video Technology},
  year={2024},
  volume={34},
  number={7},
  pages={6238-6252}
}

@misc{2020mmaction2,
    title={OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark},
    author={MMAction2 Contributors},
    howpublished = {\url{https://github.com/open-mmlab/mmaction2}},
    year={2020}
}

📧 6. Contact

For any questions, feel free to contact: Dr. Kun Li (kunli.hfut@gmail.com) and Mr. Jihao Gu (jihao.gu.23@ucl.ac.uk).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion

📚 0. Table of Contents

📦 1. Installation

📂 2. Data preparation

🔽 2.1 Download our pre-processed dataset (Recommend)

⚙️ 2.2 Process dataset by yourself [Optional]

2.2.1 Download MiGA'25 Official Dataset (Track 1)

2.2.2 Generate Skeleton Data

2.2.3 Generate RGB Videos

2.2.4 Generate Taylor Videos

2.2.5 Generate Optical Flow Videos

2.2.6 Generate Depth Videos

🏋️‍♂️ 3. Training & Testing

3.1 PoseConv3D

3.2 VideoSwinT

💥 4. Ensemble (Multi-modal Fusion)

🙏 5. Acknowledgement

📧 6. Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
assets		assets
dataset		dataset
ensemble		ensemble
mmaction2		mmaction2
pyskl		pyskl
.gitignore		.gitignore
README.md		README.md
openmmlab_environment.yml		openmmlab_environment.yml
pyskl_environment.yml		pyskl_environment.yml

momiji-bit/MM-Gesture

Folders and files

Latest commit

History

Repository files navigation

MM-Gesture: Towards Precise Micro-Gesture Recognition through Multimodal Fusion

📚 0. Table of Contents

📦 1. Installation

📂 2. Data preparation

🔽 2.1 Download our pre-processed dataset (Recommend)

⚙️ 2.2 Process dataset by yourself [Optional]

2.2.1 Download MiGA'25 Official Dataset (Track 1)

2.2.2 Generate Skeleton Data

2.2.3 Generate RGB Videos

2.2.4 Generate Taylor Videos

2.2.5 Generate Optical Flow Videos

2.2.6 Generate Depth Videos

🏋️‍♂️ 3. Training & Testing

3.1 PoseConv3D

3.2 VideoSwinT

💥 4. Ensemble (Multi-modal Fusion)

🙏 5. Acknowledgement

📧 6. Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages