Detects deepfakes by exploiting the natural identity-level homogeneity between voice and face — a cross-modal consistency that deepfake generation breaks.
Harry Cheng1, Yangyang Guo2*, Tianyi Wang3, Qi Li1, Xiaojun Chang4, Liqiang Nie5*
1 School of Computer Science and Technology, Shandong University 2 School of Computing, National University of Singapore 3 Department of Computer Science, The University of Hong Kong 4 Faculty of Engineering and Information Technology, University of Technology Sydney 5 Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) * Corresponding authors
- Paper: Voice-Face Homogeneity Tells Deepfake
- arXiv: 2203.02195
- Checkpoint (DFDC): Google Drive
- Checkpoint (FakeAVCeleb): Google Drive
- Code Repository: GitHub
- Updates
- Introduction
- Highlights
- Method / Framework
- Project Structure
- Checkpoints / Models
- Dataset / Benchmark
- Usage
- TODO
- Citation
- Acknowledgement
- License
- [04/2026] Transfer repos to iLearn-Lab
- [2023] Paper published in ACM Transactions on Multimedia Computing, Communications and Applications (ToMM), Vol. 20, Issue 3
This repository is the official implementation of Voice-Face Homogeneity Tells Deepfake, published in ACM ToMM 2023.
Real videos exhibit a natural identity-level homogeneity between a person's voice and face — their vocal and visual characteristics are correlated through shared identity. Deepfake generation typically manipulates only one modality, breaking this natural cross-modal consistency.
VFD (Voice-Face Deepfake detection) detects deepfakes by measuring the matching degree between the voice and face in a video clip. A mismatch signals a potential forgery.
- Exploits voice-face identity homogeneity as a natural, annotation-free detection signal
- Detects audio-visual deepfakes across DFDC, DF-TIMIT, and FakeAVCeleb
- Provides pretrained checkpoints for DFDC and FakeAVCeleb
VFD trains a cross-modal matching model to determine whether the voice and face in a video clip belong to the same identity. Real videos produce high consistency scores; deepfakes that manipulate one modality produce a detectable mismatch.
.
├── FaceModel/ # Face feature extraction model
├── configs/ # Dataset-specific configuration files
│ ├── DFDC/
│ └── FakeAVCeleb/
├── datasets/ # Dataset class definitions
├── lists/ # Annotation list files (train/test splits)
├── utils/ # Utility functions
├── finetune_deepfake.py # Fine-tuning script
├── pretrain_general.py # General pretraining script
├── test.py # Testing script
├── test_vfd.py # Main evaluation script
└── README.md
Download checkpoints and place into ./exp/[Dataset]/:
- DFDC: Google Drive
- FakeAVCeleb: Google Drive
Supports DFDC, DF-TIMIT, and FakeAVCeleb. Steps:
Download the original datasets from their official sources.
Extract frames and audio, and organize annotation files under ./lists/[Dataset]/:
/data/FakeAVCeleb/test/face/RealVideo-RealAudio/African/women/id04245/00001.jpg 0
/data/FakeAVCeleb/test/voice/RealVideo-RealAudio/African/women/id04245/00001.wav 0
Format: <file_path> <label>, where label is 0 (real) or 1/2/3 (fake).
python test_vfd.py --config ./configs/DFDC/test.yaml
python test_vfd.py --config ./configs/FakeAVCeleb/test.yaml- Add training script documentation
- Release DF-TIMIT configuration and checkpoint
If you find our paper useful, please cite:
@article{cheng2023voice,
title={Voice-face homogeneity tells deepfake},
author={Cheng, Harry and Guo, Yangyang and Wang, Tianyi and Li, Qi and Chang, Xiaojun and Nie, Liqiang},
journal={ACM Transactions on Multimedia Computing, Communications and Applications},
volume={20},
number={3},
pages={1--22},
year={2023},
}- Thanks to the creators of DFDC, DF-TIMIT, and FakeAVCeleb for making their datasets available.
- Thanks to our supervisors and collaborators for their support.
This project is released under the Apache License 2.0.