Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods face three major limitations:
To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include:
Overview of the Diffusion Transformer Pipeline with temporal and spatial condition injection.
Illustration of structure details of the Transformer Block.
Comparison with the state-of-the-art lip-syncing methods.
Comparison with the state-of-the-art methods on HDTF and VFHQ dataset.
@article{modit2025,
author = {Yucheng Wang and Dan Xu},
title = {MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation},
journal = {arXiv preprint arXiv:2507.05092},
year = {2025},
}