MoDiT: Learning Highly Consistent 3D Motion Coefficients

Abstract

Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods face three major limitations:

Temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies.
Identity drift due to insufficient 3D information, leading to poor facial identity preservation.
Unnatural blinking behavior due to inadequate modeling of realistic blink dynamics.

To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include:

A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering.
The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency.
A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.

Methods

Overview of the Diffusion Transformer Pipeline with temporal and spatial condition injection.

Illustration of structure details of the Transformer Block.

Experiment

Comparison with the state-of-the-art lip-syncing methods.

Comparison with the state-of-the-art methods on HDTF and VFHQ dataset.

BibTeX

@article{modit2025,
  author    = {Yucheng Wang and Dan Xu},
  title     = {MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation},
  journal   = {arXiv preprint arXiv:2507.05092},
  year      = {2025},
}

MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation

MoDiT combines 3DMM with a Diffusion-based Transformer to generate realistic talking heads from audio with single image.

Abstract

Methods

Experiment

BibTeX