GMTalker: Gaussian Mixture-based Audio-Driven
Emotional Talking Video Portraits

arXiv 2024


Yibo Xia†,1, Lizhen Wang2, Xiang Deng2, Xiaoyan Luo✉,1, Yebin Liu2

1Beihang University    2Tsinghua University
Corresponding author     †Work done during an internship at Tsinghua University

Abstract


Synthesizing high-fidelity and emotion-controllable talking video portraits, with audio-lip sync, vivid expressions, realistic head poses, and eye blinks, has been an important and challenging task in recent years. Most existing methods suffer in achieving personalized and precise emotion control, smooth transitions between different emotion states, and the generation of diverse motions. To tackle these challenges, we present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a Gaussian mixture-based expression generator that can construct a continuous and disentangled latent space, achieving more flexible emotion manipulation. Furthermore, we introduce a normalizing flow-based motion generator pretrained on a large dataset with a wide-range motion to generate diverse head poses, blinks, and eyeball movements. Finally, we propose a personalized emotion-guided head generator with an emotion mapping network that can synthesize high-fidelity and faithful emotional video portraits. Both quantitative and qualitative experiments demonstrate our method outperforms previous methods in image quality, photo-realism, emotion accuracy, and motion diversity.


Emotion Control

Given the same speech, we can generate the target emotional expressions as well as various head motions.


Intensity Control

Our GMTalker can also control the intensity of generated portrait.


Emotion Manipulate

We can manipulate emotion categories and intensity flexibly by interpolating in our continues and disentangled Gaussian mixture latent space.


Comparison with Emotion-controllable Methods


We compare our GMTalker with emotion-controllable state-of-the-art methods and some representative methods on MEAD and CREMA-D dataset.
Our approach excels in generating detailed emotional facial expressions and faithful to the ground truth emotion states and motions.

Emotion Interpolation Comparison


We further conduct the emotion interpolation study to compare the continuity and decoupling of emotion interpolation.
Our method achieve smoother emotion transitions while maintaining emotion accuracy and lip-sync.

Comparison with Pose-controllable Methods


To demonstrate the diversity of our generated motion, we compare our GMTalker with several state-of-the-art pose-controllable methods on LSP datasets.

Method


 

Pipeline of GMTalker. Our framework consists of three parts: (a) In the first part, given the input speech and emotion weights label, we propose GMEG to generate 3DMM expression coefficients sampling from Gaussian mixture latent space. (b) In the second part, we introduce NFMG to predict motion coefficients from the audio, including poses, eye blinks, and gaze. (c) In the third part, we render these coefficients to 3DMM renderings for the target person and then use an emotion-guided head generator with EMN to synthesize photo-realistic video portraits with personalized style.

 


Demo Video



Citation


@article{xia2024gmtalker,
  title={GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits},
  author={Xia, Yibo and Wang, Lizhen and Deng, Xiang and Luo, Xiaoyan and Liu, Yebin},
  journal={arXiv preprint arXiv:2312.07669v2},
  year={2024}
}