Projectpage of GMTalker

GMTalker: Gaussian Mixture-based Audio-Driven
Emotional Talking Video Portraits

arXiv 2024

Yibo Xia^†,1, Lizhen Wang², Xiang Deng², Xiaoyan Luo^✉,1, Yebin Liu²

¹Beihang University²Tsinghua University
^✉Corresponding author ^†Work done during an internship at Tsinghua University

Abstract

Synthesizing high-fidelity and emotion-controllable talking video portraits, with audio-lip sync, vivid expressions, realistic head poses, and eye blinks, has been an important and challenging task in recent years. Most existing methods suffer in achieving personalized and precise emotion control, smooth transitions between different emotion states, and the generation of diverse motions. To tackle these challenges, we present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a Gaussian mixture-based expression generator that can construct a continuous and disentangled latent space, achieving more flexible emotion manipulation. Furthermore, we introduce a normalizing flow-based motion generator pretrained on a large dataset with a wide-range motion to generate diverse head poses, blinks, and eyeball movements. Finally, we propose a personalized emotion-guided head generator with an emotion mapping network that can synthesize high-fidelity and faithful emotional video portraits. Both quantitative and qualitative experiments demonstrate our method outperforms previous methods in image quality, photo-realism, emotion accuracy, and motion diversity.

Emotion Control

Given the same speech, we can generate the target emotional expressions as well as various head motions.

Intensity Control

Our GMTalker can also control the intensity of generated portrait.

Emotion Manipulate

We can manipulate emotion categories and intensity flexibly by interpolating in our continues and disentangled Gaussian mixture latent space.

Comparison with Emotion-controllable Methods

We compare our GMTalker with emotion-controllable state-of-the-art methods and some representative methods on MEAD and CREMA-D dataset.
Our approach excels in generating detailed emotional facial expressions and faithful to the ground truth emotion states and motions.

Emotion Interpolation Comparison

We further conduct the emotion interpolation study to compare the continuity and decoupling of emotion interpolation.
Our method achieve smoother emotion transitions while maintaining emotion accuracy and lip-sync.

Comparison with Pose-controllable Methods

To demonstrate the diversity of our generated motion, we compare our GMTalker with several state-of-the-art pose-controllable methods on LSP datasets.

Method

Pipeline of GMTalker. Our framework consists of three parts: (a) In the first part, given the input speech and emotion weights label, we propose GMEG to generate 3DMM expression coefficients sampling from Gaussian mixture latent space. (b) In the second part, we introduce NFMG to predict motion coefficients from the audio, including poses, eye blinks, and gaze. (c) In the third part, we render these coefficients to 3DMM renderings for the target person and then use an emotion-guided head generator with EMN to synthesize photo-realistic video portraits with personalized style.

GMTalker: Gaussian Mixture-based Audio-Driven
Emotional Talking Video Portraits

arXiv 2024

Yibo Xia^†,1, Lizhen Wang², Xiang Deng², Xiaoyan Luo^✉,1, Yebin Liu²

Abstract

Emotion Control

Given the same speech, we can generate the target emotional expressions as well as various head motions.

Intensity Control

Our GMTalker can also control the intensity of generated portrait.

Emotion Manipulate

We can manipulate emotion categories and intensity flexibly by interpolating in our continues and disentangled Gaussian mixture latent space.

Comparison with Emotion-controllable Methods

We compare our GMTalker with emotion-controllable state-of-the-art methods and some representative methods on MEAD and CREMA-D dataset.
Our approach excels in generating detailed emotional facial expressions and faithful to the ground truth emotion states and motions.

Emotion Interpolation Comparison

We further conduct the emotion interpolation study to compare the continuity and decoupling of emotion interpolation.
Our method achieve smoother emotion transitions while maintaining emotion accuracy and lip-sync.

Comparison with Pose-controllable Methods

To demonstrate the diversity of our generated motion, we compare our GMTalker with several state-of-the-art pose-controllable methods on LSP datasets.

Method

Demo Video

Citation

GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits

arXiv 2024

Yibo Xia†,1, Lizhen Wang2, Xiang Deng2, Xiaoyan Luo✉,1, Yebin Liu2

Abstract

Emotion Control

Given the same speech, we can generate the target emotional expressions as well as various head motions.

Intensity Control

Our GMTalker can also control the intensity of generated portrait.

Emotion Manipulate

We can manipulate emotion categories and intensity flexibly by interpolating in our continues and disentangled Gaussian mixture latent space.

Comparison with Emotion-controllable Methods

We compare our GMTalker with emotion-controllable state-of-the-art methods and some representative methods on MEAD and CREMA-D dataset. Our approach excels in generating detailed emotional facial expressions and faithful to the ground truth emotion states and motions.

Emotion Interpolation Comparison

We further conduct the emotion interpolation study to compare the continuity and decoupling of emotion interpolation. Our method achieve smoother emotion transitions while maintaining emotion accuracy and lip-sync.

Comparison with Pose-controllable Methods

To demonstrate the diversity of our generated motion, we compare our GMTalker with several state-of-the-art pose-controllable methods on LSP datasets.

Method

Demo Video

Citation

GMTalker: Gaussian Mixture-based Audio-Driven
Emotional Talking Video Portraits

Yibo Xia^†,1, Lizhen Wang², Xiang Deng², Xiaoyan Luo^✉,1, Yebin Liu²

We compare our GMTalker with emotion-controllable state-of-the-art methods and some representative methods on MEAD and CREMA-D dataset.
Our approach excels in generating detailed emotional facial expressions and faithful to the ground truth emotion states and motions.

We further conduct the emotion interpolation study to compare the continuity and decoupling of emotion interpolation.
Our method achieve smoother emotion transitions while maintaining emotion accuracy and lip-sync.