Automatic Head-Gesture Synthesis Using Speech Prosody


E. Sargın, E. Erzin, Y. Yemez, . E. Erdem, T. Erdem, M. Tekalp, M. zkan
Ko University, Momentum A.S.

State of the art visual speaker animation methods are capable of generating synchronized lip movements automatically from speech content; however, they lack automatic synthesis of speaker gestures from speech. Head and face gestures are usually added manually by artists, which is costly and often look unrealistic. Hence, automatic realistic synthesis of speaker gestures from speech by learning the correlation between gesture and speech patterns of a speaker remains as a challenging research problem.

We have developed a new method for automatic and realistic synthesis of head gestures of an avatar from speech prosody. The proposed technique is based on a new framework for joint analysis of head gesture and speech prosody patterns of a speaker. We first perform a two-stage analysis procedure to learn both elementary prosody and head gesture patterns for a particular speaker, as well as the correlations between these head gesture and prosody patterns from a training video sequence. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. We represent head gestures by Euler angles associated with head rotations, and speech prosody by temporal variations in the pitch frequency and speech intensity. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in "prosody transplant" and "gesture transplant" scenarios.

The movie files provided below demonstrate the technique that we have developed for prosody-driven head gesture animation on several different application scenarios. The system is trained and tested based on audiovisual recordings that include four different stories (ferry tales) told by a single person. Two of the stories are used for training and the remaining two for testing. The first video includes the animation of the speaker with the original head motion data captured from the data, to be compared with the second movie that includes the synthesis automatically generated by our technique. The other two movies demonstrate prosody-transplant and gesture-transplant scenarios. In the former, speaker B's speech drives the audio-visual model of speaker A, whereas in the latter speaker A's speech drives the visual model of speaker B (audio-visual training is based on speaker A). Animations have been generated using the talking head avatar of Momentum Inc.

Movies in .avi format

Acknowledgement Sentence: M. E. Sargin, E. Erzin, Y. Yemez, A. M. Tekalp, A. T. Erdem, C. Erdem, M. Ozkan, "Prosody-Driven Head-Gesture Animation", ICASSP 2007.