StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Abstract

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs—including text prompts, audio references, and speaker timbre references—in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis.

Overall Pipeline

Model overview for StyleFusion-TTS
Front-end general style fusion encoder (GSF-enc) for speaker and style representation and disentanglement

Demo

1. Comparison of our proposed method for emotional style control
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8 Sample 9 Sample 10
Style Prompt (Anger)A man speaks with a tone of extreme anger. (Surprised)A man says with a slightly surprised tone (Terror)There’s an intense terror in the man’s tone. (Angry)A man says with a slightly angry tone. (Sad)A man says with a moderately sad tone. (Disgusted)A woman says with an extremely disgusted tone. (Surprise)A woman speaks with a tone of extreme surprise. (Sad)A woman speaks with a tone of extreme sadness. (Angry)Her tone reflects a tempered but still angry expression. (Happy)A woman says with a moderately happy tone.
Content Text Its yellow bristles, rather a mane than a head of hair, covered and concealed a lofty brow, evidently made to contain thought. I, whom the Kshatriyas know as Kerim Shah, a prince from Iranistan, am no greater a masquerader than most men. "Why don’t you move the pony?" "My name is Phoebe Pyncheon." said the girl, with a manner of some reserve. The patience of the meek Theodosius was provoked; and he dissolved in anger this episcopal tumult. The camp fire was burning brightly when the first guard, having completed its tour of duty, came galloping in. Nature had been prodigal of her kindness to Gwynplaine. To teach Cosette to read, and to let her play, this constituted nearly the whole of Jean Valjean’s existence. What an instrument is the human voice! But things haven't change yet.
Emotivoice
Openvoice
MMStyleSpeech
MMTTS
StyleFusion-TTS control by style prompt reference
StyleFusion-TTS control by style audio reference
StyleFusion-TTS control by prompt and audio reference
2. Comparison of our proposed method with SOTA zero-shot TTS for speaker cloning
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Speakers timbre audio for enroll
Content Text So Tom saw night as it were broad daylight. These words behind the ears is nonsense. Andy what's the gyre and to gimble. On the twenty second of last march. Born once every one hundred years, dies in flames!
VALLE
Openvoice
Hierspeech++
StyleFusion-TTS control by style prompt reference
StyleFusion-TTS control by style audio reference
StyleFusion-TTS control by style prompt and audio reference
3. Style control for different emotions
Style prompt text Style reference audio Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
Control type Text + Audio Text Audio Text + Audio Text Audio
Content Text Born once every one hundred years, dies in flames! Born once every one hundred years, dies in flames! Born once every one hundred years, dies in flames! "My name is Phoebe Pyncheon." said the girl, with a manner of some reserve. "My name is Phoebe Pyncheon." said the girl, with a manner of some reserve. "My name is Phoebe Pyncheon." said the girl, with a manner of some reserve.
Neutral Synthesize a voice that feels neutral.
Angry Synthesize a voice that feels angry.
Surprise Synthesize a voice that feels surprise.
Sad Set the emotion tone to sad.
Happy Set the emotion tone to happy.
Sleepy Set the emotion tone to sleepy.
Disgust Set the emotion tone to disgust.