StyleFusion-TTS

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Abstract

We introduce StyleFusion-TTS, a prompt and/or audio referenced, style- and speaker-controllable, zero-shot text-to-speech (TTS) synthesis system designed to enhance the editability and naturalness of current research literature. We propose a general front-end encoder as a compact and effective module to utilize multimodal inputs—including text prompts, audio references, and speaker timbre references—in a fully zero-shot manner and produce disentangled style and speaker control embeddings. Our novel approach also leverages a hierarchical conformer structure for the fusion of style and speaker control embeddings, aiming to achieve optimal feature fusion within the current advanced TTS architecture. StyleFusion-TTS is evaluated through multiple metrics, both subjectively and objectively. The system shows promising performance across our evaluations, suggesting its potential to contribute to the advancement of the field of zero-shot text-to-speech synthesis.

Overall Pipeline

Model overview for StyleFusion-TTS

Front-end general style fusion encoder (GSF-enc) for speaker and style representation and disentanglement

Demo

1. Comparison of our proposed method for emotional style control

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6	Sample 7	Sample 8	Sample 9	Sample 10
Style Prompt	(Anger)A man speaks with a tone of extreme anger.	(Surprised)A man says with a slightly surprised tone	(Terror)There’s an intense terror in the man’s tone.	(Angry)A man says with a slightly angry tone.	(Sad)A man says with a moderately sad tone.	(Disgusted)A woman says with an extremely disgusted tone.	(Surprise)A woman speaks with a tone of extreme surprise.	(Sad)A woman speaks with a tone of extreme sadness.	(Angry)Her tone reflects a tempered but still angry expression.	(Happy)A woman says with a moderately happy tone.
Content Text	Its yellow bristles, rather a mane than a head of hair, covered and concealed a lofty brow, evidently made to contain thought.	I, whom the Kshatriyas know as Kerim Shah, a prince from Iranistan, am no greater a masquerader than most men.	"Why don’t you move the pony?"	"My name is Phoebe Pyncheon." said the girl, with a manner of some reserve.	The patience of the meek Theodosius was provoked; and he dissolved in anger this episcopal tumult.	The camp fire was burning brightly when the first guard, having completed its tour of duty, came galloping in.	Nature had been prodigal of her kindness to Gwynplaine.	To teach Cosette to read, and to let her play, this constituted nearly the whole of Jean Valjean’s existence.	What an instrument is the human voice!	But things haven't change yet.
Emotivoice
Openvoice
MMStyleSpeech
MMTTS
StyleFusion-TTS control by style prompt reference
StyleFusion-TTS control by style audio reference
StyleFusion-TTS control by prompt and audio reference

2. Comparison of our proposed method with SOTA zero-shot TTS for speaker cloning

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Speakers timbre audio for enroll
Content Text	So Tom saw night as it were broad daylight.	These words behind the ears is nonsense.	Andy what's the gyre and to gimble.	On the twenty second of last march.	Born once every one hundred years, dies in flames!
VALLE
Openvoice
Hierspeech++
StyleFusion-TTS control by style prompt reference
StyleFusion-TTS control by style audio reference
StyleFusion-TTS control by style prompt and audio reference

3. Style control for different emotions

	Style prompt text	Style reference audio	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6
Control type			Text + Audio	Text	Audio	Text + Audio	Text	Audio
Content Text			Born once every one hundred years, dies in flames!	Born once every one hundred years, dies in flames!	Born once every one hundred years, dies in flames!	"My name is Phoebe Pyncheon." said the girl, with a manner of some reserve.	"My name is Phoebe Pyncheon." said the girl, with a manner of some reserve.	"My name is Phoebe Pyncheon." said the girl, with a manner of some reserve.
Neutral	Synthesize a voice that feels neutral.
Angry	Synthesize a voice that feels angry.
Surprise	Synthesize a voice that feels surprise.
Sad	Set the emotion tone to sad.
Happy	Set the emotion tone to happy.
Sleepy	Set the emotion tone to sleepy.
Disgust	Set the emotion tone to disgust.