M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

Abstract

Zero-shot text-to-speech (TTS) must generate speech from text while preserving the speaker characteristics of a short reference utterance. Recent non-autoregressive systems improve decoding efficiency, but they often bridge the text--speech length mismatch with duration predictors, filler-token padding, or text upsampling, and inject reference speech through sequence-level conditioning. We propose M3-TTS, a multi-modal, multi-stream, and modulation-based diffusion Transformer for reference-conditioned masked acoustic generation. Its core design is a native two-stream architecture, termed Self-Adaptive Joint Representation (SAJR), where text tokens and speech representations keep their original lengths and interact through packed joint self-attention. M3-TTS further uses a joint-to-audio generation pipeline to separate cross-modal correspondence learning from acoustic refinement, together with AdaLN-style acoustic prompting to inject speaker information in a unified acoustic representation space. Experiments show that the FBank-based M3-TTS reaches 1.52% WER on English and 1.35% CER on Chinese in the 24kHz Seed-TTS comparison, while the Mel-VAE variant reduces RTF to 0.12.

Model Architecture

Zero Shot TTS Task

Text	gt	F5-TTS	ZipVoice	M3-TTS(VAE)	M3-TTS(Fbank)
It seemed the ordained order of things that dogs should work.
Construction requires study and observation.
She was one of several driven onto the beach.
Why Born Enslaved!
因为我们悄悄走过，所以当时那些惊涛骇浪都烟消云散。
来到河边，蹦豆打开渔网一看，好失望呀。
自动驾驶将大幅提升出行安全，效率。
我们将为全球市场的可持续发展贡献力量。