M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

Abstract

Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3× training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36% English, 1.31% Chinese) while maintaining competitive naturalness scores.

arXiv Paper

View Paper

GitHub Repository

Access Repository

Model Architecture

Zero Shot TTS Task

Text	gt	F5-TTS	ZipVoice	M3-TTS(VAE)	M3-TTS(Fbank)
It seemed the ordained order of things that dogs should work.
Construction requires study and observation.
She was one of several driven onto the beach.
Why Born Enslaved!
因为我们悄悄走过，所以当时那些惊涛骇浪都烟消云散。
来到河边，蹦豆打开渔网一看，好失望呀。
自动驾驶将大幅提升出行安全，效率。
我们将为全球市场的可持续发展贡献力量。