M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

Abstract

NAR zero-shot TTS often relies on rigid duration modeling or pseudo-alignment, limiting prosodic expressiveness and increasing computational overhead. We propose M3-TTS, an efficient dual-stage MMDiT framework that establishes robust cross-modal correspondence. Self-Adaptive Joint Representation enables implicit monotonic alignment, enhancing both temporal modeling and pronunciation fidelity. Adaptive Distribution Modulator ensures explicit content-timbre disentanglement, while a Mel-VAE-based Distilled Latent Prior provides spatiotemporal compression—accelerating training threefold and enabling native 44.1 kHz synthesis. Extensive evaluations on Seed-TTS-eval demonstrate that M3-TTS attains SOTA performance, achieving the lowest WER of 1.36% for English and 1.33% for Chinese

Model Architecture

Model Architecture

Zero Shot TTS Task

Text gt F5-TTS ZipVoice M3-TTS(VAE) M3-TTS(Fbank)
It seemed the ordained order of things that dogs should work.
Construction requires study and observation.
She was one of several driven onto the beach.
Why Born Enslaved!
因为我们悄悄走过,所以当时那些惊涛骇浪都烟消云散。
来到河边,蹦豆打开渔网一看,好失望呀。
自动驾驶将大幅提升出行安全,效率。
我们将为全球市场的可持续发展贡献力量。