Abstract
NAR zero-shot TTS often relies on rigid duration modeling or pseudo-alignment, limiting prosodic expressiveness and increasing computational overhead. We propose M3-TTS, an efficient dual-stage MMDiT framework that establishes robust cross-modal correspondence. Self-Adaptive Joint Representation enables implicit monotonic alignment, enhancing both temporal modeling and pronunciation fidelity. Adaptive Distribution Modulator ensures explicit content-timbre disentanglement, while a Mel-VAE-based Distilled Latent Prior provides spatiotemporal compression—accelerating training threefold and enabling native 44.1 kHz synthesis. Extensive evaluations on Seed-TTS-eval demonstrate that M3-TTS attains SOTA performance, achieving the lowest WER of 1.36% for English and 1.33% for Chinese
Model Architecture
Zero Shot TTS Task
| Text | gt | F5-TTS | ZipVoice | M3-TTS(VAE) | M3-TTS(Fbank) |
|---|---|---|---|---|---|
| It seemed the ordained order of things that dogs should work. | |||||
| Construction requires study and observation. | |||||
| She was one of several driven onto the beach. | |||||
| Why Born Enslaved! | |||||
| 因为我们悄悄走过,所以当时那些惊涛骇浪都烟消云散。 | |||||
| 来到河边,蹦豆打开渔网一看,好失望呀。 | |||||
| 自动驾驶将大幅提升出行安全,效率。 | |||||
| 我们将为全球市场的可持续发展贡献力量。 |