Abstract
Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3× training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36% English, 1.31% Chinese) while maintaining competitive naturalness scores.
arXiv Paper
View PaperGitHub Repository
Access RepositoryModel Architecture
Zero Shot TTS Task
| Text | gt | F5-TTS | ZipVoice | M3-TTS(VAE) | M3-TTS(Fbank) |
|---|---|---|---|---|---|
| It seemed the ordained order of things that dogs should work. | |||||
| Construction requires study and observation. | |||||
| She was one of several driven onto the beach. | |||||
| Why Born Enslaved! | |||||
| 因为我们悄悄走过,所以当时那些惊涛骇浪都烟消云散。 | |||||
| 来到河边,蹦豆打开渔网一看,好失望呀。 | |||||
| 自动驾驶将大幅提升出行安全,效率。 | |||||
| 我们将为全球市场的可持续发展贡献力量。 |