Building voice AI
for Malaysia
ConvoTTS is a closed-source multilingual text-to-speech system purpose-built for Malaysia and Southeast Asia. Our architecture draws from recent breakthroughs in tokenizer-free speech synthesis, diffusion transformers, and large language model scaling to deliver studio-quality 48kHz audio across five languages natively spoken in Malaysia.
Five languages are currently stable. We are actively scaling to additional languages where the underlying language model already has knowledge, though some remain in experimental stages.
Built on open research, optimized for Malaysia
ConvoTTS is not a wrapper around a single model. Our architecture synthesizes ideas from multiple research directions in modern speech synthesis, combining them into a system specifically tuned for Malaysian multilingual requirements.
Tokenizer-free continuous modeling
Drawing from advances in continuous speech representation, our system bypasses discrete tokenization to preserve acoustic richness and expressivity across tonal and non-tonal languages.
Diffusion autoregressive generation
Informed by diffusion transformer research, our generation pipeline combines the stability of autoregressive decoding with the quality of diffusion-based acoustic modeling.
Flow matching for natural prosody
Inspired by flow matching techniques in multilingual TTS, our prosody model captures the natural rhythm and intonation patterns specific to each Malaysian language.
LLM backbone for multilingual understanding
Our language model backbone leverages scaling research from large language models, enabling deep understanding of text semantics across Malay, English, Mandarin, Tamil, and Hindi.
Neural audio codec for high fidelity
Our audio encoder-decoder is informed by neural audio codec research, enabling 48kHz studio-quality reconstruction from compact latent representations.
Conversational speech patterns
Guided by research in long-form conversational TTS, our system handles natural turn-taking, speaker consistency, and expressive dialogue generation.
Research References
Our work is informed by the following published research. We gratefully acknowledge these contributions to the field of speech synthesis.
Informed our tokenizer-free continuous speech modeling approach, enabling richer expressivity without discrete tokenization bottlenecks.
Guided our conversational speech synthesis design, particularly multi-speaker handling and long-form generation stability.
Inspired our flow matching techniques for natural prosody generation and multilingual zero-shot voice synthesis.
Shaped our diffusion transformer architecture for high-quality continuous speech token generation.
Informed our non-autoregressive inference strategies for efficient, low-latency speech generation.
Our language model backbone draws from the Qwen2.5 series scaling principles, particularly the 2B parameter class, for efficient multilingual text understanding and generation.
Guided our audio-language integration design for cross-modal understanding between text semantics and speech acoustics.
Foundation for our neural audio codec design, enabling high-fidelity 48kHz audio reconstruction from compact latent representations.
Technical Specifications
Supported Languages
Team
Ready to try?
No login required. Generate speech in seconds.