Our Mission

Building voice AI
for Malaysia

ConvoTTS is a closed-source multilingual text-to-speech system purpose-built for Malaysia and Southeast Asia. Our architecture draws from recent breakthroughs in tokenizer-free speech synthesis, diffusion transformers, and large language model scaling to deliver studio-quality 48kHz audio across five languages natively spoken in Malaysia.

Five languages are currently stable. We are actively scaling to additional languages where the underlying language model already has knowledge, though some remain in experimental stages.

Built on open research, optimized for Malaysia

ConvoTTS is not a wrapper around a single model. Our architecture synthesizes ideas from multiple research directions in modern speech synthesis, combining them into a system specifically tuned for Malaysian multilingual requirements.

Tokenizer-free continuous modeling

Drawing from advances in continuous speech representation, our system bypasses discrete tokenization to preserve acoustic richness and expressivity across tonal and non-tonal languages.

Diffusion autoregressive generation

Informed by diffusion transformer research, our generation pipeline combines the stability of autoregressive decoding with the quality of diffusion-based acoustic modeling.

Flow matching for natural prosody

Inspired by flow matching techniques in multilingual TTS, our prosody model captures the natural rhythm and intonation patterns specific to each Malaysian language.

LLM backbone for multilingual understanding

Our language model backbone leverages scaling research from large language models, enabling deep understanding of text semantics across Malay, English, Mandarin, Tamil, and Hindi.

Neural audio codec for high fidelity

Our audio encoder-decoder is informed by neural audio codec research, enabling 48kHz studio-quality reconstruction from compact latent representations.

Conversational speech patterns

Guided by research in long-form conversational TTS, our system handles natural turn-taking, speaker consistency, and expressive dialogue generation.

Research References

Our work is informed by the following published research. We gratefully acknowledge these contributions to the field of speech synthesis.

Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Zhou et al.·ICLR 2026·2025

Informed our tokenizer-free continuous speech modeling approach, enabling richer expressivity without discrete tokenization bottlenecks.

VibeVoice: Long-Form Multi-Speaker Conversational TTS
Microsoft Research·arXiv·2025

Guided our conversational speech synthesis design, particularly multi-speaker handling and long-form generation stability.

CosyVoice: Scalable Multilingual Zero-shot TTS via Supervised Semantic Tokens
Du et al.·arXiv·2024

Inspired our flow matching techniques for natural prosody generation and multilingual zero-shot voice synthesis.

DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
Liu et al.·arXiv·2025

Shaped our diffusion transformer architecture for high-quality continuous speech token generation.

F5-TTS: Flow Matching with Diffusion Transformer for Speech Synthesis
Chen et al.·ICLR 2025·2024

Informed our non-autoregressive inference strategies for efficient, low-latency speech generation.

Qwen2.5 Technical Report
Yang et al., Alibaba·arXiv·2024

Our language model backbone draws from the Qwen2.5 series scaling principles, particularly the 2B parameter class, for efficient multilingual text understanding and generation.

Qwen2-Audio: Advancing Audio-Language Understanding
Chu et al., Alibaba·arXiv·2024

Guided our audio-language integration design for cross-modal understanding between text semantics and speech acoustics.

High-Fidelity Audio Compression with Improved RVQGAN (DAC)
Kumar et al., Descript·NeurIPS 2023·2023

Foundation for our neural audio codec design, enabling high-fidelity 48kHz audio reconstruction from compact latent representations.

Technical Specifications

ArchitectureClosed-source, proprietary
Audio Output48kHz studio quality
Latency~3-7 seconds for typical requests
Max Text5,000 characters per request
Voice Cloning3-30 seconds reference audio
Voice DesignNatural language description
StreamingReal-time chunked audio
Languages (Stable)Malay, English, Mandarin, Tamil, Hindi
Languages (Scaling)Additional languages in progress

Supported Languages

🇲🇾MalayBahasa MelayuStable
🇬🇧EnglishEnglishStable
🇨🇳Mandarin中文Stable
🇮🇳Tamilதமிழ்Stable
🇮🇳Hindiहिन्दीStable

Team

Isarar Siddique— Lead Developer
LinkedIn
Raiyan Siddique— Assist
LinkedIn

A product of

Sarinder Labs

Building AI infrastructure for Southeast Asia

Ready to try?

No login required. Generate speech in seconds.