TTS Models - Inworld AI Documentation

Inworld provides a family of state-of-the-art TTS models, optimized for different use cases, quality levels, and performance requirements.

Realtime TTS-2

Our flagship, top-ranked model — the best choice for production

Best quality and steerability, with natural language steering for more contextually aware speech
Support for 200+ languages and locales
Ultra-low latency (~120ms median latency) at high concurrency
High quality instant voice cloning
Enhanced timestamps with phonetic details and visemes

Realtime TTS 1.5 Max

Rich, expressive speech with maximum stability

Support for 15 languages
Optimized for real-time use (<200ms median latency)
High quality instant voice cloning

Realtime TTS 1.5 Mini

Our most cost-efficient model — for English workloads where price is the top priority

Lowest cost per character
Best suited to English; use TTS-2 for other languages
High quality instant voice cloning

Models overview

Name	Model ID	Description	Supported languages
Realtime TTS-2	`inworld-tts-2`	Our newest, most powerful model with natural language steering and stronger multilingual capabilities	200+ languages and locales — see Languages
Llama Realtime TTS 1.5 Max	`inworld-tts-1.5-max`	High-quality, maximum-stability model with enhanced timestamps	`en`, `zh`, `ja`, `ko`, `ru`, `it`, `es`, `pt`, `fr`, `de`, `pl`, `nl`, `hi`, `he`, `ar`
Llama Realtime TTS 1.5 Mini	`inworld-tts-1.5-mini`	Most cost-efficient model for English workloads, with enhanced timestamps	`en`, `zh`, `ja`, `ko`, `ru`, `it`, `es`, `pt`, `fr`, `de`, `pl`, `nl`, `hi`, `he`, `ar`

Looking for inworld-tts-1 or inworld-tts-1-max? These previous-generation models were discontinued on June 15, 2026. Requests to them are now automatically routed to their 1.5 successors (inworld-tts-1 → inworld-tts-1.5-mini, inworld-tts-1-max → inworld-tts-1.5-max). We recommend migrating to inworld-tts-2 to improve quality and latency.

Realtime TTS-2

​Our flagship, top-ranked model — the best choice for production

Realtime TTS 1.5 Max

​Rich, expressive speech with maximum stability

Realtime TTS 1.5 Mini

​Our most cost-efficient model — for English workloads where price is the top priority

​Models overview

Our flagship, top-ranked model — the best choice for production

Rich, expressive speech with maximum stability

Our most cost-efficient model — for English workloads where price is the top priority

Models overview