Voices
Inworld offers a variety of built-in voices across available languages that showcase a range of vocal characteristics and styles. These voices can be immediately tried out in TTS Playground and used in your applications. For greater customization, we recommend voice cloning. Create distinct, personalized voices tailored to your experience, with as little as 5 seconds of audio. Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you’ll achieve the best quality, pronunciation, and naturalness by matching the voice’s native language to your text content.Language Support
As a larger and more capable model, Inworld TTS 1.5 Max is better suited for multilingual applications, offering better pronunciation, more accurate intonation, and more natural-sounding speech.
- English (
en) - Arabic (
ar) - Chinese (
zh) - Dutch (
nl) - French (
fr) - German (
de) - Hebrew (
he) - Hindi (
hi) - Italian (
it) - Japanese (
ja) - Korean (
ko) - Polish (
pl) - Portuguese (
pt) - Russian (
ru) - Spanish (
es)
Supported Formats
Multiple audio formats are available via API to support different application requirements. The default is MP3.- MP3: Popular compressed format with broad device and platform compatibility.
- Sample rate: 16kHz - 48kHz
- Bit rates: 32kbps - 320kbps
- PCM (
PCM): Raw uncompressed 16-bit signed little-endian samples with no WAV header. Recommended for WebSocket use cases and real-time applications that process raw audio samples directly without needing container metadata.- Sample rate: 8kHz - 48kHz
- Bit depth: 16-bit
- WAV (
WAV): Uncompressed 16-bit signed little-endian samples with WAV header optimized for HTTP streaming. For non-streaming, the WAV header is included in the response. For HTTP streaming, the WAV header is included in the first audio chunk only, so all chunks in that response can be concatenated directly into a single valid WAV file. For WebSocket streaming, a WAV header is emitted at the first audio chunk of eachflush/flush_completedevent, so direct concatenation without processing is only valid within a single flush; to build one continuous WAV file across multiple flushes, clients must strip or rebuild the repeated headers between flushes.- Sample rate: 8kHz - 48kHz
- Bit depth: 16-bit
- Linear PCM (
LINEAR16): Uncompressed 16-bit signed little-endian samples with WAV header. Maintained for backward compatibility. For non-streaming, the WAV header is included in the response. For streaming (HTTP streaming or WebSocket), the WAV header is included in every audio chunk, so each chunk is a valid WAV file on its own. Clients must strip headers when concatenating chunks.- Sample rate: 8kHz - 48kHz
- Bit depth: 16-bit
- Opus: High-quality compressed format optimized for low latency web and mobile applications.
- Sample rate: 8kHz - 48kHz
- Bit rates: 32kbps - 192kbps
- μ-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
- Sample rate: 8kHz
- A-law: Compressed telephony format ideal for voice applications with bandwidth constraints.
- Sample rate: 8kHz
Additional Configurations
The following optional configurations can also be adjusted as needed when synthesizing audio:- Temperature: Higher values increase variation, which can produce more diverse outputs with desirable outcomes, but also increases the chances of bad generations and hallucinations. Lower values improve stability and speaker similarity, though going too low increases the chances of broken generation. The default is 1.0.
- Talking Speed: Controls how fast the voice speaks. 1.0 is the normal native speed, while 0.5 is half the normal speed and 1.5 is 1.5x faster than the normal speed.
- Emphasis Markers: Asterisks around a word (e.g.
*really*) can be used to signal emphasis, prompting the voice to stress that word more strongly. This helps convey tone, intent, or emotion more clearly in spoken output.