Overview
If you’re already using Inworld TTS, Inworld Router enables you to optimize and combine your LLM requests with Inworld Text-to-Speech in a single request. Instead of managing two separate API calls (one for text generation, one for speech synthesis), you send one request and receive both text and audio back. Both streaming and non-streaming modes are supported.
In streaming mode, Inworld Router handles the entire pipeline: it intelligently routes your prompt to the best LLM, streams the generated text through an optimized chunking engine, and sends each chunk to the TTS engine as it’s produced. The result is low-latency voice output — you hear the first audio well before the LLM finishes generating the full response. In non-streaming mode, the complete audio and transcript are returned together once the full response is ready.
This is ideal for:
- Voice assistants and conversational agents
- Real-time narration and read-aloud features
- Accessibility-first applications
- Any workflow where your users hear AI responses instead of (or in addition to) reading them
Quick Start
Add the audio parameter to any chat completions request to enable TTS. You’ll receive both the text response and audio data in the same stream.
curl --request POST \
--url https://api.inworld.ai/v1/chat/completions \
--header 'Authorization: Basic <your-api-key>' \
--header 'Content-Type: application/json' \
--data '{
"model": "inworld/my-router",
"max_tokens": 1000,
"stream": true,
"audio": {
"voice": "Dennis",
"model": "inworld-tts-1.5-max"
},
"messages": [
{"role": "user", "content": "What is the meaning of life?"}
]
}'
That’s it. Inworld Router will:
- Route your prompt to your preset Inworld Route (or your chosen model)
- Stream text chunks to Inworld TTS as they’re generated
- Return both text and audio in the SSE stream
Audio Parameters
The audio object controls voice synthesis:
| Parameter | Type | Description |
|---|
voice | string | Required. The voice ID to use for speech synthesis (e.g., "Dennis", "Chloe"). See List Voices for all available voices. |
model | string | Required. The TTS model to use (e.g., "inworld-tts-1.5-max"). See TTS Models for available options. |
Default Audio Output
| Property | Value |
|---|
| Sample rate | 48,000 Hz |
| Format | PCM |
When streaming is enabled ("stream": true), the response is delivered as Server-Sent Events (SSE). Each event is a JSON object in the data field.
When TTS is active, text is delivered through delta.audio.transcript. Audio data and its corresponding transcript are sent together via delta.audio:
data: {"choices":[{"delta":{"audio":{"data":"<base64-pcm-audio>","transcript":"Hello! How can I assist you today?"}},"index":0}],...}
| Field | Description |
|---|
delta.audio.data | Base64-encoded PCM audio. |
delta.audio.transcript | The text being spoken. Use this for real-time text display. |
Text and audio are chunked independently. Text is chunked at natural sentence boundaries, while audio is chunked at fixed byte sizes. This means a single transcript value may span multiple audio chunks. The transcript for a text segment is attached to the first audio chunk of that segment — subsequent audio chunks for the same segment will contain only data without a transcript field.
Non-Streaming Response
Without streaming ("stream": false), the full audio and transcript are returned in the message.audio object:
{
"choices": [{
"message": {
"role": "assistant",
"content": "",
"audio": {
"id": "audio_chatcmpl-xyz",
"data": "<base64-pcm-audio>",
"transcript": "Hello! How can I assist you today?"
}
},
"finish_reason": "stop"
}]
}
When TTS is active, message.content is empty. The full text is available in message.audio.transcript.
Use Any LLM
The audio parameter works with any model available through Inworld Router. The LLM generates text, and Inworld Router handles the TTS conversion separately — so your choice of voice is independent of your choice of model. See the Models API for a full list of supported LLM models.
# Use auto model selection + TTS
curl --request POST \
--url https://api.inworld.ai/v1/chat/completions \
--header 'Authorization: Basic <your-api-key>' \
--header 'Content-Type: application/json' \
--data '{
"model": "auto",
"stream": true,
"audio": {
"voice": "Chloe",
"model": "inworld-tts-1.5-max"
},
"messages": [
{"role": "user", "content": "Tell me a short bedtime story."}
],
"extra_body": {
"sort": ["latency"]
}
}'
This combines Inworld Router’s intelligent model selection with TTS — you get the fastest available LLM and voice output in one call.
Combine with Smart Routing Features
All Inworld Router capabilities work alongside TTS:
Failover with Voice
If your primary model is unavailable, Inworld Router fails over to a backup — and the voice output continues seamlessly:
curl --request POST \
--url https://api.inworld.ai/v1/chat/completions \
--header 'Authorization: Basic <your-api-key>' \
--header 'Content-Type: application/json' \
--data '{
"model": "openai/gpt-5",
"stream": true,
"audio": {
"voice": "Dennis",
"model": "inworld-tts-1.5-max"
},
"messages": [
{"role": "user", "content": "Explain quantum computing simply."}
],
"extra_body": {
"models": ["anthropic/claude-opus-4-6", "google-ai-studio/gemini-2.5-flash"]
}
}'
If GPT-5 fails, the request falls over to Claude or Gemini — and the same voice (Dennis) is used regardless of which model generates the text.
Cost-Optimized Voice Responses
Route to the cheapest model while still getting audio output:
curl --request POST \
--url https://api.inworld.ai/v1/chat/completions \
--header 'Authorization: Basic <your-api-key>' \
--header 'Content-Type: application/json' \
--data '{
"model": "auto",
"stream": true,
"audio": {
"voice": "Dennis",
"model": "inworld-tts-1.5-max"
},
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"extra_body": {
"sort": ["price", "latency"]
}
}'
Optimized Chunking
Inworld Router includes a built-in text chunking engine optimized for TTS. Rather than waiting for the LLM to finish generating the full response, the router:
- Buffers incoming tokens from the LLM
- Detects natural sentence and clause boundaries
- Sends each chunk to the TTS engine as soon as it’s ready
This pipeline significantly reduces Time to First Audio (TTFA) — your users start hearing the response while the LLM is still generating text. The chunking is tuned for natural-sounding speech: it avoids breaking mid-word or mid-phrase, producing smooth, conversational audio.
Tool calls (function calling) work alongside TTS. When the LLM decides to call a tool, the tool call is returned as standard delta.tool_calls chunks (no audio is generated for that turn). Once you execute the tool and send the result back with TTS enabled, the final response is spoken.
import requests
import json
API_URL = "https://api.inworld.ai/v1/chat/completions"
HEADERS = {
"Authorization": "Basic <your-api-key>",
"Content-Type": "application/json",
}
# Step 1: Request with tools + TTS enabled
response = requests.post(API_URL, headers=HEADERS, json={
"model": "openai/gpt-5",
"messages": [{"role": "user", "content": "What's the weather in Tokyo?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}],
"audio": {
"voice": "Dennis",
"model": "inworld-tts-1.5-max"
}
}).json()
# LLM returns tool_calls (no audio on this turn)
tool_call = response["choices"][0]["message"]["tool_calls"][0]
# Step 2: Execute the tool call
tool_result = get_weather("Tokyo") # Your function
# Step 3: Send tool result back — this response is spoken aloud
audio_response = requests.post(API_URL, headers=HEADERS, json={
"model": "openai/gpt-5",
"stream": True,
"messages": [
{"role": "user", "content": "What's the weather in Tokyo?"},
response["choices"][0]["message"],
{"role": "tool", "content": tool_result, "tool_call_id": tool_call["id"]}
],
"audio": {
"voice": "Dennis",
"model": "inworld-tts-1.5-max"
}
}, stream=True)
# Parse SSE stream for audio chunks (same as Python example below)
Python Example
import requests
import json
import base64
response = requests.post(
"https://api.inworld.ai/v1/chat/completions",
headers={
"Authorization": "Basic <your-api-key>",
"Content-Type": "application/json",
},
json={
"model": "openai/gpt-5",
"max_tokens": 500,
"stream": True,
"audio": {
"voice": "Dennis",
"model": "inworld-tts-1.5-max",
},
"messages": [
{"role": "user", "content": "Tell me a fun fact about space."}
],
},
stream=True,
)
audio_chunks = []
full_transcript = ""
for line in response.iter_lines():
line = line.decode("utf-8")
if not line.startswith("data: "):
continue
data = line[6:]
if data == "[DONE]":
break
chunk = json.loads(data)
delta = chunk["choices"][0].get("delta", {})
audio = delta.get("audio")
if audio:
# Text transcript — use for real-time text display
if "transcript" in audio:
full_transcript += audio["transcript"]
print(audio["transcript"], end="", flush=True)
# Audio data — PCM 48kHz 16-bit mono
if "data" in audio:
audio_chunks.append(base64.b64decode(audio["data"]))
pcm_audio = b"".join(audio_chunks)
JavaScript / Node.js Example
const response = await fetch("https://api.inworld.ai/v1/chat/completions", {
method: "POST",
headers: {
Authorization: "Basic <your-api-key>",
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "openai/gpt-5",
max_tokens: 500,
stream: true,
audio: {
voice: "Dennis",
model: "inworld-tts-1.5-max",
},
messages: [
{ role: "user", content: "Tell me a fun fact about space." },
],
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
const audioChunks = [];
let fullTranscript = "";
let buffer = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n");
buffer = lines.pop();
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const data = line.slice(6);
if (data === "[DONE]") break;
const chunk = JSON.parse(data);
const audio = chunk.choices[0]?.delta?.audio;
if (audio) {
// Text transcript — use for real-time text display
if (audio.transcript) {
fullTranscript += audio.transcript;
process.stdout.write(audio.transcript);
}
// Audio data — PCM 48kHz 16-bit mono
if (audio.data) {
audioChunks.push(Buffer.from(audio.data, "base64"));
}
}
}
}
const pcmAudio = Buffer.concat(audioChunks);
Next Steps