LLM + TTS (Voice Responses) - Inworld AI Documentation

Overview

If you’re already using Realtime TTS, Realtime Router enables you to optimize and combine your LLM requests with Realtime Text-to-Speech in a single request. Instead of managing two separate API calls (one for text generation, one for speech synthesis), you send one request and receive both text and audio back. Both streaming and non-streaming modes are supported. In streaming mode, Realtime Router handles the entire pipeline: it intelligently routes your prompt to the best LLM, streams the generated text through an optimized chunking engine, and sends each chunk to the TTS engine as it’s produced. The result is low-latency voice output — you hear the first audio well before the LLM finishes generating the full response. In non-streaming mode, the complete audio and transcript are returned together once the full response is ready. This is ideal for:

Voice assistants and conversational agents
Real-time narration and read-aloud features
Accessibility-first applications
Any workflow where your users hear AI responses instead of (or in addition to) reading them

Quick Start

Add the audio parameter to any chat completions request to enable TTS. You’ll receive both the text response and audio data in the same stream.

curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "inworld/my-router",
    "max_tokens": 1000,
    "stream": true,
    "audio": {
      "voice": "Dennis",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "What is the meaning of life?"}
    ]
  }'

That’s it. Inworld Router will:

Route your prompt to your preset Inworld Route (or your chosen model)
Stream text chunks to Inworld TTS as they’re generated
Return both text and audio in the SSE stream

Audio Parameters

The audio object controls voice synthesis:

Parameter	Type	Description
`voice`	string	Required. The voice ID to use for speech synthesis (e.g., `"Dennis"`, `"Chloe"`). See List Voices for all available voices.
`model`	string	Required. The TTS model to use (e.g., `"inworld-tts-2"`). See TTS Models for available options.

Default Audio Output

Property	Value
Sample rate	48,000 Hz
Format	PCM

Streaming Response Format

When streaming is enabled ("stream": true), the response is delivered as Server-Sent Events (SSE). Each event is a JSON object in the data field. When TTS is active, text is delivered through delta.audio.transcript. Audio data and its corresponding transcript are sent together via delta.audio:

data: {"choices":[{"delta":{"audio":{"data":"<base64-pcm-audio>","transcript":"Hello! How can I assist you today?"}},"index":0}],...}

Field	Description
`delta.audio.data`	Base64-encoded PCM audio.
`delta.audio.transcript`	The text being spoken. Use this for real-time text display.

Text and audio are chunked independently. Text is chunked at natural sentence boundaries, while audio is chunked at fixed byte sizes. This means a single transcript value may span multiple audio chunks. The transcript for a text segment is attached to the first audio chunk of that segment — subsequent audio chunks for the same segment will contain only data without a transcript field.

Non-Streaming Response

Without streaming ("stream": false), the full audio and transcript are returned in the message.audio object:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "",
      "audio": {
        "id": "audio_chatcmpl-xyz",
        "data": "<base64-pcm-audio>",
        "transcript": "Hello! How can I assist you today?"
      }
    },
    "finish_reason": "stop"
  }]
}

When TTS is active, message.content is empty. The full text is available in message.audio.transcript.

Use Any LLM

The audio parameter works with any model available through Inworld Router. The LLM generates text, and Inworld Router handles the TTS conversion separately — so your choice of voice is independent of your choice of model. See the Models API for a full list of supported LLM models.

# Use auto model selection + TTS
curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "auto",
    "stream": true,
    "audio": {
      "voice": "Chloe",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "Tell me a short bedtime story."}
    ],
    "extra_body": {
      "sort": ["latency"]
    }
  }'

This combines Inworld Router’s intelligent model selection with TTS — you get the fastest available LLM and voice output in one call.

Combine with Smart Routing Features

All Inworld Router capabilities work alongside TTS:

Failover with Voice

If your primary model is unavailable, Inworld Router fails over to a backup — and the voice output continues seamlessly:

curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "openai/gpt-5",
    "stream": true,
    "audio": {
      "voice": "Dennis",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "Explain quantum computing simply."}
    ],
    "extra_body": {
      "models": ["anthropic/claude-opus-4-6", "google-ai-studio/gemini-2.5-flash"]
    }
  }'

If GPT-5 fails, the request falls over to Claude or Gemini — and the same voice (Dennis) is used regardless of which model generates the text.

Cost-Optimized Voice Responses

Route to the cheapest model while still getting audio output:

curl --request POST \
  --url https://api.inworld.ai/v1/chat/completions \
  --header 'Authorization: Basic <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "auto",
    "stream": true,
    "audio": {
      "voice": "Dennis",
      "model": "inworld-tts-2"
    },
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "extra_body": {
      "sort": ["price", "latency"]
    }
  }'

Optimized Chunking

Inworld Router includes a built-in text chunking engine optimized for TTS. Rather than waiting for the LLM to finish generating the full response, the router:

Buffers incoming tokens from the LLM
Detects natural sentence and clause boundaries
Sends each chunk to the TTS engine as soon as it’s ready

This pipeline significantly reduces Time to First Audio (TTFA) — your users start hearing the response while the LLM is still generating text. The chunking is tuned for natural-sounding speech: it avoids breaking mid-word or mid-phrase, producing smooth, conversational audio.

Tool Calling

Tool calls (function calling) work alongside TTS. When the LLM decides to call a tool, the tool call is returned as standard delta.tool_calls chunks (no audio is generated for that turn). Once you execute the tool and send the result back with TTS enabled, the final response is spoken.

Tools + Voice Example

import requests
import json

API_URL = "https://api.inworld.ai/v1/chat/completions"
HEADERS = {
    "Authorization": "Basic <your-api-key>",
    "Content-Type": "application/json",
}

# Step 1: Request with tools + TTS enabled
response = requests.post(API_URL, headers=HEADERS, json={
    "model": "openai/gpt-5",
    "messages": [{"role": "user", "content": "What's the weather in Tokyo?"}],
    "tools": [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }],
    "audio": {
        "voice": "Dennis",
        "model": "inworld-tts-2"
    }
}).json()
# LLM returns tool_calls (no audio on this turn)

tool_call = response["choices"][0]["message"]["tool_calls"][0]

# Step 2: Execute the tool call
tool_result = get_weather("Tokyo")  # Your function

# Step 3: Send tool result back — this response is spoken aloud
audio_response = requests.post(API_URL, headers=HEADERS, json={
    "model": "openai/gpt-5",
    "stream": True,
    "messages": [
        {"role": "user", "content": "What's the weather in Tokyo?"},
        response["choices"][0]["message"],
        {"role": "tool", "content": tool_result, "tool_call_id": tool_call["id"]}
    ],
    "audio": {
        "voice": "Dennis",
        "model": "inworld-tts-2"
    }
}, stream=True)
# Parse SSE stream for audio chunks (same as Python example below)

Python Example

import requests
import json
import base64

response = requests.post(
    "https://api.inworld.ai/v1/chat/completions",
    headers={
        "Authorization": "Basic <your-api-key>",
        "Content-Type": "application/json",
    },
    json={
        "model": "openai/gpt-5",
        "max_tokens": 500,
        "stream": True,
        "audio": {
            "voice": "Dennis",
            "model": "inworld-tts-2",
        },
        "messages": [
            {"role": "user", "content": "Tell me a fun fact about space."}
        ],
    },
    stream=True,
)

audio_chunks = []
full_transcript = ""

for line in response.iter_lines():
    line = line.decode("utf-8")
    if not line.startswith("data: "):
        continue
    data = line[6:]
    if data == "[DONE]":
        break

    chunk = json.loads(data)
    delta = chunk["choices"][0].get("delta", {})
    audio = delta.get("audio")

    if audio:
        # Text transcript — use for real-time text display
        if "transcript" in audio:
            full_transcript += audio["transcript"]
            print(audio["transcript"], end="", flush=True)

        # Audio data — PCM 48kHz 16-bit mono
        if "data" in audio:
            audio_chunks.append(base64.b64decode(audio["data"]))

pcm_audio = b"".join(audio_chunks)

JavaScript / Node.js Example

const response = await fetch("https://api.inworld.ai/v1/chat/completions", {
  method: "POST",
  headers: {
    Authorization: "Basic <your-api-key>",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "openai/gpt-5",
    max_tokens: 500,
    stream: true,
    audio: {
      voice: "Dennis",
      model: "inworld-tts-2",
    },
    messages: [
      { role: "user", content: "Tell me a fun fact about space." },
    ],
  }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
const audioChunks = [];
let fullTranscript = "";
let buffer = "";

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split("\n");
  buffer = lines.pop();

  for (const line of lines) {
    if (!line.startsWith("data: ")) continue;
    const data = line.slice(6);
    if (data === "[DONE]") break;

    const chunk = JSON.parse(data);
    const audio = chunk.choices[0]?.delta?.audio;

    if (audio) {
      // Text transcript — use for real-time text display
      if (audio.transcript) {
        fullTranscript += audio.transcript;
        process.stdout.write(audio.transcript);
      }

      // Audio data — PCM 48kHz 16-bit mono
      if (audio.data) {
        audioChunks.push(Buffer.from(audio.data, "base64"));
      }
    }
  }
}

const pcmAudio = Buffer.concat(audioChunks);

Next Steps

OpenAI Compatibility — use Inworld Router with the OpenAI SDK
Cost Optimizer — route by query complexity to reduce costs
Failover System — build a resilient multi-provider setup
Extra Body Parameters — all available sort, models, and ignore options

​Overview

​Quick Start

​Audio Parameters

​Default Audio Output

​Streaming Response Format

​Non-Streaming Response

​Use Any LLM

​Combine with Smart Routing Features

​Failover with Voice

​Cost-Optimized Voice Responses

​Optimized Chunking

​Tool Calling

​Tools + Voice Example

​Python Example

​JavaScript / Node.js Example

​Next Steps

Overview

Quick Start

Audio Parameters

Default Audio Output

Streaming Response Format

Non-Streaming Response

Use Any LLM

Combine with Smart Routing Features

Failover with Voice

Cost-Optimized Voice Responses

Optimized Chunking

Tool Calling

Tools + Voice Example

Python Example

JavaScript / Node.js Example

Next Steps