Documentation Index
Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
The Inworld Realtime API is wire-compatible with the OpenAI Realtime spec — clients written against OpenAI’s session, audio, and response events work against Inworld unchanged. On top of that baseline, Inworld layers production-grade extensions that improve quality, latency, and conversational naturalness:
- STT tuning — voice profile signals, language hints, explicit end-of-turn and VAD overrides
- TTS segmentation and steering — pick how the LLM token stream is chunked into TTS calls, the synthesis language, the TTS-2 delivery preset, and (for TTS-2) a shared multi-turn context
- Automatic conversation memory — periodic summarization and fact extraction that keep long sessions inside the context window
- Back-channel — short interjections (
"uh-huh", "I see") emitted while the user is still speaking, so the agent feels like an active listener
- Responsiveness fillers — short filler audio (
"let me think") spoken in the gap after a user turn if the main LLM is slow to produce its first delta
Everything Inworld adds beyond the OpenAI spec is exposed through a single field on the session object: providerData. Send it inside any session.update and the server merges it with current state. Most fields hot-swap mid-session and take effect on the next audio chunk or turn; the locked-at-session-open exceptions are called out in the hot-swap reference at the bottom of this page.
This page is the field-by-field reference for the full providerData surface. For task-driven walkthroughs (language switching, conversation management, etc.) and for the event-handling client code that pairs with back-channel and responsiveness, see the linked guides under each branch.
Branch overview
providerData is a flat object with five branches. Each branch is independent — send only the ones you want to configure.
| Branch | Purpose | Hot-swap |
|---|
stt | STT tuning (voice profile, language hints, end-of-turn thresholds, VAD overrides) | Yes — STT stream is restarted on the next audio chunk |
tts | TTS segmentation, language, delivery preset, conversational context | Mostly — conversational and user_turn_mode are locked at session open |
memory | Automatic conversation memory and summarization | Yes |
backchannel | Short interjections while the user is speaking | Yes |
responsiveness | Filler audio while the main LLM warms up after a user turn | Yes |
ws.send(JSON.stringify({
type: 'session.update',
session: {
providerData: {
stt: { /* ... */ },
tts: { /* ... */ },
memory: { /* ... */ },
backchannel: { /* ... */ },
responsiveness: { /* ... */ }
}
}
}));
Partial updates are supported on every branch — omit a field to keep its current value.
providerData also accepts two top-level metadata fields alongside the branches: user_id and metadata. They aren’t configuration branches; they tag the session for tracing and downstream routing. See Session metadata below.
STT (providerData.stt)
Inworld extensions to the OpenAI-standard STT config. Every field here is hot-swappable; the STT stream is restarted automatically so the next chunk of audio uses the new value.
ws.send(JSON.stringify({
type: 'session.update',
session: {
audio: {
input: { transcription: { model: 'inworld/inworld-stt-1' } }
},
providerData: {
stt: {
prompt: 'Medical dictation. Vocabulary: angioplasty.',
voice_profile: true,
language_hints: ['en-US', 'es-MX'],
end_of_turn_confidence_threshold: 0.7,
min_end_of_turn_silence: 200,
max_turn_silence: 5000,
vad_threshold: 0.5
}
}
}
}));
| Field | Type | Description |
|---|
prompt | string | Transcription guidance (vocabulary hints, domain context, formatting preferences). Equivalent to audio.input.transcription.prompt. |
voice_profile | boolean | When true, attach voice-profile signals (age, gender, emotion, vocal style, accent) to transcription events under providerData.voiceProfile. See Voice profile payload below for the returned shape. |
language_hints | string[] | BCP-47-ish hints to bias recognition without committing to a single language. Soniox-specific (soniox/stt-rt-v4); ignored by other models. |
end_of_turn_confidence_threshold | number | STT end-of-turn confidence cutoff (0.0–1.0). Explicit override of the semantic_vad.eagerness mapping. |
vad_threshold | number | Speech/silence VAD cutoff (0.0–1.0). Explicit override of the eagerness mapping. |
min_end_of_turn_silence | integer | Minimum trailing silence (ms) before STT considers a turn finished. Explicit override of the eagerness mapping. |
max_turn_silence | integer | Hard ceiling (ms) on within-turn silence before STT force-closes the turn. Explicit override of the eagerness mapping. |
For the eagerness preset that these fields override, see semantic_vad.
Voice profile payload
When providerData.stt.voice_profile is true, every conversation.item.input_audio_transcription.delta and conversation.item.input_audio_transcription.completed event carries a providerData.voiceProfile object alongside the transcript text:
{
"type": "conversation.item.input_audio_transcription.completed",
"event_id": "evt_5f7d2",
"item_id": "item_aud_01HF…",
"content_index": 0,
"transcript": "Hello, how are you?",
"providerData": {
"voiceProfile": {
"age": [{ "label": "adult", "confidence": 0.78 }],
"gender": [{ "label": "female", "confidence": 0.91 }],
"emotion": [{ "label": "neutral", "confidence": 0.65 }],
"vocal_style": [{ "label": "conversational", "confidence": 0.82 }],
"accent": [{ "label": "en-US", "confidence": 0.88 }]
}
}
}
Each top-level key is an array of { label, confidence } objects sorted by descending confidence. Keys are omitted when the STT backend does not produce labels for that category, so always null-check before reading. Confidence values are in [0.0, 1.0].
| Category | Notes |
|---|
age | Estimated age band of the speaker. |
gender | Estimated gender of the speaker. |
emotion | Detected emotional tone in the current segment. Can shift across deltas within a single turn. |
vocal_style | Speaking style (e.g. conversational, narration, whisper, monotone). |
accent | Regional accent or dialect as a BCP-47-like locale code (e.g. en-US, en-GB). |
Voice profile is computed by the realtime service regardless of the STT backend, so voice_profile: true works across all supported STT models.
TTS (providerData.tts)
Controls how the LLM text stream is segmented and forwarded to the TTS backend, the language and delivery preset used for synthesis, and (for TTS-2) whether a shared upstream context is preserved across turns. Available on inworld-tts-1.5-mini and inworld-tts-2.
ws.send(JSON.stringify({
type: 'session.update',
session: {
audio: { output: { model: 'inworld-tts-2', voice: 'Olivia' } },
providerData: {
tts: {
segmenter_strategy: 'sentence',
steering_handling: 'emit_once',
language: 'en-US',
delivery_mode: 'CREATIVE',
conversational: false,
user_turn_mode: 'both'
}
}
}
}));
| Field | Type | Description |
|---|
segmenter_strategy | string | How the LLM token stream is chunked before being forwarded to TTS. One of auto, balanced, sentence, full_turn, fast_start, per_segment_context. Empty string inherits the server default. Hot-swappable. See Segmenter strategies. |
steering_handling | string | How to handle a leading [steering] tag captured from the LLM turn. repeat_each_chunk re-prepends it to every TTS request (default). emit_once prepends it only to the first request — recommended for inworld-tts-2. Hot-swappable. |
language | string | BCP-47 tag (e.g. "en-US", "pt-BR") forwarded to TTS as the synthesis language. Independent from audio.input.transcription.language — STT and TTS can use different languages. Empty string lets the TTS backend infer. Hot-swappable. |
delivery_mode | string | TTS-2 generation preset trading off stability vs. expressiveness. One of STABLE, BALANCED, CREATIVE (case-insensitive). Empty or unrecognised values are treated as unspecified. No-op for non-TTS-2 models. Hot-swappable. |
conversational | boolean | TTS-2 only. When true, opens a single shared upstream TTS context for the entire WebSocket session. Locked at session open; mid-session toggles are ignored. See Conversational TTS. |
user_turn_mode | string | Conversational-mode only. Which channels of the user turn are forwarded to TTS before each assistant generation. One of both (default), audio_only, text_only, or none. No-op outside conversational mode. Locked at session open. |
Conversational TTS
Setting providerData.tts.conversational = true opts TTS-2 into a multi-turn shared context: the upstream TurnContext sees every user and assistant turn for the lifetime of the WebSocket. This lets the model condition its delivery on the audio history of the conversation. The trade-off is a longer-lived state on the TTS backend and slightly higher per-turn cost.
In conversational mode, segmenter_strategy is internally locked to full_turn semantics. Per-sentence and per-segment-context strategies are coerced (with a server-side WARN) because they would either fragment the upstream history or open a fresh context per segment, both of which defeat the multi-turn TurnContext.
With conversational: true, TTS conditions each response on the audio of previous turns — higher per-turn cost in exchange for potentially more natural output. Off by default.
Segmenter strategies
| Strategy | Behaviour |
|---|
auto | Default. inworld-tts-2 uses sentence splits; older models use balanced splits. |
balanced | Punctuation + conjunction splits. Tuned for inworld-tts-1.5. |
sentence | Hard terminal-punctuation splits only. Tuned for inworld-tts-2. |
full_turn | Buffer the entire LLM turn and emit it at turn end. Highest quality, highest latency. |
fast_start | Strict sentence rules for the first emission, then a relaxed config (larger chunks, no idle-flush) for the rest of the turn. Optimizes time-to-first-audio. |
per_segment_context | Each segment opens a fresh TTS context on the duplex stream. Per-segment handles are serialized so audio order is preserved. |
Memory (providerData.memory)
Automatic conversation memory and summarization. When enabled, the server periodically asks the LLM to extract durable facts and a rolling summary, prepends them to the system prompt, and trims the transcript so context stays bounded.
ws.send(JSON.stringify({
type: 'session.update',
session: {
providerData: {
memory: {
enabled: true,
turn_interval: 5,
max_memory_length: 2000,
max_transcript_items: 40,
max_facts: 50,
trim_after_summarize: true
}
}
}
}));
| Field | Type | Default | Description |
|---|
enabled | boolean | false | Enable automatic memory generation. |
turn_interval | integer | 5 | Generate memory every N completed turns. |
max_memory_length | integer | 2000 | Maximum character length for the rolling summary. |
max_transcript_items | integer | 40 | Maximum conversation items to keep after trimming. |
max_facts | integer | 50 | Maximum facts retained in state.facts. |
trim_after_summarize | boolean | true | Remove old transcript items after summarization. |
After each generation cycle the server populates providerData.memory.state (read-only) and emits a session.updated event so clients can observe the rolling summary, fact list, and bookkeeping counters.
Back-channel (providerData.backchannel)
Short audio interjections — "uh-huh", "right", "I see" — emitted while the user is still speaking. Opt-in per session and gated by server prerequisites; contact your account team to confirm prerequisites for your deployment.
For event handling (the response.backchannel.audio.delta / .done / .skipped events), client integration tips, and tuning guidance, see the dedicated Back-channel guide.
ws.send(JSON.stringify({
type: 'session.update',
session: {
providerData: {
backchannel: {
enabled: true,
eval_interval_ms: 800,
min_speech_ms: 800,
min_gap_ms: 4000,
max_per_turn: 3,
hard_deadline_ms: 1500,
history_tail_items: 4,
temperature: 0.7,
max_tokens: 6,
volume_gain: 0.6,
require_pause: false,
decider_kind: 'llm'
}
}
}
}));
| Field | Type | Default | Description |
|---|
enabled | boolean | false | Per-session opt-in. Sessions that don’t send this field never receive back-channels. |
small_model | string | server default | Override the decider LLM model identifier. Empty string inherits the default. |
eval_interval_ms | integer | 800 | How often the manager evaluates eligibility while the user is producing partial transcripts. |
min_speech_ms | integer | 800 | Minimum time after speech onset before any back-channel can fire. |
min_gap_ms | integer | 4000 | Minimum spacing between two back-channels in the same user turn. |
max_per_turn | integer | 3 | Cap on back-channels emitted within a single user turn. |
hard_deadline_ms | integer | 1500 | Combined small-LLM + TTS deadline per attempt. Misses are dropped. |
history_tail_items | integer | 4 | Recent conversation items the small LLM sees as context. |
temperature | number | 0.7 | Sampling temperature for the small LLM. |
max_tokens | integer | 6 | Max tokens for the small LLM’s reply. |
volume_gain | number | 0.6 | Linear gain multiplier applied to synthesized back-channel audio. 0.0 mutes; 1.0 keeps the synthesized volume; >1.0 amplifies. |
require_pause | boolean | false | When true, only fire after a smart-turn pause signal (input_audio_buffer.turn_suggestion). |
allowed_phrases | string[] | server default | Restrict the phrase bank. null / omitted inherits the default; an explicit empty array disables back-channel for the session; a populated array replaces the bank. |
prompt_template | string | server default | Override the decider prompt. Supports Go text/template tokens {{.PhrasesList}}, {{.History}}, {{.Partial}}. |
decider_kind | "llm" | "rule" | "llm" | llm uses a small LLM. rule picks phrases from the bank with per-tick probability rule_fire_probability. |
rule_fire_probability | number | 1.0 | Per-tick fire probability for the rule decider (0.0–1.0). Ignored when decider_kind != "rule". |
Sending providerData.backchannel: {} (empty object) clears all overrides; the server falls back to its compiled-in defaults.
Responsiveness (providerData.responsiveness)
Short filler audio ("let me think", "one moment") spoken after the user’s turn ends if the main LLM is slow to produce its first delta. Opt-in per session and gated by two server prerequisites (a small filler model and an Unleash flag); contact your account team to confirm both are in place.
For how the filler races the main LLM, TTS pipeline details, and tuning guidance, see the dedicated Responsiveness guide.
ws.send(JSON.stringify({
type: 'session.update',
session: {
providerData: {
responsiveness: {
enabled: true,
initial_wait_timeout_ms: 1200,
hard_deadline_ms: 2000,
history_tail_items: 4,
temperature: 0.7,
max_tokens: 12,
min_filler_gap_ms: 8000,
max_initial_per_turn: 1,
enable_filler_on_first_assistant_reply: false,
pause_text: ''
}
}
}
}));
| Field | Type | Default | Description |
|---|
enabled | boolean | false | Per-session opt-in. A session that does not send this object never gets a filler. |
small_model | string | server default | Override the filler LLM model identifier. |
initial_wait_timeout_ms | integer | server default | T — how long to wait for the main LLM’s first delta before committing to the filler. Lower values fire fillers more aggressively. |
hard_deadline_ms | integer | server default | Caps the small / filler LLM’s total streaming time so a slow filler model can’t become a latency tax. |
history_tail_items | integer | server default | Recent conversation items the small LLM sees as context. |
temperature | number | server default | Sampling temperature for the small LLM. |
max_tokens | integer | server default | Caps the small LLM’s response length. Keep small — fillers should be brief. |
min_filler_gap_ms | integer | server default | Minimum gap between any two fillers within a single user-turn chain. |
max_initial_per_turn | integer | 1 | Caps initial fillers per user turn. |
max_buffer_deltas | integer | server default | Bounds the in-memory buffer of main-LLM deltas held while the filler is being spoken. |
enable_filler_on_first_assistant_reply | boolean | false | Allows responsiveness fillers on the very first assistant response in a session. |
prompt_template | string | server default | Overrides the system prompt fed to the small filler LLM. Append a language directive here for multilingual sessions. |
pause_text | string | server default | TTS-only hint injected between the filler and the main answer (e.g. a brief connector word). Empty string disables injection. |
Two optional fields sit alongside the five branches at the top of providerData. They don’t configure STT, TTS, or memory — they tag the session so it can be traced, correlated, and routed downstream.
ws.send(JSON.stringify({
type: 'session.update',
session: {
providerData: {
user_id: 'user_abc123',
metadata: {
tenant: 'acme-corp',
experiment: 'voice-preset-A'
}
}
}
}));
| Field | Type | Description |
|---|
user_id | string | Stable per-user identifier surfaced in tracing, logs, and downstream service requests. Useful for cross-session memory keying and incident debugging. |
metadata | object (string → string) | Arbitrary key-value pairs forwarded to the LLM router as extra_body.metadata. Use for downstream-routing hints, customer-side correlation IDs, or A/B-test bucketing. |
Both fields are optional and hot-swappable.
Hot-swap reference
Most providerData fields take effect on the next audio chunk or turn after the session.update is acknowledged. The exceptions — locked once at session open and ignored afterwards — are:
providerData.tts.conversational
providerData.tts.user_turn_mode
If you need to change either of these, open a new WebSocket session.
See also