Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Inworld Realtime API is wire-compatible with the OpenAI Realtime spec — clients written against OpenAI’s session, audio, and response events work against Inworld unchanged. On top of that baseline, Inworld layers production-grade extensions that improve quality, latency, and conversational naturalness:
  • STT tuning — voice profile signals, language hints, explicit end-of-turn and VAD overrides
  • TTS segmentation and steering — pick how the LLM token stream is chunked into TTS calls, the synthesis language, the TTS-2 delivery preset, and (for TTS-2) a shared multi-turn context
  • Automatic conversation memory — periodic summarization and fact extraction that keep long sessions inside the context window
  • Back-channel — short interjections ("uh-huh", "I see") emitted while the user is still speaking, so the agent feels like an active listener
  • Responsiveness fillers — short filler audio ("let me think") spoken in the gap after a user turn if the main LLM is slow to produce its first delta
Everything Inworld adds beyond the OpenAI spec is exposed through a single field on the session object: providerData. Send it inside any session.update and the server merges it with current state. Most fields hot-swap mid-session and take effect on the next audio chunk or turn; the locked-at-session-open exceptions are called out in the hot-swap reference at the bottom of this page. This page is the field-by-field reference for the full providerData surface. For task-driven walkthroughs (language switching, conversation management, etc.) and for the event-handling client code that pairs with back-channel and responsiveness, see the linked guides under each branch.

Branch overview

providerData is a flat object with five branches. Each branch is independent — send only the ones you want to configure.
BranchPurposeHot-swap
sttSTT tuning (voice profile, language hints, end-of-turn thresholds, VAD overrides)Yes — STT stream is restarted on the next audio chunk
ttsTTS segmentation, language, delivery preset, conversational contextMostly — conversational and user_turn_mode are locked at session open
memoryAutomatic conversation memory and summarizationYes
backchannelShort interjections while the user is speakingYes
responsivenessFiller audio while the main LLM warms up after a user turnYes
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      stt: { /* ... */ },
      tts: { /* ... */ },
      memory: { /* ... */ },
      backchannel: { /* ... */ },
      responsiveness: { /* ... */ }
    }
  }
}));
Partial updates are supported on every branch — omit a field to keep its current value. providerData also accepts two top-level metadata fields alongside the branches: user_id and metadata. They aren’t configuration branches; they tag the session for tracing and downstream routing. See Session metadata below.

STT (providerData.stt)

Inworld extensions to the OpenAI-standard STT config. Every field here is hot-swappable; the STT stream is restarted automatically so the next chunk of audio uses the new value.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: {
      input: { transcription: { model: 'inworld/inworld-stt-1' } }
    },
    providerData: {
      stt: {
        prompt: 'Medical dictation. Vocabulary: angioplasty.',
        voice_profile: true,
        language_hints: ['en-US', 'es-MX'],
        end_of_turn_confidence_threshold: 0.7,
        min_end_of_turn_silence: 200,
        max_turn_silence: 5000,
        vad_threshold: 0.5
      }
    }
  }
}));
FieldTypeDescription
promptstringTranscription guidance (vocabulary hints, domain context, formatting preferences). Equivalent to audio.input.transcription.prompt.
voice_profilebooleanWhen true, attach voice-profile signals (age, gender, emotion, vocal style, accent) to transcription events under providerData.voiceProfile. See Voice profile payload below for the returned shape.
language_hintsstring[]BCP-47-ish hints to bias recognition without committing to a single language. Soniox-specific (soniox/stt-rt-v4); ignored by other models.
end_of_turn_confidence_thresholdnumberSTT end-of-turn confidence cutoff (0.01.0). Explicit override of the semantic_vad.eagerness mapping.
vad_thresholdnumberSpeech/silence VAD cutoff (0.01.0). Explicit override of the eagerness mapping.
min_end_of_turn_silenceintegerMinimum trailing silence (ms) before STT considers a turn finished. Explicit override of the eagerness mapping.
max_turn_silenceintegerHard ceiling (ms) on within-turn silence before STT force-closes the turn. Explicit override of the eagerness mapping.
For the eagerness preset that these fields override, see semantic_vad.

Voice profile payload

When providerData.stt.voice_profile is true, every conversation.item.input_audio_transcription.delta and conversation.item.input_audio_transcription.completed event carries a providerData.voiceProfile object alongside the transcript text:
{
  "type": "conversation.item.input_audio_transcription.completed",
  "event_id": "evt_5f7d2",
  "item_id": "item_aud_01HF…",
  "content_index": 0,
  "transcript": "Hello, how are you?",
  "providerData": {
    "voiceProfile": {
      "age":         [{ "label": "adult",          "confidence": 0.78 }],
      "gender":      [{ "label": "female",         "confidence": 0.91 }],
      "emotion":     [{ "label": "neutral",        "confidence": 0.65 }],
      "vocal_style": [{ "label": "conversational", "confidence": 0.82 }],
      "accent":      [{ "label": "en-US",          "confidence": 0.88 }]
    }
  }
}
Each top-level key is an array of { label, confidence } objects sorted by descending confidence. Keys are omitted when the STT backend does not produce labels for that category, so always null-check before reading. Confidence values are in [0.0, 1.0].
CategoryNotes
ageEstimated age band of the speaker.
genderEstimated gender of the speaker.
emotionDetected emotional tone in the current segment. Can shift across deltas within a single turn.
vocal_styleSpeaking style (e.g. conversational, narration, whisper, monotone).
accentRegional accent or dialect as a BCP-47-like locale code (e.g. en-US, en-GB).
Voice profile is computed by the realtime service regardless of the STT backend, so voice_profile: true works across all supported STT models.

TTS (providerData.tts)

Controls how the LLM text stream is segmented and forwarded to the TTS backend, the language and delivery preset used for synthesis, and (for TTS-2) whether a shared upstream context is preserved across turns. Available on inworld-tts-1.5-mini and inworld-tts-2.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { model: 'inworld-tts-2', voice: 'Olivia' } },
    providerData: {
      tts: {
        segmenter_strategy: 'sentence',
        steering_handling: 'emit_once',
        language: 'en-US',
        delivery_mode: 'CREATIVE',
        conversational: false,
        user_turn_mode: 'both'
      }
    }
  }
}));
FieldTypeDescription
segmenter_strategystringHow the LLM token stream is chunked before being forwarded to TTS. One of auto, balanced, sentence, full_turn, fast_start, per_segment_context. Empty string inherits the server default. Hot-swappable. See Segmenter strategies.
steering_handlingstringHow to handle a leading [steering] tag captured from the LLM turn. repeat_each_chunk re-prepends it to every TTS request (default). emit_once prepends it only to the first request — recommended for inworld-tts-2. Hot-swappable.
languagestringBCP-47 tag (e.g. "en-US", "pt-BR") forwarded to TTS as the synthesis language. Independent from audio.input.transcription.language — STT and TTS can use different languages. Empty string lets the TTS backend infer. Hot-swappable.
delivery_modestringTTS-2 generation preset trading off stability vs. expressiveness. One of STABLE, BALANCED, CREATIVE (case-insensitive). Empty or unrecognised values are treated as unspecified. No-op for non-TTS-2 models. Hot-swappable.
conversationalbooleanTTS-2 only. When true, opens a single shared upstream TTS context for the entire WebSocket session. Locked at session open; mid-session toggles are ignored. See Conversational TTS.
user_turn_modestringConversational-mode only. Which channels of the user turn are forwarded to TTS before each assistant generation. One of both (default), audio_only, text_only, or none. No-op outside conversational mode. Locked at session open.

Conversational TTS

Setting providerData.tts.conversational = true opts TTS-2 into a multi-turn shared context: the upstream TurnContext sees every user and assistant turn for the lifetime of the WebSocket. This lets the model condition its delivery on the audio history of the conversation. The trade-off is a longer-lived state on the TTS backend and slightly higher per-turn cost. In conversational mode, segmenter_strategy is internally locked to full_turn semantics. Per-sentence and per-segment-context strategies are coerced (with a server-side WARN) because they would either fragment the upstream history or open a fresh context per segment, both of which defeat the multi-turn TurnContext.
With conversational: true, TTS conditions each response on the audio of previous turns — higher per-turn cost in exchange for potentially more natural output. Off by default.

Segmenter strategies

StrategyBehaviour
autoDefault. inworld-tts-2 uses sentence splits; older models use balanced splits.
balancedPunctuation + conjunction splits. Tuned for inworld-tts-1.5.
sentenceHard terminal-punctuation splits only. Tuned for inworld-tts-2.
full_turnBuffer the entire LLM turn and emit it at turn end. Highest quality, highest latency.
fast_startStrict sentence rules for the first emission, then a relaxed config (larger chunks, no idle-flush) for the rest of the turn. Optimizes time-to-first-audio.
per_segment_contextEach segment opens a fresh TTS context on the duplex stream. Per-segment handles are serialized so audio order is preserved.

Memory (providerData.memory)

Automatic conversation memory and summarization. When enabled, the server periodically asks the LLM to extract durable facts and a rolling summary, prepends them to the system prompt, and trims the transcript so context stays bounded.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      memory: {
        enabled: true,
        turn_interval: 5,
        max_memory_length: 2000,
        max_transcript_items: 40,
        max_facts: 50,
        trim_after_summarize: true
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalseEnable automatic memory generation.
turn_intervalinteger5Generate memory every N completed turns.
max_memory_lengthinteger2000Maximum character length for the rolling summary.
max_transcript_itemsinteger40Maximum conversation items to keep after trimming.
max_factsinteger50Maximum facts retained in state.facts.
trim_after_summarizebooleantrueRemove old transcript items after summarization.
After each generation cycle the server populates providerData.memory.state (read-only) and emits a session.updated event so clients can observe the rolling summary, fact list, and bookkeeping counters.

Back-channel (providerData.backchannel)

Short audio interjections — "uh-huh", "right", "I see" — emitted while the user is still speaking. Opt-in per session and gated by server prerequisites; contact your account team to confirm prerequisites for your deployment. For event handling (the response.backchannel.audio.delta / .done / .skipped events), client integration tips, and tuning guidance, see the dedicated Back-channel guide.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      backchannel: {
        enabled: true,
        eval_interval_ms: 800,
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3,
        hard_deadline_ms: 1500,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 6,
        volume_gain: 0.6,
        require_pause: false,
        decider_kind: 'llm'
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalsePer-session opt-in. Sessions that don’t send this field never receive back-channels.
small_modelstringserver defaultOverride the decider LLM model identifier. Empty string inherits the default.
eval_interval_msinteger800How often the manager evaluates eligibility while the user is producing partial transcripts.
min_speech_msinteger800Minimum time after speech onset before any back-channel can fire.
min_gap_msinteger4000Minimum spacing between two back-channels in the same user turn.
max_per_turninteger3Cap on back-channels emitted within a single user turn.
hard_deadline_msinteger1500Combined small-LLM + TTS deadline per attempt. Misses are dropped.
history_tail_itemsinteger4Recent conversation items the small LLM sees as context.
temperaturenumber0.7Sampling temperature for the small LLM.
max_tokensinteger6Max tokens for the small LLM’s reply.
volume_gainnumber0.6Linear gain multiplier applied to synthesized back-channel audio. 0.0 mutes; 1.0 keeps the synthesized volume; >1.0 amplifies.
require_pausebooleanfalseWhen true, only fire after a smart-turn pause signal (input_audio_buffer.turn_suggestion).
allowed_phrasesstring[]server defaultRestrict the phrase bank. null / omitted inherits the default; an explicit empty array disables back-channel for the session; a populated array replaces the bank.
prompt_templatestringserver defaultOverride the decider prompt. Supports Go text/template tokens {{.PhrasesList}}, {{.History}}, {{.Partial}}.
decider_kind"llm" | "rule""llm"llm uses a small LLM. rule picks phrases from the bank with per-tick probability rule_fire_probability.
rule_fire_probabilitynumber1.0Per-tick fire probability for the rule decider (0.01.0). Ignored when decider_kind != "rule".
Sending providerData.backchannel: {} (empty object) clears all overrides; the server falls back to its compiled-in defaults.

Responsiveness (providerData.responsiveness)

Short filler audio ("let me think", "one moment") spoken after the user’s turn ends if the main LLM is slow to produce its first delta. Opt-in per session and gated by two server prerequisites (a small filler model and an Unleash flag); contact your account team to confirm both are in place. For how the filler races the main LLM, TTS pipeline details, and tuning guidance, see the dedicated Responsiveness guide.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        hard_deadline_ms: 2000,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 12,
        min_filler_gap_ms: 8000,
        max_initial_per_turn: 1,
        enable_filler_on_first_assistant_reply: false,
        pause_text: ''
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalsePer-session opt-in. A session that does not send this object never gets a filler.
small_modelstringserver defaultOverride the filler LLM model identifier.
initial_wait_timeout_msintegerserver defaultT — how long to wait for the main LLM’s first delta before committing to the filler. Lower values fire fillers more aggressively.
hard_deadline_msintegerserver defaultCaps the small / filler LLM’s total streaming time so a slow filler model can’t become a latency tax.
history_tail_itemsintegerserver defaultRecent conversation items the small LLM sees as context.
temperaturenumberserver defaultSampling temperature for the small LLM.
max_tokensintegerserver defaultCaps the small LLM’s response length. Keep small — fillers should be brief.
min_filler_gap_msintegerserver defaultMinimum gap between any two fillers within a single user-turn chain.
max_initial_per_turninteger1Caps initial fillers per user turn.
max_buffer_deltasintegerserver defaultBounds the in-memory buffer of main-LLM deltas held while the filler is being spoken.
enable_filler_on_first_assistant_replybooleanfalseAllows responsiveness fillers on the very first assistant response in a session.
prompt_templatestringserver defaultOverrides the system prompt fed to the small filler LLM. Append a language directive here for multilingual sessions.
pause_textstringserver defaultTTS-only hint injected between the filler and the main answer (e.g. a brief connector word). Empty string disables injection.

Session metadata

Two optional fields sit alongside the five branches at the top of providerData. They don’t configure STT, TTS, or memory — they tag the session so it can be traced, correlated, and routed downstream.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      user_id: 'user_abc123',
      metadata: {
        tenant: 'acme-corp',
        experiment: 'voice-preset-A'
      }
    }
  }
}));
FieldTypeDescription
user_idstringStable per-user identifier surfaced in tracing, logs, and downstream service requests. Useful for cross-session memory keying and incident debugging.
metadataobject (string → string)Arbitrary key-value pairs forwarded to the LLM router as extra_body.metadata. Use for downstream-routing hints, customer-side correlation IDs, or A/B-test bucketing.
Both fields are optional and hot-swappable.

Hot-swap reference

Most providerData fields take effect on the next audio chunk or turn after the session.update is acknowledged. The exceptions — locked once at session open and ignored afterwards — are:
  • providerData.tts.conversational
  • providerData.tts.user_turn_mode
If you need to change either of these, open a new WebSocket session.

See also