Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

Responsiveness — the intermediate-filler layer — bridges the gap between the moment the user finishes speaking and the moment the main LLM produces its first audible delta. This can be useful if a tool call is made or if the main LLM is slow to produce its response. A small “filler” LLM races against the main model: if the main model takes longer than initial_wait_timeout_ms to emit its first token, the server speaks a short filler (“let me think”, “good question, one moment”) via TTS, then transparently hands off to the main response when it lands.

Enabling responsiveness

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        hard_deadline_ms: 2000,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 12,
        min_filler_gap_ms: 8000,
        max_initial_per_turn: 1,
        enable_filler_on_first_assistant_reply: false,
        pause_text: ''
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalsePer-session opt-in. A session that does not send this object never gets a filler, even when the server has the racing infrastructure wired and the Unleash flag is on.
small_modelstringserver defaultOverride the filler LLM model identifier. Useful for A/B testing different small models without a service redeploy.
initial_wait_timeout_msintegerserver defaultT — how long to wait for the main LLM’s first delta before committing to the filler. Lower values fire fillers more aggressively (better perceived responsiveness, more frequent fillers).
hard_deadline_msintegerserver defaultCaps the small / filler LLM’s total streaming time so a slow filler model can’t itself become a latency tax.
history_tail_itemsintegerserver defaultRecent conversation items the small LLM sees as context. Trades coherence (more history = more on-topic fillers) for token cost.
temperaturenumberserver defaultSampling temperature for the small LLM.
max_tokensintegerserver defaultCaps the small LLM’s response length. Keep this small — fillers should be brief.
min_filler_gap_msintegerserver defaultMinimum gap between any two fillers within a single user-turn chain. Prevents back-to-back fillers when the main LLM is consistently slow.
max_initial_per_turninteger1Caps initial fillers per user turn. The default of 1 matches the v1 single-filler behavior.
max_buffer_deltasintegerserver defaultBounds the in-memory buffer of main-LLM deltas held while the filler is being spoken. Once the buffer is exhausted the main response is flushed even if the filler is still in progress.
enable_filler_on_first_assistant_replybooleanfalseAllows responsiveness fillers on the very first assistant response in a session. Default false because the first reply is often a greeting that doesn’t benefit from a filler.
prompt_templatestringserver defaultOverrides the system prompt fed to the small filler LLM. Empty string is treated as “use server default” — you can’t clear it to literally empty. Append a language directive here for multilingual sessions; the compiled-in default is English-biased.
pause_textstringserver defaultTTS-only hint injected between the filler and the main answer (e.g. a brief audible breath or a short connector word). Empty string disables injection for this session.
All fields are optional pointers — sending null or omitting a field leaves the server-side default in place.

Interaction with TTS

The filler is spoken through the same TTS pipeline as the main response — the same voice, the same delivery_mode, the same language. Filler audio is delivered on your transport’s normal assistant-audio path: on WebSocket sessions it arrives on the regular response.output_audio.delta stream, and on WebRTC sessions it plays on the inbound RTP audio track (the same track that carries main-response audio — see WebRTC). There is no dedicated filler event type or separate media track. This means:
  • Voice / language / accent flips made via session.update apply to fillers immediately.
  • If you have configured providerData.tts.conversational = true, fillers participate in the shared TTS context just like main responses.
  • pause_text, when set, is synthesized through the same TTS call after the filler — useful for a softer transition from “let me think” into the actual answer.

Example: Japanese fillers

The compiled-in filler prompt biases the small LLM toward English, so a Japanese session needs both a Japanese TTS language pin and a prompt_template override that tells the filler LLM to reply in Japanese. The TTS voice and delivery_mode come from your normal audio.output and providerData.tts settings, so the filler is spoken in the Japanese voice you’ve already configured.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Hiroshi', model: 'inworld-tts-2' } },
    providerData: {
      tts: { language: 'ja-JP' },
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        max_tokens: 12,
        prompt_template:
          'You are a polite Japanese conversational assistant. ' +
          'The user has just spoken and the main response is being prepared. ' +
          'Output a single short filler in natural spoken Japanese ' +
          '(e.g. "ちょっと待ってください", "少々お待ちを", "そうですね") — ' +
          'no greeting, no follow-up question, just the filler. ' +
          'Reply in Japanese only.'
      }
    }
  }
}));
What this does:
  • prompt_template is the single most important setting for non-English fillers. The compiled-in default is English-biased; without the override, the filler LLM emits English (“let me think”) even mid-Japanese conversation. Keep the prompt short and example-driven — the filler LLM is small and benefits from concrete examples in the target language.
  • providerData.tts.language: 'ja-JP' anchors the TTS accent so the synthesized filler audio is rendered in Japanese, not transliterated through an English voice.
  • max_tokens: 12 caps filler length. Japanese fillers are typically short (ちょっと待ってください is 5 tokens-ish); keep this low so the filler doesn’t outrun the main LLM warmup.
  • Voice selection (e.g. Hiroshi) and delivery_mode are inherited from your TTS config — no responsiveness-specific override needed.

Tuning tips

  • Start with enabled: true and the server defaults. Measure end-to-end perceived latency before tuning anything else.
  • If fillers fire too often (the user notices a filler before every response), raise initial_wait_timeout_ms. Fillers should mask only the slowest tail of main-LLM responses.
  • If fillers fire too rarely (long awkward silences), lower initial_wait_timeout_ms.
  • For multilingual sessions, append a language directive to prompt_template. The compiled-in default biases the filler LLM toward English; without the override the agent will speak English fillers mid-Spanish or mid-French conversation.
  • Pair with enable_filler_on_first_assistant_reply: false (the default) so the opening greeting plays cleanly — fillers on the very first turn tend to feel awkward.