Responsiveness (intermediate fillers) - Inworld AI Documentation

Responsiveness — the intermediate-filler layer — bridges the gap between the moment the user finishes speaking and the moment the main LLM produces its first audible delta. This can be useful if a tool call is made or if the main LLM is slow to produce its response. A small “filler” LLM races against the main model: if the main model takes longer than initial_wait_timeout_ms to emit its first token, the server speaks a short filler (“let me think”, “good question, one moment”) via TTS, then transparently hands off to the main response when it lands.

Enabling responsiveness

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        hard_deadline_ms: 2000,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 12,
        min_filler_gap_ms: 8000,
        max_initial_per_turn: 1,
        enable_filler_on_first_assistant_reply: false,
        pause_text: ''
      }
    }
  }
}));

Field	Type	Default	Description
`enabled`	boolean	`false`	Per-session opt-in. A session that does not send this object never gets a filler, even when the server has the racing infrastructure wired and the Unleash flag is on.
`small_model`	string	server default	Override the filler LLM model identifier. Useful for A/B testing different small models without a service redeploy.
`initial_wait_timeout_ms`	integer	server default	T — how long to wait for the main LLM’s first delta before committing to the filler. Lower values fire fillers more aggressively (better perceived responsiveness, more frequent fillers).
`hard_deadline_ms`	integer	server default	Caps the small / filler LLM’s total streaming time so a slow filler model can’t itself become a latency tax.
`history_tail_items`	integer	server default	Recent conversation items the small LLM sees as context. Trades coherence (more history = more on-topic fillers) for token cost.
`temperature`	number	server default	Sampling temperature for the small LLM.
`max_tokens`	integer	server default	Caps the small LLM’s response length. Keep this small — fillers should be brief.
`min_filler_gap_ms`	integer	server default	Minimum gap between any two fillers within a single user-turn chain. Prevents back-to-back fillers when the main LLM is consistently slow.
`max_initial_per_turn`	integer	`1`	Caps initial fillers per user turn. The default of `1` matches the v1 single-filler behavior.
`max_buffer_deltas`	integer	server default	Bounds the in-memory buffer of main-LLM deltas held while the filler is being spoken. Once the buffer is exhausted the main response is flushed even if the filler is still in progress.
`enable_filler_on_first_assistant_reply`	boolean	`false`	Allows responsiveness fillers on the very first assistant response in a session. Default `false` because the first reply is often a greeting that doesn’t benefit from a filler.
`prompt_template`	string	server default	Overrides the system prompt fed to the small filler LLM. Empty string is treated as “use server default” — you can’t clear it to literally empty. Append a language directive here for multilingual sessions; the compiled-in default is English-biased.
`pause_text`	string	server default	TTS-only hint injected between the filler and the main answer (e.g. a brief audible breath or a short connector word). Empty string disables injection for this session.

All fields are optional pointers — sending null or omitting a field leaves the server-side default in place.

Interaction with TTS

The filler is spoken through the same TTS pipeline as the main response — the same voice, the same delivery_mode, the same language. Filler audio is delivered on your transport’s normal assistant-audio path: on WebSocket sessions it arrives on the regular response.output_audio.delta stream, and on WebRTC sessions it plays on the inbound RTP audio track (the same track that carries main-response audio — see WebRTC). There is no dedicated filler event type or separate media track. This means:

Voice / language / accent flips made via session.update apply to fillers immediately.
If you have configured providerData.tts.conversational = true, fillers participate in the shared TTS context just like main responses.
pause_text, when set, is synthesized through the same TTS call after the filler — useful for a softer transition from “let me think” into the actual answer.

Example: Japanese fillers

The compiled-in filler prompt biases the small LLM toward English, so a Japanese session needs both a Japanese TTS language pin and a prompt_template override that tells the filler LLM to reply in Japanese. The TTS voice and delivery_mode come from your normal audio.output and providerData.tts settings, so the filler is spoken in the Japanese voice you’ve already configured.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Hiroshi', model: 'inworld-tts-2' } },
    providerData: {
      tts: { language: 'ja-JP' },
      responsiveness: {
        enabled: true,
        initial_wait_timeout_ms: 1200,
        max_tokens: 12,
        prompt_template:
          'You are a polite Japanese conversational assistant. ' +
          'The user has just spoken and the main response is being prepared. ' +
          'Output a single short filler in natural spoken Japanese ' +
          '(e.g. "ちょっと待ってください", "少々お待ちを", "そうですね") — ' +
          'no greeting, no follow-up question, just the filler. ' +
          'Reply in Japanese only.'
      }
    }
  }
}));

What this does:

prompt_template is the single most important setting for non-English fillers. The compiled-in default is English-biased; without the override, the filler LLM emits English (“let me think”) even mid-Japanese conversation. Keep the prompt short and example-driven — the filler LLM is small and benefits from concrete examples in the target language.
providerData.tts.language: 'ja-JP' anchors the TTS accent so the synthesized filler audio is rendered in Japanese, not transliterated through an English voice.
max_tokens: 12 caps filler length. Japanese fillers are typically short (ちょっと待ってください is 5 tokens-ish); keep this low so the filler doesn’t outrun the main LLM warmup.
Voice selection (e.g. Hiroshi) and delivery_mode are inherited from your TTS config — no responsiveness-specific override needed.

Tuning tips

Start with enabled: true and the server defaults. Measure end-to-end perceived latency before tuning anything else.
If fillers fire too often (the user notices a filler before every response), raise initial_wait_timeout_ms. Fillers should mask only the slowest tail of main-LLM responses.
If fillers fire too rarely (long awkward silences), lower initial_wait_timeout_ms.
For multilingual sessions, append a language directive to prompt_template. The compiled-in default biases the filler LLM toward English; without the override the agent will speak English fillers mid-Spanish or mid-French conversation.
Pair with enable_filler_on_first_assistant_reply: false (the default) so the opening greeting plays cleanly — fillers on the very first turn tend to feel awkward.

​Enabling responsiveness

​Interaction with TTS

​Example: Japanese fillers

​Tuning tips

Enabling responsiveness

Interaction with TTS

Example: Japanese fillers

Tuning tips