Responsiveness — the intermediate-filler layer — bridges the gap between the moment the user finishes speaking and the moment the main LLM produces its first audible delta. This can be useful if a tool call is made or if the main LLM is slow to produce its response. A small “filler” LLM races against the main model: if the main model takes longer thanDocumentation Index
Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
initial_wait_timeout_ms to emit its first token, the server speaks a short filler (“let me think”, “good question, one moment”) via TTS, then transparently hands off to the main response when it lands.
Enabling responsiveness
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Per-session opt-in. A session that does not send this object never gets a filler, even when the server has the racing infrastructure wired and the Unleash flag is on. |
small_model | string | server default | Override the filler LLM model identifier. Useful for A/B testing different small models without a service redeploy. |
initial_wait_timeout_ms | integer | server default | T — how long to wait for the main LLM’s first delta before committing to the filler. Lower values fire fillers more aggressively (better perceived responsiveness, more frequent fillers). |
hard_deadline_ms | integer | server default | Caps the small / filler LLM’s total streaming time so a slow filler model can’t itself become a latency tax. |
history_tail_items | integer | server default | Recent conversation items the small LLM sees as context. Trades coherence (more history = more on-topic fillers) for token cost. |
temperature | number | server default | Sampling temperature for the small LLM. |
max_tokens | integer | server default | Caps the small LLM’s response length. Keep this small — fillers should be brief. |
min_filler_gap_ms | integer | server default | Minimum gap between any two fillers within a single user-turn chain. Prevents back-to-back fillers when the main LLM is consistently slow. |
max_initial_per_turn | integer | 1 | Caps initial fillers per user turn. The default of 1 matches the v1 single-filler behavior. |
max_buffer_deltas | integer | server default | Bounds the in-memory buffer of main-LLM deltas held while the filler is being spoken. Once the buffer is exhausted the main response is flushed even if the filler is still in progress. |
enable_filler_on_first_assistant_reply | boolean | false | Allows responsiveness fillers on the very first assistant response in a session. Default false because the first reply is often a greeting that doesn’t benefit from a filler. |
prompt_template | string | server default | Overrides the system prompt fed to the small filler LLM. Empty string is treated as “use server default” — you can’t clear it to literally empty. Append a language directive here for multilingual sessions; the compiled-in default is English-biased. |
pause_text | string | server default | TTS-only hint injected between the filler and the main answer (e.g. a brief audible breath or a short connector word). Empty string disables injection for this session. |
null or omitting a field leaves the server-side default in place.
Interaction with TTS
The filler is spoken through the same TTS pipeline as the main response — the same voice, the samedelivery_mode, the same language. Filler audio is delivered on your transport’s normal assistant-audio path: on WebSocket sessions it arrives on the regular response.output_audio.delta stream, and on WebRTC sessions it plays on the inbound RTP audio track (the same track that carries main-response audio — see WebRTC). There is no dedicated filler event type or separate media track. This means:
- Voice / language / accent flips made via
session.updateapply to fillers immediately. - If you have configured
providerData.tts.conversational = true, fillers participate in the shared TTS context just like main responses. pause_text, when set, is synthesized through the same TTS call after the filler — useful for a softer transition from “let me think” into the actual answer.
Example: Japanese fillers
The compiled-in filler prompt biases the small LLM toward English, so a Japanese session needs both a Japanese TTS language pin and aprompt_template override that tells the filler LLM to reply in Japanese. The TTS voice and delivery_mode come from your normal audio.output and providerData.tts settings, so the filler is spoken in the Japanese voice you’ve already configured.
prompt_templateis the single most important setting for non-English fillers. The compiled-in default is English-biased; without the override, the filler LLM emits English (“let me think”) even mid-Japanese conversation. Keep the prompt short and example-driven — the filler LLM is small and benefits from concrete examples in the target language.providerData.tts.language: 'ja-JP'anchors the TTS accent so the synthesized filler audio is rendered in Japanese, not transliterated through an English voice.max_tokens: 12caps filler length. Japanese fillers are typically short (ちょっと待ってくださいis 5 tokens-ish); keep this low so the filler doesn’t outrun the main LLM warmup.- Voice selection (e.g.
Hiroshi) anddelivery_modeare inherited from your TTS config — no responsiveness-specific override needed.
Tuning tips
- Start with
enabled: trueand the server defaults. Measure end-to-end perceived latency before tuning anything else. - If fillers fire too often (the user notices a filler before every response), raise
initial_wait_timeout_ms. Fillers should mask only the slowest tail of main-LLM responses. - If fillers fire too rarely (long awkward silences), lower
initial_wait_timeout_ms. - For multilingual sessions, append a language directive to
prompt_template. The compiled-in default biases the filler LLM toward English; without the override the agent will speak English fillers mid-Spanish or mid-French conversation. - Pair with
enable_filler_on_first_assistant_reply: false(the default) so the opening greeting plays cleanly — fillers on the very first turn tend to feel awkward.