Back-channel responses are short audio interjections —Documentation Index
Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
"uh-huh", "right", "I see" — that the server emits while the user is still speaking. They sit out-of-band from the main response stream and give the agent the cadence of an active listener without interrupting the user’s turn.
Enabling back-channel
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Per-session opt-in. Sessions that don’t send this field never receive back-channels even when server prerequisites are met. |
small_model | string | server default | Override the decider LLM model identifier. Empty string inherits the server default. |
eval_interval_ms | integer | 800 | How often the manager evaluates eligibility while the user is producing partial transcripts. |
min_speech_ms | integer | 800 | Minimum time after speech onset before any back-channel can fire. Suppresses interjections on micro-utterances. |
min_gap_ms | integer | 4000 | Minimum spacing between two back-channels in the same user turn. |
max_per_turn | integer | 3 | Cap on back-channels emitted within a single user turn. |
hard_deadline_ms | integer | 1500 | Combined small-LLM + TTS deadline per attempt. Misses are dropped. |
history_tail_items | integer | 4 | Recent conversation items the small LLM sees as context. |
temperature | number | 0.7 | Sampling temperature for the small LLM. |
max_tokens | integer | 6 | Max tokens for the small LLM’s reply. |
volume_gain | number | 0.6 | Linear gain multiplier applied to synthesized back-channel audio before it is sent to the client. 0.0 mutes back-channels (synthesis still runs but no audio is delivered); 1.0 keeps the synthesized volume; values >1.0 amplify. Useful when back-channels feel louder than the main response audio. |
require_pause | boolean | false | When true, only fire after a smart-turn pause signal (the input_audio_buffer.turn_suggestion event). When false, the periodic ticker fires regardless of speech state. |
allowed_phrases | string[] | server default | Restrict the phrase bank. null / field omitted inherits the default; an explicit empty array disables back-channel for the session (no phrase can be picked); a populated array replaces the bank. |
prompt_template | string | server default | Override the decider prompt. Empty string inherits the server default. Supports the Go text/template tokens {{.PhrasesList}}, {{.History}}, {{.Partial}}. |
decider_kind | "llm" | "rule" | "llm" | llm uses a small LLM. rule picks phrases from the bank with per-tick probability rule_fire_probability — useful for load tests or low-cost production. |
rule_fire_probability | number | 1.0 | Per-tick fire probability for the rule decider (0.0–1.0). Values outside [0,1] are clamped. Ignored when decider_kind != "rule". The default of 1.0 matches legacy / test behavior; production rule-decider deployments typically set this lower (e.g. 0.3) for a natural cadence. |
providerData.backchannel: {} (empty object) clears all overrides; the server falls back to its compiled-in defaults.
Handling back-channel audio
backchannel_id as the playback bucket key so chunks of one interjection don’t collide with the active assistant response item.
Client integration: preserve audio during user speech
A natural-feeling back-channel is audible to the user while they are still talking. The default audio-ducking behavior in many client integrations attenuates all output channels when VAD reports user activity — this also silences the back-channel, defeating its purpose. When you wire back-channel into your client, exempt the back-channel playback bucket (keyed bybackchannel_id) from the duck-on-user-speech rule so interjections remain audible while the user holds the floor.
Event reference
| Event | Direction | Description |
|---|---|---|
response.backchannel.audio.delta | server → client | Streaming PCM audio chunk for one interjection. Carries backchannel_id, base64 delta. |
response.backchannel.audio.done | server → client | Indicates all audio for the back-channel interjection identified by backchannel_id has been streamed. Carries an optional phrase field with the chosen utterance (e.g. "uh-huh") when the decider surfaces it; omitted when the decider doesn’t expose the phrase (configuration-dependent). The wire shape is omitempty, so absent and null are equivalent. |
response.backchannel.skipped | server → client | The decider chose not to fire on an evaluation tick. Carries a reason string (e.g. "deadline_missed", "no_phrase") for client-side telemetry. Safe to ignore. |
response.backchannel.skipped payload:
Example: Spanish back-channel with rule decider
For a low-cost, deterministic back-channel — no small-LLM call per evaluation, just a random pick from a fixed Spanish phrase bank — usedecider_kind: "rule" with a populated allowed_phrases and a per-tick fire probability tuned for natural cadence. The TTS voice and language come from your normal audio.output and providerData.tts.language settings, so the phrases get spoken in the Spanish accent you’ve already configured.
- No LLM in the hot path. Every evaluation tick is a coin flip against
rule_fire_probability; if it fires, the server picks a random phrase fromallowed_phrases. Latency is effectively just TTS synthesis. rule_fire_probability: 0.3keeps the cadence natural — at the default 800 mseval_interval_ms, that’s roughly one fire every ~2.7 ticks once the gating thresholds pass. Tune up for more eager back-channels, down for sparser.allowed_phrasesreplaces the compiled-in English bank with Spanish utterances. With the LLM decider you’d instead append a language directive toprompt_template; with the rule decider, the phrase bank is the only thing controlling what gets said, so list them explicitly.providerData.tts.language: 'es-ES'anchors the TTS accent. Without it, TTS-2 may infer a different accent from the audio or text context.
Tuning tips
- Start with the server defaults (just
{ enabled: true }). Adjustmin_speech_msandmin_gap_msfirst if back-channels feel too eager or too sparse. - Pair with
turn_detection.eagerness: 'low'so the main response model gives the user space to continue — back-channel fills the perceived silence. - If back-channels feel louder than the main assistant response, lower
volume_gain(default0.6); if they feel inaudible, raise it toward1.0. Settingvolume_gain: 0.0mutes back-channels entirely without disabling synthesis — useful for A/B tests of the decider in isolation. - For multilingual sessions, append a language directive to
prompt_template. The compiled-in default lists English example phrases; without a language hint the small model echoes them in English regardless of the conversation language. - For load tests, switch
decider_kindtorulewithrule_fire_probability: 0.3to remove the LLM call from the hot path while still exercising the audio pipeline.