Back-channel responses - Inworld AI Documentation

Back-channel responses are short audio interjections — "uh-huh", "right", "I see" — that the server emits while the user is still speaking. They sit out-of-band from the main response stream and give the agent the cadence of an active listener without interrupting the user’s turn.

Enabling back-channel

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      backchannel: {
        enabled: true,
        eval_interval_ms: 800,
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3,
        hard_deadline_ms: 1500,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 6,
        volume_gain: 0.6,
        require_pause: false,
        decider_kind: 'llm'
      }
    }
  }
}));

Field	Type	Default	Description
`enabled`	boolean	`false`	Per-session opt-in. Sessions that don’t send this field never receive back-channels even when server prerequisites are met.
`small_model`	string	server default	Override the decider LLM model identifier. Empty string inherits the server default.
`eval_interval_ms`	integer	`800`	How often the manager evaluates eligibility while the user is producing partial transcripts.
`min_speech_ms`	integer	`800`	Minimum time after speech onset before any back-channel can fire. Suppresses interjections on micro-utterances.
`min_gap_ms`	integer	`4000`	Minimum spacing between two back-channels in the same user turn.
`max_per_turn`	integer	`3`	Cap on back-channels emitted within a single user turn.
`hard_deadline_ms`	integer	`1500`	Combined small-LLM + TTS deadline per attempt. Misses are dropped.
`history_tail_items`	integer	`4`	Recent conversation items the small LLM sees as context.
`temperature`	number	`0.7`	Sampling temperature for the small LLM.
`max_tokens`	integer	`6`	Max tokens for the small LLM’s reply.
`volume_gain`	number	`0.6`	Linear gain multiplier applied to synthesized back-channel audio before it is sent to the client. `0.0` mutes back-channels (synthesis still runs but no audio is delivered); `1.0` keeps the synthesized volume; values >1.0 amplify. Useful when back-channels feel louder than the main response audio.
`require_pause`	boolean	`false`	When `true`, only fire after a smart-turn pause signal (the `input_audio_buffer.turn_suggestion` event). When `false`, the periodic ticker fires regardless of speech state.
`allowed_phrases`	string[]	server default	Restrict the phrase bank. `null` / field omitted inherits the default; an explicit empty array disables back-channel for the session (no phrase can be picked); a populated array replaces the bank.
`prompt_template`	string	server default	Override the decider prompt. Empty string inherits the server default. Supports the Go `text/template` tokens `{{.PhrasesList}}`, `{{.History}}`, `{{.Partial}}`.
`decider_kind`	`"llm"` \| `"rule"`	`"llm"`	`llm` uses a small LLM. `rule` picks phrases from the bank with per-tick probability `rule_fire_probability` — useful for load tests or low-cost production.
`rule_fire_probability`	number	`1.0`	Per-tick fire probability for the rule decider (`0.0`–`1.0`). Values outside `[0,1]` are clamped. Ignored when `decider_kind != "rule"`. The default of `1.0` matches legacy / test behavior; production rule-decider deployments typically set this lower (e.g. `0.3`) for a natural cadence.

Sending providerData.backchannel: {} (empty object) clears all overrides; the server falls back to its compiled-in defaults.

Handling back-channel audio

session.on('transport_event', (event) => {
  switch (event.type) {
    case 'response.backchannel.audio.delta': {
      // event.backchannel_id groups deltas + done for one interjection.
      // event.delta is base64-encoded PCM16 (or whatever output format you
      // configured on session.audio.output.format).
      audioHandler.playAudio(event.delta, `backchannel:${event.backchannel_id}`);
      break;
    }
    case 'response.backchannel.audio.done': {
      // event.phrase (optional) is the chosen utterance, e.g. "uh-huh".
      // No teardown needed — playAudio queues until exhausted.
      break;
    }
  }
});

Use backchannel_id as the playback bucket key so chunks of one interjection don’t collide with the active assistant response item.

Client integration: preserve audio during user speech

A natural-feeling back-channel is audible to the user while they are still talking. The default audio-ducking behavior in many client integrations attenuates all output channels when VAD reports user activity — this also silences the back-channel, defeating its purpose. When you wire back-channel into your client, exempt the back-channel playback bucket (keyed by backchannel_id) from the duck-on-user-speech rule so interjections remain audible while the user holds the floor.

Event reference

Event	Direction	Description
`response.backchannel.audio.delta`	server → client	Streaming PCM audio chunk for one interjection. Carries `backchannel_id`, base64 `delta`.
`response.backchannel.audio.done`	server → client	Indicates all audio for the back-channel interjection identified by `backchannel_id` has been streamed. Carries an optional `phrase` field with the chosen utterance (e.g. `"uh-huh"`) when the decider surfaces it; omitted when the decider doesn’t expose the phrase (configuration-dependent). The wire shape is `omitempty`, so absent and `null` are equivalent.
`response.backchannel.skipped`	server → client	The decider chose not to fire on an evaluation tick. Carries a `reason` string (e.g. `"deadline_missed"`, `"no_phrase"`) for client-side telemetry. Safe to ignore.

Example response.backchannel.skipped payload:

{
  "event_id": "evt_5f7d2",
  "type": "response.backchannel.skipped",
  "reason": "min_gap_not_elapsed"
}

See the WebSocket API reference for the full schemas.

Example: Spanish back-channel with rule decider

For a low-cost, deterministic back-channel — no small-LLM call per evaluation, just a random pick from a fixed Spanish phrase bank — use decider_kind: "rule" with a populated allowed_phrases and a per-tick fire probability tuned for natural cadence. The TTS voice and language come from your normal audio.output and providerData.tts.language settings, so the phrases get spoken in the Spanish accent you’ve already configured.

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Olivia', model: 'inworld-tts-2' } },
    providerData: {
      tts: { language: 'es-ES' },
      backchannel: {
        enabled: true,
        decider_kind: 'rule',
        rule_fire_probability: 0.3,
        allowed_phrases: ['ajá', 'claro', 'sí', 'mhm', 'entiendo', 'vale'],
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3
      }
    }
  }
}));

What this does:

No LLM in the hot path. Every evaluation tick is a coin flip against rule_fire_probability; if it fires, the server picks a random phrase from allowed_phrases. Latency is effectively just TTS synthesis.
rule_fire_probability: 0.3 keeps the cadence natural — at the default 800 ms eval_interval_ms, that’s roughly one fire every ~2.7 ticks once the gating thresholds pass. Tune up for more eager back-channels, down for sparser.
allowed_phrases replaces the compiled-in English bank with Spanish utterances. With the LLM decider you’d instead append a language directive to prompt_template; with the rule decider, the phrase bank is the only thing controlling what gets said, so list them explicitly.
providerData.tts.language: 'es-ES' anchors the TTS accent. Without it, TTS-2 may infer a different accent from the audio or text context.

Tuning tips

Start with the server defaults (just { enabled: true }). Adjust min_speech_ms and min_gap_ms first if back-channels feel too eager or too sparse.
Pair with turn_detection.eagerness: 'low' so the main response model gives the user space to continue — back-channel fills the perceived silence.
If back-channels feel louder than the main assistant response, lower volume_gain (default 0.6); if they feel inaudible, raise it toward 1.0. Setting volume_gain: 0.0 mutes back-channels entirely without disabling synthesis — useful for A/B tests of the decider in isolation.
For multilingual sessions, append a language directive to prompt_template. The compiled-in default lists English example phrases; without a language hint the small model echoes them in English regardless of the conversation language.
For load tests, switch decider_kind to rule with rule_fire_probability: 0.3 to remove the LLM call from the hot path while still exercising the audio pipeline.

​Enabling back-channel

​Handling back-channel audio

​Client integration: preserve audio during user speech

​Event reference

​Example: Spanish back-channel with rule decider

​Tuning tips