Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

Back-channel responses are short audio interjections — "uh-huh", "right", "I see" — that the server emits while the user is still speaking. They sit out-of-band from the main response stream and give the agent the cadence of an active listener without interrupting the user’s turn.

Enabling back-channel

ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    providerData: {
      backchannel: {
        enabled: true,
        eval_interval_ms: 800,
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3,
        hard_deadline_ms: 1500,
        history_tail_items: 4,
        temperature: 0.7,
        max_tokens: 6,
        volume_gain: 0.6,
        require_pause: false,
        decider_kind: 'llm'
      }
    }
  }
}));
FieldTypeDefaultDescription
enabledbooleanfalsePer-session opt-in. Sessions that don’t send this field never receive back-channels even when server prerequisites are met.
small_modelstringserver defaultOverride the decider LLM model identifier. Empty string inherits the server default.
eval_interval_msinteger800How often the manager evaluates eligibility while the user is producing partial transcripts.
min_speech_msinteger800Minimum time after speech onset before any back-channel can fire. Suppresses interjections on micro-utterances.
min_gap_msinteger4000Minimum spacing between two back-channels in the same user turn.
max_per_turninteger3Cap on back-channels emitted within a single user turn.
hard_deadline_msinteger1500Combined small-LLM + TTS deadline per attempt. Misses are dropped.
history_tail_itemsinteger4Recent conversation items the small LLM sees as context.
temperaturenumber0.7Sampling temperature for the small LLM.
max_tokensinteger6Max tokens for the small LLM’s reply.
volume_gainnumber0.6Linear gain multiplier applied to synthesized back-channel audio before it is sent to the client. 0.0 mutes back-channels (synthesis still runs but no audio is delivered); 1.0 keeps the synthesized volume; values >1.0 amplify. Useful when back-channels feel louder than the main response audio.
require_pausebooleanfalseWhen true, only fire after a smart-turn pause signal (the input_audio_buffer.turn_suggestion event). When false, the periodic ticker fires regardless of speech state.
allowed_phrasesstring[]server defaultRestrict the phrase bank. null / field omitted inherits the default; an explicit empty array disables back-channel for the session (no phrase can be picked); a populated array replaces the bank.
prompt_templatestringserver defaultOverride the decider prompt. Empty string inherits the server default. Supports the Go text/template tokens {{.PhrasesList}}, {{.History}}, {{.Partial}}.
decider_kind"llm" | "rule""llm"llm uses a small LLM. rule picks phrases from the bank with per-tick probability rule_fire_probability — useful for load tests or low-cost production.
rule_fire_probabilitynumber1.0Per-tick fire probability for the rule decider (0.01.0). Values outside [0,1] are clamped. Ignored when decider_kind != "rule". The default of 1.0 matches legacy / test behavior; production rule-decider deployments typically set this lower (e.g. 0.3) for a natural cadence.
Sending providerData.backchannel: {} (empty object) clears all overrides; the server falls back to its compiled-in defaults.

Handling back-channel audio

session.on('transport_event', (event) => {
  switch (event.type) {
    case 'response.backchannel.audio.delta': {
      // event.backchannel_id groups deltas + done for one interjection.
      // event.delta is base64-encoded PCM16 (or whatever output format you
      // configured on session.audio.output.format).
      audioHandler.playAudio(event.delta, `backchannel:${event.backchannel_id}`);
      break;
    }
    case 'response.backchannel.audio.done': {
      // event.phrase (optional) is the chosen utterance, e.g. "uh-huh".
      // No teardown needed — playAudio queues until exhausted.
      break;
    }
  }
});
Use backchannel_id as the playback bucket key so chunks of one interjection don’t collide with the active assistant response item.

Client integration: preserve audio during user speech

A natural-feeling back-channel is audible to the user while they are still talking. The default audio-ducking behavior in many client integrations attenuates all output channels when VAD reports user activity — this also silences the back-channel, defeating its purpose. When you wire back-channel into your client, exempt the back-channel playback bucket (keyed by backchannel_id) from the duck-on-user-speech rule so interjections remain audible while the user holds the floor.

Event reference

EventDirectionDescription
response.backchannel.audio.deltaserver → clientStreaming PCM audio chunk for one interjection. Carries backchannel_id, base64 delta.
response.backchannel.audio.doneserver → clientIndicates all audio for the back-channel interjection identified by backchannel_id has been streamed. Carries an optional phrase field with the chosen utterance (e.g. "uh-huh") when the decider surfaces it; omitted when the decider doesn’t expose the phrase (configuration-dependent). The wire shape is omitempty, so absent and null are equivalent.
response.backchannel.skippedserver → clientThe decider chose not to fire on an evaluation tick. Carries a reason string (e.g. "deadline_missed", "no_phrase") for client-side telemetry. Safe to ignore.
Example response.backchannel.skipped payload:
{
  "event_id": "evt_5f7d2",
  "type": "response.backchannel.skipped",
  "reason": "min_gap_not_elapsed"
}
See the WebSocket API reference for the full schemas.

Example: Spanish back-channel with rule decider

For a low-cost, deterministic back-channel — no small-LLM call per evaluation, just a random pick from a fixed Spanish phrase bank — use decider_kind: "rule" with a populated allowed_phrases and a per-tick fire probability tuned for natural cadence. The TTS voice and language come from your normal audio.output and providerData.tts.language settings, so the phrases get spoken in the Spanish accent you’ve already configured.
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    audio: { output: { voice: 'Olivia', model: 'inworld-tts-2' } },
    providerData: {
      tts: { language: 'es-ES' },
      backchannel: {
        enabled: true,
        decider_kind: 'rule',
        rule_fire_probability: 0.3,
        allowed_phrases: ['ajá', 'claro', 'sí', 'mhm', 'entiendo', 'vale'],
        min_speech_ms: 800,
        min_gap_ms: 4000,
        max_per_turn: 3
      }
    }
  }
}));
What this does:
  • No LLM in the hot path. Every evaluation tick is a coin flip against rule_fire_probability; if it fires, the server picks a random phrase from allowed_phrases. Latency is effectively just TTS synthesis.
  • rule_fire_probability: 0.3 keeps the cadence natural — at the default 800 ms eval_interval_ms, that’s roughly one fire every ~2.7 ticks once the gating thresholds pass. Tune up for more eager back-channels, down for sparser.
  • allowed_phrases replaces the compiled-in English bank with Spanish utterances. With the LLM decider you’d instead append a language directive to prompt_template; with the rule decider, the phrase bank is the only thing controlling what gets said, so list them explicitly.
  • providerData.tts.language: 'es-ES' anchors the TTS accent. Without it, TTS-2 may infer a different accent from the audio or text context.

Tuning tips

  • Start with the server defaults (just { enabled: true }). Adjust min_speech_ms and min_gap_ms first if back-channels feel too eager or too sparse.
  • Pair with turn_detection.eagerness: 'low' so the main response model gives the user space to continue — back-channel fills the perceived silence.
  • If back-channels feel louder than the main assistant response, lower volume_gain (default 0.6); if they feel inaudible, raise it toward 1.0. Setting volume_gain: 0.0 mutes back-channels entirely without disabling synthesis — useful for A/B tests of the decider in isolation.
  • For multilingual sessions, append a language directive to prompt_template. The compiled-in default lists English example phrases; without a language hint the small model echoes them in English regardless of the conversation language.
  • For load tests, switch decider_kind to rule with rule_fire_probability: 0.3 to remove the LLM call from the hot path while still exercising the audio pipeline.