WebRTC - Inworld AI Documentation

Connect via WebRTC for browser-native, low-latency voice. A WebRTC proxy bridges your peer connection to the same realtime service used by the WebSocket transport, transcoding OPUS ↔ PCM16 and forwarding events transparently.

Endpoint

https://api.inworld.ai

Endpoint	Method	Description
`/v1/realtime/calls`	POST	SDP offer/answer exchange
`/v1/realtime/ice-servers`	GET	STUN/TURN server configuration

Authentication

Pass your Inworld API key as a Bearer token. The proxy forwards it to the realtime service.

Authorization: Bearer <base64-api-key>

Keep the API key server-side. Serve it to the browser via a backend endpoint (see examples below).

Flow

Fetch config from your server (API key + ICE servers)
Create RTCPeerConnection with ICE servers
Create data channel oai-events + add microphone track
Create SDP offer → POST to /v1/realtime/calls → set SDP answer
Data channel opens → send session.update → start conversation

Audio flows via RTP tracks (no manual encode/decode). Events flow via data channel using the same JSON schema as WebSocket.

Session Config

Same session.update as WebSocket, sent through the data channel. See Configuring Models for the full breakdown of STT, LLM, and TTS configuration.

dc.send(JSON.stringify({
  type: 'session.update',
  session: {
    type: 'realtime',
    model: 'openai/gpt-4o-mini',
    instructions: 'You are a concise concierge.',
    output_modalities: ['audio', 'text'],
    audio: {
      input: {
        turn_detection: {
          type: 'semantic_vad',
          eagerness: 'medium',
          create_response: true,
          interrupt_response: true
        }
      },
      output: {
        voice: 'Clive',
        model: 'inworld-tts-2',
        speed: 1.0
      }
    },
    tools: [{
      type: 'function',
      name: 'get_weather',
      description: 'Fetch weather for a location',
      parameters: {
        type: 'object',
        properties: { location: { type: 'string' } },
        required: ['location']
      }
    }],
    providerData: {
      stt: {
        voice_profile: true,
        language_hints: ['en-US', 'es-MX'],
        end_of_turn_confidence_threshold: 0.7,
        min_end_of_turn_silence: 200,
        max_turn_silence: 5000,
        vad_threshold: 0.5
      },
      tts: {
        segmenter_strategy: 'sentence',
        steering_handling: 'emit_once',
        language: 'en-US',
        delivery_mode: 'CREATIVE',
        conversational: false
      },
      memory: {
        enabled: true,
        turn_interval: 5,
        max_facts: 50
      },
      backchannel: {
        enabled: true
      },
      responsiveness: {
        enabled: true
      }
    }
  }
}));

providerData carries Inworld-specific extensions to the OpenAI-compatible session shape — STT tuning, TTS segmentation/steering, automatic memory, and more. Most providerData fields are hot-swappable via partial session.update, but a few are read only at session open and ignored afterwards (notably providerData.tts.conversational and providerData.tts.user_turn_mode). See Inworld Realtime API Extensions for the field-by-field reference.

Audio

Unlike WebSocket (manual base64 PCM), WebRTC handles audio natively:

Input: browser captures mic and sends OPUS over RTP automatically
Output: proxy sends AI audio back as an RTP track — attach to <audio> to play

pc.ontrack = (e) => {
  const audio = document.createElement('audio');
  audio.autoplay = true;
  audio.srcObject = new MediaStream([e.track]);
  document.body.appendChild(audio);
};

response.output_audio.delta events are not sent through the data channel — audio is delivered via the RTP track instead.

Text & Responses

Same as WebSocket, but sent through the data channel:

dc.send(JSON.stringify({
  type: 'conversation.item.create',
  item: { type: 'message', role: 'user', content: [{ type: 'input_text', text: 'Hello!' }] }
}));
dc.send(JSON.stringify({ type: 'response.create' }));

Events

Same event types as WebSocket, received on the data channel.

Option 1: Direct WebRTC

Server — serves the page and a /api/config endpoint that fetches ICE servers and keeps the API key hidden:

import 'dotenv/config';
import { readFileSync } from 'fs';
import { createServer } from 'http';

const html = readFileSync('index.html');
const API_KEY = process.env.INWORLD_API_KEY || '';
const PROXY = 'https://api.inworld.ai';

const server = createServer(async (req, res) => {
  if (req.url === '/api/config') {
    let ice = [];
    try {
      const r = await fetch(`${PROXY}/v1/realtime/ice-servers`, {
        headers: { Authorization: `Bearer ${API_KEY}` },
      });
      if (r.ok) ice = (await r.json()).ice_servers || [];
    } catch {}
    res.writeHead(200, { 'Content-Type': 'application/json' });
    res.end(JSON.stringify({ api_key: API_KEY, ice_servers: ice, url: `${PROXY}/v1/realtime/calls` }));
    return;
  }
  res.writeHead(200, { 'Content-Type': 'text/html' });
  res.end(html);
});
let port = 3000;
server.on('error', (e) => {
  if (e.code === 'EADDRINUSE') { console.warn(`Port ${port} in use, trying ${++port}…`); server.listen(port); }
  else throw e;
});
server.listen(port, () => console.log(`http://localhost:${port}`));

Client — full WebRTC flow in the browser:

const cfg = await (await fetch('/api/config')).json();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

const pc = new RTCPeerConnection({ iceServers: cfg.ice_servers });
const dc = pc.createDataChannel('oai-events', { ordered: true });
stream.getAudioTracks().forEach(t => pc.addTrack(t, stream));

pc.ontrack = (e) => {
  const audio = document.createElement('audio');
  audio.autoplay = true;
  audio.srcObject = new MediaStream([e.track]);
  document.body.appendChild(audio);
};

dc.onopen = () => {
  dc.send(JSON.stringify({
    type: 'session.update',
    session: {
      type: 'realtime',
      model: 'openai/gpt-4o-mini',
      instructions: 'You are a helpful voice assistant.',
      output_modalities: ['audio', 'text'],
      audio: {
        input: { turn_detection: { type: 'semantic_vad', eagerness: 'medium', create_response: true, interrupt_response: true } },
        output: { voice: 'Clive', model: 'inworld-tts-2' }
      }
    }
  }));
};

dc.onmessage = (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type === 'response.output_text.delta') console.log(msg.delta);
  if (msg.type === 'error') console.error(msg.error?.message);
};

const offer = await pc.createOffer();
await pc.setLocalDescription(offer);
// wait for ICE gathering...
const res = await fetch(cfg.url, {
  method: 'POST',
  headers: { 'Content-Type': 'application/sdp', Authorization: `Bearer ${cfg.api_key}` },
  body: pc.localDescription.sdp,
});
await pc.setRemoteDescription({ type: 'answer', sdp: await res.text() });

Option 2: OpenAI Agents SDK

The OpenAI Agents SDK manages the full WebRTC lifecycle — peer connection, SDP exchange, mic, and audio playback:

import { RealtimeSession, RealtimeAgent, OpenAIRealtimeWebRTC } from '@openai/agents/realtime';

const agent = new RealtimeAgent({
  name: 'assistant',
  instructions: 'You are a helpful voice assistant.',
  model: 'openai/gpt-4o-mini',
});

const cfg = await (await fetch('/api/config')).json();
const audioEl = document.createElement('audio');
audioEl.autoplay = true;

const session = new RealtimeSession(agent, {
  transport: new OpenAIRealtimeWebRTC({
    useInsecureApiKey: true,
    audioElement: audioEl,
    changePeerConnection: async (pc) => {
      if (cfg.ice_servers?.length) pc.setConfiguration({ iceServers: cfg.ice_servers });
      return pc;
    },
  }),
  model: 'gpt-4o-realtime-preview-2025-06-03',
});

await session.connect({ url: cfg.url, apiKey: cfg.api_key });
session.sendMessage('Hello!');

The server-side /api/config endpoint is identical to Option 1.

WebSocket vs WebRTC

	WebSocket	WebRTC
Audio	PCM16 base64 (manual)	OPUS via RTP (native)
Latency	Higher	Lower (UDP)
NAT	Not needed	ICE (STUN/TURN)
Events	WS messages	DataChannel (same schema)
Best for	Server-side / Node.js	Browser voice apps

API reference for full event schemas.

​Endpoint

​Authentication

​Flow

​Session Config

​Audio

​Text & Responses

​Events

​Option 1: Direct WebRTC

​Option 2: OpenAI Agents SDK

​WebSocket vs WebRTC

Endpoint

Authentication

Flow

Session Config

Audio

Text & Responses

Events

Option 1: Direct WebRTC

Option 2: OpenAI Agents SDK

WebSocket vs WebRTC