Messages

{
  "type": "session.update",
  "session": {
    "instructions": "You are a friendly voice assistant.",
    "audio": {
      "input": {
        "transcription": {
          "model": "inworld/inworld-stt-1"
        },
        "turn_detection": {
          "type": "semantic_vad",
          "eagerness": "medium",
          "create_response": true,
          "interrupt_response": true
        }
      },
      "output": {
        "model": "inworld-tts-2",
        "voice": "Dennis",
        "speed": 1
      }
    }
  }
}

{
  "event_id": "2c23cfd4-a4b5-4a96-83b8-a6a151f3989e",
  "type": "error",
  "error": {
    "type": "server_error",
    "code": null,
    "message": "Failed to read content stream.",
    "param": null,
    "event_id": null
  }
}

Realtime API

Realtime API (WebSocket)

Real-time, multimodal AI interactions over WebSocket. Enables low-latency speech-to-speech conversations through a cascaded pipeline (STT → LLM → TTS), supporting both audio and text modalities.

The API maintains a persistent WebSocket connection where clients can:

Create and configure sessions with custom instructions and voice settings
Stream audio input in real-time for natural voice conversations
Send text input as an alternative to audio
Receive streaming audio and text responses with low latency
Manage conversation flow with turn detection and response control

Key Features:

Low Latency: Optimized for real-time interactions
Multimodal: Supports both audio and text input/output
Voice Activity Detection: Automatic speech detection with configurable thresholds
Streaming Responses: Receive response events as they’re generated
Session Management: Maintain conversation context across multiple interactions

Rate Limits: Concurrent session limits vary by subscription plan. See features and limits by plan for the per-tier table.

Inworld extensions: The session object accepts a providerData field carrying Inworld-specific extensions to the OpenAI-compatible shape — STT tuning, TTS segmentation/steering, automatic memory, back-channel, and responsiveness fillers. See API Extensions for the field-by-field reference.

This API implements the Realtime interface. Refer to the Realtime overview for hands-on guides.

WSS

api

realtime

session

Messages

{
  "type": "session.update",
  "session": {
    "instructions": "You are a friendly voice assistant.",
    "audio": {
      "input": {
        "transcription": {
          "model": "inworld/inworld-stt-1"
        },
        "turn_detection": {
          "type": "semantic_vad",
          "eagerness": "medium",
          "create_response": true,
          "interrupt_response": true
        }
      },
      "output": {
        "model": "inworld-tts-2",
        "voice": "Dennis",
        "speed": 1
      }
    }
  }
}

{
  "event_id": "2c23cfd4-a4b5-4a96-83b8-a6a151f3989e",
  "type": "error",
  "error": {
    "type": "server_error",
    "code": null,
    "message": "Failed to read content stream.",
    "param": null,
    "event_id": null
  }
}

bearerAuth

type:http

Use your API key for authentication. See Authentication for details. You can create a key in one command with the Inworld CLI: inworld workspace add-key.

session.update

type:object

Update the session configuration. The server responds with a session.updated event.

conversation.item.create

type:object

Add a conversation item (message, function call result, etc.).

conversation.item.truncate

type:object

Truncate an assistant message's audio.

conversation.item.delete

type:object

Delete a conversation item by ID.

conversation.item.retrieve

type:object

Retrieve a conversation item by ID.

response.create

type:object

Trigger a model response. The server streams back response events.

response.cancel

type:object

Cancel an in-progress response.

input_audio_buffer.append

type:object

Append audio bytes to the input buffer.

input_audio_buffer.commit

type:object

Commit the buffered audio as a user message.

input_audio_buffer.clear

type:object

Discard all audio in the input buffer.

output_audio_buffer.clear

type:object

Clear the server's output audio buffer, stopping playback.

session.created

type:object

Sent by the server immediately when the WebSocket connection is established, carrying the session's default configuration. Send a session.update to configure the session.

session.updated

type:object

Confirms a session.update was applied.

error

type:object

Indicates an error occurred.

conversation.item.added

type:object

A new item was added to the conversation.

conversation.item.done

type:object

An item finished being populated.

conversation.item.deleted

type:object

An item was deleted from the conversation.

conversation.item.retrieved

type:object

Response to conversation.item.retrieve.

conversation.item.truncated

type:object

An assistant audio item was truncated.

conversation.item.input_audio_transcription.delta

type:object

Streaming partial transcription for user audio.

conversation.item.input_audio_transcription.completed

type:object

Final transcription for a user audio item.

response.created

type:object

A new response was created. Contains the full response object in its initial state.

response.done

type:object

The response finished. Contains the completed response object with final status and output.

response.output_item.added

type:object

An output item was added to the response.

response.output_item.done

type:object

An output item finished.

response.content_part.added

type:object

A content part was added to an output item.

response.content_part.done

type:object

A content part finished.

response.output_text.delta

type:object

Streaming text chunk from the model.

response.output_text.done

type:object

Text output finished.

response.output_audio_transcript.delta

type:object

Streaming transcript for generated audio.

response.output_audio_transcript.done

type:object

Final transcript for generated audio.

response.output_audio.done

type:object

Audio output for a content part finished.

response.function_call_arguments.delta

type:object

Streaming function call arguments.

response.function_call_arguments.done

type:object

Function call arguments finished.

input_audio_buffer.speech_started

type:object

Voice activity detected — user started speaking. Always emitted before transcripts, including on STT providers without native VAD (the server synthesizes the event at first-audio so client code can rely on the same ordering across providers).

input_audio_buffer.speech_stopped

type:object

Voice activity ended — user stopped speaking.

input_audio_buffer.committed

type:object

Buffered audio was committed as a conversation item.

input_audio_buffer.cleared

type:object

Input audio buffer was cleared.

input_audio_buffer.timeout_triggered

type:object

An idle timeout was triggered on the input buffer. Emitted under server_vad when no speech has been detected within turn_detection.idle_timeout_ms.

input_audio_buffer.turn_suggestion

type:object

Emitted by the server VAD smart-turn detector when it predicts the user has reached an end-of-turn boundary. Clients can use this signal to drive low-latency UI cues or to pre-warm a response without waiting for the final speech_stopped commit. May be followed by input_audio_buffer.turn_suggestion_revoked if the user resumes speaking.

input_audio_buffer.turn_suggestion_revoked

type:object

Emitted when the user resumes speaking after a previous turn_suggestion. Pairs with the most recent turn_suggestion sharing the same utterance_index.

output_audio_buffer.started

type:object

Server started sending output audio.

output_audio_buffer.stopped

type:object

Server stopped sending output audio.

output_audio_buffer.cleared

type:object

Output audio buffer was cleared.

response.backchannel.audio.delta

type:object

Streaming PCM audio chunk for a low-latency back-channel interjection (e.g. "uh-huh", "right") emitted while the user is mid-utterance. Out-of-band from the main response stream — use backchannel_id to group chunks belonging to the same interjection.

response.backchannel.audio.done

type:object

All audio for a back-channel interjection has been streamed. No teardown required — playback queues until exhausted.

response.backchannel.skipped

type:object

An evaluation tick chose not to fire a back-channel. Useful for client-side telemetry; clients that don't care can ignore this event.

rate_limits.updated

type:object

Reports current rate limit state.

⌘I