{
"type": "session.update",
"session": {
"instructions": "You are a friendly voice assistant.",
"audio": {
"input": {
"transcription": {
"model": "inworld/inworld-stt-1"
},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "medium",
"create_response": true,
"interrupt_response": true
}
},
"output": {
"model": "inworld-tts-2",
"voice": "Dennis",
"speed": 1
}
}
}
}{
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "user",
"content": [
{
"type": "input_text",
"text": "Hello, how are you?"
}
]
}
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"content_index": 123,
"audio_end_ms": 123
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"type": "response.create",
"response": {
"output_modalities": [
"audio",
"text"
],
"instructions": "Respond in a cheerful tone."
}
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}No examples foundNo examples found{
"event_id": "2c23cfd4-a4b5-4a96-83b8-a6a151f3989e",
"type": "error",
"error": {
"type": "server_error",
"code": null,
"message": "Failed to read content stream.",
"param": null,
"event_id": null
}
}No examples foundNo examples found{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"content_index": 123,
"transcript": "<string>"
}No examples foundNo examples foundNo examples foundNo examples foundNo examples foundNo examples found{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"text": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"transcript": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"arguments": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio_start_ms": 123,
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio_end_ms": 123,
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio_start_ms": 123,
"audio_end_ms": 123,
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"utterance_index": 123,
"probability": 123,
"trailing_silence_ms": 123,
"audio_duration_ms": 123,
"inference_ms": 123
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"utterance_index": 123
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"backchannel_id": "<string>",
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"backchannel_id": "<string>",
"phrase": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"reason": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}Realtime API (WebSocket)
Real-time, multimodal AI interactions over WebSocket. Enables low-latency speech-to-speech conversations with AI models, supporting both audio and text modalities.
The API maintains a persistent WebSocket connection where clients can:
- Create and configure sessions with custom instructions and voice settings
- Stream audio input in real-time for natural voice conversations
- Send text input as an alternative to audio
- Receive streaming audio and text responses with low latency
- Manage conversation flow with turn detection and response control
Key Features:
- Low Latency: Optimized for real-time interactions
- Multimodal: Supports both audio and text input/output
- Voice Activity Detection: Automatic speech detection with configurable thresholds
- Streaming Responses: Receive response events as they’re generated
- Session Management: Maintain conversation context across multiple interactions
Rate Limits: Concurrent session limits vary by subscription plan. See features and limits by plan for the per-tier table.
Inworld extensions: The session object accepts a providerData field carrying Inworld-specific extensions to the OpenAI-compatible shape — STT tuning, TTS segmentation/steering, automatic memory, back-channel, and responsiveness fillers. See API Extensions for the field-by-field reference.
This API implements the Realtime interface. Refer to the Realtime overview for hands-on guides.
{
"type": "session.update",
"session": {
"instructions": "You are a friendly voice assistant.",
"audio": {
"input": {
"transcription": {
"model": "inworld/inworld-stt-1"
},
"turn_detection": {
"type": "semantic_vad",
"eagerness": "medium",
"create_response": true,
"interrupt_response": true
}
},
"output": {
"model": "inworld-tts-2",
"voice": "Dennis",
"speed": 1
}
}
}
}{
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "user",
"content": [
{
"type": "input_text",
"text": "Hello, how are you?"
}
]
}
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"content_index": 123,
"audio_end_ms": 123
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"type": "response.create",
"response": {
"output_modalities": [
"audio",
"text"
],
"instructions": "Respond in a cheerful tone."
}
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}No examples foundNo examples found{
"event_id": "2c23cfd4-a4b5-4a96-83b8-a6a151f3989e",
"type": "error",
"error": {
"type": "server_error",
"code": null,
"message": "Failed to read content stream.",
"param": null,
"event_id": null
}
}No examples foundNo examples found{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"content_index": 123,
"transcript": "<string>"
}No examples foundNo examples foundNo examples foundNo examples foundNo examples foundNo examples found{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"text": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"transcript": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"response_id": "<string>",
"item_id": "<string>",
"output_index": 123,
"content_index": 123,
"arguments": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio_start_ms": 123,
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio_end_ms": 123,
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"audio_start_ms": 123,
"audio_end_ms": 123,
"item_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"utterance_index": 123,
"probability": 123,
"trailing_silence_ms": 123,
"audio_duration_ms": 123,
"inference_ms": 123
}{
"const": "<string>",
"event_id": "<string>",
"item_id": "<string>",
"utterance_index": 123
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"backchannel_id": "<string>",
"delta": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"backchannel_id": "<string>",
"phrase": "<string>"
}{
"const": "<string>",
"event_id": "<string>",
"reason": "<string>"
}{
"const": "<string>",
"event_id": "<string>"
}Use your API key for authentication. See Authentication for details. You can create a key in one command with the Inworld CLI: inworld workspace add-key.
Update the session configuration. The server responds with a session.updated event.
Add a conversation item (message, function call result, etc.).
Truncate an assistant message's audio.
Delete a conversation item by ID.
Retrieve a conversation item by ID.
Trigger a model response. The server streams back response events.
Cancel an in-progress response.
Append audio bytes to the input buffer.
Commit the buffered audio as a user message.
Discard all audio in the input buffer.
Clear the server's output audio buffer, stopping playback.
Not currently supported. The session starts immediately with default configuration. Send a session.update to configure the session.
Confirms a session.update was applied.
Indicates an error occurred.
A new item was added to the conversation.
An item finished being populated.
An item was deleted from the conversation.
Response to conversation.item.retrieve.
An assistant audio item was truncated.
Streaming partial transcription for user audio.
Final transcription for a user audio item.
A new response was created. Contains the full response object in its initial state.
The response finished. Contains the completed response object with final status and output.
An output item was added to the response.
An output item finished.
A content part was added to an output item.
A content part finished.
Streaming text chunk from the model.
Text output finished.
Streaming transcript for generated audio.
Final transcript for generated audio.
Audio output for a content part finished.
Streaming function call arguments.
Function call arguments finished.
Voice activity detected — user started speaking. Always emitted before transcripts, including on STT providers without native VAD (the server synthesizes the event at first-audio so client code can rely on the same ordering across providers).
Voice activity ended — user stopped speaking.
Buffered audio was committed as a conversation item.
Input audio buffer was cleared.
An idle timeout was triggered on the input buffer. Emitted under server_vad when no speech has been detected within turn_detection.idle_timeout_ms.
Emitted by the server VAD smart-turn detector when it predicts the user has reached an end-of-turn boundary. Clients can use this signal to drive low-latency UI cues or to pre-warm a response without waiting for the final speech_stopped commit. May be followed by input_audio_buffer.turn_suggestion_revoked if the user resumes speaking.
Emitted when the user resumes speaking after a previous turn_suggestion. Pairs with the most recent turn_suggestion sharing the same utterance_index.
Server started sending output audio.
Server stopped sending output audio.
Output audio buffer was cleared.
Streaming PCM audio chunk for a low-latency back-channel interjection (e.g. "uh-huh", "right") emitted while the user is mid-utterance. Out-of-band from the main response stream — use backchannel_id to group chunks belonging to the same interjection.
All audio for a back-channel interjection has been streamed. No teardown required — playback queues until exhausted.
An evaluation tick chose not to fire a back-channel. Useful for client-side telemetry; clients that don't care can ignore this event.
Reports current rate limit state.