Transcribe audio (WebSocket)

Bidirectional streaming API for real-time speech-to-text transcription over WebSocket.

This method listens for streaming audio input and returns recognized text chunks one by one as soon as they are ready. Audio chunks are expected to be a part of a single voice input. Suitable for streaming live conversations, microphone input, or other streaming audio sources.

To use the API:

Send a transcribeConfig message first to configure the session (model, language, audio encoding, etc.).
Stream audioChunk messages containing raw audio bytes.
Receive transcription results as they become available, including both interim (partial) and final results.
Listen for speechStarted and speechStopped events to detect voice activity changes.
Optionally send endTurn to signal end of a speaker’s turn.
Send closeStream when done.

WSS

stt

transcribe:streamBidirectional

Messages

authorization

type:httpApiKey

Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY. You can create a key in one command with the Inworld CLI: inworld workspace add-key.

transcribeConfig

type:object

Configure the transcription session. Must be the first message sent. Contains model selection, audio format settings, and optional feature configurations.

audioChunk

type:object

Send a chunk of audio data for transcription. Must be sent after the initial transcribe config message.

endTurn

type:object

Signal the end of a speaker's turn. Some providers do not support manual turn-taking; for those providers, sending this message will have no effect.

closeStream

type:object

Signal that the client is done sending audio data. Required for HTTP/WebSocket clients since there is no equivalent to gRPC stream close.

transcription

type:object

Transcription result streamed back as audio is processed. May be an interim (partial) result or a final result depending on the isFinal field.

usage

type:object

Usage metrics for billing and monitoring purposes. Coming soon — this field is not yet populated.

speechStarted

type:object

Signal to indicate the start of a speaker's speech. Sent when voice activity is detected in the audio stream.

speechStopped

type:object

Signal raised when STT detects silence after speech has stopped. Useful for tracking pauses and implementing custom turn-taking logic.

⌘I