How it works
Withinworld/inworld-stt-1 streaming, turn detection runs by default — no configuration required:
- As you stream audio, the server returns interim (partial) transcription results.
- When the server detects end-of-turn (for example, a sustained pause), it finalizes the transcript for that turn (
isFinal: true). - Speech after the turn boundary starts a new transcript.
| Event | Meaning |
|---|---|
speechStarted | Voice activity detected in the audio stream |
speechStopped | Silence detected after speech has stopped |
Tuning automatic turn detection
Adjust sensitivity viatranscribeConfig in the first WebSocket message:
| Field | Type | Default | Description |
|---|---|---|---|
endOfTurnConfidenceThreshold | float | 0.5 | Confidence required to declare end-of-turn. Higher values reduce false positives (fewer premature turn splits) at the cost of slower turn detection. Range: 0.0–1.0 |
inworldSttV1Config.minEndOfTurnSilenceWhenConfident | integer (ms) | — | Minimum silence duration before finalizing a turn when confidence is high |
inworldSttV1Config.vadThreshold | float | 0.5 | Voice activity detection threshold. Range: 0.0–1.0 |
inactivityTimeoutSeconds | integer | — | Stops transcription if the client is silent for this duration |
Turn-detection tuning fields are also available for AssemblyAI models via
assemblyaiConfig (minEndOfTurnSilenceWhenConfident, maxTurnSilence, vadThreshold). Turn-detection behavior for third-party models follows the capabilities of each provider.Manual turn control
To hand turn control fully to the client, disable server-side voice activity detection by settingvadThreshold to 0:
- Send an
endTurnmessage at the end of each speaker turn to finalize the transcript. - Send
closeStreamwhen you are done sending audio.
With manual turn control, a single turn has a maximum length (currently around 30 seconds; subject to change). Send
endTurn regularly at natural turn boundaries rather than relying on the limit.Choosing a mode
| Mode | When to use |
|---|---|
| Automatic (default) | Voice agents and live transcription where the server should decide when the speaker is done |
| Automatic, tuned | Environments with background noise, slow speakers, or domain-specific pacing — adjust thresholds to reduce premature or delayed turn splits |
Manual (vadThreshold: 0) | Push-to-talk UIs, client-side VAD, or applications with their own turn-taking logic |
Next steps
WebSocket API Reference
Full message and configuration schema for the streaming endpoint.
Developer Quickstart
Make your first STT API call and get a transcript.