Skip to main content
Turn detection identifies when a speaker has finished talking — the core signal a voice agent needs to know when to respond. The STT streaming API supports turn detection out of the box: the server detects end-of-turn automatically, and you can tune its sensitivity or take full manual control. Turn detection is available on the WebSocket streaming endpoint. Sync (file upload) transcription processes complete audio files, so turn detection does not apply.

How it works

With inworld/inworld-stt-1 streaming, turn detection runs by default — no configuration required:
  1. As you stream audio, the server returns interim (partial) transcription results.
  2. When the server detects end-of-turn (for example, a sustained pause), it finalizes the transcript for that turn (isFinal: true).
  3. Speech after the turn boundary starts a new transcript.
With default settings, a sustained mid-utterance pause (on the order of a couple of seconds) is enough to split the transcript into two final results. The exact pause duration is not fixed — it depends on the end-of-turn model’s confidence and can be tuned with the thresholds below. The server also emits voice-activity events you can use to drive application behavior (e.g., interrupt playback when the user starts speaking):
EventMeaning
speechStartedVoice activity detected in the audio stream
speechStoppedSilence detected after speech has stopped

Tuning automatic turn detection

Adjust sensitivity via transcribeConfig in the first WebSocket message:
FieldTypeDefaultDescription
endOfTurnConfidenceThresholdfloat0.5Confidence required to declare end-of-turn. Higher values reduce false positives (fewer premature turn splits) at the cost of slower turn detection. Range: 0.0–1.0
inworldSttV1Config.minEndOfTurnSilenceWhenConfidentinteger (ms)Minimum silence duration before finalizing a turn when confidence is high
inworldSttV1Config.vadThresholdfloat0.5Voice activity detection threshold. Range: 0.0–1.0
inactivityTimeoutSecondsintegerStops transcription if the client is silent for this duration
{
  "transcribeConfig": {
    "modelId": "inworld/inworld-stt-1",
    "audioEncoding": "LINEAR16",
    "endOfTurnConfidenceThreshold": 0.7,
    "inworldSttV1Config": {
      "minEndOfTurnSilenceWhenConfident": 800
    }
  }
}
Turn-detection tuning fields are also available for AssemblyAI models via assemblyaiConfig (minEndOfTurnSilenceWhenConfident, maxTurnSilence, vadThreshold). Turn-detection behavior for third-party models follows the capabilities of each provider.

Manual turn control

To hand turn control fully to the client, disable server-side voice activity detection by setting vadThreshold to 0:
{
  "transcribeConfig": {
    "modelId": "inworld/inworld-stt-1",
    "audioEncoding": "LINEAR16",
    "inworldSttV1Config": {
      "vadThreshold": 0
    }
  }
}
With VAD disabled, the server no longer splits turns automatically. Signal turn boundaries yourself:
  • Send an endTurn message at the end of each speaker turn to finalize the transcript.
  • Send closeStream when you are done sending audio.
With manual turn control, a single turn has a maximum length (currently around 30 seconds; subject to change). Send endTurn regularly at natural turn boundaries rather than relying on the limit.

Choosing a mode

ModeWhen to use
Automatic (default)Voice agents and live transcription where the server should decide when the speaker is done
Automatic, tunedEnvironments with background noise, slow speakers, or domain-specific pacing — adjust thresholds to reduce premature or delayed turn splits
Manual (vadThreshold: 0)Push-to-talk UIs, client-side VAD, or applications with their own turn-taking logic

Next steps

WebSocket API Reference

Full message and configuration schema for the streaming endpoint.

Developer Quickstart

Make your first STT API call and get a transcript.