The Realtime Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.Documentation Index
Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
Developer Quickstart
Make your first STT API call and get a transcript.
API Reference
View the complete API specification.
Code Examples
Browse ready-to-use GitHub samples for sync and real-time STT.
Supported Providers
Inworld (first-party) — Experimental
| Model ID | Endpoints | Best for |
|---|---|---|
inworld/inworld-stt-1 | Sync API + WebSocket | Voice agents and character-driven apps that benefit from transcription plus Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking |
The Inworld first-party model is currently Experimental. Features and pricing are subject to change.
Groq
| Model ID | Endpoints | Best for |
|---|---|---|
groq/whisper-large-v3 | Sync API only | General-purpose transcription for recorded audio |
AssemblyAI
| Model ID | Endpoints | Best for |
|---|---|---|
assemblyai/universal-streaming-multilingual | WebSocket only | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
assemblyai/universal-streaming-english | WebSocket only | English-optimized streaming |
assemblyai/u3-rt-pro | WebSocket only | High-accuracy, sub-300ms latency, multilingual streaming (English, Spanish, French, German, Italian, Portuguese) |
assemblyai/whisper-rt | WebSocket only | Real-time Whisper transcription |
AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.
Soniox
| Model ID | Endpoints | Best for |
|---|---|---|
soniox/stt-rt-v4 | WebSocket only | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support |
Soniox models currently support the WebSocket streaming endpoint only.
Model comparison
| Feature | inworld/inworld-stt-1 | groq/whisper-large-v3 | assemblyai/universal-streaming-multilingual | assemblyai/universal-streaming-english | assemblyai/u3-rt-pro | assemblyai/whisper-rt | soniox/stt-rt-v4 |
|---|---|---|---|---|---|---|---|
| Pricing | See pricing | See pricing | See pricing | See pricing | See pricing | See pricing | See pricing |
| Endpoint | Sync API + WebSocket | Sync API only | WebSocket only | WebSocket only | WebSocket only | WebSocket only | WebSocket only |
| Real-time streaming | |||||||
| Best for | Voice agents with Voice Profile and configurable turn-taking | General-purpose transcription for recorded audio | Multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | English-optimized streaming | High-accuracy, sub-300ms multilingual streaming (English, Spanish, French, German, Italian, Portuguese) | Real-time Whisper transcription | High-accuracy real-time streaming with semantic end-of-turn detection and multilingual support |
| Languages | English; 29 Experimental (see below) | 100+ (Whisper) | 6 languages | English | 6 languages | 100+ (Whisper) | Multilingual |
Supported Audio Formats
| Format | Sync API | WebSocket Streaming |
|---|---|---|
LINEAR16 (PCM) | ||
MP3 | ||
OGG_OPUS | ||
FLAC | ||
AUTO_DETECT |
sampleRateHertz is optional — the API auto-detects it from the file header.
STT performs best with 16 kHz audio. Lower sample rates (such as 8 kHz telephony audio) contain fewer data points for the model to interpret, which reduces transcription accuracy. Upsampling low-sample-rate audio does not improve quality — it only interpolates between existing samples without adding new information.
Endpoints
| Endpoint | Method | Description |
|---|---|---|
/stt/v1/transcribe | POST | Send complete audio, receive full transcript |
/stt/v1/transcribe:streamBidirectional | WebSocket | Stream audio in real time, receive transcription chunks as they become available |
Supported Languages
Language support depends on the STT provider. See Model comparison above for more details.Inworld first-party model (inworld/inworld-stt-1)
Available:
- English (en)
- Spanish (es)
- French (fr)
- German (de)
- Italian (it)
- Portuguese (pt)
- Dutch (nl)
- Russian (ru)
- Chinese (zh)
- Japanese (ja)
- Korean (ko)
- Arabic (ar)
- Hindi (hi)
- Turkish (tr)
- Polish (pl)
- Swedish (sv)
- Cantonese (yue)
- Indonesian (id)
- Thai (th)
- Vietnamese (vi)
- Malay (ms)
- Danish (da)
- Finnish (fi)
- Czech (cs)
- Filipino (fil)
- Persian (fa)
- Greek (el)
- Hungarian (hu)
- Macedonian (mk)
- Romanian (ro)
language when you want to force recognition for a known language. Omit language to allow auto-detection when supported.
Error Handling
Errors follow the standard gRPC status format. Authentication error| Code | Name | Description |
|---|---|---|
3 | INVALID_ARGUMENT | Invalid or missing request field (encoding, model ID, audio data) |
8 | RESOURCE_EXHAUSTED | Too many concurrent requests (rate limit) |
16 | UNAUTHENTICATED | Invalid or missing API key |
Best Practices
- Model choice — Use
inworld/inworld-stt-1when you want Voice Profile or Inworld-optimized turn-taking; use Groq/AssemblyAI/Soniox for specific latency/accuracy needs. - Audio — Use MP3/OGG_OPUS for file uploads to reduce size; use LINEAR16 for streaming (required) and when you need highest quality.
- Streaming — For Inworld model with manual turn-taking, send
EndTurnat each turn boundary andCloseStreamwhen done. - Voice Profile — Set
voiceProfileConfig.enableVoiceProfiletotrueand optionally adjusttopN(default: 10) to control how many labels per category are returned. - Test with sample audio and your target language before production.
Troubleshooting
| Issue | What to check |
|---|---|
| No transcript | API key, audio encoding matches request, valid audio file |
UNAUTHENTICATED | INWORLD_API_KEY set correctly and not expired in Portal |
INVALID_ARGUMENT | audioEncoding matches the actual format (LINEAR16 for raw PCM, MP3 for MP3, etc.) |
| Poor quality | Try a higher-accuracy model; use 16 kHz sample rate (8 kHz telephony audio has fewer data points and will produce lower-quality results); ensure clear speech |
| Large file failures | Split or compress (e.g. MP3/OGG_OPUS); respect upload size limits |
| No Voice Profile | Ensure voiceProfileConfig.enableVoiceProfile is set to true in your request; response may also omit it if the selected model does not support it |