Skip to main content
Inworld’s Speech-to-Text (STT) API provides a unified integration point for industry-leading transcription providers. You get consistent authentication, request formatting, and response handling across providers — without managing multiple SDKs or credentials. The API supports both synchronous transcription for complete audio files and real-time bidirectional streaming over WebSocket for live audio.

Supported Providers

Groq

Model IDEndpointsBest for
groq/whisper-large-v3Sync API onlyGeneral-purpose transcription for recorded audio

AssemblyAI

Model IDEndpointsBest for
assemblyai/universal-streaming-multilingualWebSocket onlyMultilingual streaming (English, Spanish, French, German, Italian, Portuguese)
assemblyai/universal-streaming-englishWebSocket onlyEnglish-optimized streaming
AssemblyAI models currently support the WebSocket streaming endpoint only. Sync HTTP support is coming soon.
For pricing details, see inworld.ai/pricing.

Features

Featuregroq/whisper-large-v3assemblyai/universal-streaming-multilingualassemblyai/universal-streaming-english
Pricing$0.111/hour$0.15/hour$0.15/hour
EndpointSync API onlyWebSocket onlyWebSocket only
Real-time streaming
Best forGeneral-purpose transcription for recorded audioMultilingual streaming (English, Spanish, French, German, Italian, Portuguese)English-optimized streaming
Languages100+ (Whisper)6 languagesEnglish

Supported Audio Formats

FormatSync APIWebSocket Streaming
LINEAR16 (PCM)
MP3
OGG_OPUS
FLAC
AUTO_DETECT
Recommended defaults: 16,000 Hz sample rate, 16-bit depth, mono. For container formats (MP3, FLAC, OGG_OPUS), sampleRateHertz is optional — the API auto-detects it from the file header.

Endpoints

EndpointMethodDescription
/stt/v1/transcribePOSTSend complete audio, receive full transcript
/stt/v1/transcribe:streamBidirectionalWebSocketStream audio in real time, receive transcription chunks as they become available