Use this file to discover all available pages before exploring further.
In this quickstart, you’ll send an audio file to the STT API and receive a transcript. It also highlights Inworld STT (inworld/inworld-stt-1), which adds Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking (automatic or manual).
The STT API accepts base64-encoded audio and supports multiple audio formats. Requirements vary by use case:
Use case
Format
Notes
File upload (sync)
LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT
Sample rate can be auto-detected from file headers when possible
Streaming
LINEAR16 (PCM)
Other encodings are not supported for streaming to minimize latency and preserve quality
Recommended settings:
Sample rate: 16,000 Hz (STT performs best at this rate; lower sample rates like 8 kHz contain fewer data points, reducing accuracy)
Bit depth: 16-bit (for LINEAR16)
Channels: Mono (1 channel)
For file uploads (MP3, FLAC, OGG_OPUS, WAV), sampleRateHertz is optional — the API can auto-detect it from the file header.
3
Send the request
Audio is sent as a JSON payload with base64-encoded audio content. The API returns the complete transcript when processing is complete (and optionally Voice Profile, when returned by the API).Create a new file inworld_stt_quickstart.py or inworld_stt_quickstart.js and use the code below. The Inworld model (inworld/inworld-stt-1) provides transcription plus optional Voice Profile (age, pitch, emotion, vocal style, accent) and configurable turn-taking for streaming.
import requestsimport osimport base64# Sync endpointURL = "https://api.inworld.ai/stt/v1/transcribe"# Use a 16-bit PCM WAV file (16 kHz, mono)with open("input.wav", "rb") as f: audio_content = base64.b64encode(f.read()).decode("utf-8")payload = { "transcribe_config": { "model_id": "inworld/inworld-stt-1", "language": "en-US", "audio_encoding": "LINEAR16", "voice_profile_config": { "enable_voice_profile": True, }, }, "audio_data": {"content": audio_content},}headers = { "Authorization": f"Basic {os.getenv('INWORLD_API_KEY')}", "Content-Type": "application/json",}response = requests.post(URL, headers=headers, json=payload)response.raise_for_status()result = response.json()print("Transcript:", result["transcription"]["transcript"])# Voice Profile (when returned by the API)if "voiceProfile" in result and result["voiceProfile"]: vp = result["voiceProfile"] if vp.get("age"): print("Age:", vp["age"].get("label"), vp["age"].get("confidence")) if vp.get("pitch"): print("Pitch:", vp["pitch"].get("label"), vp["pitch"].get("confidence"))
4
Review the response
The response includes the transcript and usage fields, plus optional voiceProfile when available.Response (sync)
Field
Description
transcription.transcript
The transcribed text
transcription.isFinal
Whether the result is finalized
transcription.wordTimestamps
Per-word timing data (when available)
usage
Usage metrics for billing
voiceProfile
(When returned) Age, pitch, emotion, vocal_style, accent with label and confidence. Available with Inworld and supported third-party models
5
Configuration parameters
transcribeConfig / transcribe_config
Field
Type
Required
Description
modelId / model_id
string
Yes
STT model ID. Use inworld/inworld-stt-1 for WebSocket and HTTP
language
string
No
BCP-47 language code (e.g. en-US). If omitted, the model may auto-detect. See Supported Languages for the full list
audioEncoding
string
Yes
One of: LINEAR16, MP3, OGG_OPUS, FLAC, AUTO_DETECT. For streaming, use LINEAR16 only
sampleRateHertz
integer
No
Sample rate in Hz. Default 16000. Can be omitted for formats with headers (MP3, FLAC, OGG_OPUS, WAV)
numberOfChannels
integer
No
Channel count. Default 1
voiceProfileConfig
object
No
Voice Profile configuration. See below
voiceProfileConfig / voice_profile_config
Field
Type
Required
Description
enableVoiceProfile / enable_voice_profile
bool
Yes
Set to true to enable Voice Profile analysis
topN / top_n
integer
No
Number of top labels per category to return. Default: 10
audioData
Field
Type
Required
Description
content
string
Yes
Base64-encoded audio bytes
6
Run the code
pip install requests # if neededpython inworld_stt_quickstart.py
Example output:
Transcript: Hey, I just wanted to check in on the delivery status for my order.
Responses stream back as Transcription (interim and final), optional voiceProfile, and finally Usage when the stream is closed.Streaming endpoint (WebSocket):wss://api.inworld.ai/stt/v1/transcribe:streamBidirectional