Voice Profile analyzes vocal characteristics of the speaker alongside transcription. It returns structured classification data for Age, Emotion, Vocal Style, and Accent, each with confidence scores ranging from 0.0 to 1.0.
Voice Profile is available across all STT models on the Inworld STT API. By understanding who is speaking and how they are speaking, applications can adapt responses, adjust tone, route conversations, or trigger context-sensitive behaviors in real time.
Use cases
- Voice agents and NPCs — Adapt responses based on the speaker’s detected emotion or vocal style (e.g., respond empathetically to a sad tone).
- Accessibility — Detect age category or vocal style to adjust UI, pacing, or interaction complexity.
- Content moderation — Flag unusual vocal patterns (shouting, crying) for escalation or review.
- Analytics and insights — Aggregate emotion and vocal style data across sessions for user experience analysis.
- Localization — Use accent detection to dynamically select language models or localized content.
How it works
Voice Profile analysis runs automatically when the voice profile threshold is configured via inworldConfig.voiceProfileThreshold in your request (HTTP or WebSocket). The confidence threshold controls which labels are returned — only labels with a confidence score at or above the threshold are included in the response.
Set the threshold via the voiceProfileThreshold field inside inworldConfig. Default: 0.5. Range: 0.0–1.0.
Classification categories
Age
Estimates the speaker’s age category. Returns a single label with the highest confidence.
| Label | Description |
|---|
young | Young adult / teenager |
adult | Adult speaker |
kid | Child speaker |
old | Elderly speaker |
unclear | Age could not be determined |
Emotion
Detects emotional tone in the speaker’s voice. Returns multiple labels ranked by confidence.
| Label | Description |
|---|
tender | Soft, gentle, caring tone |
sad | Sorrowful or melancholy tone |
calm | Relaxed, even-tempered delivery |
neutral | No strong emotional signal |
happy | Cheerful, upbeat tone |
angry | Frustrated, aggressive tone |
fearful | Anxious or frightened tone |
surprised | Startled or astonished tone |
disgusted | Revulsion or strong disapproval |
unclear | Emotion could not be determined |
Vocal Style
Identifies the speaker’s manner of delivery. Returns multiple labels ranked by confidence.
| Label | Description |
|---|
whispering | Hushed, breathy delivery |
normal | Standard conversational speech |
singing | Melodic or musical delivery |
mumbling | Unclear, low-articulation speech |
crying | Speech accompanied by crying |
laughing | Speech accompanied by laughter |
shouting | Loud, raised-voice delivery |
monotone | Flat, unvaried pitch delivery |
unclear | Vocal style could not be determined |
Accent
Detects the speaker’s accent or regional dialect using BCP-47 locale codes. Returns a single label with the highest confidence, plus additional candidates ranked below.
| Label | Region |
|---|
en-US | American English |
en-GB | British English |
en-AU | Australian English |
zh-CN | Mandarin Chinese |
fr-FR | French (France) |
es-ES | Spanish (Spain) |
es-419 | Spanish (Latin America) |
es-MX | Spanish (Mexico) |
ar-EG | Arabic (Egypt) |
Additional accent locales may be returned beyond those listed above. The model supports a broad range of BCP-47 codes.
Configuration
The STT API accepts both camelCase and snake_case field names (e.g., transcribeConfig / transcribe_config, voiceProfileThreshold / voice_profile_threshold). The examples below use camelCase.
Set voiceProfileThreshold inside inworldConfig:
{
"transcribeConfig": {
"modelId": "<MODEL_ID>",
"language": "en-US",
"audioEncoding": "MP3",
"inworldConfig": {
"voiceProfileThreshold": 0.5
}
}
}
Use any STT model that supports Voice Profiles (for example, groq/whisper-large-v3 for synchronous HTTP, or the assemblyai/... streaming models listed in the STT overview).
WebSocket (Streaming)
Include inworldConfig in the first WebSocket message:
{
"transcribeConfig": {
"modelId": "<MODEL_ID>",
"audioEncoding": "LINEAR16",
"inworldConfig": {
"voiceProfileThreshold": 0.5
}
}
}
Configuration parameters
| Field | Type | Default | Description |
|---|
voice_profile_threshold / voiceProfileThreshold | float | 0.5 | Minimum confidence score (0.0–1.0) for a label to be included in the response. Higher values return fewer, more confident labels. |
Response structure
The voiceProfile object is returned alongside transcription and usage in both sync and streaming responses. Each category contains a label and a confidence score.
Example response (sync)
{
"transcription": {
"transcript": "Hey, I just wanted to check in on the delivery status.",
"isFinal": true
},
"voiceProfile": {
"age": { "label": "young", "confidence": 0.78 },
"emotion": [
{ "label": "tender", "confidence": 0.97 },
{ "label": "sad", "confidence": 0.03 }
],
"vocal_style": [
{ "label": "whispering", "confidence": 0.97 },
{ "label": "normal", "confidence": 0.03 }
],
"accent": { "label": "en-US", "confidence": 0.48 }
},
"usage": {
"transcribed_audio_ms": 3200,
"model_id": "inworld/inworld-stt-1"
}
}
Response fields
| Field | Type | Description |
|---|
voiceProfile.age | ClassLabel | Single label: estimated age category of the speaker. |
voiceProfile.emotion | ClassLabel | Array of detected emotions, ranked by confidence. Multiple emotions may be present. |
voiceProfile.vocal_style | ClassLabel | Array of detected vocal styles, ranked by confidence. Multiple styles may be present. |
voiceProfile.accent | ClassLabel | Single label: detected accent as a BCP-47 locale code. |
Each ClassLabel contains:
- label (string) — The predicted class name
- confidence (float) — Score from 0.0 to 1.0
Best practices
- Start with the default threshold (0.5) — This filters out low-confidence noise while keeping useful labels. Lower the threshold if you need broader signal; raise it for precision-critical use cases.
- Use emotion and vocal style together — Combining both categories gives a richer picture. A “tender” emotion with “whispering” vocal style tells a different story than “tender” with “normal” style.
- Handle missing fields gracefully — Voice Profile fields may be absent if the model cannot make a confident classification or if the audio quality is insufficient. Always check for presence before accessing.
- Accent is probabilistic — Accent detection returns the most likely locale, not a definitive answer. Use it as a signal rather than a hard routing decision.
- Test with representative audio — Classification accuracy depends on audio quality, background noise, and speech duration. Test with samples that reflect your production environment.