Timestamp alignment currently supports English only; other languages are experimental.
timestampType request parameter to control granularity:
WORD: Return timestamps for each word, including detailed phoneme-level timing with viseme symbolsCHARACTER: Return timestamps for each character or punctuation
Enabling timestamp alignment can increase latency (especially for the non-streaming endpoint).
WORD:timestampInfo.wordAlignmentwithwords,wordStartTimeSeconds,wordEndTimeSeconds- For TTS 1.5 models,
phoneticDetailscontaining detailed phoneme-level timing with viseme symbols
- For TTS 1.5 models,
CHARACTER:timestampInfo.characterAlignmentwithcharacters,characterStartTimeSeconds,characterEndTimeSeconds
Phoneme and viseme timings (
phoneticDetails) are currently only returned for WORD alignment (not CHARACTER).Streaming behavior
You can control how timestamp data is delivered alongside audio usingtimestampTransportStrategy.
Sync (default)
Audio and alignment arrive together in each chunk. Every chunk contains both audio data and its corresponding timestamps.Async
Audio chunks arrive first, followed by separate trailing messages containing only timestamp data. This reduces time-to-first-audio with TTS 1.5 models, since the server doesn’t need to wait for alignment computation before sending audio.timestampTransportStrategy to SYNC or ASYNC in your request. See the API reference for details.
Response structure
TTS 1.5 models (inworld-tts-1.5-mini, inworld-tts-1.5-max)
Returns enhanced alignment data with phonetic details: detailed phoneme-level timing with viseme symbols for precise lip-sync animation.
Phonetic details structure
Each entry inphoneticDetails contains:
| Field | Description |
|---|---|
wordIndex | Index of the word this phonetic detail belongs to (0-based). |
phones | Array of phonemes that make up this word. |
isPartial | True when the server considers the word potentially unstable (e.g., last word in a non-final streaming update). Clients may choose to delay processing partial words until isPartial becomes false. |
| Field | Description |
|---|---|
phoneSymbol | The phoneme symbol in IPA notation. |
startTimeSeconds | Start time of the phoneme in seconds. May be omitted for the first phoneme of a word. |
durationSeconds | Duration of the phoneme in seconds. |
visemeSymbol | The viseme symbol for lip-sync animation. |
Viseme symbols
The following viseme symbols are used for lip-sync animation:| Viseme | Description |
|---|---|
aei | Open mouth vowels (a, e, i, ə, ʌ, æ, ɑ, etc.) |
o | Rounded vowels (o, ʊ, əʊ, oʊ, etc.) |
ee | Front vowels (i, ɪ, eɪ, etc.) |
bmp | Bilabial consonants (b, m, p) |
fv | Labiodental consonants (f, v) |
l | Lateral consonant (l) |
r | Rhotic sounds (r, ɝ, ɚ) |
th | Dental fricatives (θ, ð) |
qw | Rounded consonants (w, ʍ) |
cdgknstxyz | Alveolar/velar consonants (c, d, g, k, n, s, t, x, y, z) |