Skip to main content
Timestamp alignment currently supports English only; other languages are experimental.
Timestamp alignment lets you retrieve timing information that matches the generated audio, which is useful for experiences like word highlighting, karaoke‑style captions, and lipsync. Set the timestampType request parameter to control granularity:
  • WORD: Return timestamps for each word, including detailed phoneme-level timing with viseme symbols
  • CHARACTER: Return timestamps for each character or punctuation
Enabling timestamp alignment can increase latency (especially for the non-streaming endpoint).
When enabled, the response includes timestamp arrays:
  • WORD: timestampInfo.wordAlignment with words, wordStartTimeSeconds, wordEndTimeSeconds
    • For TTS 1.5 models, phoneticDetails containing detailed phoneme-level timing with viseme symbols
  • CHARACTER: timestampInfo.characterAlignment with characters, characterStartTimeSeconds, characterEndTimeSeconds
Phoneme and viseme timings (phoneticDetails) are currently only returned for WORD alignment (not CHARACTER).
See the API reference for full details.

Streaming behavior

You can control how timestamp data is delivered alongside audio using timestampTransportStrategy.

Sync (default)

Audio and alignment arrive together in each chunk. Every chunk contains both audio data and its corresponding timestamps.
Chunk 1: audio + timestamps for chunk 1
Chunk 2: audio + timestamps for chunk 2
Chunk 3: audio + timestamps for chunk 3
This is the simplest approach, however the first audio will be slightly delayed.

Async

Audio chunks arrive first, followed by separate trailing messages containing only timestamp data. This reduces time-to-first-audio with TTS 1.5 models, since the server doesn’t need to wait for alignment computation before sending audio.
Chunk 1: audio only
Chunk 2: audio only
Chunk 3: audio only
Chunk 4: timestamps only (alignment for chunks 1–3)
Chunk 5: timestamps only
...
Use async when you prioritize playback speed and can handle timestamps arriving after their corresponding audio. Use sync when you need audio and timestamps together in each chunk (e.g., for real-time lip-sync or word highlighting during playback). Set timestampTransportStrategy to SYNC or ASYNC in your request. See the API reference for details.

Response structure

TTS 1.5 models (inworld-tts-1.5-mini, inworld-tts-1.5-max)

Returns enhanced alignment data with phonetic details: detailed phoneme-level timing with viseme symbols for precise lip-sync animation.
{
  "timestampInfo": {
    "wordAlignment": {
      "words": ["Hello,", "world,", "this", "will", "be", "saved"],
      "wordStartTimeSeconds": [0, 0.28, 0.96, 1.25, 1.38, 1.5],
      "wordEndTimeSeconds": [0.28, 0.8, 1.25, 1.38, 1.5, 1.99],
      "phoneticDetails": [
        {
          "wordIndex": 0,
          "phones": [
            {"phoneSymbol": "h", "startTimeSeconds": 0, "durationSeconds": 0.07, "visemeSymbol": "aei"},
            {"phoneSymbol": "ə", "startTimeSeconds": 0.07, "durationSeconds": 0.030000001, "visemeSymbol": "aei"},
            {"phoneSymbol": "l", "startTimeSeconds": 0.1, "durationSeconds": 0.089999996, "visemeSymbol": "l"},
            {"phoneSymbol": "oʊ1", "startTimeSeconds": 0.19, "durationSeconds": 0.09, "visemeSymbol": "o"}
          ],
          "isPartial": false
        },
        {
          "wordIndex": 1,
          "phones": [
            {"phoneSymbol": "w", "startTimeSeconds": 0.28, "durationSeconds": 0.18, "visemeSymbol": "qw"},
            {"phoneSymbol": "ɝ1", "startTimeSeconds": 0.46, "durationSeconds": 0.119999975, "visemeSymbol": "r"},
            {"phoneSymbol": "l", "startTimeSeconds": 0.58, "durationSeconds": 0.08000004, "visemeSymbol": "l"},
            {"phoneSymbol": "d", "startTimeSeconds": 0.66, "durationSeconds": 0.13999999, "visemeSymbol": "cdgknstxyz"}
          ],
          "isPartial": false
        },
        {
          "wordIndex": 2,
          "phones": [
            {"phoneSymbol": "ð", "startTimeSeconds": 0.96, "durationSeconds": 0.14000005, "visemeSymbol": "th"},
            {"phoneSymbol": "ɪ1", "startTimeSeconds": 1.1, "durationSeconds": 0.06999993, "visemeSymbol": "ee"},
            {"phoneSymbol": "s", "startTimeSeconds": 1.17, "durationSeconds": 0.08000004, "visemeSymbol": "cdgknstxyz"}
          ],
          "isPartial": false
        }
      ]
    }
  }
}
Phonetic details structure
Each entry in phoneticDetails contains:
FieldDescription
wordIndexIndex of the word this phonetic detail belongs to (0-based).
phonesArray of phonemes that make up this word.
isPartialTrue when the server considers the word potentially unstable (e.g., last word in a non-final streaming update). Clients may choose to delay processing partial words until isPartial becomes false.
Each phone entry contains:
FieldDescription
phoneSymbolThe phoneme symbol in IPA notation.
startTimeSecondsStart time of the phoneme in seconds. May be omitted for the first phoneme of a word.
durationSecondsDuration of the phoneme in seconds.
visemeSymbolThe viseme symbol for lip-sync animation.
Viseme symbols
The following viseme symbols are used for lip-sync animation:
VisemeDescription
aeiOpen mouth vowels (a, e, i, ə, ʌ, æ, ɑ, etc.)
oRounded vowels (o, ʊ, əʊ, oʊ, etc.)
eeFront vowels (i, ɪ, eɪ, etc.)
bmpBilabial consonants (b, m, p)
fvLabiodental consonants (f, v)
lLateral consonant (l)
rRhotic sounds (r, ɝ, ɚ)
thDental fricatives (θ, ð)
qwRounded consonants (w, ʍ)
cdgknstxyzAlveolar/velar consonants (c, d, g, k, n, s, t, x, y, z)

TTS 1 models (inworld-tts-1, inworld-tts-1-max)

Returns basic word/character timing arrays:
{
  "timestampInfo": {
    "wordAlignment": {
      "words": ["Hello", "world,", "this", "will", "be", "saved"],
      "wordStartTimeSeconds": [0, 0.33, 0.69, 0.89, 1.1, 1.26],
      "wordEndTimeSeconds": [0.28, 0.63, 0.87, 1.05, 1.16, 1.6]
    }
  }
}