Receive audio chunks as they are individually processed.
Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY
Request body for the streaming speech synthesis endpoint.
The text to be synthesized into speech. Maximum input of 2,000 characters.
The ID of the voice to use for synthesizing speech.
Configurations to use when synthesizing speech.
Determines the degree of randomness when sampling audio tokens to generate the response.
Defaults to 1.1. Accepts values between 0 (exclusive) and 2 (inclusive). Higher values will make the output more random and can lead to more expressive results. Lower values will make it more deterministic. If 0 is provided, the default value will be used.
For the most stable results, we recommend using the default value.
Controls timestamp metadata returned with the audio. When enabled, the response includes timing arrays, which can be useful for word-highlighting, karaoke-style captions, and lipsync.
timestampInfo.wordAlignment (words, wordStartTimeSeconds, wordEndTimeSeconds).timestampInfo.characterAlignment (characters, characterStartTimeSeconds, characterEndTimeSeconds).Phonetic details: phoneticDetails is currently only returned for WORD alignment (not CHARACTER).
Latency note: Alignment adds additional computation. Enabling alignment can increase latency.
Model differences:
inworld-tts-1, inworld-tts-1-max): Returns basic word/character timing arrays.inworld-tts-1.5-mini, inworld-tts-1.5-max): Returns enhanced alignment data with detailed phoneticDetails containing phoneme-level timing and viseme symbols for lip-sync.Note: Timestamp alignment currently supports English only; other languages are experimental.
TIMESTAMP_TYPE_UNSPECIFIED, WORD, CHARACTER When enabled, text normalization automatically expands and standardizes things like numbers, dates, times, and abbreviations before converting them to speech. For example, Dr. Smith becomes Doctor Smith, and 3/10/25 is spoken as March tenth, twenty twenty-five. Turning this off may reduce latency, but the speech output will read the text exactly as written. Defaults to automatically deciding whether to apply text normalization.
APPLY_TEXT_NORMALIZATION_UNSPECIFIED, ON, OFF The transport strategy of timestamps info.
TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED: The service will automatically decide the transport strategy.SYNC: Timestamps will be returned in the same message as the audio data.ASYNC: Timestamps could return in trailing message after the audio data. Use this strategy to reduce latency of the first audio chunk with v1.5+ models.TIMESTAMP_TRANSPORT_STRATEGY_UNSPECIFIED, SYNC, ASYNC A successful response returns a stream of objects.