Skip to main content
Audio markups let you control how the model speaks—not only what it says, but pacing, emotion, and non-verbal sounds. This page covers two kinds: SSML break tags for inserting silences, and emotion, delivery, and non-verbal markups, bracket-style tags for expression and vocalizations.

SSML break tags

Use when you need precise control over silence duration and position. You can insert silences at specific points in the generated speech. The TTS API and Inworld Portal support SSML <break time="1s" /> in text input for streaming, non-streaming, and WebSocket requests, in all languages. You can specify silences in milliseconds or seconds. For example, <break time="1000ms" /> and <break time="1s" /> produce the same result. Constraints:
  • Use well-formed SSML: specify the slash and brackets—for example, <break time="1s" />.
  • Tag names and attributes are case insensitive; for example, <BREAK time="2s" /> works.
  • Up to 20 break tags are supported per request. After the first 20 tags, the remaining ones will be ignored.
  • Each break is at most 10 seconds—for example, time="10s" or time="10000ms".
Example:
One second pause <break time="1s" /> two seconds pause <BREAK time="2s" /> this is the end.<break time="500ms" />

Emotion, delivery, and non-verbal markups

Use when you want to control emotion, delivery style, or add sounds like sighs and laughs. The markups below are experimental and supported for English only. They give you finer control over how the model speaks: emotional expression, delivery style such as whispering, and non-verbal vocalizations such as sighs and coughs.
These markups are currently experimental and only support English.

Emotion and Delivery Style

Emotion and delivery style markups control the way a given text is spoken. These work best when used at the beginning of a text and apply to the text that follows.
  • Emotion: [happy], [sad], [angry], [surprised], [fearful], [disgusted]
  • Delivery Style: [laughing], [whispering]
For example:
[happy] I can't believe this is happening.
Best practices: Use only one emotion or delivery style markup at the beginning of your text. Using multiple emotion and delivery style markups or placing them mid-text may produce mixed results. Instead, split the text into separate requests with the markup at the start of each. See our Best Practices guide for more details.

Non-verbal Vocalization

Non-verbal vocalization markups add in non-verbal sounds based on where they are placed in the text.
  • [breathe], [clear_throat], [cough], [laugh], [sigh], [yawn]
For example:
[clear_throat] Did you hear what I said? [sigh] You never listen to me!
Best practices: You can use multiple non-verbal vocalizations within a single piece of text to add the appropriate vocal effects throughout the speech.