Generating speech - Inworld AI Documentation

This guide covers techniques and best practices for generating high-quality, natural-sounding speech for your applications.

If you’re using an LLM to generate text for TTS, see our dedicated guide on Prompting for TTS for prompt templates and techniques.

General Best Practices

Pick a suitable voice - Different voices will be better suited for different applications. Choose a voice that matches the emotional range and expression you’re looking for. For example, for a meditation app, select a more steady and calm voice. For an encouraging fitness coach, select a more expressive and excited voice.
Pay attention to punctuation - Punctuation matters! Use exclamation points (!) to make the voice more emphatic and excited. Use periods to insert natural pauses. Where possible, make sure to include punctuation at the end of the sentence.
Use asterisks for emphasis - You can emphasize specific words by surrounding them with asterisks. For example, writing “We *need* a beach vacation” will cause the voice to stress the word “need” when speaking, whereas “We need a *beach* vacation” will emphasize the word “beach”. This can help clarify tone or intent in nuanced dialogue.
Match the voice to the text language - Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you’ll achieve the best quality, pronunciation, and naturalness by matching the voice’s native language to your text content.
Normalize complex text - If you find that the model is mispronouncing certain complex phrases like phone numbers or dollar amounts, it can help to normalize the text. This may be particularly helpful for non-English languages. Some examples of normalization include:
- Phone numbers: “(123)456-7891” -> “one two three, four five six, seven eight nine one”
- Dates: 5/6/2025 -> “may sixth twenty twenty five” (helpful since date formats may vary)
- Times: “12:55 PM” -> “twelve fifty-five PM”
- Emails: test@example.com -> “test at example dot com”
- Monetary values: $5,342.29 -> “five thousand three hundred and forty two dollars and twenty nine cents”
- Symbols: 2+2=4 -> “two plus two equals four”
Tune the temperature - The temperature controls how random the audio output is. Higher values result in more random outputs and can lead to more expressive results. This can be desirable for generating barks, demo clips, or other non-real-time use cases. Lower temperatures result in more deterministic output, though temperatures that are too low will often produce poor results. For real-time use cases, we recommend keeping the temperature between 0.6 and 1, with the default being 1.1.

Latency

For realtime use cases, minimizing latency is critical. Here are some tips and techniques you can use:

Stream TTS output - Instead of waiting for the entire generation (which may take some time if it is long), you can start playback as soon as the first chunk arrives so that the user doesn’t have to wait. Inworld’s websocket streaming should be the lowest-latency option, but streaming over HTTP will also be superior to a non-streaming setup.
Chunk TTS input - Instead of sending a large request to the TTS model (whether it’s pre-written or generated by an LLM), consider breaking it into sentence chunks and sending them one by one. The Inworld Agent Runtime provides built-in tools to handle this in a performant manner.

Advanced Tips

Natural, Conversational Speech

Natural human conversation is not perfect. It’s full of filler words, pauses, and other natural speech patterns that make it sound more human. Our TTS models are trained to generate the requested text as is, in order to produce the most accurate and consistent output that can be used for a wide range of applications. After all, not all applications want to have a bunch of filler words inserted into the speech! To generate natural, conversational speech, you can use the following techniques:

Insert filler words like uh, um, well, like, and you know in the text. For example, instead of:
```
I'm not too sure about that.
```
change it to:
```
Uh, I'm not uh too sure about that.
```
If the text is already being generated using an LLM, you can add instructions in the prompt to insert filler words in the response. Alternatively, you can use a small LLM to insert filler words given a piece of text.
Use audio markups to add non-verbal vocalizations like [sigh], [breathe], [clear_throat]. These natural speech patterns can make the speech sound more natural.

Audio Markups

This feature is currently experimental, and is not recommended for real-time, production use cases.

When using audio markups, there are a number of techniques for producing the best results.

Choose contextually appropriate markups - Markups will work best when they make sense with the text content. When markups conflict with the text, the model may struggle to handle the contradiction. For example, the following phrase can be challenging:
```
[angry] I appreciate your help and I’m really grateful for your kindness.
```
The text is clearly grateful and sincere, which contradicts with the angry markup.
Avoid conflicting markups - When using multiple markups for a single text, ensure they don’t conflict with each other. For example, this markup can be problematic:
```
[angry] I can't believe you did that. [yawn] You never listen.
```
Yawning typically indicates boredom or tiredness, which rarely occurs alongside anger.
Break up the text - Emotion and delivery style markups work best when placed at the beginning of text with a single markup per request. Using multiple emotion and delivery style markups or placing them mid-text may produce mixed results. Instead of making one request like this:
```
[angry] I can't believe you didn't save the last bite of cake for me. [laughing] Got you! I was just kidding.
```
Break it into two requests:
```
[angry] I can't believe you didn't save the last bite of cake for me.
```
```
[laughing] Got you! I was just kidding.
```
Repeat non-verbal vocalizations if necessary - If a non-verbal vocalization is consistently being omitted, it may help to repeat the markup to ensure that it is vocalized. This works best for vocalizations where repetition sounds natural, such as [laugh] [laugh] or [cough] [cough].

​General Best Practices

​Latency

​Advanced Tips

​Natural, Conversational Speech

​Audio Markups

General Best Practices

Latency

Advanced Tips

Natural, Conversational Speech

Audio Markups