General Best Practices
- Pick a suitable voice - Different voices will be better suited for different applications. Choose a voice that matches the emotional range and expression you’re looking for. For example, for a meditation app, select a more steady and calm voice. For an encouraging fitness coach, select a more expressive and excited voice.
- Pay attention to punctuation - Punctuation matters! Use exclamation points (!) to make the voice more emphatic and excited. Use ellipsis (…) or dashes (—) to insert natural pauses. Where possible, make sure to include punctuation at the end of the sentence.
- Use asterisks for emphasis - You can emphasize specific words by surrounding them with asterisks. For example, writing “We *need* a beach vacation” will cause the voice to stress the word “need” when speaking, whereas ""We need a *beach* vacation” will emphasize the word “beach”. This can help clarify tone or intent in nuanced dialogue.
- Match the voice to the text language - Voices perform optimally when synthesizing text in the same language as the original voice. While cross-language synthesis is possible, you’ll achieve the best quality, pronunciation, and naturalness by matching the voice’s native language to your text content.
- Normalize complex text - If you find that the model is mispronouncing certain complex phrases like phone numbers or dollar amounts, it can help to normalize the text. This may be particularly helpful for non-English languages. Some examples of normalization include:
- Phone numbers: “(123)456-7891” -> “one two three, four five six, seven eight nine one”
- Dates: 5/6/2025 -> “may sixth twenty twenty five” (helpful since date formats may vary)
- Times: “12:55 PM” -> “twelve fifty-five PM”
- Emails: test@example.com -> “test at example dot com”
- Monetary values: $5,342.29 -> “five thousand three hundred and forty two dollars and twenty nine cents”
- Symbols: 2+2=4 -> “two plus two equals four”
 
- Tune the temperature - The temperature controls how random the audio output is. Higher values result in more random outputs and can lead to more expressive results. This can be desirable for generating barks, demo clips, or other non-real-time use cases. Lower temperatures result in more deterministic output, though temperatures that are too low will often produce poor results. For real-time use cases, we recommend keeping the temperature between 0.6 and 1, with the default being 1.1.
Advanced Tips
Natural, Conversational Speech
Natural human conversation is not perfect. It’s full of filler words, pauses, and other natural speech patterns that make it sound more human. Our TTS models are trained to generate the requested text as is, in order to produce the most accurate and consistent output that can be used for a wide range of applications. After all, not all applications want to have a bunch of filler words inserted into the speech! To generate natural, conversational speech, you can use the following techniques:- Insert filler words like uh,um,well,like, andyou knowin the text. For example, instead of:change it to:If the text is already being generated using an LLM, you can add instructions in the prompt to insert filler words in the response. Alternatively, you can use a small LLM to insert filler words given a piece of text.
- Use audio markups to add non-verbal vocalizations like [sigh],[breathe],[clear_throat]. These natural speech patterns can make the speech sound more natural.
Audio Markups
This feature is currently experimental, and is not recommended for real-time, production use cases.
- Choose contextually appropriate markups - Markups will work best when they make sense with the text content. When markups conflict with the text, the model may struggle to handle the contradiction. For example, the following phrase can be challenging:
The text is clearly grateful and sincere, which contradicts with the angry markup.
- Avoid conflicting markups - When using multiple markups for a single text, ensure they don’t conflict with each other. For example, this markup can be problematic:
Yawning typically indicates boredom or tiredness, which rarely occurs alongside anger.
- Break up the text -
Emotion and delivery style markups work best when placed at the beginning of text with a single markup per request. Using multiple emotion and delivery style markups or placing them mid-text may produce mixed results. Instead of making one request like this:
Break it into two requests:
- Repeat non-verbal vocalizations if necessary - If a non-verbal vocalization is consistently being omitted, it may help to repeat the markup to ensure that it is vocalized. This works best for vocalizations where repetition sounds natural, such as [laugh] [laugh]or[cough] [cough].