12/04, acronyms aren’t expanded, and there are no cues for emphasis, pauses, or emotion.
This guide shows you what to add to your LLM system prompt so that its output is optimized for Inworld TTS.
Quality Dimensions
Emphasis
Use asterisks around words to make TTS stress them. Exclamation marks add energy, and ellipses create trailing-off effects. Prompt snippet:I think this is a really important point and you should consider it carefully.After (with emphasis guidance):
I think this is a *really* important point, and you should consider it *carefully*.
Pronunciation
For uncommon words like brand names, proper nouns, and technical terms, Inworld TTS supports inline IPA phoneme notation. You can provide a pronunciation dictionary in your system prompt that the LLM substitutes inline. Prompt snippet:You should visit Crete for your honeymoon.After (with IPA substitution):
You should visit /kriːt/ for your honeymoon.Inworld TTS reads the IPA notation and produces the correct pronunciation. See Custom Pronunciation for details on finding the right IPA phonemes. Another common approach is to use a string parser that replaces important-to-pronounce words from your pronunciation dictionary before passing the text to TTS. This works well as a post-processing step when you don’t want to add IPA instructions to your LLM prompt, or when the same dictionary needs to be applied consistently across multiple LLM providers.
Pauses and Pacing
Punctuation controls pacing in TTS. Periods create natural pauses between thoughts. Commas insert shorter breaks. Sentence length affects overall rhythm: short sentences speed things up, longer sentences slow them down. Prompt snippet:The results are in and we exceeded our target by 40 percent so this is the best quarter we have ever had.After (with pacing guidance):
The results are in. We exceeded our target… by *forty percent*. This is the *best* quarter we have ever had.
Non-verbal Vocalizations
Inworld TTS supports non-verbal tokens that add human-like sounds:[sigh], [laugh], [breathe], [cough], [clear_throat], [yawn]. These make speech sound more natural and emotionally grounded.
Audio markups are currently experimental and only support English.
I really thought that would work. Oh well, let’s try again.After (with vocalizations):
[sigh] I *really* thought that would work. [laugh] Oh well, let’s try again.See Audio Markups for the full list of supported markups including emotion and delivery style tags.
Conversational Naturalness
Natural human speech is full of filler words likeuh, um, well, like, you know. Adding these to LLM output makes TTS sound less robotic and more conversational.
Prompt snippet:
I was thinking we could try a different approach.After (with fillers):
So, uh, I was thinking we could, you know, try a *different* approach.
Output Length
LLMs tend to be verbose. A detailed paragraph may read well on screen, but sounds unnatural and exhausting when spoken aloud. Keeping responses short produces better-sounding speech and reduces latency. A good default is to ask your LLM to respond in 1–2 sentences unless the user’s query specifically demands a longer answer. Use sentences as your length unit, not words or characters. LLMs operate on tokens, so word and character counts are unreliable constraints. Prompt snippet:Well, the weather forecast for tomorrow is showing that there will be partly cloudy skies throughout the morning hours, with temperatures expected to reach a high of around seventy-five degrees Fahrenheit by the early afternoon, and then cooling down to approximately sixty degrees in the evening.After (concise):
Tomorrow looks like partly cloudy skies, with a high around *seventy-five* and cooling to sixty by evening.
Example Prompt Templates
Below are complete, copyable system prompt blocks tailored for common use cases. Each template combines the techniques above into a ready-to-use prompt.- Companion / Conversational
- Support / Sales
- Dev Tools / Technical
Use this template for chatbots, AI companions, virtual friends, and other informal conversational applications.
Notes on Normalization
Inworld TTS includes an optional normalization step that automatically expands dates, numbers, emails, currencies, and symbols into their spoken forms before synthesis. Understanding how normalization interacts with your LLM output is important for getting the best results. Toggle normalization with theapplyTextNormalization parameter in your TTS API request:
ON— always normalizeOFF— skip normalization entirelyAPPLY_TEXT_NORMALIZATION_UNSPECIFIED(default) — TTS decides per-request
Normalization adds slight latency to each TTS request. For latency-sensitive applications, consider having your LLM handle text expansion directly and setting
applyTextNormalization to OFF.With Normalization On
Inworld TTS handles common expansions automatically. Your LLM prompt still benefits from guiding edge cases that normalization may not cover:- Ambiguous dates:
01/02/2025could be January 2nd or February 1st depending on locale - Domain-specific abbreviations:
RDS,k8s,HIPAAmay not expand as expected - Uncommon acronyms: Industry-specific terms that aren’t in common usage
With Normalization Off
The LLM must handle all text expansion. Your prompt must instruct the LLM to write everything in spoken form: no digits, no symbols, no shorthand.Comparison Table
| Raw Text | Normalization Produces | LLM Should Produce (Normalization Off) |
|---|---|---|
12/04/2025 | ”twelve oh four twenty twenty-five" | "december fourth, twenty twenty-five” |
(555) 123-4567 | ”five five five, one two three, four five six seven" | "five five five, one two three, four five six seven” |
$1,249.99 | ”one thousand two hundred forty-nine dollars and ninety-nine cents" | "twelve hundred forty-nine dollars and ninety-nine cents” |
3:45 PM | ”three forty-five PM" | "three forty-five PM” |
test@example.com | ”test at example dot com" | "test at example dot com” |
2 + 2 = 4 | ”two plus two equals four" | "two plus two equals four” |
When to Use Each
- Normalization on (recommended for most cases): Less prompt engineering required. Inworld TTS handles standard expansions and you only need to guide edge cases.
- Normalization off: Use when you need full control over how text is spoken, or when your domain has specific pronunciation requirements that conflict with default expansion rules.
Tips for Iterating
- Test with the TTS Playground: Use the TTS Playground to quickly hear how your LLM output sounds when synthesized. Paste in sample outputs and iterate on your prompt until the speech quality meets your needs.
- Tune LLM temperature for consistency: Lower temperatures produce more consistent output that follows your formatting rules reliably. Higher temperatures can produce more expressive text but may ignore specific instructions. Start around
0.7and adjust based on results. - Iterate on your pronunciation dictionary: Start with a small set of terms and expand as you discover mispronunciations during testing. Ask an LLM to generate IPA for new terms.
Next Steps
Generating Speech
Best practices for synthesizing high-quality speech, including punctuation, emphasis, and temperature tuning.
Audio Markups
Control emotion, delivery style, and non-verbal vocalizations with markup tags.
Custom Pronunciation
Define exact pronunciations for uncommon words using inline IPA notation.