Skip to main content
When an LLM generates text that gets fed into TTS, the default output often sounds flat and unnatural. LLMs tend to produce clean, well-formatted text, but clean text isn’t the same as speakable text. Dates stay as 12/04, acronyms aren’t expanded, and there are no cues for emphasis, pauses, or emotion. This guide shows you what to add to your LLM system prompt so that its output is optimized for Inworld TTS.

Quality Dimensions

Emphasis

Use asterisks around words to make TTS stress them. Exclamation marks add energy, and ellipses create trailing-off effects. Prompt snippet:
Use asterisks (*word*) to emphasize key words in your response — focus on
prices, deadlines, action items, or any word the listener needs to catch.
Use punctuation to convey tone:
- Exclamation marks for excitement or urgency
- Ellipsis (...) for trailing off, hesitation, or leaving a thought unfinished
  Example: "I thought it would work, but..."
Before (no emphasis guidance):
I think this is a really important point and you should consider it carefully.
After (with emphasis guidance):
I think this is a *really* important point, and you should consider it *carefully*.
Use single asterisks only (*word*). Double asterisks (**word**) will cause TTS to read the asterisk characters aloud instead of emphasizing the word.

Pronunciation

For uncommon words like brand names, proper nouns, and technical terms, Inworld TTS supports inline IPA phoneme notation. You can provide a pronunciation dictionary in your system prompt that the LLM substitutes inline. Prompt snippet:
When you use any of the following words, replace them with their IPA pronunciation
inline using slash notation:
- "Crete" → /kriːt/
- "Yosemite" → /joʊˈsɛmɪti/
- "Nguyen" → /ŋwɪən/
- "Acai" → /ɑːsɑːˈiː/
Before (no pronunciation guidance):
You should visit Crete for your honeymoon.
After (with IPA substitution):
You should visit /kriːt/ for your honeymoon.
Inworld TTS reads the IPA notation and produces the correct pronunciation. See Custom Pronunciation for details on finding the right IPA phonemes. Another common approach is to use a string parser that replaces important-to-pronounce words from your pronunciation dictionary before passing the text to TTS. This works well as a post-processing step when you don’t want to add IPA instructions to your LLM prompt, or when the same dictionary needs to be applied consistently across multiple LLM providers.

Pauses and Pacing

Punctuation controls pacing in TTS. Periods create natural pauses between thoughts. Commas insert shorter breaks. Sentence length affects overall rhythm: short sentences speed things up, longer sentences slow them down. Prompt snippet:
Control pacing through punctuation and sentence structure:
- Use periods to separate thoughts and create pauses
- Use commas for shorter breaks within sentences
- Use ellipsis (...) to create a lingering pause or beat
- Use short sentences for emphasis and urgency
- Use longer sentences for calm, measured delivery
Before (flat pacing):
The results are in and we exceeded our target by 40 percent so this is the best quarter we have ever had.
After (with pacing guidance):
The results are in. We exceeded our target… by *forty percent*. This is the *best* quarter we have ever had.

Non-verbal Vocalizations

Inworld TTS supports non-verbal tokens that add human-like sounds: [sigh], [laugh], [breathe], [cough], [clear_throat], [yawn]. These make speech sound more natural and emotionally grounded.
Audio markups are currently experimental and only support English.
Prompt snippet:
Insert non-verbal vocalizations where they would naturally occur in conversation:
- [sigh] for frustration, relief, or resignation
- [laugh] for amusement or warmth
- [breathe] before delivering important or emotional statements
- [cough] or [clear_throat] for naturalistic transitions
- [yawn] for tiredness

Place these tokens inline in your text, e.g.: "[sigh] I really thought that would work."
Before (no vocalizations):
I really thought that would work. Oh well, let’s try again.
After (with vocalizations):
[sigh] I *really* thought that would work. [laugh] Oh well, let’s try again.
See Audio Markups for the full list of supported markups including emotion and delivery style tags.

Conversational Naturalness

Natural human speech is full of filler words like uh, um, well, like, you know. Adding these to LLM output makes TTS sound less robotic and more conversational. Prompt snippet:
To sound natural and conversational, include filler words where a human speaker
would naturally use them:
- "uh" and "um" for thinking moments
- "well" and "so" for transitions
- "like" and "you know" for casual emphasis

Example: "So, uh, I was thinking we could, you know, try a different approach."
Before (no fillers):
I was thinking we could try a different approach.
After (with fillers):
So, uh, I was thinking we could, you know, try a *different* approach.
Filler words are best for casual, conversational use cases. Skip them for formal applications like news reading, professional narration, or customer support.

Output Length

LLMs tend to be verbose. A detailed paragraph may read well on screen, but sounds unnatural and exhausting when spoken aloud. Keeping responses short produces better-sounding speech and reduces latency. A good default is to ask your LLM to respond in 1–2 sentences unless the user’s query specifically demands a longer answer. Use sentences as your length unit, not words or characters. LLMs operate on tokens, so word and character counts are unreliable constraints. Prompt snippet:
Keep your responses to 1-2 sentences unless the user's question specifically
requires a longer explanation. Prefer concise, direct answers.
Before (too verbose):
Well, the weather forecast for tomorrow is showing that there will be partly cloudy skies throughout the morning hours, with temperatures expected to reach a high of around seventy-five degrees Fahrenheit by the early afternoon, and then cooling down to approximately sixty degrees in the evening.
After (concise):
Tomorrow looks like partly cloudy skies, with a high around *seventy-five* and cooling to sixty by evening.

Example Prompt Templates

Below are complete, copyable system prompt blocks tailored for common use cases. Each template combines the techniques above into a ready-to-use prompt.
Use this template for chatbots, AI companions, virtual friends, and other informal conversational applications.
## Speech Output Rules

Your responses will be converted to speech using TTS. Follow these
rules to produce natural, expressive spoken output:

### Expressiveness
- Use *asterisks* to emphasize key words
- Use exclamation marks for excitement, ellipsis for trailing off
- Insert non-verbal vocalizations where natural:
  [sigh], [laugh], [breathe], [cough], [clear_throat], [yawn]
  Example: "[laugh] That's *exactly* what I was thinking!"

### Naturalness
- Include filler words (uh, um, well, like, you know) where a human would naturally pause
- Vary sentence length for natural rhythm
- Use contractions (don't, can't, I'm, we're) instead of formal forms

### Pronunciation
- Replace uncommon proper nouns with IPA: e.g., /kriːt/ for Crete
[Add your pronunciation dictionary here]

### Text Formatting
- Write numbers in spoken form: "twenty-three" not "23"
- Write dates in spoken form: "march fifteenth" not "3/15"
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Notes on Normalization

Inworld TTS includes an optional normalization step that automatically expands dates, numbers, emails, currencies, and symbols into their spoken forms before synthesis. Understanding how normalization interacts with your LLM output is important for getting the best results. Toggle normalization with the applyTextNormalization parameter in your TTS API request:
  • ON — always normalize
  • OFF — skip normalization entirely
  • APPLY_TEXT_NORMALIZATION_UNSPECIFIED (default) — TTS decides per-request
Normalization adds slight latency to each TTS request. For latency-sensitive applications, consider having your LLM handle text expansion directly and setting applyTextNormalization to OFF.

With Normalization On

Inworld TTS handles common expansions automatically. Your LLM prompt still benefits from guiding edge cases that normalization may not cover:
  • Ambiguous dates: 01/02/2025 could be January 2nd or February 1st depending on locale
  • Domain-specific abbreviations: RDS, k8s, HIPAA may not expand as expected
  • Uncommon acronyms: Industry-specific terms that aren’t in common usage

With Normalization Off

The LLM must handle all text expansion. Your prompt must instruct the LLM to write everything in spoken form: no digits, no symbols, no shorthand.

Comparison Table

Raw TextNormalization ProducesLLM Should Produce (Normalization Off)
12/04/2025”twelve oh four twenty twenty-five""december fourth, twenty twenty-five”
(555) 123-4567”five five five, one two three, four five six seven""five five five, one two three, four five six seven”
$1,249.99”one thousand two hundred forty-nine dollars and ninety-nine cents""twelve hundred forty-nine dollars and ninety-nine cents”
3:45 PM”three forty-five PM""three forty-five PM”
test@example.com”test at example dot com""test at example dot com”
2 + 2 = 4”two plus two equals four""two plus two equals four”

When to Use Each

  • Normalization on (recommended for most cases): Less prompt engineering required. Inworld TTS handles standard expansions and you only need to guide edge cases.
  • Normalization off: Use when you need full control over how text is spoken, or when your domain has specific pronunciation requirements that conflict with default expansion rules.
Prompt snippet for normalization off:
CRITICAL: Write ALL text in fully spoken form. Never use digits, symbols, or abbreviations.
- Dates: "december fourth, twenty twenty-five" not "12/04/2025"
- Phone numbers: "five five five, one two three, four five six seven" not "(555) 123-4567"
- Currency: "forty-nine dollars and ninety-nine cents" not "$49.99"
- Times: "three forty-five PM" not "3:45 PM"
- Emails: "john at example dot com" not "john@example.com"
- Symbols: "two plus two equals four" not "2+2=4"

Tips for Iterating

  • Test with the TTS Playground: Use the TTS Playground to quickly hear how your LLM output sounds when synthesized. Paste in sample outputs and iterate on your prompt until the speech quality meets your needs.
  • Tune LLM temperature for consistency: Lower temperatures produce more consistent output that follows your formatting rules reliably. Higher temperatures can produce more expressive text but may ignore specific instructions. Start around 0.7 and adjust based on results.
  • Iterate on your pronunciation dictionary: Start with a small set of terms and expand as you discover mispronunciations during testing. Ask an LLM to generate IPA for new terms.

Next Steps