Prompting for TTS - Inworld AI Documentation

When an LLM generates text that gets fed into TTS, the default output often sounds flat and unnatural. LLMs tend to produce clean, well-formatted text, but clean text isn’t the same as speakable text. Dates stay as 12/04, acronyms aren’t expanded, and there are no cues for emphasis, pauses, or emotion. This guide shows you what to add to your LLM system prompt so that its output is optimized for Inworld TTS.

Quality Dimensions

Emphasis

Use asterisks around words to make TTS stress them. Exclamation marks add energy, and ellipses create trailing-off effects. Prompt snippet:

Use asterisks (*word*) to emphasize key words in your response — focus on
prices, deadlines, action items, or any word the listener needs to catch.
Use punctuation to convey tone:
- Exclamation marks for excitement or urgency
- Ellipsis (...) for trailing off, hesitation, or leaving a thought unfinished
  Example: "I thought it would work, but..."

Before (no emphasis guidance):

I think this is a really important point and you should consider it carefully.

After (with emphasis guidance):

I think this is a *really* important point, and you should consider it *carefully*.

Use single asterisks only (*word*). Double asterisks (**word**) will cause TTS to read the asterisk characters aloud instead of emphasizing the word.

Pronunciation

For uncommon words like brand names, proper nouns, and technical terms, Inworld TTS supports inline IPA phoneme notation. You can provide a pronunciation dictionary in your system prompt that the LLM substitutes inline. Prompt snippet:

When you use any of the following words, replace them with their IPA pronunciation
inline using slash notation:
- "Crete" → /kriːt/
- "Yosemite" → /joʊˈsɛmɪti/
- "Nguyen" → /ŋwɪən/
- "Acai" → /ɑːsɑːˈiː/

Before (no pronunciation guidance):

You should visit Crete for your honeymoon.

After (with IPA substitution):

You should visit /kriːt/ for your honeymoon.

Inworld TTS reads the IPA notation and produces the correct pronunciation. See Custom Pronunciation for details on finding the right IPA phonemes. Another common approach is to use a string parser that replaces important-to-pronounce words from your pronunciation dictionary before passing the text to TTS. This works well as a post-processing step when you don’t want to add IPA instructions to your LLM prompt, or when the same dictionary needs to be applied consistently across multiple LLM providers.

Pauses and Pacing

Punctuation controls pacing in TTS. Periods create natural pauses between thoughts. Commas insert shorter breaks. Sentence length affects overall rhythm: short sentences speed things up, longer sentences slow them down. Prompt snippet:

Control pacing through punctuation and sentence structure:
- Use periods to separate thoughts and create pauses
- Use commas for shorter breaks within sentences
- Use ellipsis (...) to create a lingering pause or beat
- Use short sentences for emphasis and urgency
- Use longer sentences for calm, measured delivery

Before (flat pacing):

The results are in and we exceeded our target by 40 percent so this is the best quarter we have ever had.

After (with pacing guidance):

The results are in. We exceeded our target… by *forty percent*. This is the *best* quarter we have ever had.

Non-verbal Vocalizations

Inworld TTS supports non-verbal tokens that add human-like sounds: [sigh], [laugh], [breathe], [cough], [clear_throat], [yawn]. These make speech sound more natural and emotionally grounded.

Audio markups are currently experimental and only support English.

Prompt snippet:

Insert non-verbal vocalizations where they would naturally occur in conversation:
- [sigh] for frustration, relief, or resignation
- [laugh] for amusement or warmth
- [breathe] before delivering important or emotional statements
- [cough] or [clear_throat] for naturalistic transitions
- [yawn] for tiredness

Place these tokens inline in your text, e.g.: "[sigh] I really thought that would work."

Before (no vocalizations):

I really thought that would work. Oh well, let’s try again.

After (with vocalizations):

[sigh] I *really* thought that would work. [laugh] Oh well, let’s try again.

See Audio Markups for the full list of supported markups including emotion and delivery style tags.

Conversational Naturalness

Natural human speech is full of filler words like uh, um, well, like, you know. Adding these to LLM output makes TTS sound less robotic and more conversational. Prompt snippet:

To sound natural and conversational, include filler words where a human speaker
would naturally use them:
- "uh" and "um" for thinking moments
- "well" and "so" for transitions
- "like" and "you know" for casual emphasis

Example: "So, uh, I was thinking we could, you know, try a different approach."

Before (no fillers):

I was thinking we could try a different approach.

After (with fillers):

So, uh, I was thinking we could, you know, try a *different* approach.

Filler words are best for casual, conversational use cases. Skip them for formal applications like news reading, professional narration, or customer support.

Output Length

LLMs tend to be verbose. A detailed paragraph may read well on screen, but sounds unnatural and exhausting when spoken aloud. Keeping responses short produces better-sounding speech and reduces latency. A good default is to ask your LLM to respond in 1–2 sentences unless the user’s query specifically demands a longer answer. Use sentences as your length unit, not words or characters. LLMs operate on tokens, so word and character counts are unreliable constraints. Prompt snippet:

Keep your responses to 1-2 sentences unless the user's question specifically
requires a longer explanation. Prefer concise, direct answers.

Before (too verbose):

Well, the weather forecast for tomorrow is showing that there will be partly cloudy skies throughout the morning hours, with temperatures expected to reach a high of around seventy-five degrees Fahrenheit by the early afternoon, and then cooling down to approximately sixty degrees in the evening.

After (concise):

Tomorrow looks like partly cloudy skies, with a high around *seventy-five* and cooling to sixty by evening.

Example Prompt Templates

Below are complete, copyable system prompt blocks tailored for common use cases. Each template combines the techniques above into a ready-to-use prompt.

Companion / Conversational
Support / Sales
Dev Tools / Technical

Use this template for chatbots, AI companions, virtual friends, and other informal conversational applications.

## Speech Output Rules

Your responses will be converted to speech using TTS. Follow these
rules to produce natural, expressive spoken output:

### Expressiveness
- Use *asterisks* to emphasize key words
- Use exclamation marks for excitement, ellipsis for trailing off
- Insert non-verbal vocalizations where natural:
  [sigh], [laugh], [breathe], [cough], [clear_throat], [yawn]
  Example: "[laugh] That's *exactly* what I was thinking!"

### Naturalness
- Include filler words (uh, um, well, like, you know) where a human would naturally pause
- Vary sentence length for natural rhythm
- Use contractions (don't, can't, I'm, we're) instead of formal forms

### Pronunciation
- Replace uncommon proper nouns with IPA: e.g., /kriːt/ for Crete
[Add your pronunciation dictionary here]

### Text Formatting
- Write numbers in spoken form: "twenty-three" not "23"
- Write dates in spoken form: "march fifteenth" not "3/15"
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Use this template for customer support agents, sales assistants, and other professional conversational applications.

## Speech Output Rules

Your responses will be converted to speech using TTS. Follow these
rules to produce clear, professional spoken output:

### Clarity
- Use *asterisks* sparingly to emphasize critical information (prices, deadlines, action items)
- Use short, clear sentences for important details
- Use periods to separate distinct points

### Professionalism
- Do NOT use filler words (uh, um, like, you know)
- Do NOT use non-verbal vocalizations ([sigh], [laugh], etc.)
- Maintain a warm but professional tone
- Use contractions naturally (don't, we'll, you're)

### Numbers and Data
- Speak account numbers digit by digit: "one two three four five six" not "123456"
- Speak prices naturally: "forty-nine ninety-nine" or "forty-nine dollars and ninety-nine cents"
- Speak dates fully: "january fifteenth, twenty twenty-five" not "1/15/2025"
- Speak phone numbers in groups: "five five five, one two three, four five six seven"

### Pronunciation
- Replace product names and brand terms with IPA where needed
[Add your pronunciation dictionary here]

### Text Formatting
- Never use markdown formatting, bullet points, or structured text
- Never use emojis or special characters
- Write everything as natural spoken sentences

Use this template for coding assistants, documentation readers, technical narrators, and developer-facing tools.

## Speech Output Rules

Your responses will be converted to speech using TTS. Follow these
rules to produce accurate, well-paced technical speech:

### Technical Accuracy
- Spell out acronyms on first use: "AWS, or Amazon Web Services"
- For common acronyms after first use, speak them as words if pronounceable
  (e.g., "NASA") or spell them out if not (e.g., "A-P-I")
- Speak URLs by component: "github dot com slash inworld dash AI"
- Speak code identifiers in plain English: "the getUserName function" not "getUserName()"
- Speak version numbers naturally: "version three point two" not "v3.2"

### Pronunciation
- Replace technical proper nouns with IPA:
[Add your pronunciation dictionary here, e.g.:]
- "Kubernetes" → /kuːbərˈnɛtiːz/
- "Nginx" → /ˈɛndʒɪnɛks/
- "PostgreSQL" → /ˈpoʊstɡrɛsˌkjuːˈɛl/

### Pacing
- Use measured, even pacing. Avoid rushing through technical content.
- Insert periods before key technical terms to create natural pauses
- Keep sentences moderate length
- Do NOT use filler words (uh, um, like, you know)

### Text Formatting
- Write all numbers in spoken form: "forty-two" not "42"
- Never use markdown formatting, bullet points, or code blocks
- Never use emojis or special characters
- Write everything as natural spoken sentences

Notes on Normalization

Inworld TTS includes an optional normalization step that automatically expands dates, numbers, emails, currencies, and symbols into their spoken forms before synthesis. Understanding how normalization interacts with your LLM output is important for getting the best results. Toggle normalization with the applyTextNormalization parameter in your TTS API request:

ON — always normalize
OFF — skip normalization entirely
APPLY_TEXT_NORMALIZATION_UNSPECIFIED (default) — TTS decides per-request

Normalization adds slight latency to each TTS request. For latency-sensitive applications, consider having your LLM handle text expansion directly and setting applyTextNormalization to OFF.

With Normalization On

Inworld TTS handles common expansions automatically. Your LLM prompt still benefits from guiding edge cases that normalization may not cover:

Ambiguous dates: 01/02/2025 could be January 2nd or February 1st depending on locale
Domain-specific abbreviations: RDS, k8s, HIPAA may not expand as expected
Uncommon acronyms: Industry-specific terms that aren’t in common usage

With Normalization Off

The LLM must handle all text expansion. Your prompt must instruct the LLM to write everything in spoken form: no digits, no symbols, no shorthand.

Comparison Table

Raw Text	Normalization Produces	LLM Should Produce (Normalization Off)
`12/04/2025`	”twelve oh four twenty twenty-five"	"december fourth, twenty twenty-five”
`(555) 123-4567`	”five five five, one two three, four five six seven"	"five five five, one two three, four five six seven”
`$1,249.99`	”one thousand two hundred forty-nine dollars and ninety-nine cents"	"twelve hundred forty-nine dollars and ninety-nine cents”
`3:45 PM`	”three forty-five PM"	"three forty-five PM”
`test@example.com`	”test at example dot com"	"test at example dot com”
`2 + 2 = 4`	”two plus two equals four"	"two plus two equals four”

When to Use Each

Normalization on (recommended for most cases): Less prompt engineering required. Inworld TTS handles standard expansions and you only need to guide edge cases.
Normalization off: Use when you need full control over how text is spoken, or when your domain has specific pronunciation requirements that conflict with default expansion rules.

Prompt snippet for normalization off:

CRITICAL: Write ALL text in fully spoken form. Never use digits, symbols, or abbreviations.
- Dates: "december fourth, twenty twenty-five" not "12/04/2025"
- Phone numbers: "five five five, one two three, four five six seven" not "(555) 123-4567"
- Currency: "forty-nine dollars and ninety-nine cents" not "$49.99"
- Times: "three forty-five PM" not "3:45 PM"
- Emails: "john at example dot com" not "john@example.com"
- Symbols: "two plus two equals four" not "2+2=4"

Tips for Iterating

Test with the TTS Playground: Use the TTS Playground to quickly hear how your LLM output sounds when synthesized. Paste in sample outputs and iterate on your prompt until the speech quality meets your needs.
Tune LLM temperature for consistency: Lower temperatures produce more consistent output that follows your formatting rules reliably. Higher temperatures can produce more expressive text but may ignore specific instructions. Start around 0.7 and adjust based on results.
Iterate on your pronunciation dictionary: Start with a small set of terms and expand as you discover mispronunciations during testing. Ask an LLM to generate IPA for new terms.

Next Steps

Generating Speech

Best practices for synthesizing high-quality speech, including punctuation, emphasis, and temperature tuning.

Audio Markups

Control emotion, delivery style, and non-verbal vocalizations with markup tags.

Custom Pronunciation

Define exact pronunciations for uncommon words using inline IPA notation.

​Quality Dimensions

​Emphasis

​Pronunciation

​Pauses and Pacing

​Non-verbal Vocalizations

​Conversational Naturalness

​Output Length

​Example Prompt Templates

​Notes on Normalization

​With Normalization On

​With Normalization Off

​Comparison Table

​When to Use Each

​Tips for Iterating

​Next Steps

Generating Speech

Audio Markups

Custom Pronunciation

Quality Dimensions

Emphasis

Pronunciation

Pauses and Pacing

Non-verbal Vocalizations

Conversational Naturalness

Output Length

Example Prompt Templates

Notes on Normalization

With Normalization On

With Normalization Off

Comparison Table

When to Use Each

Tips for Iterating

Next Steps