Steering

Exclusive to inworld-tts-2, Steering is a powerful new capability that brings realistic speech to life. Truly convincing audio depends not just on the words spoken, but on how they are delivered. Flat, mechanical voices break immersion and signal to listeners that they are interacting with a machine. Steering addresses this by letting you provide natural-language instructions that control how a voice performs, covering emotion, pace, volume, style, and more. Think of it as giving direction to a voice actor. Wrap your instructions in square brackets and place them before the input text. No markup languages or numeric parameters required.

Metadata-based instructions

Single-property instructions that target one aspect of delivery at a time.

Emotion: Emotion is the difference between a voice that informs and one that resonates. Tell the voice exactly how to feel, and it will. Examples: [sound sad] [say excitedly] [sound terrified]
[say excitedly] We just got the green light, the product launches tomorrow!
Speed: Pacing shapes urgency. A frantic warning lands differently at half speed, and a careful instruction loses authority when rushed. Match the tempo to the moment. Examples: [speak quickly] [say fast] [extremely slowly]
[speak quickly] Run, they’re right behind us, don’t stop, keep moving!
Volume: From a hushed confession to a crowd-filling announcement, volume sets the physical presence of the voice. Use it to place the listener in the scene. Examples: [quietly] [softly] [in a loud voice]
[quietly] Don’t make a sound. There’s someone right outside the door.
Vocal style: Sometimes it’s not what is said but how it’s said. Shift the entire register of the voice, from a tense whisper to a sung melody, to unlock creative possibilities beyond standard speech. Examples: [whisper] [sing] [shout]
[whisper] She’s finally asleep. Don’t make a sound, but we did it.
Tone: Tone defines the relationship between the voice and the listener. A casual aside lands differently than a tense delivery, even with identical words. Set the right interpersonal register for your context. Examples: [speak conversationally] [in an anxious manner]
[speak conversationally] So anyway, I was telling her about the trip, and she just laughed the whole time.

Free-form turn instructions

Describe the full character of a delivery in natural language, like a director coaching an actor before a take. A single instruction can capture emotion, energy, pacing, and intent all at once.

[speak as if you are barely holding back rage, emphasizing every word through gritted teeth] I have told you. Repeatedly. And you STILL didn’t listen.

[deliver this like you find it absolutely ridiculous but are trying to stay professional] Apparently the entire presentation was sent to the wrong client. Every. Single. Slide.

Non-verbals

Insert organic, human sounds at any point in the text to add realism. Supported tags: [laugh] [breathe] [clear throat] [sigh] [cough] [yawn]

[clear throat] If I could have everyone’s attention, please.

I told him what happened, and he just [laugh] couldn’t believe it!

Emphasis

Capitalize letters within your input text to draw attention to specific words or syllables. Fully capitalizing a word stresses the entire word, while capitalizing individual letters within a word emphasizes a specific syllable.

I told you NOT to open that door.

Are you seriously asking if I want pizza? AbsoLUTEly I do.

Best practices

Keep instructions concise. Short, specific instructions give the model clearer direction. Overly long or compound instructions can dilute the effect. Avoid conflicting instructions. Combining opposing directions, for example [whisper] and [in a loud voice] in the same tag, produces unpredictable results. Use one clear instruction per tag. Match the instruction to the text. The content being spoken should be consistent with the delivery style. A mismatch like [say sadly] applied to This is the happiest day of my life, I just landed my dream job and fell in love! sends contradictory signals and may degrade output quality. Place steering instruction tags before the text. Tags that direct delivery, such as emotion, speed, volume, vocal style, tone, or free-form performance instructions, should appear at the start of the text input they apply to. These instructions may apply inconsistently when placed mid-sentence. Non-verbal tags like [laugh] or [sigh] are the exception and can be inserted inline where the sound should occur. Model compatibility. Steering is supported exclusively on inworld-tts-2. It has no effect when used with other models. Use pause controls for longer pauses. Use pause controls if you want to add longer pauses for added emphasis.

Get Started

Build with Realtime TTS

Best Practices

Resources

Metadata-based instructions

Free-form turn instructions

Non-verbals

Emphasis

Best practices

Get Started

Build with Realtime TTS

Best Practices

Resources

​Metadata-based instructions

​Free-form turn instructions

​Non-verbals

​Emphasis

​Best practices

Metadata-based instructions

Free-form turn instructions

Non-verbals

Emphasis

Best practices