Voice Design

This guide walks through best practices and techniques for generating high-quality voices using Voice Design. For a step-by-step walkthrough on how to design a voice, check out the Voice Design guide.

Voice Design is currently in research preview. Please share any feedback with us via the feedback form in Portal or in Discord.

Voice Description Best Practices

The voice description helps the model understand the type of voice you want to generate. The following best practices will help you write descriptions that produce better voices:

Be specific in your description - Vague descriptions like “a fun voice” may produce less consistent results. Include details about age, gender, language (if not English), accent, pitch, pace, timbre, tone, and emotional quality. We generally recommend structuring your description in this order: Distinctive Qualities → Gender → Language / Accent → Age → Tone → Delivery Style → Pacing → Additional Qualities → Audio Quality For example:
“A soothing, calming female voice with soft American accent, 30-45 years old. Gentle, flowing delivery with natural pauses and smooth transitions. Warm, peaceful tone that creates relaxation without sounding robotic. Perfect broadcast quality audio.”
Be specific with age - If more general terms like “young” and “old” are not producing the desired voice, use more specific age ranges like “mid-20s to early 30s” or “late 60s to early 70s”.
- For child voices, try specifying exact ages (e.g., “8-10 years old”) and emphasize “natural” and “age-appropriate” to avoid over-cutesy results.
- For elderly voices, include both the age range and specific texture descriptors (“gravelly,” “weathered”) along with pacing cues (“slower, deliberate”).
For regional accents, specify the city or region - For regional accents, always include the specific city or region. For example, write “Boston accent” rather than “Northeast accent.”
Describe vocal texture in the middle - Place descriptions of the vocal texture and timbre (e.g., “raspy,” “breathy,” “nasally”) in the middle of your voice description, never at the end. Use modifiers like “slight,” “subtle,” or “natural” to prevent over-exaggeration.
End with audio quality - For the clearest audio quality, include the phrase “Perfect broadcast quality audio.” at the end of your description. This can be especially helpful if the voice includes descriptions like “gravelly”, “breathy”, or “scratchy” that may be misinterpreted as audio degradation.
Avoid conflicting descriptors - Don’t use conflicting descriptors (e.g., “fast-paced” with “slow, deliberate”), as that may confuse the model.
Experiment with multiple generations - Each generation produces slightly different results. Especially for less common voices (e.g., children, elderly specific regional accents), you may need to generate a couple of times to get a succesful voice.

Voice Script Best Practices

The script shapes the voice that gets generated, as the model will tailor the voice to suit the content it’s speaking. If writing your own script, the following best practices will help ensure the best results.

Match the script to the voice - The model will tailor the voice to the script. Write a script that matches your voice and desired use case. For example, if you’re designing a customer support voice, use a script that sounds like a customer support conversation. For accented voices, use words and phrasing typical of that accent. For example, for a British voice, use words like “brilliant,” “proper,” or “spot on.”
Aim for 5-15 seconds - Aim for a script that will generate 5-15 seconds of audio (50-200 characters in English), so that your resulting voice has enough generated audio to reference for how the voice should sound in your future audio generations.
Match the desired language - Make sure the script is in the desired language (e.g., write a Chinese script if you want the voice to speak Chinese).

Next Steps

Follow the step-by-step guide to design your first voice.

Speech Generation Best Practices

Learn best practices for synthesizing high-quality speech with your designed voices.

Voice Cloning

Clone an existing voice with just 5-15 seconds of audio.

​Voice Description Best Practices

​Voice Script Best Practices