Skip to main content
This guide walks through best practices and techniques for generating high-quality voice clones. For more information on how to create a voice clone, check out this guide. Inworld offers two types of voice cloning: instant voice cloning (available via Inworld Portal) and professional voice cloning (please reach out for more information). We’ve broken down the best practices in this guide to general best practices that apply to all voice clones, as well as more specific best practices for each type of cloning.

General Best Practices

  1. Capture the full range of expression - Make sure your script and delivery cover the emotions and expressiveness you want the voice to capture. The more variety you include, the better the model will be at recreating those feelings. If the audio is flat, the resulting voice will usually sound monotone as well. Below are some scripts you can use that we’ve found work well:
    • Are you ready to save big? Get set for the sale of the century! Deals and discounts like never before! You won’t want to miss this.
    • Every challenge we face is an opportunity in disguise. Wouldn’t you agree? So cheer up! It’ll all be okay.
    • How have you been? It’s been way too long since we last caught up. By the way, I heard about your recent promotion. Congratulations! I’m so excited for you!
  2. Speak clearly and consistently - Pronounce each word carefully and avoid filler sounds like sighs or coughs. Try not to have unnaturally long pauses in the middle of your recording, as this can affect the flow of the cloned voice.
  3. Minimize noise - Record in a quiet environment and keep a reasonable distance from the microphone to reduce echo, plosives, and device noise. After recording, listen back to ensure your audio is clean and free of any unwanted sounds.

Best Practice for Instant Voice Cloning

  1. Keep your recordings short and sweet - Aim for clips between 5 to 15 seconds. This length ensures that there is enough context, while maintaining voice consistency.
  2. Use high-quality audio settings - We recommend recording audio with at least a 22kHz sample rate with 16-bit depth.
Instant voice cloning may not perform well for less common voices, such as children’s voices or unique accents. For those use cases, we recommend professional voice cloning.

Best Practice for Professional Voice Cloning

  1. Follow the optimal recording specifications - For the best voice quality, we recommend recording audio with the following specifications:
    • Audio Format: .wav 
    • Sampling Frequency: 48 kHz
    • Bit Rate: 24 bits
    • Codec: Linear PCM (uncompressed)
    • Channel(s): 1 (mono)
    • Loudness Level: -23LUFS ±0.5 LU (compliant with ITU-R BS.1770-3)
    • Peak Values Level (Max): -5 dBFS using True Peak value (compliant with ITU-R BS.1770-3)
    • Noise Floor Level (Max): -60dB
  2. Maintain consistent voice delivery - Keep your voice consistent throughout all recordings. It’s fine to reflect natural variation based on the script (such as hesitations, questions, or exclamations), but avoid major changes in accent or style between samples.
  3. Provide ample, high-quality samples - While the minimum required audio is only 30 samples (5–20 seconds each, totaling about 5 minutes), we recommend at least 120 samples (totaling about 20 minutes) for the best results. There’s no upper limit to the number of samples you can provide—more clean, high-quality recordings will generally lead to higher quality clones.
  4. Include transcripts where possible - Text transcripts are not strictly necessary, but we recommend providing them if available—especially for uncommon words, product names, or company terms. This ensures accurate pronunciation in the final voice clone.