Voice Cloning - Inworld AI Documentation

This guide walks through best practices and techniques for generating high-quality voice clones. For more information on how to create a voice clone, check out this guide. Inworld offers two types of voice cloning: instant voice cloning (available via Inworld Portal) and professional voice cloning (please reach out for more information). We’ve broken down the best practices in this guide to general best practices that apply to all voice clones, as well as more specific best practices for each type of cloning.

General Best Practices

Capture the full range of expression - Make sure your script and delivery cover the emotions and expressiveness you want the voice to capture. The more variety you include, the better the model will be at recreating those feelings. If the audio is flat, the resulting voice will usually sound monotone as well. Below are some scripts you can use that we’ve found work well:
- Are you ready to save big? Get set for the sale of the century! Deals and discounts like never before! You won’t want to miss this.
- Every challenge we face is an opportunity in disguise. Wouldn’t you agree? So cheer up! It’ll all be okay.
- How have you been? It’s been way too long since we last caught up. By the way, I heard about your recent promotion. Congratulations! I’m so excited for you!
Speak clearly and consistently - Pronounce each word carefully and avoid filler sounds like sighs or coughs. Try not to have unnaturally long pauses in the middle of your recording, as this can affect the flow of the cloned voice.
Minimize noise - Record in a quiet environment and keep a reasonable distance from the microphone to reduce echo, plosives, and device noise. After recording, listen back to ensure your audio is clean and free of any unwanted sounds.

Best Practices for Instant Voice Cloning

Keep final clip short - Use a 5-15s total length for enough context while keeping the voice consistent.
Use high-quality audio - Record with at least a 22 kHz sample rate and 16-bit depth.
Vary emotion and delivery - Combine a few short clips that show different expressions into your final clip; use short pauses or crossfades between clips to avoid abrupt cuts.
Use clean audio - Avoid artifacts, background noise, and non-speech sounds.
Normalize volume - Keep levels fairly consistent with normal voice variation; avoid clipping due to very high dB.
Avoid mid-word cuts - Don’t use samples that break in the middle of words.

Instant voice cloning may not perform well for less common voices, such as children’s voices or unique accents. For those use cases, we recommend professional voice cloning.

Best Practices for Professional Voice Cloning

Follow the optimal recording specifications - For the best voice quality, we recommend recording audio with the following specifications:
- Audio Format: .wav
- Sampling Frequency: 48 kHz
- Bit Rate: 24 bits
- Codec: Linear PCM (uncompressed)
- Channel(s): 1 (mono)
- Loudness Level: -23LUFS ±0.5 LU (compliant with ITU-R BS.1770-3)
- Peak Values Level (Max): -5 dBFS using True Peak value (compliant with ITU-R BS.1770-3)
- Noise Floor Level (Max): -60dB
Maintain consistent voice delivery - Keep your voice consistent throughout all recordings. It’s fine to reflect natural variation based on the script (such as hesitations, questions, or exclamations), but avoid major changes in accent or style between samples.
Provide ample, high-quality samples - While the minimum required audio is only 30 samples (5–20 seconds each, totaling about 5 minutes), we recommend at least 120 samples (totaling about 20 minutes) for the best results. There’s no upper limit to the number of samples you can provide—more clean, high-quality recordings will generally lead to higher quality clones.
Include transcripts where possible - Text transcripts are not strictly necessary, but we recommend providing them if available—especially for uncommon words, product names, or company terms. This ensures accurate pronunciation in the final voice clone.

Automation via API

If you need to clone multiple voices (for example, to support a batch of creators or a pipeline workflow), you can automate voice cloning via the API.

API reference: Clone a voice
Python example: example_voice_clone.py
JavaScript example: example_voice_clone.js

Voice cloning has lower rate limits than regular speech synthesis. For details, see Rate limits.

​General Best Practices

​Best Practices for Instant Voice Cloning

​Best Practices for Professional Voice Cloning

​Automation via API

General Best Practices

Best Practices for Instant Voice Cloning

Best Practices for Professional Voice Cloning

Automation via API