Skip to main content
The inworld-tts Python SDK wraps the Inworld TTS REST API with a clean, Pythonic interface. It handles chunking for long text, retries with exponential backoff, and connection management automatically — reducing typical integrations from 30+ lines of raw HTTP to just a few lines of code.
pip install inworld-tts
Requires Python 3.10+.

Quick Start

from inworld_tts import InworldTTS

tts = InworldTTS()  # reads INWORLD_API_KEY from env

tts.generate(
    text="What a wonderful day to be a text-to-speech model!",
    voice="Ashley",
    output_file="output.mp3",
)

Speech Synthesis

generate(options)

Synthesize speech and return the complete audio as bytes. Text longer than 2,000 characters is automatically chunked and sent in parallel.
audio = tts.generate(
    text="Hello, world!",
    voice="Ashley",
    model="inworld-tts-1.5-max",
    encoding="MP3",
    output_file="output.mp3",  # optional — also writes to disk
)
ParameterTypeRequiredDefaultDescription
textstrYesText to synthesize. Any length. Supports <break time="Xs"/> SSML.
voicestrYesVoice ID (e.g. "Ashley", "Dennis", or a custom voice ID).
modelstrNo"inworld-tts-1.5-max"Model ID.
encodingstrNo"MP3"Audio format: MP3, OGG_OPUS, FLAC, LINEAR16, WAV, PCM, ALAW, MULAW.
sample_rateintNo48000Sample rate in Hz.
bit_rateintNo128000Bit rate in bps (MP3 / OGG_OPUS only).
speaking_ratefloatNo1.0Speed multiplier (0.5–1.5).
temperaturefloatNo1.0Expressiveness (0.0–2.0). Higher = more expressive.
output_filestrNoWrite audio to this file path.
playboolNoFalsePlay audio immediately after synthesis.
Returns: bytes — raw audio bytes in the requested encoding.

stream(options)

Stream audio chunks over HTTP as they are generated. Lower time-to-first-audio than generate(). Text must be 2,000 characters or fewer.
chunks = []

async for chunk in tts.stream(
    text="Streaming is great for real-time playback!",
    voice="Ashley",
):
    chunks.append(chunk)

audio = b"".join(chunks)
Parameters are the same as generate(), except text must be ≤2,000 characters and the default model is "inworld-tts-1.5-mini". Yields: bytes — audio chunks as they arrive.

generate_with_timestamps(options)

Same as generate() but also returns word- or character-level timing data. Useful for lip-sync, karaoke, and subtitle alignment.
result = tts.generate_with_timestamps(
    text="Timestamps are useful for lip sync.",
    voice="Ashley",
    timestamp_type="WORD",
)

# result.audio → bytes
# result.timestamps.word_alignment.words → ["Timestamps", "are", "useful", ...]
# result.timestamps.word_alignment.word_start_time_seconds → [0.0, 0.42, 0.61, ...]
Takes all the same parameters as generate(), plus:
ParameterTypeRequiredDescription
timestamp_type"WORD" | "CHARACTER"Yes"WORD" returns word timing, phonemes, and visemes. "CHARACTER" returns per-character timing.
Returns: an object with audio: bytes and timestamps: TimestampInfo.

stream_with_timestamps(options)

Stream audio chunks, each paired with optional timestamp data. Text must be ≤2,000 characters.
async for chunk in tts.stream_with_timestamps(
    text="Streaming with timestamps!",
    voice="Ashley",
    timestamp_type="WORD",
):
    # chunk.audio: bytes
    # chunk.timestamps: TimestampInfo | None
    pass
Takes all the same parameters as stream(), plus timestamp_type (required). Default model is "inworld-tts-1.5-mini". Yields: objects with audio: bytes and optional timestamps: TimestampInfo.

play(audio, options)

Play audio from bytes or a file path. Encoding is auto-detected from magic bytes unless overridden.
audio = tts.generate(text="Listen to this!", voice="Ashley")
tts.play(audio)

# Or play from a file
tts.play("output.mp3")
ParameterTypeRequiredDefaultDescription
audiobytes | strYesRaw audio bytes or a file path.
encodingstrNoauto-detectedFormat hint ("MP3", "WAV", etc.). Inferred from extension for file paths.

Voice Management

list_voices(options)

List available voices, optionally filtered by language.
voices = tts.list_voices()

# Filter by language
en_voices = tts.list_voices(lang="EN_US")
multi_lang = tts.list_voices(lang=["EN_US", "ES_ES"])
ParameterTypeRequiredDescription
langstr | list[str]NoFilter by language code(s). Returns all voices when omitted.
Returns: list[VoiceInfo]

get_voice(voice)

Get details for a single voice. Works with custom voices in your workspace (cloned or designed voices).
voice = tts.get_voice("my-custom-voice-id")
# voice.voice_id, voice.display_name, voice.lang_code, ...
Returns: VoiceInfo

clone_voice(options)

Clone a voice from one or more audio recordings. Only 5–15 seconds of audio is needed.
result = tts.clone_voice(
    audio_samples=["./recording.wav"],
    display_name="My Cloned Voice",
    lang="EN_US",
)

print(result.voice.voice_id)  # use this ID in generate()
ParameterTypeRequiredDefaultDescription
audio_sampleslist[bytes | str]YesAudio files as bytes, or file paths. WAV or MP3.
display_namestrNo"Cloned Voice"Display name for the cloned voice.
langstrNo"EN_US"Language code of the recordings.
transcriptionslist[str]NoTranscriptions aligned with each audio sample. Improves clone quality.
descriptionstrNoVoice description.
tagslist[str]NoTags for filtering.
remove_background_noiseboolNoFalseApply noise reduction before cloning.
Returns: CloneVoiceResult — the cloned voice ID is at result.voice.voice_id.

design_voice(options)

Design a new voice from a text description — no audio recording needed.
result = tts.design_voice(
    design_prompt="A warm, friendly female voice with a slight British accent",
    preview_text="Hello! Welcome to our application.",
    number_of_samples=3,
)

# Listen to previews, then publish the one you like
chosen_voice = result.preview_voices[0]
ParameterTypeRequiredDefaultDescription
design_promptstrYesNatural-language description of the voice (30–250 characters).
preview_textstrYesText the generated voice will speak in the preview.
langstrNo"EN_US"Language code.
number_of_samplesintNo1Number of preview candidates (1–3).
Returns: DesignVoiceResult — preview voices at result.preview_voices.

publish_voice(options)

Publish a designed or cloned voice preview to your library so it can be used in generate() and stream().
voice = tts.publish_voice(
    voice=chosen_voice.voice_id,
    display_name="My Designed Voice",
)
ParameterTypeRequiredDescription
voicestrYesVoice ID from design_voice() or clone_voice().
display_namestrNoDisplay name for the published voice.
descriptionstrNoDescription.
tagslist[str]NoTags for filtering.
Returns: VoiceInfo

migrate_from_elevenlabs(options)

Migrate a voice from ElevenLabs to your Inworld workspace. Fetches the voice’s audio samples directly from ElevenLabs and clones them into Inworld. No ElevenLabs SDK required.
import os

result = tts.migrate_from_elevenlabs(
    eleven_labs_api_key=os.environ["ELEVEN_LABS_API_KEY"],
    eleven_labs_voice_id="abc123",
)

print(f'Migrated "{result.eleven_labs_name}" → {result.inworld_voice_id}')
ParameterTypeRequiredDescription
eleven_labs_api_keystrYesYour ElevenLabs API key.
eleven_labs_voice_idstrYesElevenLabs voice ID to migrate.
Returns: an object with eleven_labs_voice_id, eleven_labs_name, and inworld_voice_id.

Configuration

Create a client with InworldTTS():
from inworld_tts import InworldTTS

tts = InworldTTS()                   # reads INWORLD_API_KEY from env
tts = InworldTTS(api_key="your_key") # or pass explicitly
OptionTypeRequiredDefaultDescription
api_keystrINWORLD_API_KEY env varInworld API key.
base_urlstrNohttps://api.inworld.aiOverride the API base URL.
timeoutintNo120Global HTTP timeout in seconds.
max_retriesintNo2Retry attempts on NetworkError or 5xx. Uses exponential backoff (1s, 2s, 4s… capped at 16s). 0 disables retries.
max_concurrent_requestsintNo4Max parallel chunk requests for long-text generate().
debugboolNoFalseEnable debug logging. Also activated by DEBUG=inworld-tts env var.
api_key must be provided directly or through the INWORLD_API_KEY environment variable. If neither is set, a MissingApiKeyError is thrown.

Long Text

generate() and generate_with_timestamps() automatically chunk text longer than 2,000 characters and send chunks in parallel (controlled by max_concurrent_requests). The resulting audio is seamlessly concatenated, and timestamp offsets are merged correctly. stream() and stream_with_timestamps() require text of 2,000 characters or fewer. For longer text with streaming, split the text yourself and call stream() for each segment.

Error Handling

The SDK exports three error classes, all extending InworldTTSError:
from inworld_tts import (
    InworldTTS,
    InworldTTSError,
    ApiError,
    NetworkError,
    MissingApiKeyError,
)

tts = InworldTTS()  # reads INWORLD_API_KEY from env

try:
    audio = tts.generate(text="Hello!", voice="Ashley")
except MissingApiKeyError:
    # No API key provided
    pass
except ApiError as err:
    print(f"HTTP {err.code}: {err.message}", err.details)
except NetworkError as err:
    print(f"Network error: {err}")
ErrorWhen
MissingApiKeyErrorNo api_key was provided and INWORLD_API_KEY is not set.
ApiErrorThe API returned a 4xx or 5xx response. Includes .code (HTTP status) and .details.
NetworkErrorConnection failure or timeout. Automatically retried up to max_retries times before throwing.

Next Steps

Voice Cloning

Create a personalized voice clone with just 5 seconds of audio.

Best Practices

Learn tips and tricks for synthesizing high-quality speech.

API Reference

View the complete TTS API specification.