Benchmarking

A comprehensive load testing tool for TTS On-Premises that measures performance metrics including latency, throughput, and streaming characteristics across different QPS (Queries Per Second) loads.

Overview

The tool simulates realistic TTS workloads by sending requests at specified rates with configurable burstiness patterns. It measures:

End-to-end latency
Audio generation latency per second
Streaming metrics (first chunk, 4th chunk, average chunk latencies)
Request success rates
Server performance under different load conditions

Quick start

# Install the load test tool
pip install -e .

# Basic load test with streaming
python load-test.main \
    --host http://localhost:8081 \
    --stream \
    --min-qps 1.0 \
    --max-qps 7.0 \
    --qps-step 2.0 \
    --number-of-samples 300

Parameters

Required

Parameter	Description	Example
`--host`	Base address of the On-Premises TTS server (endpoint auto-appended)	`http://localhost:8081`

Load configuration

Parameter	Default	Description
`--min-qps`	`1.0`	Minimum requests per second to test
`--max-qps`	`10.0`	Maximum requests per second to test
`--qps-step`	`2.0`	Step size for QPS increments
`--number-of-samples`	`1`	Total number of texts to synthesize per QPS level
`--burstiness`	`1.0`	Request timing pattern (`1.0` = Poisson, `< 1.0` = bursty, `> 1.0` = uniform)

TTS configuration

Parameter	Default	Description
`--stream`	`False`	Use streaming synthesis (`/SynthesizeSpeechStream`) vs non-streaming (`/SynthesizeSpeech`)
`--max_tokens`	`400`	Maximum tokens to synthesize (~8s audio at 50 tokens/s)
`--voice-ids`	`["Olivia", "Remy"]`	Voice IDs to use (can specify multiple)
`--model_id`	`None`	Model ID for TTS synthesis (optional)
`--text_samples_file`	`scripts/tts_load_testing/text_samples.json`	File containing text samples

Output and analysis

Parameter	Default	Description
`--benchmark_name`	auto-generated	Name for the benchmark run (affects output files)
`--plot_only`	`False`	Only generate plots from existing results (skip testing)
`--verbose`	`False`	Enable verbose output for debugging

Examples

Streaming vs non-streaming comparison

# Non-streaming test
python load-test.main \
    --host http://localhost:8081 \
    --min-qps 10.0 \
    --max-qps 50.0 \
    --qps-step 10.0 \
    --number-of-samples 500 \
    --benchmark_name non-streaming-test

# Streaming test
python load-test.main \
    --host http://localhost:8081 \
    --stream \
    --min-qps 10.0 \
    --max-qps 50.0 \
    --qps-step 10.0 \
    --number-of-samples 500 \
    --benchmark_name streaming-test

Plot-only mode

Generate plots from existing results without re-running tests:

./scripts/tts-load-test \
    --plot_only \
    --benchmark_name prod-stress-test

Understanding results

The tool generates comprehensive metrics for each QPS level.

Latency metrics

E2E Latency: Complete request-response time
Audio Generation Latency: Time per second of generated audio
First Chunk Latency: Time to first audio chunk (streaming only)
4th Chunk Latency: Time to 4th audio chunk (streaming only)
Average Chunk Latency: Mean time between chunks (streaming only)

Percentiles

Results include P50, P90, P95, and P99 percentiles for all latency metrics.

Output files

Results are saved in benchmark_result/{benchmark_name}/:

result.json — Raw performance data
{benchmark_name}_*.png — Performance charts

Burstiness parameter

The burstiness parameter controls request timing distribution:

Value	Behavior
`1.0`	Poisson process (natural randomness)
`< 1.0`	More bursty (requests come in clusters)
`> 1.0`	More uniform (evenly spaced requests)

Performance tips

Start small — Begin with low QPS and small sample sizes
Use appropriate text samples — Match your production text length distribution
Monitor server resources — Watch CPU, memory, and network during tests
Consider burstiness — Real-world traffic is often bursty (try 0.7–0.9)
Test both modes — Compare streaming vs non-streaming for your use case

Troubleshooting

Common issues

Issue	Solution
Connection errors	Verify server address and network connectivity
Authentication errors	Set `INWORLD_API_KEY` for external APIs
High latency	Check server load and network conditions
Memory issues	Reduce `number-of-samples` for high QPS tests

Debug mode

Use the --verbose flag for detailed request/response logging:

./scripts/tts-load-test --verbose --host ... # other params

Architecture

The tool uses:

Async/await: Efficient concurrent request handling
Pausable timers: Accurate server-only timing measurements
Multiple protocols: gRPC, HTTP REST API support
Configurable clients: Pluggable client architecture
Real-time progress: Live progress bars and status updates

Get Started

Build with Realtime TTS

Best Practices

Resources

Overview

Quick start

Parameters

Required

Load configuration

TTS configuration

Output and analysis

Examples

Streaming vs non-streaming comparison

Plot-only mode

Understanding results

Latency metrics

Percentiles

Output files

Burstiness parameter

Performance tips

Troubleshooting

Common issues

Debug mode

Architecture

Get Started

Build with Realtime TTS

Best Practices

Resources

Documentation Index

​Overview

​Quick start

​Parameters

​Required

​Load configuration

​TTS configuration

​Output and analysis

​Examples

​Streaming vs non-streaming comparison

​Plot-only mode

​Understanding results

​Latency metrics

​Percentiles

​Output files

​Burstiness parameter

​Performance tips

​Troubleshooting

​Common issues

​Debug mode

​Architecture

Overview

Quick start

Parameters

Required

Load configuration

TTS configuration

Output and analysis

Examples

Streaming vs non-streaming comparison

Plot-only mode

Understanding results

Latency metrics

Percentiles

Output files

Burstiness parameter

Performance tips

Troubleshooting

Common issues

Debug mode

Architecture