Documentation Index
Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt
Use this file to discover all available pages before exploring further.
A comprehensive load testing tool for TTS On-Premises that measures performance metrics including latency, throughput, and streaming characteristics across different QPS (Queries Per Second) loads.
Overview
The tool simulates realistic TTS workloads by sending requests at specified rates with configurable burstiness patterns. It measures:
- End-to-end latency
- Audio generation latency per second
- Streaming metrics (first chunk, 4th chunk, average chunk latencies)
- Request success rates
- Server performance under different load conditions
Quick start
# Install the load test tool
pip install -e .
# Basic load test with streaming
python load-test.main \
--host http://localhost:8081 \
--stream \
--min-qps 1.0 \
--max-qps 7.0 \
--qps-step 2.0 \
--number-of-samples 300
Parameters
Required
| Parameter | Description | Example |
|---|
--host | Base address of the On-Premises TTS server (endpoint auto-appended) | http://localhost:8081 |
Load configuration
| Parameter | Default | Description |
|---|
--min-qps | 1.0 | Minimum requests per second to test |
--max-qps | 10.0 | Maximum requests per second to test |
--qps-step | 2.0 | Step size for QPS increments |
--number-of-samples | 1 | Total number of texts to synthesize per QPS level |
--burstiness | 1.0 | Request timing pattern (1.0 = Poisson, < 1.0 = bursty, > 1.0 = uniform) |
TTS configuration
| Parameter | Default | Description |
|---|
--stream | False | Use streaming synthesis (/SynthesizeSpeechStream) vs non-streaming (/SynthesizeSpeech) |
--max_tokens | 400 | Maximum tokens to synthesize (~8s audio at 50 tokens/s) |
--voice-ids | ["Olivia", "Remy"] | Voice IDs to use (can specify multiple) |
--model_id | None | Model ID for TTS synthesis (optional) |
--text_samples_file | scripts/tts_load_testing/text_samples.json | File containing text samples |
Output and analysis
| Parameter | Default | Description |
|---|
--benchmark_name | auto-generated | Name for the benchmark run (affects output files) |
--plot_only | False | Only generate plots from existing results (skip testing) |
--verbose | False | Enable verbose output for debugging |
Examples
Streaming vs non-streaming comparison
# Non-streaming test
python load-test.main \
--host http://localhost:8081 \
--min-qps 10.0 \
--max-qps 50.0 \
--qps-step 10.0 \
--number-of-samples 500 \
--benchmark_name non-streaming-test
# Streaming test
python load-test.main \
--host http://localhost:8081 \
--stream \
--min-qps 10.0 \
--max-qps 50.0 \
--qps-step 10.0 \
--number-of-samples 500 \
--benchmark_name streaming-test
Plot-only mode
Generate plots from existing results without re-running tests:
./scripts/tts-load-test \
--plot_only \
--benchmark_name prod-stress-test
Understanding results
The tool generates comprehensive metrics for each QPS level.
Latency metrics
- E2E Latency: Complete request-response time
- Audio Generation Latency: Time per second of generated audio
- First Chunk Latency: Time to first audio chunk (streaming only)
- 4th Chunk Latency: Time to 4th audio chunk (streaming only)
- Average Chunk Latency: Mean time between chunks (streaming only)
Percentiles
Results include P50, P90, P95, and P99 percentiles for all latency metrics.
Output files
Results are saved in benchmark_result/{benchmark_name}/:
result.json — Raw performance data
{benchmark_name}_*.png — Performance charts
Burstiness parameter
The burstiness parameter controls request timing distribution:
| Value | Behavior |
|---|
1.0 | Poisson process (natural randomness) |
< 1.0 | More bursty (requests come in clusters) |
> 1.0 | More uniform (evenly spaced requests) |
- Start small — Begin with low QPS and small sample sizes
- Use appropriate text samples — Match your production text length distribution
- Monitor server resources — Watch CPU, memory, and network during tests
- Consider burstiness — Real-world traffic is often bursty (try 0.7–0.9)
- Test both modes — Compare streaming vs non-streaming for your use case
Troubleshooting
Common issues
| Issue | Solution |
|---|
| Connection errors | Verify server address and network connectivity |
| Authentication errors | Set INWORLD_API_KEY for external APIs |
| High latency | Check server load and network conditions |
| Memory issues | Reduce number-of-samples for high QPS tests |
Debug mode
Use the --verbose flag for detailed request/response logging:
./scripts/tts-load-test --verbose --host ... # other params
Architecture
The tool uses:
- Async/await: Efficient concurrent request handling
- Pausable timers: Accurate server-only timing measurements
- Multiple protocols: gRPC, HTTP REST API support
- Configurable clients: Pluggable client architecture
- Real-time progress: Live progress bars and status updates