Overview
The tool simulates realistic TTS workloads by sending requests at specified rates with configurable burstiness patterns. It measures:- End-to-end latency
- Audio generation latency per second
- Streaming metrics (first chunk, 4th chunk, average chunk latencies)
- Request success rates
- Server performance under different load conditions
Quick start
Parameters
Required
| Parameter | Description | Example |
|---|---|---|
--host | Base address of the On-Premises TTS server (endpoint auto-appended) | http://localhost:8081 |
Load configuration
| Parameter | Default | Description |
|---|---|---|
--min-qps | 1.0 | Minimum requests per second to test |
--max-qps | 10.0 | Maximum requests per second to test |
--qps-step | 2.0 | Step size for QPS increments |
--number-of-samples | 1 | Total number of texts to synthesize per QPS level |
--burstiness | 1.0 | Request timing pattern (1.0 = Poisson, < 1.0 = bursty, > 1.0 = uniform) |
TTS configuration
| Parameter | Default | Description |
|---|---|---|
--stream | False | Use streaming synthesis (/SynthesizeSpeechStream) vs non-streaming (/SynthesizeSpeech) |
--max_tokens | 400 | Maximum tokens to synthesize (~8s audio at 50 tokens/s) |
--voice-ids | ["Olivia", "Remy"] | Voice IDs to use (can specify multiple) |
--model_id | None | Model ID for TTS synthesis (optional) |
--text_samples_file | scripts/tts_load_testing/text_samples.json | File containing text samples |
Output and analysis
| Parameter | Default | Description |
|---|---|---|
--benchmark_name | auto-generated | Name for the benchmark run (affects output files) |
--plot_only | False | Only generate plots from existing results (skip testing) |
--verbose | False | Enable verbose output for debugging |
Examples
Streaming vs non-streaming comparison
Plot-only mode
Generate plots from existing results without re-running tests:Understanding results
The tool generates comprehensive metrics for each QPS level.Latency metrics
- E2E Latency: Complete request-response time
- Audio Generation Latency: Time per second of generated audio
- First Chunk Latency: Time to first audio chunk (streaming only)
- 4th Chunk Latency: Time to 4th audio chunk (streaming only)
- Average Chunk Latency: Mean time between chunks (streaming only)
Percentiles
Results include P50, P90, P95, and P99 percentiles for all latency metrics.Output files
Results are saved inbenchmark_result/{benchmark_name}/:
result.json— Raw performance data{benchmark_name}_*.png— Performance charts
Burstiness parameter
The burstiness parameter controls request timing distribution:| Value | Behavior |
|---|---|
1.0 | Poisson process (natural randomness) |
< 1.0 | More bursty (requests come in clusters) |
> 1.0 | More uniform (evenly spaced requests) |
Performance tips
- Start small — Begin with low QPS and small sample sizes
- Use appropriate text samples — Match your production text length distribution
- Monitor server resources — Watch CPU, memory, and network during tests
- Consider burstiness — Real-world traffic is often bursty (try 0.7–0.9)
- Test both modes — Compare streaming vs non-streaming for your use case
Troubleshooting
Common issues
| Issue | Solution |
|---|---|
| Connection errors | Verify server address and network connectivity |
| Authentication errors | Set INWORLD_API_KEY for external APIs |
| High latency | Check server load and network conditions |
| Memory issues | Reduce number-of-samples for high QPS tests |
Debug mode
Use the--verbose flag for detailed request/response logging:
Architecture
The tool uses:- Async/await: Efficient concurrent request handling
- Pausable timers: Accurate server-only timing measurements
- Multiple protocols: gRPC, HTTP REST API support
- Configurable clients: Pluggable client architecture
- Real-time progress: Live progress bars and status updates