Skip to main content
A comprehensive load testing tool for TTS On-Premises that measures performance metrics including latency, throughput, and streaming characteristics across different QPS (Queries Per Second) loads.

Overview

The tool simulates realistic TTS workloads by sending requests at specified rates with configurable burstiness patterns. It measures:
  • End-to-end latency
  • Audio generation latency per second
  • Streaming metrics (first chunk, 4th chunk, average chunk latencies)
  • Request success rates
  • Server performance under different load conditions

Quick start

# Install the load test tool
pip install -e .

# Basic load test with streaming
python load-test.main \
    --host http://localhost:8081 \
    --stream \
    --min-qps 1.0 \
    --max-qps 7.0 \
    --qps-step 2.0 \
    --number-of-samples 300

Parameters

Required

ParameterDescriptionExample
--hostBase address of the On-Premises TTS server (endpoint auto-appended)http://localhost:8081

Load configuration

ParameterDefaultDescription
--min-qps1.0Minimum requests per second to test
--max-qps10.0Maximum requests per second to test
--qps-step2.0Step size for QPS increments
--number-of-samples1Total number of texts to synthesize per QPS level
--burstiness1.0Request timing pattern (1.0 = Poisson, < 1.0 = bursty, > 1.0 = uniform)

TTS configuration

ParameterDefaultDescription
--streamFalseUse streaming synthesis (/SynthesizeSpeechStream) vs non-streaming (/SynthesizeSpeech)
--max_tokens400Maximum tokens to synthesize (~8s audio at 50 tokens/s)
--voice-ids["Olivia", "Remy"]Voice IDs to use (can specify multiple)
--model_idNoneModel ID for TTS synthesis (optional)
--text_samples_filescripts/tts_load_testing/text_samples.jsonFile containing text samples

Output and analysis

ParameterDefaultDescription
--benchmark_nameauto-generatedName for the benchmark run (affects output files)
--plot_onlyFalseOnly generate plots from existing results (skip testing)
--verboseFalseEnable verbose output for debugging

Examples

Streaming vs non-streaming comparison

# Non-streaming test
python load-test.main \
    --host http://localhost:8081 \
    --min-qps 10.0 \
    --max-qps 50.0 \
    --qps-step 10.0 \
    --number-of-samples 500 \
    --benchmark_name non-streaming-test

# Streaming test
python load-test.main \
    --host http://localhost:8081 \
    --stream \
    --min-qps 10.0 \
    --max-qps 50.0 \
    --qps-step 10.0 \
    --number-of-samples 500 \
    --benchmark_name streaming-test

Plot-only mode

Generate plots from existing results without re-running tests:
./scripts/tts-load-test \
    --plot_only \
    --benchmark_name prod-stress-test

Understanding results

The tool generates comprehensive metrics for each QPS level.

Latency metrics

  • E2E Latency: Complete request-response time
  • Audio Generation Latency: Time per second of generated audio
  • First Chunk Latency: Time to first audio chunk (streaming only)
  • 4th Chunk Latency: Time to 4th audio chunk (streaming only)
  • Average Chunk Latency: Mean time between chunks (streaming only)

Percentiles

Results include P50, P90, P95, and P99 percentiles for all latency metrics.

Output files

Results are saved in benchmark_result/{benchmark_name}/:
  • result.json — Raw performance data
  • {benchmark_name}_*.png — Performance charts

Burstiness parameter

The burstiness parameter controls request timing distribution:
ValueBehavior
1.0Poisson process (natural randomness)
< 1.0More bursty (requests come in clusters)
> 1.0More uniform (evenly spaced requests)

Performance tips

  1. Start small — Begin with low QPS and small sample sizes
  2. Use appropriate text samples — Match your production text length distribution
  3. Monitor server resources — Watch CPU, memory, and network during tests
  4. Consider burstiness — Real-world traffic is often bursty (try 0.7–0.9)
  5. Test both modes — Compare streaming vs non-streaming for your use case

Troubleshooting

Common issues

IssueSolution
Connection errorsVerify server address and network connectivity
Authentication errorsSet INWORLD_API_KEY for external APIs
High latencyCheck server load and network conditions
Memory issuesReduce number-of-samples for high QPS tests

Debug mode

Use the --verbose flag for detailed request/response logging:
./scripts/tts-load-test --verbose --host ... # other params

Architecture

The tool uses:
  • Async/await: Efficient concurrent request handling
  • Pausable timers: Accurate server-only timing measurements
  • Multiple protocols: gRPC, HTTP REST API support
  • Configurable clients: Pluggable client architecture
  • Real-time progress: Live progress bars and status updates