TTS-1.5 is here: under 120ms latency, optimized stability! Learn more
{
"create": {
"voiceId": "Dennis",
"modelId": "inworld-tts-1.5-max",
"bufferCharThreshold": 100,
"autoMode": true,
"timestampType": "WORD",
"timestampTransportStrategy": "ASYNC"
},
"contextId": "ctx-1"
}{
"send_text": {
"text": "Hello, what a wonderful day to be a text-to-speech model!",
"flush_context": {}
},
"contextId": "ctx-1"
}{
"flush_context": {},
"contextId": "ctx-1"
}{
"close_context": {},
"contextId": "ctx-1"
}{
"result": {
"contextId": "ctx-1",
"contextCreated": {
"voiceId": "Dennis",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1.5-max",
"timestampType": "WORD",
"maxBufferDelayMs": 3000,
"autoMode": true,
"timestampTransportStrategy": "SYNC"
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"audioChunk": {
"audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
"usage": {
"processedCharactersCount": 79,
"modelId": "inworld-tts-1.5-max"
},
"timestampInfo": {
"wordAlignment": {
"words": [
"Hello,",
"what",
"a",
"wonderful",
"day",
"to",
"be",
"a",
"text-to-speech",
"model."
],
"wordStartTimeSeconds": [
0.031,
0.375,
0.901,
1.002,
1.386,
1.548,
1.649,
1.771,
1.852,
2.58
],
"wordEndTimeSeconds": [
0.355,
0.86,
0.921,
1.326,
1.528,
1.609,
1.71,
1.791,
2.539,
2.802
]
}
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}
}{
"result": {
"contextId": "ctx-1",
"contextClosed": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"flushCompleted": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}Generate audio from text input while managing multiple independent audio generation streams over a single WebSocket connection.
The independent audio streams each correspond to a context, identified by contextId, that maintains its own state. To use the API:
maxBufferDelayMs and bufferCharThreshold in the context configurations).auto_mode which would automatically balance latency and quality of the generations.contextId so you can match the audio to the request.{
"create": {
"voiceId": "Dennis",
"modelId": "inworld-tts-1.5-max",
"bufferCharThreshold": 100,
"autoMode": true,
"timestampType": "WORD",
"timestampTransportStrategy": "ASYNC"
},
"contextId": "ctx-1"
}{
"send_text": {
"text": "Hello, what a wonderful day to be a text-to-speech model!",
"flush_context": {}
},
"contextId": "ctx-1"
}{
"flush_context": {},
"contextId": "ctx-1"
}{
"close_context": {},
"contextId": "ctx-1"
}{
"result": {
"contextId": "ctx-1",
"contextCreated": {
"voiceId": "Dennis",
"audioConfig": {
"audioEncoding": "LINEAR16",
"sampleRateHertz": 16000
},
"modelId": "inworld-tts-1.5-max",
"timestampType": "WORD",
"maxBufferDelayMs": 3000,
"autoMode": true,
"timestampTransportStrategy": "SYNC"
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"audioChunk": {
"audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
"usage": {
"processedCharactersCount": 79,
"modelId": "inworld-tts-1.5-max"
},
"timestampInfo": {
"wordAlignment": {
"words": [
"Hello,",
"what",
"a",
"wonderful",
"day",
"to",
"be",
"a",
"text-to-speech",
"model."
],
"wordStartTimeSeconds": [
0.031,
0.375,
0.901,
1.002,
1.386,
1.548,
1.649,
1.771,
1.852,
2.58
],
"wordEndTimeSeconds": [
0.355,
0.86,
0.921,
1.326,
1.528,
1.609,
1.71,
1.791,
2.539,
2.802
]
}
},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}
}{
"result": {
"contextId": "ctx-1",
"contextClosed": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}{
"result": {
"contextId": "ctx-1",
"flushCompleted": {},
"status": {
"code": 0,
"message": "",
"details": []
}
}
}Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY
Create a new context with specified voice and configuration. A context is an independent conversation happening over the connection. The configurations for each context are completely separate – you can have different voice ids, models, output formats, etc. between contexts. Note: for each connection, 5 contexts is the max. If you don't need multiple contexts, you can omit the contextId in the message to use a single context connection.
Send text to be synthesized for a specific context. You can only send up to 1000 characters in a single send_text request. Text can be buffered on the server or immediately flushed by including flush_context in the message.
Flush a context and start synthesis of all accumulated text. Note that the buffer will automatically flush all text if the length of text is greater than 1000 characters, regardless of any other buffer settings.
Close an existing context and release all of its resources. Sending a close context message is equivalent to sending a flush message right before, so all text in the buffer will be synthesized before the context is closed. Note that the session will automatically be closed after 10 minutes of inactivity across any context.
Event sent when a new TTS context has been successfully created
Audio data chunk containing synthesized speech
Event sent when a context has been closed
Event sent when speech synthesis for a flush of text is completed. Some websocket use cases require an indicator that speech synthesis for a flush of text is completed. To facilitate this, we've included an empty "flushCompleted":{} event at the end of speech synthesis for each flush. Note that the implementation currently assumes that flushes execute sequentially, so the first flushCompleted event would correspond to the first flush call made on the client side.