Messages

{
  "create": {
    "voiceId": "Dennis",
    "modelId": "inworld-tts-2",
    "bufferCharThreshold": 100,
    "autoMode": true,
    "timestampType": "WORD",
    "timestampTransportStrategy": "ASYNC"
  },
  "contextId": "ctx-1"
}

{
  "result": {
    "contextId": "ctx-1",
    "contextCreated": {
      "voiceId": "Dennis",
      "audioConfig": {
        "audioEncoding": "LINEAR16",
        "sampleRateHertz": 16000
      },
      "modelId": "inworld-tts-2",
      "timestampType": "WORD",
      "maxBufferDelayMs": 3000,
      "autoMode": true,
      "timestampTransportStrategy": "SYNC",
      "language": "en-US",
      "deliveryMode": "BALANCED"
    },
    "status": {
      "code": 0,
      "message": "",
      "details": []
    }
  }
}

{
  "result": {
    "contextId": "ctx-1",
    "audioChunk": {
      "audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
      "usage": {
        "processedCharactersCount": 79,
        "modelId": "inworld-tts-2"
      },
      "timestampInfo": {
        "wordAlignment": {
          "words": [
            "Hello,",
            "what",
            "a",
            "wonderful",
            "day",
            "to",
            "be",
            "a",
            "text-to-speech",
            "model."
          ],
          "wordStartTimeSeconds": [
            0.031,
            0.375,
            0.901,
            1.002,
            1.386,
            1.548,
            1.649,
            1.771,
            1.852,
            2.58
          ],
          "wordEndTimeSeconds": [
            0.355,
            0.86,
            0.921,
            1.326,
            1.528,
            1.609,
            1.71,
            1.791,
            2.539,
            2.802
          ]
        }
      },
      "status": {
        "code": 0,
        "message": "",
        "details": []
      }
    }
  }
}

Text-to-Speech

Synthesize speech (WebSocket)

Generate audio from text input while managing multiple independent audio generation streams over a single WebSocket connection.

The independent audio streams each correspond to a context, identified by contextId, that maintains its own state. To use the API:

Create a context with audio generation configurations. By default, we allow up to 20 concurrent connections, with a maximum of 5 contexts per connection.
When you send text to be synthesized into audio, you can send it to a specific context (optional if there is only 1 context).
Each context maintains its own buffer that can be flushed either manually or automatically when the buffer reaches a certain threshold (see maxBufferDelayMs and bufferCharThreshold in the context configurations).
If texts are sent in full sentences phrases, it’s recommended to use auto_mode which would automatically balance latency and quality of the generations.
Responses contain the contextId so you can match the audio to the request.
Close a context when it is no longer needed.

WSS

tts

voice:streamBidirectional

Messages

{
  "create": {
    "voiceId": "Dennis",
    "modelId": "inworld-tts-2",
    "bufferCharThreshold": 100,
    "autoMode": true,
    "timestampType": "WORD",
    "timestampTransportStrategy": "ASYNC"
  },
  "contextId": "ctx-1"
}

{
  "result": {
    "contextId": "ctx-1",
    "contextCreated": {
      "voiceId": "Dennis",
      "audioConfig": {
        "audioEncoding": "LINEAR16",
        "sampleRateHertz": 16000
      },
      "modelId": "inworld-tts-2",
      "timestampType": "WORD",
      "maxBufferDelayMs": 3000,
      "autoMode": true,
      "timestampTransportStrategy": "SYNC",
      "language": "en-US",
      "deliveryMode": "BALANCED"
    },
    "status": {
      "code": 0,
      "message": "",
      "details": []
    }
  }
}

{
  "result": {
    "contextId": "ctx-1",
    "audioChunk": {
      "audioContent": "UklGRgSYAABXQVZFZm10IBAAAAABAAEAgD4AAAB9AAACABAAZGF0YeCX=",
      "usage": {
        "processedCharactersCount": 79,
        "modelId": "inworld-tts-2"
      },
      "timestampInfo": {
        "wordAlignment": {
          "words": [
            "Hello,",
            "what",
            "a",
            "wonderful",
            "day",
            "to",
            "be",
            "a",
            "text-to-speech",
            "model."
          ],
          "wordStartTimeSeconds": [
            0.031,
            0.375,
            0.901,
            1.002,
            1.386,
            1.548,
            1.649,
            1.771,
            1.852,
            2.58
          ],
          "wordEndTimeSeconds": [
            0.355,
            0.86,
            0.921,
            1.326,
            1.528,
            1.609,
            1.71,
            1.791,
            2.539,
            2.802
          ]
        }
      },
      "status": {
        "code": 0,
        "message": "",
        "details": []
      }
    }
  }
}

authorization

type:httpApiKey

Your authentication credentials. For Basic authentication, please populate Basic $INWORLD_API_KEY. You can create a key in one command with the Inworld CLI: inworld workspace add-key.

Create Context

type:object

Create a new context with specified voice and configuration. A context is an independent conversation happening over the connection. The configurations for each context are completely separate – you can have different voice ids, models, output formats, etc. between contexts. Note: for each connection, 5 contexts is the max. If you don't need multiple contexts, you can omit the contextId in the message to use a single context connection.

Send Text

type:object

Send text to be synthesized for a specific context. You can only send up to 1000 characters in a single send_text request. Text can be buffered on the server or immediately flushed by including flush_context in the message.

Flush Context

type:object

Flush a context and start synthesis of all accumulated text. Note that the buffer will automatically flush all text if the length of text is greater than 1000 characters, regardless of any other buffer settings.

Close Context

type:object

Close an existing context and release all of its resources. Sending a close context message is equivalent to sending a flush message right before, so all text in the buffer will be synthesized before the context is closed. Note that the session will automatically be closed after 10 minutes of inactivity across any context.

Context Created

type:object

Event sent when a new TTS context has been successfully created

Audio Chunk

type:object

Audio data chunk containing synthesized speech

Context Closed

type:object

Event sent when a context has been closed

Flush Completed

type:object

Event sent when speech synthesis for a flush of text is completed. Some websocket use cases require an indicator that speech synthesis for a flush of text is completed. To facilitate this, we've included an empty "flushCompleted":{} event at the end of speech synthesis for each flush. Note that the implementation currently assumes that flushes execute sequentially, so the first flushCompleted event would correspond to the first flush call made on the client side.

⌘I