Skip to main content
The node-tts template illustrates how to convert text-to-speech using the TTS node.
Architecture
  • Backend: Inworld Runtime
  • Frontend: N/A (CLI example)

Run the Template

  1. Download and extract the Inworld Templates.
  2. Install the Runtime SDK inside the cli directory.
    yarn add @inworld/runtime
    
  3. Set up your Base64 Runtime API key by copying the .env-sample file into a .env file in the cli folder and adding your API key.
    .env
    # Inworld Runtime Base64 API key
    INWORLD_API_KEY=<your_api_key_here>
    
  4. Try a different model or voice! You can specify the model using the --modelId parameter and a voice using the --voiceName parameter:
    yarn node-tts "Hello, how are you?" --modelId=inworld-tts-1-max --voiceName=Ronald
    

Understanding the Template

The main functionality of the template is contained in the run function, which demonstrates how to use the Inworld Runtime to convert text-to-speech using the TTS node. Now let’s break down the template into more detail:

1) Node Initialization

We start by creating the TTS node.
const ttsNode = new RemoteTTSNode({
  id: 'tts_node',
  speakerId: voiceName,
  modelId,
  sampleRate: SAMPLE_RATE,
  temperature: 1.1,
  speakingRate: 1,
});
When creating the TTS node, you can specify:
  • id: A unique identifier for the node
  • speakerId: The voice to use for synthesis (see available voices)
  • modelId: The TTS model to use for synthesis
  • sampleRate: Audio output sample rate
  • temperature: Controls randomness in synthesis
  • speakingRate: Controls the speed of speech (1.0 is the voice’s natural speed)

2) Graph initialization

Next, we create the graph using the GraphBuilder, adding the TTS node and setting it as both start and end node:
const graph = new GraphBuilder({
  id: 'node_tts_graph',
  apiKey,
  enableRemoteConfig: false,
})
  .addNode(ttsNode)
  .setStartNode(ttsNode)
  .setEndNode(ttsNode)
  .build();
The GraphBuilder configuration includes:
  • id: A unique identifier for the graph
  • apiKey: Your Inworld API key for authentication
  • enableRemoteConfig: Whether to enable remote configuration (set to false for local execution)
In this example, we only have a single TTS node, setting it as the start and end node. In more complex applications, you could connect other nodes, like a LLM node, to the TTS node to create a processing pipeline.

3) Graph execution

Now we execute the graph with the text input directly:
const { outputStream } = graph.start(text);
The text input is passed directly to the graph, which will be processed by the TTS node.

4) Response handling

The audio generation results are handled using the processResponse method, which supports streaming audio responses:
let initialText = '';
let resultCount = 0;
let allAudioData: number[] = [];

for await (const result of outputStream) {
  await result.processResponse({
    TTSOutputStream: async (ttsStream: GraphTypes.TTSOutputStream) => {
      for await (const chunk of ttsStream) {
        if (chunk.text) initialText += chunk.text;
        if (chunk.audio?.data) {
          allAudioData = allAudioData.concat(Array.from(chunk.audio.data));
        }
        resultCount++;
      }
    },
  });
}

console.log(`Result count: ${resultCount}`);
console.log(`Initial text: ${initialText}`);
The response handler processes:
  • TTSOutputStream: Streaming audio responses containing both text and audio data
  • chunk.text: The text being synthesized
  • chunk.audio.data: The audio data as Float32Array samples

5) Audio file creation

Then, we encode the audio data and save it as a WAV file:
const audio = {
  sampleRate: SAMPLE_RATE,
  channelData: [new Float32Array(allAudioData)],
};

const buffer = await wavEncoder.encode(audio);
if (!fs.existsSync(OUTPUT_DIRECTORY)) {
  fs.mkdirSync(OUTPUT_DIRECTORY, { recursive: true });
}

fs.writeFileSync(OUTPUT_PATH, Buffer.from(buffer));

console.log(`Audio saved to ${OUTPUT_PATH}`);