STT(Speech-to-text) Node Demo - Inworld AI Documentation

This demo showcases how to use the STTNode.

Run the Template

Go to Assets/InworldRuntime/Scenes/Nodes and play the STTNode scene.
Once the graph is compiled, speak into the microphone to generate text.

Understanding the Graph

You can find the graph on the InworldGraphExecutor of STTCanvas.

The graph is very simple. It contains a single node, STTNode, with no edges. STTNode is both the StartNode and the EndNode.

InworldController

The InworldController is also simple; it contains only one primitive module: STT.

For details about the primitive module, see the STT Primitive Demo.

InworldAudioManager

InworldAudioManager handles audio processing and is also modular. In this demo, it uses four components:

AudioCapturer: Manages microphone on/off and input devices. Uses Unity’s Microphone by default, and can be extended via third‑party plugins.
AudioCollector: Collects raw samples from the microphone.
PlayerVoiceDetector: Implements IPlayerAudioEventHandler and ICalibrateAudioHandler to emit player audio events and decide which timestamped segments to keep from the stream.
AudioDispatcher: Sends the captured microphone data for downstream processing.

Workflow

Audio Thread: At startup, the microphone calibrates to background noise. PlayerVoiceDetector listens for speech using SNR (Signal‑to‑Noise Ratio). When it exceeds the threshold, AudioDispatcher streams audio frames to InworldAudio. Main Thread:

When the game starts, InworldController initializes its only module, STTModule, which creates the STTInterface.
Next, InworldGraphExecutor initializes its graph asset by calling each component’s CreateRuntime(). In this case, only STTNode.CreateRuntime() is called, using the created STTInterface as input.
After initialization, the graph calls Compile() and returns the executor handle.
After compilation, the OnGraphCompiled event is invoked. In this demo, STTNodeTemplate subscribes to it and enables the UI components. Users can then interact with the graph system.

STTNodeTemplate.cs

protected override void OnGraphCompiled(InworldGraphAsset obj)
{
    foreach (InworldUIElement element in m_UIElements)
        element.Interactable = true;

}

When AudioDispatcher sends data, STTNodeTemplate handles its OnAudioSent event with the SendAudio() function, converting the List<float> audio data into InworldAudio.

STTNodeTemplate.cs

protected override void OnEnable()
{
    base.OnEnable();
    if (!m_Audio)
        return;
    m_Audio.Event.onStartCalibrating.AddListener(()=>Title("Calibrating"));
    m_Audio.Event.onStopCalibrating.AddListener(Calibrated);
    m_Audio.Event.onPlayerStartSpeaking.AddListener(()=>Title("PlayerSpeaking"));
    m_Audio.Event.onPlayerStopSpeaking.AddListener(()=>
    {
        Title("");
        if (m_STTResult)
            m_STTResult.text = "";
    });
    m_Audio.Event.onAudioSent.AddListener(SendAudio);
}

void SendAudio(List<float> audioData)
{
    if (!m_ModuleInitialized)
        return;
    InworldVector<float> wave = new InworldVector<float>();
    wave.AddRange(audioData);
    
    _ = m_InworldGraphExecutor.ExecuteGraphAsync("STT", new InworldAudio(wave, wave.Size));
}

Calling ExecuteGraphAsync() eventually produces a result and invokes OnGraphResult(), which STTNodeTemplate subscribes to in order to receive the data.

STTNodeTemplate.cs

protected override void OnGraphResult(InworldBaseData obj)
{
    InworldText outputStream = new InworldText(obj);
    if (outputStream.IsValid && m_STTResult)
        m_STTResult.text += outputStream;
}

​Run the Template

​Understanding the Graph

​InworldController

​InworldAudioManager

​Workflow

Run the Template

Understanding the Graph

InworldController

InworldAudioManager

Workflow