Run the Template
- Go to Assets/InworldRuntime/Scenes/Primitivesand play theSTTTemplatescene. 
- When the game starts, stay quiet for a moment to let the microphone calibrate to background noise.
- Once you see the Calibratedmessage, speak into the microphone.
- You’ll see the transcribed text appear on screen.
 
Understanding the Template
Structure
- This demo has two prefabs under InworldController:STT(containsInworldSTTModule) andVAD(containsInworldVADModule).
- When InworldControllerinitializes, it callsInitializeAsync()on both modules (see Primitives Overview).
- These functions create STTFactoryandVADFactory, and each factory creates itsSTTInterfaceorVADInterfacebased on the currentSTT/VADConfig.

InworldAudioManager
InworldAudioManager handles audio processing and is also modular. In this demo, it uses four components:
- AudioCapturer: Manages microphone on/off and input devices. Uses Unity’s Microphoneby default, and can be extended via third‑party plugins.
- AudioCollector: Collects raw samples from the microphone.
- PlayerVoiceDetector: Implements IPlayerAudioEventHandlerandICalibrateAudioHandlerto emit player audio events and decide which timestamped segments to keep from the stream.
For example, 
TurnBasedVoiceDetector automatically pauses capture while the character is speaking to prevent echo.In this demo, VoiceActivityDetector extends PlayerVoiceDetector and leverages an AI model to accurately detect when the player is speaking.- AudioDispatcher: Sends the captured microphone data for downstream processing.

Workflow
Audio Thread:At startup, the microphone calibrates to background noise. The VAD (Voice Activity Detection) module listens for speech, and when speech is detected, the
AudioDispatcher streams audio frames to the STT module.
Both partial and final transcriptions are produced and displayed in the UI.
Since this section mainly covers STT, detailed explanations about audio capture will be described later.
Main Thread:In this demo’s
STTCanvas, each audio-thread event is registered in the OnEnable method.
Certain simple events, such as starting or stopping calibration, are handled directly (for example, updating on-screen text):
STTCanvas.cs
onAudioSent event is received, we assemble the audio data into an AudioChunk—the audio should be resampled to mono with a sample rate of 16,000 Hz—and call InworldController.STT.RecognizeSpeechAsync().
This function checks whether the STT module exists and has been initialized (i.e., the STTInterface is valid).
If so, it directly calls sttInterface.RecognizeSpeech, returns the transcription string, and displays it on the STTCanvas.
InworldController.cs