- Speech-to-text (STT)- Voice input processing with VAD-based segmentation
- Multimodal image chat - Combined text and image understanding
- Text-to-speech (TTS) - Streaming audio response generation
- WebSocket communication - Real-time bidirectional data exchange
- Unity integration - Full client implementation for mobile/desktop
Watch the Demo
Overview
The Multimodal Companion consists of two main components:- Node.js Server - Handles WebSocket connections, processes audio/text/image inputs, and manages graph executions
- Unity Client - Provides the user interface for capturing audio, images, and displaying responses
- Convert speech-to-text using VAD for segmentation
- Process text and images through LLM models
- Generate speech responses via TTS
- Stream results back to the client in real-time
Prerequisites
- Node.js 18+ and TypeScript 5+
- Unity 2017+ (for full client experience)
- Inworld Runtime SDK v0.5 (installed automatically via package.json)
Run the Template
You have two options for running this template:Option 1: Run the Node.js server with Test Pages
Use the built-in HTML test pages for rapid prototyping and testing of the Node.js Server functionality without Unity.- 
Clone the server repository
bash
- 
In the root directory, copy .env-sampleto.envand set the required values:- INWORLD_API_KEY: Your Base64 Runtime API key
- VAD_MODEL_PATH: Path to your VAD model file (the repo includes the VAD model at- silero_vad.onnx)
- ALLOW_TEST_CLIENT: Must be- trueto enable test pages
 
- 
Install and start the server:
You should see:bash
- 
Test the functionality:
- 
Audio interface: http://localhost:3000/test-audio 
- 
Multimodal interface: http://localhost:3000/test-image 
 The test endpoints requireALLOW_TEST_CLIENT=true. Never enable this in production.
- 
Audio interface: 
Option 2: Run the full application with Unity client
For the complete multimodal companion experience with a proper UI:- 
Set up your workspace
- 
Clone both the Node server repo and the Unity client repo.
- 
Start the server:
a. Navigate to runtime-multimodal-companion-node. b. Copy.env-sampleto.envand set the required values:- INWORLD_API_KEY: Your Base64 Runtime API key
- VAD_MODEL_PATH: Path to your VAD model file (the repo includes the VAD model at- silero_vad.onnx)
- ALLOW_TEST_CLIENT: Set to- falseto disable test pages (not needed with Unity client).
 bash
- 
Now, configure the Unity client:
a. Open Unity Hub and click Add → Add project from disk
 runtime-multimodal-companion-unityfolder c. Open the sceneDemoScene_WebSocket   - HTTP URL: http://localhost:3000
- WebSocket URL: ws://localhost:3000
- API Key and API Secret: Your Inworld JWT credentials (see Authentication)
 
 
- HTTP URL: 
- 
Run the application
- Click Play in Unity
- Hold record button to capture audio, release to send
- The app connects to your Node.js server for real-time interactions
 
Understanding the Template
The Multimodal Companion uses a sophisticated graph-based architecture to process multiple input types and generate appropriate responses.Message Flow
- 
Client Connection
- Unity client authenticates and receives session token
- WebSocket connection established with session key
 
- 
Input Processing
- Voice: Audio chunks → VAD → STT Graph → Text
- Text: Direct text input → LLM processing
- Image+Text: Combined multimodal input → LLM → TTS
 
- 
Response Generation
- Text responses streamed as they’re generated
- Audio synthesized in chunks for low latency
- All responses include interaction IDs for tracking
 
Core Components
1. Speech Processing Pipeline
The STT graph uses Voice Activity Detection (VAD) to segment speech:2. Multimodal Processing
For image+text inputs, the system creates a streaming pipeline:3. Custom Nodes
The template demonstrates creating custom nodes for specialized processing:4. WebSocket Protocol
Messages follow a structured format: Client → Server:- { type: "text", text: string }
- { type: "audio", audio: number[][] }
- { type: "audioSessionEnd" }
- { type: "imageChat", text: string, image: string, voiceId?: string }
- TEXT:- { text: { text, final }, routing: { source } }
- AUDIO:- { audio: { chunk: base64_wav } }
- INTERACTION_END: Signals completion
- ERROR:- { error: string }
Graph Execution Strategy
The template uses different execution strategies for optimal performance:- STT Graph: Single shared executor for all connections (fast first token)
- Image Chat Graph: Per-connection executor with voice-specific configuration
- Queue Management: Serialized processing per connection to prevent conflicts
Error Handling
The system implements robust error recovery:- gRPC Deadline Exceeded: Automatic retry once
- HTTP/2 GOAWAY: Rebuild executor on next use
- WebSocket Disconnection: Client auto-reconnect with backoff
Configuration Options
Model Providers
Configure LLM providers in the code:Text Generation Settings
Adjust generation parameters inconstants.ts:
- temperature: Output randomness (0-1)
- topP: Nucleus sampling threshold
- maxNewTokens: Response length limit
- Various penalties for repetition control
Audio Settings
- Input sample rate: 16 kHz (Unity microphone)
- VAD model: Silero ONNX
- Pause threshold: Configurable in PAUSE_DURATION_THRESHOLD_MS
Deployment Considerations
Production Setup
- Disable test endpoints: ALLOW_TEST_CLIENT=false
- Implement proper authentication for WebSocket connections
- Use environment-specific configuration
- Set appropriate concurrency limits (2-4 for basic plans)
Performance Optimization
- Reuse graph executors across requests
- Implement connection pooling
- Monitor memory usage with long-running executors
- Handle GOAWAY errors gracefully
Next Steps
- Extend with additional input modalities (video, documents)
- Implement conversation history and context management
- Add custom voice cloning or style transfer
- Integrate with external services and APIs