- Voice Activity Detection (VAD) - for parsing speech activity out of open mic audio
- Speech-to-text (STT) - for understanding speech inputs
- Jinja prompt templating - for passing app state into formatted context and instructions for an LLM
- LLM - for generating the agent text response
- Text-to-speech (TTS) - for generating agent speech audio
Architecture
- Backend: Inworld Runtime + Express.js
- Frontend: Vanilla HTML/CSS/JavaScript
- Communication: WebSocket
Understanding the Template
Depending on your learning style, you may want to:- Watch the tutorial videos walking through how the functionality is implemented using the Inworld Runtime
- Clone the open-source GitHub repo to investigate the full code context (or add new features!)