Skip to main content

Documentation Index

Fetch the complete documentation index at: https://dev.docs.inworld.ai/llms.txt

Use this file to discover all available pages before exploring further.

Realtime sessions are ephemeral. The server holds conversation state for the lifetime of the WebSocket and a short trailing TTL (15 minutes of inactivity by default); after that, server-side memory is gone. To support long-running relationships — a user chats with a voice agent today, comes back next week, and the agent picks up where they left off — your client needs to persist memory externally and re-inject it on every new session. This guide shows the pattern: enable the in-session memory feature, capture the facts and summary it generates, store them in your own database, and seed the next session with what you’ve saved.

How the memory feature fits in

The providerData.memory branch enables automatic conversation memory inside a single session. While the session is open, the server periodically asks the LLM to extract durable facts and a rolling summary from the conversation, prepends them to the system prompt, and trims older transcript items so context stays bounded. This is what makes a single 90-minute conversation coherent — the agent doesn’t forget what was said 20 minutes ago even as the raw transcript ages out. But because all of this state lives only on the server until the session closes, you need to capture it and replay it yourself to bridge across sessions.

The persistence lifecycle

The contract is: anything your client persists during one session can be sent back as configuration on the next. Nothing about the realtime API itself is durable across sessions — your DB is the source of truth for long-term state.

1. Enable memory in the session

Send the memory config in your initial session.update (or a later one):
ws.send(JSON.stringify({
  type: 'session.update',
  session: {
    instructions: SYSTEM_PROMPT,
    providerData: {
      memory: {
        enabled: true,
        turn_interval: 5,       // generate every 5 completed turns
        max_facts: 50,          // keep up to 50 facts
        max_memory_length: 2000 // summary capped at 2000 chars
      }
    }
  }
}));
See the memory field reference for every tunable.

2. Capture state from session.updated

Each time the server completes a memory-generation cycle, it sends a session.updated event containing the full session object — including a freshly populated providerData.memory.state. The state shape is:
{
  "version": 3,
  "facts": [
    "User's name is Sarah.",
    "Sarah is preparing for a marathon in October.",
    "Sarah's longest run so far is 18 km."
  ],
  "summary": "Sarah is training for an October marathon and tracking her long runs. She asked about pacing strategy and tapering, and is following a Pfitzinger plan.",
  "context_text": "...what the server prepended to the system prompt...",
  "turns_since_gen": 0,
  "total_turns": 15,
  "items_trimmed": 4
}
The server emits session.updated events for two reasons: as an acknowledgement after every client-sent session.update, and whenever it completes a memory-generation cycle. Both carry the full session object, so most session.updated events you receive will repeat the same memory state you’ve already seen. Use the version counter (incremented only when new memory is generated) to detect a real change and avoid redundant DB writes:
let lastMemoryVersion = 0;

ws.addEventListener('message', async (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type !== 'session.updated') return;

  const state = msg.session?.providerData?.memory?.state;
  if (!state || !state.version || state.version <= lastMemoryVersion) return;

  lastMemoryVersion = state.version;
  await persistMemory(userId, state);   // your DB write
});
If you also want to be able to replay literal recent turns (not just the summary), subscribe to conversation.item.input_audio_transcription.completed for user turns and response.output_audio_transcript.done for assistant turns, and write each turn into your own transcript table keyed by user_id.

3. Store the memories

There are two reasonable shapes for the persisted side, depending on how many sessions per user you expect.

Option A — flat persistence (KV, Postgres row, JSON blob)

For most apps, just save the latest facts array and summary string per user. Overwrite on every memory-generation cycle. Cheap, simple, fits in a single row:
CREATE TABLE user_memory (
  user_id    TEXT PRIMARY KEY,
  summary    TEXT NOT NULL DEFAULT '',
  facts      JSONB NOT NULL DEFAULT '[]',
  version    INT NOT NULL DEFAULT 0,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
async function persistMemory(userId, state) {
  await db.query(
    `INSERT INTO user_memory (user_id, summary, facts, version)
     VALUES ($1, $2, $3::jsonb, $4)
     ON CONFLICT (user_id) DO UPDATE
       SET summary = EXCLUDED.summary,
           facts = EXCLUDED.facts,
           version = EXCLUDED.version,
           updated_at = now()
     WHERE user_memory.version < EXCLUDED.version`,
    [userId, state.summary, JSON.stringify(state.facts), state.version]
  );
}
This works well when each user has one ongoing relationship with the agent and you always want the full saved state in every new session.

Option B — vector store for retrieval across many sessions

If your users accumulate many sessions over months and the saved facts pile up past what fits in a single prompt, store each fact (or each session’s summary) with embeddings in a vector DB, then retrieve only the most relevant memories at the start of each new session. The Inworld language-learning Node example demonstrates this end-to-end with Supabase + pgvector: it embeds each saved fact, then on a new session runs a similarity query against the user’s opening utterance and injects the top-K matching memories. The same pattern works with Pinecone, Weaviate, or any pgvector-backed Postgres. Rough shape (using OpenAI embeddings + pgvector):
async function persistMemory(userId, state) {
  for (const fact of state.facts) {
    const { rows: existing } = await db.query(
      `SELECT 1 FROM user_facts WHERE user_id = $1 AND fact = $2`,
      [userId, fact]
    );
    if (existing.length) continue;
    const embedding = await embed(fact); // your embeddings provider
    await db.query(
      `INSERT INTO user_facts (user_id, fact, embedding) VALUES ($1, $2, $3)`,
      [userId, fact, embedding]
    );
  }
}

async function loadRelevantMemories(userId, openingUtterance, k = 10) {
  const queryEmbedding = await embed(openingUtterance);
  const { rows } = await db.query(
    `SELECT fact FROM user_facts
     WHERE user_id = $1
     ORDER BY embedding <=> $2 LIMIT $3`,
    [userId, queryEmbedding, k]
  );
  return rows.map(r => r.fact);
}
Use Option B when the “complete saved state” no longer fits comfortably in the system prompt — typically several hundred facts in. Until then, Option A is simpler and gives the agent more context on every turn.

4. Restore on a new session

When a returning user connects, load their saved state and bake it into the new session before the conversation starts. The most reliable approach is to inject everything you want the agent to recall directly into the instructions field of the initial session.update. The model already treats instructions as authoritative, and this approach is portable across LLM providers — it doesn’t depend on any server-side state restoration semantics.
async function buildResumedSession(userId, openingUtterance) {
  // Option A: load the flat saved state
  const { summary, facts } = await loadMemory(userId);

  // Option B (vector): pull only the relevant ones
  // const facts = await loadRelevantMemories(userId, openingUtterance);
  // const summary = await loadLastSummary(userId);

  const memoryBlock = facts.length
    ? `\n\nWhat you remember about this user:\n${facts.map(f => `- ${f}`).join('\n')}` +
      (summary ? `\n\nRecent conversation summary: ${summary}` : '')
    : '';

  return {
    type: 'session.update',
    session: {
      instructions: BASE_SYSTEM_PROMPT + memoryBlock,
      providerData: {
        memory: { enabled: true, turn_interval: 5 } // keep memory on for this session too
      }
    }
  };
}

ws.addEventListener('message', async (e) => {
  const msg = JSON.parse(e.data);
  if (msg.type === 'session.created') {
    const update = await buildResumedSession(currentUserId, /* opening utterance if known */ '');
    ws.send(JSON.stringify(update));
  }
});

Optional: replay the last few turns

The injected facts and summary give the agent semantic memory. If you also want the agent to have the literal last few exchanges in context — useful for greetings like “as I was saying earlier…” — replay them with conversation.item.create after the session.updated ack:
async function replayRecentTurns(ws, userId, n = 4) {
  const turns = await loadRecentTurns(userId, n); // your transcript table
  for (const turn of turns) {
    ws.send(JSON.stringify({
      type: 'conversation.item.create',
      item: {
        type: 'message',
        role: turn.role,                              // "user" or "assistant"
        content: [{ type: 'input_text', text: turn.text }]
      }
    }));
  }
}
Trade-off: every replayed turn costs tokens on every subsequent LLM call. For most apps, the summary + facts approach is sufficient — replay only as much literal transcript as your latency / cost budget allows.

End-to-end example

A reconnect handler that loads from your DB, resumes the session with persisted memory baked in, and keeps capturing fresh memories during the new session:
import { WebSocket } from 'ws';

const BASE_SYSTEM_PROMPT = 'You are a friendly running coach. Keep responses brief.';

async function startSession(userId) {
  const ws = new WebSocket(
    `wss://api.inworld.ai/api/v1/realtime/session?key=voice-${userId}-${Date.now()}&protocol=realtime`,
    { headers: { Authorization: `Basic ${process.env.INWORLD_API_KEY}` } }
  );

  let lastMemoryVersion = 0;

  ws.on('message', async (raw) => {
    const msg = JSON.parse(raw);

    if (msg.type === 'session.created') {
      const { summary, facts } = await loadMemory(userId);
      const memoryBlock = facts.length
        ? `\n\nWhat you remember about this user:\n${facts.map(f => `- ${f}`).join('\n')}` +
          (summary ? `\n\nRecent conversation summary: ${summary}` : '')
        : '';

      ws.send(JSON.stringify({
        type: 'session.update',
        session: {
          instructions: BASE_SYSTEM_PROMPT + memoryBlock,
          providerData: {
            memory: { enabled: true, turn_interval: 5, max_facts: 50 }
          }
        }
      }));

      await replayRecentTurns(ws, userId, 4); // optional
    }

    if (msg.type === 'session.updated') {
      const state = msg.session?.providerData?.memory?.state;
      if (state?.version && state.version > lastMemoryVersion) {
        lastMemoryVersion = state.version;
        await persistMemory(userId, state);
      }
    }

    if (msg.type === 'conversation.item.input_audio_transcription.completed' && msg.transcript) {
      await appendTurn(userId, 'user', msg.transcript);
    }
    if (msg.type === 'response.output_audio_transcript.done' && msg.transcript) {
      await appendTurn(userId, 'assistant', msg.transcript);
    }
  });

  return ws;
}

Tuning tips

  • turn_interval: generate memory every N turns. Lower values catch facts sooner but spend more on memory-generation LLM calls. 5 is a good default; drop to 3 for fact-dense conversations (onboarding, intake) or raise to 10 for casual chat.
  • max_facts: cap the size of the persisted fact list. If you’re using Option A (flat persistence) and finding the resumed instructions getting too long, drop this number — older facts get aged out and the summary absorbs them.
  • max_memory_length: cap on the rolling summary length in characters. Match this to the model’s context budget; a 2000-char summary plus ~50 short facts comfortably fits any modern LLM.
  • Stable user_id: the entire pattern keys on a stable user identifier. Pass it as providerData.user_id so it shows up in tracing and logs alongside the memory state (see SessionProvider in providerData).
  • First-session vs returning-user paths: keep them in one handler. A new user’s loadMemory returns empty facts and an empty summary, which produces an empty memoryBlock — the same code path serves both cases.

See also