Prompt Caching - Inworld AI Documentation

To save on inference costs, you can enable prompt caching on supported providers and models. Inworld Router supports both implicit and explicit caching.

Implicit Caching

Implicit caching depends on the model provider. If a provider supports prompt caching, it works automatically on their terms—no configuration required.

Provider stickiness: Inworld Router ensures all requests go to the same model provider for a given conversation. If you hit cache with one provider, subsequent requests stay routed to that provider, so you never lose cached data by being switched to a different provider.

Implicit caching is typically available on providers like OpenAI, DeepSeek, and Google Gemini 2.5. Each provider has its own minimum token requirements and TTL behavior. Consult the provider’s documentation for pricing and model-specific details.

Explicit Caching

Explicit caching is supported for Anthropic and Google providers. It gives you control over what gets cached and for how long.

Unified Protocol

Both providers use the same protocol: a cache_control object in each message content part, including a configurable ttl (time to live).

Cache Control in Messages

Add cache_control to text parts within multipart message content. Reserve it for large bodies of text such as character cards, CSV data, RAG context, or book chapters.

{
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a historian studying the fall of the Roman Empire. Below is an extensive reference book:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY HERE",
          "cache_control": {
            "type": "ephemeral",
            "ttl": "1h"
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What triggered the collapse?"
        }
      ]
    }
  ]
}

TTL Management

Configurable TTL: Set ttl in the cache_control object (e.g., "ttl": "5m", "ttl": "1h").
Update at any time: You can change the TTL on subsequent requests if you originally set it longer than needed.
Automatic prolongation: When 5% or less of the TTL remains and there are still messages hitting the cache, Inworld Router automatically extends the TTL to keep your cache warm.

User Message Example

{
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Based on the book text below:"
        },
        {
          "type": "text",
          "text": "HUGE TEXT BODY HERE",
          "cache_control": {
            "type": "ephemeral",
            "ttl": "30m"
          }
        },
        {
          "type": "text",
          "text": "List all main characters mentioned in the text above."
        }
      ]
    }
  ]
}

Inspecting Cache Usage

To see how much caching saved on each generation, check the prompt_tokens_details object in the usage response.

Request and Response Samples

The following samples were verified against the Inworld Router API. The first request establishes a cache; the second request hits the cache. Request 1 — Cache write (establishing cache)

curl -X POST "https://api.inworld.ai/v1/chat/completions" \
  -H "Authorization: Basic <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google-ai-studio/gemini-2.5-flash",
    "messages": [
      {
        "role": "system",
        "content": [
          {
            "type": "text",
            "text": "You are a helpful historian. Below is reference material:"
          },
          {
            "type": "text",
            "text": "<large text body of 1024+ tokens>",
            "cache_control": {"type": "ephemeral", "ttl": "5m"}
          }
        ]
      },
      {"role": "user", "content": "What year did the Western Roman Empire fall? Reply in one short sentence."}
    ],
    "max_tokens": 50
  }'

Response 1 — Cache write

{
  "id": "chatcmpl-1772169766796",
  "model": "google-ai-studio/gemini-2.5-flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The Western Roman Empire fell in 476 AD.\n"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1402,
    "completion_tokens": 13,
    "total_tokens": 1415,
    "prompt_tokens_details": {
      "cache_write_tokens": 1388,
      "cached_tokens": 1388
    }
  }
}

Request 2 — Cache hit (same conversation, follow-up question) Send the same system message and cached content, plus the previous exchange and a new user message:

curl -X POST "https://api.inworld.ai/v1/chat/completions" \
  -H "Authorization: Basic <your-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google-ai-studio/gemini-2.5-flash",
    "messages": [
      {
        "role": "system",
        "content": [
          {"type": "text", "text": "You are a helpful historian. Below is reference material:"},
          {
            "type": "text",
            "text": "<same large text body as request 1>",
            "cache_control": {"type": "ephemeral", "ttl": "5m"}
          }
        ]
      },
      {"role": "user", "content": "What year did the Western Roman Empire fall? Reply in one short sentence."},
      {"role": "assistant", "content": "The Western Roman Empire fell in 476 AD.\n"},
      {"role": "user", "content": "Name two factors that contributed to its fall."}
    ],
    "max_tokens": 80
  }'

Response 2 — Cache hit

{
  "id": "chatcmpl-1772169775239",
  "model": "google-ai-studio/gemini-2.5-flash",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "While the provided text doesn't explicitly detail the reasons for the fall of the Western Roman Empire, it does highlight factors key to its rise and endurance... Drawing inferences from that, two factors could be: a decline in military prowess and ineffective administration.\n"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 1424,
    "completion_tokens": 65,
    "total_tokens": 1489,
    "prompt_tokens_details": {
      "cached_tokens": 1388
    }
  }
}

On the cache hit, cached_tokens is 1388 and cache_write_tokens is omitted (no write). The cached content is reused, reducing input token cost.

Explicit caching requires a minimum of 1024 tokens in the cacheable content block. Shorter content returns a validation error.

Response Example (summary)

{
  "id": "chatcmpl-...",
  "usage": {
    "prompt_tokens": 4641,
    "completion_tokens": 1817,
    "prompt_tokens_details": {
      "cached_tokens": 4608
    }
  }
}

Usage Fields

Field	Description
`cached_tokens`	Number of tokens read from the cache (cache hit). When greater than zero, you benefit from cached content.
`cache_write_tokens`	Number of tokens written to the cache. Appears on the first request when establishing a new cache entry.

Some providers charge differently for cache writes vs. reads. Anthropic, for example, charges for cache writes but offers discounts on cache reads. Check provider pricing for details.

​Implicit Caching

​Explicit Caching

​Unified Protocol

​Cache Control in Messages

​TTL Management

​User Message Example

​Inspecting Cache Usage

​Request and Response Samples

​Response Example (summary)

​Usage Fields

Implicit Caching

Explicit Caching

Unified Protocol

Cache Control in Messages

TTL Management

User Message Example

Inspecting Cache Usage

Request and Response Samples

Response Example (summary)

Usage Fields