Prompt Caching allows users to make repeated API calls more efficiently by reusing context from recent prompts, resulting in a reduction in input token costs and faster response times. The Prompt Caching option is now available for Claude, OpenAI and Google Gemini models.Documentation Index
Fetch the complete documentation index at: https://docs.typingmind.com/llms.txt
Use this file to discover all available pages before exploring further.
Challenges with Current AI Context Handling
When interacting with an AI model, the relevant conversation history and instructions need to be sent with each new query so the model can maintain context. For long conversations or complex tasks, repeatedly processing the same context may lead to higher input token costs and slower responses. With Prompt Caching, supported AI providers can reuse stable prompt content from recent requests instead of processing it again from scratch.How Prompt Caching Works
Prompt Caching helps AI models reuse stable context, such as system instructions, large documents, background information, examples, or earlier conversation turns. When you send a request with Prompt Caching enabled or supported automatically:- The provider checks whether the beginning of your prompt matches recently cached content.
- If a matching prompt prefix is found, the cached tokens are reused.
- If no match is found, the request is processed normally, and eligible content may be cached for future requests.
Time to Live (TTL) for Cache Storage
Cache lifetime varies by provider and model:- For OpenAI: cached prefixes using in-memory retention generally remain active for 5 to 10 minutes of inactivity, up to a maximum of 1 hour. Some newer models also support extended cache retention for up to 24 hours.
- For Claude: the default cache lifetime is 5 minutes and is refreshed each time the cached content is reused. Claude also supports a 1-hour cache duration at additional cost.
- For Gemini: prompt caching is handled automatically on supported Gemini models. There is no configurable or documented TTL, and cache hits are more likely when requests with similar prefixes are sent within a short period of time.
Supported Models
Prompt Caching support changes over time as providers release new models and retire older ones. Refer to each provider’s official documentation for the latest model-specific availability.OpenAI
OpenAI Prompt Caching is automatically available for supported recent models, starting with GPT-4o and newer model families. Prompt Caching applies to eligible prompts of at least 1,024 tokens.Claude
Claude Prompt Caching is supported on current Claude model families, including supported Claude Opus, Sonnet, and Haiku models. The minimum cacheable prompt length depends on the specific Claude model.Google Gemini
Google Gemini Prompt Caching Automatically enabled by Google for Gemini 2.5 and newer models.Why Use Prompt Caching?
For Claude
With Prompt Caching for Claude models, repeated input context can be processed at a significantly lower cost. For the default 5-minute cache:- Writing content to the cache costs 25% more than standard input tokens.
- Reading cached content costs 10% of the standard input token price.
- The cache lifetime is refreshed each time the cached content is reused.

- The default cache lifetime is 5 minutes and is refreshed whenever the cached content is reused.
- The minimum cacheable prompt length varies by model.
- You can set up to 4 cache breakpoints within a prompt.
For OpenAI
OpenAI Prompt Caching is applied automatically on supported models when an eligible prompt is sent. OpenAI states that Prompt Caching can reduce input token costs by up to 90% and latency by up to 80%, depending on the model and cache hit.
- The minimum cacheable prompt length is 1,024 tokens.
- Cache hits require an exact matching prompt prefix.
- Stable content should be placed at the beginning of the prompt to improve cache efficiency.
- Cached token usage is included in the API response for supported requests.
- Cache retention depends on the model and selected retention policy.
For Gemini
Gemini has a complex pricing structure with costs including:- Regular input/output costs when the cache is missed
- 75% discount on input costs when the cache is used
- Cache storage costs
Some important notes:
- The minimum input token count for context caching is 32,768, and the maximum is the same as the maximum for the given model.
How Prompt Caching Can Be Used?
Prompt caching is useful for scenarios where you want to send a large prompt context once and refer back to it in subsequent requests: This is especially useful for:- Analyze long documents: process and interact with entire books, legal documents, or other extensive texts without slowing down.
- Help in coding: keep track of large codebases to provide more accurate suggestions, help with debugging, and ensure code consistency.
- Set up hyper-detailed instructions: allow for the inclusion of numerous examples to improve AI output quality.
- Solve complex issues: address multi-step problems by maintaining a comprehensive understanding of the context throughout the process.
How To Enable Automatic Prompt Caching on TypingMind
If you are using supported OpenAI models, you do not need to take any further action. Prompt Caching is automatically applied by OpenAI when your prompt is eligible.
If you are using Prompt caching for Claude and Gemini models, here’s the detail guidelines:
- Go to Models in the left sidebar.
- Expand the Advanced Model Parameter
- Scroll down to enable the “Prompt Caching” option
Best Practices for Using Prompt Caching
To get the most out of prompt caching, consider following these best practices from OpenAI:- Place reusable content at the beginning of prompts for better cache efficiency.
- Prompts that aren’t used regularly are automatically removed from the cache. To prevent cache evictions, maintain consistent usage of prompts.
- Regularly track cache hit rates, latency, and the proportion of cached tokens. Use these insights to fine-tune your caching strategy and maximize performance.