logo

Automatic Prompt Caching (Claude and OpenAI)

Prompt Caching allows users to make repeated API calls more efficiently by reusing context from recent prompts, resulting in a reduction in input token costs and faster response times.
The Prompt Caching option is now available for Claude and OpenAI models.
Image without caption

Challenges with Current AI Context Handling

Previously, when interacting with an AI model, the entire conversation history must be sent to the LLM for each new query to maintain the conversation context for the AI model.
This repetitive processing may lead to slower responses, increased latency, and higher operational costs, especially during long conversations or when dealing with complex tasks.

How Prompt Caching Works

Prompt Caching improves AI efficiency by allowing Claude or OpenAI to store and reuse stable contexts, such as system instructions or background information.
When you send a request with Prompt Caching enabled:
  1. The system checks if the start of your prompt is already cached from a recent query.
  1. If it is, the cached version is used, speeding up responses and lowering costs.
  1. If not, the full prompt is processed, and the prefix is cached for future use.
This is especially useful for prompts with many examples, large contexts, repetitive tasks, and long multi-turn conversations. By reusing cached information, Claude can focus on new queries without reprocessing the entire conversation history, enhancing accuracy for complex tasks, especially when refining previous outputs.
πŸ“Œ
  • For OpenAI: cached prefixes generally remain active for 5 to 10 minutes of inactivity. However, during off-peak periods, caches may persist for up to one hour.
  • For Claude: the cache has a 5-minute lifetime, refreshed each time the cached content is used.

Supported Models

With Claude, Prompt Caching is currently supported on:
  • Claude 3.5 Sonnet
  • Claude 3 Haiku
  • Claude 3 Opus
With OpenAI, Prompt Caching is supported on the latest version of:
  • GPT-4o
  • GPT-4o mini
  • o1-preview
  • o1-mini
  • Fine-tuned versions of the above models.

Why Use Prompt Caching?

For Claude

With Prompt Caching for Claude models, you can get up toΒ 85% faster response times for cached promptsΒ and potentiallyΒ reduce costs by up to 90%.
πŸ’‘
Please note: while creating the initial cached prompt incurs a 25% higher cost than the standard API rate, subsequent requests using the cached prompt will be up to 90% cheaper than the usual API cost.
Prompt Caching Costs
Prompt Caching Costs
Reduce cost and latency
Reduce cost and latency
Here’s what you need to know:
  • The cache has a lifetime (TTL) of about 5 minutes. This lifetime is refreshed each time the cached content is used.
  • The minimum cachable prompt length is 1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus, and 2048 tokens for Claude 3.0 Haiku (support for caching prompts shorter than 1024 tokens is coming soon)
  • You can set up to 4 cache breakpoints within a prompt.

For OpenAI

You can get a 50% discount on input tokens when using cached prompts. Plus, it can also reduce up to 80% in latency!
Image without caption
Here’s what you need to know:
  • The minimum catchable prompt length is 1024 tokens and increases in increments of 128 tokens.
  • Cached prefixes generally remain active for 5 to 10 minutes of inactivity. However, during off-peak periods, caches may persist for up to one hour.

How Prompt Caching Can Be Used?

Prompt caching is useful for scenarios where you want to send a large prompt context once and refer back to it in subsequent requests:
This is especially useful for:
  • Analyze long documents: process and interact with entire books, legal documents, or other extensive texts without slowing down.
  • Help in coding: keep track of large codebases to provide more accurate suggestions, help with debugging, and ensure code consistency.
  • Set up hyper-detailed instructions: allow for the inclusion of numerous examples to improve AI output quality.
  • Solve complex issues: address multi-step problems by maintaining a comprehensive understanding of the context throughout the process.
More applications can be referred atΒ Prompt Caching with Claude and Prompt Caching with OpenAI

How To Enable Automatic Prompt Caching on TypingMind

If you are using Prompt caching for OpenAI models, then you do not need to take any further action. The prompt caching will be automatically applied on the latest versions of GPT-4o, GPT-4o mini, o1-preview and o1-mini.
Image without caption
Image without caption
If you are using Prompt caching for Claude models, here’s the detail guidelines:
  • Go to Model Settings
  • Expand the Advanced Model Parameter
  • Scroll down to enable the β€œPrompt Caching” option
Image without caption
Image without caption
Image without caption
πŸ’‘
Important notes:
  • Avoid using Prompt Caching with Dynamic Context via API, as changing system prompts cannot be cached.

Best Practices for Using Prompt Caching

To get the most out of prompt caching, consider following these best practices from OpenAI:
  • Place reusable content at the beginning of prompts for better cache efficiency.
  • Prompts that aren't used regularly are automatically removed from the cache. To prevent cache evictions, maintain consistent usage of prompts.
  • Regularly track cache hit rates, latency, and the proportion of cached tokens. Use these insights to fine-tune your caching strategy and maximize performance.

Final Thought

Prompt Caching can bring huge benefits that can resolve core limitations in other AI models.
By reducing the need for repetitive processing, Prompt Caching helps improve efficiency, reduce costs, and unlock new possibilities for how AI can be applied in real-world scenarios.