INFRASTRUCTURE & CLOUD SCALE

Understanding Prompt Caching in Gemini API: 90% Cost Savings

By FS AI Hub

2026-06-27

2 min read

Introduction

If you are building Agentic AI applications, you are likely stuffing massive amounts of data into your system prompts. Providing a 100,000-token repository or a 50-page employee handbook as context makes the AI incredibly smart, but it introduces two massive problems:

Cost: Paying for 100,000 input tokens on every single user message gets expensive.
Latency: The LLM takes several seconds just to "read" that prefix before generating a response.

Context Caching (or Prompt Caching) solves both.

How Context Caching Works

Instead of sending your massive system instructions with every API call, you send it once to Google's servers and request that it be cached.

When you start a new conversation, you simply reference the cache ID. The API retrieves the pre-computed attention matrix from memory rather than recalculating it.

The Impact

Costs Drop: Cached input tokens are priced at a massive discount (often 75% to 90% cheaper). For Gemini 1.5 Pro, active tokens cost $7.00/1M, but cached tokens cost just $1.75/1M.
Latency Drops: Time-To-First-Token (TTFT) drops from seconds to milliseconds.

Implementation (Node.js)

To use context caching, your payload must be over 32,768 tokens. Here is how you do it with the @google/genai SDK:

import { GoogleGenAI } from '@google/genai';
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });

// 1. Create the cache (Set TTL to 30 mins)
const cache = await ai.caches.create({
  model: 'gemini-1.5-flash',
  displayName: 'massive_document_cache',
  ttl: '1800s',
  config: { systemInstruction: "You are an expert. Context: [50k tokens...]" }
});

// 2. Chat using the cache
const chat = ai.chats.create({
  model: 'gemini-1.5-flash',
  history: [],
  cachedContent: cache.name
});

const response = await chat.sendMessage({ message: "Summarize chapter 3." });

Conclusion

If your application relies on long, static system prompts, Context Caching is not optional—it is a mandatory architectural pattern for production.