Google has introduced a new cost-saving feature for its Gemini API aimed at third-party developers, known as “implicit caching.” This automated feature promises developers substantial savings—up to 75%—on repeated contexts submitted to Gemini models. Currently, implicit caching supports Google’s Gemini 2.5 Pro and 2.5 Flash AI models.
The announcement comes after developers raised concerns regarding the affordability and predictability of costs associated with Google’s state-of-the-art models. Previously, Google offered only “explicit caching,” requiring developers to manually specify which prompts they wanted cached. While this provided potential cost savings, explicit caching demanded substantial manual upkeep and often proved cumbersome.
Implicit caching, in contrast, operates automatically by default. It identifies and stores common prefixes in API requests. If a request matches cached content, cost savings are immediately passed on to the developer. Google’s latest update also lowers the entry barrier significantly: Gemini 2.5 Flash caching now activates automatically for requests with a minimum of 1,024 tokens, and Gemini 2.5 Pro caching kicks in at 2,048 tokens. Given that approximately 1,000 tokens translate to roughly 750 words, reaching the threshold for savings is relatively straightforward.
In practical terms, developers receive these discounts when their API queries share a common initial structure, encouraging best practices like keeping static context upfront and placing dynamic elements toward the end of the request. Google’s previous explicit caching implementation drew criticism as developers experienced unexpectedly high bills due to ineffective caching, prompting the Gemini team to acknowledge errors publicly and commit to rectifying the situation.
Although the introduction of implicit caching seems promising and user-friendly, Google has not provided third-party verification for these potential savings, leaving developers awaiting further validation from early adopters.