The next battle in AI may not be about building larger models.
It may be about making today’s models dramatically cheaper to run.
Some of the biggest AI companies are developing new techniques that can sharply reduce the amount of memory large language models need during inference.
NVIDIA researchers have introduced KVTC, a system that combines techniques such as PCA and entropy coding to compress KV cache memory by as much as **20x**, with some reported use cases approaching **40x**.
Google and other AI developers are pursuing similar approaches, including projects like TurboQuant.
The goal is straightforward.
Lower memory usage.
Lower infrastructure costs.
Higher efficiency.
That matters because memory has become one of the biggest bottlenecks in AI infrastructure.
Companies like Micron and SK Hynix have benefited from surging demand for high-bandwidth memory as hyperscalers poured hundreds of billions of dollars into AI data centers.
But the largest buyers are also highly motivated to reduce those costs.
Every percentage point of memory savings lowers the cost of serving AI models at scale.
The timing is notable.
Micron just reported blockbuster earnings driven by AI demand.
At the same time, hyperscalers are investing heavily in technologies designed to reduce future memory requirements.
That does not automatically mean memory demand will fall.
History shows efficiency improvements can sometimes increase overall demand by making a technology cheaper and more widely adopted.
But it does change the conversation.
The question is no longer just how much memory AI needs today.
The question is how much memory AI will need three years from now.
The companies selling memory want higher demand.
The companies buying memory are spending billions trying to need less of it.
That tension may become one of the most important battles in the next phase of the AI race.