Does caveman mode really save 65% of tokens?

No. In an A/B test across six real tasks, measured output-token savings were 18.6%. The 65% figure is a hardcoded constant (COMPRESSION = { 'full': 0.65 }) in caveman-stats.js, not a measurement of your session — the tool estimates a baseline by multiplying your output by 2.86 and reports the gap as savings.

Why was the measured saving so much lower than the claim?

Savings depend entirely on output shape. Caveman trims prose filler, so chatty coding questions compress well (the original 65% benchmark was ten such prompts). Dense work — data analysis, code, revenue figures — barely compresses, and caveman explicitly leaves code blocks and tool results untouched, which dominate real agentic token cost.

Caveman plugin saves tokens at an 18.6% rate

07 Jun 2026 ~4 min read

Nikhil Bhardwaj — AI Product Manager at DataDrew. Field notes on shipping AI agents, evals, and measuring the tools you bet on. LinkedIn · About

There's a plugin on my machine called caveman mode. It rewrites the assistant's replies into terse, article-dropping fragments — "Bug in auth middleware. Token expiry check use < not <=. Fix:" instead of the usual three polite paragraphs.

The pitch of the plugin is that it saves 65% of the tokens. When I run /caveman-stats, it tells me — in confident numbers — how much I've saved this session.

I'd read that 65% figure and wanted to know if it held for the work I actually do day to day — reading, analysing, reasoning. So I ran a proper A/B test: same prompts, caveman on versus off, reading the token count straight off the API response.

The real number came to 18.6%.

Here's how a 65% became an 18.6%.

The setup

I picked four things from a normal week —

Reading a GitHub repo — pointed it at a usage-monitor project, asked for a summary
Unpacking a transferred zip — a frontend prototype someone had sent me
Data analysis — a real usage export, 1,336 rows × 55 columns of product-query logs
A reasoning task — a Knights-and-Knaves logic puzzle

For the data file I ran three different analyst questions on the same sheet — an overview (what the data is, the headline stats), a monetization read (where revenue and conversion opportunity sit), and a segmentation pass (how to group the users into meaningful cohorts) — so the dense-output case carried more weight than the others.

Every task ran twice. Once with the caveman system prompt, once without it. Same user prompt both times. I logged the output tokens the model actually produced — no estimation, just the number off the response metadata.

The results

Task	Normal	Caveman	Saved
Reasoning (logic puzzle)	409	342	16.4%
Git repo summary	496	509	−2.6%
Document unpack	437	437	0%
Data — overview	1,473	1,398	5.1%
Data — monetization	2,292	1,453	36.6%
Data — segmentation	1,266	1,050	17.1%
Total	6,373	5,189	18.6%

The 18.6% is total output tokens saved across all six runs — the number that actually shows up on the bill. One run per cell, so the small per-task figures sit inside the noise; I'm reading the shape, not the decimals.

The spread is the whole story.

One task went negative — caveman wrote slightly more on the repo summary. One came out dead flat. And one — the monetization analysis — cut a third of the output.

So the honest answer to "how much does it save?" is: it depends entirely on what you ask it. There is no single number. The plugin sells one anyway.

Where the 65% actually comes from

This is the part that stopped me.

The 65% is not measured from your session. It's a hardcoded constant.

// caveman-stats.js, line 19 const COMPRESSION = { 'full': 0.65 };

When you run /caveman-stats, it takes the tokens you produced and runs them backwards through one line:

estNormal = output / (1 − 0.65) // = output × 2.86

It assumes — every single time — that a normal reply would have been 2.86× longer, then reports the difference as "saved". It never runs the baseline. It has no way to see what you'd have gotten without it. The savings figure is an article of faith, soldered into a constant.

To be fair to whoever built it — 65% wasn't pulled from air. It came from their own benchmark, ten prompts. I read them. All ten are conversational coding questions: "explain rebase vs merge", "why is my React component re-rendering", "how do I set up a Postgres connection pool".

My read: those are prose-heavy questions, and prose is mostly compressible filler. Mine weren't. A data analysis is mostly numbers, segment names, findings. A revenue figure can't be compressed — it's already one token doing one job. Caveman trims filler, and dense work doesn't carry much. That's the whole gap between their 65% and my 18.6%, sitting in plain sight.

The bigger miss — it only touches the part you read

Even my 18.6% flatters it.

My test measured single-shot replies — one prompt, one text answer, no tools. That's caveman's best case, because the output was pure prose and prose is the only thing it can compress.

A real task isn't shaped like that. When the assistant reads five files, runs a few commands, writes some code — the token bill is dominated by things caveman is explicitly told to leave alone:

The file contents it reads back in — one 2,000-line file outweighs every word of saved commentary
The code it writes — caveman's own rules say "code blocks unchanged"
The bash commands, the tool results, the context that piles up turn over turn

Caveman compresses the narration between the work. In a real agentic loop, that narration is a thin slice of the total. So the share of tokens it genuinely saves drops well below that — the denominator is full of weight it was never given permission to touch.

So is it useless?

No — and this is the fair other side.

Output tokens are the expensive class. They run roughly 4–5× the price of input tokens. Caveman cuts exactly those, and on a chatty back-and-forth where you're reading a lot of model prose, shaving high-teens percent off the priciest tokens adds up over a year.

It's also nicer to read for some work. Terse is a feature when you want an answer, not an essay. That part, I'll keep using.

My problem is the scoreboard. A tool that multiplies your output by a fixed 2.86 and calls the gap "savings" is measuring its own assumption. The number feels earned. It isn't.

A tool that multiplies your output by a fixed number and calls the gap savings is measuring its own assumption, not your session.

Key takeaways

If a tool ships with its own success metric, look under the metric before you trust it — caveman's 65% is a hardcoded constant, not a reading of your session.
65% wasn't a lie. It was a real measurement on a workload that happened not to be mine — the trap is applying one benchmark's number to every workload as a constant.
Savings depend entirely on output shape — prose compresses, dense data and code barely move — and it never touches the tool-result and context bulk that dominates real agentic cost.

And that's not really about caveman. It's most AI tooling right now — a headline figure from a clean demo, handed to you as if it holds for your messy reality. The demo question and your actual question are rarely the same shape, and almost nobody checks.

I'd rather carry the 18.6% I measured than the 65% I was handed.

Want to talk about this?

Saturday sessions on AI · open to corporates and colleges

Most Saturdays I block off time to talk about AI — how to think about agent design, how to actually measure the tools you're betting on, where evals fit in, and where the gap between hype and product reality usually shows up.

I've been running these with larger teams implementing AI solutions, and as seminars with college students.

Book a Saturday slot Email me instead

Email: hello.nikhilbhardwaj@gmail.com · LinkedIn: @nikhil-bhardwaj-0417

ai tooling token economics measurement product management claude code evals field note

→ More field notes: all writing · Eight minutes with MIRA · or see my work.