~4 min read
Nikhil Bhardwaj — AI Product Manager at DataDrew. Field notes on shipping AI agents, evals, and measuring the tools you bet on. LinkedIn · About

There's a plugin on my machine called caveman mode. It rewrites the assistant's replies into terse, article-dropping fragments — "Bug in auth middleware. Token expiry check use < not <=. Fix:" instead of the usual three polite paragraphs.

The pitch of the plugin is that it saves 65% of the tokens. When I run /caveman-stats, it tells me — in confident numbers — how much I've saved this session.

I'd read that 65% figure and wanted to know if it held for the work I actually do day to day — reading, analysing, reasoning. So I ran a proper A/B test: same prompts, caveman on versus off, reading the token count straight off the API response.

The real number came to 18.6%.

Here's how a 65% became an 18.6%.

The setup

I picked four things from a normal week —

For the data file I ran three different analyst questions on the same sheet — an overview (what the data is, the headline stats), a monetization read (where revenue and conversion opportunity sit), and a segmentation pass (how to group the users into meaningful cohorts) — so the dense-output case carried more weight than the others.

Every task ran twice. Once with the caveman system prompt, once without it. Same user prompt both times. I logged the output tokens the model actually produced — no estimation, just the number off the response metadata.

The results

Task Normal Caveman Saved
Reasoning (logic puzzle)40934216.4%
Git repo summary496509−2.6%
Document unpack4374370%
Data — overview1,4731,3985.1%
Data — monetization2,2921,45336.6%
Data — segmentation1,2661,05017.1%
Total6,3735,18918.6%

The 18.6% is total output tokens saved across all six runs — the number that actually shows up on the bill. One run per cell, so the small per-task figures sit inside the noise; I'm reading the shape, not the decimals.

The spread is the whole story.

One task went negative — caveman wrote slightly more on the repo summary. One came out dead flat. And one — the monetization analysis — cut a third of the output.

So the honest answer to "how much does it save?" is: it depends entirely on what you ask it. There is no single number. The plugin sells one anyway.

Where the 65% actually comes from

This is the part that stopped me.

The 65% is not measured from your session. It's a hardcoded constant.

// caveman-stats.js, line 19 const COMPRESSION = { 'full': 0.65 };

When you run /caveman-stats, it takes the tokens you produced and runs them backwards through one line:

estNormal = output / (1 − 0.65) // = output × 2.86

It assumes — every single time — that a normal reply would have been 2.86× longer, then reports the difference as "saved". It never runs the baseline. It has no way to see what you'd have gotten without it. The savings figure is an article of faith, soldered into a constant.

To be fair to whoever built it — 65% wasn't pulled from air. It came from their own benchmark, ten prompts. I read them. All ten are conversational coding questions: "explain rebase vs merge", "why is my React component re-rendering", "how do I set up a Postgres connection pool".

My read: those are prose-heavy questions, and prose is mostly compressible filler. Mine weren't. A data analysis is mostly numbers, segment names, findings. A revenue figure can't be compressed — it's already one token doing one job. Caveman trims filler, and dense work doesn't carry much. That's the whole gap between their 65% and my 18.6%, sitting in plain sight.

The bigger miss — it only touches the part you read

Even my 18.6% flatters it.

My test measured single-shot replies — one prompt, one text answer, no tools. That's caveman's best case, because the output was pure prose and prose is the only thing it can compress.

A real task isn't shaped like that. When the assistant reads five files, runs a few commands, writes some code — the token bill is dominated by things caveman is explicitly told to leave alone:

Caveman compresses the narration between the work. In a real agentic loop, that narration is a thin slice of the total. So the share of tokens it genuinely saves drops well below that — the denominator is full of weight it was never given permission to touch.

So is it useless?

No — and this is the fair other side.

Output tokens are the expensive class. They run roughly 4–5× the price of input tokens. Caveman cuts exactly those, and on a chatty back-and-forth where you're reading a lot of model prose, shaving high-teens percent off the priciest tokens adds up over a year.

It's also nicer to read for some work. Terse is a feature when you want an answer, not an essay. That part, I'll keep using.

My problem is the scoreboard. A tool that multiplies your output by a fixed 2.86 and calls the gap "savings" is measuring its own assumption. The number feels earned. It isn't.

A tool that multiplies your output by a fixed number and calls the gap savings is measuring its own assumption, not your session.

Key takeaways

And that's not really about caveman. It's most AI tooling right now — a headline figure from a clean demo, handed to you as if it holds for your messy reality. The demo question and your actual question are rarely the same shape, and almost nobody checks.

I'd rather carry the 18.6% I measured than the 65% I was handed.

Want to talk about this?

Saturday sessions on AI · open to corporates and colleges

Most Saturdays I block off time to talk about AI — how to think about agent design, how to actually measure the tools you're betting on, where evals fit in, and where the gap between hype and product reality usually shows up.

I've been running these with larger teams implementing AI solutions, and as seminars with college students.

ai tooling token economics measurement product management claude code evals field note
More field notes: all writing · Eight minutes with MIRA · or see my work.