CostCompass An Almanac Beta
LLM cost management

Manage what your LLMs actually cost

Tracking tells you the number. Managing means acting on it, and you can't pull the right lever until you can see which model the money is going to. Here is how to decide what to change first, and how to tell whether the change worked.

By Joubert Berger Published June 7, 2026

There's a difference between watching a bill and managing one. Watching is reading the meter as it climbs. Managing is deciding which dial to turn, turning it, then reading the meter again to see if it mattered. The first half — here is your usage, here is the total — is the easy part; the part that actually saves money is deciding what to change and confirming it worked.

This guide is about that second half: the levers that lower an LLM bill, how to decide which one to pull first, and how to tell afterward whether the bill moved. You can't manage what you can't see broken down, so the work starts with knowing which model the money is going to.

An antique almanac engraving: a row of brass control valves of unequal size feeding a single pressure gauge, a gloved hand resting on the largest valve while the needle holds steady, the chosen valve picked out in copper.
The skill isn't turning every valve. It's reading the gauge, finding the one that moves it, and turning that.

What is LLM cost management?

LLM cost management is the loop you run around your spend: measure where the money is going, act on the largest line, then re-measure to see whether the change reduced cost. The measuring half is covered elsewhere. The ways to track AI costs across providers lay out how you get to a number, and an AI API cost dashboard is what shows it to you. This guide is about what you do with that number once it’s in front of you.

The distinction matters because the two halves have different owners. A tracker or a dashboard can hand you the measurement. The action lives in your code and your provider settings — choosing a model, restructuring a prompt, switching on a discount tier — and no tracker can make those calls for you. A good breakdown points you to the largest cost. It tells you which model and which provider the spend concentrates in, so you can put your effort into the biggest line rather than the easiest one.

Where does an LLM bill actually go?

Before you change anything, it’s worth knowing what you’d be changing. A few drivers account for most of an LLM bill, and they’re rarely spread evenly:

  • Model tier. The per-token rate, and the widest spread of the lot. A flagship reasoning model can cost many times a small one for the same call, which is why which model handles a job tends to move the bill more than any other single decision.
  • Input size. Everything you send (prompt, system instructions, retrieved context, conversation history) is metered. Long context and stuffed system prompts are a steady, easy-to-miss cost on every call.
  • Output length. The expensive side of the meter. Verbose completions run it up, and reasoning models bill their hidden thinking tokens as output — so even a short visible reply can carry a large output charge.
  • Request volume and retries. The multiplier underneath everything else. A retry loop or a chatty agent turns a small per-call cost into the line item that dominates the month.

So spend concentrates. One model, or one high-volume call path, usually accounts for a disproportionate share, while the rest is rounding error. You find the biggest line and work on it first, and you can’t do that until you have a breakdown that ranks where the money went.

How do you lower an LLM bill, and which lever comes first?

You pull the levers in priority order, the biggest expected saving first, against where your breakdown says the money is. The levers themselves are well known (the per-provider guides cover how Claude, OpenAI, Gemini, DeepSeek, and OpenRouter each price the moves); what’s worth adding is the order. Rank each lever on three things: how much you expect it to save, how easy it is to apply, and how little it risks quality. A lever that halves the cost of a model you barely use loses to the one that shaves a tenth off your largest line.

  • Right-size the model. Send the easy, high-volume calls to a cheaper model (a smaller tier, or an open-weight model on a cheaper host) and reserve the flagship for work that needs it. Because the per-token spread across tiers is so wide, this is usually the single largest mover, and on genuinely simple calls — once you’ve validated the routing against real cases — it holds the quality the user sees.
  • Cache repeated input (prompt caching) — where the provider offers it. When a later call reuses a stable prefix, that portion can bill at a much cheaper cached rate. But this is per-provider: some apply it automatically, some require you to opt in and structure the prompt deliberately, and the discount may or may not appear in the usage they report. Check each provider’s prompt-caching terms before you count on it.
  • Cache whole responses, not just prompts. Distinct from prompt caching: response (or semantic) caching skips the model call entirely when an identical or near-identical request has already been answered, serving the stored answer and avoiding another model inference charge. It lives in your application rather than the provider, and pays off most on repetitive, bounded workloads like classification, FAQ bots, or deduplicated batch jobs. The risk is staleness, so scope it to requests whose answer doesn’t drift.
  • Move bulk work to a batch tier. Several providers offer a discounted asynchronous tier for work that doesn’t need an immediate answer. Availability and the size of the discount vary by provider, so confirm it exists before you plan around it.
  • Trim input and cap output. Shorten bloated system prompts, prune the context you actually need, and set a sensible max_tokens ceiling. This cuts both sides of the meter and is entirely within your control on every provider.
  • Route and fall back across providers — with eyes open. Routing the right job to the right model is a real lever (it’s the whole premise of OpenRouter), but a fallback isn’t automatically cheaper: when your first-choice model is busy, a fallback can silently send traffic to a pricier one, so the bill moves even though your volume didn’t. Treat routing as a lever you have to watch, not set and forget.

Right-sizing the model is worth a worked example. Say a model at a tenth of the flagship’s per-token rate can take 60% of your calls without the user noticing the difference. Move that share over and that workload’s bill falls to 46% of where it started — 40% of the calls still at full price, 60% of them now at a tenth (0.4 + 0.6 × 0.1, taking the calls as roughly equal in cost). That’s a 54% cut from a single routing rule, and it works because the per-token spread between tiers is the widest number in the whole bill — compare any provider’s current rates: Anthropic, OpenAI, Gemini. (The exact ratio is illustrative and rates move; the point is the order of magnitude, not the decimal.)

LeverWhat it cutsEffortTrade-off / risk
Right-size the modelPer-token rate on high-volume callsLow–mediumQuality on calls you misjudge as “easy”
Prompt cachingRepeated input on stable prefixesLow–mediumPer-provider; discount may not show in reported usage
Response cachingWhole repeated calls, end to endMediumOnly fits repetitive workloads; stale-answer risk
Batch tierA flat percentage on async workLowNot universal; no immediate response; varies by provider
Trim input / cap outputBoth sides of the meterLowTruncated answers if the cap is too tight
Route + fall backSends easy work to cheaper modelsMediumFallback can route to a pricier model — watch the blend

How do you know a change actually lowered the bill?

You measure the cost on both sides of the change and compare. The catch is how much that comparison really proves. Every lever above is a hypothesis: this change will lower the bill. Testing it means comparing before and after, which the providers make hard. Each console shows its own slice, in its own units, and no provider’s own tooling forecasts across your full stack. What you want is one ranked, cross-provider view that tells you where to aim, plus a forward number for where the month lands if nothing changes.

Two honest limits come with that measurement:

  • A breakdown tells you which model is expensive. It does not tell you the right fix. It can’t see your prompt structure, what’s cacheable, or where quality matters. It aims you at the costly line; choosing the lever is still your call.
  • A lower total after a change is a signal, not a proof. Re-measuring isn’t a controlled experiment: if request volume or workload mix shifted at the same time, the number moved for reasons other than your lever. And for providers that settle caching or batch discounts only on the invoice, the raw usage you read won’t reflect the saving until billing does. Read the movement as directional, and confirm large changes against the eventual bill. (For how the forward number is built from recent usage, see forecasting your AI spend.)

Some teams get that before/after view a different way — by instrumenting calls through an observability SDK or gateway (Helicone, Langfuse, LiteLLM, Portkey), which capture per-request cost in exchange for sitting in your code or your request path. That’s a heavier setup than reading the bill each provider already keeps, and the trade-offs are laid out in how AI cost tools collect their numbers.

Where does CostCompass fit?

CostCompass handles the measurement and the feedback. It doesn’t enforce anything; that part is on you and the providers. It reads each provider’s own usage on demand, with no SDK in your code and no gateway in your request path, and prices it into one month-to-date total, a forecast, and a per-model, per-provider breakdown that ranks where the money went. The breakdown points you at the costly line; the forecast tells you whether the trend is worth acting on.

A ranked bar chart of month-to-date LLM spend broken down by model — a flagship model carrying most of the cost, with several cheaper models below it — making the largest line obvious at a glance.
A per-model breakdown ranks where the spend went, so the line worth attacking is the one at the top.

You pull the data; nothing polls in the background and nothing alerts you. After you ship a change — say a new route, a tighter prompt, or a cheaper model on the easy calls — a click on Refresh reads the latest usage and shows whether the number moved, with the caveats above in mind. Coverage isn’t uniform. Most LLM providers expose per-model detail, but some don’t. DeepSeek, for instance, exposes only an account balance, so its line shows a total without the per-model split.

The CostCompass dashboard showing month-to-date spend across providers with a forecast and burn rate.
One month-to-date total and a forecast across every connected provider — the before-and-after the management loop reads from.

It also doesn’t stop at model APIs. The same view rolls Claude, OpenAI, Gemini, DeepSeek, and the routed models behind OpenRouter up next to the GPU box and the hosting bill, with the full set on the providers page and a wider overview of tracking AI costs across providers. Your provider keys are encrypted in your browser before they’re stored, sealed with your vault password and saved only as ciphertext the server can’t decrypt. What sits at rest is never a usable credential.

Managing your LLM costs with it takes three steps:

  1. Connect each provider — paste the usage or admin key it gives you. It’s encrypted in your browser before it’s stored, so CostCompass stores only ciphertext — no usable credential sits in its database or logs.
  2. Read the breakdown. Find the model carrying the most spend, and pick the lever that fits it — a cheaper model for the easy calls, a tighter prompt, a discount tier the provider offers.
  3. Ship the change, then click Refresh and read whether the number moved. Run it again the next time the bill drifts.

Frequently asked questions

What's the difference between LLM cost tracking and cost management?
Tracking is reading the meter — what you've spent, priced into money and summed across providers. Management is what you do with that reading — which model or call is worth changing, where you make the change, and whether the bill moved afterward. A tracker that shows you a per-model total has done its job. The saving comes from what you do next.
What actually drives an LLM bill?
A handful of things, rarely spread evenly. The model tier sets the per-token rate, and the gap between a cheap model and a flagship is wide enough that which model handles a job usually matters more than anything else. After that come input size (prompt, system instructions, retrieved context), output length, and the number of requests, with retries on top. Reasoning modes add a wrinkle, since the thinking tokens count toward the output you pay for. Spend concentrates in a few places. So the first move in managing it is finding the biggest line and starting there.
How do I reduce my LLM costs without hurting quality?
Start with the levers that cut cost without touching the answer the user sees. The biggest is usually routing. Send the easy, high-volume calls to a cheaper model and keep the flagship for the hard ones; on genuinely simple cases nobody notices. Two smaller levers pay off quietly. A long, stable prompt prefix kept consistent can land on a provider's cached-input rate, so the same prompt costs less. Trimming a bloated system prompt or capping output length cuts the expensive side of the meter. Each of these is a trade you can measure, which is what the per-model breakdown is for.
Does CostCompass set budgets or alert me when spend is high?
No. CostCompass has no budgets and no spend caps, and no alerts either. It doesn't sit in your request path, so it can't stop a call. What it does is show you the number on demand — a month-to-date total, a forecast of where the month is heading at your pace, and a per-model breakdown of where it's going. You read it, decide what to change, change it yourself, then click Refresh to see whether it moved.
Will switching models or adding caching show up in CostCompass?
Where the provider exposes per-model usage, a model switch shows up clearly. The per-model breakdown shifts on the next Refresh, with the cheaper model carrying more of the volume. Caching and batch discounts are messier. Some providers fold them into the usage they report, so the total comes in lower; with others, the discount may land only on the invoice, so the raw usage CostCompass reads won't reflect it until billing does. The provider's bill moves either way. But for an invoice-only discount the refreshed CostCompass total can lag behind it, so you won't always pin the exact saving on the lever you pulled. Read a refreshed total as directional, and check the big changes against the actual bill.
Can I put a hard cap on my LLM spend?
Not in CostCompass. It has no enforcement layer. Any ceiling lives at the providers, and they differ. Most model APIs let you set a monthly budget or a usage quota in their own dashboard, but the threshold behaves differently from provider to provider. Some keep serving past a soft budget and only flag it after the fact, while others enforce a hard quota that stops requests once you cross it. Check the specific provider before you rely on one as a cap. CostCompass doesn't replace those controls. It helps you choose the number, since the forecast shows where the month is heading before you set the limit.
What's the difference between prompt caching and response caching?
Prompt caching is a provider feature. When a later call reuses a stable prefix, that portion bills at a cheaper cached-input rate, but the model still runs. Response caching lives in your own application and skips the model call outright when an identical or near-identical request has already been answered, serving the stored answer and avoiding another model inference charge. Prompt caching trims the input side of one call; response caching removes the call. Response caching saves more where it fits, but it only fits workloads whose answers don't drift, since a stale cache returns a wrong answer.
Why use CostCompass to manage LLM costs instead of just reading each provider's console?
Each console shows one provider's spend in its own units, and a few add a forecast — but only for that one provider. None of them forecast across your whole stack. To manage across a stack you'd open every console, convert each meter to money, add them up, and work out the combined slope yourself — every time you wanted to check whether a change helped. CostCompass puts that in one place — a per-model breakdown that points at the costly line, a latest-refreshed month-to-date total, and one forecast across every provider you connect. From there you can see what to change and, one Refresh later, whether it worked. You manage the bill instead of reassembling it by hand each time.

About the author

Joubert Berger builds CostCompass, a spend-intelligence dashboard that pulls usage from AI and compute providers into one month-to-date total, a forecast, and a per-provider breakdown. This guide reflects how CostCompass reads each provider's own usage API — see the security model for how your keys are handled.

Find the costly model, change it, then read the result

Connect each provider once and pull a per-model breakdown where the provider exposes it, a month-to-date total, and a forecast on demand. A change you ship today is one Refresh away from showing whether it worked.