Skip to content
BrainRoad BrainRoad

How to Cut AI Agent API Costs by 65% With Model Tiering

BrainRoad ·
Beacon the lighthouse character shining light on a tiered stack of coins, illustrating AI agent API cost savings.
Share
On this page

I spent a weekend reverse-engineering an OpenClaw agent bill that had quietly climbed to $340/month. The agent wasn’t doing anything exotic — email triage, calendar checks, some research summaries. Normal personal assistant work. But the bill looked like it was running a nuclear simulation.

The default OpenClaw configuration runs everything through one model. Claude Opus. The smart, capable, expensive one. Great for hard problems. Overkill for ‘is this email a newsletter or does it need a reply?’ — which is roughly 80% of what an agent actually does all day.

The fix sounds obvious once you see it. But most cost-cutting guides go straight to ‘use a cheaper model’ — which breaks quality — or ‘reduce your context window’ — which breaks memory. The answer is routing: right model for right task. I’ll show you the specific tier breakdown after we look at where the money actually leaks, because the source is not what you’d expect.

Why Claude API Costs Get Out of Hand in Agent Workloads

A chatbot makes one API call per message. An AI agent makes 3–10 API calls per user request — planning, tool selection, execution, verification, response. Each loop burns tokens. And unlike a chatbot where you control the conversation, an agent running 24/7 keeps making calls whether you’re watching or not.

The math compounds fast. According to research from Zylos AI, a single unconstrained agent solving a software engineering task can cost $5–8 per task in API fees alone. That’s one task. Your personal AI assistant is running dozens of background operations per hour — most of them trivial.

Here’s where the actual money goes on a default OpenClaw setup running Claude Opus:

  • Heartbeat checks — OpenClaw pings the configured model every 30 minutes, loading full context each time. That’s 48 API calls per day, $3–5 in idle costs before your agent does a single useful thing.

Beacon the lighthouse illuminating a tiered stack of coins and circuit nodes, symbolizing layered AI model cost savings. Not every question needs your smartest model — sometimes a smaller light does the job just fine.

  • Context file loading — Files like AGENTS.md, TOOLS.md, and MEMORY.md total roughly 15,000 words worth of instructions. Every session loads them. At Opus input rates across 50 conversations per day, that’s $11.25/day just to remind the agent who it is.
  • Output token premium — Output tokens cost 3–8x more than input tokens across nearly all major providers. Agents that generate verbose intermediate reasoning pay this premium on every loop step.
  • Formatting and classification tasks — Summarizing an email, deciding if a message is urgent, generating a subject line. None of this needs a frontier model. All of it runs on Opus by default.

According to analysis from the OpenClaw Blueprint, organizations using a single top-tier model for all agent tasks may be overpaying by 40–85% on their AI bills. The number that stuck with me: 80% of agent activity doesn’t require a frontier model. Triage, classification, health monitoring, session warming, heartbeats — all of it can run cheaper.

You’re paying Opus rates for work that Haiku could do in its sleep. That’s the whole problem. Now here’s the routing strategy that fixes it.

The Three-Tier Model Routing Strategy for Claude API Cost Reduction

Model tiering means you match model capability to task complexity. Three tiers cover virtually every agent workload:

  • Tier 1 — Frontier (Opus): Complex reasoning, multi-step planning, nuanced writing, code generation and review, image understanding. Use this for 15–20% of tasks.
  • Tier 2 — Mid-tier (Sonnet): Code execution, debugging, structured data extraction, tool orchestration, response drafting. Use this for 30–40% of tasks.
  • Tier 3 — Fast/cheap (Haiku, Gemini Flash): Email triage, message classification, health checks, formatting, heartbeats, simple lookups. Use this for the remaining 45–55% of tasks.

The pricing difference between these tiers is not subtle. Claude Opus 4.6 runs $5 per million input words and $25 per million output words. Claude Haiku runs $1 per million input and $5 per million output. That’s a meaningful gap — and the gap between Haiku and running local models via Ollama is infinite, because Ollama costs $0.00 per call.

Implementation has three approaches, depending on how deep you want to go:

  1. Keyword/regex matching — Simplest. Tag task types by keywords in the request. ‘Schedule’, ‘remind’, ‘check’ → Haiku. ‘Write’, ‘analyze’, ‘plan’ → Sonnet or Opus. No additional API calls required.
  2. Pre-router classification — A cheap model (Haiku or Flash) reads the incoming request and classifies it as simple/medium/complex before the main agent runs. Costs one small API call to save potentially many large ones.
  3. Custom routing skill — A Python function inside OpenClaw that evaluates task metadata and routes accordingly. Most flexible, requires setup, worth it at scale.

If you’re exploring AI automation setups for the first time, start with keyword matching. You can implement it in an afternoon and see immediate results in your API dashboard.

Heartbeat Routing: Killing the $3–5/Day Idle Tax

This one surprised me when I first saw it in the billing breakdown. Heartbeats are background pings that keep your agent alive and aware — OpenClaw sends one every 30 minutes. That’s fine. The problem is that by default, every heartbeat loads full context and calls Opus.

Forty-eight Opus calls per day. Before your agent handles a single email.

The fix: route heartbeat calls to your cheapest model — Haiku or Gemini Flash. A heartbeat check doesn’t need to reason. It needs to confirm the agent is alive and that nothing urgent just arrived. Flash can do that. The routing rule is simple: if the call type is ‘heartbeat’ or ‘health_check’, send it to Tier 3. Full stop.

This change alone cuts $3–5/day from idle costs. That’s $90–150/month for zero change in agent capability.

LLM Cost Optimization With Prompt Caching

Model tiering handles which model gets called. Prompt caching handles how much you pay when you call it.

Anthropic’s prompt caching charges only 10% of the normal input rate for cache hits. When your agent loads the same AGENTS.md file — same 15,000 words — across 50 conversations per day, you pay full price once and 10% for the remaining 49 loads. That context file loading problem I mentioned earlier? Goes from $11.25/day to roughly $1.50/day.

If you want to go deeper: stack prompt caching with Anthropic’s Batch API, which gives a 50% discount on processing when you don’t need real-time responses. Stack both discounts on Opus and you can get effective input costs down to $0.25 per million words — a 95% reduction from the standard rate. Not every task can batch, but scheduled research summaries, overnight report generation, and non-urgent email drafting all can.

Ollama for Zero-Cost Repetitive Agent Tasks

Some tasks genuinely require zero intelligence. Heartbeats. Session warming — loading context before a conversation starts so the first response isn’t slow. Simple keyword lookups. Checking whether a calendar slot is free.

For these, Ollama local models cost $0.00 per call. The only cost is electricity. On a modern laptop or a small home server, you can run a local model that handles Tier 3 tasks without touching the API.

The tradeoff: local models are slower and less capable than cloud models. But for tasks that ‘require zero creativity,’ according to the Clawctl cost optimization guide, that doesn’t matter. Nobody cares if the heartbeat ping takes two seconds instead of 200 milliseconds.

Practical use cases for Ollama in a tiered setup: heartbeats, email triage classifications, response formatting cleanup, simple yes/no routing decisions. Anything where you’re essentially asking ‘is this urgent?’ rather than ‘how should I handle this complex situation?‘

OpenRouter Auto-Routing as a Safety Net

OpenRouter sits between your agent and the model providers, and it offers auto-routing: you specify a quality/cost target, and it picks the cheapest model that meets your threshold. You can set price caps per request and let OpenRouter handle fallbacks automatically.

I think of OpenRouter as the safety net, not the primary strategy. Manual tiering gives you more control and predictability. But if you’re running a complex multi-agent setup and don’t want to hand-code every routing rule, OpenRouter auto-routing plus a defined cost ceiling is a reasonable starting point.

For a full breakdown of what OpenRouter charges and how the pricing tiers actually work, the OpenRouter pricing guide covers the details — including when it saves money versus when you’re better off routing directly.

Stay in the loop

Get the latest AI insights delivered to your inbox.

Join Free

Where the Three-Tier Strategy Breaks Down

This approach works well for predictable agent workloads. It has real failure modes you should know about before you implement it.

  • Misclassification at scale — Keyword matching is fast but brittle. An email that says ‘just checking in’ might be a client threatening to cancel. If your router sends it to Haiku and Haiku misses the subtext, you’ve got a quality failure that keyword matching can’t prevent. Start with a conservative router that over-assigns to Tier 2 rather than over-saving.
  • Context loss between tiers — If Haiku handles triage but Sonnet handles the response, make sure the full conversation context passes between them. A common mistake is letting the tier-switch discard memory. This is especially sharp in multi-agent setups.
  • Local model latency — Ollama on underpowered hardware is slow. If your agent is responding to messages in real time, a 5-second local inference might be worse than a 200ms Haiku API call. Benchmark your hardware before committing local models to latency-sensitive tasks.
  • Caching config is manual — Prompt caching on Anthropic’s API doesn’t activate automatically. You have to structure your requests correctly and explicitly mark cacheable content. If you skip this, you’re paying full input rates while thinking you’re not.
  • Cost monitoring lag — Most API dashboards show costs with a 24-48 hour delay. You can optimize for a week and not see results in the dashboard until the following week. Set a spending alert before you start so you have a real-time signal.

How to Know Your AI Token Optimization Is Actually Working

  • Opus call percentage drops below 20% — If your API logs show Opus handling more than 20–25% of total calls, your router is under-classifying. Most tasks should route to Sonnet or Haiku.
  • Cache hit rate above 70% for system prompts — Anthropic’s dashboard shows cache hit rates. If your context files are loading uncached more than 30% of the time, your caching configuration needs adjustment.
  • Heartbeat calls show Haiku or Flash in logs — Check that background health checks aren’t still routing to Opus. This is the easiest quick win to verify.
  • Daily cost drops within 48 hours of heartbeat rerouting — The heartbeat fix shows up fast. If you don’t see a meaningful drop in idle costs within two days, the routing rule isn’t firing.
  • Agent response quality holds — Run your standard test prompts against the tiered setup before fully deploying. Complex reasoning tasks should still route to Opus. If quality degraded, your classifier is too aggressive.

Your Monday Morning AI Agent Cost Audit

Here’s the exact sequence I’d run if I were starting fresh on a $300/month agent bill:

  1. Pull your API usage logs for the last 7 days. Break down calls by model. If Opus is handling more than 25% of total calls, you have a routing problem. If heartbeat-type calls appear in your Opus logs, fix that first — it’s the fastest payback.
  2. Route all heartbeat and health check calls to Haiku or Gemini Flash. Target: 0% of heartbeat calls hitting Opus. Expected savings: $90–150/month from idle costs alone.
  3. Enable prompt caching for your context files (AGENTS.md, TOOLS.md, MEMORY.md). This requires updating your API call structure to mark content as cacheable. Anthropic charges 10% of normal input rates on cache hits. If your agent runs 50+ sessions per day, expect 60–75% reduction in context-loading costs.
  4. Implement keyword routing for the most obvious task types. At minimum: any call containing ‘format’, ‘summarize brief’, ‘check if’, or ‘is this urgent’ routes to Haiku. Any call containing ‘write’, ‘analyze’, ‘plan’, or ‘generate code’ routes to Sonnet or Opus. This alone shifts 30–40% of calls to cheaper tiers.
  5. If you run repetitive lookups or session-warming calls, test Ollama locally. Install a small local model (7B parameters or less runs fine on most machines), point your Tier 3 routing at it, and measure latency. If response time is under 3 seconds, you’ve eliminated those API costs entirely.
  6. Set a hard daily spending alert at 20% below your current average. This gives you a real-time signal during the optimization period instead of waiting for weekly billing summaries.
  7. Wait 72 hours, then re-pull your usage logs. Compare Opus call percentage, cache hit rate, and daily cost against your baseline. If you followed steps 1–5, expect a 50–65% reduction. If you’re only seeing 20–30%, your keyword routing rules need tightening.

If you’re running this on BrainRoad’s platform, the agent console gives you per-model call breakdowns directly — you don’t need to parse raw API logs. The BrainRoad console guide walks through where to find those usage views.

What This Means for Your Agent Budget

  • Default single-model OpenClaw setups running Opus generate $150–300/month, with 60–80% of that cost going to tasks that never needed a frontier model.
  • Three-tier routing — Opus for planning, Sonnet for execution, Haiku/Flash for routine tasks — cuts real-world bills from $340/month to approximately $120/month without quality loss.
  • Heartbeat rerouting to cheap models eliminates $3–5/day in idle API costs, or $90–150/month, in a single configuration change.
  • Anthropic’s prompt caching cuts input costs by 90% on cache hits. Stack that with Batch API’s 50% discount and effective Opus input costs drop to $0.25 per million words — a 95% reduction from base rate.
  • 80% of agent activity — triage, classification, health monitoring, heartbeats — can run on Haiku or local Ollama models. Only 20% actually requires frontier-model reasoning.
  • Start with heartbeat rerouting and prompt caching. Both are straightforward config changes with immediate, measurable payback. Add model tiering and local models after you’ve validated the baseline savings.

Frequently Asked Questions

Will routing tasks to Haiku degrade my agent's output quality?

Only if you route the wrong tasks. Haiku handles triage, classification, formatting, and health checks well. It’s not built for complex reasoning, nuanced writing, or multi-step code generation. The key is routing by task type, not by random assignment. Test your routing rules against real samples before deploying — if a borderline task gets misclassified, bump it up a tier rather than optimizing the classifier further.

How do I actually enable prompt caching in OpenClaw?

Prompt caching requires structuring your API calls to mark cacheable content explicitly — it doesn’t activate automatically. Anthropic’s documentation covers the cache_control parameter. In OpenClaw, your context files (AGENTS.md, TOOLS.md, MEMORY.md) are the primary candidates. Once marked correctly, Anthropic charges 10% of normal input rates on subsequent cache hits for those blocks.

What's the cheapest way to reduce AI agent costs right now?

Heartbeat rerouting. Change one config line to send health check calls to Haiku or Flash instead of Opus. No classifier, no complex routing logic. You’ll see $90–150/month in savings within 48 hours. Prompt caching is the second change — it’s a bit more technical but cuts context-loading costs by up to 90%. Do those two before anything else.

Does OpenRouter auto-routing replace manual model tiering?

It can for some workloads, but manual tiering gives you more control and predictability. OpenRouter auto-routing works well as a cost ceiling safety net — you set a max price per request and let it pick the cheapest model that qualifies. For complex agents where you care deeply about which specific model handles which task type, manual routing rules are more reliable. Many production setups use both: manual rules for known task types, OpenRouter as a fallback for edge cases.

Can I run Ollama locally alongside a cloud agent?

Yes. Ollama runs on your local machine or a small server and exposes an API endpoint that looks like a standard model API. You configure your Tier 3 routing to point at your local Ollama instance for zero-cost tasks. The tradeoff is latency — local inference on modest hardware takes 2–5 seconds per call, which is fine for heartbeats and background tasks but may be too slow for real-time message responses.

Stay in the loop

Get the latest AI insights delivered to your inbox.

Join Free

Sources

Topics

AI Agent Platform

Stay updated

Get AI strategy insights delivered weekly. No fluff, no spam.

Related Articles