What's the cheapest way to track LLM cost across providers?

Helicone's free tier (10k requests/month) covers most solo developer use. Above that, LiteLLM self-hosted is free in software but requires running your own proxy. For teams over 100k requests/month, Helicone Pro or self-hosted LiteLLM + a custom dashboard is the realistic choice. The unified-spend question is harder than the unified-tracking question — see below.

Why is multi-provider cost tracking hard?

Three reasons. (1) Each provider's pricing is different — Anthropic prices inputs at $5/M for Opus, OpenAI at $10/M for GPT-5.4, Gemini at $1.25/M for Gemini 3.1 Pro. (2) Token counts differ between tokenizers, so 1,000 'tokens' isn't comparable. (3) Provider-specific discounts (Anthropic batch + cache stacking, OpenAI volume tiers) need per-provider math. The 'unified daily spend in USD' number is the only one that's comparable.

Should I use Helicone, LiteLLM, or build my own?

Helicone if you want SaaS observability with cost tracking included. LiteLLM if you want a self-hosted proxy that doubles as a gateway and includes cost tracking. Build your own if you have specific tracking needs (per-tenant attribution, per-feature attribution, custom budget alerts). Most teams should pick Helicone or LiteLLM, not build.

What's the cheapest alert mechanism for 'spend crossed $X today'?

LiteLLM has built-in budget alerts that hit Slack/PagerDuty via webhooks. Helicone has alert rules in its UI. Build-your-own: a daily cron that queries provider billing APIs and posts to Slack if total exceeds threshold. The build-your-own version is roughly 50 lines of code; the SaaS version is one config screen.

Does Anthropic's prompt caching show up in cost-tracking tools?

Yes, in the providers' billing data. Cache hits are billed at 10% of input rate (Anthropic), 50% (OpenAI cached input). Helicone, LiteLLM, and OpenLLMetry all separate cache hit/miss spend in their reports. If your tool doesn't, switch tools — cache hit rate is the single most actionable cost metric.

How accurate are token estimates before a request?

Pretty accurate within a provider (5-10% margin). Inaccurate across providers because tokenizers differ. tiktoken estimates OpenAI tokens; you cannot use it to predict Anthropic spend. Use each provider's official tokenizer library if you need pre-request cost predictions; accept that cross-provider comparison is post-hoc only.

LLM Cost Tracking Across OpenAI, Anthropic, and Vertex

The pattern is recognizable in every team that integrates LLMs into a product: the bill arrives, the bill is unexpected, the post-mortem reveals that nobody had visibility into per-request cost in real time. By the time you have a problem, you've already had a problem.

This page is the stack we'd ship for unified LLM cost visibility across OpenAI, Anthropic, and Google (Vertex / Gemini API), plus the alert and budget patterns that catch problems before the invoice does. The recommendations draw from Helicone's free tier, LiteLLM's gateway model, OpenLLMetry's OTEL approach, and the post-mortems we've read from teams that learned this the expensive way.

The four data points you need

For every LLM request in production, you need:

Provider (anthropic, openai, vertex, etc.)
Model (claude-opus-4-7, gpt-5.4, gemini-3.1-pro, ...)
Token counts (input + output, plus cache hit/miss for Anthropic and OpenAI)
Cost in USD (derived from token counts × per-provider rate)

That's it. Every advanced metric (cost-per-feature, cost-per-tenant, p50/p95 cost-per-session) builds on these four. Most teams don't have the basics, which is why the advanced metrics are absent too.

The stack we'd ship

Three options, ordered by setup effort.

Option 1 — Helicone (SaaS, fastest setup)

Helicone is a proxy-based observability platform with cost tracking built in. Free tier covers 10k requests/month and 1 seat. Setup is a one-line base-URL swap in your SDK config:

from openai import OpenAI

client = OpenAI(     api_key=os.environ["OPENAI_API_KEY"],     base_url="https://oai.helicone.ai/v1",  # was https://api.openai.com/v1     default_headers={         "Helicone-Auth": f"Bearer {os.environ['HELICONE_KEY']}",     }, )

Same pattern for Anthropic (anthropic.helicone.ai), Vertex (gemini.helicone.ai). Helicone's dashboard shows requests, costs, per-model breakdowns, and supports custom properties for per-tenant or per-feature attribution.

What you get: 5 minutes of setup, real-time cost dashboards, alert rules, cache-hit visibility.

What you pay: $0 free tier, ~$50/mo for 100k requests, $250/mo for 1M requests. Your data goes through Helicone's proxy, which is a trust consideration for some teams.

Option 2 — LiteLLM self-hosted (gateway + tracking)

BerriAI/litellm is an open-source LLM gateway that supports 100+ models and includes cost tracking. Run it as a proxy in front of your application; it logs every request and computes cost from a maintained pricing table.

Deploy via Docker:

docker run -p 4000:4000 \
  -e LITELLM_MASTER_KEY=sk-... \
  -e DATABASE_URL=postgresql://... \
  ghcr.io/berriai/litellm:main-stable

Your app calls LiteLLM as if it were OpenAI:

from openai import OpenAI

client = OpenAI(     api_key="sk-your-litellm-key",     base_url="http://litellm-proxy:4000/v1", ) # Now call any model: model="claude-opus-4-7" or "gpt-5.4" or "gemini-3.1-pro"

What you get: full self-hosted observability, unified routing across providers, budget controls per virtual API key, no data leaves your network.

What you pay: software is free. You pay for Postgres + container hosting. Realistic ops cost: $50-200/month for a small team's traffic.

Budget alerts:

# Set a per-key daily budget litellm-cli budget set --api-key sk-xxx --daily-usd 50

# When exceeded, LiteLLM rejects new requests and webhooks to your alert channel

Option 3 — OpenLLMetry (OTEL-native, plug into existing stack)

OpenLLMetry (by Traceloop) is an OpenTelemetry instrumentation library for LLM SDKs. It emits OTLP spans with token counts and cost attributes; export them to Datadog, Grafana, Honeycomb, New Relic — wherever your existing telemetry goes.

Setup:

from openllmetry import OpenLLMTracker

tracker = OpenLLMTracker() tracker.init()  # auto-instruments OpenAI, Anthropic, Google SDKs

What you get: native OTEL traces with gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.model, and computed cost attributes. Plugs into your existing dashboards.

What you pay: software is free. You pay for the OTEL backend you export to. Best when your team already runs Datadog / Grafana / Honeycomb — you get LLM observability without a separate tool.

The pricing tables (May 2026)

Per-million-token rates. Subject to vendor changes; always verify against the pricing page before alerting.

Model	Input	Output	Cache read	Cache write
Claude Opus 4.7	$5.00	$25.00	$0.50	$6.25
Claude Sonnet 4.7	$3.00	$15.00	$0.30	$3.75
Claude Haiku 4.5	$0.80	$4.00	$0.08	$1.00
GPT-5.4	$10.00	$30.00	$5.00	n/a
GPT-5.4 mini	$2.50	$10.00	$1.25	n/a
Gemini 3.1 Pro	$1.25	$5.00	$0.31	n/a
Gemini 3.1 Flash	$0.30	$1.20	$0.075	n/a

The headline observation: cache reads are 10% (Anthropic), 50% (OpenAI), 25% (Gemini) of input cost. If your cache hit rate is below 70%, that's where the next dollar lives — not in switching providers.

Per-feature attribution

The single highest-leverage practice for production cost tracking: tag every LLM request with the feature it serves. Then the dashboard can answer "which feature is responsible for the spike?"

Pattern in Helicone:

client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[...],
    extra_headers={
        "Helicone-Property-Feature": "rag-query",
        "Helicone-Property-Tenant": tenant_id,
        "Helicone-Property-UserId": user_id,
    },
)

In LiteLLM, the same idea via metadata:

client.chat.completions.create(
    model="claude-opus-4-7",
    messages=[...],
    metadata={
        "feature": "rag-query",
        "tenant_id": tenant_id,
        "user_id": user_id,
    },
)

Now your dashboard can group by feature. When the bill jumps, you know where to look.

Budget alerts

Three tiers we set on every project:

Tier 1 — daily threshold. "Spend exceeded $X today" → Slack notification. Threshold = 2x expected daily spend. Catches usage spikes (someone's running a job in a tight loop). Tier 2 — monthly run-rate. "Current run-rate exceeds budget by 20%" → email + Slack. Catches gradual drift before it's a problem. Tier 3 — per-feature daily. "Feature X spent more than $Y today" → channel-specific Slack. Catches single-feature regressions before they affect the rollup.

LiteLLM has all three built in. Helicone has the first two. OpenLLMetry needs you to wire alerts in your OTEL backend (which most teams already have for non-LLM workloads).

The cross-provider comparison problem

A common request: "show me cost-per-task across providers so I can pick the cheapest." Three things make this harder than it sounds:

1. Tokenizers differ. 1,000 tokens of input in Claude is not 1,000 tokens of input in GPT. Same English prose tokenizes to roughly 1,000 tokens in both, but specific content (code, JSON, non-Latin scripts) varies by 10-30%. 2. Output verbosity differs. Sonnet 4.7 produces shorter outputs than GPT-5.4 on the same prompt by default. "Same task" produces 800 tokens out from Sonnet and 1,200 from GPT — even though the input is identical. 3. Quality differs. Cheaper output isn't useful if it requires retries. Track success rate alongside cost.

The comparison that actually works: cost-per-successful-task in production, measured for a week, per feature. Stop trying to do theoretical apples-to-apples; measure what your real workload costs on each provider, and switch based on data.

The cache discipline

The most actionable optimization in this space is improving prompt-cache hit rate. Specific to providers:

Anthropic: Cache hits cost 10% of input rate. 20-block lookback window. 5-minute default TTL, 1-hour beta available. See /topic/anthropic-prompt-caching for the implementation playbook. OpenAI: Cached input costs 50% of standard input rate. Automatic for prompts over 1,024 tokens with no explicit opt-in. Less aggressive than Anthropic's mechanism but still a real saving. Gemini: Context caching available; pricing varies by model. Less mature than the other two as of May 2026.

Your dashboard should show cache hit rate per model, per feature. If hit rate is below 60% on a request pattern that should be cacheable, that's where the next dollar is.

Where this fails

Provider billing APIs lag. Most providers update billing data on 6-24 hour delays. Real-time dashboards estimate from request logs, not provider billing. The estimates are usually within 5% of the official invoice; differences come from billing-adjustments and credits. Multi-tenant attribution is hard if your auth flow is messy. "Tag every request with tenant_id" presumes tenant_id is available at the call site. If your code structure makes it inconvenient, tag what you can and accept the rest. Cache hit rate is misleading for short-lived workloads. A cache that expires every 5 minutes won't help workloads that run once per hour. Match the cache TTL to your traffic pattern, or accept the lower hit rate. Streaming responses complicate token counting. When a response streams, you don't know final output token count until completion. Helicone and LiteLLM handle this transparently; if you're building your own, plan for delayed totals. The pricing tables go stale. Anthropic and OpenAI both adjust pricing more often than they used to. Hard-coded rates in your dashboard get wrong. LiteLLM maintains a community pricing JSON that updates with new model releases; we pin to its main branch.

Sources

Helicone. Pricing and free tier. YC W23.
BerriAI. LiteLLM repository. 100+ model support, gateway + cost tracking.
Traceloop. OpenLLMetry repository. OTEL-native LLM instrumentation.
Anthropic. Pricing page. Per-model rates.
Anthropic. Prompt caching docs. 10% cache-read rate.
OpenAI. Pricing page.
Google. Gemini pricing.
HN 47301395. "Ask HN: How are you monitoring AI agents in production?". Tools cited: Helicone, LangSmith, Langfuse, Lava.
Galileo. "State-of-AI-Evaluation 2026 report". 84.9% of teams hit a production incident within 6 months.