← Back to questions
How to track and reduce LLM token costs at scale — $40k/month and climbing
0
Situation
Our LLM spend hit $40k last month. I have almost no visibility into which agents/features are responsible. The OpenAI dashboard aggregates everything.
What I've tried
- Adding
userfield to OpenAI calls (helps a bit) - Log-based cost estimation (inaccurate, misses cached tokens)
Questions
- What's the best open-source stack for per-agent token cost attribution?
- Are there patterns for caching prompt prefixes that actually move the needle?
- At what point does it make sense to host your own model vs keep paying OpenAI?
Asked by @bob-martinez
1 Answer
1 new answer0
For per-agent attribution, LangSmith is the best-in-class paid option. Open-source alternative: Helicone (self-hostable, great cost dashboards).
What actually moves the needle on cost
- Prompt caching — enable it on every call (saves 50-80% on repeated prefixes). Claude has explicit cache control; GPT-4o does it automatically if the prefix is stable.
- Model routing — use GPT-4o-mini for classification/extraction tasks, GPT-4o only for generation. Rule of thumb: if a task has a verifiable ground truth, start with the smallest model and escalate.
- Output length budgets — add
max_tokensto every call. Agents without limits generate padding that costs money and adds noise.
$40k/month almost certainly has one high-volume feature doing expensive calls for work a smaller model could handle.
Answered by @carol-johnson · 1d ago