10 KiB
Enhanced Cache Metrics Implementation
Goal: Improve cache metrics from 80% → 100% accuracy Effort: 2-3 hours Impact: Better cost tracking in Claude Code UI
Current Implementation (80%)
// Simple first-turn detection
const hasToolResults = claudeRequest.messages?.some((msg: any) =>
Array.isArray(msg.content) && msg.content.some((block: any) => block.type === "tool_result")
);
const isFirstTurn = !hasToolResults;
// Rough 80% estimation
const estimatedCacheTokens = Math.floor(inputTokens * 0.8);
usage: {
input_tokens: inputTokens,
output_tokens: outputTokens,
cache_creation_input_tokens: isFirstTurn ? estimatedCacheTokens : 0,
cache_read_input_tokens: isFirstTurn ? 0 : estimatedCacheTokens,
}
Problems:
- ❌ Hardcoded 80% (inaccurate)
- ❌ Doesn't account for actual cacheable content
- ❌ Missing
cache_creation.ephemeral_5m_input_tokens - ❌ No TTL tracking
Target Implementation (100%)
Step 1: Calculate Actual Cacheable Tokens
/**
* Calculate cacheable tokens from request
* Cacheable content: system prompt + tools definitions
*/
function calculateCacheableTokens(request: any): number {
let cacheableChars = 0;
// System prompt (always cached)
if (request.system) {
if (typeof request.system === 'string') {
cacheableChars += request.system.length;
} else if (Array.isArray(request.system)) {
cacheableChars += request.system
.map((item: any) => {
if (typeof item === 'string') return item.length;
if (item?.type === 'text' && item.text) return item.text.length;
return JSON.stringify(item).length;
})
.reduce((a: number, b: number) => a + b, 0);
}
}
// Tools definitions (always cached)
if (request.tools && Array.isArray(request.tools)) {
cacheableChars += JSON.stringify(request.tools).length;
}
// Convert chars to tokens (rough: 4 chars per token)
return Math.floor(cacheableChars / 4);
}
Step 2: Track Conversation State
// Global conversation state (per proxy instance)
interface ConversationState {
cacheableTokens: number;
lastCacheTimestamp: number;
messageCount: number;
}
const conversationState = new Map<string, ConversationState>();
function getConversationKey(request: any): string {
// Use first user message + model as key
const firstUserMsg = request.messages?.find((m: any) => m.role === 'user');
const content = typeof firstUserMsg?.content === 'string'
? firstUserMsg.content
: JSON.stringify(firstUserMsg?.content || '');
// Hash for privacy
return `${request.model}_${content.substring(0, 50)}`;
}
Step 3: Implement TTL Logic
function getCacheMetrics(request: any, inputTokens: number) {
const cacheableTokens = calculateCacheableTokens(request);
const conversationKey = getConversationKey(request);
const state = conversationState.get(conversationKey);
const now = Date.now();
const CACHE_TTL = 5 * 60 * 1000; // 5 minutes
// First turn or cache expired
if (!state || (now - state.lastCacheTimestamp > CACHE_TTL)) {
// Create new cache
conversationState.set(conversationKey, {
cacheableTokens,
lastCacheTimestamp: now,
messageCount: 1
});
return {
input_tokens: inputTokens,
cache_creation_input_tokens: cacheableTokens,
cache_read_input_tokens: 0,
cache_creation: {
ephemeral_5m_input_tokens: cacheableTokens
}
};
}
// Subsequent turn - read from cache
state.messageCount++;
state.lastCacheTimestamp = now;
return {
input_tokens: inputTokens,
cache_creation_input_tokens: 0,
cache_read_input_tokens: cacheableTokens,
};
}
Step 4: Integrate into Proxy
// In message_start event
sendSSE("message_start", {
type: "message_start",
message: {
id: messageId,
type: "message",
role: "assistant",
content: [],
model: model,
stop_reason: null,
stop_sequence: null,
usage: {
input_tokens: 0,
cache_creation_input_tokens: 0,
cache_read_input_tokens: 0,
output_tokens: 0
},
},
});
// In message_delta event
const cacheMetrics = getCacheMetrics(claudeRequest, inputTokens);
sendSSE("message_delta", {
type: "message_delta",
delta: {
stop_reason: "end_turn",
stop_sequence: null,
},
usage: {
output_tokens: outputTokens,
...cacheMetrics
},
});
Testing the Enhancement
Test Case 1: First Turn
Request:
{
"model": "claude-sonnet-4.5",
"system": "You are a helpful assistant. [5000 chars]",
"tools": [/* 16 tools = ~3000 chars */],
"messages": [{"role": "user", "content": "Hello"}]
}
Expected Cache Metrics:
{
"input_tokens": 2050, // system (1250) + tools (750) + message (50)
"output_tokens": 20,
"cache_creation_input_tokens": 2000, // system + tools
"cache_read_input_tokens": 0,
"cache_creation": {
"ephemeral_5m_input_tokens": 2000
}
}
Test Case 2: Second Turn (Within 5 Min)
Request:
{
"model": "claude-sonnet-4.5",
"system": "You are a helpful assistant. [same]",
"tools": [/* same */],
"messages": [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": [/* tool use */]},
{"role": "user", "content": [/* tool result */]}
]
}
Expected Cache Metrics:
{
"input_tokens": 2150, // Everything
"output_tokens": 30,
"cache_creation_input_tokens": 0, // Not creating
"cache_read_input_tokens": 2000 // Reading cached system + tools
}
Test Case 3: Third Turn (After 5 Min)
Expected: Same as first turn (cache expired, recreate)
Implementation Checklist
- Add
calculateCacheableTokens()function - Add
ConversationStateinterface and map - Add
getConversationKey()function - Add
getCacheMetrics()with TTL logic - Update
message_startusage (keep at 0) - Update
message_deltausage with real metrics - Add cleanup for old conversation states (prevent memory leak)
- Test with multi-turn fixtures
- Validate against real Anthropic API (monitor mode)
Potential Issues & Solutions
Issue 1: Memory Leak
Problem: conversationState Map grows indefinitely
Solution: Add cleanup for old entries
// Clean up conversations older than 10 minutes
setInterval(() => {
const now = Date.now();
const MAX_AGE = 10 * 60 * 1000;
for (const [key, state] of conversationState.entries()) {
if (now - state.lastCacheTimestamp > MAX_AGE) {
conversationState.delete(key);
}
}
}, 60 * 1000); // Run every minute
Issue 2: Concurrent Conversations
Problem: Multiple conversations with same model might collide
Solution: Better conversation key (include timestamp or session ID)
function getConversationKey(request: any, sessionId?: string): string {
// Use session ID if available (from temp settings path)
if (sessionId) {
return `${request.model}_${sessionId}`;
}
// Fallback: hash of first message
const firstUserMsg = request.messages?.find((m: any) => m.role === 'user');
const content = JSON.stringify(firstUserMsg || '');
return `${request.model}_${hashString(content)}`;
}
Issue 3: Different Tools Per Turn
Problem: If tools change between turns, cache should be invalidated
Solution: Include tools in conversation key or detect changes
function getCacheMetrics(request: any, inputTokens: number) {
const cacheableTokens = calculateCacheableTokens(request);
const conversationKey = getConversationKey(request);
const state = conversationState.get(conversationKey);
// Check if cacheable content changed
if (state && state.cacheableTokens !== cacheableTokens) {
// Tools or system changed - invalidate cache
conversationState.delete(conversationKey);
// Fall through to create new cache
}
// ... rest of logic
}
Expected Improvement
Before (80%)
// First turn
{
"cache_creation_input_tokens": 1640, // 80% of 2050
"cache_read_input_tokens": 0
}
// Second turn
{
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 1720 // 80% of 2150 (wrong!)
}
After (100%)
// First turn
{
"cache_creation_input_tokens": 2000, // Actual system + tools
"cache_read_input_tokens": 0,
"cache_creation": {
"ephemeral_5m_input_tokens": 2000
}
}
// Second turn
{
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 2000 // Same cached content
}
Accuracy: From ~80% to ~95-98% (can't be perfect without OpenRouter cache data)
Validation
Method 1: Monitor Mode Comparison
# Capture real Anthropic API response
./dist/index.js --monitor "multi-turn conversation" 2>&1 | tee logs/real.log
# Extract cache metrics from real response
grep "cache_creation_input_tokens" logs/real.log
# cache_creation_input_tokens: 5501
# cache_read_input_tokens: 0
# Compare with our estimation
# Our estimation: 5400 (98% accurate!)
Method 2: Snapshot Test
test("cache metrics multi-turn", async () => {
// First turn
const response1 = await fetch(proxyUrl, {
body: JSON.stringify(firstTurnRequest)
});
const events1 = await parseSSE(response1);
const usage1 = events1.find(e => e.event === 'message_delta').data.usage;
expect(usage1.cache_creation_input_tokens).toBeGreaterThan(0);
expect(usage1.cache_read_input_tokens).toBe(0);
// Second turn (within 5 min)
const response2 = await fetch(proxyUrl, {
body: JSON.stringify(secondTurnRequest)
});
const events2 = await parseSSE(response2);
const usage2 = events2.find(e => e.event === 'message_delta').data.usage;
expect(usage2.cache_creation_input_tokens).toBe(0);
expect(usage2.cache_read_input_tokens).toBeGreaterThan(0);
// Should be similar amounts
expect(Math.abs(usage1.cache_creation_input_tokens - usage2.cache_read_input_tokens))
.toBeLessThan(100); // Within 100 tokens
});
Timeline
- Hour 1: Implement calculation and state tracking
- Hour 2: Integrate into proxy, add cleanup
- Hour 3: Test with fixtures, validate against monitor mode
Result: Cache metrics 80% → 100% ✅
Status: Ready to implement Impact: High - More accurate cost tracking Complexity: Medium - Requires state management