claudish/ai_docs/CACHE_METRICS_ENHANCEMENT.md

10 KiB

Enhanced Cache Metrics Implementation

Goal: Improve cache metrics from 80% → 100% accuracy Effort: 2-3 hours Impact: Better cost tracking in Claude Code UI


Current Implementation (80%)

// Simple first-turn detection
const hasToolResults = claudeRequest.messages?.some((msg: any) =>
  Array.isArray(msg.content) && msg.content.some((block: any) => block.type === "tool_result")
);
const isFirstTurn = !hasToolResults;

// Rough 80% estimation
const estimatedCacheTokens = Math.floor(inputTokens * 0.8);

usage: {
  input_tokens: inputTokens,
  output_tokens: outputTokens,
  cache_creation_input_tokens: isFirstTurn ? estimatedCacheTokens : 0,
  cache_read_input_tokens: isFirstTurn ? 0 : estimatedCacheTokens,
}

Problems:

  • Hardcoded 80% (inaccurate)
  • Doesn't account for actual cacheable content
  • Missing cache_creation.ephemeral_5m_input_tokens
  • No TTL tracking

Target Implementation (100%)

Step 1: Calculate Actual Cacheable Tokens

/**
 * Calculate cacheable tokens from request
 * Cacheable content: system prompt + tools definitions
 */
function calculateCacheableTokens(request: any): number {
  let cacheableChars = 0;

  // System prompt (always cached)
  if (request.system) {
    if (typeof request.system === 'string') {
      cacheableChars += request.system.length;
    } else if (Array.isArray(request.system)) {
      cacheableChars += request.system
        .map((item: any) => {
          if (typeof item === 'string') return item.length;
          if (item?.type === 'text' && item.text) return item.text.length;
          return JSON.stringify(item).length;
        })
        .reduce((a: number, b: number) => a + b, 0);
    }
  }

  // Tools definitions (always cached)
  if (request.tools && Array.isArray(request.tools)) {
    cacheableChars += JSON.stringify(request.tools).length;
  }

  // Convert chars to tokens (rough: 4 chars per token)
  return Math.floor(cacheableChars / 4);
}

Step 2: Track Conversation State

// Global conversation state (per proxy instance)
interface ConversationState {
  cacheableTokens: number;
  lastCacheTimestamp: number;
  messageCount: number;
}

const conversationState = new Map<string, ConversationState>();

function getConversationKey(request: any): string {
  // Use first user message + model as key
  const firstUserMsg = request.messages?.find((m: any) => m.role === 'user');
  const content = typeof firstUserMsg?.content === 'string'
    ? firstUserMsg.content
    : JSON.stringify(firstUserMsg?.content || '');

  // Hash for privacy
  return `${request.model}_${content.substring(0, 50)}`;
}

Step 3: Implement TTL Logic

function getCacheMetrics(request: any, inputTokens: number) {
  const cacheableTokens = calculateCacheableTokens(request);
  const conversationKey = getConversationKey(request);
  const state = conversationState.get(conversationKey);

  const now = Date.now();
  const CACHE_TTL = 5 * 60 * 1000; // 5 minutes

  // First turn or cache expired
  if (!state || (now - state.lastCacheTimestamp > CACHE_TTL)) {
    // Create new cache
    conversationState.set(conversationKey, {
      cacheableTokens,
      lastCacheTimestamp: now,
      messageCount: 1
    });

    return {
      input_tokens: inputTokens,
      cache_creation_input_tokens: cacheableTokens,
      cache_read_input_tokens: 0,
      cache_creation: {
        ephemeral_5m_input_tokens: cacheableTokens
      }
    };
  }

  // Subsequent turn - read from cache
  state.messageCount++;
  state.lastCacheTimestamp = now;

  return {
    input_tokens: inputTokens,
    cache_creation_input_tokens: 0,
    cache_read_input_tokens: cacheableTokens,
  };
}

Step 4: Integrate into Proxy

// In message_start event
sendSSE("message_start", {
  type: "message_start",
  message: {
    id: messageId,
    type: "message",
    role: "assistant",
    content: [],
    model: model,
    stop_reason: null,
    stop_sequence: null,
    usage: {
      input_tokens: 0,
      cache_creation_input_tokens: 0,
      cache_read_input_tokens: 0,
      output_tokens: 0
    },
  },
});

// In message_delta event
const cacheMetrics = getCacheMetrics(claudeRequest, inputTokens);

sendSSE("message_delta", {
  type: "message_delta",
  delta: {
    stop_reason: "end_turn",
    stop_sequence: null,
  },
  usage: {
    output_tokens: outputTokens,
    ...cacheMetrics
  },
});

Testing the Enhancement

Test Case 1: First Turn

Request:

{
  "model": "claude-sonnet-4.5",
  "system": "You are a helpful assistant. [5000 chars]",
  "tools": [/* 16 tools = ~3000 chars */],
  "messages": [{"role": "user", "content": "Hello"}]
}

Expected Cache Metrics:

{
  "input_tokens": 2050,  // system (1250) + tools (750) + message (50)
  "output_tokens": 20,
  "cache_creation_input_tokens": 2000,  // system + tools
  "cache_read_input_tokens": 0,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 2000
  }
}

Test Case 2: Second Turn (Within 5 Min)

Request:

{
  "model": "claude-sonnet-4.5",
  "system": "You are a helpful assistant. [same]",
  "tools": [/* same */],
  "messages": [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": [/* tool use */]},
    {"role": "user", "content": [/* tool result */]}
  ]
}

Expected Cache Metrics:

{
  "input_tokens": 2150,  // Everything
  "output_tokens": 30,
  "cache_creation_input_tokens": 0,  // Not creating
  "cache_read_input_tokens": 2000   // Reading cached system + tools
}

Test Case 3: Third Turn (After 5 Min)

Expected: Same as first turn (cache expired, recreate)


Implementation Checklist

  • Add calculateCacheableTokens() function
  • Add ConversationState interface and map
  • Add getConversationKey() function
  • Add getCacheMetrics() with TTL logic
  • Update message_start usage (keep at 0)
  • Update message_delta usage with real metrics
  • Add cleanup for old conversation states (prevent memory leak)
  • Test with multi-turn fixtures
  • Validate against real Anthropic API (monitor mode)

Potential Issues & Solutions

Issue 1: Memory Leak

Problem: conversationState Map grows indefinitely

Solution: Add cleanup for old entries

// Clean up conversations older than 10 minutes
setInterval(() => {
  const now = Date.now();
  const MAX_AGE = 10 * 60 * 1000;

  for (const [key, state] of conversationState.entries()) {
    if (now - state.lastCacheTimestamp > MAX_AGE) {
      conversationState.delete(key);
    }
  }
}, 60 * 1000); // Run every minute

Issue 2: Concurrent Conversations

Problem: Multiple conversations with same model might collide

Solution: Better conversation key (include timestamp or session ID)

function getConversationKey(request: any, sessionId?: string): string {
  // Use session ID if available (from temp settings path)
  if (sessionId) {
    return `${request.model}_${sessionId}`;
  }

  // Fallback: hash of first message
  const firstUserMsg = request.messages?.find((m: any) => m.role === 'user');
  const content = JSON.stringify(firstUserMsg || '');
  return `${request.model}_${hashString(content)}`;
}

Issue 3: Different Tools Per Turn

Problem: If tools change between turns, cache should be invalidated

Solution: Include tools in conversation key or detect changes

function getCacheMetrics(request: any, inputTokens: number) {
  const cacheableTokens = calculateCacheableTokens(request);
  const conversationKey = getConversationKey(request);
  const state = conversationState.get(conversationKey);

  // Check if cacheable content changed
  if (state && state.cacheableTokens !== cacheableTokens) {
    // Tools or system changed - invalidate cache
    conversationState.delete(conversationKey);
    // Fall through to create new cache
  }

  // ... rest of logic
}

Expected Improvement

Before (80%)

// First turn
{
  "cache_creation_input_tokens": 1640,  // 80% of 2050
  "cache_read_input_tokens": 0
}

// Second turn
{
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 1720  // 80% of 2150 (wrong!)
}

After (100%)

// First turn
{
  "cache_creation_input_tokens": 2000,  // Actual system + tools
  "cache_read_input_tokens": 0,
  "cache_creation": {
    "ephemeral_5m_input_tokens": 2000
  }
}

// Second turn
{
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 2000  // Same cached content
}

Accuracy: From ~80% to ~95-98% (can't be perfect without OpenRouter cache data)


Validation

Method 1: Monitor Mode Comparison

# Capture real Anthropic API response
./dist/index.js --monitor "multi-turn conversation" 2>&1 | tee logs/real.log

# Extract cache metrics from real response
grep "cache_creation_input_tokens" logs/real.log
# cache_creation_input_tokens: 5501
# cache_read_input_tokens: 0

# Compare with our estimation
# Our estimation: 5400 (98% accurate!)

Method 2: Snapshot Test

test("cache metrics multi-turn", async () => {
  // First turn
  const response1 = await fetch(proxyUrl, {
    body: JSON.stringify(firstTurnRequest)
  });
  const events1 = await parseSSE(response1);
  const usage1 = events1.find(e => e.event === 'message_delta').data.usage;

  expect(usage1.cache_creation_input_tokens).toBeGreaterThan(0);
  expect(usage1.cache_read_input_tokens).toBe(0);

  // Second turn (within 5 min)
  const response2 = await fetch(proxyUrl, {
    body: JSON.stringify(secondTurnRequest)
  });
  const events2 = await parseSSE(response2);
  const usage2 = events2.find(e => e.event === 'message_delta').data.usage;

  expect(usage2.cache_creation_input_tokens).toBe(0);
  expect(usage2.cache_read_input_tokens).toBeGreaterThan(0);

  // Should be similar amounts
  expect(Math.abs(usage1.cache_creation_input_tokens - usage2.cache_read_input_tokens))
    .toBeLessThan(100); // Within 100 tokens
});

Timeline

  • Hour 1: Implement calculation and state tracking
  • Hour 2: Integrate into proxy, add cleanup
  • Hour 3: Test with fixtures, validate against monitor mode

Result: Cache metrics 80% → 100%


Status: Ready to implement Impact: High - More accurate cost tracking Complexity: Medium - Requires state management