16 KiB

Raw Blame History

Protocol Compliance Plan: Achieving 1:1 Claude Code Compatibility

Goal: Ensure Claudish proxy provides identical user experience to official Claude Code, regardless of which model is used.

Status: Testing framework complete ✅ | Proxy fixes pending ⏳

Executive Summary

We have built a comprehensive snapshot testing system that captures real Claude Code protocol interactions and validates proxy responses. The current proxy implementation is 60-70% compliant with critical gaps in streaming protocol, tool handling, and cache metrics.

What's Complete ✅

Monitor Mode - Pass-through proxy with complete logging
Fixture Capture - Tool to extract test cases from monitor logs
Snapshot Tests - Automated validation of protocol compliance
Protocol Validators - Event sequence, block indices, tool streaming, usage, stop reasons
Example Fixtures - Documented examples for text and tool use
Workflow Scripts - End-to-end capture → test automation

What's Pending ⏳

Fix content block index management (CRITICAL)
Add tool input JSON validation (CRITICAL)
Implement continuous ping events (MEDIUM)
Add cache metrics emulation (MEDIUM)
Capture comprehensive fixture library (20+ scenarios)
Run full test suite and fix remaining issues

Testing System Architecture

╔══════════════════════════════════════════════════════════════╗
║                   MONITOR MODE (Capture)                      ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  1. Run: ./dist/index.js --monitor "query"                  ║
║  2. Captures: Request + Response (SSE events)               ║
║  3. Logs: Complete Anthropic API traffic                    ║
║                                                              ║
║  Output: logs/capture_*.log                                 ║
╚══════════════════════════════════════════════════════════════╝
                           ↓
╔══════════════════════════════════════════════════════════════╗
║                FIXTURE GENERATION (Extract)                   ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  1. Parse: bun tests/capture-fixture.ts logs/file.log       ║
║  2. Normalize: Dynamic values (IDs, timestamps)             ║
║  3. Analyze: Build assertions (blocks, sequence, usage)     ║
║                                                              ║
║  Output: tests/fixtures/*.json                              ║
╚══════════════════════════════════════════════════════════════╝
                           ↓
╔══════════════════════════════════════════════════════════════╗
║              SNAPSHOT TESTING (Validate)                      ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  1. Replay: Request through proxy                           ║
║  2. Capture: Actual SSE response                            ║
║  3. Validate: Against captured fixture                      ║
║  4. Report: Pass/Fail with detailed errors                  ║
║                                                              ║
║  Run: bun test tests/snapshot.test.ts                       ║
╚══════════════════════════════════════════════════════════════╝

Protocol Requirements (From Analysis)

Streaming Events (7 Types)

Claude Code ALWAYS uses streaming. Complete sequence:

message_start → Initialize message with usage
content_block_start → Begin text or tool block
content_block_delta → Stream content incrementally
ping → Keep-alive (every 15s)
content_block_stop → End content block
message_delta → Stop reason + final usage
message_stop → Stream complete

Content Block Management

Blocks must have sequential indices:

Expected:  [text @ 0] [tool @ 1] [tool @ 2]
Current:   [text @ 0] [tool @ 0] [tool @ 1]  ❌ WRONG

Fine-Grained Tool Streaming

Tool input must stream as partial JSON:

// Chunk 1: {"event": "content_block_delta", "data": {"delta": {"partial_json": "{\"file"}}}
// Chunk 2: {"event": "content_block_delta", "data": {"delta": {"partial_json": "_path\":\"test.ts\""}}}
// Chunk 3: {"event": "content_block_delta", "data": {"delta": {"partial_json": "}"}}}
// Result:  {"file_path":"test.ts"} ✅ Valid JSON

Usage Metrics

Must include cache metrics:

{
  "usage": {
    "input_tokens": 150,
    "cache_creation_input_tokens": 5501,    // NEW
    "cache_read_input_tokens": 0,           // NEW
    "output_tokens": 50,
    "cache_creation": {                     // OPTIONAL
      "ephemeral_5m_input_tokens": 5501
    }
  }
}

Required Headers

anthropic-version: 2023-06-01
anthropic-beta: oauth-2025-04-20,interleaved-thinking-2025-05-14,fine-grained-tool-streaming-2025-05-14

Critical Fixes Required

1. Content Block Index Management (CRITICAL)

File: src/proxy-server.ts:600-850

Current Problem:

// Line 750 - Text block delta
sendSSE("content_block_delta", {
  index: 0,  // ❌ Hardcoded!
  delta: { type: "text_delta", text: delta.content }
});

// Line 787 - Text block stop
sendSSE("content_block_stop", {
  index: 0,  // ❌ Hardcoded!
});

Fix Required:

// Initialize block tracking
let currentBlockIndex = 0;
let textBlockIndex = -1;
const toolBlocks = new Map<number, number>(); // toolIndex → blockIndex

// Start text block
textBlockIndex = currentBlockIndex++;
sendSSE("content_block_start", {
  index: textBlockIndex,
  content_block: { type: "text", text: "" }
});

// Text delta
sendSSE("content_block_delta", {
  index: textBlockIndex,  // ✅ Correct
  delta: { type: "text_delta", text: delta.content }
});

// Start tool block
const toolBlockIndex = currentBlockIndex++;
toolBlocks.set(toolIndex, toolBlockIndex);
sendSSE("content_block_start", {
  index: toolBlockIndex,  // ✅ Sequential
  content_block: { type: "tool_use", id: toolId, name: toolName }
});

Impact: HIGH - Claude Code may reject responses with incorrect indices

Complexity: MEDIUM - Need to track state across stream

2. Tool Input JSON Validation (CRITICAL)

File: src/proxy-server.ts:829

Current Problem:

// Line 829 - Close tool block immediately
if (choice?.finish_reason === "tool_calls") {
  sendSSE("content_block_stop", {
    index: toolState.blockIndex  // No validation!
  });
}

Fix Required:

// Validate JSON before closing
if (choice?.finish_reason === "tool_calls") {
  for (const [toolIndex, toolState] of toolCalls.entries()) {
    // Validate JSON is complete
    try {
      JSON.parse(toolState.args);
      log(`[Proxy] Tool ${toolState.name} arguments valid JSON`);
      sendSSE("content_block_stop", {
        index: toolState.blockIndex
      });
    } catch (e) {
      log(`[Proxy] WARNING: Tool ${toolState.name} has incomplete JSON!`);
      log(`[Proxy] Args so far: ${toolState.args}`);
      // Don't close block yet - wait for more chunks
    }
  }
}

Impact: HIGH - Malformed tool calls will fail execution

Complexity: LOW - Simple JSON.parse check

3. Continuous Ping Events (MEDIUM)

File: src/proxy-server.ts:636

Current Problem:

// Line 636 - One ping at start
sendSSE("ping", {
  type: "ping",
});
// No more pings!

Fix Required:

// Send ping every 15 seconds
const pingInterval = setInterval(() => {
  if (!isClosed) {
    sendSSE("ping", { type: "ping" });
  }
}, 15000);

// Clear interval when done
try {
  // ... streaming logic ...
} finally {
  clearInterval(pingInterval);
  if (!isClosed) {
    controller.close();
    isClosed = true;
  }
}

Impact: MEDIUM - Long streams may timeout without pings

Complexity: LOW - Simple setInterval

4. Cache Metrics Emulation (MEDIUM)

File: src/proxy-server.ts:614

Current Problem:

// Line 614 - Missing cache fields
usage: {
  input_tokens: 0,
  cache_creation_input_tokens: 0,  // Present but always 0
  cache_read_input_tokens: 0,      // Present but always 0
  output_tokens: 0
}

Fix Required:

// Estimate cache metrics from multi-turn conversations
// First turn: All tokens go to cache_creation
// Subsequent turns: Most tokens come from cache_read

let isFirstTurn = /* detect from conversation history */;
let estimatedCacheTokens = Math.floor(inputTokens * 0.8);

usage: {
  input_tokens: inputTokens,
  cache_creation_input_tokens: isFirstTurn ? estimatedCacheTokens : 0,
  cache_read_input_tokens: isFirstTurn ? 0 : estimatedCacheTokens,
  output_tokens: outputTokens,
  cache_creation: {
    ephemeral_5m_input_tokens: isFirstTurn ? estimatedCacheTokens : 0
  }
}

Impact: MEDIUM - Inaccurate cost tracking in Claude Code UI

Complexity: MEDIUM - Need conversation state tracking

5. Stop Reason Validation (LOW)

File: src/proxy-server.ts:695

Current Check:

// Line 695 - Basic mapping exists
stop_reason: "end_turn",  // From mapStopReason()

Verify Mapping:

function mapStopReason(finishReason: string | undefined): string {
  switch (finishReason) {
    case "stop":       return "end_turn";     // ✅
    case "length":     return "max_tokens";   // ✅
    case "tool_calls": return "tool_use";     // ✅
    case "content_filter": return "stop_sequence"; // ⚠️ Not quite right
    default:           return "end_turn";     // ✅ Safe fallback
  }
}

Impact: LOW - Already mostly correct

Complexity: LOW - Verify edge cases

Testing Workflow

Phase 1: Capture Fixtures (2-3 hours)

Capture comprehensive test cases:

# Build
bun run build

# Capture scenarios
./tests/snapshot-workflow.sh --capture

Scenarios to Capture (20+ fixtures):

Simple text (2+2)
Long text (explain quantum physics)
Read file
Grep search
Glob pattern
Write file
Edit file
Bash command
Multi-tool (Read + Edit)
Tool with error
Multi-turn conversation
All 16 official tools
Thinking mode (if supported)
Max tokens reached
Content filter

Phase 2: Run Baseline Tests (30 mins)

Run tests to identify failures:

bun test tests/snapshot.test.ts --verbose > test-results.txt 2>&1

Expected Failures (before fixes):

❌ Content block indices
❌ Tool JSON validation
⚠️ Ping events (may pass if short)
⚠️ Cache metrics (present but zero)

Phase 3: Fix Proxy (1-2 days)

Implement fixes in order:

Day 1 Morning: Fix content block indices
Day 1 Afternoon: Add tool JSON validation
Day 2 Morning: Add continuous ping events
Day 2 Afternoon: Add cache metrics estimation

Phase 4: Validate (1-2 hours)

Re-run tests after each fix:

# After each fix
bun test tests/snapshot.test.ts

# Expected progression:
# After fix #1: 70-80% pass
# After fix #2: 85-90% pass
# After fix #3: 90-95% pass
# After fix #4: 95-100% pass

Phase 5: Integration Testing (2-3 hours)

Test with real Claude Code:

# Start proxy
./dist/index.js --model "anthropic/claude-sonnet-4.5"

# In another terminal, use real Claude Code
# Point it to localhost:8337
# Perform various tasks

# Validate:
# - No errors in Claude Code UI
# - Tools execute correctly
# - Multi-turn conversations work
# - Cost tracking accurate

Success Criteria

For 1:1 compatibility:

✅ 100% test coverage for critical paths
✅ All snapshot tests pass
✅ Event sequences match protocol spec
✅ Block indices sequential (0, 1, 2, ...)
✅ Tool JSON validates before block close
✅ Ping events sent every 15 seconds
✅ Cache metrics present (even if estimated)
✅ Stop reason valid in all cases
✅ No Claude Code errors in real usage
✅ Multi-turn works perfectly

Risk Mitigation

If OpenRouter Models Don't Support Feature X

Problem: Model doesn't provide thinking mode, cache metrics, etc.

Solution: Implement graceful degradation

// Example: Thinking mode emulation
if (modelSupportsThinking(model)) {
  // Use real thinking blocks
} else {
  // Convert to text blocks with prefix
  sendSSE("content_block_delta", {
    index: textBlockIndex,
    delta: {
      type: "text_delta",
      text: "[Thinking: " + thinkingContent + "]\n\n"
    }
  });
}

If Tests Fail on Specific Models

Problem: Model behaves differently than Claude

Solution: Model-specific adapters

// tests/model-adapters.ts
export const modelAdapters = {
  "openai/gpt-4": {
    // GPT-4 specific quirks
    requiresSpecialToolFormat: true,
    maxToolsPerCall: 5
  },
  "anthropic/claude-sonnet-4.5": {
    // Should be 100% compatible
    requiresSpecialToolFormat: false
  }
};

If Proxy Performance Issues

Problem: Snapshot tests timeout

Solution: Optimize streaming

// Batch small deltas
let deltaBuffer = "";
let bufferTimeout: Timer;

function sendDelta(text: string) {
  deltaBuffer += text;

  clearTimeout(bufferTimeout);
  bufferTimeout = setTimeout(() => {
    if (deltaBuffer) {
      sendSSE("content_block_delta", { /* ... */ });
      deltaBuffer = "";
    }
  }, 50); // Batch deltas every 50ms
}

Timeline

Phase	Duration	Status
Testing Framework	1 day	✅ Complete
Fixture Capture	2-3 hours	⏳ Pending
Proxy Fixes	1-2 days	⏳ Pending
Validation	2-3 hours	⏳ Pending
Total	2-3 days	In Progress

Next Steps

Immediate (Today):
- Run ./tests/snapshot-workflow.sh --capture to build fixture library
- Run bun test tests/snapshot.test.ts to see current failures
- Start with Fix #1 (content block indices)
Tomorrow:
- Complete Fixes #1-2 (critical)
- Re-run tests, validate improvements
- Implement Fixes #3-4 (medium priority)
Day 3:
- Run full test suite
- Fix any remaining issues
- Integration test with real Claude Code
- Document model-specific limitations

Files Created

File	Purpose
`tests/capture-fixture.ts`	Extract fixtures from monitor logs
`tests/snapshot.test.ts`	Snapshot test runner with validators
`tests/fixtures/README.md`	Fixture format documentation
`tests/fixtures/example_simple_text.json`	Example text fixture
`tests/fixtures/example_tool_use.json`	Example tool use fixture
`tests/snapshot-workflow.sh`	End-to-end workflow automation
`SNAPSHOT_TESTING.md`	Testing system documentation
`PROTOCOL_COMPLIANCE_PLAN.md`	This file

References

Protocol Specification - Complete protocol docs
Snapshot Testing Guide - Testing system docs
Monitor Mode Guide - Monitor mode usage
Streaming Protocol - SSE event details

Status: Framework complete, ready for fixture capture and proxy fixes Next Action: Run ./tests/snapshot-workflow.sh --capture Owner: Jack Rudenko @ MadAppGang Last Updated: 2025-01-15

16 KiB Raw Blame History

Protocol Compliance Plan: Achieving 1:1 Claude Code Compatibility

Executive Summary

What's Complete ✅

What's Pending ⏳

Testing System Architecture

Protocol Requirements (From Analysis)

Streaming Events (7 Types)

Content Block Management

Fine-Grained Tool Streaming

Usage Metrics

Required Headers

Critical Fixes Required

1. Content Block Index Management (CRITICAL)

2. Tool Input JSON Validation (CRITICAL)

3. Continuous Ping Events (MEDIUM)

4. Cache Metrics Emulation (MEDIUM)

5. Stop Reason Validation (LOW)

Testing Workflow

Phase 1: Capture Fixtures (2-3 hours)

Phase 2: Run Baseline Tests (30 mins)

Phase 3: Fix Proxy (1-2 days)

Phase 4: Validate (1-2 hours)

Phase 5: Integration Testing (2-3 hours)

Success Criteria

Risk Mitigation

If OpenRouter Models Don't Support Feature X

If Tests Fail on Specific Models

If Proxy Performance Issues

Timeline

Next Steps

Files Created

References

16 KiB

Raw Blame History