# Protocol Compliance Plan: Achieving 1:1 Claude Code Compatibility **Goal**: Ensure Claudish proxy provides identical user experience to official Claude Code, regardless of which model is used. **Status**: Testing framework complete ✅ | Proxy fixes pending ⏳ --- ## Executive Summary We have built a comprehensive snapshot testing system that captures real Claude Code protocol interactions and validates proxy responses. The current proxy implementation is **60-70% compliant** with critical gaps in streaming protocol, tool handling, and cache metrics. ### What's Complete ✅ 1. **Monitor Mode** - Pass-through proxy with complete logging 2. **Fixture Capture** - Tool to extract test cases from monitor logs 3. **Snapshot Tests** - Automated validation of protocol compliance 4. **Protocol Validators** - Event sequence, block indices, tool streaming, usage, stop reasons 5. **Example Fixtures** - Documented examples for text and tool use 6. **Workflow Scripts** - End-to-end capture → test automation ### What's Pending ⏳ 1. **Fix content block index management** (CRITICAL) 2. **Add tool input JSON validation** (CRITICAL) 3. **Implement continuous ping events** (MEDIUM) 4. **Add cache metrics emulation** (MEDIUM) 5. **Capture comprehensive fixture library** (20+ scenarios) 6. **Run full test suite and fix remaining issues** --- ## Testing System Architecture ``` ╔══════════════════════════════════════════════════════════════╗ ║ MONITOR MODE (Capture) ║ ╠══════════════════════════════════════════════════════════════╣ ║ ║ ║ 1. Run: ./dist/index.js --monitor "query" ║ ║ 2. Captures: Request + Response (SSE events) ║ ║ 3. Logs: Complete Anthropic API traffic ║ ║ ║ ║ Output: logs/capture_*.log ║ ╚══════════════════════════════════════════════════════════════╝ ↓ ╔══════════════════════════════════════════════════════════════╗ ║ FIXTURE GENERATION (Extract) ║ ╠══════════════════════════════════════════════════════════════╣ ║ ║ ║ 1. Parse: bun tests/capture-fixture.ts logs/file.log ║ ║ 2. Normalize: Dynamic values (IDs, timestamps) ║ ║ 3. Analyze: Build assertions (blocks, sequence, usage) ║ ║ ║ ║ Output: tests/fixtures/*.json ║ ╚══════════════════════════════════════════════════════════════╝ ↓ ╔══════════════════════════════════════════════════════════════╗ ║ SNAPSHOT TESTING (Validate) ║ ╠══════════════════════════════════════════════════════════════╣ ║ ║ ║ 1. Replay: Request through proxy ║ ║ 2. Capture: Actual SSE response ║ ║ 3. Validate: Against captured fixture ║ ║ 4. Report: Pass/Fail with detailed errors ║ ║ ║ ║ Run: bun test tests/snapshot.test.ts ║ ╚══════════════════════════════════════════════════════════════╝ ``` --- ## Protocol Requirements (From Analysis) ### Streaming Events (7 Types) Claude Code **ALWAYS** uses streaming. Complete sequence: 1. **message_start** → Initialize message with usage 2. **content_block_start** → Begin text or tool block 3. **content_block_delta** → Stream content incrementally 4. **ping** → Keep-alive (every 15s) 5. **content_block_stop** → End content block 6. **message_delta** → Stop reason + final usage 7. **message_stop** → Stream complete ### Content Block Management Blocks must have **sequential indices**: ``` Expected: [text @ 0] [tool @ 1] [tool @ 2] Current: [text @ 0] [tool @ 0] [tool @ 1] ❌ WRONG ``` ### Fine-Grained Tool Streaming Tool input must stream as partial JSON: ```json // Chunk 1: {"event": "content_block_delta", "data": {"delta": {"partial_json": "{\"file"}}} // Chunk 2: {"event": "content_block_delta", "data": {"delta": {"partial_json": "_path\":\"test.ts\""}}} // Chunk 3: {"event": "content_block_delta", "data": {"delta": {"partial_json": "}"}}} // Result: {"file_path":"test.ts"} ✅ Valid JSON ``` ### Usage Metrics Must include cache metrics: ```json { "usage": { "input_tokens": 150, "cache_creation_input_tokens": 5501, // NEW "cache_read_input_tokens": 0, // NEW "output_tokens": 50, "cache_creation": { // OPTIONAL "ephemeral_5m_input_tokens": 5501 } } } ``` ### Required Headers ``` anthropic-version: 2023-06-01 anthropic-beta: oauth-2025-04-20,interleaved-thinking-2025-05-14,fine-grained-tool-streaming-2025-05-14 ``` --- ## Critical Fixes Required ### 1. Content Block Index Management (CRITICAL) **File**: `src/proxy-server.ts:600-850` **Current Problem**: ```typescript // Line 750 - Text block delta sendSSE("content_block_delta", { index: 0, // ❌ Hardcoded! delta: { type: "text_delta", text: delta.content } }); // Line 787 - Text block stop sendSSE("content_block_stop", { index: 0, // ❌ Hardcoded! }); ``` **Fix Required**: ```typescript // Initialize block tracking let currentBlockIndex = 0; let textBlockIndex = -1; const toolBlocks = new Map(); // toolIndex → blockIndex // Start text block textBlockIndex = currentBlockIndex++; sendSSE("content_block_start", { index: textBlockIndex, content_block: { type: "text", text: "" } }); // Text delta sendSSE("content_block_delta", { index: textBlockIndex, // ✅ Correct delta: { type: "text_delta", text: delta.content } }); // Start tool block const toolBlockIndex = currentBlockIndex++; toolBlocks.set(toolIndex, toolBlockIndex); sendSSE("content_block_start", { index: toolBlockIndex, // ✅ Sequential content_block: { type: "tool_use", id: toolId, name: toolName } }); ``` **Impact**: HIGH - Claude Code may reject responses with incorrect indices **Complexity**: MEDIUM - Need to track state across stream --- ### 2. Tool Input JSON Validation (CRITICAL) **File**: `src/proxy-server.ts:829` **Current Problem**: ```typescript // Line 829 - Close tool block immediately if (choice?.finish_reason === "tool_calls") { sendSSE("content_block_stop", { index: toolState.blockIndex // No validation! }); } ``` **Fix Required**: ```typescript // Validate JSON before closing if (choice?.finish_reason === "tool_calls") { for (const [toolIndex, toolState] of toolCalls.entries()) { // Validate JSON is complete try { JSON.parse(toolState.args); log(`[Proxy] Tool ${toolState.name} arguments valid JSON`); sendSSE("content_block_stop", { index: toolState.blockIndex }); } catch (e) { log(`[Proxy] WARNING: Tool ${toolState.name} has incomplete JSON!`); log(`[Proxy] Args so far: ${toolState.args}`); // Don't close block yet - wait for more chunks } } } ``` **Impact**: HIGH - Malformed tool calls will fail execution **Complexity**: LOW - Simple JSON.parse check --- ### 3. Continuous Ping Events (MEDIUM) **File**: `src/proxy-server.ts:636` **Current Problem**: ```typescript // Line 636 - One ping at start sendSSE("ping", { type: "ping", }); // No more pings! ``` **Fix Required**: ```typescript // Send ping every 15 seconds const pingInterval = setInterval(() => { if (!isClosed) { sendSSE("ping", { type: "ping" }); } }, 15000); // Clear interval when done try { // ... streaming logic ... } finally { clearInterval(pingInterval); if (!isClosed) { controller.close(); isClosed = true; } } ``` **Impact**: MEDIUM - Long streams may timeout without pings **Complexity**: LOW - Simple setInterval --- ### 4. Cache Metrics Emulation (MEDIUM) **File**: `src/proxy-server.ts:614` **Current Problem**: ```typescript // Line 614 - Missing cache fields usage: { input_tokens: 0, cache_creation_input_tokens: 0, // Present but always 0 cache_read_input_tokens: 0, // Present but always 0 output_tokens: 0 } ``` **Fix Required**: ```typescript // Estimate cache metrics from multi-turn conversations // First turn: All tokens go to cache_creation // Subsequent turns: Most tokens come from cache_read let isFirstTurn = /* detect from conversation history */; let estimatedCacheTokens = Math.floor(inputTokens * 0.8); usage: { input_tokens: inputTokens, cache_creation_input_tokens: isFirstTurn ? estimatedCacheTokens : 0, cache_read_input_tokens: isFirstTurn ? 0 : estimatedCacheTokens, output_tokens: outputTokens, cache_creation: { ephemeral_5m_input_tokens: isFirstTurn ? estimatedCacheTokens : 0 } } ``` **Impact**: MEDIUM - Inaccurate cost tracking in Claude Code UI **Complexity**: MEDIUM - Need conversation state tracking --- ### 5. Stop Reason Validation (LOW) **File**: `src/proxy-server.ts:695` **Current Check**: ```typescript // Line 695 - Basic mapping exists stop_reason: "end_turn", // From mapStopReason() ``` **Verify Mapping**: ```typescript function mapStopReason(finishReason: string | undefined): string { switch (finishReason) { case "stop": return "end_turn"; // ✅ case "length": return "max_tokens"; // ✅ case "tool_calls": return "tool_use"; // ✅ case "content_filter": return "stop_sequence"; // ⚠️ Not quite right default: return "end_turn"; // ✅ Safe fallback } } ``` **Impact**: LOW - Already mostly correct **Complexity**: LOW - Verify edge cases --- ## Testing Workflow ### Phase 1: Capture Fixtures (2-3 hours) Capture comprehensive test cases: ```bash # Build bun run build # Capture scenarios ./tests/snapshot-workflow.sh --capture ``` **Scenarios to Capture** (20+ fixtures): - [x] Simple text (2+2) - [ ] Long text (explain quantum physics) - [ ] Read file - [ ] Grep search - [ ] Glob pattern - [ ] Write file - [ ] Edit file - [ ] Bash command - [ ] Multi-tool (Read + Edit) - [ ] Tool with error - [ ] Multi-turn conversation - [ ] All 16 official tools - [ ] Thinking mode (if supported) - [ ] Max tokens reached - [ ] Content filter ### Phase 2: Run Baseline Tests (30 mins) Run tests to identify failures: ```bash bun test tests/snapshot.test.ts --verbose > test-results.txt 2>&1 ``` **Expected Failures** (before fixes): - ❌ Content block indices - ❌ Tool JSON validation - ⚠️ Ping events (may pass if short) - ⚠️ Cache metrics (present but zero) ### Phase 3: Fix Proxy (1-2 days) Implement fixes in order: 1. **Day 1 Morning**: Fix content block indices 2. **Day 1 Afternoon**: Add tool JSON validation 3. **Day 2 Morning**: Add continuous ping events 4. **Day 2 Afternoon**: Add cache metrics estimation ### Phase 4: Validate (1-2 hours) Re-run tests after each fix: ```bash # After each fix bun test tests/snapshot.test.ts # Expected progression: # After fix #1: 70-80% pass # After fix #2: 85-90% pass # After fix #3: 90-95% pass # After fix #4: 95-100% pass ``` ### Phase 5: Integration Testing (2-3 hours) Test with real Claude Code: ```bash # Start proxy ./dist/index.js --model "anthropic/claude-sonnet-4.5" # In another terminal, use real Claude Code # Point it to localhost:8337 # Perform various tasks # Validate: # - No errors in Claude Code UI # - Tools execute correctly # - Multi-turn conversations work # - Cost tracking accurate ``` --- ## Success Criteria For 1:1 compatibility: - ✅ **100% test coverage** for critical paths - ✅ **All snapshot tests pass** - ✅ **Event sequences match** protocol spec - ✅ **Block indices sequential** (0, 1, 2, ...) - ✅ **Tool JSON validates** before block close - ✅ **Ping events sent** every 15 seconds - ✅ **Cache metrics present** (even if estimated) - ✅ **Stop reason valid** in all cases - ✅ **No Claude Code errors** in real usage - ✅ **Multi-turn works** perfectly --- ## Risk Mitigation ### If OpenRouter Models Don't Support Feature X **Problem**: Model doesn't provide thinking mode, cache metrics, etc. **Solution**: Implement graceful degradation ```typescript // Example: Thinking mode emulation if (modelSupportsThinking(model)) { // Use real thinking blocks } else { // Convert to text blocks with prefix sendSSE("content_block_delta", { index: textBlockIndex, delta: { type: "text_delta", text: "[Thinking: " + thinkingContent + "]\n\n" } }); } ``` ### If Tests Fail on Specific Models **Problem**: Model behaves differently than Claude **Solution**: Model-specific adapters ```typescript // tests/model-adapters.ts export const modelAdapters = { "openai/gpt-4": { // GPT-4 specific quirks requiresSpecialToolFormat: true, maxToolsPerCall: 5 }, "anthropic/claude-sonnet-4.5": { // Should be 100% compatible requiresSpecialToolFormat: false } }; ``` ### If Proxy Performance Issues **Problem**: Snapshot tests timeout **Solution**: Optimize streaming ```typescript // Batch small deltas let deltaBuffer = ""; let bufferTimeout: Timer; function sendDelta(text: string) { deltaBuffer += text; clearTimeout(bufferTimeout); bufferTimeout = setTimeout(() => { if (deltaBuffer) { sendSSE("content_block_delta", { /* ... */ }); deltaBuffer = ""; } }, 50); // Batch deltas every 50ms } ``` --- ## Timeline | Phase | Duration | Status | |-------|----------|--------| | Testing Framework | 1 day | ✅ Complete | | Fixture Capture | 2-3 hours | ⏳ Pending | | Proxy Fixes | 1-2 days | ⏳ Pending | | Validation | 2-3 hours | ⏳ Pending | | **Total** | **2-3 days** | **In Progress** | --- ## Next Steps 1. **Immediate** (Today): - Run `./tests/snapshot-workflow.sh --capture` to build fixture library - Run `bun test tests/snapshot.test.ts` to see current failures - Start with Fix #1 (content block indices) 2. **Tomorrow**: - Complete Fixes #1-2 (critical) - Re-run tests, validate improvements - Implement Fixes #3-4 (medium priority) 3. **Day 3**: - Run full test suite - Fix any remaining issues - Integration test with real Claude Code - Document model-specific limitations --- ## Files Created | File | Purpose | |------|---------| | `tests/capture-fixture.ts` | Extract fixtures from monitor logs | | `tests/snapshot.test.ts` | Snapshot test runner with validators | | `tests/fixtures/README.md` | Fixture format documentation | | `tests/fixtures/example_simple_text.json` | Example text fixture | | `tests/fixtures/example_tool_use.json` | Example tool use fixture | | `tests/snapshot-workflow.sh` | End-to-end workflow automation | | `SNAPSHOT_TESTING.md` | Testing system documentation | | `PROTOCOL_COMPLIANCE_PLAN.md` | This file | --- ## References - [Protocol Specification](./PROTOCOL_SPECIFICATION.md) - Complete protocol docs - [Snapshot Testing Guide](./SNAPSHOT_TESTING.md) - Testing system docs - [Monitor Mode Guide](./MONITOR_MODE_COMPLETE.md) - Monitor mode usage - [Streaming Protocol](./STREAMING_PROTOCOL_EXPLAINED.md) - SSE event details --- **Status**: Framework complete, ready for fixture capture and proxy fixes **Next Action**: Run `./tests/snapshot-workflow.sh --capture` **Owner**: Jack Rudenko @ MadAppGang **Last Updated**: 2025-01-15