claudish/ai_docs/PROTOCOL_COMPLIANCE_PLAN.md

# Protocol Compliance Plan: Achieving 1:1 Claude Code Compatibility

**Goal**: Ensure Claudish proxy provides identical user experience to official Claude Code, regardless of which model is used.

**Status**: Testing framework complete ✅ | Proxy fixes pending ⏳

---

## Executive Summary

We have built a comprehensive snapshot testing system that captures real Claude Code protocol interactions and validates proxy responses. The current proxy implementation is **60-70% compliant** with critical gaps in streaming protocol, tool handling, and cache metrics.

### What's Complete ✅

1. **Monitor Mode** - Pass-through proxy with complete logging
2. **Fixture Capture** - Tool to extract test cases from monitor logs
3. **Snapshot Tests** - Automated validation of protocol compliance
4. **Protocol Validators** - Event sequence, block indices, tool streaming, usage, stop reasons
5. **Example Fixtures** - Documented examples for text and tool use
6. **Workflow Scripts** - End-to-end capture → test automation

### What's Pending ⏳

1. **Fix content block index management** (CRITICAL)
2. **Add tool input JSON validation** (CRITICAL)
3. **Implement continuous ping events** (MEDIUM)
4. **Add cache metrics emulation** (MEDIUM)
5. **Capture comprehensive fixture library** (20+ scenarios)
6. **Run full test suite and fix remaining issues**

---

## Testing System Architecture

```
╔══════════════════════════════════════════════════════════════╗
║                   MONITOR MODE (Capture)                      ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  1. Run: ./dist/index.js --monitor "query"                  ║
║  2. Captures: Request + Response (SSE events)               ║
║  3. Logs: Complete Anthropic API traffic                    ║
║                                                              ║
║  Output: logs/capture_*.log                                 ║
╚══════════════════════════════════════════════════════════════╝
                           ↓
╔══════════════════════════════════════════════════════════════╗
║                FIXTURE GENERATION (Extract)                   ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  1. Parse: bun tests/capture-fixture.ts logs/file.log       ║
║  2. Normalize: Dynamic values (IDs, timestamps)             ║
║  3. Analyze: Build assertions (blocks, sequence, usage)     ║
║                                                              ║
║  Output: tests/fixtures/*.json                              ║
╚══════════════════════════════════════════════════════════════╝
                           ↓
╔══════════════════════════════════════════════════════════════╗
║              SNAPSHOT TESTING (Validate)                      ║
╠══════════════════════════════════════════════════════════════╣
║                                                              ║
║  1. Replay: Request through proxy                           ║
║  2. Capture: Actual SSE response                            ║
║  3. Validate: Against captured fixture                      ║
║  4. Report: Pass/Fail with detailed errors                  ║
║                                                              ║
║  Run: bun test tests/snapshot.test.ts                       ║
╚══════════════════════════════════════════════════════════════╝
```

---

## Protocol Requirements (From Analysis)

### Streaming Events (7 Types)

Claude Code **ALWAYS** uses streaming. Complete sequence:

1. **message_start** → Initialize message with usage
2. **content_block_start** → Begin text or tool block
3. **content_block_delta** → Stream content incrementally
4. **ping** → Keep-alive (every 15s)
5. **content_block_stop** → End content block
6. **message_delta** → Stop reason + final usage
7. **message_stop** → Stream complete

### Content Block Management

Blocks must have **sequential indices**:

```
Expected:  [text @ 0] [tool @ 1] [tool @ 2]
Current:   [text @ 0] [tool @ 0] [tool @ 1]  ❌ WRONG
```

### Fine-Grained Tool Streaming

Tool input must stream as partial JSON:

```json
// Chunk 1: {"event": "content_block_delta", "data": {"delta": {"partial_json": "{\"file"}}}
// Chunk 2: {"event": "content_block_delta", "data": {"delta": {"partial_json": "_path\":\"test.ts\""}}}
// Chunk 3: {"event": "content_block_delta", "data": {"delta": {"partial_json": "}"}}}
// Result:  {"file_path":"test.ts"} ✅ Valid JSON
```

### Usage Metrics

Must include cache metrics:

```json
{
  "usage": {
    "input_tokens": 150,
    "cache_creation_input_tokens": 5501,    // NEW
    "cache_read_input_tokens": 0,           // NEW
    "output_tokens": 50,
    "cache_creation": {                     // OPTIONAL
      "ephemeral_5m_input_tokens": 5501
    }
  }
}
```

### Required Headers

```
anthropic-version: 2023-06-01
anthropic-beta: oauth-2025-04-20,interleaved-thinking-2025-05-14,fine-grained-tool-streaming-2025-05-14
```

---

## Critical Fixes Required

### 1. Content Block Index Management (CRITICAL)

**File**: `src/proxy-server.ts:600-850`

**Current Problem**:

```typescript
// Line 750 - Text block delta
sendSSE("content_block_delta", {
  index: 0,  // ❌ Hardcoded!
  delta: { type: "text_delta", text: delta.content }
});

// Line 787 - Text block stop
sendSSE("content_block_stop", {
  index: 0,  // ❌ Hardcoded!
});
```

**Fix Required**:

```typescript
// Initialize block tracking
let currentBlockIndex = 0;
let textBlockIndex = -1;
const toolBlocks = new Map<number, number>(); // toolIndex → blockIndex

// Start text block
textBlockIndex = currentBlockIndex++;
sendSSE("content_block_start", {
  index: textBlockIndex,
  content_block: { type: "text", text: "" }
});

// Text delta
sendSSE("content_block_delta", {
  index: textBlockIndex,  // ✅ Correct
  delta: { type: "text_delta", text: delta.content }
});

// Start tool block
const toolBlockIndex = currentBlockIndex++;
toolBlocks.set(toolIndex, toolBlockIndex);
sendSSE("content_block_start", {
  index: toolBlockIndex,  // ✅ Sequential
  content_block: { type: "tool_use", id: toolId, name: toolName }
});
```

**Impact**: HIGH - Claude Code may reject responses with incorrect indices

**Complexity**: MEDIUM - Need to track state across stream

---

### 2. Tool Input JSON Validation (CRITICAL)

**File**: `src/proxy-server.ts:829`

**Current Problem**:

```typescript
// Line 829 - Close tool block immediately
if (choice?.finish_reason === "tool_calls") {
  sendSSE("content_block_stop", {
    index: toolState.blockIndex  // No validation!
  });
}
```

**Fix Required**:

```typescript
// Validate JSON before closing
if (choice?.finish_reason === "tool_calls") {
  for (const [toolIndex, toolState] of toolCalls.entries()) {
    // Validate JSON is complete
    try {
      JSON.parse(toolState.args);
      log(`[Proxy] Tool ${toolState.name} arguments valid JSON`);
      sendSSE("content_block_stop", {
        index: toolState.blockIndex
      });
    } catch (e) {
      log(`[Proxy] WARNING: Tool ${toolState.name} has incomplete JSON!`);
      log(`[Proxy] Args so far: ${toolState.args}`);
      // Don't close block yet - wait for more chunks
    }
  }
}
```

**Impact**: HIGH - Malformed tool calls will fail execution

**Complexity**: LOW - Simple JSON.parse check

---

### 3. Continuous Ping Events (MEDIUM)

**File**: `src/proxy-server.ts:636`

**Current Problem**:

```typescript
// Line 636 - One ping at start
sendSSE("ping", {
  type: "ping",
});
// No more pings!
```

**Fix Required**:

```typescript
// Send ping every 15 seconds
const pingInterval = setInterval(() => {
  if (!isClosed) {
    sendSSE("ping", { type: "ping" });
  }
}, 15000);

// Clear interval when done
try {
  // ... streaming logic ...
} finally {
  clearInterval(pingInterval);
  if (!isClosed) {
    controller.close();
    isClosed = true;
  }
}
```

**Impact**: MEDIUM - Long streams may timeout without pings

**Complexity**: LOW - Simple setInterval

---

### 4. Cache Metrics Emulation (MEDIUM)

**File**: `src/proxy-server.ts:614`

**Current Problem**:

```typescript
// Line 614 - Missing cache fields
usage: {
  input_tokens: 0,
  cache_creation_input_tokens: 0,  // Present but always 0
  cache_read_input_tokens: 0,      // Present but always 0
  output_tokens: 0
}
```

**Fix Required**:

```typescript
// Estimate cache metrics from multi-turn conversations
// First turn: All tokens go to cache_creation
// Subsequent turns: Most tokens come from cache_read

let isFirstTurn = /* detect from conversation history */;
let estimatedCacheTokens = Math.floor(inputTokens * 0.8);

usage: {
  input_tokens: inputTokens,
  cache_creation_input_tokens: isFirstTurn ? estimatedCacheTokens : 0,
  cache_read_input_tokens: isFirstTurn ? 0 : estimatedCacheTokens,
  output_tokens: outputTokens,
  cache_creation: {
    ephemeral_5m_input_tokens: isFirstTurn ? estimatedCacheTokens : 0
  }
}
```

**Impact**: MEDIUM - Inaccurate cost tracking in Claude Code UI

**Complexity**: MEDIUM - Need conversation state tracking

---

### 5. Stop Reason Validation (LOW)

**File**: `src/proxy-server.ts:695`

**Current Check**:

```typescript
// Line 695 - Basic mapping exists
stop_reason: "end_turn",  // From mapStopReason()
```

**Verify Mapping**:

```typescript
function mapStopReason(finishReason: string | undefined): string {
  switch (finishReason) {
    case "stop":       return "end_turn";     // ✅
    case "length":     return "max_tokens";   // ✅
    case "tool_calls": return "tool_use";     // ✅
    case "content_filter": return "stop_sequence"; // ⚠️ Not quite right
    default:           return "end_turn";     // ✅ Safe fallback
  }
}
```

**Impact**: LOW - Already mostly correct

**Complexity**: LOW - Verify edge cases

---

## Testing Workflow

### Phase 1: Capture Fixtures (2-3 hours)

Capture comprehensive test cases:

```bash
# Build
bun run build

# Capture scenarios
./tests/snapshot-workflow.sh --capture
```

**Scenarios to Capture** (20+ fixtures):

- [x] Simple text (2+2)
- [ ] Long text (explain quantum physics)
- [ ] Read file
- [ ] Grep search
- [ ] Glob pattern
- [ ] Write file
- [ ] Edit file
- [ ] Bash command
- [ ] Multi-tool (Read + Edit)
- [ ] Tool with error
- [ ] Multi-turn conversation
- [ ] All 16 official tools
- [ ] Thinking mode (if supported)
- [ ] Max tokens reached
- [ ] Content filter

### Phase 2: Run Baseline Tests (30 mins)

Run tests to identify failures:

```bash
bun test tests/snapshot.test.ts --verbose > test-results.txt 2>&1
```

**Expected Failures** (before fixes):
- ❌ Content block indices
- ❌ Tool JSON validation
- ⚠️  Ping events (may pass if short)
- ⚠️  Cache metrics (present but zero)

### Phase 3: Fix Proxy (1-2 days)

Implement fixes in order:

1. **Day 1 Morning**: Fix content block indices
2. **Day 1 Afternoon**: Add tool JSON validation
3. **Day 2 Morning**: Add continuous ping events
4. **Day 2 Afternoon**: Add cache metrics estimation

### Phase 4: Validate (1-2 hours)

Re-run tests after each fix:

```bash
# After each fix
bun test tests/snapshot.test.ts

# Expected progression:
# After fix #1: 70-80% pass
# After fix #2: 85-90% pass
# After fix #3: 90-95% pass
# After fix #4: 95-100% pass
```

### Phase 5: Integration Testing (2-3 hours)

Test with real Claude Code:

```bash
# Start proxy
./dist/index.js --model "anthropic/claude-sonnet-4.5"

# In another terminal, use real Claude Code
# Point it to localhost:8337
# Perform various tasks

# Validate:
# - No errors in Claude Code UI
# - Tools execute correctly
# - Multi-turn conversations work
# - Cost tracking accurate
```

---

## Success Criteria

For 1:1 compatibility:

- ✅ **100% test coverage** for critical paths
- ✅ **All snapshot tests pass**
- ✅ **Event sequences match** protocol spec
- ✅ **Block indices sequential** (0, 1, 2, ...)
- ✅ **Tool JSON validates** before block close
- ✅ **Ping events sent** every 15 seconds
- ✅ **Cache metrics present** (even if estimated)
- ✅ **Stop reason valid** in all cases
- ✅ **No Claude Code errors** in real usage
- ✅ **Multi-turn works** perfectly

---

## Risk Mitigation

### If OpenRouter Models Don't Support Feature X

**Problem**: Model doesn't provide thinking mode, cache metrics, etc.

**Solution**: Implement graceful degradation

```typescript
// Example: Thinking mode emulation
if (modelSupportsThinking(model)) {
  // Use real thinking blocks
} else {
  // Convert to text blocks with prefix
  sendSSE("content_block_delta", {
    index: textBlockIndex,
    delta: {
      type: "text_delta",
      text: "[Thinking: " + thinkingContent + "]\n\n"
    }
  });
}
```

### If Tests Fail on Specific Models

**Problem**: Model behaves differently than Claude

**Solution**: Model-specific adapters

```typescript
// tests/model-adapters.ts
export const modelAdapters = {
  "openai/gpt-4": {
    // GPT-4 specific quirks
    requiresSpecialToolFormat: true,
    maxToolsPerCall: 5
  },
  "anthropic/claude-sonnet-4.5": {
    // Should be 100% compatible
    requiresSpecialToolFormat: false
  }
};
```

### If Proxy Performance Issues

**Problem**: Snapshot tests timeout

**Solution**: Optimize streaming

```typescript
// Batch small deltas
let deltaBuffer = "";
let bufferTimeout: Timer;

function sendDelta(text: string) {
  deltaBuffer += text;

  clearTimeout(bufferTimeout);
  bufferTimeout = setTimeout(() => {
    if (deltaBuffer) {
      sendSSE("content_block_delta", { /* ... */ });
      deltaBuffer = "";
    }
  }, 50); // Batch deltas every 50ms
}
```

---

## Timeline

| Phase | Duration | Status |
|-------|----------|--------|
| Testing Framework | 1 day | ✅ Complete |
| Fixture Capture | 2-3 hours | ⏳ Pending |
| Proxy Fixes | 1-2 days | ⏳ Pending |
| Validation | 2-3 hours | ⏳ Pending |
| **Total** | **2-3 days** | **In Progress** |

---

## Next Steps

1. **Immediate** (Today):
   - Run `./tests/snapshot-workflow.sh --capture` to build fixture library
   - Run `bun test tests/snapshot.test.ts` to see current failures
   - Start with Fix #1 (content block indices)

2. **Tomorrow**:
   - Complete Fixes #1-2 (critical)
   - Re-run tests, validate improvements
   - Implement Fixes #3-4 (medium priority)

3. **Day 3**:
   - Run full test suite
   - Fix any remaining issues
   - Integration test with real Claude Code
   - Document model-specific limitations

---

## Files Created

| File | Purpose |
|------|---------|
| `tests/capture-fixture.ts` | Extract fixtures from monitor logs |
| `tests/snapshot.test.ts` | Snapshot test runner with validators |
| `tests/fixtures/README.md` | Fixture format documentation |
| `tests/fixtures/example_simple_text.json` | Example text fixture |
| `tests/fixtures/example_tool_use.json` | Example tool use fixture |
| `tests/snapshot-workflow.sh` | End-to-end workflow automation |
| `SNAPSHOT_TESTING.md` | Testing system documentation |
| `PROTOCOL_COMPLIANCE_PLAN.md` | This file |

---

## References

- [Protocol Specification](./PROTOCOL_SPECIFICATION.md) - Complete protocol docs
- [Snapshot Testing Guide](./SNAPSHOT_TESTING.md) - Testing system docs
- [Monitor Mode Guide](./MONITOR_MODE_COMPLETE.md) - Monitor mode usage
- [Streaming Protocol](./STREAMING_PROTOCOL_EXPLAINED.md) - SSE event details

---

**Status**: Framework complete, ready for fixture capture and proxy fixes
**Next Action**: Run `./tests/snapshot-workflow.sh --capture`
**Owner**: Jack Rudenko @ MadAppGang
**Last Updated**: 2025-01-15