claudish/ai_docs/PROTOCOL_COMPLIANCE_PLAN.md

589 lines
16 KiB
Markdown

# Protocol Compliance Plan: Achieving 1:1 Claude Code Compatibility
**Goal**: Ensure Claudish proxy provides identical user experience to official Claude Code, regardless of which model is used.
**Status**: Testing framework complete ✅ | Proxy fixes pending ⏳
---
## Executive Summary
We have built a comprehensive snapshot testing system that captures real Claude Code protocol interactions and validates proxy responses. The current proxy implementation is **60-70% compliant** with critical gaps in streaming protocol, tool handling, and cache metrics.
### What's Complete ✅
1. **Monitor Mode** - Pass-through proxy with complete logging
2. **Fixture Capture** - Tool to extract test cases from monitor logs
3. **Snapshot Tests** - Automated validation of protocol compliance
4. **Protocol Validators** - Event sequence, block indices, tool streaming, usage, stop reasons
5. **Example Fixtures** - Documented examples for text and tool use
6. **Workflow Scripts** - End-to-end capture → test automation
### What's Pending ⏳
1. **Fix content block index management** (CRITICAL)
2. **Add tool input JSON validation** (CRITICAL)
3. **Implement continuous ping events** (MEDIUM)
4. **Add cache metrics emulation** (MEDIUM)
5. **Capture comprehensive fixture library** (20+ scenarios)
6. **Run full test suite and fix remaining issues**
---
## Testing System Architecture
```
╔══════════════════════════════════════════════════════════════╗
║ MONITOR MODE (Capture) ║
╠══════════════════════════════════════════════════════════════╣
║ ║
║ 1. Run: ./dist/index.js --monitor "query" ║
║ 2. Captures: Request + Response (SSE events) ║
║ 3. Logs: Complete Anthropic API traffic ║
║ ║
║ Output: logs/capture_*.log ║
╚══════════════════════════════════════════════════════════════╝
╔══════════════════════════════════════════════════════════════╗
║ FIXTURE GENERATION (Extract) ║
╠══════════════════════════════════════════════════════════════╣
║ ║
║ 1. Parse: bun tests/capture-fixture.ts logs/file.log ║
║ 2. Normalize: Dynamic values (IDs, timestamps) ║
║ 3. Analyze: Build assertions (blocks, sequence, usage) ║
║ ║
║ Output: tests/fixtures/*.json ║
╚══════════════════════════════════════════════════════════════╝
╔══════════════════════════════════════════════════════════════╗
║ SNAPSHOT TESTING (Validate) ║
╠══════════════════════════════════════════════════════════════╣
║ ║
║ 1. Replay: Request through proxy ║
║ 2. Capture: Actual SSE response ║
║ 3. Validate: Against captured fixture ║
║ 4. Report: Pass/Fail with detailed errors ║
║ ║
║ Run: bun test tests/snapshot.test.ts ║
╚══════════════════════════════════════════════════════════════╝
```
---
## Protocol Requirements (From Analysis)
### Streaming Events (7 Types)
Claude Code **ALWAYS** uses streaming. Complete sequence:
1. **message_start** → Initialize message with usage
2. **content_block_start** → Begin text or tool block
3. **content_block_delta** → Stream content incrementally
4. **ping** → Keep-alive (every 15s)
5. **content_block_stop** → End content block
6. **message_delta** → Stop reason + final usage
7. **message_stop** → Stream complete
### Content Block Management
Blocks must have **sequential indices**:
```
Expected: [text @ 0] [tool @ 1] [tool @ 2]
Current: [text @ 0] [tool @ 0] [tool @ 1] ❌ WRONG
```
### Fine-Grained Tool Streaming
Tool input must stream as partial JSON:
```json
// Chunk 1: {"event": "content_block_delta", "data": {"delta": {"partial_json": "{\"file"}}}
// Chunk 2: {"event": "content_block_delta", "data": {"delta": {"partial_json": "_path\":\"test.ts\""}}}
// Chunk 3: {"event": "content_block_delta", "data": {"delta": {"partial_json": "}"}}}
// Result: {"file_path":"test.ts"} ✅ Valid JSON
```
### Usage Metrics
Must include cache metrics:
```json
{
"usage": {
"input_tokens": 150,
"cache_creation_input_tokens": 5501, // NEW
"cache_read_input_tokens": 0, // NEW
"output_tokens": 50,
"cache_creation": { // OPTIONAL
"ephemeral_5m_input_tokens": 5501
}
}
}
```
### Required Headers
```
anthropic-version: 2023-06-01
anthropic-beta: oauth-2025-04-20,interleaved-thinking-2025-05-14,fine-grained-tool-streaming-2025-05-14
```
---
## Critical Fixes Required
### 1. Content Block Index Management (CRITICAL)
**File**: `src/proxy-server.ts:600-850`
**Current Problem**:
```typescript
// Line 750 - Text block delta
sendSSE("content_block_delta", {
index: 0, // ❌ Hardcoded!
delta: { type: "text_delta", text: delta.content }
});
// Line 787 - Text block stop
sendSSE("content_block_stop", {
index: 0, // ❌ Hardcoded!
});
```
**Fix Required**:
```typescript
// Initialize block tracking
let currentBlockIndex = 0;
let textBlockIndex = -1;
const toolBlocks = new Map<number, number>(); // toolIndex → blockIndex
// Start text block
textBlockIndex = currentBlockIndex++;
sendSSE("content_block_start", {
index: textBlockIndex,
content_block: { type: "text", text: "" }
});
// Text delta
sendSSE("content_block_delta", {
index: textBlockIndex, // ✅ Correct
delta: { type: "text_delta", text: delta.content }
});
// Start tool block
const toolBlockIndex = currentBlockIndex++;
toolBlocks.set(toolIndex, toolBlockIndex);
sendSSE("content_block_start", {
index: toolBlockIndex, // ✅ Sequential
content_block: { type: "tool_use", id: toolId, name: toolName }
});
```
**Impact**: HIGH - Claude Code may reject responses with incorrect indices
**Complexity**: MEDIUM - Need to track state across stream
---
### 2. Tool Input JSON Validation (CRITICAL)
**File**: `src/proxy-server.ts:829`
**Current Problem**:
```typescript
// Line 829 - Close tool block immediately
if (choice?.finish_reason === "tool_calls") {
sendSSE("content_block_stop", {
index: toolState.blockIndex // No validation!
});
}
```
**Fix Required**:
```typescript
// Validate JSON before closing
if (choice?.finish_reason === "tool_calls") {
for (const [toolIndex, toolState] of toolCalls.entries()) {
// Validate JSON is complete
try {
JSON.parse(toolState.args);
log(`[Proxy] Tool ${toolState.name} arguments valid JSON`);
sendSSE("content_block_stop", {
index: toolState.blockIndex
});
} catch (e) {
log(`[Proxy] WARNING: Tool ${toolState.name} has incomplete JSON!`);
log(`[Proxy] Args so far: ${toolState.args}`);
// Don't close block yet - wait for more chunks
}
}
}
```
**Impact**: HIGH - Malformed tool calls will fail execution
**Complexity**: LOW - Simple JSON.parse check
---
### 3. Continuous Ping Events (MEDIUM)
**File**: `src/proxy-server.ts:636`
**Current Problem**:
```typescript
// Line 636 - One ping at start
sendSSE("ping", {
type: "ping",
});
// No more pings!
```
**Fix Required**:
```typescript
// Send ping every 15 seconds
const pingInterval = setInterval(() => {
if (!isClosed) {
sendSSE("ping", { type: "ping" });
}
}, 15000);
// Clear interval when done
try {
// ... streaming logic ...
} finally {
clearInterval(pingInterval);
if (!isClosed) {
controller.close();
isClosed = true;
}
}
```
**Impact**: MEDIUM - Long streams may timeout without pings
**Complexity**: LOW - Simple setInterval
---
### 4. Cache Metrics Emulation (MEDIUM)
**File**: `src/proxy-server.ts:614`
**Current Problem**:
```typescript
// Line 614 - Missing cache fields
usage: {
input_tokens: 0,
cache_creation_input_tokens: 0, // Present but always 0
cache_read_input_tokens: 0, // Present but always 0
output_tokens: 0
}
```
**Fix Required**:
```typescript
// Estimate cache metrics from multi-turn conversations
// First turn: All tokens go to cache_creation
// Subsequent turns: Most tokens come from cache_read
let isFirstTurn = /* detect from conversation history */;
let estimatedCacheTokens = Math.floor(inputTokens * 0.8);
usage: {
input_tokens: inputTokens,
cache_creation_input_tokens: isFirstTurn ? estimatedCacheTokens : 0,
cache_read_input_tokens: isFirstTurn ? 0 : estimatedCacheTokens,
output_tokens: outputTokens,
cache_creation: {
ephemeral_5m_input_tokens: isFirstTurn ? estimatedCacheTokens : 0
}
}
```
**Impact**: MEDIUM - Inaccurate cost tracking in Claude Code UI
**Complexity**: MEDIUM - Need conversation state tracking
---
### 5. Stop Reason Validation (LOW)
**File**: `src/proxy-server.ts:695`
**Current Check**:
```typescript
// Line 695 - Basic mapping exists
stop_reason: "end_turn", // From mapStopReason()
```
**Verify Mapping**:
```typescript
function mapStopReason(finishReason: string | undefined): string {
switch (finishReason) {
case "stop": return "end_turn"; // ✅
case "length": return "max_tokens"; // ✅
case "tool_calls": return "tool_use"; // ✅
case "content_filter": return "stop_sequence"; // ⚠️ Not quite right
default: return "end_turn"; // ✅ Safe fallback
}
}
```
**Impact**: LOW - Already mostly correct
**Complexity**: LOW - Verify edge cases
---
## Testing Workflow
### Phase 1: Capture Fixtures (2-3 hours)
Capture comprehensive test cases:
```bash
# Build
bun run build
# Capture scenarios
./tests/snapshot-workflow.sh --capture
```
**Scenarios to Capture** (20+ fixtures):
- [x] Simple text (2+2)
- [ ] Long text (explain quantum physics)
- [ ] Read file
- [ ] Grep search
- [ ] Glob pattern
- [ ] Write file
- [ ] Edit file
- [ ] Bash command
- [ ] Multi-tool (Read + Edit)
- [ ] Tool with error
- [ ] Multi-turn conversation
- [ ] All 16 official tools
- [ ] Thinking mode (if supported)
- [ ] Max tokens reached
- [ ] Content filter
### Phase 2: Run Baseline Tests (30 mins)
Run tests to identify failures:
```bash
bun test tests/snapshot.test.ts --verbose > test-results.txt 2>&1
```
**Expected Failures** (before fixes):
- ❌ Content block indices
- ❌ Tool JSON validation
- ⚠️ Ping events (may pass if short)
- ⚠️ Cache metrics (present but zero)
### Phase 3: Fix Proxy (1-2 days)
Implement fixes in order:
1. **Day 1 Morning**: Fix content block indices
2. **Day 1 Afternoon**: Add tool JSON validation
3. **Day 2 Morning**: Add continuous ping events
4. **Day 2 Afternoon**: Add cache metrics estimation
### Phase 4: Validate (1-2 hours)
Re-run tests after each fix:
```bash
# After each fix
bun test tests/snapshot.test.ts
# Expected progression:
# After fix #1: 70-80% pass
# After fix #2: 85-90% pass
# After fix #3: 90-95% pass
# After fix #4: 95-100% pass
```
### Phase 5: Integration Testing (2-3 hours)
Test with real Claude Code:
```bash
# Start proxy
./dist/index.js --model "anthropic/claude-sonnet-4.5"
# In another terminal, use real Claude Code
# Point it to localhost:8337
# Perform various tasks
# Validate:
# - No errors in Claude Code UI
# - Tools execute correctly
# - Multi-turn conversations work
# - Cost tracking accurate
```
---
## Success Criteria
For 1:1 compatibility:
-**100% test coverage** for critical paths
-**All snapshot tests pass**
-**Event sequences match** protocol spec
-**Block indices sequential** (0, 1, 2, ...)
-**Tool JSON validates** before block close
-**Ping events sent** every 15 seconds
-**Cache metrics present** (even if estimated)
-**Stop reason valid** in all cases
-**No Claude Code errors** in real usage
-**Multi-turn works** perfectly
---
## Risk Mitigation
### If OpenRouter Models Don't Support Feature X
**Problem**: Model doesn't provide thinking mode, cache metrics, etc.
**Solution**: Implement graceful degradation
```typescript
// Example: Thinking mode emulation
if (modelSupportsThinking(model)) {
// Use real thinking blocks
} else {
// Convert to text blocks with prefix
sendSSE("content_block_delta", {
index: textBlockIndex,
delta: {
type: "text_delta",
text: "[Thinking: " + thinkingContent + "]\n\n"
}
});
}
```
### If Tests Fail on Specific Models
**Problem**: Model behaves differently than Claude
**Solution**: Model-specific adapters
```typescript
// tests/model-adapters.ts
export const modelAdapters = {
"openai/gpt-4": {
// GPT-4 specific quirks
requiresSpecialToolFormat: true,
maxToolsPerCall: 5
},
"anthropic/claude-sonnet-4.5": {
// Should be 100% compatible
requiresSpecialToolFormat: false
}
};
```
### If Proxy Performance Issues
**Problem**: Snapshot tests timeout
**Solution**: Optimize streaming
```typescript
// Batch small deltas
let deltaBuffer = "";
let bufferTimeout: Timer;
function sendDelta(text: string) {
deltaBuffer += text;
clearTimeout(bufferTimeout);
bufferTimeout = setTimeout(() => {
if (deltaBuffer) {
sendSSE("content_block_delta", { /* ... */ });
deltaBuffer = "";
}
}, 50); // Batch deltas every 50ms
}
```
---
## Timeline
| Phase | Duration | Status |
|-------|----------|--------|
| Testing Framework | 1 day | ✅ Complete |
| Fixture Capture | 2-3 hours | ⏳ Pending |
| Proxy Fixes | 1-2 days | ⏳ Pending |
| Validation | 2-3 hours | ⏳ Pending |
| **Total** | **2-3 days** | **In Progress** |
---
## Next Steps
1. **Immediate** (Today):
- Run `./tests/snapshot-workflow.sh --capture` to build fixture library
- Run `bun test tests/snapshot.test.ts` to see current failures
- Start with Fix #1 (content block indices)
2. **Tomorrow**:
- Complete Fixes #1-2 (critical)
- Re-run tests, validate improvements
- Implement Fixes #3-4 (medium priority)
3. **Day 3**:
- Run full test suite
- Fix any remaining issues
- Integration test with real Claude Code
- Document model-specific limitations
---
## Files Created
| File | Purpose |
|------|---------|
| `tests/capture-fixture.ts` | Extract fixtures from monitor logs |
| `tests/snapshot.test.ts` | Snapshot test runner with validators |
| `tests/fixtures/README.md` | Fixture format documentation |
| `tests/fixtures/example_simple_text.json` | Example text fixture |
| `tests/fixtures/example_tool_use.json` | Example tool use fixture |
| `tests/snapshot-workflow.sh` | End-to-end workflow automation |
| `SNAPSHOT_TESTING.md` | Testing system documentation |
| `PROTOCOL_COMPLIANCE_PLAN.md` | This file |
---
## References
- [Protocol Specification](./PROTOCOL_SPECIFICATION.md) - Complete protocol docs
- [Snapshot Testing Guide](./SNAPSHOT_TESTING.md) - Testing system docs
- [Monitor Mode Guide](./MONITOR_MODE_COMPLETE.md) - Monitor mode usage
- [Streaming Protocol](./STREAMING_PROTOCOL_EXPLAINED.md) - SSE event details
---
**Status**: Framework complete, ready for fixture capture and proxy fixes
**Next Action**: Run `./tests/snapshot-workflow.sh --capture`
**Owner**: Jack Rudenko @ MadAppGang
**Last Updated**: 2025-01-15