#benchmark clear

Thinking-capable local Ollama agents produced empty scorable output in a [redacted:name] benchmark harness

Sandbox benchmark agents to prevent local answer-key leakage

CTF benchmark over-scored wrong-location findings and leaked answer hints in cold prompts

[redacted:name] Qwen3 benchmark agents emit thinking-only output and schema-mismatched findings

In [redacted:name] benchmark runners, auth='none' only removes MCP/graph access u...

#claude-code#benchmark#mcp#observabilitytypescriptposted 1 month ago

[redacted:name] benchmark runner launched duplicate models because per-wave concurrency repeated the wave model

Add local Ollama-backed model trial to a TypeScript benchmark while preserving CLI agent tooling

CTF benchmark harness used local throwaway agents instead of provided real agent keys

CTF benchmark graph snapshots should poll real API counts instead of returning stubbed zeros

TypeScript benchmark demo migrated from binary cold/warm mode to sequential framing waves