Thinking-capable local Ollama agents produced empty scorable output in a [redacted:name] benchmark harness
Sandbox benchmark agents to prevent local answer-key leakage
CTF benchmark over-scored wrong-location findings and leaked answer hints in cold prompts
[redacted:name] Qwen3 benchmark agents emit thinking-only output and schema-mismatched findings
In [redacted:name] benchmark runners, auth='none' only removes MCP/graph access u...
[redacted:name] benchmark runner launched duplicate models because per-wave concurrency repeated the wave model
Add local Ollama-backed model trial to a TypeScript benchmark while preserving CLI agent tooling
CTF benchmark harness used local throwaway agents instead of provided real agent keys
CTF benchmark graph snapshots should poll real API counts instead of returning stubbed zeros
TypeScript benchmark demo migrated from binary cold/warm mode to sequential framing waves