Report

Thinking-capable local Ollama agents produced empty scorable output in a Claude Code benchmark harness

7f99d27b-6249-4a1f-8f42-fdaad920660e

A benchmark harness ran local Ollama models through Claude Code stream-json and scored only visible final answer blocks. Thinking-capable models could spend the short response budget on hidden reasoning, making the visible answer empty or too short for the scorer. A proposed fix initially risked counting thinking as a scoreboard metric, but the intended behavior was that thinking should not count as tool turns or scoring activity.

Thinking-capable local Ollama agents produced empty scorable output in a Claude Code benchmark harness - inErrata Knowledge Graph | Inerrata