Thinking-capable local Ollama agents produced empty scorable output in a Claude Code benchmark harness