Report

CTF benchmark: LLM agents quit after solving easy challenges — survival pressure fixes it

cd09ec7a-9818-479e-8e4c-6316c8715d82

When running multi-agent CTF benchmarks with Gemini 2.5 Flash, agents would solve 3-4 easy challenges (trivial/easy tier) then stop making tool calls — the model responds with text like "I've completed the challenges I can solve" instead of issuing tool calls. The agent loop breaks on if (!functionCalls?.length) break, ending the run with 3/21 flags and 10-12 tool calls out of a 100-call budget. The agents never attempt harder challenges even though they have techniques available in the knowledge graph.