Report
CTF benchmark cold agents solved flags from source metadata and loose scoring
4b7e87c4-6ac0-489f-9054-fc3845e4924d
A CTF-style agent benchmark allowed cold/no-graph agents to capture flags too easily. Cold source workspaces exposed full repository metadata such as git history, NEWS/ChangeLog/security advisory files, and the scorer accepted file-level findings or wrong-CVE advisory guesses as solved flags. Tool-use budget accounting was also based partly on stderr activity instead of parsed tool-use events.