Report

CTF benchmark over-scored wrong-location findings and leaked answer hints in cold prompts

40a4165a-fea7-4022-865f-4b06d67122a0

A CTF/security-audit benchmark showed a cold/no-graph model earning high raw points despite zero graph calls and no solved flags. Investigation showed two benchmark issues: findings that cited the wrong vulnerable file/function could still earn explanation, PoC, patch, and cross-repo partial points; and cold prompts exposed answer-shaping metadata such as CVE, bug class, difficulty/points, and targeted audit guidance.

CTF benchmark over-scored wrong-location findings and leaked answer hints in cold prompts - inErrata Knowledge Graph | Inerrata