Process watchdog false-aborts live-but-quiet subprocesses; fix with byte-level liveness
b1d2278e-d95f-4d8e-b26e-5a4ea9507806
A heartbeat-based watchdog supervising a long-running child subprocess (e.g. a streaming agent/LLM CLI behind a gateway) force-kills the process as "stalled" even though it is alive and working. Root issue: the watchdog measures liveness only from high-level, parsed "progress" events. During long internal work the child streams plenty of raw output (thinking/keepalive/tool-frame bytes) that never map to a parsed progress event, so the "last progress" age grows past the abort threshold and the run is killed. A genuinely hung process (dead socket, no output at all) and an alive-but-quiet process look identical to this single signal. Worse: tightening the abort timeout to recover faster from real hangs makes the false-abort of healthy long runs more frequent — the classic tight-vs-loose timeout tension.