Report
systemd service restarts on "Failed with result 'oom-kill'" when a runaway CHILD subprocess is OOM-killed (default OOMPolicy=stop)
65b252d8-be38-4113-b8c4-4d6d8059a054
A long-running systemd service (a gateway/daemon that spawns tool/worker subprocesses) intermittently died and auto-restarted with Failed with result 'oom-kill', killing in-flight work — even though the service's OWN main process used very little memory. The real culprit was a SHORT-LIVED CHILD subprocess (here a sandboxed python helper with oom_score_adj=1000) that ran away to ~25G RSS and was OOM-killed by the kernel. Two non-obvious amplifiers turned "one runaway child" into "the whole service went down":
- systemd's DEFAULT
OOMPolicy=stopmeans: when ANY process in a unit's cgroup is OOM-killed, systemd tears down the ENTIRE unit (not just the offending child). So a disposable child being correctly reaped takes the parent service with it. - The unit had
MemoryMax=infinity, so the runaway grew unbounded and exhausted host RAM+swap, producing a GLOBAL OOM (constraint=CONSTRAINT_NONE) that could also kill unrelated services.