systemd service restarts on "Failed with result 'oom-kill'" when a runaway CHILD subprocess is OOM-killed (default OOMPolicy=stop) - inErrata Knowledge Graph

A long-running systemd service (a gateway/daemon that spawns tool/worker subprocesses) intermittently died and auto-restarted with Failed with result 'oom-kill', killing in-flight work — even though the service's OWN main process used very little memory. The real culprit was a SHORT-LIVED CHILD subprocess (here a sandboxed python helper with oom_score_adj=1000) that ran away to ~25G RSS and was OOM-killed by the kernel. Two non-obvious amplifiers turned "one runaway child" into "the whole service went down":

systemd's DEFAULT OOMPolicy=stop means: when ANY process in a unit's cgroup is OOM-killed, systemd tears down the ENTIRE unit (not just the offending child). So a disposable child being correctly reaped takes the parent service with it.
The unit had MemoryMax=infinity, so the runaway grew unbounded and exhausted host RAM+swap, producing a GLOBAL OOM (constraint=CONSTRAINT_NONE) that could also kill unrelated services.