diff --git a/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md b/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md new file mode 100644 index 0000000..ffe5811 --- /dev/null +++ b/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md @@ -0,0 +1,161 @@ +# Overlay umount helper was pinning the unit's mount namespace alive + +> **Status:** fixed in `5eac51a` (helper nsenter wrap) and `87d56a0` +> (modal delegation). This doc is a postmortem so future maintainers +> don't walk the same path. + +## Symptom + +After commit `936c8bb` ("ExecStart srcds_run from merged overlay, +not installation/"), every Reset job started failing: + +``` +OSError: [Errno 16] Device or resource busy: + '/var/lib/left4me/runtime//merged' +``` + +`shutil.rmtree(runtime_dir)` in `_purge_instance` tripped on the +still-mounted `merged/`. The unit's `ExecStopPost` had run the umount +helper, the helper had returned non-zero, the unit went `failed`, and +the rmtree downstream couldn't proceed. + +## False starts (don't repeat these) + +We initially modeled this as an unavoidable kernel-level race between +ExecStopPost and the deferred reaping of the unit's per-service mount +namespace. The "fixes" applied in that frame: + +1. **Eager-retry loop in `cmd_umount`** (started at 4 s deadline, + bumped to 12 s, then 25 s). Each bump worked sometimes and broke + sometimes — because we were timing the helper's own life, not the + kernel's reaping (see root cause). +2. **Lazy-umount (`umount -l`) fallback** if eager retries exhausted. + This *would* have made the unit not go `failed`, but it left + `work/work` half-finalized and just moved the EBUSY downstream. +3. **`TimeoutStopSec=15s` → `60s`** to give ExecStopPost more retry + room. This made Stop sit in "stopping" for tens of seconds. + +All three workarounds shipped to the test box and were reverted in +`5eac51a` once we found the actual cause. + +## Root cause + +A live empirical probe (`/tmp/probe-umount2.sh` on the test box, +polling `/proc/*/ns/mnt` while a stop was in flight) showed: + +``` +[t= 0.00] mounted=Y holders=[] +[t= 2.27] mounted=Y holders=[35259(left4me-overlay) ] +[t= 4.53] mounted=Y holders=[35259(left4me-overlay) ] +[t= … ] (steady for ~22 s) +[t=22.97] mounted=Y holders=[35259(left4me-overlay) ] +[t=25.22] mounted=N holders=[] ← helper finally exited +``` + +The single PID holding a reference to the unit's dying mount namespace +was **our own umount helper** running as ExecStopPost. The EBUSY +window matched the helper's retry budget exactly. The mount became +unmountable the moment the helper exited. + +### Why the helper was holding the namespace + +systemd's `+` Exec prefix removes sandbox & credentials, but does +**not** detach from the unit's per-service mount namespace (created +by `PrivateTmp=true` + `Protect*` directives). The Python interpreter +that runs `left4me-overlay` was launched inside the unit's namespace. + +Inside the helper we did: + +```python +subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...]) +``` + +That nsenter put the *child process* (the umount syscall) in PID 1's +namespace — but the *parent process* (the helper Python interpreter) +never left the unit's namespace. As long as the helper was alive, it +held a reference to that namespace, which kept the slave-mount tree +alive, which made `umount` in PID 1 return EBUSY (mount-propagation +can't reconcile a slave that still has open references). + +Self-defeating loop: the helper tried to umount the namespace it was +holding open. The mount only released when the helper gave up. + +### Why this didn't surface before commit `936c8bb` + +Before that commit, `ExecStart` invoked `srcds_run` from the +`installation/` lower layer. Srcds processes had cwd / mmaps in +`installation/`, **not** in the overlay mount. The unit's namespace +still existed and the helper still pinned it, but the kernel didn't +need to reconcile any references inside the overlay — so `umount` in +PID 1 found nothing busy and succeeded immediately. + +Once srcds started running from inside `merged/`, the unit's namespace +gained file references inside the overlay, and the helper's +namespace-pin became the thing keeping those references in place. + +## Fix + +**One change at the systemd Exec line, two consequential cleanups.** + +### `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` + +Wrap both helper invocations with nsenter at the unit level: + +```ini +ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i +ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i +``` + +`nsenter` runs in the unit's namespace momentarily, switches its own +mount namespace to PID 1's, then `execve`s the helper. From that +point the helper Python interpreter — *the long-lived parent process* +— lives in PID 1's namespace and holds no reference to the unit's +namespace. + +`TimeoutStopSec` reverts to `15s`. + +### `deploy/files/usr/local/libexec/left4me/left4me-overlay` + +With the helper already in PID 1's namespace, internal nsenter is +redundant. Removed: + +- `nsenter --mount=/proc/1/ns/mnt --` prefix on the mount/umount argv. +- `cmd_umount`'s eager-retry loop (no race left to ride out). +- Lazy-umount (`umount -l`) fallback (no fallback needed; eager + succeeds first try). +- `work_inner` cleanup retry (no kernel-finalisation residual after a + successful eager umount). +- `import time`. + +Kept: input validation, idempotency guards (`os.path.ismount`), +`work_inner` rmtree (the kernel-overlayfs orphan dir is unrelated to +the namespace issue and still needs cleaning up). + +## Verification + +After deploy on the test box: + +| Metric | Before fix | After fix | +|---|---|---| +| Reset duration (`l4d2ctl reset 3`) | ~25 s | ~0.5 s | +| `holders=` of dying namespace | `[helper_pid]` for ~25 s | `[]` immediately | +| Unit state after Stop | `failed` | `inactive` | +| ExecStopPost exit code | 32 (EBUSY) | 0 | + +UI flow (`/servers/3` → Start → Reset): job `#164 reset succeeded` +in 1.3 s end-to-end. No `failed` rows on subsequent resets. + +## Lessons + +- **A retry loop is a hint, not a fix.** If you find yourself reaching + for "retry until kernel finishes," check whether *your own process* + is what's blocking the kernel from finishing. nsenter at the + syscall level looks right, but only escapes the namespace for the + child process; the parent still pins it. +- **Probe for the holder, don't assume async.** `/proc/*/ns/mnt` plus + a tight polling loop quickly tells you who's actually holding a + namespace alive. We jumped to "task_work_add reaping" as the + explanation and burned a round of workarounds before checking. +- **`+` prefix only escapes sandbox & credentials.** Mount namespace + inheritance is unaffected; if you need PID 1's namespace, do + `nsenter` yourself at the Exec line.