From 36d3d83de6a9e92f11ad70eeb71d04d601ea370c Mon Sep 17 00:00:00 2001 From: mwiegand Date: Sat, 9 May 2026 15:50:41 +0200 Subject: [PATCH] docs: postmortem for the overlay-umount EBUSY rabbit hole MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Captures the symptom (Reset blew up on `umount target busy`), the false starts (eager retry, lazy fallback, TimeoutStopSec bump — all shipped briefly and reverted), the actual root cause (the helper's own Python interpreter inheriting and pinning the unit's mount namespace), and the fix (nsenter at the systemd Exec line). The lessons section is the part future-me reads first: a retry loop is a hint that something we own is the blocker; probe `/proc/*/ns/mnt` before assuming kernel async; `+` Exec prefix doesn't escape the unit's mount namespace. Co-Authored-By: Claude Opus 4.7 (1M context) --- ...-09-overlay-umount-helper-namespace-pin.md | 161 ++++++++++++++++++ 1 file changed, 161 insertions(+) create mode 100644 docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md diff --git a/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md b/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md new file mode 100644 index 0000000..ffe5811 --- /dev/null +++ b/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md @@ -0,0 +1,161 @@ +# Overlay umount helper was pinning the unit's mount namespace alive + +> **Status:** fixed in `5eac51a` (helper nsenter wrap) and `87d56a0` +> (modal delegation). This doc is a postmortem so future maintainers +> don't walk the same path. + +## Symptom + +After commit `936c8bb` ("ExecStart srcds_run from merged overlay, +not installation/"), every Reset job started failing: + +``` +OSError: [Errno 16] Device or resource busy: + '/var/lib/left4me/runtime//merged' +``` + +`shutil.rmtree(runtime_dir)` in `_purge_instance` tripped on the +still-mounted `merged/`. The unit's `ExecStopPost` had run the umount +helper, the helper had returned non-zero, the unit went `failed`, and +the rmtree downstream couldn't proceed. + +## False starts (don't repeat these) + +We initially modeled this as an unavoidable kernel-level race between +ExecStopPost and the deferred reaping of the unit's per-service mount +namespace. The "fixes" applied in that frame: + +1. **Eager-retry loop in `cmd_umount`** (started at 4 s deadline, + bumped to 12 s, then 25 s). Each bump worked sometimes and broke + sometimes — because we were timing the helper's own life, not the + kernel's reaping (see root cause). +2. **Lazy-umount (`umount -l`) fallback** if eager retries exhausted. + This *would* have made the unit not go `failed`, but it left + `work/work` half-finalized and just moved the EBUSY downstream. +3. **`TimeoutStopSec=15s` → `60s`** to give ExecStopPost more retry + room. This made Stop sit in "stopping" for tens of seconds. + +All three workarounds shipped to the test box and were reverted in +`5eac51a` once we found the actual cause. + +## Root cause + +A live empirical probe (`/tmp/probe-umount2.sh` on the test box, +polling `/proc/*/ns/mnt` while a stop was in flight) showed: + +``` +[t= 0.00] mounted=Y holders=[] +[t= 2.27] mounted=Y holders=[35259(left4me-overlay) ] +[t= 4.53] mounted=Y holders=[35259(left4me-overlay) ] +[t= … ] (steady for ~22 s) +[t=22.97] mounted=Y holders=[35259(left4me-overlay) ] +[t=25.22] mounted=N holders=[] ← helper finally exited +``` + +The single PID holding a reference to the unit's dying mount namespace +was **our own umount helper** running as ExecStopPost. The EBUSY +window matched the helper's retry budget exactly. The mount became +unmountable the moment the helper exited. + +### Why the helper was holding the namespace + +systemd's `+` Exec prefix removes sandbox & credentials, but does +**not** detach from the unit's per-service mount namespace (created +by `PrivateTmp=true` + `Protect*` directives). The Python interpreter +that runs `left4me-overlay` was launched inside the unit's namespace. + +Inside the helper we did: + +```python +subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...]) +``` + +That nsenter put the *child process* (the umount syscall) in PID 1's +namespace — but the *parent process* (the helper Python interpreter) +never left the unit's namespace. As long as the helper was alive, it +held a reference to that namespace, which kept the slave-mount tree +alive, which made `umount` in PID 1 return EBUSY (mount-propagation +can't reconcile a slave that still has open references). + +Self-defeating loop: the helper tried to umount the namespace it was +holding open. The mount only released when the helper gave up. + +### Why this didn't surface before commit `936c8bb` + +Before that commit, `ExecStart` invoked `srcds_run` from the +`installation/` lower layer. Srcds processes had cwd / mmaps in +`installation/`, **not** in the overlay mount. The unit's namespace +still existed and the helper still pinned it, but the kernel didn't +need to reconcile any references inside the overlay — so `umount` in +PID 1 found nothing busy and succeeded immediately. + +Once srcds started running from inside `merged/`, the unit's namespace +gained file references inside the overlay, and the helper's +namespace-pin became the thing keeping those references in place. + +## Fix + +**One change at the systemd Exec line, two consequential cleanups.** + +### `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` + +Wrap both helper invocations with nsenter at the unit level: + +```ini +ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i +ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i +``` + +`nsenter` runs in the unit's namespace momentarily, switches its own +mount namespace to PID 1's, then `execve`s the helper. From that +point the helper Python interpreter — *the long-lived parent process* +— lives in PID 1's namespace and holds no reference to the unit's +namespace. + +`TimeoutStopSec` reverts to `15s`. + +### `deploy/files/usr/local/libexec/left4me/left4me-overlay` + +With the helper already in PID 1's namespace, internal nsenter is +redundant. Removed: + +- `nsenter --mount=/proc/1/ns/mnt --` prefix on the mount/umount argv. +- `cmd_umount`'s eager-retry loop (no race left to ride out). +- Lazy-umount (`umount -l`) fallback (no fallback needed; eager + succeeds first try). +- `work_inner` cleanup retry (no kernel-finalisation residual after a + successful eager umount). +- `import time`. + +Kept: input validation, idempotency guards (`os.path.ismount`), +`work_inner` rmtree (the kernel-overlayfs orphan dir is unrelated to +the namespace issue and still needs cleaning up). + +## Verification + +After deploy on the test box: + +| Metric | Before fix | After fix | +|---|---|---| +| Reset duration (`l4d2ctl reset 3`) | ~25 s | ~0.5 s | +| `holders=` of dying namespace | `[helper_pid]` for ~25 s | `[]` immediately | +| Unit state after Stop | `failed` | `inactive` | +| ExecStopPost exit code | 32 (EBUSY) | 0 | + +UI flow (`/servers/3` → Start → Reset): job `#164 reset succeeded` +in 1.3 s end-to-end. No `failed` rows on subsequent resets. + +## Lessons + +- **A retry loop is a hint, not a fix.** If you find yourself reaching + for "retry until kernel finishes," check whether *your own process* + is what's blocking the kernel from finishing. nsenter at the + syscall level looks right, but only escapes the namespace for the + child process; the parent still pins it. +- **Probe for the holder, don't assume async.** `/proc/*/ns/mnt` plus + a tight polling loop quickly tells you who's actually holding a + namespace alive. We jumped to "task_work_add reaping" as the + explanation and burned a round of workarounds before checking. +- **`+` prefix only escapes sandbox & credentials.** Mount namespace + inheritance is unaffected; if you need PID 1's namespace, do + `nsenter` yourself at the Exec line.