# Overlay umount helper was pinning the unit's mount namespace alive > **Status:** fixed in `5eac51a` (helper nsenter wrap) and `87d56a0` > (modal delegation). This doc is a postmortem so future maintainers > don't walk the same path. ## Symptom After commit `936c8bb` ("ExecStart srcds_run from merged overlay, not installation/"), every Reset job started failing: ``` OSError: [Errno 16] Device or resource busy: '/var/lib/left4me/runtime//merged' ``` `shutil.rmtree(runtime_dir)` in `_purge_instance` tripped on the still-mounted `merged/`. The unit's `ExecStopPost` had run the umount helper, the helper had returned non-zero, the unit went `failed`, and the rmtree downstream couldn't proceed. ## False starts (don't repeat these) We initially modeled this as an unavoidable kernel-level race between ExecStopPost and the deferred reaping of the unit's per-service mount namespace. The "fixes" applied in that frame: 1. **Eager-retry loop in `cmd_umount`** (started at 4 s deadline, bumped to 12 s, then 25 s). Each bump worked sometimes and broke sometimes — because we were timing the helper's own life, not the kernel's reaping (see root cause). 2. **Lazy-umount (`umount -l`) fallback** if eager retries exhausted. This *would* have made the unit not go `failed`, but it left `work/work` half-finalized and just moved the EBUSY downstream. 3. **`TimeoutStopSec=15s` → `60s`** to give ExecStopPost more retry room. This made Stop sit in "stopping" for tens of seconds. All three workarounds shipped to the test box and were reverted in `5eac51a` once we found the actual cause. ## Root cause A live empirical probe (`/tmp/probe-umount2.sh` on the test box, polling `/proc/*/ns/mnt` while a stop was in flight) showed: ``` [t= 0.00] mounted=Y holders=[] [t= 2.27] mounted=Y holders=[35259(left4me-overlay) ] [t= 4.53] mounted=Y holders=[35259(left4me-overlay) ] [t= … ] (steady for ~22 s) [t=22.97] mounted=Y holders=[35259(left4me-overlay) ] [t=25.22] mounted=N holders=[] ← helper finally exited ``` The single PID holding a reference to the unit's dying mount namespace was **our own umount helper** running as ExecStopPost. The EBUSY window matched the helper's retry budget exactly. The mount became unmountable the moment the helper exited. ### Why the helper was holding the namespace systemd's `+` Exec prefix removes sandbox & credentials, but does **not** detach from the unit's per-service mount namespace (created by `PrivateTmp=true` + `Protect*` directives). The Python interpreter that runs `left4me-overlay` was launched inside the unit's namespace. Inside the helper we did: ```python subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...]) ``` That nsenter put the *child process* (the umount syscall) in PID 1's namespace — but the *parent process* (the helper Python interpreter) never left the unit's namespace. As long as the helper was alive, it held a reference to that namespace, which kept the slave-mount tree alive, which made `umount` in PID 1 return EBUSY (mount-propagation can't reconcile a slave that still has open references). Self-defeating loop: the helper tried to umount the namespace it was holding open. The mount only released when the helper gave up. ### Why this didn't surface before commit `936c8bb` Before that commit, `ExecStart` invoked `srcds_run` from the `installation/` lower layer. Srcds processes had cwd / mmaps in `installation/`, **not** in the overlay mount. The unit's namespace still existed and the helper still pinned it, but the kernel didn't need to reconcile any references inside the overlay — so `umount` in PID 1 found nothing busy and succeeded immediately. Once srcds started running from inside `merged/`, the unit's namespace gained file references inside the overlay, and the helper's namespace-pin became the thing keeping those references in place. ## Fix **One change at the systemd Exec line, two consequential cleanups.** ### `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` Wrap both helper invocations with nsenter at the unit level: ```ini ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i ``` `nsenter` runs in the unit's namespace momentarily, switches its own mount namespace to PID 1's, then `execve`s the helper. From that point the helper Python interpreter — *the long-lived parent process* — lives in PID 1's namespace and holds no reference to the unit's namespace. `TimeoutStopSec` reverts to `15s`. ### `deploy/files/usr/local/libexec/left4me/left4me-overlay` With the helper already in PID 1's namespace, internal nsenter is redundant. Removed: - `nsenter --mount=/proc/1/ns/mnt --` prefix on the mount/umount argv. - `cmd_umount`'s eager-retry loop (no race left to ride out). - Lazy-umount (`umount -l`) fallback (no fallback needed; eager succeeds first try). - `work_inner` cleanup retry (no kernel-finalisation residual after a successful eager umount). - `import time`. Kept: input validation, idempotency guards (`os.path.ismount`), `work_inner` rmtree (the kernel-overlayfs orphan dir is unrelated to the namespace issue and still needs cleaning up). ## Verification After deploy on the test box: | Metric | Before fix | After fix | |---|---|---| | Reset duration (`l4d2ctl reset 3`) | ~25 s | ~0.5 s | | `holders=` of dying namespace | `[helper_pid]` for ~25 s | `[]` immediately | | Unit state after Stop | `failed` | `inactive` | | ExecStopPost exit code | 32 (EBUSY) | 0 | UI flow (`/servers/3` → Start → Reset): job `#164 reset succeeded` in 1.3 s end-to-end. No `failed` rows on subsequent resets. ## Lessons - **A retry loop is a hint, not a fix.** If you find yourself reaching for "retry until kernel finishes," check whether *your own process* is what's blocking the kernel from finishing. nsenter at the syscall level looks right, but only escapes the namespace for the child process; the parent still pins it. - **Probe for the holder, don't assume async.** `/proc/*/ns/mnt` plus a tight polling loop quickly tells you who's actually holding a namespace alive. We jumped to "task_work_add reaping" as the explanation and burned a round of workarounds before checking. - **`+` prefix only escapes sandbox & credentials.** Mount namespace inheritance is unaffected; if you need PID 1's namespace, do `nsenter` yourself at the Exec line.