Captures the symptom (Reset blew up on `umount target busy`), the false starts (eager retry, lazy fallback, TimeoutStopSec bump — all shipped briefly and reverted), the actual root cause (the helper's own Python interpreter inheriting and pinning the unit's mount namespace), and the fix (nsenter at the systemd Exec line). The lessons section is the part future-me reads first: a retry loop is a hint that something we own is the blocker; probe `/proc/*/ns/mnt` before assuming kernel async; `+` Exec prefix doesn't escape the unit's mount namespace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.4 KiB
Overlay umount helper was pinning the unit's mount namespace alive
Status: fixed in
5eac51a(helper nsenter wrap) and87d56a0(modal delegation). This doc is a postmortem so future maintainers don't walk the same path.
Symptom
After commit 936c8bb ("ExecStart srcds_run from merged overlay,
not installation/"), every Reset job started failing:
OSError: [Errno 16] Device or resource busy:
'/var/lib/left4me/runtime/<id>/merged'
shutil.rmtree(runtime_dir) in _purge_instance tripped on the
still-mounted merged/. The unit's ExecStopPost had run the umount
helper, the helper had returned non-zero, the unit went failed, and
the rmtree downstream couldn't proceed.
False starts (don't repeat these)
We initially modeled this as an unavoidable kernel-level race between ExecStopPost and the deferred reaping of the unit's per-service mount namespace. The "fixes" applied in that frame:
- Eager-retry loop in
cmd_umount(started at 4 s deadline, bumped to 12 s, then 25 s). Each bump worked sometimes and broke sometimes — because we were timing the helper's own life, not the kernel's reaping (see root cause). - Lazy-umount (
umount -l) fallback if eager retries exhausted. This would have made the unit not gofailed, but it leftwork/workhalf-finalized and just moved the EBUSY downstream. TimeoutStopSec=15s→60sto give ExecStopPost more retry room. This made Stop sit in "stopping" for tens of seconds.
All three workarounds shipped to the test box and were reverted in
5eac51a once we found the actual cause.
Root cause
A live empirical probe (/tmp/probe-umount2.sh on the test box,
polling /proc/*/ns/mnt while a stop was in flight) showed:
[t= 0.00] mounted=Y holders=[]
[t= 2.27] mounted=Y holders=[35259(left4me-overlay) ]
[t= 4.53] mounted=Y holders=[35259(left4me-overlay) ]
[t= … ] (steady for ~22 s)
[t=22.97] mounted=Y holders=[35259(left4me-overlay) ]
[t=25.22] mounted=N holders=[] ← helper finally exited
The single PID holding a reference to the unit's dying mount namespace was our own umount helper running as ExecStopPost. The EBUSY window matched the helper's retry budget exactly. The mount became unmountable the moment the helper exited.
Why the helper was holding the namespace
systemd's + Exec prefix removes sandbox & credentials, but does
not detach from the unit's per-service mount namespace (created
by PrivateTmp=true + Protect* directives). The Python interpreter
that runs left4me-overlay was launched inside the unit's namespace.
Inside the helper we did:
subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...])
That nsenter put the child process (the umount syscall) in PID 1's
namespace — but the parent process (the helper Python interpreter)
never left the unit's namespace. As long as the helper was alive, it
held a reference to that namespace, which kept the slave-mount tree
alive, which made umount in PID 1 return EBUSY (mount-propagation
can't reconcile a slave that still has open references).
Self-defeating loop: the helper tried to umount the namespace it was holding open. The mount only released when the helper gave up.
Why this didn't surface before commit 936c8bb
Before that commit, ExecStart invoked srcds_run from the
installation/ lower layer. Srcds processes had cwd / mmaps in
installation/, not in the overlay mount. The unit's namespace
still existed and the helper still pinned it, but the kernel didn't
need to reconcile any references inside the overlay — so umount in
PID 1 found nothing busy and succeeded immediately.
Once srcds started running from inside merged/, the unit's namespace
gained file references inside the overlay, and the helper's
namespace-pin became the thing keeping those references in place.
Fix
One change at the systemd Exec line, two consequential cleanups.
deploy/files/usr/local/lib/systemd/system/left4me-server@.service
Wrap both helper invocations with nsenter at the unit level:
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
nsenter runs in the unit's namespace momentarily, switches its own
mount namespace to PID 1's, then execves the helper. From that
point the helper Python interpreter — the long-lived parent process
— lives in PID 1's namespace and holds no reference to the unit's
namespace.
TimeoutStopSec reverts to 15s.
deploy/files/usr/local/libexec/left4me/left4me-overlay
With the helper already in PID 1's namespace, internal nsenter is redundant. Removed:
nsenter --mount=/proc/1/ns/mnt --prefix on the mount/umount argv.cmd_umount's eager-retry loop (no race left to ride out).- Lazy-umount (
umount -l) fallback (no fallback needed; eager succeeds first try). work_innercleanup retry (no kernel-finalisation residual after a successful eager umount).import time.
Kept: input validation, idempotency guards (os.path.ismount),
work_inner rmtree (the kernel-overlayfs orphan dir is unrelated to
the namespace issue and still needs cleaning up).
Verification
After deploy on the test box:
| Metric | Before fix | After fix |
|---|---|---|
Reset duration (l4d2ctl reset 3) |
~25 s | ~0.5 s |
holders= of dying namespace |
[helper_pid] for ~25 s |
[] immediately |
| Unit state after Stop | failed |
inactive |
| ExecStopPost exit code | 32 (EBUSY) | 0 |
UI flow (/servers/3 → Start → Reset): job #164 reset succeeded
in 1.3 s end-to-end. No failed rows on subsequent resets.
Lessons
- A retry loop is a hint, not a fix. If you find yourself reaching for "retry until kernel finishes," check whether your own process is what's blocking the kernel from finishing. nsenter at the syscall level looks right, but only escapes the namespace for the child process; the parent still pins it.
- Probe for the holder, don't assume async.
/proc/*/ns/mntplus a tight polling loop quickly tells you who's actually holding a namespace alive. We jumped to "task_work_add reaping" as the explanation and burned a round of workarounds before checking. +prefix only escapes sandbox & credentials. Mount namespace inheritance is unaffected; if you need PID 1's namespace, donsenteryourself at the Exec line.