docs: postmortem for the overlay-umount EBUSY rabbit hole

Captures the symptom (Reset blew up on `umount target busy`), the
false starts (eager retry, lazy fallback, TimeoutStopSec bump — all
shipped briefly and reverted), the actual root cause (the helper's
own Python interpreter inheriting and pinning the unit's mount
namespace), and the fix (nsenter at the systemd Exec line).

The lessons section is the part future-me reads first: a retry loop
is a hint that something we own is the blocker; probe `/proc/*/ns/mnt`
before assuming kernel async; `+` Exec prefix doesn't escape the
unit's mount namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-09 15:50:41 +02:00

6.4 KiB

Raw Blame History

Overlay umount helper was pinning the unit's mount namespace alive

Status: fixed in 5eac51a (helper nsenter wrap) and 87d56a0 (modal delegation). This doc is a postmortem so future maintainers don't walk the same path.

Symptom

After commit 936c8bb ("ExecStart srcds_run from merged overlay, not installation/"), every Reset job started failing:

OSError: [Errno 16] Device or resource busy:
  '/var/lib/left4me/runtime/<id>/merged'

shutil.rmtree(runtime_dir) in _purge_instance tripped on the still-mounted merged/. The unit's ExecStopPost had run the umount helper, the helper had returned non-zero, the unit went failed, and the rmtree downstream couldn't proceed.

False starts (don't repeat these)

We initially modeled this as an unavoidable kernel-level race between ExecStopPost and the deferred reaping of the unit's per-service mount namespace. The "fixes" applied in that frame:

Eager-retry loop in cmd_umount (started at 4 s deadline, bumped to 12 s, then 25 s). Each bump worked sometimes and broke sometimes — because we were timing the helper's own life, not the kernel's reaping (see root cause).
Lazy-umount (umount -l) fallback if eager retries exhausted. This would have made the unit not go failed, but it left work/work half-finalized and just moved the EBUSY downstream.
TimeoutStopSec=15s → 60s to give ExecStopPost more retry room. This made Stop sit in "stopping" for tens of seconds.

All three workarounds shipped to the test box and were reverted in 5eac51a once we found the actual cause.

Root cause

A live empirical probe (/tmp/probe-umount2.sh on the test box, polling /proc/*/ns/mnt while a stop was in flight) showed:

[t= 0.00] mounted=Y holders=[]
[t= 2.27] mounted=Y holders=[35259(left4me-overlay) ]
[t= 4.53] mounted=Y holders=[35259(left4me-overlay) ]
[t= …  ]                          (steady for ~22 s)
[t=22.97] mounted=Y holders=[35259(left4me-overlay) ]
[t=25.22] mounted=N holders=[]    ← helper finally exited

The single PID holding a reference to the unit's dying mount namespace was our own umount helper running as ExecStopPost. The EBUSY window matched the helper's retry budget exactly. The mount became unmountable the moment the helper exited.

Why the helper was holding the namespace

systemd's + Exec prefix removes sandbox & credentials, but does not detach from the unit's per-service mount namespace (created by PrivateTmp=true + Protect* directives). The Python interpreter that runs left4me-overlay was launched inside the unit's namespace.

Inside the helper we did:

subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...])

That nsenter put the child process (the umount syscall) in PID 1's namespace — but the parent process (the helper Python interpreter) never left the unit's namespace. As long as the helper was alive, it held a reference to that namespace, which kept the slave-mount tree alive, which made umount in PID 1 return EBUSY (mount-propagation can't reconcile a slave that still has open references).

Self-defeating loop: the helper tried to umount the namespace it was holding open. The mount only released when the helper gave up.

Why this didn't surface before commit `936c8bb`

Before that commit, ExecStart invoked srcds_run from the installation/ lower layer. Srcds processes had cwd / mmaps in installation/, not in the overlay mount. The unit's namespace still existed and the helper still pinned it, but the kernel didn't need to reconcile any references inside the overlay — so umount in PID 1 found nothing busy and succeeded immediately.

Once srcds started running from inside merged/, the unit's namespace gained file references inside the overlay, and the helper's namespace-pin became the thing keeping those references in place.

Fix

One change at the systemd Exec line, two consequential cleanups.

`deploy/files/usr/local/lib/systemd/system/left4me-server@.service`

Wrap both helper invocations with nsenter at the unit level:

ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i

nsenter runs in the unit's namespace momentarily, switches its own mount namespace to PID 1's, then execves the helper. From that point the helper Python interpreter — the long-lived parent process — lives in PID 1's namespace and holds no reference to the unit's namespace.

TimeoutStopSec reverts to 15s.

`deploy/files/usr/local/libexec/left4me/left4me-overlay`

With the helper already in PID 1's namespace, internal nsenter is redundant. Removed:

nsenter --mount=/proc/1/ns/mnt -- prefix on the mount/umount argv.
cmd_umount's eager-retry loop (no race left to ride out).
Lazy-umount (umount -l) fallback (no fallback needed; eager succeeds first try).
work_inner cleanup retry (no kernel-finalisation residual after a successful eager umount).
import time.

Kept: input validation, idempotency guards (os.path.ismount), work_inner rmtree (the kernel-overlayfs orphan dir is unrelated to the namespace issue and still needs cleaning up).

Verification

After deploy on the test box:

Metric	Before fix	After fix
Reset duration (`l4d2ctl reset 3`)	~25 s	~0.5 s
`holders=` of dying namespace	`[helper_pid]` for ~25 s	`[]` immediately
Unit state after Stop	`failed`	`inactive`
ExecStopPost exit code	32 (EBUSY)	0

UI flow (/servers/3 → Start → Reset): job #164 reset succeeded in 1.3 s end-to-end. No failed rows on subsequent resets.

Lessons

A retry loop is a hint, not a fix. If you find yourself reaching for "retry until kernel finishes," check whether your own process is what's blocking the kernel from finishing. nsenter at the syscall level looks right, but only escapes the namespace for the child process; the parent still pins it.
Probe for the holder, don't assume async. /proc/*/ns/mnt plus a tight polling loop quickly tells you who's actually holding a namespace alive. We jumped to "task_work_add reaping" as the explanation and burned a round of workarounds before checking.
+ prefix only escapes sandbox & credentials. Mount namespace inheritance is unaffected; if you need PID 1's namespace, do nsenter yourself at the Exec line.

6.4 KiB Raw Blame History