left4me/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md
mwiegand 36d3d83de6
docs: postmortem for the overlay-umount EBUSY rabbit hole
Captures the symptom (Reset blew up on `umount target busy`), the
false starts (eager retry, lazy fallback, TimeoutStopSec bump — all
shipped briefly and reverted), the actual root cause (the helper's
own Python interpreter inheriting and pinning the unit's mount
namespace), and the fix (nsenter at the systemd Exec line).

The lessons section is the part future-me reads first: a retry loop
is a hint that something we own is the blocker; probe `/proc/*/ns/mnt`
before assuming kernel async; `+` Exec prefix doesn't escape the
unit's mount namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:50:41 +02:00

161 lines
6.4 KiB
Markdown

# Overlay umount helper was pinning the unit's mount namespace alive
> **Status:** fixed in `5eac51a` (helper nsenter wrap) and `87d56a0`
> (modal delegation). This doc is a postmortem so future maintainers
> don't walk the same path.
## Symptom
After commit `936c8bb` ("ExecStart srcds_run from merged overlay,
not installation/"), every Reset job started failing:
```
OSError: [Errno 16] Device or resource busy:
'/var/lib/left4me/runtime/<id>/merged'
```
`shutil.rmtree(runtime_dir)` in `_purge_instance` tripped on the
still-mounted `merged/`. The unit's `ExecStopPost` had run the umount
helper, the helper had returned non-zero, the unit went `failed`, and
the rmtree downstream couldn't proceed.
## False starts (don't repeat these)
We initially modeled this as an unavoidable kernel-level race between
ExecStopPost and the deferred reaping of the unit's per-service mount
namespace. The "fixes" applied in that frame:
1. **Eager-retry loop in `cmd_umount`** (started at 4 s deadline,
bumped to 12 s, then 25 s). Each bump worked sometimes and broke
sometimes — because we were timing the helper's own life, not the
kernel's reaping (see root cause).
2. **Lazy-umount (`umount -l`) fallback** if eager retries exhausted.
This *would* have made the unit not go `failed`, but it left
`work/work` half-finalized and just moved the EBUSY downstream.
3. **`TimeoutStopSec=15s``60s`** to give ExecStopPost more retry
room. This made Stop sit in "stopping" for tens of seconds.
All three workarounds shipped to the test box and were reverted in
`5eac51a` once we found the actual cause.
## Root cause
A live empirical probe (`/tmp/probe-umount2.sh` on the test box,
polling `/proc/*/ns/mnt` while a stop was in flight) showed:
```
[t= 0.00] mounted=Y holders=[]
[t= 2.27] mounted=Y holders=[35259(left4me-overlay) ]
[t= 4.53] mounted=Y holders=[35259(left4me-overlay) ]
[t= … ] (steady for ~22 s)
[t=22.97] mounted=Y holders=[35259(left4me-overlay) ]
[t=25.22] mounted=N holders=[] ← helper finally exited
```
The single PID holding a reference to the unit's dying mount namespace
was **our own umount helper** running as ExecStopPost. The EBUSY
window matched the helper's retry budget exactly. The mount became
unmountable the moment the helper exited.
### Why the helper was holding the namespace
systemd's `+` Exec prefix removes sandbox & credentials, but does
**not** detach from the unit's per-service mount namespace (created
by `PrivateTmp=true` + `Protect*` directives). The Python interpreter
that runs `left4me-overlay` was launched inside the unit's namespace.
Inside the helper we did:
```python
subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...])
```
That nsenter put the *child process* (the umount syscall) in PID 1's
namespace — but the *parent process* (the helper Python interpreter)
never left the unit's namespace. As long as the helper was alive, it
held a reference to that namespace, which kept the slave-mount tree
alive, which made `umount` in PID 1 return EBUSY (mount-propagation
can't reconcile a slave that still has open references).
Self-defeating loop: the helper tried to umount the namespace it was
holding open. The mount only released when the helper gave up.
### Why this didn't surface before commit `936c8bb`
Before that commit, `ExecStart` invoked `srcds_run` from the
`installation/` lower layer. Srcds processes had cwd / mmaps in
`installation/`, **not** in the overlay mount. The unit's namespace
still existed and the helper still pinned it, but the kernel didn't
need to reconcile any references inside the overlay — so `umount` in
PID 1 found nothing busy and succeeded immediately.
Once srcds started running from inside `merged/`, the unit's namespace
gained file references inside the overlay, and the helper's
namespace-pin became the thing keeping those references in place.
## Fix
**One change at the systemd Exec line, two consequential cleanups.**
### `deploy/files/usr/local/lib/systemd/system/left4me-server@.service`
Wrap both helper invocations with nsenter at the unit level:
```ini
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
```
`nsenter` runs in the unit's namespace momentarily, switches its own
mount namespace to PID 1's, then `execve`s the helper. From that
point the helper Python interpreter — *the long-lived parent process*
— lives in PID 1's namespace and holds no reference to the unit's
namespace.
`TimeoutStopSec` reverts to `15s`.
### `deploy/files/usr/local/libexec/left4me/left4me-overlay`
With the helper already in PID 1's namespace, internal nsenter is
redundant. Removed:
- `nsenter --mount=/proc/1/ns/mnt --` prefix on the mount/umount argv.
- `cmd_umount`'s eager-retry loop (no race left to ride out).
- Lazy-umount (`umount -l`) fallback (no fallback needed; eager
succeeds first try).
- `work_inner` cleanup retry (no kernel-finalisation residual after a
successful eager umount).
- `import time`.
Kept: input validation, idempotency guards (`os.path.ismount`),
`work_inner` rmtree (the kernel-overlayfs orphan dir is unrelated to
the namespace issue and still needs cleaning up).
## Verification
After deploy on the test box:
| Metric | Before fix | After fix |
|---|---|---|
| Reset duration (`l4d2ctl reset 3`) | ~25 s | ~0.5 s |
| `holders=` of dying namespace | `[helper_pid]` for ~25 s | `[]` immediately |
| Unit state after Stop | `failed` | `inactive` |
| ExecStopPost exit code | 32 (EBUSY) | 0 |
UI flow (`/servers/3` → Start → Reset): job `#164 reset succeeded`
in 1.3 s end-to-end. No `failed` rows on subsequent resets.
## Lessons
- **A retry loop is a hint, not a fix.** If you find yourself reaching
for "retry until kernel finishes," check whether *your own process*
is what's blocking the kernel from finishing. nsenter at the
syscall level looks right, but only escapes the namespace for the
child process; the parent still pins it.
- **Probe for the holder, don't assume async.** `/proc/*/ns/mnt` plus
a tight polling loop quickly tells you who's actually holding a
namespace alive. We jumped to "task_work_add reaping" as the
explanation and burned a round of workarounds before checking.
- **`+` prefix only escapes sandbox & credentials.** Mount namespace
inheritance is unaffected; if you need PID 1's namespace, do
`nsenter` yourself at the Exec line.