left4me/docs/superpowers/plans/2026-05-09-overlay-umount-helper-namespace-pin.md

# Overlay umount helper was pinning the unit's mount namespace alive

> **Status:** fixed in `5eac51a` (helper nsenter wrap) and `87d56a0`
> (modal delegation). This doc is a postmortem so future maintainers
> don't walk the same path.

## Symptom

After commit `936c8bb` ("ExecStart srcds_run from merged overlay,
not installation/"), every Reset job started failing:

```
OSError: [Errno 16] Device or resource busy:
  '/var/lib/left4me/runtime/<id>/merged'
```

`shutil.rmtree(runtime_dir)` in `_purge_instance` tripped on the
still-mounted `merged/`. The unit's `ExecStopPost` had run the umount
helper, the helper had returned non-zero, the unit went `failed`, and
the rmtree downstream couldn't proceed.

## False starts (don't repeat these)

We initially modeled this as an unavoidable kernel-level race between
ExecStopPost and the deferred reaping of the unit's per-service mount
namespace. The "fixes" applied in that frame:

1. **Eager-retry loop in `cmd_umount`** (started at 4 s deadline,
   bumped to 12 s, then 25 s). Each bump worked sometimes and broke
   sometimes — because we were timing the helper's own life, not the
   kernel's reaping (see root cause).
2. **Lazy-umount (`umount -l`) fallback** if eager retries exhausted.
   This *would* have made the unit not go `failed`, but it left
   `work/work` half-finalized and just moved the EBUSY downstream.
3. **`TimeoutStopSec=15s` → `60s`** to give ExecStopPost more retry
   room. This made Stop sit in "stopping" for tens of seconds.

All three workarounds shipped to the test box and were reverted in
`5eac51a` once we found the actual cause.

## Root cause

A live empirical probe (`/tmp/probe-umount2.sh` on the test box,
polling `/proc/*/ns/mnt` while a stop was in flight) showed:

```
[t= 0.00] mounted=Y holders=[]
[t= 2.27] mounted=Y holders=[35259(left4me-overlay) ]
[t= 4.53] mounted=Y holders=[35259(left4me-overlay) ]
[t= …  ]                          (steady for ~22 s)
[t=22.97] mounted=Y holders=[35259(left4me-overlay) ]
[t=25.22] mounted=N holders=[]    ← helper finally exited
```

The single PID holding a reference to the unit's dying mount namespace
was **our own umount helper** running as ExecStopPost. The EBUSY
window matched the helper's retry budget exactly. The mount became
unmountable the moment the helper exited.

### Why the helper was holding the namespace

systemd's `+` Exec prefix removes sandbox & credentials, but does
**not** detach from the unit's per-service mount namespace (created
by `PrivateTmp=true` + `Protect*` directives). The Python interpreter
that runs `left4me-overlay` was launched inside the unit's namespace.

Inside the helper we did:

```python
subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...])
```

That nsenter put the *child process* (the umount syscall) in PID 1's
namespace — but the *parent process* (the helper Python interpreter)
never left the unit's namespace. As long as the helper was alive, it
held a reference to that namespace, which kept the slave-mount tree
alive, which made `umount` in PID 1 return EBUSY (mount-propagation
can't reconcile a slave that still has open references).

Self-defeating loop: the helper tried to umount the namespace it was
holding open. The mount only released when the helper gave up.

### Why this didn't surface before commit `936c8bb`

Before that commit, `ExecStart` invoked `srcds_run` from the
`installation/` lower layer. Srcds processes had cwd / mmaps in
`installation/`, **not** in the overlay mount. The unit's namespace
still existed and the helper still pinned it, but the kernel didn't
need to reconcile any references inside the overlay — so `umount` in
PID 1 found nothing busy and succeeded immediately.

Once srcds started running from inside `merged/`, the unit's namespace
gained file references inside the overlay, and the helper's
namespace-pin became the thing keeping those references in place.

## Fix

**One change at the systemd Exec line, two consequential cleanups.**

### `deploy/files/usr/local/lib/systemd/system/left4me-server@.service`

Wrap both helper invocations with nsenter at the unit level:

```ini
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
```

`nsenter` runs in the unit's namespace momentarily, switches its own
mount namespace to PID 1's, then `execve`s the helper. From that
point the helper Python interpreter — *the long-lived parent process*
— lives in PID 1's namespace and holds no reference to the unit's
namespace.

`TimeoutStopSec` reverts to `15s`.

### `deploy/files/usr/local/libexec/left4me/left4me-overlay`

With the helper already in PID 1's namespace, internal nsenter is
redundant. Removed:

- `nsenter --mount=/proc/1/ns/mnt --` prefix on the mount/umount argv.
- `cmd_umount`'s eager-retry loop (no race left to ride out).
- Lazy-umount (`umount -l`) fallback (no fallback needed; eager
  succeeds first try).
- `work_inner` cleanup retry (no kernel-finalisation residual after a
  successful eager umount).
- `import time`.

Kept: input validation, idempotency guards (`os.path.ismount`),
`work_inner` rmtree (the kernel-overlayfs orphan dir is unrelated to
the namespace issue and still needs cleaning up).

## Verification

After deploy on the test box:

| Metric | Before fix | After fix |
|---|---|---|
| Reset duration (`l4d2ctl reset 3`) | ~25 s | ~0.5 s |
| `holders=` of dying namespace | `[helper_pid]` for ~25 s | `[]` immediately |
| Unit state after Stop | `failed` | `inactive` |
| ExecStopPost exit code | 32 (EBUSY) | 0 |

UI flow (`/servers/3` → Start → Reset): job `#164 reset succeeded`
in 1.3 s end-to-end. No `failed` rows on subsequent resets.

## Lessons

- **A retry loop is a hint, not a fix.** If you find yourself reaching
  for "retry until kernel finishes," check whether *your own process*
  is what's blocking the kernel from finishing. nsenter at the
  syscall level looks right, but only escapes the namespace for the
  child process; the parent still pins it.
- **Probe for the holder, don't assume async.** `/proc/*/ns/mnt` plus
  a tight polling loop quickly tells you who's actually holding a
  namespace alive. We jumped to "task_work_add reaping" as the
  explanation and burned a round of workarounds before checking.
- **`+` prefix only escapes sandbox & credentials.** Mount namespace
  inheritance is unaffected; if you need PID 1's namespace, do
  `nsenter` yourself at the Exec line.