left4me/docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md
mwiegand db120d77d3
docs(specs): kernel overlayfs migration design + plan
Captures the architectural fix for the mount-propagation bug: replace
fuse-overlayfs (rootless mount inside the web service's namespace, never
visible to host or to gameserver units) with kernel-native overlayfs
mounted via a privileged sudo helper that nsenters into PID 1's mount
namespace. Companion plan numbers the migration as five tasks ending in
end-to-end verification on the test box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:19:26 +02:00

80 lines
8.9 KiB
Markdown

# Kernel Overlayfs Helper Design
**Goal:** Replace the per-instance `fuse-overlayfs` mount with kernel-native overlayfs invoked through a privileged sudo helper that mounts in PID 1's mount namespace. Restores host-namespace visibility of the merged overlay so gameserver units (`left4me-server@%i.service`) can `chdir` into it at unshare time.
**Approval status:** User-approved design direction. Implementation proceeds in lockstep with the companion plan at `docs/superpowers/plans/2026-05-08-kernel-overlayfs-helper.md`.
## Context
**Symptom.** After redeploys, starting a gameserver leaves the systemd unit in `activating (auto-restart)` with `status=200/CHDIR — Changing to the requested working directory failed: No such file or directory`. Investigation showed:
- `fuse-overlayfs` running as `left4me` user mounts in `left4me-web.service`'s mount namespace.
- `ProtectSystem=full` + `ReadWritePaths=/var/lib/left4me` forces `PrivateMounts=yes` on the unit (`systemd-analyze security` confirms).
- The unit's bind of `/var/lib/left4me` shows `shared:471 master:1` in `/proc/<pid>/mountinfo` — slave-receive-only — so mounts created beneath it never propagate back to host.
- `MountFlags=shared` (added in commit `1968684` to fix this) sets only the unit's *root* propagation; it does not override the slave-direction propagation that `ProtectSystem`/`ReadWritePaths` apply to their bind mounts. The gameserver unit, on unshare, inherits *host* mounts and sees nothing at the merged path → CHDIR fails.
The system *appeared* to work for ~1d8h before this investigation because the prior fuse daemon happened to land in the host namespace via some transient state. The mechanism documented in `1968684` does not reliably work on systemd 257 with this hardening shape.
**Out-of-scope item now in scope.** The 2026-05-07 workshop-overlays spec already lists this transition at line 211: *"Switch from fuse-overlayfs to kernel overlayfs via a privileged helper. Matches the existing systemd / steam-install sudoers helper pattern under `/usr/local/libexec/left4me/`."* The mount-propagation bug is the trigger to do it now.
## Locked Decisions
1. **Privileged helper does the mount.** New `left4me-overlay` script under `/usr/local/libexec/left4me/`, invoked via `sudo -n`. Mirrors the existing `left4me-systemctl` and `left4me-journalctl` pattern. The helper enters PID 1's mount namespace via `nsenter --mount=/proc/1/ns/mnt` and then calls `/bin/mount -t overlay …` or `/bin/umount`. Result: all overlay mounts live in the host namespace, visible to gameserver units.
2. **Kernel-native overlayfs, not fuse.** Once a privileged helper exists, fuse-overlayfs's rootless-mount-via-setuid-`fusermount3` advantage disappears. Kernel overlayfs is faster, has no long-running daemon, simpler unmount, and one fewer runtime dep.
3. **Helper is Python, not shell.** Path canonicalization, env-file parsing, and lowerdir prefix-allowlist validation are too brittle in shell. Uses system `/usr/bin/python3` (never the venv) and stdlib only. Owned by root, mode 0755.
4. **Verbs are `mount` and `umount`.** Matches the kernel/userspace utility names; reduces cognitive friction over `unmount`.
5. **Helper takes only the instance name as input.** It reads `${LEFT4ME_ROOT:-/var/lib/left4me}/instances/<name>/instance.env` for `L4D2_LOWERDIRS=` and computes `upper`/`work`/`merged` from the runtime root. Equivalent in security to taking lowerdirs as args (the user already controls instance.env), and produces a one-line audit trail in `journalctl _COMM=sudo`.
6. **Strict path validation in the helper.**
- Instance name matches `^[a-z0-9][a-z0-9_-]{0,63}$` (mirrors `validate_instance_name` in `l4d2host/paths.py`).
- Each lowerdir from `L4D2_LOWERDIRS` is `os.path.realpath`'d and must resolve under one of an allowlist: `installation/`, `overlays/`, `global_overlay_cache/`, `workshop_cache/`. Empty entries and traversals are rejected.
- `upper`/`work`/`merged` must resolve exactly to `runtime/<name>/{upper,work,merged}`.
- Lowerdir count ≤ 500 (kernel overlayfs hard cap; was 64 before kernel 5.2).
7. **Whiteout-format guard.** `fuse-overlayfs` running as non-root uses `user.fuseoverlayfs.*` xattrs for whiteouts and opaque dirs, which kernel overlayfs ignores entirely. Before mounting, the helper walks `upperdir` once and refuses if any such xattr is present. Defensive; catches a stale fuse-era upperdir that wasn't wiped during migration.
8. **One-time migration: wipe existing `upper/` and `work/`.** Deploy script runs a gated migration (sentinel file `/var/lib/left4me/.kernel-overlay-migrated`) that stops gameservers, stops web service, unmounts any stale fuse/overlay mounts, recreates empty `upper`/`work` dirs for every instance. Players' in-place edits to merged content are sacrificed; v1 accepts this for a test deployment.
9. **Sudoers verb constraints.** `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-overlay mount *, /usr/local/libexec/left4me/left4me-overlay umount *`. Defense in depth (real validation lives in the helper); makes `sudo -l` output self-documenting.
10. **Wire the existing `OverlayMounter` ABC through.** `start_instance`/`stop_instance`/`delete_instance` today bypass the abstraction at `l4d2host/fs/base.py`. The new `KernelOverlayFSMounter` replaces the unused `FuseOverlayFSMounter` AND becomes the only path through `instances.py`. `FuseOverlayFSMounter` and the `fuse_overlayfs.py` module are deleted.
11. **Double-mount guard in `start_instance`.** Kernel mounts persist when the web worker dies (unlike fuse daemons, which die with their cgroup). `start_instance` checks `os.path.ismount(merged)` and refuses with a clear error rather than double-mounting.
12. **Hardening cleanup on `left4me-web.service`.** Drop `MountFlags=shared` (no longer the mechanism). Restore `PrivateTmp=true` (was dropped in commit `593611e` for fuse propagation that did not work). Keep `NoNewPrivileges` unset (sudo still requires setuid). Update the comment block to reflect the new model.
13. **AGENTS.md contracts unchanged.** The host library's CLI surface (`install`, `initialize`, `start`, `stop`, `delete`, `status`, `logs`) is unchanged. The web app continues to drive operations via `l4d2ctl`. The fuse-overlayfs implementation detail was never part of the public contract.
## Architecture
```
left4me-web.service (hardened, private mount namespace)
│ start_instance(name=…)
l4d2host.instances.start_instance
│ KernelOverlayFSMounter().mount(merged=…)
sudo -n /usr/local/libexec/left4me/left4me-overlay mount <name>
│ • validate name (regex)
│ • parse instance.env → L4D2_LOWERDIRS
│ • realpath each lowerdir, prefix-allowlist check
│ • compute upper/work/merged under runtime/<name>/
│ • walk upperdir, refuse if any user.fuseoverlayfs.* xattr
nsenter --mount=/proc/1/ns/mnt -- \
/bin/mount -t overlay overlay \
-o "lowerdir=…,upperdir=…,workdir=…" \
/var/lib/left4me/runtime/<name>/merged
host mount namespace now has the overlay; gameserver unit, on
unshare, inherits it and CHDIRs into …/merged/left4dead2 successfully.
```
## Operational Notes
- **Migration ordering on the test box (test-server, …).** The deploy script must, in order: (1) stop all `left4me-server@*.service`, (2) stop `left4me-web.service` (kills any lingering fuse-overlayfs daemons by reaping their cgroup), (3) `findmnt` + force-unmount any leftover fuse/overlay mounts under `/var/lib/left4me/runtime/`, (4) wipe and recreate `upper`/`work` for every instance, (5) deploy + start the new code. The sentinel file `/var/lib/left4me/.kernel-overlay-migrated` gates reruns.
- **Filesystem.** `/var/lib/left4me` is btrfs on the test box. Kernel overlayfs on btrfs is supported on kernel ≥ 5.10; the box is on 6.12 — fine. AppArmor ships enabled on Debian Trixie; verify no overlay-related denials in `journalctl -k` after first start.
- **Concurrency.** Two threads racing on `start_instance` for the same name is a latent issue unaffected by this change. The double-mount guard partly mitigates: the loser hits the existing mount and errors cleanly.
## Out Of Scope
- **Replace `sudo` with `AmbientCapabilities=CAP_SYS_ADMIN`** on a dedicated helper unit. Broader blast radius than the wrapper-script approach.
- **A `systemd-mount` per-instance mount unit.** Considered as the alternative architectural fix but adds more moving parts than the helper-script approach. The helper matches the established privileged-helper pattern in this codebase.
- **Re-enable `NoNewPrivileges` on `left4me-web.service`.** Requires removing sudo; not feasible while the helper invocation pattern stays.
- **Multi-process job-worker-claim safety.** The `_claim_lock` in `l4d2host/services/job_worker.py:131-138` is process-local; correctness depends on `--workers 1`. This change doesn't touch it.
- **Replicating the migration on production deployments.** v1 covers only the test-server deployment shape.