docs(specs): kernel overlayfs migration design + plan

Captures the architectural fix for the mount-propagation bug: replace
fuse-overlayfs (rootless mount inside the web service's namespace, never
visible to host or to gameserver units) with kernel-native overlayfs
mounted via a privileged sudo helper that nsenters into PID 1's mount
namespace. Companion plan numbers the migration as five tasks ending in
end-to-end verification on the test box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
mwiegand 2026-05-08 12:19:26 +02:00
parent d5d710afa7
commit db120d77d3
No known key found for this signature in database
2 changed files with 309 additions and 0 deletions

View file

@ -0,0 +1,229 @@
# Kernel Overlayfs Helper Implementation Plan
> **Approval status:** User-approved 2026-05-08. Implementation proceeds.
**Goal:** Implement the kernel-overlayfs migration per `docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md`. Add a Python `left4me-overlay` privileged helper, a `KernelOverlayFSMounter` Python class, wire the existing `OverlayMounter` ABC through `l4d2host/instances.py`, drop `fuse-overlayfs` from the deploy stack, and migrate existing on-disk upper/work directories.
**Architecture:** The web app continues to call `l4d2ctl start|stop|delete <name>`; `l4d2host` continues to expose the same CLI verbs. Internally, `start_instance`/`stop_instance`/`delete_instance` move from a hardcoded subprocess call to `fuse-overlayfs`/`fusermount3` to using `KernelOverlayFSMounter`, which invokes the new sudo helper that mounts in PID 1's namespace via `nsenter`.
---
## Locked Decisions
See `docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md` for the design rationale. Implementation-relevant summary:
- `left4me-overlay` Python helper in `/usr/local/libexec/left4me/`, owned root, mode 0755, system `/usr/bin/python3`, stdlib only.
- Verbs: `mount <name>`, `umount <name>`.
- Validation in helper: name regex; realpath + allowlist for each lowerdir; exact-prefix check for upper/work/merged; reject upperdir with `user.fuseoverlayfs.*` xattrs; lowerdir count ≤ 500.
- Sudoers verb-constrained: `mount *`, `umount *`.
- `KernelOverlayFSMounter` in `l4d2host/fs/kernel_overlayfs.py` — implements `OverlayMounter`. Derives `name` from the merged path's parent.
- `start_instance` adds `os.path.ismount(merged)` guard before mounting.
- Deploy migration: gated on sentinel file `/var/lib/left4me/.kernel-overlay-migrated`; stops gameservers + web, force-unmounts stale mounts, wipes upper/work, recreates empty.
- Web unit cleanup: drop `MountFlags=shared`, restore `PrivateTmp=true`, rewrite comment block. Keep `NoNewPrivileges` unset.
- Delete `l4d2host/fs/fuse_overlayfs.py` (currently unused — `start_instance` bypasses it).
- AGENTS.md contracts unchanged.
---
## Current Gap
- `l4d2host/instances.py` `start_instance` calls `fuse-overlayfs` directly (lines 85-101); `stop_instance`/`delete_instance` call `fusermount3 -u` directly. The `OverlayMounter` ABC at `l4d2host/fs/base.py` and the `FuseOverlayFSMounter` impl at `l4d2host/fs/fuse_overlayfs.py` exist but are unused.
- Mounts land in the web service's private mount namespace, invisible to host and to gameserver units. `MountFlags=shared` does not fix it.
- No privileged mount helper exists; only `left4me-systemctl` and `left4me-journalctl`.
- Deploy script installs `fuse-overlayfs` apt package and assumes it as a runtime tool.
- Existing `runtime/<name>/upper` directories may carry `user.fuseoverlayfs.*` xattrs that kernel overlayfs would silently ignore (resurrecting "deleted" files).
---
## Task 1: Helper Script + Sudoers + Mounter Class (RED-first)
**Files:**
- Create: `deploy/files/usr/local/libexec/left4me/left4me-overlay` (Python, mode 0755 after deploy)
- Modify: `deploy/files/etc/sudoers.d/left4me`
- Create: `l4d2host/fs/kernel_overlayfs.py`
- Create: `l4d2host/tests/test_kernel_overlayfs.py`
- Create: `l4d2host/tests/test_overlay_helper.py`
- Modify: `deploy/tests/test_deploy_artifacts.py` (assert helper deployed + sudoers entry)
Test plan (RED first):
1. `test_kernel_overlayfs.py::test_mount_invokes_helper_with_name` — mock `run_command`, call `KernelOverlayFSMounter().mount(lowerdirs="/x:/y", upperdir=Path("/var/lib/left4me/runtime/alpha/upper"), workdir=Path("/var/lib/left4me/runtime/alpha/work"), merged=Path("/var/lib/left4me/runtime/alpha/merged"))`, assert argv `["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"]`.
2. `test_kernel_overlayfs.py::test_unmount_invokes_helper_with_umount_verb` — mock + call + assert argv with `umount`.
3. `test_overlay_helper.py` — drives the helper script as a subprocess with `LEFT4ME_OVERLAY_PRINT_ONLY=1` env var (helper prints the would-be `nsenter …` command line and exits 0 instead of execve), and with isolated `LEFT4ME_ROOT=tmp_path`. Cases:
- Valid mount: prints expected `nsenter --mount=/proc/1/ns/mnt -- /bin/mount -t overlay …` line.
- Valid umount: prints expected umount line.
- Bad name (`../escape`, uppercase, empty): exit non-zero, stderr matches.
- Lowerdir traversal (`/etc`, `/var/lib/left4me/../etc`, symlink escape): exit non-zero.
- Missing `instance.env`: exit non-zero.
- Tainted upperdir (with `user.fuseoverlayfs.opaque` xattr): exit non-zero with clear message. (Optional: skip if `setfattr` is unavailable on dev machine; keep test on Linux only via `pytest.mark.skipif`.)
- Lowerdir count > 500: exit non-zero.
4. `test_deploy_artifacts.py` — assert `/usr/local/libexec/left4me/left4me-overlay` is present in deployed files; sudoers includes the new lines.
Implementation:
- Helper script structure: `argparse` for the verb, then path-validation funcs, then `os.execv("/usr/bin/nsenter", [...])` (or printing it under `LEFT4ME_OVERLAY_PRINT_ONLY`).
- `KernelOverlayFSMounter`: `name = merged.parent.name` (with a one-line comment), then `run_command(["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", verb, name], on_stdout=…, on_stderr=…, passthrough=…, should_cancel=…)`.
**Verification:**
```
python3 -m pytest l4d2host/tests/test_kernel_overlayfs.py l4d2host/tests/test_overlay_helper.py deploy/tests/test_deploy_artifacts.py -q
```
Expected before implementation: FAIL on missing class/script. After: all green.
**Commit:** `feat(l4d2-host): KernelOverlayFSMounter + left4me-overlay helper`
---
## Task 2: Wire OverlayMounter Through Lifecycle + Drop Fuse Module
**Files:**
- Modify: `l4d2host/instances.py` (start/stop/delete)
- Modify: `l4d2host/tests/test_lifecycle.py` (update argv assertions, add double-mount guard test)
- Delete: `l4d2host/fs/fuse_overlayfs.py`
- Verify: `l4d2host/fs/__init__.py` does not re-export `FuseOverlayFSMounter`
Test plan (update RED, then GREEN):
1. `test_lifecycle.py::test_start_order` — change assertion: `calls[0]` is now `["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"]`. Adjust setup so the test still creates the merged directory.
2. `test_lifecycle.py::test_stop_succeeds_when_unmount_fails``cmd[0:5] == ["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "umount", "alpha"]`.
3. `test_lifecycle.py::test_delete_succeeds_when_unmount_fails` — same.
4. NEW `test_lifecycle.py::test_start_refuses_double_mount` — monkeypatch `os.path.ismount` to return True; expect `start_instance` to raise `subprocess.CalledProcessError`; assert NO mount command was issued.
5. `test_lifecycle.py::test_lifecycle_rejects_unsafe_instance_names` — unchanged.
6. `test_lifecycle.py::test_delete_missing_is_noop` — unchanged.
Implementation:
- `instances.py` imports `KernelOverlayFSMounter`. Module-level singleton instance (`_mounter = KernelOverlayFSMounter()`). Replace direct `run_command([...fuse-overlayfs...])` with `_mounter.mount(...)`. Replace direct `run_command([...fusermount3...])` with `_mounter.unmount(...)` (still inside the existing try/except for stop/delete).
- Add the ismount guard at the top of `start_instance` after `runtime_dir` is computed, before `emit_step("mounting runtime overlay...")`. Raise `subprocess.CalledProcessError(returncode=1, cmd=["mount-guard"], stderr="runtime overlay already mounted at <path>; refusing to double-mount")`.
- Delete `l4d2host/fs/fuse_overlayfs.py`.
- Confirm `l4d2host/fs/__init__.py` is empty (already verified to be 1 line).
**Verification:**
```
python3 -m pytest l4d2host/tests -q
python3 -m pytest l4d2web/tests -q
```
Both green. Web tests: the `"Step: mounting runtime overlay..."` log line is preserved in `start_instance`.
**Commit:** `refactor(l4d2-host): start/stop/delete go through OverlayMounter; drop FuseOverlayFSMounter`
---
## Task 3: Deploy Script Migration (Apt Deps + Wipe Upper/Work)
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Modify: `deploy/tests/test_deploy_artifacts.py` (assert deploy script contains migration lines; assert `fuse-overlayfs` no longer in apt-get install)
Test plan:
1. `test_deploy_artifacts.py::test_deploy_script_drops_fuse_overlayfs_apt_dep``assert "fuse-overlayfs" not in deploy_script` and `assert "kernel-overlay-migrated" in deploy_script`.
2. `test_deploy_artifacts.py::test_deploy_script_migration_block_uses_sentinel``assert ".kernel-overlay-migrated" in deploy_script`.
Implementation:
In `deploy/deploy-test-server.sh`, drop `fuse-overlayfs` from the apt-get and dnf lines (lines 82, 84). Insert before the existing `systemctl restart left4me-web.service` (line 182):
```sh
# One-time migration: fuse-overlayfs upperdir → kernel overlayfs upperdir.
# fuse-overlayfs running as the left4me user uses user.fuseoverlayfs.* xattrs
# for whiteouts and opaque dirs; kernel overlayfs ignores those, so any
# pre-existing upper/ from the fuse era would resurrect "deleted" files.
sentinel=/var/lib/left4me/.kernel-overlay-migrated
if [ ! -e "$sentinel" ]; then
$sudo_cmd systemctl stop 'left4me-server@*.service' 2>/dev/null || true
$sudo_cmd systemctl stop left4me-web.service 2>/dev/null || true
$sudo_cmd sh -c 'findmnt -t fuse.fuse-overlayfs -o TARGET --noheadings | xargs -r -n1 fusermount3 -u 2>/dev/null || true'
$sudo_cmd sh -c "findmnt -t overlay -o TARGET --noheadings | grep '/var/lib/left4me/runtime/' | xargs -r -n1 umount 2>/dev/null || true"
$sudo_cmd sh -c 'for d in /var/lib/left4me/runtime/*/; do [ -d "$d" ] || continue; rm -rf "$d/upper" "$d/work"; mkdir -p "$d/upper" "$d/work"; chown left4me:left4me "$d/upper" "$d/work"; done'
$sudo_cmd touch "$sentinel"
$sudo_cmd chown left4me:left4me "$sentinel"
fi
```
**Verification:**
```
python3 -m pytest deploy/tests -q
```
Green.
**Commit:** `chore(deploy): drop fuse-overlayfs apt dep + one-shot migrate upper/work`
---
## Task 4: Web Unit Hardening Cleanup + Docs
**Files:**
- Modify: `deploy/files/usr/local/lib/systemd/system/left4me-web.service`
- Modify: `deploy/tests/test_deploy_artifacts.py`
- Modify: `README.md`
- Modify: `l4d2host/README.md`
- Modify: `deploy/README.md`
Test plan:
1. `test_deploy_artifacts.py::test_web_unit_contains_required_runtime_contract` — drop `assert "MountFlags=shared" in unit` (or rather: replace with `assert "MountFlags=" not in unit`); add `assert "PrivateTmp=true" in unit`; add `assert "left4me-overlay" not in unit` (just to be precise — the unit shouldn't reference the helper directly, only via Python code).
Implementation:
Edit `left4me-web.service`:
- Drop `MountFlags=shared`.
- Restore `PrivateTmp=true`.
- Rewrite the comment block above hardening lines to explain: mounts now go through the `left4me-overlay` helper which `nsenter`s into PID 1's mount namespace, so this unit's namespace is irrelevant to gameserver visibility. `NoNewPrivileges` stays unset because sudo is setuid.
README updates:
- `README.md` (line ~59): drop fuse-overlayfs from tech-stack list; replace with "kernel overlayfs via privileged helper".
- `l4d2host/README.md`: lines 29, 52, 64 reference fuse — update to "kernel overlayfs (mount via the `left4me-overlay` helper deployed to `/usr/local/libexec/left4me/`)".
- `deploy/README.md`: add `/usr/local/libexec/left4me/left4me-overlay` to the privileged-helpers inventory.
**Verification:**
```
python3 -m pytest deploy/tests -q
```
Green. Manual readthrough of the three READMEs confirms no stale fuse references.
**Commit:** `chore(deploy): cleanup left4me-web hardening + docs for kernel overlayfs`
---
## Task 5: End-to-End Verification on `ckn@10.0.4.128`
**Pre-deploy:** branch is clean, all four prior commits land, all tests green locally.
**Deploy:**
```
deploy/deploy-test-server.sh ckn@10.0.4.128
```
**Verification commands on the box:**
1. `test -e /var/lib/left4me/.kernel-overlay-migrated && echo migrated` — sentinel created.
2. `systemctl status left4me-web.service --no-pager``active (running)`, recent invocation timestamp.
3. From the UI or via `sudo -u left4me /opt/left4me/.venv/bin/l4d2ctl start test-server` — exit 0.
4. `findmnt /var/lib/left4me/runtime/test-server/merged` — shows fstype `overlay` in the host namespace.
5. `systemctl status left4me-server@test-server --no-pager``active (running)` after the start; **not** in `activating (auto-restart)`. No `status=200/CHDIR` errors in `journalctl -u left4me-server@test-server`.
6. `sudo journalctl -k --since "5 minutes ago" | grep -i apparmor | tail` — no overlay-related denials.
7. Negative test: `sudo -u left4me sudo -n /usr/local/libexec/left4me/left4me-overlay mount '../escape'` — exits non-zero with validation error.
8. Idempotency: `l4d2ctl stop test-server && l4d2ctl stop test-server` — both succeed (per the prior `fix(l4d2-host): make stop_instance idempotent` commit, still holds).
9. Re-start: `l4d2ctl start test-server` — succeeds, `findmnt` shows the mount again.
10. Double-mount guard: while the server is running, attempting another start (not via UI; via Python REPL or a second job) — `start_instance` raises `CalledProcessError` with the "refusing to double-mount" message. Optional, can be left to the unit test.
**On failure of any step:** stop and report. Do NOT push. The deploy script is rerunnable; the migration sentinel stays so wipe doesn't repeat.
---
## Out Of Scope
- See spec's "Out Of Scope" section.
- This plan does not push commits; pushing is a separate user decision after end-to-end verification passes.

View file

@ -0,0 +1,80 @@
# Kernel Overlayfs Helper Design
**Goal:** Replace the per-instance `fuse-overlayfs` mount with kernel-native overlayfs invoked through a privileged sudo helper that mounts in PID 1's mount namespace. Restores host-namespace visibility of the merged overlay so gameserver units (`left4me-server@%i.service`) can `chdir` into it at unshare time.
**Approval status:** User-approved design direction. Implementation proceeds in lockstep with the companion plan at `docs/superpowers/plans/2026-05-08-kernel-overlayfs-helper.md`.
## Context
**Symptom.** After redeploys, starting a gameserver leaves the systemd unit in `activating (auto-restart)` with `status=200/CHDIR — Changing to the requested working directory failed: No such file or directory`. Investigation showed:
- `fuse-overlayfs` running as `left4me` user mounts in `left4me-web.service`'s mount namespace.
- `ProtectSystem=full` + `ReadWritePaths=/var/lib/left4me` forces `PrivateMounts=yes` on the unit (`systemd-analyze security` confirms).
- The unit's bind of `/var/lib/left4me` shows `shared:471 master:1` in `/proc/<pid>/mountinfo` — slave-receive-only — so mounts created beneath it never propagate back to host.
- `MountFlags=shared` (added in commit `1968684` to fix this) sets only the unit's *root* propagation; it does not override the slave-direction propagation that `ProtectSystem`/`ReadWritePaths` apply to their bind mounts. The gameserver unit, on unshare, inherits *host* mounts and sees nothing at the merged path → CHDIR fails.
The system *appeared* to work for ~1d8h before this investigation because the prior fuse daemon happened to land in the host namespace via some transient state. The mechanism documented in `1968684` does not reliably work on systemd 257 with this hardening shape.
**Out-of-scope item now in scope.** The 2026-05-07 workshop-overlays spec already lists this transition at line 211: *"Switch from fuse-overlayfs to kernel overlayfs via a privileged helper. Matches the existing systemd / steam-install sudoers helper pattern under `/usr/local/libexec/left4me/`."* The mount-propagation bug is the trigger to do it now.
## Locked Decisions
1. **Privileged helper does the mount.** New `left4me-overlay` script under `/usr/local/libexec/left4me/`, invoked via `sudo -n`. Mirrors the existing `left4me-systemctl` and `left4me-journalctl` pattern. The helper enters PID 1's mount namespace via `nsenter --mount=/proc/1/ns/mnt` and then calls `/bin/mount -t overlay …` or `/bin/umount`. Result: all overlay mounts live in the host namespace, visible to gameserver units.
2. **Kernel-native overlayfs, not fuse.** Once a privileged helper exists, fuse-overlayfs's rootless-mount-via-setuid-`fusermount3` advantage disappears. Kernel overlayfs is faster, has no long-running daemon, simpler unmount, and one fewer runtime dep.
3. **Helper is Python, not shell.** Path canonicalization, env-file parsing, and lowerdir prefix-allowlist validation are too brittle in shell. Uses system `/usr/bin/python3` (never the venv) and stdlib only. Owned by root, mode 0755.
4. **Verbs are `mount` and `umount`.** Matches the kernel/userspace utility names; reduces cognitive friction over `unmount`.
5. **Helper takes only the instance name as input.** It reads `${LEFT4ME_ROOT:-/var/lib/left4me}/instances/<name>/instance.env` for `L4D2_LOWERDIRS=` and computes `upper`/`work`/`merged` from the runtime root. Equivalent in security to taking lowerdirs as args (the user already controls instance.env), and produces a one-line audit trail in `journalctl _COMM=sudo`.
6. **Strict path validation in the helper.**
- Instance name matches `^[a-z0-9][a-z0-9_-]{0,63}$` (mirrors `validate_instance_name` in `l4d2host/paths.py`).
- Each lowerdir from `L4D2_LOWERDIRS` is `os.path.realpath`'d and must resolve under one of an allowlist: `installation/`, `overlays/`, `global_overlay_cache/`, `workshop_cache/`. Empty entries and traversals are rejected.
- `upper`/`work`/`merged` must resolve exactly to `runtime/<name>/{upper,work,merged}`.
- Lowerdir count ≤ 500 (kernel overlayfs hard cap; was 64 before kernel 5.2).
7. **Whiteout-format guard.** `fuse-overlayfs` running as non-root uses `user.fuseoverlayfs.*` xattrs for whiteouts and opaque dirs, which kernel overlayfs ignores entirely. Before mounting, the helper walks `upperdir` once and refuses if any such xattr is present. Defensive; catches a stale fuse-era upperdir that wasn't wiped during migration.
8. **One-time migration: wipe existing `upper/` and `work/`.** Deploy script runs a gated migration (sentinel file `/var/lib/left4me/.kernel-overlay-migrated`) that stops gameservers, stops web service, unmounts any stale fuse/overlay mounts, recreates empty `upper`/`work` dirs for every instance. Players' in-place edits to merged content are sacrificed; v1 accepts this for a test deployment.
9. **Sudoers verb constraints.** `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-overlay mount *, /usr/local/libexec/left4me/left4me-overlay umount *`. Defense in depth (real validation lives in the helper); makes `sudo -l` output self-documenting.
10. **Wire the existing `OverlayMounter` ABC through.** `start_instance`/`stop_instance`/`delete_instance` today bypass the abstraction at `l4d2host/fs/base.py`. The new `KernelOverlayFSMounter` replaces the unused `FuseOverlayFSMounter` AND becomes the only path through `instances.py`. `FuseOverlayFSMounter` and the `fuse_overlayfs.py` module are deleted.
11. **Double-mount guard in `start_instance`.** Kernel mounts persist when the web worker dies (unlike fuse daemons, which die with their cgroup). `start_instance` checks `os.path.ismount(merged)` and refuses with a clear error rather than double-mounting.
12. **Hardening cleanup on `left4me-web.service`.** Drop `MountFlags=shared` (no longer the mechanism). Restore `PrivateTmp=true` (was dropped in commit `593611e` for fuse propagation that did not work). Keep `NoNewPrivileges` unset (sudo still requires setuid). Update the comment block to reflect the new model.
13. **AGENTS.md contracts unchanged.** The host library's CLI surface (`install`, `initialize`, `start`, `stop`, `delete`, `status`, `logs`) is unchanged. The web app continues to drive operations via `l4d2ctl`. The fuse-overlayfs implementation detail was never part of the public contract.
## Architecture
```
left4me-web.service (hardened, private mount namespace)
│ start_instance(name=…)
l4d2host.instances.start_instance
│ KernelOverlayFSMounter().mount(merged=…)
sudo -n /usr/local/libexec/left4me/left4me-overlay mount <name>
│ • validate name (regex)
│ • parse instance.env → L4D2_LOWERDIRS
│ • realpath each lowerdir, prefix-allowlist check
│ • compute upper/work/merged under runtime/<name>/
│ • walk upperdir, refuse if any user.fuseoverlayfs.* xattr
nsenter --mount=/proc/1/ns/mnt -- \
/bin/mount -t overlay overlay \
-o "lowerdir=…,upperdir=…,workdir=…" \
/var/lib/left4me/runtime/<name>/merged
host mount namespace now has the overlay; gameserver unit, on
unshare, inherits it and CHDIRs into …/merged/left4dead2 successfully.
```
## Operational Notes
- **Migration ordering on the test box (test-server, …).** The deploy script must, in order: (1) stop all `left4me-server@*.service`, (2) stop `left4me-web.service` (kills any lingering fuse-overlayfs daemons by reaping their cgroup), (3) `findmnt` + force-unmount any leftover fuse/overlay mounts under `/var/lib/left4me/runtime/`, (4) wipe and recreate `upper`/`work` for every instance, (5) deploy + start the new code. The sentinel file `/var/lib/left4me/.kernel-overlay-migrated` gates reruns.
- **Filesystem.** `/var/lib/left4me` is btrfs on the test box. Kernel overlayfs on btrfs is supported on kernel ≥ 5.10; the box is on 6.12 — fine. AppArmor ships enabled on Debian Trixie; verify no overlay-related denials in `journalctl -k` after first start.
- **Concurrency.** Two threads racing on `start_instance` for the same name is a latent issue unaffected by this change. The double-mount guard partly mitigates: the loser hits the existing mount and errors cleanly.
## Out Of Scope
- **Replace `sudo` with `AmbientCapabilities=CAP_SYS_ADMIN`** on a dedicated helper unit. Broader blast radius than the wrapper-script approach.
- **A `systemd-mount` per-instance mount unit.** Considered as the alternative architectural fix but adds more moving parts than the helper-script approach. The helper matches the established privileged-helper pattern in this codebase.
- **Re-enable `NoNewPrivileges` on `left4me-web.service`.** Requires removing sudo; not feasible while the helper invocation pattern stays.
- **Multi-process job-worker-claim safety.** The `_claim_lock` in `l4d2host/services/job_worker.py:131-138` is process-local; correctness depends on `--workers 1`. This change doesn't touch it.
- **Replicating the migration on production deployments.** v1 covers only the test-server deployment shape.