Captures the architectural fix for the mount-propagation bug: replace fuse-overlayfs (rootless mount inside the web service's namespace, never visible to host or to gameserver units) with kernel-native overlayfs mounted via a privileged sudo helper that nsenters into PID 1's mount namespace. Companion plan numbers the migration as five tasks ending in end-to-end verification on the test box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
229 lines
13 KiB
Markdown
229 lines
13 KiB
Markdown
# Kernel Overlayfs Helper Implementation Plan
|
|
|
|
> **Approval status:** User-approved 2026-05-08. Implementation proceeds.
|
|
|
|
**Goal:** Implement the kernel-overlayfs migration per `docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md`. Add a Python `left4me-overlay` privileged helper, a `KernelOverlayFSMounter` Python class, wire the existing `OverlayMounter` ABC through `l4d2host/instances.py`, drop `fuse-overlayfs` from the deploy stack, and migrate existing on-disk upper/work directories.
|
|
|
|
**Architecture:** The web app continues to call `l4d2ctl start|stop|delete <name>`; `l4d2host` continues to expose the same CLI verbs. Internally, `start_instance`/`stop_instance`/`delete_instance` move from a hardcoded subprocess call to `fuse-overlayfs`/`fusermount3` to using `KernelOverlayFSMounter`, which invokes the new sudo helper that mounts in PID 1's namespace via `nsenter`.
|
|
|
|
---
|
|
|
|
## Locked Decisions
|
|
|
|
See `docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md` for the design rationale. Implementation-relevant summary:
|
|
|
|
- `left4me-overlay` Python helper in `/usr/local/libexec/left4me/`, owned root, mode 0755, system `/usr/bin/python3`, stdlib only.
|
|
- Verbs: `mount <name>`, `umount <name>`.
|
|
- Validation in helper: name regex; realpath + allowlist for each lowerdir; exact-prefix check for upper/work/merged; reject upperdir with `user.fuseoverlayfs.*` xattrs; lowerdir count ≤ 500.
|
|
- Sudoers verb-constrained: `mount *`, `umount *`.
|
|
- `KernelOverlayFSMounter` in `l4d2host/fs/kernel_overlayfs.py` — implements `OverlayMounter`. Derives `name` from the merged path's parent.
|
|
- `start_instance` adds `os.path.ismount(merged)` guard before mounting.
|
|
- Deploy migration: gated on sentinel file `/var/lib/left4me/.kernel-overlay-migrated`; stops gameservers + web, force-unmounts stale mounts, wipes upper/work, recreates empty.
|
|
- Web unit cleanup: drop `MountFlags=shared`, restore `PrivateTmp=true`, rewrite comment block. Keep `NoNewPrivileges` unset.
|
|
- Delete `l4d2host/fs/fuse_overlayfs.py` (currently unused — `start_instance` bypasses it).
|
|
- AGENTS.md contracts unchanged.
|
|
|
|
---
|
|
|
|
## Current Gap
|
|
|
|
- `l4d2host/instances.py` `start_instance` calls `fuse-overlayfs` directly (lines 85-101); `stop_instance`/`delete_instance` call `fusermount3 -u` directly. The `OverlayMounter` ABC at `l4d2host/fs/base.py` and the `FuseOverlayFSMounter` impl at `l4d2host/fs/fuse_overlayfs.py` exist but are unused.
|
|
- Mounts land in the web service's private mount namespace, invisible to host and to gameserver units. `MountFlags=shared` does not fix it.
|
|
- No privileged mount helper exists; only `left4me-systemctl` and `left4me-journalctl`.
|
|
- Deploy script installs `fuse-overlayfs` apt package and assumes it as a runtime tool.
|
|
- Existing `runtime/<name>/upper` directories may carry `user.fuseoverlayfs.*` xattrs that kernel overlayfs would silently ignore (resurrecting "deleted" files).
|
|
|
|
---
|
|
|
|
## Task 1: Helper Script + Sudoers + Mounter Class (RED-first)
|
|
|
|
**Files:**
|
|
- Create: `deploy/files/usr/local/libexec/left4me/left4me-overlay` (Python, mode 0755 after deploy)
|
|
- Modify: `deploy/files/etc/sudoers.d/left4me`
|
|
- Create: `l4d2host/fs/kernel_overlayfs.py`
|
|
- Create: `l4d2host/tests/test_kernel_overlayfs.py`
|
|
- Create: `l4d2host/tests/test_overlay_helper.py`
|
|
- Modify: `deploy/tests/test_deploy_artifacts.py` (assert helper deployed + sudoers entry)
|
|
|
|
Test plan (RED first):
|
|
|
|
1. `test_kernel_overlayfs.py::test_mount_invokes_helper_with_name` — mock `run_command`, call `KernelOverlayFSMounter().mount(lowerdirs="/x:/y", upperdir=Path("/var/lib/left4me/runtime/alpha/upper"), workdir=Path("/var/lib/left4me/runtime/alpha/work"), merged=Path("/var/lib/left4me/runtime/alpha/merged"))`, assert argv `["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"]`.
|
|
2. `test_kernel_overlayfs.py::test_unmount_invokes_helper_with_umount_verb` — mock + call + assert argv with `umount`.
|
|
3. `test_overlay_helper.py` — drives the helper script as a subprocess with `LEFT4ME_OVERLAY_PRINT_ONLY=1` env var (helper prints the would-be `nsenter …` command line and exits 0 instead of execve), and with isolated `LEFT4ME_ROOT=tmp_path`. Cases:
|
|
- Valid mount: prints expected `nsenter --mount=/proc/1/ns/mnt -- /bin/mount -t overlay …` line.
|
|
- Valid umount: prints expected umount line.
|
|
- Bad name (`../escape`, uppercase, empty): exit non-zero, stderr matches.
|
|
- Lowerdir traversal (`/etc`, `/var/lib/left4me/../etc`, symlink escape): exit non-zero.
|
|
- Missing `instance.env`: exit non-zero.
|
|
- Tainted upperdir (with `user.fuseoverlayfs.opaque` xattr): exit non-zero with clear message. (Optional: skip if `setfattr` is unavailable on dev machine; keep test on Linux only via `pytest.mark.skipif`.)
|
|
- Lowerdir count > 500: exit non-zero.
|
|
4. `test_deploy_artifacts.py` — assert `/usr/local/libexec/left4me/left4me-overlay` is present in deployed files; sudoers includes the new lines.
|
|
|
|
Implementation:
|
|
|
|
- Helper script structure: `argparse` for the verb, then path-validation funcs, then `os.execv("/usr/bin/nsenter", [...])` (or printing it under `LEFT4ME_OVERLAY_PRINT_ONLY`).
|
|
- `KernelOverlayFSMounter`: `name = merged.parent.name` (with a one-line comment), then `run_command(["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", verb, name], on_stdout=…, on_stderr=…, passthrough=…, should_cancel=…)`.
|
|
|
|
**Verification:**
|
|
|
|
```
|
|
python3 -m pytest l4d2host/tests/test_kernel_overlayfs.py l4d2host/tests/test_overlay_helper.py deploy/tests/test_deploy_artifacts.py -q
|
|
```
|
|
|
|
Expected before implementation: FAIL on missing class/script. After: all green.
|
|
|
|
**Commit:** `feat(l4d2-host): KernelOverlayFSMounter + left4me-overlay helper`
|
|
|
|
---
|
|
|
|
## Task 2: Wire OverlayMounter Through Lifecycle + Drop Fuse Module
|
|
|
|
**Files:**
|
|
- Modify: `l4d2host/instances.py` (start/stop/delete)
|
|
- Modify: `l4d2host/tests/test_lifecycle.py` (update argv assertions, add double-mount guard test)
|
|
- Delete: `l4d2host/fs/fuse_overlayfs.py`
|
|
- Verify: `l4d2host/fs/__init__.py` does not re-export `FuseOverlayFSMounter`
|
|
|
|
Test plan (update RED, then GREEN):
|
|
|
|
1. `test_lifecycle.py::test_start_order` — change assertion: `calls[0]` is now `["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"]`. Adjust setup so the test still creates the merged directory.
|
|
2. `test_lifecycle.py::test_stop_succeeds_when_unmount_fails` — `cmd[0:5] == ["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "umount", "alpha"]`.
|
|
3. `test_lifecycle.py::test_delete_succeeds_when_unmount_fails` — same.
|
|
4. NEW `test_lifecycle.py::test_start_refuses_double_mount` — monkeypatch `os.path.ismount` to return True; expect `start_instance` to raise `subprocess.CalledProcessError`; assert NO mount command was issued.
|
|
5. `test_lifecycle.py::test_lifecycle_rejects_unsafe_instance_names` — unchanged.
|
|
6. `test_lifecycle.py::test_delete_missing_is_noop` — unchanged.
|
|
|
|
Implementation:
|
|
|
|
- `instances.py` imports `KernelOverlayFSMounter`. Module-level singleton instance (`_mounter = KernelOverlayFSMounter()`). Replace direct `run_command([...fuse-overlayfs...])` with `_mounter.mount(...)`. Replace direct `run_command([...fusermount3...])` with `_mounter.unmount(...)` (still inside the existing try/except for stop/delete).
|
|
- Add the ismount guard at the top of `start_instance` after `runtime_dir` is computed, before `emit_step("mounting runtime overlay...")`. Raise `subprocess.CalledProcessError(returncode=1, cmd=["mount-guard"], stderr="runtime overlay already mounted at <path>; refusing to double-mount")`.
|
|
- Delete `l4d2host/fs/fuse_overlayfs.py`.
|
|
- Confirm `l4d2host/fs/__init__.py` is empty (already verified to be 1 line).
|
|
|
|
**Verification:**
|
|
|
|
```
|
|
python3 -m pytest l4d2host/tests -q
|
|
python3 -m pytest l4d2web/tests -q
|
|
```
|
|
|
|
Both green. Web tests: the `"Step: mounting runtime overlay..."` log line is preserved in `start_instance`.
|
|
|
|
**Commit:** `refactor(l4d2-host): start/stop/delete go through OverlayMounter; drop FuseOverlayFSMounter`
|
|
|
|
---
|
|
|
|
## Task 3: Deploy Script Migration (Apt Deps + Wipe Upper/Work)
|
|
|
|
**Files:**
|
|
- Modify: `deploy/deploy-test-server.sh`
|
|
- Modify: `deploy/tests/test_deploy_artifacts.py` (assert deploy script contains migration lines; assert `fuse-overlayfs` no longer in apt-get install)
|
|
|
|
Test plan:
|
|
|
|
1. `test_deploy_artifacts.py::test_deploy_script_drops_fuse_overlayfs_apt_dep` — `assert "fuse-overlayfs" not in deploy_script` and `assert "kernel-overlay-migrated" in deploy_script`.
|
|
2. `test_deploy_artifacts.py::test_deploy_script_migration_block_uses_sentinel` — `assert ".kernel-overlay-migrated" in deploy_script`.
|
|
|
|
Implementation:
|
|
|
|
In `deploy/deploy-test-server.sh`, drop `fuse-overlayfs` from the apt-get and dnf lines (lines 82, 84). Insert before the existing `systemctl restart left4me-web.service` (line 182):
|
|
|
|
```sh
|
|
# One-time migration: fuse-overlayfs upperdir → kernel overlayfs upperdir.
|
|
# fuse-overlayfs running as the left4me user uses user.fuseoverlayfs.* xattrs
|
|
# for whiteouts and opaque dirs; kernel overlayfs ignores those, so any
|
|
# pre-existing upper/ from the fuse era would resurrect "deleted" files.
|
|
sentinel=/var/lib/left4me/.kernel-overlay-migrated
|
|
if [ ! -e "$sentinel" ]; then
|
|
$sudo_cmd systemctl stop 'left4me-server@*.service' 2>/dev/null || true
|
|
$sudo_cmd systemctl stop left4me-web.service 2>/dev/null || true
|
|
$sudo_cmd sh -c 'findmnt -t fuse.fuse-overlayfs -o TARGET --noheadings | xargs -r -n1 fusermount3 -u 2>/dev/null || true'
|
|
$sudo_cmd sh -c "findmnt -t overlay -o TARGET --noheadings | grep '/var/lib/left4me/runtime/' | xargs -r -n1 umount 2>/dev/null || true"
|
|
$sudo_cmd sh -c 'for d in /var/lib/left4me/runtime/*/; do [ -d "$d" ] || continue; rm -rf "$d/upper" "$d/work"; mkdir -p "$d/upper" "$d/work"; chown left4me:left4me "$d/upper" "$d/work"; done'
|
|
$sudo_cmd touch "$sentinel"
|
|
$sudo_cmd chown left4me:left4me "$sentinel"
|
|
fi
|
|
```
|
|
|
|
**Verification:**
|
|
|
|
```
|
|
python3 -m pytest deploy/tests -q
|
|
```
|
|
|
|
Green.
|
|
|
|
**Commit:** `chore(deploy): drop fuse-overlayfs apt dep + one-shot migrate upper/work`
|
|
|
|
---
|
|
|
|
## Task 4: Web Unit Hardening Cleanup + Docs
|
|
|
|
**Files:**
|
|
- Modify: `deploy/files/usr/local/lib/systemd/system/left4me-web.service`
|
|
- Modify: `deploy/tests/test_deploy_artifacts.py`
|
|
- Modify: `README.md`
|
|
- Modify: `l4d2host/README.md`
|
|
- Modify: `deploy/README.md`
|
|
|
|
Test plan:
|
|
|
|
1. `test_deploy_artifacts.py::test_web_unit_contains_required_runtime_contract` — drop `assert "MountFlags=shared" in unit` (or rather: replace with `assert "MountFlags=" not in unit`); add `assert "PrivateTmp=true" in unit`; add `assert "left4me-overlay" not in unit` (just to be precise — the unit shouldn't reference the helper directly, only via Python code).
|
|
|
|
Implementation:
|
|
|
|
Edit `left4me-web.service`:
|
|
|
|
- Drop `MountFlags=shared`.
|
|
- Restore `PrivateTmp=true`.
|
|
- Rewrite the comment block above hardening lines to explain: mounts now go through the `left4me-overlay` helper which `nsenter`s into PID 1's mount namespace, so this unit's namespace is irrelevant to gameserver visibility. `NoNewPrivileges` stays unset because sudo is setuid.
|
|
|
|
README updates:
|
|
|
|
- `README.md` (line ~59): drop fuse-overlayfs from tech-stack list; replace with "kernel overlayfs via privileged helper".
|
|
- `l4d2host/README.md`: lines 29, 52, 64 reference fuse — update to "kernel overlayfs (mount via the `left4me-overlay` helper deployed to `/usr/local/libexec/left4me/`)".
|
|
- `deploy/README.md`: add `/usr/local/libexec/left4me/left4me-overlay` to the privileged-helpers inventory.
|
|
|
|
**Verification:**
|
|
|
|
```
|
|
python3 -m pytest deploy/tests -q
|
|
```
|
|
|
|
Green. Manual readthrough of the three READMEs confirms no stale fuse references.
|
|
|
|
**Commit:** `chore(deploy): cleanup left4me-web hardening + docs for kernel overlayfs`
|
|
|
|
---
|
|
|
|
## Task 5: End-to-End Verification on `ckn@10.0.4.128`
|
|
|
|
**Pre-deploy:** branch is clean, all four prior commits land, all tests green locally.
|
|
|
|
**Deploy:**
|
|
|
|
```
|
|
deploy/deploy-test-server.sh ckn@10.0.4.128
|
|
```
|
|
|
|
**Verification commands on the box:**
|
|
|
|
1. `test -e /var/lib/left4me/.kernel-overlay-migrated && echo migrated` — sentinel created.
|
|
2. `systemctl status left4me-web.service --no-pager` — `active (running)`, recent invocation timestamp.
|
|
3. From the UI or via `sudo -u left4me /opt/left4me/.venv/bin/l4d2ctl start test-server` — exit 0.
|
|
4. `findmnt /var/lib/left4me/runtime/test-server/merged` — shows fstype `overlay` in the host namespace.
|
|
5. `systemctl status left4me-server@test-server --no-pager` — `active (running)` after the start; **not** in `activating (auto-restart)`. No `status=200/CHDIR` errors in `journalctl -u left4me-server@test-server`.
|
|
6. `sudo journalctl -k --since "5 minutes ago" | grep -i apparmor | tail` — no overlay-related denials.
|
|
7. Negative test: `sudo -u left4me sudo -n /usr/local/libexec/left4me/left4me-overlay mount '../escape'` — exits non-zero with validation error.
|
|
8. Idempotency: `l4d2ctl stop test-server && l4d2ctl stop test-server` — both succeed (per the prior `fix(l4d2-host): make stop_instance idempotent` commit, still holds).
|
|
9. Re-start: `l4d2ctl start test-server` — succeeds, `findmnt` shows the mount again.
|
|
10. Double-mount guard: while the server is running, attempting another start (not via UI; via Python REPL or a second job) — `start_instance` raises `CalledProcessError` with the "refusing to double-mount" message. Optional, can be left to the unit test.
|
|
|
|
**On failure of any step:** stop and report. Do NOT push. The deploy script is rerunnable; the migration sentinel stays so wipe doesn't repeat.
|
|
|
|
---
|
|
|
|
## Out Of Scope
|
|
|
|
- See spec's "Out Of Scope" section.
|
|
- This plan does not push commits; pushing is a separate user decision after end-to-end verification passes.
|