left4me/docs/superpowers/plans/2026-05-08-kernel-overlayfs-helper.md
mwiegand db120d77d3
docs(specs): kernel overlayfs migration design + plan
Captures the architectural fix for the mount-propagation bug: replace
fuse-overlayfs (rootless mount inside the web service's namespace, never
visible to host or to gameserver units) with kernel-native overlayfs
mounted via a privileged sudo helper that nsenters into PID 1's mount
namespace. Companion plan numbers the migration as five tasks ending in
end-to-end verification on the test box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:19:26 +02:00

13 KiB

Kernel Overlayfs Helper Implementation Plan

Approval status: User-approved 2026-05-08. Implementation proceeds.

Goal: Implement the kernel-overlayfs migration per docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md. Add a Python left4me-overlay privileged helper, a KernelOverlayFSMounter Python class, wire the existing OverlayMounter ABC through l4d2host/instances.py, drop fuse-overlayfs from the deploy stack, and migrate existing on-disk upper/work directories.

Architecture: The web app continues to call l4d2ctl start|stop|delete <name>; l4d2host continues to expose the same CLI verbs. Internally, start_instance/stop_instance/delete_instance move from a hardcoded subprocess call to fuse-overlayfs/fusermount3 to using KernelOverlayFSMounter, which invokes the new sudo helper that mounts in PID 1's namespace via nsenter.


Locked Decisions

See docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md for the design rationale. Implementation-relevant summary:

  • left4me-overlay Python helper in /usr/local/libexec/left4me/, owned root, mode 0755, system /usr/bin/python3, stdlib only.
  • Verbs: mount <name>, umount <name>.
  • Validation in helper: name regex; realpath + allowlist for each lowerdir; exact-prefix check for upper/work/merged; reject upperdir with user.fuseoverlayfs.* xattrs; lowerdir count ≤ 500.
  • Sudoers verb-constrained: mount *, umount *.
  • KernelOverlayFSMounter in l4d2host/fs/kernel_overlayfs.py — implements OverlayMounter. Derives name from the merged path's parent.
  • start_instance adds os.path.ismount(merged) guard before mounting.
  • Deploy migration: gated on sentinel file /var/lib/left4me/.kernel-overlay-migrated; stops gameservers + web, force-unmounts stale mounts, wipes upper/work, recreates empty.
  • Web unit cleanup: drop MountFlags=shared, restore PrivateTmp=true, rewrite comment block. Keep NoNewPrivileges unset.
  • Delete l4d2host/fs/fuse_overlayfs.py (currently unused — start_instance bypasses it).
  • AGENTS.md contracts unchanged.

Current Gap

  • l4d2host/instances.py start_instance calls fuse-overlayfs directly (lines 85-101); stop_instance/delete_instance call fusermount3 -u directly. The OverlayMounter ABC at l4d2host/fs/base.py and the FuseOverlayFSMounter impl at l4d2host/fs/fuse_overlayfs.py exist but are unused.
  • Mounts land in the web service's private mount namespace, invisible to host and to gameserver units. MountFlags=shared does not fix it.
  • No privileged mount helper exists; only left4me-systemctl and left4me-journalctl.
  • Deploy script installs fuse-overlayfs apt package and assumes it as a runtime tool.
  • Existing runtime/<name>/upper directories may carry user.fuseoverlayfs.* xattrs that kernel overlayfs would silently ignore (resurrecting "deleted" files).

Task 1: Helper Script + Sudoers + Mounter Class (RED-first)

Files:

  • Create: deploy/files/usr/local/libexec/left4me/left4me-overlay (Python, mode 0755 after deploy)
  • Modify: deploy/files/etc/sudoers.d/left4me
  • Create: l4d2host/fs/kernel_overlayfs.py
  • Create: l4d2host/tests/test_kernel_overlayfs.py
  • Create: l4d2host/tests/test_overlay_helper.py
  • Modify: deploy/tests/test_deploy_artifacts.py (assert helper deployed + sudoers entry)

Test plan (RED first):

  1. test_kernel_overlayfs.py::test_mount_invokes_helper_with_name — mock run_command, call KernelOverlayFSMounter().mount(lowerdirs="/x:/y", upperdir=Path("/var/lib/left4me/runtime/alpha/upper"), workdir=Path("/var/lib/left4me/runtime/alpha/work"), merged=Path("/var/lib/left4me/runtime/alpha/merged")), assert argv ["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"].
  2. test_kernel_overlayfs.py::test_unmount_invokes_helper_with_umount_verb — mock + call + assert argv with umount.
  3. test_overlay_helper.py — drives the helper script as a subprocess with LEFT4ME_OVERLAY_PRINT_ONLY=1 env var (helper prints the would-be nsenter … command line and exits 0 instead of execve), and with isolated LEFT4ME_ROOT=tmp_path. Cases:
    • Valid mount: prints expected nsenter --mount=/proc/1/ns/mnt -- /bin/mount -t overlay … line.
    • Valid umount: prints expected umount line.
    • Bad name (../escape, uppercase, empty): exit non-zero, stderr matches.
    • Lowerdir traversal (/etc, /var/lib/left4me/../etc, symlink escape): exit non-zero.
    • Missing instance.env: exit non-zero.
    • Tainted upperdir (with user.fuseoverlayfs.opaque xattr): exit non-zero with clear message. (Optional: skip if setfattr is unavailable on dev machine; keep test on Linux only via pytest.mark.skipif.)
    • Lowerdir count > 500: exit non-zero.
  4. test_deploy_artifacts.py — assert /usr/local/libexec/left4me/left4me-overlay is present in deployed files; sudoers includes the new lines.

Implementation:

  • Helper script structure: argparse for the verb, then path-validation funcs, then os.execv("/usr/bin/nsenter", [...]) (or printing it under LEFT4ME_OVERLAY_PRINT_ONLY).
  • KernelOverlayFSMounter: name = merged.parent.name (with a one-line comment), then run_command(["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", verb, name], on_stdout=…, on_stderr=…, passthrough=…, should_cancel=…).

Verification:

python3 -m pytest l4d2host/tests/test_kernel_overlayfs.py l4d2host/tests/test_overlay_helper.py deploy/tests/test_deploy_artifacts.py -q

Expected before implementation: FAIL on missing class/script. After: all green.

Commit: feat(l4d2-host): KernelOverlayFSMounter + left4me-overlay helper


Task 2: Wire OverlayMounter Through Lifecycle + Drop Fuse Module

Files:

  • Modify: l4d2host/instances.py (start/stop/delete)
  • Modify: l4d2host/tests/test_lifecycle.py (update argv assertions, add double-mount guard test)
  • Delete: l4d2host/fs/fuse_overlayfs.py
  • Verify: l4d2host/fs/__init__.py does not re-export FuseOverlayFSMounter

Test plan (update RED, then GREEN):

  1. test_lifecycle.py::test_start_order — change assertion: calls[0] is now ["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"]. Adjust setup so the test still creates the merged directory.
  2. test_lifecycle.py::test_stop_succeeds_when_unmount_failscmd[0:5] == ["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "umount", "alpha"].
  3. test_lifecycle.py::test_delete_succeeds_when_unmount_fails — same.
  4. NEW test_lifecycle.py::test_start_refuses_double_mount — monkeypatch os.path.ismount to return True; expect start_instance to raise subprocess.CalledProcessError; assert NO mount command was issued.
  5. test_lifecycle.py::test_lifecycle_rejects_unsafe_instance_names — unchanged.
  6. test_lifecycle.py::test_delete_missing_is_noop — unchanged.

Implementation:

  • instances.py imports KernelOverlayFSMounter. Module-level singleton instance (_mounter = KernelOverlayFSMounter()). Replace direct run_command([...fuse-overlayfs...]) with _mounter.mount(...). Replace direct run_command([...fusermount3...]) with _mounter.unmount(...) (still inside the existing try/except for stop/delete).
  • Add the ismount guard at the top of start_instance after runtime_dir is computed, before emit_step("mounting runtime overlay..."). Raise subprocess.CalledProcessError(returncode=1, cmd=["mount-guard"], stderr="runtime overlay already mounted at <path>; refusing to double-mount").
  • Delete l4d2host/fs/fuse_overlayfs.py.
  • Confirm l4d2host/fs/__init__.py is empty (already verified to be 1 line).

Verification:

python3 -m pytest l4d2host/tests -q
python3 -m pytest l4d2web/tests -q

Both green. Web tests: the "Step: mounting runtime overlay..." log line is preserved in start_instance.

Commit: refactor(l4d2-host): start/stop/delete go through OverlayMounter; drop FuseOverlayFSMounter


Task 3: Deploy Script Migration (Apt Deps + Wipe Upper/Work)

Files:

  • Modify: deploy/deploy-test-server.sh
  • Modify: deploy/tests/test_deploy_artifacts.py (assert deploy script contains migration lines; assert fuse-overlayfs no longer in apt-get install)

Test plan:

  1. test_deploy_artifacts.py::test_deploy_script_drops_fuse_overlayfs_apt_depassert "fuse-overlayfs" not in deploy_script and assert "kernel-overlay-migrated" in deploy_script.
  2. test_deploy_artifacts.py::test_deploy_script_migration_block_uses_sentinelassert ".kernel-overlay-migrated" in deploy_script.

Implementation:

In deploy/deploy-test-server.sh, drop fuse-overlayfs from the apt-get and dnf lines (lines 82, 84). Insert before the existing systemctl restart left4me-web.service (line 182):

# One-time migration: fuse-overlayfs upperdir → kernel overlayfs upperdir.
# fuse-overlayfs running as the left4me user uses user.fuseoverlayfs.* xattrs
# for whiteouts and opaque dirs; kernel overlayfs ignores those, so any
# pre-existing upper/ from the fuse era would resurrect "deleted" files.
sentinel=/var/lib/left4me/.kernel-overlay-migrated
if [ ! -e "$sentinel" ]; then
    $sudo_cmd systemctl stop 'left4me-server@*.service' 2>/dev/null || true
    $sudo_cmd systemctl stop left4me-web.service 2>/dev/null || true
    $sudo_cmd sh -c 'findmnt -t fuse.fuse-overlayfs -o TARGET --noheadings | xargs -r -n1 fusermount3 -u 2>/dev/null || true'
    $sudo_cmd sh -c "findmnt -t overlay -o TARGET --noheadings | grep '/var/lib/left4me/runtime/' | xargs -r -n1 umount 2>/dev/null || true"
    $sudo_cmd sh -c 'for d in /var/lib/left4me/runtime/*/; do [ -d "$d" ] || continue; rm -rf "$d/upper" "$d/work"; mkdir -p "$d/upper" "$d/work"; chown left4me:left4me "$d/upper" "$d/work"; done'
    $sudo_cmd touch "$sentinel"
    $sudo_cmd chown left4me:left4me "$sentinel"
fi

Verification:

python3 -m pytest deploy/tests -q

Green.

Commit: chore(deploy): drop fuse-overlayfs apt dep + one-shot migrate upper/work


Task 4: Web Unit Hardening Cleanup + Docs

Files:

  • Modify: deploy/files/usr/local/lib/systemd/system/left4me-web.service
  • Modify: deploy/tests/test_deploy_artifacts.py
  • Modify: README.md
  • Modify: l4d2host/README.md
  • Modify: deploy/README.md

Test plan:

  1. test_deploy_artifacts.py::test_web_unit_contains_required_runtime_contract — drop assert "MountFlags=shared" in unit (or rather: replace with assert "MountFlags=" not in unit); add assert "PrivateTmp=true" in unit; add assert "left4me-overlay" not in unit (just to be precise — the unit shouldn't reference the helper directly, only via Python code).

Implementation:

Edit left4me-web.service:

  • Drop MountFlags=shared.
  • Restore PrivateTmp=true.
  • Rewrite the comment block above hardening lines to explain: mounts now go through the left4me-overlay helper which nsenters into PID 1's mount namespace, so this unit's namespace is irrelevant to gameserver visibility. NoNewPrivileges stays unset because sudo is setuid.

README updates:

  • README.md (line ~59): drop fuse-overlayfs from tech-stack list; replace with "kernel overlayfs via privileged helper".
  • l4d2host/README.md: lines 29, 52, 64 reference fuse — update to "kernel overlayfs (mount via the left4me-overlay helper deployed to /usr/local/libexec/left4me/)".
  • deploy/README.md: add /usr/local/libexec/left4me/left4me-overlay to the privileged-helpers inventory.

Verification:

python3 -m pytest deploy/tests -q

Green. Manual readthrough of the three READMEs confirms no stale fuse references.

Commit: chore(deploy): cleanup left4me-web hardening + docs for kernel overlayfs


Task 5: End-to-End Verification on ckn@10.0.4.128

Pre-deploy: branch is clean, all four prior commits land, all tests green locally.

Deploy:

deploy/deploy-test-server.sh ckn@10.0.4.128

Verification commands on the box:

  1. test -e /var/lib/left4me/.kernel-overlay-migrated && echo migrated — sentinel created.
  2. systemctl status left4me-web.service --no-pageractive (running), recent invocation timestamp.
  3. From the UI or via sudo -u left4me /opt/left4me/.venv/bin/l4d2ctl start test-server — exit 0.
  4. findmnt /var/lib/left4me/runtime/test-server/merged — shows fstype overlay in the host namespace.
  5. systemctl status left4me-server@test-server --no-pageractive (running) after the start; not in activating (auto-restart). No status=200/CHDIR errors in journalctl -u left4me-server@test-server.
  6. sudo journalctl -k --since "5 minutes ago" | grep -i apparmor | tail — no overlay-related denials.
  7. Negative test: sudo -u left4me sudo -n /usr/local/libexec/left4me/left4me-overlay mount '../escape' — exits non-zero with validation error.
  8. Idempotency: l4d2ctl stop test-server && l4d2ctl stop test-server — both succeed (per the prior fix(l4d2-host): make stop_instance idempotent commit, still holds).
  9. Re-start: l4d2ctl start test-server — succeeds, findmnt shows the mount again.
  10. Double-mount guard: while the server is running, attempting another start (not via UI; via Python REPL or a second job) — start_instance raises CalledProcessError with the "refusing to double-mount" message. Optional, can be left to the unit test.

On failure of any step: stop and report. Do NOT push. The deploy script is rerunnable; the migration sentinel stays so wipe doesn't repeat.


Out Of Scope

  • See spec's "Out Of Scope" section.
  • This plan does not push commits; pushing is a separate user decision after end-to-end verification passes.