left4me/docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md
mwiegand db120d77d3
docs(specs): kernel overlayfs migration design + plan
Captures the architectural fix for the mount-propagation bug: replace
fuse-overlayfs (rootless mount inside the web service's namespace, never
visible to host or to gameserver units) with kernel-native overlayfs
mounted via a privileged sudo helper that nsenters into PID 1's mount
namespace. Companion plan numbers the migration as five tasks ending in
end-to-end verification on the test box.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:19:26 +02:00

8.9 KiB

Kernel Overlayfs Helper Design

Goal: Replace the per-instance fuse-overlayfs mount with kernel-native overlayfs invoked through a privileged sudo helper that mounts in PID 1's mount namespace. Restores host-namespace visibility of the merged overlay so gameserver units (left4me-server@%i.service) can chdir into it at unshare time.

Approval status: User-approved design direction. Implementation proceeds in lockstep with the companion plan at docs/superpowers/plans/2026-05-08-kernel-overlayfs-helper.md.

Context

Symptom. After redeploys, starting a gameserver leaves the systemd unit in activating (auto-restart) with status=200/CHDIR — Changing to the requested working directory failed: No such file or directory. Investigation showed:

  • fuse-overlayfs running as left4me user mounts in left4me-web.service's mount namespace.
  • ProtectSystem=full + ReadWritePaths=/var/lib/left4me forces PrivateMounts=yes on the unit (systemd-analyze security confirms).
  • The unit's bind of /var/lib/left4me shows shared:471 master:1 in /proc/<pid>/mountinfo — slave-receive-only — so mounts created beneath it never propagate back to host.
  • MountFlags=shared (added in commit 1968684 to fix this) sets only the unit's root propagation; it does not override the slave-direction propagation that ProtectSystem/ReadWritePaths apply to their bind mounts. The gameserver unit, on unshare, inherits host mounts and sees nothing at the merged path → CHDIR fails.

The system appeared to work for ~1d8h before this investigation because the prior fuse daemon happened to land in the host namespace via some transient state. The mechanism documented in 1968684 does not reliably work on systemd 257 with this hardening shape.

Out-of-scope item now in scope. The 2026-05-07 workshop-overlays spec already lists this transition at line 211: "Switch from fuse-overlayfs to kernel overlayfs via a privileged helper. Matches the existing systemd / steam-install sudoers helper pattern under /usr/local/libexec/left4me/." The mount-propagation bug is the trigger to do it now.

Locked Decisions

  1. Privileged helper does the mount. New left4me-overlay script under /usr/local/libexec/left4me/, invoked via sudo -n. Mirrors the existing left4me-systemctl and left4me-journalctl pattern. The helper enters PID 1's mount namespace via nsenter --mount=/proc/1/ns/mnt and then calls /bin/mount -t overlay … or /bin/umount. Result: all overlay mounts live in the host namespace, visible to gameserver units.
  2. Kernel-native overlayfs, not fuse. Once a privileged helper exists, fuse-overlayfs's rootless-mount-via-setuid-fusermount3 advantage disappears. Kernel overlayfs is faster, has no long-running daemon, simpler unmount, and one fewer runtime dep.
  3. Helper is Python, not shell. Path canonicalization, env-file parsing, and lowerdir prefix-allowlist validation are too brittle in shell. Uses system /usr/bin/python3 (never the venv) and stdlib only. Owned by root, mode 0755.
  4. Verbs are mount and umount. Matches the kernel/userspace utility names; reduces cognitive friction over unmount.
  5. Helper takes only the instance name as input. It reads ${LEFT4ME_ROOT:-/var/lib/left4me}/instances/<name>/instance.env for L4D2_LOWERDIRS= and computes upper/work/merged from the runtime root. Equivalent in security to taking lowerdirs as args (the user already controls instance.env), and produces a one-line audit trail in journalctl _COMM=sudo.
  6. Strict path validation in the helper.
    • Instance name matches ^[a-z0-9][a-z0-9_-]{0,63}$ (mirrors validate_instance_name in l4d2host/paths.py).
    • Each lowerdir from L4D2_LOWERDIRS is os.path.realpath'd and must resolve under one of an allowlist: installation/, overlays/, global_overlay_cache/, workshop_cache/. Empty entries and traversals are rejected.
    • upper/work/merged must resolve exactly to runtime/<name>/{upper,work,merged}.
    • Lowerdir count ≤ 500 (kernel overlayfs hard cap; was 64 before kernel 5.2).
  7. Whiteout-format guard. fuse-overlayfs running as non-root uses user.fuseoverlayfs.* xattrs for whiteouts and opaque dirs, which kernel overlayfs ignores entirely. Before mounting, the helper walks upperdir once and refuses if any such xattr is present. Defensive; catches a stale fuse-era upperdir that wasn't wiped during migration.
  8. One-time migration: wipe existing upper/ and work/. Deploy script runs a gated migration (sentinel file /var/lib/left4me/.kernel-overlay-migrated) that stops gameservers, stops web service, unmounts any stale fuse/overlay mounts, recreates empty upper/work dirs for every instance. Players' in-place edits to merged content are sacrificed; v1 accepts this for a test deployment.
  9. Sudoers verb constraints. left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-overlay mount *, /usr/local/libexec/left4me/left4me-overlay umount *. Defense in depth (real validation lives in the helper); makes sudo -l output self-documenting.
  10. Wire the existing OverlayMounter ABC through. start_instance/stop_instance/delete_instance today bypass the abstraction at l4d2host/fs/base.py. The new KernelOverlayFSMounter replaces the unused FuseOverlayFSMounter AND becomes the only path through instances.py. FuseOverlayFSMounter and the fuse_overlayfs.py module are deleted.
  11. Double-mount guard in start_instance. Kernel mounts persist when the web worker dies (unlike fuse daemons, which die with their cgroup). start_instance checks os.path.ismount(merged) and refuses with a clear error rather than double-mounting.
  12. Hardening cleanup on left4me-web.service. Drop MountFlags=shared (no longer the mechanism). Restore PrivateTmp=true (was dropped in commit 593611e for fuse propagation that did not work). Keep NoNewPrivileges unset (sudo still requires setuid). Update the comment block to reflect the new model.
  13. AGENTS.md contracts unchanged. The host library's CLI surface (install, initialize, start, stop, delete, status, logs) is unchanged. The web app continues to drive operations via l4d2ctl. The fuse-overlayfs implementation detail was never part of the public contract.

Architecture

left4me-web.service  (hardened, private mount namespace)
    │
    │  start_instance(name=…)
    ▼
l4d2host.instances.start_instance
    │
    │  KernelOverlayFSMounter().mount(merged=…)
    ▼
sudo -n /usr/local/libexec/left4me/left4me-overlay mount <name>
    │   • validate name (regex)
    │   • parse instance.env → L4D2_LOWERDIRS
    │   • realpath each lowerdir, prefix-allowlist check
    │   • compute upper/work/merged under runtime/<name>/
    │   • walk upperdir, refuse if any user.fuseoverlayfs.* xattr
    ▼
nsenter --mount=/proc/1/ns/mnt -- \
    /bin/mount -t overlay overlay \
        -o "lowerdir=…,upperdir=…,workdir=…" \
        /var/lib/left4me/runtime/<name>/merged
    │
    ▼
host mount namespace now has the overlay; gameserver unit, on
unshare, inherits it and CHDIRs into …/merged/left4dead2 successfully.

Operational Notes

  • Migration ordering on the test box (test-server, …). The deploy script must, in order: (1) stop all left4me-server@*.service, (2) stop left4me-web.service (kills any lingering fuse-overlayfs daemons by reaping their cgroup), (3) findmnt + force-unmount any leftover fuse/overlay mounts under /var/lib/left4me/runtime/, (4) wipe and recreate upper/work for every instance, (5) deploy + start the new code. The sentinel file /var/lib/left4me/.kernel-overlay-migrated gates reruns.
  • Filesystem. /var/lib/left4me is btrfs on the test box. Kernel overlayfs on btrfs is supported on kernel ≥ 5.10; the box is on 6.12 — fine. AppArmor ships enabled on Debian Trixie; verify no overlay-related denials in journalctl -k after first start.
  • Concurrency. Two threads racing on start_instance for the same name is a latent issue unaffected by this change. The double-mount guard partly mitigates: the loser hits the existing mount and errors cleanly.

Out Of Scope

  • Replace sudo with AmbientCapabilities=CAP_SYS_ADMIN on a dedicated helper unit. Broader blast radius than the wrapper-script approach.
  • A systemd-mount per-instance mount unit. Considered as the alternative architectural fix but adds more moving parts than the helper-script approach. The helper matches the established privileged-helper pattern in this codebase.
  • Re-enable NoNewPrivileges on left4me-web.service. Requires removing sudo; not feasible while the helper invocation pattern stays.
  • Multi-process job-worker-claim safety. The _claim_lock in l4d2host/services/job_worker.py:131-138 is process-local; correctness depends on --workers 1. This change doesn't touch it.
  • Replicating the migration on production deployments. v1 covers only the test-server deployment shape.