Captures the architectural fix for the mount-propagation bug: replace fuse-overlayfs (rootless mount inside the web service's namespace, never visible to host or to gameserver units) with kernel-native overlayfs mounted via a privileged sudo helper that nsenters into PID 1's mount namespace. Companion plan numbers the migration as five tasks ending in end-to-end verification on the test box. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.9 KiB
Kernel Overlayfs Helper Design
Goal: Replace the per-instance fuse-overlayfs mount with kernel-native overlayfs invoked through a privileged sudo helper that mounts in PID 1's mount namespace. Restores host-namespace visibility of the merged overlay so gameserver units (left4me-server@%i.service) can chdir into it at unshare time.
Approval status: User-approved design direction. Implementation proceeds in lockstep with the companion plan at docs/superpowers/plans/2026-05-08-kernel-overlayfs-helper.md.
Context
Symptom. After redeploys, starting a gameserver leaves the systemd unit in activating (auto-restart) with status=200/CHDIR — Changing to the requested working directory failed: No such file or directory. Investigation showed:
fuse-overlayfsrunning asleft4meuser mounts inleft4me-web.service's mount namespace.ProtectSystem=full+ReadWritePaths=/var/lib/left4meforcesPrivateMounts=yeson the unit (systemd-analyze securityconfirms).- The unit's bind of
/var/lib/left4meshowsshared:471 master:1in/proc/<pid>/mountinfo— slave-receive-only — so mounts created beneath it never propagate back to host. MountFlags=shared(added in commit1968684to fix this) sets only the unit's root propagation; it does not override the slave-direction propagation thatProtectSystem/ReadWritePathsapply to their bind mounts. The gameserver unit, on unshare, inherits host mounts and sees nothing at the merged path → CHDIR fails.
The system appeared to work for ~1d8h before this investigation because the prior fuse daemon happened to land in the host namespace via some transient state. The mechanism documented in 1968684 does not reliably work on systemd 257 with this hardening shape.
Out-of-scope item now in scope. The 2026-05-07 workshop-overlays spec already lists this transition at line 211: "Switch from fuse-overlayfs to kernel overlayfs via a privileged helper. Matches the existing systemd / steam-install sudoers helper pattern under /usr/local/libexec/left4me/." The mount-propagation bug is the trigger to do it now.
Locked Decisions
- Privileged helper does the mount. New
left4me-overlayscript under/usr/local/libexec/left4me/, invoked viasudo -n. Mirrors the existingleft4me-systemctlandleft4me-journalctlpattern. The helper enters PID 1's mount namespace viansenter --mount=/proc/1/ns/mntand then calls/bin/mount -t overlay …or/bin/umount. Result: all overlay mounts live in the host namespace, visible to gameserver units. - Kernel-native overlayfs, not fuse. Once a privileged helper exists, fuse-overlayfs's rootless-mount-via-setuid-
fusermount3advantage disappears. Kernel overlayfs is faster, has no long-running daemon, simpler unmount, and one fewer runtime dep. - Helper is Python, not shell. Path canonicalization, env-file parsing, and lowerdir prefix-allowlist validation are too brittle in shell. Uses system
/usr/bin/python3(never the venv) and stdlib only. Owned by root, mode 0755. - Verbs are
mountandumount. Matches the kernel/userspace utility names; reduces cognitive friction overunmount. - Helper takes only the instance name as input. It reads
${LEFT4ME_ROOT:-/var/lib/left4me}/instances/<name>/instance.envforL4D2_LOWERDIRS=and computesupper/work/mergedfrom the runtime root. Equivalent in security to taking lowerdirs as args (the user already controls instance.env), and produces a one-line audit trail injournalctl _COMM=sudo. - Strict path validation in the helper.
- Instance name matches
^[a-z0-9][a-z0-9_-]{0,63}$(mirrorsvalidate_instance_nameinl4d2host/paths.py). - Each lowerdir from
L4D2_LOWERDIRSisos.path.realpath'd and must resolve under one of an allowlist:installation/,overlays/,global_overlay_cache/,workshop_cache/. Empty entries and traversals are rejected. upper/work/mergedmust resolve exactly toruntime/<name>/{upper,work,merged}.- Lowerdir count ≤ 500 (kernel overlayfs hard cap; was 64 before kernel 5.2).
- Instance name matches
- Whiteout-format guard.
fuse-overlayfsrunning as non-root usesuser.fuseoverlayfs.*xattrs for whiteouts and opaque dirs, which kernel overlayfs ignores entirely. Before mounting, the helper walksupperdironce and refuses if any such xattr is present. Defensive; catches a stale fuse-era upperdir that wasn't wiped during migration. - One-time migration: wipe existing
upper/andwork/. Deploy script runs a gated migration (sentinel file/var/lib/left4me/.kernel-overlay-migrated) that stops gameservers, stops web service, unmounts any stale fuse/overlay mounts, recreates emptyupper/workdirs for every instance. Players' in-place edits to merged content are sacrificed; v1 accepts this for a test deployment. - Sudoers verb constraints.
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-overlay mount *, /usr/local/libexec/left4me/left4me-overlay umount *. Defense in depth (real validation lives in the helper); makessudo -loutput self-documenting. - Wire the existing
OverlayMounterABC through.start_instance/stop_instance/delete_instancetoday bypass the abstraction atl4d2host/fs/base.py. The newKernelOverlayFSMounterreplaces the unusedFuseOverlayFSMounterAND becomes the only path throughinstances.py.FuseOverlayFSMounterand thefuse_overlayfs.pymodule are deleted. - Double-mount guard in
start_instance. Kernel mounts persist when the web worker dies (unlike fuse daemons, which die with their cgroup).start_instancechecksos.path.ismount(merged)and refuses with a clear error rather than double-mounting. - Hardening cleanup on
left4me-web.service. DropMountFlags=shared(no longer the mechanism). RestorePrivateTmp=true(was dropped in commit593611efor fuse propagation that did not work). KeepNoNewPrivilegesunset (sudo still requires setuid). Update the comment block to reflect the new model. - AGENTS.md contracts unchanged. The host library's CLI surface (
install,initialize,start,stop,delete,status,logs) is unchanged. The web app continues to drive operations vial4d2ctl. The fuse-overlayfs implementation detail was never part of the public contract.
Architecture
left4me-web.service (hardened, private mount namespace)
│
│ start_instance(name=…)
▼
l4d2host.instances.start_instance
│
│ KernelOverlayFSMounter().mount(merged=…)
▼
sudo -n /usr/local/libexec/left4me/left4me-overlay mount <name>
│ • validate name (regex)
│ • parse instance.env → L4D2_LOWERDIRS
│ • realpath each lowerdir, prefix-allowlist check
│ • compute upper/work/merged under runtime/<name>/
│ • walk upperdir, refuse if any user.fuseoverlayfs.* xattr
▼
nsenter --mount=/proc/1/ns/mnt -- \
/bin/mount -t overlay overlay \
-o "lowerdir=…,upperdir=…,workdir=…" \
/var/lib/left4me/runtime/<name>/merged
│
▼
host mount namespace now has the overlay; gameserver unit, on
unshare, inherits it and CHDIRs into …/merged/left4dead2 successfully.
Operational Notes
- Migration ordering on the test box (test-server, …). The deploy script must, in order: (1) stop all
left4me-server@*.service, (2) stopleft4me-web.service(kills any lingering fuse-overlayfs daemons by reaping their cgroup), (3)findmnt+ force-unmount any leftover fuse/overlay mounts under/var/lib/left4me/runtime/, (4) wipe and recreateupper/workfor every instance, (5) deploy + start the new code. The sentinel file/var/lib/left4me/.kernel-overlay-migratedgates reruns. - Filesystem.
/var/lib/left4meis btrfs on the test box. Kernel overlayfs on btrfs is supported on kernel ≥ 5.10; the box is on 6.12 — fine. AppArmor ships enabled on Debian Trixie; verify no overlay-related denials injournalctl -kafter first start. - Concurrency. Two threads racing on
start_instancefor the same name is a latent issue unaffected by this change. The double-mount guard partly mitigates: the loser hits the existing mount and errors cleanly.
Out Of Scope
- Replace
sudowithAmbientCapabilities=CAP_SYS_ADMINon a dedicated helper unit. Broader blast radius than the wrapper-script approach. - A
systemd-mountper-instance mount unit. Considered as the alternative architectural fix but adds more moving parts than the helper-script approach. The helper matches the established privileged-helper pattern in this codebase. - Re-enable
NoNewPrivilegesonleft4me-web.service. Requires removing sudo; not feasible while the helper invocation pattern stays. - Multi-process job-worker-claim safety. The
_claim_lockinl4d2host/services/job_worker.py:131-138is process-local; correctness depends on--workers 1. This change doesn't touch it. - Replicating the migration on production deployments. v1 covers only the test-server deployment shape.