Insert an idmapped bind mount in front of each lowerdir whose top-level
uid matches l4d2-sandbox at overlay-mount time, so that overlayfs copy-up
produces left4me-owned upperdir entries instead of EACCES.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SteamInstaller defaults to steamcmd="steamcmd" (bare name), which relies
on PATH lookup. Deployments that don't have steamcmd on PATH — or where
steamcmd.sh's `cd "$(dirname "$0")"` breaks under PATH-symlink invocation
(observed with the Valve-shipped script) — can now pin an absolute path
via LEFT4ME_STEAMCMD. Default keeps bare-name lookup for dev/tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
systemd's `+` Exec prefix removes sandbox/credentials but does NOT
detach from the unit's per-service mount namespace (created by
PrivateTmp/Protect*). The Python interpreter for the helper was
launched inside that namespace, and even though the helper internally
nsenter'd into PID 1 for the umount syscall, the calling Python
process itself never left the unit's namespace. Its existence pinned
the namespace alive, which kept the slave mount tree alive, which
made PID 1's umount return EBUSY for the entire duration of the
helper's run. The mount became unmountable the moment the helper
exited — empirically verified by polling /proc/*/ns/mnt during stop:
the only PID holding the dying namespace was the helper itself.
Wrap both ExecStartPre and ExecStopPost with `/usr/bin/nsenter
--mount=/proc/1/ns/mnt --` so the helper Python interpreter runs in
PID 1's mount namespace from the start. With the helper out of the
unit's namespace, umount succeeds first try once the cgroup empties.
Reset went from ~25 s with retry/lazy-fallback workarounds to ~0.5 s
clean.
Knock-on cleanups:
- Helper drops internal nsenter for the syscalls (already in PID 1's
namespace), and drops the eager-retry loop + lazy-umount fallback +
inner work_inner retry (no race left to ride out).
- Revert TimeoutStopSec=60s back to 15s.
- Tests updated to expect the new argv shapes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced by the previous deploy attempt:
1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit
packages= list. After deleting the fs/ package, pip install -e fails
with "package directory './fs' does not exist".
2. The CPU-isolation deploy step uses `nproc` to detect host core count,
but `nproc` honors Cpus_allowed of the calling shell. On a host that
already has the cpuset drop-ins applied (system.slice/user.slice →
AllowedCPUs=0), the SSH login lands constrained to one core and
`nproc` returns 1 — making subsequent deploys think they're on a
single-core box and skip the cpuset writes entirely. `nproc --all`
reports installed processors regardless of affinity, which is what
the deploy actually wants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until
now, the unit's ExecStartPre handled mount but the Python side still drove
unmount: stop_instance and _purge_instance both called _mounter.unmount,
which wrapped sudo + the helper. Two code paths for two halves of the
same lifecycle.
Move unmount into the unit:
- ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
(ExecStopPost, not ExecStop, so it runs after the cgroup is cleared;
ExecStop runs while srcds is alive and would EBUSY the umount syscall.)
- Helper's umount verb is now idempotent (mirrors mount): if merged
isn't a mount point, return early. PRINT_ONLY mode bypasses both
short-circuits so the unit tests still exercise the full nsenter argv.
Drop the dead Python machinery:
- _mounter.unmount(...) calls in stop_instance and _purge_instance
- _mounter global + KernelOverlayFSMounter import
- The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter
class) — no production callers, just self-tests
- l4d2host/tests/test_kernel_overlayfs.py
- test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
(tested Python-side unmount-failure tolerance that no longer exists)
- The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests
After this, the only thing start_instance does beyond cfg-staging is ask
systemd to enable+start the unit. stop/delete/reset only ask systemd to
disable; the overlay lifecycle lives entirely in the unit file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before this change there were two callers of left4me-overlay mount:
the web app's start_instance (Python, in-process) and the unit's
ExecStartPre (shell, via sudo). The duplication invited divergence; the
helper's recently-added idempotency made both paths technically work
but at the cost of a "first wins" race and dead-code retry logic in
start_instance.
Drop the in-process _mounter.mount() call from start_instance. The web
app now only stages cfg files (which still must happen on the host
filesystem before mount, to avoid overlayfs copy-up changing ownership),
then asks systemd to enable+start the unit; the unit's ExecStartPre
does the mount.
Removed:
- os.path.ismount(merged) refusal in start_instance and its test
(test_start_refuses_to_double_mount). The race the check guarded
against is now handled by the helper's idempotency.
- _load_instance_env helper and the `os` import (both became dead).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract -- only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each linked overlay gets a checkbox on the blueprint detail page that opts
its server.cfg in as exec server_overlay_<id>. The web app builds the
spec with {path, alias} per overlay and prepends exec server_overlay_<id>
lines to the blueprint config in lowest-overlay-first order. The host
stages those copies in the overlayfs upper layer before mounting (avoids
copy-up writes against a sandbox-uid file). A live preview block above the
Config textarea shows what gets auto-executed.
Schema:
- alembic 0007: BlueprintOverlay.expose_server_cfg BOOLEAN
Spec contract:
- l4d2host OverlayRef(path, alias?). load_spec accepts both bare-string
and {path, alias} entries.
Side effects folded in (same file in l4d2_facade):
- start_server auto-initializes; the manual Initialize step is no longer
needed before Start.
- initialize_server no longer runs blueprint builders — builds happen on
overlay save, not on every server Start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reset stops the systemd service, unmounts the overlay, and rm -rf's both
runtime/<name> and instances/<name>, but keeps the Server row, blueprint,
and (shared) systemd template. Next Start re-initializes from the current
blueprint, so users can clean up logs/caches/accumulated game state without
losing the server.
Implementation factors a shared _purge_instance helper out of
delete_instance; reset_instance reuses it without the existence guard. New
"reset" lifecycle op flows through the same route + worker + facade plumbing
as the other server ops.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop MountFlags=shared (the assumption that it propagated fuse mounts
to host was incorrect on systemd 257 with ProtectSystem+ReadWritePaths).
Restore PrivateTmp=true (was dropped in 593611e for fuse propagation
that did not work). Rewrite the comment block to describe the new
model: mounts go through the left4me-overlay helper which nsenters
into PID 1's mount namespace, so the unit's mount-ns layout is no
longer load-bearing.
Update the three user-facing READMEs (root, l4d2host, deploy) to drop
fuse-overlayfs / fusermount3 prereqs and call out the kernel overlayfs
mount path through the privileged helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace direct fuse-overlayfs / fusermount3 subprocess calls in
start_instance / stop_instance / delete_instance with the existing
OverlayMounter abstraction, now backed by KernelOverlayFSMounter.
Adds an os.path.ismount guard at the top of start_instance so a
kernel-level overlay that survived a web-worker crash isn't double-
mounted (kernel mounts persist when the cgroup dies, unlike fuse
daemons).
Delete the unused FuseOverlayFSMounter module.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New privileged helper at /usr/local/libexec/left4me/left4me-overlay
(Python, system /usr/bin/python3, stdlib only) takes only the instance
name, parses instance.env for L4D2_LOWERDIRS, validates each lowerdir
against an allowlist (installation/, overlays/, global_overlay_cache/,
workshop_cache/), refuses upperdirs tainted with user.fuseoverlayfs.*
xattrs from the prior fuse era, and execs `nsenter --mount=/proc/1/ns/mnt
-- mount -t overlay ...` so the resulting mount lives in the host
namespace. Mirrors the existing left4me-systemctl / left4me-journalctl
pattern; sudoers entry is verb-constrained.
KernelOverlayFSMounter implements the existing OverlayMounter ABC,
deriving the instance name from the merged path. No call sites use it
yet — that's the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
systemctl stop is already a no-op on a stopped unit, but stop_instance
was unconditionally running fusermount3 -u and bubbling up the EINVAL
when the overlay wasn't currently mounted (e.g. server already stopped).
Mirror the established delete_instance pattern: always attempt the
unmount, swallow CalledProcessError, and label the step "(if mounted)".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
L4D2 dedicated server expects to dlopen steamclient.so via
~/.steam/sdk32 (and sdk64). Without those symlinks, srcds_run logs
'cannot open shared object file' and SteamAPI_Init fails, which means
the server is invisible to the public Steam master server, Workshop
addon downloads break, and Steam 'Join Game' / lobby joins do not
reach the server (only direct-IP connect works).
SteamInstaller.install_or_update now ensures the symlinks exist after
SteamCMD finishes. Targets prefer SteamCMD's linux32/linux64 sibling
dirs; falls back to <install_dir>/bin/ if the siblings cannot be
located. Idempotent: re-running the install repairs or leaves the
symlinks alone.
Path.home() respects HOME, which the systemd web unit sets to
/var/lib/left4me, so the symlinks land in the left4me user's home.
Existing deploys can apply the fix by re-running 'Install' from /admin
without a full redeploy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- validate instance names at the host lib and web boundary against
[a-z0-9][a-z0-9_-]{0,63} to prevent path traversal via Server.name
- fail-closed on SECRET_KEY: load_config returns None when env unset,
create_app raises if missing or "dev" outside TESTING
- close login timing oracle by hashing a dummy digest when the user
is not found, equalizing response time
- set SESSION_COOKIE_SECURE outside TESTING
- delete_instance tolerates stop_service and fusermount3 failures so
partially-initialized instances clean up without contract breaks;
drops the is_mount() preflight that violated AGENTS.md
- document claim_next_job's single-process assumption
- clarify emit_step contract via docstring
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>