fix(deploy): drop deleted l4d2host.fs from pyproject + use nproc --all

Two bugs surfaced by the previous deploy attempt: 1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit packages= list. After deleting the fs/ package, pip install -e fails with "package directory './fs' does not exist". 2. The CPU-isolation deploy step uses `nproc` to detect host core count, but `nproc` honors Cpus_allowed of the calling shell. On a host that already has the cpuset drop-ins applied (system.slice/user.slice → AllowedCPUs=0), the SSH login lands constrained to one core and `nproc` returns 1 — making subsequent deploys think they're on a single-core box and skip the cpuset writes entirely. `nproc --all` reports installed processors regardless of affinity, which is what the deploy actually wants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
refactor(l4d2-host): unmount via ExecStopPost — single code path mirroring mount
2026-05-09 13:11:19 +02:00 · 2026-05-09 13:09:52 +02:00 · 2026-05-09 12:56:54 +02:00 · 2026-05-09 12:55:16 +02:00 · 2026-05-09 12:54:05 +02:00 · 2026-05-09 12:51:58 +02:00
30 changed files with 2815 additions and 359 deletions
--- a/deploy/README.md
+++ b/deploy/README.md
@ -71,3 +71,85 @@ The web app currently supports two overlay surfaces:
 - `script` overlays — populated by an arbitrary user-authored bash script that runs inside `bubblewrap` + `systemd-run --scope` as the unprivileged `l4d2-sandbox` UID, with the overlay directory bind-mounted RW at `/overlay`. Resource caps: 1h walltime, 4 GB RAM, 512 tasks, 200% CPU, 20 GB post-build disk cap.

 Both the caches and the overlay directories are owned by the `left4me` runtime user; if the web service ever runs as a different uid, ensure it shares a group with the host process and that both trees are group-readable.
+
+## Performance Tuning
+
+The deployment ships a host-side perf baseline (slices, unit directives, sysctls). See `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` for design rationale.
+
+The following knobs are documented escape hatches — they are **not** auto-applied. Apply only if you have measured a need and understand the failure modes.
+
+### CPU governor
+
+The performance governor squeezes a few percent off jitter under bursty load. `schedutil` is acceptable for sustained UDP workloads.
+
+```sh
+sudo cpupower frequency-set -g performance
+```
+
+Install via `sudo apt install linux-cpupower` if the binary isn't present.
+
+Persist via your distro's CPU-frequency tooling (e.g. `/etc/default/cpufrequtils`).
+
+### CPU isolation (cores)
+
+The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
+
+Override the split by setting either env var when running the deploy:
+
+```sh
+LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
+```
+
+On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
+
+Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
+
+### Per-instance CPU affinity
+
+`srcds` is single-threaded per instance. On a multi-core host, pinning each instance to its own core can cut jitter under contention. Drop in `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
+
+```ini
+[Service]
+CPUAffinity=2
+```
+
+This pins the instance to CPU 2 specifically; per-instance values would typically be 1, 2, 3, ... so each server has its own core.
+
+A reasonable strategy on an N-core host: leave core 0 for the kernel + IRQs + system services, then pin one instance per remaining core.
+
+### NIC tuning
+
+Hardware-specific (install via `sudo apt install ethtool` if not present). On a host with a single primary interface (replace `eth0`):
+
+```sh
+sudo ethtool -G eth0 rx 4096 tx 4096
+sudo ethtool -K eth0 gro on lro off
+```
+
+If you run a high instance count, also pin the NIC's interrupts off the cores that game servers occupy (see `/proc/interrupts` and `/proc/irq/<n>/smp_affinity`).
+
+### Real-time scheduling (advanced, opt-in)
+
+Source-engine servers do not need real-time scheduling, and a misbehaving `srcds` at any RT priority can starve kernel threads — even with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of CPU back. Use only if you have a measured jitter problem that the baseline does not solve.
+
+`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
+
+```ini
+[Service]
+CPUSchedulingPolicy=fifo
+CPUSchedulingPriority=10
+LimitRTPRIO=10
+AmbientCapabilities=CAP_SYS_NICE
+```
+
+The `AmbientCapabilities=CAP_SYS_NICE` line is needed because the service runs as `User=left4me` with `NoNewPrivileges=true`; without it some kernels/systemd combinations refuse to apply the RT policy.
+
+### Applying changes to running servers
+
+Unit-file changes do not apply to already-running services. After any change:
+
+```sh
+sudo systemctl daemon-reload
+# Restart each game server via the web UI's stop + start, or:
+sudo systemctl restart 'left4me-server@*.service'
+```
--- a/deploy/deploy-test-server.sh
+++ b/deploy/deploy-test-server.sh
@ -136,6 +136,42 @@ $sudo_cmd chown -R left4me:left4me /opt/left4me

 $sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
 $sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
+
+# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
+# isn't a live game server to core 0; give game servers cores 1..N-1.
+# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
+# `nproc --all` reports installed processors regardless of the calling
+# shell's CPU affinity. Plain `nproc` honors Cpus_allowed of the calling
+# process, so on a host that already has the cpuset drop-ins applied
+# (system.slice → AllowedCPUs=0), the SSH login lands in user.slice with
+# AllowedCPUs=0 and `nproc` would return 1 — making subsequent deploys
+# wrongly think they're on a single-core box and skip CPU isolation.
+NPROC=$(nproc --all)
+SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
+if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
+    GAME_CPUS=$LEFT4ME_GAME_CPUS
+else
+    GAME_CPUS="1-$((NPROC - 1))"
+fi
+if [ "$NPROC" -lt 2 ] && [ "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" = "" ]; then
+    printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
+else
+    for slice_drop_in in \
+        /etc/systemd/system/system.slice.d/99-left4me-cpuset.conf \
+        /etc/systemd/system/user.slice.d/99-left4me-cpuset.conf \
+        /etc/systemd/system/l4d2-build.slice.d/99-left4me-cpuset.conf; do
+        $sudo_cmd mkdir -p "$(dirname "$slice_drop_in")"
+        printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
+            | $sudo_cmd install -m 0644 -o root -g root /dev/stdin "$slice_drop_in"
+    done
+    $sudo_cmd mkdir -p /etc/systemd/system/l4d2-game.slice.d
+    printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
+        | $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
+          /etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf
+fi
+
 $sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-systemctl /usr/local/libexec/left4me/left4me-systemctl
 $sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-journalctl /usr/local/libexec/left4me/left4me-journalctl
 $sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-overlay /usr/local/libexec/left4me/left4me-overlay
@ -154,6 +190,13 @@ $sudo_cmd install -m 0644 -o root -g root \
    /opt/left4me/deploy/files/etc/left4me/sandbox-resolv.conf \
    /etc/left4me/sandbox-resolv.conf

+# Host perf-baseline sysctls. Apply with `sysctl --system` so values
+# take effect this deploy, not on next reboot.
+$sudo_cmd install -m 0644 -o root -g root \
+    /opt/left4me/deploy/files/etc/sysctl.d/99-left4me.conf \
+    /etc/sysctl.d/99-left4me.conf
+$sudo_cmd sysctl --system >/dev/null
+
 # Stomp the file every deploy so newly added vars reach existing boxes.
 # SECRET_KEY is derived from /etc/machine-id so it stays stable across
 # redeploys (no session invalidation) without persisting state in /etc.
--- a/deploy/files/etc/sysctl.d/99-left4me.conf
+++ b/deploy/files/etc/sysctl.d/99-left4me.conf
@ -0,0 +1,21 @@
+# Host-side perf baseline for left4me — see
+# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+#
+# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
+# Source-engine UDP across multiple instances. 8 MiB matches the standard
+# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
+# explicitly enlarge their buffers.
+net.core.rmem_max = 8388608
+net.core.wmem_max = 8388608
+net.core.rmem_default = 524288
+net.core.wmem_default = 524288
+
+# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
+# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
+# netdev_budget = 600 gives softirq more drain headroom per pass.
+net.core.netdev_max_backlog = 5000
+net.core.netdev_budget = 600
+
+# Latency-sensitive default: avoid swap unless the box is really under
+# pressure. Harmless on swapless hosts.
+vm.swappiness = 10
--- a/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice
+++ b/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice
@ -0,0 +1,8 @@
+# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+[Unit]
+Description=left4me script-sandbox build slice
+Before=slices.target
+
+[Slice]
+CPUWeight=10
+IOWeight=10
--- a/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice
+++ b/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice
@ -0,0 +1,8 @@
+# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+[Unit]
+Description=left4me game-server slice
+Before=slices.target
+
+[Slice]
+CPUWeight=1000
+IOWeight=1000
--- a/deploy/files/usr/local/lib/systemd/system/left4me-server@.service
+++ b/deploy/files/usr/local/lib/systemd/system/left4me-server@.service
@ -2,6 +2,11 @@
 Description=left4me server instance %i
 After=network-online.target
 Wants=network-online.target
+# Bound the restart loop. Without these, a persistent ExecStartPre or
+# ExecStart failure spins indefinitely. Note: these are [Unit]-section
+# directives (systemd 230+), not [Service].
+StartLimitBurst=5
+StartLimitIntervalSec=60s

 [Service]
 Type=simple
@ -9,10 +14,45 @@ User=left4me
 Group=left4me
 EnvironmentFile=/etc/left4me/host.env
 EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
-WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
+# `-` prefix: chdir failure is non-fatal. systemd applies WorkingDirectory
+# before every Exec line — including ExecStartPre — but the merged dir only
+# exists once ExecStartPre's overlay mount succeeds. With `-`, ExecStartPre
+# runs in the unit's home (cwd doesn't matter for the mount helper); the
+# ExecStart re-applies WorkingDirectory after the mount and finds the dir.
+WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
+# Single source of truth for the kernel-overlayfs mount lifecycle: the web
+# app's start_instance only stages cfg files and asks systemd to enable+
+# start this unit; the actual `mount -t overlay` lives here so reboot
+# auto-start works the same as a UI-driven start. ExecStopPost mirrors it
+# so the unmount lives in the same place — no Python-side _mounter needed
+# in stop/delete/reset paths. Both helper verbs are idempotent.
+#
+# `+` prefix runs the helper as PID 1 (root, no sandbox). Required because
+# the unit has NoNewPrivileges=true, which blocks sudo's setuid escalation
+# — and the helper itself needs root to nsenter into PID 1's mnt namespace
+# anyway. ExecStopPost (not ExecStop) so unmount runs after the cgroup is
+# cleared; ExecStop runs while srcds is still alive and would EBUSY.
+ExecStartPre=+/usr/local/libexec/left4me/left4me-overlay mount %i
 ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
+ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
 Restart=on-failure
 RestartSec=5
+
+# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+Slice=l4d2-game.slice
+Nice=-5
+IOSchedulingClass=best-effort
+IOSchedulingPriority=4
+OOMScoreAdjust=-200
+MemoryHigh=1.5G
+MemoryMax=2G
+TasksMax=256
+LimitNOFILE=65536
+KillSignal=SIGINT
+TimeoutStopSec=15s
+LogRateLimitIntervalSec=0
+
+# Hardening (unchanged from previous baseline).
 NoNewPrivileges=true
 PrivateTmp=true
 PrivateDevices=true
--- a/deploy/files/usr/local/libexec/left4me/left4me-overlay
+++ b/deploy/files/usr/local/libexec/left4me/left4me-overlay
@ -127,16 +127,30 @@ def exec_or_print(argv: list[str]) -> None:
 def cmd_mount(name: str) -> None:
    name = validate_name(name)
    r = root()
+    runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
+    merged_for_check = (runtime_name_dir / "merged").resolve(strict=True)
+
+    # Idempotency for unit restart cycles: if a previous start mounted
+    # successfully but ExecStart failed afterwards (and Restart=on-failure
+    # fires another cycle), the second ExecStartPre would otherwise refuse
+    # to mount-on-top. Short-circuit here so the second cycle just gets
+    # straight to ExecStart. PRINT_ONLY (test mode) bypasses this so the
+    # tests can exercise the full nsenter argv regardless of mount state.
+    if (
+        os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") != "1"
+        and os.path.ismount(merged_for_check)
+    ):
+        return
+
    instance_env = r / "instances" / name / "instance.env"
    raw_lowerdirs = parse_lowerdirs(instance_env)

    allowed_roots = [(r / sub).resolve() for sub in LOWERDIR_ALLOWLIST]
    canonical_lowerdirs = [str(canonical_under(allowed_roots, Path(p))) for p in raw_lowerdirs]

-    runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
    upper = (runtime_name_dir / "upper").resolve(strict=True)
    work = (runtime_name_dir / "work").resolve(strict=True)
-    merged = (runtime_name_dir / "merged").resolve(strict=True)
+    merged = merged_for_check
    for label, path in (("upper", upper), ("work", work), ("merged", merged)):
        if path.parent != runtime_name_dir:
            die(f"{label} resolved outside runtime/{name}: {path}")
@ -164,6 +178,18 @@ def cmd_umount(name: str) -> None:
    merged = (runtime_name_dir / "merged").resolve(strict=True)
    if merged.parent != runtime_name_dir:
        die(f"merged resolved outside runtime/{name}: {merged}")
+
+    # Idempotency: if merged isn't a mount point right now, we have nothing
+    # to do. Mirrors cmd_mount's symmetric check. ExecStopPost on the unit
+    # is the one canonical caller, but a manual `systemctl reset-failed`
+    # cycle or a redundant cleanup pass should still be a no-op. PRINT_ONLY
+    # bypasses for the same reason as cmd_mount above.
+    if (
+        os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") != "1"
+        and not os.path.ismount(merged)
+    ):
+        return
+
    argv = [
        NSENTER,
        "--mount=/proc/1/ns/mnt",
--- a/deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
+++ b/deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
@ -45,6 +45,8 @@ chmod 0755 "$OVERLAY_DIR"
 SCRIPT_RC=0
 systemd-run --quiet --collect --wait --pipe \
    --unit="left4me-script-${OVERLAY_ID}-$$" \
+    --slice=l4d2-build.slice \
+    -p OOMScoreAdjust=500 \
    -p User=l4d2-sandbox -p Group=l4d2-sandbox \
    -p UMask=0022 \
    -p NoNewPrivileges=yes \
--- a/deploy/files/usr/local/libexec/left4me/left4me-systemctl
+++ b/deploy/files/usr/local/libexec/left4me/left4me-systemctl
@ -2,7 +2,7 @@
 set -eu

 usage() {
-    printf '%s\n' "usage: left4me-systemctl start|stop|show <server-name>" >&2
+    printf '%s\n' "usage: left4me-systemctl enable|disable|show <server-name>" >&2
    exit 2
 }

@ -22,7 +22,7 @@ action=$1
 name=$2

 case "$action" in
-    start|stop|show) ;;
+    enable|disable|show) ;;
    *) usage ;;
 esac

@ -38,7 +38,7 @@ else
 fi

 case "$action" in
-    start) exec "$systemctl" start "$unit" ;;
-    stop) exec "$systemctl" stop "$unit" ;;
+    enable) exec "$systemctl" enable --now "$unit" ;;
+    disable) exec "$systemctl" disable --now "$unit" ;;
    show) exec "$systemctl" show --property=ActiveState --property=SubState "$unit" ;;
 esac
--- a/deploy/tests/test_deploy_artifacts.py
+++ b/deploy/tests/test_deploy_artifacts.py
@ -9,6 +9,9 @@ DEPLOY = ROOT / "deploy"

 WEB_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-web.service"
 SERVER_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-server@.service"
+GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
+BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
+SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
 GLOBAL_REFRESH_SERVICE = DEPLOY / "files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.service"
 GLOBAL_REFRESH_TIMER = DEPLOY / "files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer"
 SANDBOX_UNIT_DIR = DEPLOY / "files/usr/local/lib/systemd/system"
@ -60,7 +63,10 @@ def test_server_unit_contains_required_runtime_contract():
    assert "Group=left4me" in unit
    assert "EnvironmentFile=/etc/left4me/host.env" in unit
    assert "EnvironmentFile=/var/lib/left4me/instances/%i/instance.env" in unit
-    assert "WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
+    # `-` prefix: chdir failure is non-fatal so ExecStartPre can run the
+    # mount helper before the merged dir exists. ExecStart re-applies and
+    # finds the dir once the mount has landed.
+    assert "WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
    assert "ExecStart=/var/lib/left4me/installation/srcds_run" in unit
    assert "$L4D2_ARGS" in unit
    assert "${L4D2_ARGS}" not in unit
@ -75,6 +81,176 @@ def test_server_unit_contains_required_runtime_contract():
    assert "LockPersonality=true" in unit


+def test_server_unit_mounts_overlay_via_exec_start_pre():
+    """At boot, systemd auto-starts enabled units before the web app gets a
+    chance to run start_instance's pre-start mount. The unit itself must
+    re-mount the overlay so reboots are transparent. Pairs with the helper's
+    idempotency check (test_overlay_helper_mount_is_idempotent_when_mounted).
+    """
+    unit = SERVER_UNIT.read_text()
+    # `+` prefix: ExecStartPre runs as PID 1 (root, no sandbox). Required
+    # because the unit has NoNewPrivileges=true, which blocks sudo's setuid
+    # escalation — and the helper needs root for nsenter anyway.
+    assert (
+        "ExecStartPre=+/usr/local/libexec/left4me/left4me-overlay mount %i"
+        in unit
+    )
+    # Bound the restart loop; without these, a CHDIR-failure (or any other
+    # pre-start error) spins indefinitely.
+    assert "StartLimitBurst=5" in unit
+    assert "StartLimitIntervalSec=60s" in unit
+
+
+def test_server_unit_unmounts_overlay_via_exec_stop_post():
+    """Single source of truth for unmount, mirroring the mount path.
+    ExecStopPost (not ExecStop) so it runs after srcds has fully exited
+    and the cgroup is cleared — otherwise the open files in merged/ would
+    EBUSY the umount syscall.
+    """
+    unit = SERVER_UNIT.read_text()
+    assert (
+        "ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i"
+        in unit
+    )
+
+
+def test_overlay_helper_mount_is_idempotent_when_already_mounted():
+    """ExecStartPre runs on every Restart=on-failure cycle. If a previous
+    start mounted successfully but ExecStart failed afterwards, the next
+    ExecStartPre would re-mount on top -- which fails. The helper must
+    short-circuit when merged is already a mount point.
+    """
+    text = OVERLAY_HELPER.read_text()
+    # Two ismount checks now: one in cmd_mount (skip if mounted),
+    # one in cmd_umount (skip if not mounted).
+    assert text.count("os.path.ismount") >= 2
+
+
+def test_server_unit_contains_perf_baseline_directives():
+    unit = SERVER_UNIT.read_text()
+
+    # Slice membership.
+    assert "Slice=l4d2-game.slice" in unit
+
+    # CFS priority bump (no SCHED_FIFO).
+    assert "Nice=-5" in unit
+    assert "CPUSchedulingPolicy=" not in unit
+
+    # I/O priority.
+    assert "IOSchedulingClass=best-effort" in unit
+    assert "IOSchedulingPriority=4" in unit
+
+    # OOM ordering: game servers survive, sandbox dies first.
+    assert "OOMScoreAdjust=-200" in unit
+
+    # Memory caps with headroom for map-load spikes.
+    assert "MemoryHigh=1.5G" in unit
+    assert "MemoryMax=2G" in unit
+
+    # Bounded fork surface.
+    assert "TasksMax=256" in unit
+
+    # Plenty of fds for plugin-heavy setups.
+    assert "LimitNOFILE=65536" in unit
+
+    # srcds clean shutdown via SIGINT, with time to flush.
+    assert "KillSignal=SIGINT" in unit
+    assert "TimeoutStopSec=15s" in unit
+
+    # Per-unit override of journald rate limiting (default drops srcds output).
+    assert "LogRateLimitIntervalSec=0" in unit
+
+
+def test_l4d2_game_slice_exists_with_high_weights():
+    assert GAME_SLICE.is_file()
+    text = GAME_SLICE.read_text()
+    assert "[Slice]" in text
+    assert "CPUWeight=1000" in text
+    assert "IOWeight=1000" in text
+
+
+def test_l4d2_build_slice_exists_with_low_weights():
+    assert BUILD_SLICE.is_file()
+    text = BUILD_SLICE.read_text()
+    assert "[Slice]" in text
+    assert "CPUWeight=10" in text
+    assert "IOWeight=10" in text
+
+
+def test_sysctl_conf_present_with_perf_settings():
+    assert SYSCTL_CONF.is_file()
+    text = SYSCTL_CONF.read_text()
+    for line in (
+        "net.core.rmem_max = 8388608",
+        "net.core.wmem_max = 8388608",
+        "net.core.rmem_default = 524288",
+        "net.core.wmem_default = 524288",
+        "net.core.netdev_max_backlog = 5000",
+        "net.core.netdev_budget = 600",
+        "vm.swappiness = 10",
+    ):
+        assert line in text, f"missing {line!r} in 99-left4me.conf"
+
+
+def test_script_sandbox_in_build_slice_with_oom_adjust():
+    text = SCRIPT_SANDBOX_HELPER.read_text()
+
+    # Put the transient unit in the low-weight build slice so it yields to
+    # game-server instances under CPU/IO contention.
+    assert "--slice=l4d2-build.slice" in text
+
+    # Sandbox dies first if the host hits memory pressure; servers
+    # (OOMScoreAdjust=-200) survive.
+    assert "-p OOMScoreAdjust=500" in text
+
+
+def test_deploy_script_installs_perf_artifacts():
+    script = DEPLOY_SCRIPT.read_text()
+
+    # Slice files copied into the system-wide systemd unit dir.
+    assert "/usr/local/lib/systemd/system/l4d2-game.slice" in script
+    assert "/usr/local/lib/systemd/system/l4d2-build.slice" in script
+
+    # Sysctl drop-in installed under /etc/sysctl.d/.
+    assert "/etc/sysctl.d/99-left4me.conf" in script
+
+    # Values applied immediately, not on next boot.
+    assert "sysctl --system" in script
+
+
+def test_deploy_script_writes_cpuset_drop_ins():
+    script = DEPLOY_SCRIPT.read_text()
+
+    # Reads nproc and binds defaults via ${VAR:-...}.
+    assert "nproc" in script
+    assert "LEFT4ME_SYSTEM_CPUS" in script
+    assert "LEFT4ME_GAME_CPUS" in script
+    assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
+
+    # Default game-core upper bound is computed from nproc; accept either
+    # the NPROC-1 form or LEFT4ME_GAME_CPUS:-1- prefix.
+    assert (
+        "1-$((NPROC - 1))" in script
+        or "1-$((NPROC-1))" in script
+        or "1-$((nproc-1))" in script
+        or "LEFT4ME_GAME_CPUS:-1-" in script
+    )
+
+    # All four drop-in paths.
+    for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
+        assert (
+            f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf"
+            in script
+        )
+
+    # Drop-ins use the existing install pattern.
+    assert "install -m 0644 -o root -g root" in script
+
+    # Single-core host: skip with a warning to stderr.
+    assert ("-lt 2" in script) or ("< 2" in script) or ("-ge 2" in script)
+    assert "skipping CPU isolation" in script
+
+
 def _fake_command(tmp_path, command_name):
    marker = tmp_path / f"{command_name}.args"
    command = tmp_path / command_name
@ -105,12 +281,16 @@ def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_pat

    for args in [
        ["bad/action", "alpha"],
-        ["start", ""],
-        ["start", ".hidden"],
-        ["start", "bad..name"],
-        ["start", "bad/name"],
-        ["start", "bad\\name"],
-        ["start", "bad name"],
+        # `start` and `stop` are no longer accepted verbs — the lifecycle now
+        # uses `enable`/`disable` for reboot survival via WantedBy= symlinks.
+        ["start", "alpha"],
+        ["stop", "alpha"],
+        ["enable", ""],
+        ["enable", ".hidden"],
+        ["enable", "bad..name"],
+        ["enable", "bad/name"],
+        ["enable", "bad\\name"],
+        ["enable", "bad name"],
    ]:
        result = subprocess.run(["sh", str(SYSTEMCTL_HELPER), *args], env=_env_with_fake_commands(tmp_path), check=False)
        assert result.returncode != 0
@ -118,8 +298,8 @@ def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_pat

    script = SYSTEMCTL_HELPER.read_text()
    assert 'unit="left4me-server@${name}.service"' in script
-    assert 'start) exec "$systemctl" start "$unit"' in script
-    assert 'stop) exec "$systemctl" stop "$unit"' in script
+    assert 'enable) exec "$systemctl" enable --now "$unit"' in script
+    assert 'disable) exec "$systemctl" disable --now "$unit"' in script
    assert "--property=ActiveState" in script
    assert "--property=SubState" in script

--- a/docs/superpowers/plans/2026-05-09-l4d2-cpu-isolation.md
+++ b/docs/superpowers/plans/2026-05-09-l4d2-cpu-isolation.md
@ -0,0 +1,260 @@
+# L4D2 CPU Isolation Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively, scaled automatically across host sizes.
+
+**Architecture:** Four `99-left4me-cpuset.conf` drop-ins under `/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/`, written by the deploy script from heredocs. `LEFT4ME_SYSTEM_CPUS` (default `0`) and `LEFT4ME_GAME_CPUS` (default `1-$((NPROC-1))`) are env-var overrides. Single-core hosts skip the cpuset writes with a warning.
+
+**Tech Stack:** systemd cgroup-v2 `AllowedCPUs=` directive, bash heredoc + `install`, Linux `nproc(1)`, pytest text-assertion tests.
+
+**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`
+
+---
+
+## File Structure
+
+Files to modify:
+
+- `deploy/deploy-test-server.sh` — compute `NPROC`, default `LEFT4ME_SYSTEM_CPUS=0` / `LEFT4ME_GAME_CPUS=1-$((NPROC-1))`, write four drop-in files. Skip when `nproc < 2` (with stderr warning) unless either env var is set explicitly.
+- `deploy/README.md` — append a "CPU isolation" subsection inside the existing "Performance Tuning" section.
+- `deploy/tests/test_deploy_artifacts.py` — new test functions.
+
+No host library or web app changes.
+
+---
+
+## Pre-flight
+
+- [ ] **Step 0a: Verify clean working tree**
+
+Run: `git status`
+Expected: `nothing to commit, working tree clean`
+
+- [ ] **Step 0b: Verify the existing deploy tests are at the known-good baseline**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
+Expected: 35 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`).
+
+If the count differs, stop and surface — this plan assumes that exact baseline.
+
+---
+
+## Task 1: Deploy-script CPU-isolation block + tests
+
+Write the four drop-ins from the deploy script in one cohesive block. The block computes `NPROC` once, resolves both env vars (with defaults), guards single-core hosts, and writes each drop-in via the existing `install -m 0644 -o root -g root` pattern. Tests cover defaults, overrides, single-core skip, and drop-in paths.
+
+**Files:**
+- Modify: `deploy/deploy-test-server.sh`
+- Modify: `deploy/tests/test_deploy_artifacts.py` (new test function)
+
+- [ ] **Step 1.1: Add the failing test**
+
+Open `deploy/tests/test_deploy_artifacts.py` and append (after the `test_deploy_script_installs_perf_artifacts` from the perf-baseline branch):
+
+```python
+def test_deploy_script_writes_cpuset_drop_ins():
+    script = DEPLOY_SCRIPT.read_text()
+
+    # Reads nproc and binds defaults via ${VAR:-...}.
+    assert "nproc" in script
+    assert "LEFT4ME_SYSTEM_CPUS" in script
+    assert "LEFT4ME_GAME_CPUS" in script
+    assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
+    # Default game-core expression: 1-(nproc-1). Match the form the
+    # implementer chose; both `1-$((NPROC-1))` and `1-$((nproc-1))` are
+    # acceptable as long as the upper bound is computed from nproc.
+    assert ("1-$((NPROC-1))" in script) or ("1-$((nproc-1))" in script) \
+        or ("LEFT4ME_GAME_CPUS:-1-" in script)
+
+    # All four drop-in paths.
+    for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
+        assert f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf" in script
+
+    # Drop-ins use the existing install pattern.
+    assert "install -m 0644 -o root -g root" in script
+
+    # Single-core host: skip with a warning to stderr.
+    # Match either an explicit `nproc < 2` / `-lt 2` guard or `[ "$nproc" -ge 2 ]` form.
+    assert ("nproc" in script) and (("-lt 2" in script) or ("-ge 2" in script) or ("< 2" in script))
+    assert "skipping CPU isolation" in script.lower() or "skip cpu isolation" in script.lower()
+```
+
+- [ ] **Step 1.2: Run the new test, verify it fails**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_writes_cpuset_drop_ins -v`
+Expected: FAIL — none of the new strings exist yet.
+
+- [ ] **Step 1.3: Edit the deploy script — add the cpuset block**
+
+Open `deploy/deploy-test-server.sh`. Find the block that copies the slice files (added in the perf-baseline branch, around lines 139–140):
+
+```sh
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
+```
+
+Immediately after that pair, before any of the helper-script copies that follow, insert this block:
+
+```sh
+# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
+# isn't a live game server to core 0; give game servers cores 1..N-1.
+# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
+NPROC=$(nproc)
+SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
+if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
+    GAME_CPUS=$LEFT4ME_GAME_CPUS
+else
+    GAME_CPUS="1-$((NPROC - 1))"
+fi
+if [ "$NPROC" -lt 2 ] && [ -z "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" ]; then
+    printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
+else
+    for slice_name in system user l4d2-build; do
+        $sudo_cmd mkdir -p "/etc/systemd/system/${slice_name}.slice.d"
+        printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
+            | $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
+              "/etc/systemd/system/${slice_name}.slice.d/99-left4me-cpuset.conf"
+    done
+    $sudo_cmd mkdir -p "/etc/systemd/system/l4d2-game.slice.d"
+    printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
+        | $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
+          "/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf"
+fi
+```
+
+Notes for the implementer:
+
+- The single-core skip only triggers when **neither** override is set. If the operator sets either `LEFT4ME_SYSTEM_CPUS` or `LEFT4ME_GAME_CPUS` explicitly on a single-core host, honor their intent.
+- `install -m 0644 -o root -g root /dev/stdin <dest>` is the idiomatic way to install a small generated file from a pipeline (matches the existing pattern for sandbox-resolv.conf, just with `/dev/stdin` as source).
+- The `mkdir -p` for each `.d` directory is required: systemd reads drop-ins only from existing directories.
+
+- [ ] **Step 1.4: Verify shell syntax still parses**
+
+Run: `sh -n deploy/deploy-test-server.sh`
+Expected: exit 0, no output.
+
+- [ ] **Step 1.5: Run the new test and full deploy test suite**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
+Expected: 36 passed, 1 failed (the pre-existing unrelated test, count goes from 35→36 because of the new test).
+
+If your specific assertion forms in Step 1.1 don't match the implementation, adjust the test — but only the `or` branches; do not weaken the contract.
+
+- [ ] **Step 1.6: Commit**
+
+```bash
+git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
+git commit -m "$(cat <<'EOF'
+feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
+
+Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
+LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
+with a stderr warning unless an env var override is set. Spec:
+docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
+EOF
+)"
+```
+
+---
+
+## Task 2: README "CPU isolation" subsection
+
+Append a subsection to `deploy/README.md` inside the existing "Performance Tuning" section, documenting the layout, the env-var overrides, the single-core skip, and the relationship to the existing per-instance `CPUAffinity=` escape hatch.
+
+**Files:**
+- Modify: `deploy/README.md`
+
+No test for this task — README content is documentation, not contract.
+
+- [ ] **Step 2.1: Append the CPU isolation subsection**
+
+Open `deploy/README.md`. Find the existing `### Per-instance CPU affinity` subsection (added in the perf-baseline branch). Insert a new subsection **immediately before** it (so the slice-level isolation is documented before the per-instance refinement that builds on top). The new subsection content:
+
+```markdown
+### CPU isolation (cores)
+
+The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
+
+Override the split by setting either env var when running the deploy:
+
+```sh
+LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
+```
+
+On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
+
+Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
+```
+
+(The outer triple-backticks above are markdown punctuation around this prompt block, not part of the README content. Inner code-block fences DO need to be written into the README. The `markdown` language tag on the outer fence in this plan is documentation-only.)
+
+- [ ] **Step 2.2: Run the full deploy test suite**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
+Expected: 36 passed, 1 failed (unchanged; README has no test).
+
+- [ ] **Step 2.3: Commit**
+
+```bash
+git add deploy/README.md
+git commit -m "$(cat <<'EOF'
+docs(deploy): document CPU isolation in performance-tuning section
+
+Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
+LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
+subset-of relationship with per-instance CPUAffinity=.
+EOF
+)"
+```
+
+---
+
+## Final Verification
+
+- [ ] **Step F.1: Full deploy + host + web test sweep**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
+Expected: deploy 36 passed / 1 failed (pre-existing); host 111 passed / 1 skipped; web 313 passed / 1 skipped.
+
+- [ ] **Step F.2: Working tree clean and commits in order**
+
+Run: `git status && git log --oneline -5`
+Expected:
+- `git status`: clean.
+- Top of `git log`:
+  1. `docs(deploy): document CPU isolation in performance-tuning section`
+  2. `feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest`
+  3. `docs(plans): l4d2 cpu isolation — implementation plan`
+  4. `docs(specs): l4d2 cpu isolation — design`
+
+- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
+
+This plan ships artifacts. Confirming systemd actually enforces `AllowedCPUs=` on a real Trixie host is operator-side:
+
+```sh
+deploy/deploy-test-server.sh deploy-user@example-host
+ssh deploy-user@example-host '
+  systemctl cat system.slice | grep AllowedCPUs
+  systemctl cat l4d2-game.slice | grep AllowedCPUs
+  cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
+  cat /sys/fs/cgroup/l4d2-game.slice/cpuset.cpus.effective
+'
+# Expect on an 8-core box:
+#   system.slice    → AllowedCPUs=0   → cpuset.cpus.effective = 0
+#   l4d2-game.slice → AllowedCPUs=1-7 → cpuset.cpus.effective = 1-7
+```
+
+End-to-end behavioural test (manual, ops-side): on a 4-core host, run two L4D2 instances + a script-sandbox build simultaneously. Confirm via `htop` (with affinity column on) that the srcds processes only ever appear on cores 1, 2, 3 and the sandbox + web stay on core 0.
+
+---
+
+## Out of Scope (do NOT implement here)
+
+- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot params.
+- NIC IRQ pinning automation.
+- Per-instance `CPUAffinity=` driven by a deploy-env knob.
+- A separate `l4d2-web.slice`.
+- Any web-app or host-library code changes.
+
+If you find yourself touching any of these, stop — they belong in a separate spec.
--- a/docs/superpowers/plans/2026-05-09-l4d2-server-host-perf-baseline.md
+++ b/docs/superpowers/plans/2026-05-09-l4d2-server-host-perf-baseline.md
@ -0,0 +1,686 @@
+# L4D2 Server Host Perf Baseline Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Apply a host-side performance and resource-isolation baseline (systemd directives, slice hierarchy, host sysctls) to every L4D2 server instance, leaving game ConVars to the maintainer.
+
+**Architecture:** Add resource-control directives to `left4me-server@.service`; introduce two flat top-level slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10) so the build sandbox is starved by the kernel under contention; ship `/etc/sysctl.d/99-left4me.conf` for UDP buffer and netdev tuning; place the script-sandbox transient unit into `l4d2-build.slice` with `OOMScoreAdjust=500`. RT scheduling, CPU governor, CPUAffinity, NIC tuning are documentation-only escape hatches.
+
+**Tech Stack:** systemd unit files (service + slice), `systemd-run` properties, Linux sysctl, bash deploy script, pytest text-assertion tests under `deploy/tests/test_deploy_artifacts.py`.
+
+**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`
+
+---
+
+## File Structure
+
+Files to create:
+
+- `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` — high-weight slice for game-server instances.
+- `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` — low-weight slice for sandboxed script-overlay builds.
+- `deploy/files/etc/sysctl.d/99-left4me.conf` — host UDP/netdev/swap sysctls.
+
+Files to modify:
+
+- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` — add resource-control directives (`Slice`, `Nice`, `IOSchedulingClass`, `IOSchedulingPriority`, `OOMScoreAdjust`, `MemoryHigh`, `MemoryMax`, `TasksMax`, `LimitNOFILE`, `KillSignal`, `TimeoutStopSec`, `LogRateLimitIntervalSec`).
+- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — add `--slice=l4d2-build.slice` and `-p OOMScoreAdjust=500` to the `systemd-run` invocation.
+- `deploy/deploy-test-server.sh` — copy the two slice files and the sysctl conf during deploy; run `sysctl --system` so values take effect immediately.
+- `deploy/README.md` — append a "Performance tuning" section with the four documented escape hatches.
+- `deploy/tests/test_deploy_artifacts.py` — new tests for each artifact above (text assertions following the existing `assert "X" in text` style).
+
+No application code (Python, Flask, host library) is touched.
+
+---
+
+## Pre-flight
+
+- [ ] **Step 0a: Verify clean working tree**
+
+Run: `git status`
+Expected: `nothing to commit, working tree clean`
+
+- [ ] **Step 0b: Verify the existing deploy tests pass**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
+Expected: all green.
+
+If any test is already red, stop and surface — this plan assumes the baseline is green.
+
+---
+
+## Task 1: Per-Instance Unit Resource-Control Directives
+
+Add the per-instance baseline to `left4me-server@.service`. This task is self-contained even though `Slice=l4d2-game.slice` references a slice that doesn't exist yet — systemd does not validate the reference until the unit is actually started, and the deploy artifact tests are pure text checks.
+
+**Files:**
+- Modify: `deploy/files/usr/local/lib/systemd/system/left4me-server@.service`
+- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
+
+- [ ] **Step 1.1: Add the failing test**
+
+Open `deploy/tests/test_deploy_artifacts.py` and append (after `test_server_unit_contains_required_runtime_contract`):
+
+```python
+def test_server_unit_contains_perf_baseline_directives():
+    unit = SERVER_UNIT.read_text()
+
+    # Slice membership.
+    assert "Slice=l4d2-game.slice" in unit
+
+    # CFS priority bump (no SCHED_FIFO).
+    assert "Nice=-5" in unit
+    assert "CPUSchedulingPolicy=" not in unit
+
+    # I/O priority.
+    assert "IOSchedulingClass=best-effort" in unit
+    assert "IOSchedulingPriority=4" in unit
+
+    # OOM ordering: game servers survive, sandbox dies first.
+    assert "OOMScoreAdjust=-200" in unit
+
+    # Memory caps with headroom for map-load spikes.
+    assert "MemoryHigh=1.5G" in unit
+    assert "MemoryMax=2G" in unit
+
+    # Bounded fork surface.
+    assert "TasksMax=256" in unit
+
+    # Plenty of fds for plugin-heavy setups.
+    assert "LimitNOFILE=65536" in unit
+
+    # srcds clean shutdown via SIGINT, with time to flush.
+    assert "KillSignal=SIGINT" in unit
+    assert "TimeoutStopSec=15s" in unit
+
+    # Per-unit override of journald rate limiting (default drops srcds output).
+    assert "LogRateLimitIntervalSec=0" in unit
+```
+
+- [ ] **Step 1.2: Run the new test, verify it fails**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
+Expected: FAIL — first failing assert is on `Slice=l4d2-game.slice`.
+
+- [ ] **Step 1.3: Edit the unit file**
+
+Open `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` and replace its contents with:
+
+```ini
+[Unit]
+Description=left4me server instance %i
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=simple
+User=left4me
+Group=left4me
+EnvironmentFile=/etc/left4me/host.env
+EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
+WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
+ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
+Restart=on-failure
+RestartSec=5
+
+# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+Slice=l4d2-game.slice
+Nice=-5
+IOSchedulingClass=best-effort
+IOSchedulingPriority=4
+OOMScoreAdjust=-200
+MemoryHigh=1.5G
+MemoryMax=2G
+TasksMax=256
+LimitNOFILE=65536
+KillSignal=SIGINT
+TimeoutStopSec=15s
+LogRateLimitIntervalSec=0
+
+# Hardening (unchanged from previous baseline).
+NoNewPrivileges=true
+PrivateTmp=true
+PrivateDevices=true
+ProtectHome=true
+ProtectSystem=strict
+ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays
+ReadWritePaths=/var/lib/left4me/runtime/%i
+RestrictSUIDSGID=true
+LockPersonality=true
+
+[Install]
+WantedBy=multi-user.target
+```
+
+- [ ] **Step 1.4: Run the new test, verify it passes**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
+Expected: PASS.
+
+- [ ] **Step 1.5: Re-run the existing server-unit test, verify still passes**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_required_runtime_contract -v`
+Expected: PASS — the existing assertions (`User=left4me`, `Group=left4me`, hardening directives, etc.) still match.
+
+- [ ] **Step 1.6: Commit**
+
+```bash
+git add deploy/files/usr/local/lib/systemd/system/left4me-server@.service deploy/tests/test_deploy_artifacts.py
+git commit -m "$(cat <<'EOF'
+feat(deploy): perf-baseline directives on left4me-server@.service
+
+Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort,
+OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256,
+LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s,
+LogRateLimitIntervalSec=0. Spec:
+docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+EOF
+)"
+```
+
+---
+
+## Task 2: Slice Unit Files
+
+Create the two slice unit files. After this task the perf unit's `Slice=l4d2-game.slice` reference is satisfied.
+
+**Files:**
+- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`
+- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`
+- Test: `deploy/tests/test_deploy_artifacts.py` (new constants + new test functions)
+
+- [ ] **Step 2.1: Add path constants and failing tests**
+
+Open `deploy/tests/test_deploy_artifacts.py`. After the existing `SERVER_UNIT = ...` line, add:
+
+```python
+GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
+BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
+```
+
+After the new `test_server_unit_contains_perf_baseline_directives`, append:
+
+```python
+def test_l4d2_game_slice_exists_with_high_weights():
+    assert GAME_SLICE.is_file()
+    text = GAME_SLICE.read_text()
+    assert "[Slice]" in text
+    assert "CPUWeight=1000" in text
+    assert "IOWeight=1000" in text
+
+
+def test_l4d2_build_slice_exists_with_low_weights():
+    assert BUILD_SLICE.is_file()
+    text = BUILD_SLICE.read_text()
+    assert "[Slice]" in text
+    assert "CPUWeight=10" in text
+    assert "IOWeight=10" in text
+```
+
+- [ ] **Step 2.2: Run the new tests, verify they fail**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
+Expected: FAIL on `assert GAME_SLICE.is_file()` (file does not exist).
+
+- [ ] **Step 2.3: Create the game slice file**
+
+Create `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` with:
+
+```ini
+[Unit]
+Description=left4me game-server slice
+Before=slices.target
+
+[Slice]
+CPUWeight=1000
+IOWeight=1000
+```
+
+- [ ] **Step 2.4: Create the build slice file**
+
+Create `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` with:
+
+```ini
+[Unit]
+Description=left4me script-sandbox build slice
+Before=slices.target
+
+[Slice]
+CPUWeight=10
+IOWeight=10
+```
+
+- [ ] **Step 2.5: Run the new tests, verify they pass**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
+Expected: PASS.
+
+- [ ] **Step 2.6: Commit**
+
+```bash
+git add deploy/files/usr/local/lib/systemd/system/l4d2-game.slice deploy/files/usr/local/lib/systemd/system/l4d2-build.slice deploy/tests/test_deploy_artifacts.py
+git commit -m "$(cat <<'EOF'
+feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio
+
+Flat top-level slices. Game wins under contention; build still gets
+the box when uncontended. Referenced by left4me-server@.service and
+the script-sandbox systemd-run invocation.
+EOF
+)"
+```
+
+---
+
+## Task 3: Host Sysctls
+
+Ship a `/etc/sysctl.d/` drop-in for UDP buffers, netdev backlog, netdev budget, and `vm.swappiness`.
+
+**Files:**
+- Create: `deploy/files/etc/sysctl.d/99-left4me.conf`
+- Test: `deploy/tests/test_deploy_artifacts.py` (new constant + new test function)
+
+- [ ] **Step 3.1: Add path constant and failing test**
+
+Open `deploy/tests/test_deploy_artifacts.py`. After the slice constants, add:
+
+```python
+SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
+```
+
+Append a new test:
+
+```python
+def test_sysctl_conf_present_with_perf_settings():
+    assert SYSCTL_CONF.is_file()
+    text = SYSCTL_CONF.read_text()
+    for line in (
+        "net.core.rmem_max = 8388608",
+        "net.core.wmem_max = 8388608",
+        "net.core.rmem_default = 524288",
+        "net.core.wmem_default = 524288",
+        "net.core.netdev_max_backlog = 5000",
+        "net.core.netdev_budget = 600",
+        "vm.swappiness = 10",
+    ):
+        assert line in text, f"missing {line!r} in 99-left4me.conf"
+```
+
+- [ ] **Step 3.2: Run the new test, verify it fails**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
+Expected: FAIL on `assert SYSCTL_CONF.is_file()`.
+
+- [ ] **Step 3.3: Create the sysctl conf file**
+
+Create `deploy/files/etc/sysctl.d/99-left4me.conf` with:
+
+```
+# Host-side perf baseline for left4me — see
+# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+#
+# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
+# Source-engine UDP across multiple instances. 8 MiB matches the standard
+# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
+# explicitly enlarge their buffers.
+net.core.rmem_max = 8388608
+net.core.wmem_max = 8388608
+net.core.rmem_default = 524288
+net.core.wmem_default = 524288
+
+# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
+# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
+# netdev_budget = 600 gives softirq more drain headroom per pass.
+net.core.netdev_max_backlog = 5000
+net.core.netdev_budget = 600
+
+# Latency-sensitive default: avoid swap unless the box is really under
+# pressure. Harmless on swapless hosts.
+vm.swappiness = 10
+```
+
+- [ ] **Step 3.4: Run the new test, verify it passes**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
+Expected: PASS.
+
+- [ ] **Step 3.5: Commit**
+
+```bash
+git add deploy/files/etc/sysctl.d/99-left4me.conf deploy/tests/test_deploy_artifacts.py
+git commit -m "$(cat <<'EOF'
+feat(deploy): host sysctls for UDP buffers + netdev backlog/budget
+
+99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults),
+netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.
+EOF
+)"
+```
+
+---
+
+## Task 4: Sandbox in Build Slice
+
+Place the script-sandbox transient unit into `l4d2-build.slice` and give it `OOMScoreAdjust=500` so it dies first under memory pressure.
+
+**Files:**
+- Modify: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
+- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
+
+- [ ] **Step 4.1: Add the failing test**
+
+Open `deploy/tests/test_deploy_artifacts.py`. Append:
+
+```python
+def test_script_sandbox_in_build_slice_with_oom_adjust():
+    text = SCRIPT_SANDBOX_HELPER.read_text()
+
+    # Put the transient unit in the low-weight build slice so it yields to
+    # game-server instances under CPU/IO contention.
+    assert "--slice=l4d2-build.slice" in text
+
+    # Sandbox dies first if the host hits memory pressure; servers
+    # (OOMScoreAdjust=-200) survive.
+    assert "-p OOMScoreAdjust=500" in text
+```
+
+- [ ] **Step 4.2: Run the new test, verify it fails**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust -v`
+Expected: FAIL — neither string is in the helper yet.
+
+- [ ] **Step 4.3: Edit the sandbox helper**
+
+Open `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`. Locate the `systemd-run` invocation that begins with:
+
+```
+systemd-run --quiet --collect --wait --pipe \
+    --unit="left4me-script-${OVERLAY_ID}-$$" \
+```
+
+Insert two new lines immediately after the `--unit=` line, before `-p User=l4d2-sandbox`. The block becomes:
+
+```
+systemd-run --quiet --collect --wait --pipe \
+    --unit="left4me-script-${OVERLAY_ID}-$$" \
+    --slice=l4d2-build.slice \
+    -p OOMScoreAdjust=500 \
+    -p User=l4d2-sandbox -p Group=l4d2-sandbox \
+```
+
+Leave every other `-p` line untouched.
+
+- [ ] **Step 4.4: Verify shell syntax still parses**
+
+Run: `bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
+Expected: exit 0, no output.
+
+- [ ] **Step 4.5: Run the new test and the existing sandbox-helper tests, verify they pass**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_invokes_systemd_run_with_hardening deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_passes_shell_syntax_check -v`
+Expected: PASS for all three. The hardening test still matches because it only checks for substring presence; we added strings, didn't remove any.
+
+- [ ] **Step 4.6: Commit**
+
+```bash
+git add deploy/files/usr/local/libexec/left4me/left4me-script-sandbox deploy/tests/test_deploy_artifacts.py
+git commit -m "$(cat <<'EOF'
+feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500
+
+Builds yield CPU/IO to game-server instances under contention via the
+slice's weight=10, and are killed first under memory pressure
+(servers have OOMScoreAdjust=-200).
+EOF
+)"
+```
+
+---
+
+## Task 5: Deploy Script Installs Slice + Sysctl Artifacts
+
+Wire the new artifacts into `deploy-test-server.sh` so a fresh deploy actually puts them on disk and applies the sysctls.
+
+**Files:**
+- Modify: `deploy/deploy-test-server.sh`
+- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
+
+- [ ] **Step 5.1: Add the failing test**
+
+Open `deploy/tests/test_deploy_artifacts.py`. Append:
+
+```python
+def test_deploy_script_installs_perf_artifacts():
+    script = DEPLOY_SCRIPT.read_text()
+
+    # Slice files copied into the system-wide systemd unit dir.
+    assert "/usr/local/lib/systemd/system/l4d2-game.slice" in script
+    assert "/usr/local/lib/systemd/system/l4d2-build.slice" in script
+
+    # Sysctl drop-in installed under /etc/sysctl.d/.
+    assert "/etc/sysctl.d/99-left4me.conf" in script
+
+    # Values applied immediately, not on next boot.
+    assert "sysctl --system" in script
+```
+
+- [ ] **Step 5.2: Run the new test, verify it fails**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts -v`
+Expected: FAIL on the first assertion.
+
+- [ ] **Step 5.3: Edit the deploy script — copy the slice + sysctl files**
+
+Open `deploy/deploy-test-server.sh`. Find the block that copies unit files (currently around line 138):
+
+```sh
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
+```
+
+Add two new lines immediately after the `left4me-server@.service` copy line, so the block becomes:
+
+```sh
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
+$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
+```
+
+- [ ] **Step 5.4: Edit the deploy script — install the sysctl conf and apply it**
+
+In `deploy/deploy-test-server.sh`, find the block that installs `/etc/left4me/sandbox-resolv.conf` (currently around lines 153–155):
+
+```sh
+$sudo_cmd install -m 0644 -o root -g root \
+    /opt/left4me/deploy/files/etc/left4me/sandbox-resolv.conf \
+    /etc/left4me/sandbox-resolv.conf
+```
+
+Immediately after that block, add:
+
+```sh
+# Host perf-baseline sysctls. Apply with `sysctl --system` so values
+# take effect this deploy, not on next reboot.
+$sudo_cmd install -m 0644 -o root -g root \
+    /opt/left4me/deploy/files/etc/sysctl.d/99-left4me.conf \
+    /etc/sysctl.d/99-left4me.conf
+$sudo_cmd sysctl --system >/dev/null
+```
+
+- [ ] **Step 5.5: Verify the deploy script's shell syntax still parses**
+
+Run: `sh -n deploy/deploy-test-server.sh`
+Expected: exit 0, no output.
+
+- [ ] **Step 5.6: Run the new test and the existing deploy-script tests, verify they pass**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts deploy/tests/test_deploy_artifacts.py::test_deploy_script_has_safe_defaults_and_preserves_state deploy/tests/test_deploy_artifacts.py::test_deploy_script_shell_syntax -v`
+Expected: PASS for all three.
+
+- [ ] **Step 5.7: Commit**
+
+```bash
+git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
+git commit -m "$(cat <<'EOF'
+feat(deploy): install slice + sysctl artifacts and apply via sysctl --system
+
+Copies l4d2-game.slice and l4d2-build.slice into
+/usr/local/lib/systemd/system/, installs 99-left4me.conf into
+/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
+live this deploy, not on next reboot.
+EOF
+)"
+```
+
+---
+
+## Task 6: Performance-Tuning Section in deploy/README.md
+
+Document the four escape hatches the spec lists as opt-in: CPU governor, per-instance `CPUAffinity`, NIC tuning, and SCHED_FIFO.
+
+**Files:**
+- Modify: `deploy/README.md`
+
+No test for this task — README content is documentation, not contract.
+
+- [ ] **Step 6.1: Append the Performance Tuning section**
+
+Open `deploy/README.md`. Append (after the existing final paragraph) a new section:
+
+```markdown
+## Performance Tuning
+
+The deployment ships a host-side perf baseline (slices, unit directives, sysctls). See `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` for design rationale.
+
+The following knobs are documented escape hatches — they are **not** auto-applied. Apply only if you have measured a need and understand the failure modes.
+
+### CPU governor
+
+The performance governor squeezes a few percent off jitter under bursty load. `schedutil` is acceptable for sustained UDP workloads.
+
+```sh
+sudo cpupower frequency-set -g performance
+```
+
+Persist via your distro's CPU-frequency tooling (e.g. `/etc/default/cpufrequtils`).
+
+### Per-instance CPU affinity
+
+`srcds` is single-threaded per instance. On a multi-core host, pinning each instance to its own core can cut jitter under contention. Drop in `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
+
+```ini
+[Service]
+CPUAffinity=2
+```
+
+A reasonable strategy on an N-core host: leave core 0 for the kernel + IRQs + system services, then pin one instance per remaining core.
+
+### NIC tuning
+
+Hardware-specific. On a host with a single primary interface (replace `eth0`):
+
+```sh
+sudo ethtool -G eth0 rx 4096 tx 4096
+sudo ethtool -K eth0 gro on lro off
+```
+
+If you run a high instance count, also pin the NIC's interrupts off the cores that game servers occupy (see `/proc/interrupts` and `/proc/irq/<n>/smp_affinity`).
+
+### Real-time scheduling (advanced, opt-in)
+
+Source-engine servers do not need real-time scheduling, and a misbehaving `srcds` at any RT priority can starve kernel threads — even with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of CPU back. Use only if you have a measured jitter problem that the baseline does not solve.
+
+`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
+
+```ini
+[Service]
+CPUSchedulingPolicy=fifo
+CPUSchedulingPriority=10
+LimitRTPRIO=10
+```
+
+### Applying changes to running servers
+
+Unit-file changes do not apply to already-running services. After any change:
+
+```sh
+sudo systemctl daemon-reload
+# Restart each game server via the web UI's stop + start, or:
+sudo systemctl restart 'left4me-server@*.service'
+```
+```
+
+- [ ] **Step 6.2: Run the full deploy test suite and verify it stays green**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
+Expected: all green. README changes have no test, but should not break any existing tests.
+
+- [ ] **Step 6.3: Commit**
+
+```bash
+git add deploy/README.md
+git commit -m "$(cat <<'EOF'
+docs(deploy): performance-tuning escape-hatch section in README
+
+Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
+SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
+ops-side knobs for measured problems the perf baseline doesn't solve.
+EOF
+)"
+```
+
+---
+
+## Final Verification
+
+- [ ] **Step F.1: Full deploy test suite green**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
+Expected: all green.
+
+- [ ] **Step F.2: Host library + web tests still green (regression check)**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q && pytest l4d2web/tests -q`
+Expected: all green. Nothing in this plan touches host or web Python code, but a clean run rules out accidental import-time damage.
+
+- [ ] **Step F.3: Working tree clean and commits in order**
+
+Run: `git status && git log --oneline -8`
+Expected:
+- `git status`: `nothing to commit, working tree clean`.
+- `git log`: six new commits in this order, top-most first:
+  1. `docs(deploy): performance-tuning escape-hatch section in README`
+  2. `feat(deploy): install slice + sysctl artifacts and apply via sysctl --system`
+  3. `feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500`
+  4. `feat(deploy): host sysctls for UDP buffers + netdev backlog/budget`
+  5. `feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio`
+  6. `feat(deploy): perf-baseline directives on left4me-server@.service`
+
+If any step is missing or out of order, do not amend — diagnose, fix, and create new commits.
+
+- [ ] **Step F.4: Manual deploy smoke test (deferred, ops-side)**
+
+This plan ships artifacts. Confirming that systemd actually accepts and applies them on a real host requires running the deploy script against a test target. That validation is operator-side, not part of this implementation:
+
+```sh
+deploy/deploy-test-server.sh deploy-user@example-host
+ssh deploy-user@example-host 'systemctl cat l4d2-game.slice'
+ssh deploy-user@example-host 'sysctl net.core.rmem_max'   # expect 8388608
+ssh deploy-user@example-host 'systemd-analyze verify /usr/local/lib/systemd/system/left4me-server@.service'
+```
+
+Document any deploy-time problems back into the spec or this plan as v1.x corrections. Do not invent fixes that go beyond the spec.
+
+---
+
+## Out of Scope (do NOT implement here)
+
+Listed in the spec — repeated for clarity:
+
+- ConVars / blueprint arguments / tickrate / sv_minrate.
+- SCHED_FIFO auto-apply.
+- CPU governor auto-apply.
+- Per-instance `CPUAffinity` auto-apply.
+- NIC ring-buffer / IRQ-pinning code.
+- Job-scheduler awareness ("don't build while server X has players").
+- Hardening tightening (`ProtectKernelTunables=yes`, etc.).
+
+If you find yourself touching any of these, stop — they belong in a separate spec.
--- a/docs/superpowers/plans/2026-05-09-l4d2-server-lifecycle-reboot-and-drift.md
+++ b/docs/superpowers/plans/2026-05-09-l4d2-server-lifecycle-reboot-and-drift.md
@ -0,0 +1,584 @@
+# L4D2 Server Lifecycle: Reboot-Safe + Drift Reconciliation Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Make L4D2 server instances survive a host reboot (Part A) and converge `Server.actual_state` to systemd reality every ~30s for out-of-band drift (Part B).
+
+**Architecture:** Helper script + `service_control.py` switch from `systemctl start/stop` to `systemctl enable --now / disable --now`. A new background thread spawned with the job workers polls every server's status periodically and writes the result via the existing `refresh_server_actual_state()` path. Skip servers with in-flight jobs to avoid racing with the post-job refresh.
+
+**Tech Stack:** bash helper script + sudoers; Python `subprocess` via `l4d2host.service_control.systemctl_command`; SQLAlchemy via `session_scope()`; threading; pytest.
+
+**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md`
+
+---
+
+## File Structure
+
+Files to modify (Part A — lifecycle verb change):
+
+- `deploy/files/usr/local/libexec/left4me/left4me-systemctl` — accept verbs `enable`/`disable`/`show` (drop `start`/`stop`).
+- `l4d2host/service_control.py` — rename `start_service` → `enable_service`, `stop_service` → `disable_service`. Action tokens become `"enable"` / `"disable"`.
+- `l4d2host/instances.py` — call `enable_service` from `start_instance`; call `disable_service` from `stop_instance` and `_purge_instance`.
+- `l4d2host/tests/test_lifecycle.py` — update mock-call expectations.
+- `l4d2host/tests/test_service_control.py` — new file with direct unit tests for `enable_service` / `disable_service`.
+- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args` — update the verb assertions.
+
+Files to modify (Part B — poller):
+
+- `l4d2web/services/job_worker.py` — add `start_state_poller`, `state_poller_loop`, `poll_all_servers`.
+- `l4d2web/app.py` — call `start_state_poller(app)` next to `start_job_workers(app)`.
+- `l4d2web/config.py` — default `STATE_POLLER_INTERVAL_SECONDS = 30`.
+- `l4d2web/tests/test_job_worker.py` — four new tests for the poller.
+
+No host-library, web-app facade, or CLI surface signatures change. The `l4d2ctl start <name>` / `l4d2ctl stop <name>` commands keep their names (per `AGENTS.md`).
+
+---
+
+## Pre-flight
+
+- [ ] **Step 0a: Verify clean working tree**
+
+Run: `git status`
+Expected: `nothing to commit, working tree clean`
+
+- [ ] **Step 0b: Verify the existing test suite is at the known-good baseline**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
+Expected: 460 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
+
+If the count differs, stop and surface — this plan assumes that exact baseline.
+
+---
+
+## Task 1: Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
+
+This task changes the helper script, the Python wrapper, and the instance lifecycle in one cohesive commit. The change is end-to-end vertical — splitting it across commits would leave broken intermediate states (helper accepting verbs that no caller uses, or callers using verbs the helper rejects).
+
+**Files:**
+- Modify: `deploy/files/usr/local/libexec/left4me/left4me-systemctl`
+- Modify: `l4d2host/service_control.py`
+- Modify: `l4d2host/instances.py`
+- Modify: `l4d2host/tests/test_lifecycle.py`
+- Create: `l4d2host/tests/test_service_control.py`
+- Modify: `deploy/tests/test_deploy_artifacts.py`
+
+### Step 1.1: Update the deploy artifact test for the helper
+
+Open `deploy/tests/test_deploy_artifacts.py`. Find `test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`.
+
+Replace the assertions that check the helper's case-statement bodies. Currently the test asserts something like:
+
+```python
+assert 'start) exec "$systemctl" start "$unit"' in script
+assert 'stop) exec "$systemctl" stop "$unit"' in script
+```
+
+Update to:
+
+```python
+assert 'enable)' in script
+assert 'enable --now' in script
+assert 'disable)' in script
+assert 'disable --now' in script
+```
+
+Keep the `--property=ActiveState` and `--property=SubState` assertions for the `show` action (unchanged).
+
+The rejected-action examples list (currently includes things like `["bad/action", "alpha"]`) is unchanged — those are still bad. If the test currently asserts that `start` and `stop` are accepted (e.g., a positive case), drop those — `start`/`stop` are now rejected verbs, not accepted ones.
+
+### Step 1.2: Run the updated artifact test to verify it fails
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
+Expected: FAIL — the helper script still has `start)`/`stop)` cases, not `enable)`/`disable)`.
+
+### Step 1.3: Edit the helper script
+
+Open `deploy/files/usr/local/libexec/left4me/left4me-systemctl`. Find the case-statement (currently around lines 24–27). Replace:
+
+```sh
+case "$action" in
+    start) exec "$systemctl" start "$unit" ;;
+    stop) exec "$systemctl" stop "$unit" ;;
+    show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
+    *) ...
+esac
+```
+
+with:
+
+```sh
+case "$action" in
+    enable) exec "$systemctl" enable --now "$unit" ;;
+    disable) exec "$systemctl" disable --now "$unit" ;;
+    show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
+    *) ...
+esac
+```
+
+Keep the rest of the script (shebang, name validation, `*)` reject-and-exit branch) unchanged. The exact form of the `*)` reject case in the existing helper should be preserved.
+
+### Step 1.4: Verify the helper script still parses
+
+Run: `sh -n deploy/files/usr/local/libexec/left4me/left4me-systemctl`
+Expected: exit 0, no output.
+
+### Step 1.5: Run the artifact test, verify it passes
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
+Expected: PASS.
+
+### Step 1.6: Update `service_control.py`
+
+Open `l4d2host/service_control.py`. Replace:
+
+```python
+def start_service(
+    name: str,
+    *,
+    on_stdout: Callable[[str], None] | None = None,
+    on_stderr: Callable[[str], None] | None = None,
+    passthrough: bool = False,
+    should_cancel: Callable[[], bool] | None = None,
+) -> CommandResult:
+    return run_command(
+        systemctl_command("start", name),
+        on_stdout=on_stdout,
+        on_stderr=on_stderr,
+        passthrough=passthrough,
+        should_cancel=should_cancel,
+    )
+
+
+def stop_service(
+    name: str,
+    *,
+    on_stdout: Callable[[str], None] | None = None,
+    on_stderr: Callable[[str], None] | None = None,
+    passthrough: bool = False,
+    should_cancel: Callable[[], bool] | None = None,
+) -> CommandResult:
+    return run_command(
+        systemctl_command("stop", name),
+        on_stdout=on_stdout,
+        on_stderr=on_stderr,
+        passthrough=passthrough,
+        should_cancel=should_cancel,
+    )
+```
+
+with:
+
+```python
+def enable_service(
+    name: str,
+    *,
+    on_stdout: Callable[[str], None] | None = None,
+    on_stderr: Callable[[str], None] | None = None,
+    passthrough: bool = False,
+    should_cancel: Callable[[], bool] | None = None,
+) -> CommandResult:
+    return run_command(
+        systemctl_command("enable", name),
+        on_stdout=on_stdout,
+        on_stderr=on_stderr,
+        passthrough=passthrough,
+        should_cancel=should_cancel,
+    )
+
+
+def disable_service(
+    name: str,
+    *,
+    on_stdout: Callable[[str], None] | None = None,
+    on_stderr: Callable[[str], None] | None = None,
+    passthrough: bool = False,
+    should_cancel: Callable[[], bool] | None = None,
+) -> CommandResult:
+    return run_command(
+        systemctl_command("disable", name),
+        on_stdout=on_stdout,
+        on_stderr=on_stderr,
+        passthrough=passthrough,
+        should_cancel=should_cancel,
+    )
+```
+
+`show_service`, `stream_command`, `stream_journal`, and the `systemctl_command` / `journalctl_command` helpers are unchanged.
+
+### Step 1.7: Update `instances.py` to call the new names
+
+Open `l4d2host/instances.py`. Replace the import:
+
+```python
+from l4d2host.service_control import start_service, stop_service
+```
+
+with:
+
+```python
+from l4d2host.service_control import disable_service, enable_service
+```
+
+Inside `start_instance`, find the `start_service(...)` call (around line 137 in current source) and replace with `enable_service(...)`. Inside `stop_instance` (line 159) and `_purge_instance` (line 194), replace `stop_service(...)` with `disable_service(...)`. Keep all keyword arguments identical — only the function name changes.
+
+### Step 1.8: Update `test_lifecycle.py`
+
+Open `l4d2host/tests/test_lifecycle.py`. Search for every assertion that references the `start` or `stop` action token in mock-call expectations against `service_control.run_command` or `systemctl_command`. The tests typically look for argument lists like `["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "<name>"]`.
+
+Update each occurrence:
+- `"start"` → `"enable"` (in the `start_instance` test paths)
+- `"stop"` → `"disable"` (in the `stop_instance`, `delete_instance`, `reset_instance`, and `_purge_instance` test paths)
+
+Some tests may import `start_service` / `stop_service` directly. Update those imports to `enable_service` / `disable_service`.
+
+### Step 1.9: Create direct unit tests for `enable_service` / `disable_service`
+
+Create `l4d2host/tests/test_service_control.py` with:
+
+```python
+from unittest.mock import patch
+
+from l4d2host.service_control import (
+    SYSTEMCTL_HELPER,
+    disable_service,
+    enable_service,
+)
+
+
+@patch("l4d2host.service_control.run_command")
+def test_enable_service_invokes_helper_with_enable_action(mock_run):
+    enable_service("instance-7")
+    args, _ = mock_run.call_args
+    assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
+
+
+@patch("l4d2host.service_control.run_command")
+def test_disable_service_invokes_helper_with_disable_action(mock_run):
+    disable_service("instance-7")
+    args, _ = mock_run.call_args
+    assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]
+```
+
+### Step 1.10: Run the host-library tests
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q`
+Expected: all green (110 or 111 passing depending on whether `test_service_control.py` already existed; `+2` from the new direct tests).
+
+If anything red: fix the test expectations, not the implementation. The implementation matches the spec exactly. Most likely failure mode: a test in `test_lifecycle.py` you missed updating; search for any remaining string literal `"start"` or `"stop"` in helper-arg-list contexts.
+
+### Step 1.11: Run the deploy artifact test suite
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
+Expected: 36 passed, 1 failed (the pre-existing unrelated test).
+
+### Step 1.12: Commit
+
+```bash
+git add deploy/files/usr/local/libexec/left4me/left4me-systemctl \
+        l4d2host/service_control.py l4d2host/instances.py \
+        l4d2host/tests/test_lifecycle.py \
+        l4d2host/tests/test_service_control.py \
+        deploy/tests/test_deploy_artifacts.py
+git commit -m "$(cat <<'EOF'
+feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
+
+Servers started via the web UI now create a WantedBy= symlink under
+multi-user.target.wants/, so they auto-start on the next host reboot.
+Helper verbs renamed start/stop -> enable/disable; service_control.py
+renamed start_service/stop_service -> enable_service/disable_service.
+The user-facing l4d2ctl start/stop commands keep their names per the
+AGENTS.md contract — only the implementation changes. Spec:
+docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
+EOF
+)"
+```
+
+---
+
+## Task 2: Part B — Periodic state poller
+
+This task adds the poller code, wires it into the Flask startup, exposes its config knob, and tests four behaviors. One cohesive commit.
+
+**Files:**
+- Modify: `l4d2web/services/job_worker.py`
+- Modify: `l4d2web/app.py`
+- Modify: `l4d2web/config.py`
+- Modify: `l4d2web/tests/test_job_worker.py`
+
+### Step 2.1: Add the failing tests
+
+Open `l4d2web/tests/test_job_worker.py`. Append after the existing tests:
+
+```python
+def test_state_poller_refreshes_each_server(app, monkeypatch):
+    from l4d2web.services import job_worker as jw
+
+    with app.app_context():
+        from l4d2web.db import session_scope
+        from l4d2web.models import Server
+        with session_scope() as db:
+            db.add_all([
+                Server(id=11, name="alpha", port=27015, blueprint_id=None,
+                       desired_state="running", actual_state="unknown"),
+                Server(id=12, name="beta", port=27016, blueprint_id=None,
+                       desired_state="running", actual_state="unknown"),
+            ])
+
+    refreshed = []
+    monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
+
+    with app.app_context():
+        jw.poll_all_servers()
+
+    assert sorted(refreshed) == [11, 12]
+
+
+def test_state_poller_skips_servers_with_inflight_jobs(app, monkeypatch):
+    from l4d2web.services import job_worker as jw
+
+    with app.app_context():
+        from l4d2web.db import session_scope
+        from l4d2web.models import Job, Server
+        with session_scope() as db:
+            db.add(Server(id=21, name="gamma", port=27017, blueprint_id=None,
+                          desired_state="running", actual_state="running"))
+            db.add(Job(server_id=21, operation="stop", state="running"))
+
+    refreshed = []
+    monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
+
+    with app.app_context():
+        jw.poll_all_servers()
+
+    assert refreshed == []
+
+
+def test_state_poller_swallows_per_server_exceptions(app, monkeypatch):
+    from l4d2web.services import job_worker as jw
+
+    with app.app_context():
+        from l4d2web.db import session_scope
+        from l4d2web.models import Server
+        with session_scope() as db:
+            db.add_all([
+                Server(id=31, name="bad", port=27018, blueprint_id=None,
+                       desired_state="running", actual_state="unknown"),
+                Server(id=32, name="good", port=27019, blueprint_id=None,
+                       desired_state="running", actual_state="unknown"),
+            ])
+
+    refreshed = []
+
+    def fake_refresh(sid):
+        if sid == 31:
+            raise RuntimeError("simulated host failure")
+        refreshed.append(sid)
+
+    monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
+
+    with app.app_context():
+        jw.poll_all_servers()  # must not raise
+
+    assert refreshed == [32]
+
+
+def test_state_poller_disabled_when_job_workers_disabled(monkeypatch):
+    """create_app must not spawn the poller thread when JOB_WORKER_ENABLED=False."""
+    import threading
+
+    from l4d2web.app import create_app
+
+    spawned = []
+    real_thread_init = threading.Thread.__init__
+
+    def tracking_init(self, *args, **kwargs):
+        if kwargs.get("name") == "left4me-state-poller":
+            spawned.append(True)
+        real_thread_init(self, *args, **kwargs)
+
+    monkeypatch.setattr(threading.Thread, "__init__", tracking_init)
+    create_app({"TESTING": True, "JOB_WORKER_ENABLED": False})
+    assert not spawned
+```
+
+(The tests assume the existing `app` fixture from `conftest.py`. If your project uses a different fixture name, adjust accordingly. The polling tests run `poll_all_servers()` synchronously to avoid testing the loop's `time.sleep`.)
+
+### Step 2.2: Run the new tests, verify they fail
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
+Expected: FAIL — `poll_all_servers` and `start_state_poller` don't exist yet.
+
+### Step 2.3: Add the poller code to `job_worker.py`
+
+Open `l4d2web/services/job_worker.py`. Add at the bottom of the file:
+
+```python
+def start_state_poller(app):
+    interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
+    thread = threading.Thread(
+        target=state_poller_loop,
+        args=(app, interval),
+        daemon=True,
+        name="left4me-state-poller",
+    )
+    thread.start()
+
+
+def state_poller_loop(app, interval: float) -> None:
+    while True:
+        try:
+            with app.app_context():
+                poll_all_servers()
+        except Exception:
+            pass
+        time.sleep(interval)
+
+
+def poll_all_servers() -> None:
+    with session_scope() as db:
+        active_server_ids = set(db.scalars(
+            select(Job.server_id).where(Job.state.in_(("queued", "running")))
+        ).all())
+        server_ids = [
+            sid for sid in db.scalars(select(Server.id)).all()
+            if sid not in active_server_ids
+        ]
+    for sid in server_ids:
+        try:
+            refresh_server_actual_state(sid)
+        except Exception:
+            pass
+```
+
+`Server`, `Job`, `select`, `session_scope`, `threading`, `time`, and `refresh_server_actual_state` are already imported in this file. Verify by scanning the existing imports; if any are missing (unlikely for `select`/`Server`/`Job` since the worker uses them), add them.
+
+### Step 2.4: Wire the poller into `create_app`
+
+Open `l4d2web/app.py`. Find the existing `start_job_workers(app)` call (around line 91, inside the `if should_start_workers:` block). Add `start_state_poller(app)` immediately after it:
+
+```python
+if should_start_workers:
+    recover_stale_jobs()
+    start_job_workers(app)
+    start_state_poller(app)
+```
+
+Also update the import:
+
+```python
+from l4d2web.services.job_worker import (
+    recover_stale_jobs,
+    start_job_workers,
+    start_state_poller,
+)
+```
+
+(If the existing import is single-line `from ... import recover_stale_jobs, start_job_workers`, just add `start_state_poller` to the list.)
+
+### Step 2.5: Add the config default
+
+Open `l4d2web/config.py`. Find the dict literal that contains other defaults like `JOB_WORKER_THREADS`, `PORT_RANGE_START`, etc. Add:
+
+```python
+"STATE_POLLER_INTERVAL_SECONDS": 30,
+```
+
+In the env-var-loading section (where `LEFT4ME_PORT_RANGE_START` etc. are read), add:
+
+```python
+"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("LEFT4ME_STATE_POLLER_INTERVAL_SECONDS", "30")),
+```
+
+### Step 2.6: Run the four new tests, verify they pass
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
+Expected: PASS for all four.
+
+### Step 2.7: Run the full web test suite
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests -q`
+Expected: 317 passed, 1 skipped (313 + 4 new tests).
+
+### Step 2.8: Commit
+
+```bash
+git add l4d2web/services/job_worker.py l4d2web/app.py l4d2web/config.py l4d2web/tests/test_job_worker.py
+git commit -m "$(cat <<'EOF'
+feat(l4d2-web): periodic state poller refreshes Server.actual_state
+
+A background thread spawned alongside the job workers polls every
+server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
+writes the result via the existing refresh_server_actual_state path.
+Servers with in-flight jobs are skipped to avoid racing the post-job
+refresh. Catches reboot drift, OOM kills, manual systemctl operations,
+and any other out-of-band state change. Spec:
+docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
+EOF
+)"
+```
+
+---
+
+## Final Verification
+
+- [ ] **Step F.1: Full test sweep**
+
+Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
+Expected: ~466 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
+
+- [ ] **Step F.2: Working tree clean and commit shape**
+
+Run: `git status && git log --oneline -5`
+Expected:
+- `git status`: clean.
+- Top of `git log`:
+  1. `feat(l4d2-web): periodic state poller refreshes Server.actual_state`
+  2. `feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now`
+  3. `docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan`
+  4. `docs(specs): l4d2 server lifecycle reboot-and-drift — design`
+
+- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
+
+End-to-end on `ckn@10.0.4.128` after deploy:
+
+```sh
+deploy/deploy-test-server.sh ckn@10.0.4.128
+
+# Confirm the helper now drives enable/disable
+ssh ckn@10.0.4.128 'cat /usr/local/libexec/left4me/left4me-systemctl | grep -E "enable|disable"'
+# expect:  enable) exec "$systemctl" enable --now "$unit"
+#          disable) exec "$systemctl" disable --now "$unit"
+
+# Click "start" in the web UI for a server. Then:
+ssh ckn@10.0.4.128 'systemctl is-enabled left4me-server@1.service'
+# expect: enabled
+
+# Reboot the host:
+ssh ckn@10.0.4.128 'sudo systemctl reboot'
+# wait for it to come back, then:
+ssh ckn@10.0.4.128 'systemctl is-active left4me-server@1.service && pgrep -fa srcds'
+# expect: active, srcds running with no UI intervention
+
+# Confirm the poller corrects out-of-band drift
+ssh ckn@10.0.4.128 'sudo systemctl disable --now left4me-server@1.service'
+# Within ~30s the web UI's actual_state for server 1 flips from "running" to "stopped".
+ssh ckn@10.0.4.128 'sudo -u left4me /opt/left4me/.venv/bin/python -c "
+import sqlite3
+c = sqlite3.connect(\"/var/lib/left4me/left4me.db\")
+print(c.execute(\"SELECT id, actual_state, actual_state_updated_at FROM servers WHERE id=1\").fetchone())
+"'
+# expect: actual_state='stopped' with a fresh updated_at.
+```
+
+---
+
+## Out of Scope (do NOT implement here)
+
+- Auto-restart on `desired_state=running && actual_state=stopped`.
+- UI banners for stale-state warnings.
+- Reconciliation of orphan systemd units.
+- Per-server poll intervals.
+- Replacing `Restart=on-failure`.
+- Touching the pre-existing red test (`test_deploy_script_has_safe_defaults_and_preserves_state`).
+
+If you find yourself touching any of these, stop — they belong in a separate spec.
--- a/docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
+++ b/docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
@ -0,0 +1,131 @@
+# l4d2 cpu isolation — design
+
+Date: 2026-05-09
+Status: design
+
+## Summary
+
+Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively. Implementation is systemd cgroup-v2 `AllowedCPUs=` drop-ins, computed at deploy time from `nproc`, overridable via env vars. Lands on top of the perf baseline shipped in `851e662..e5126c8`.
+
+## Goals
+
+- A logged-in admin doing CPU-heavy work, the script-build sandbox, and the Flask web app cannot steal cycles from a live match.
+- Layout scales automatically across host sizes (4-core, 8-core, 16-core) without per-host edits.
+- Operator can override the default `0` / `1..N-1` split for NUMA boxes or hyperthread quirks.
+- Single-core hosts degrade gracefully: skip CPU isolation, keep the rest of the perf baseline.
+
+## Non-goals
+
+- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot parameters. True core isolation (eviction of softirqs, RCU, timer ticks) requires GRUB edits + reboot + per-host tuning. cgroup cpuset is sufficient for L4D2 tickrates; document as a future opt-in if measurement justifies it.
+- NIC IRQ pinning. Hardware-specific; already documented as an escape hatch in `deploy/README.md`.
+- Per-instance pinning *within* the game-core set. The slice-level cpuset is the floor; the existing per-instance `CPUAffinity=` drop-in escape hatch (already in `deploy/README.md`) composes on top — the kernel enforces "per-instance value must be a subset of slice's allowed set."
+- A separate `l4d2-web.slice`. The web app is light; living in `system.slice` on core 0 is fine.
+- Web-app or host-library code changes. Pure deploy-side artifact work.
+
+## Background
+
+The perf baseline (commit range `851e662..e5126c8`) introduced two slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10), per-instance unit directives (Nice, OOM, memory caps), and host sysctls. None of those constrain *which* CPUs cgroups run on. Under the kernel CFS, every task can move to any core; the build sandbox, ssh sessions, the web app, and game servers all compete for the same cores.
+
+## Design
+
+### Topology
+
+```
+                core 0           cores 1..N-1
+                ─────────        ────────────
+system.slice    AllowedCPUs=0
+user.slice      AllowedCPUs=0
+l4d2-build.slice AllowedCPUs=0
+l4d2-game.slice                 AllowedCPUs=1-(N-1)
+```
+
+Everything that isn't a live game server (Flask web app, ssh sessions, journald, script-sandbox builds, cron, systemd housekeeping) is funneled to core 0. Game servers get cores 1..N-1 exclusively.
+
+### Why slice-level `AllowedCPUs=`, not per-instance `CPUAffinity=`
+
+- **Hierarchy does the work for free.** A cpuset on `l4d2-game.slice` propagates to every `left4me-server@*.service` automatically. No per-instance drop-ins to manage; no logic in the web app to pick cores.
+- **Hot-applied.** cgroup-v2 cpuset changes apply to running cgroups; existing servers move next time the kernel schedules them. No need to restart instances after a deploy.
+- **Composable.** A future operator who wants per-instance pinning *within* the game cores adds `CPUAffinity=N` via `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` (already documented). The slice constraint and per-instance pin compose; the kernel enforces subset-of.
+
+### Why drop-ins, not edits to the existing `.slice` files
+
+The two slice files we ship today (`l4d2-game.slice`, `l4d2-build.slice`) are static text and host-portable. `AllowedCPUs=1-7` is true on an 8-core host and wrong on a 4-core host. Drop-ins under `<unit>.d/*.conf` are the standard systemd pattern for host-specific overrides. We already use `99-` prefixing for the sysctl drop-in so it lex-orders last; reuse that.
+
+### Operator override
+
+Two env vars consumed by the deploy script:
+
+- `LEFT4ME_SYSTEM_CPUS` — defaults to `0`. Goes into `system.slice`, `user.slice`, `l4d2-build.slice` drop-ins.
+- `LEFT4ME_GAME_CPUS` — defaults to `1-$((NPROC-1))`. Goes into `l4d2-game.slice` drop-in.
+
+Operators with NUMA boxes, hyperthread quirks, or "I want core 0 *and* core 1 for system" set the vars explicitly. Defaults handle the typical case.
+
+### Single-core fallback
+
+If `nproc < 2`, skip CPU isolation entirely (write no drop-ins). Print a warning to stderr explaining the deploy is leaving cpuset unset. The rest of the perf baseline still applies (weights, sysctls, OOM scores).
+
+If `LEFT4ME_GAME_CPUS` or `LEFT4ME_SYSTEM_CPUS` is set explicitly on a single-core host, honor the operator's intent — they presumably know what they're doing — but still write the drop-ins.
+
+### Drop-in layout
+
+Four files written to `/etc/systemd/system/`, each named `99-left4me-cpuset.conf`:
+
+```
+/etc/systemd/system/system.slice.d/99-left4me-cpuset.conf
+/etc/systemd/system/user.slice.d/99-left4me-cpuset.conf
+/etc/systemd/system/l4d2-build.slice.d/99-left4me-cpuset.conf
+/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf
+```
+
+Each file contains:
+
+```ini
+[Slice]
+AllowedCPUs=<resolved value>
+```
+
+### systemd compatibility
+
+`AllowedCPUs=` is systemd 244+. Debian Trixie ships systemd 256+. Cgroup-v2 cpuset controller is enabled by default on Trixie; systemd auto-enables the controller when `AllowedCPUs=` is set on a unit. No additional machinery.
+
+### Files changed / added
+
+```
+deploy/deploy-test-server.sh                   (modified — compute layout, write four drop-ins)
+deploy/README.md                               (modified — new "CPU isolation" subsection inside Performance Tuning)
+deploy/tests/test_deploy_artifacts.py          (modified — new tests)
+```
+
+## Tests
+
+`deploy/tests/test_deploy_artifacts.py` additions, following the existing
+`assert "X" in script` pattern:
+
+- For `deploy-test-server.sh`, assert:
+  - All four drop-in paths (`/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/99-left4me-cpuset.conf`) appear.
+  - The script reads `nproc` (substring `nproc` plus a default-binding form for `LEFT4ME_GAME_CPUS`).
+  - The script honors `LEFT4ME_SYSTEM_CPUS` and `LEFT4ME_GAME_CPUS` env-var overrides (substrings present, default-binding form like `${LEFT4ME_SYSTEM_CPUS:-...}`).
+  - The script has a single-core fallback (substring guarding `nproc -lt 2` or equivalent, with a warning to stderr).
+  - Each drop-in is written via the existing `install -m 0644 -o root -g root` heredoc pattern.
+
+No runtime tests in this spec — verifying that systemd actually enforces `AllowedCPUs=` is operator-side via `cat /sys/fs/cgroup/<slice>/cpuset.cpus.effective` after deploy.
+
+## Rollout
+
+Single deploy. cgroup-v2 cpuset changes apply to running cgroups, so already-running servers move next time the kernel reschedules them — no instance restarts required. The `daemon-reload` already in the deploy script picks up the new drop-ins.
+
+If something goes wrong (cpuset too narrow, a slice can't run any process), `systemctl status <slice>` will show the error and the operator can either fix the env vars and redeploy or `rm /etc/systemd/system/<slice>.slice.d/99-left4me-cpuset.conf` followed by `systemctl daemon-reload` to revert.
+
+## Open questions
+
+None blocking. Possible v2 candidates if measurement justifies them:
+
+- Pair this with kernel `isolcpus=` boot params for true core isolation.
+- Auto-pin NIC IRQs to core 0 (would compose with this isolation).
+- Per-instance `CPUAffinity=` driven by a deploy-env knob, partitioning the game-core set across instances deterministically.
+
+## References
+
+- systemd.resource-control(5) — `AllowedCPUs=` semantics.
+- Linux Documentation/admin-guide/cgroup-v2.rst — cpuset controller behavior on `cpuset.cpus` / `cpuset.cpus.effective`.
+- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` — sibling work that introduced the slices this spec extends.
--- a/docs/superpowers/specs/2026-05-09-l4d2-cpu-pinning-decision.md
+++ b/docs/superpowers/specs/2026-05-09-l4d2-cpu-pinning-decision.md
@ -0,0 +1,83 @@
+# l4d2 cpu pinning — decision record (deferred)
+
+Date: 2026-05-09
+Status: decision (no implementation)
+
+## Question
+
+After the lifecycle + drift fix landed (commits `8552c55`, `67b5521`), the
+question came up: with `AllowedCPUs=1-7` already constraining game servers
+to cores 1–7, do CFS scheduler migrations *within* that range still cause
+meaningful jitter? Should we hard-pin each instance to a single core?
+
+## Investigation
+
+The classic "lazy CFS" sysctl knob is **gone** on modern kernels. Verified
+on Trixie's running kernel 6.12 (`ckn@10.0.4.128`):
+
+```
+/sbin/sysctl -a | grep -E "sched_migration_cost|sched_min_granularity|sched_wakeup_granularity|sched_latency"
+# (no output)
+```
+
+`kernel.sched_migration_cost_ns` and the other classic CFS tunables were
+removed in 5.13+ as part of the scheduler internals refactor that culminated
+in EEVDF (6.6). Only `kernel.sched_rt_period_us` / `sched_rt_runtime_us`
+remain. There is no global "be lazy about migrations" knob anymore.
+
+### Available paths
+
+| Option | Cost | Strictness | Pays off when |
+|---|---|---|---|
+| Trust CFS + `Nice=-5` + `AllowedCPUs=1-7` (current) | None | Soft | ≤ 3 instances on 7 cores; CFS rarely migrates active CPU-bound nice<0 tasks |
+| Per-instance `CPUAffinity=N` drop-in | Web-app machinery to write drop-ins, daemon-reload, modulo or DB-persisted assignment | Strict | ≥ 4 instances (each gets exclusive core), or measured jitter |
+| `isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7` kernel cmdline | GRUB edit + reboot, host-specific | Strongest (also evicts kernel softirqs/RCU/timer ticks from game cores) | Tickrate-128 with measurable kernel-induced jitter |
+| `SCHED_FIFO` per unit | Risky (RT misconfig can stall kernel) | Strict | Already documented as ops-side escape hatch in `deploy/README.md` |
+
+### Why deferring is defensible
+
+- The slice's `AllowedCPUs=1-7` already prevents game servers from running on core 0. The open question is "do they migrate within 1–7?" — yes, CFS can migrate, but for long-running CPU-bound `srcds` with `Nice=-5`, migrations are infrequent. CFS prefers cache locality and only migrates when an idle core "steals" or a periodic load-balance tick detects imbalance.
+- With ≤ 3 instances on 7 game cores, the load balancer rarely sees imbalance to fix.
+- Per-instance hard pinning adds non-trivial machinery (drop-in writer through `left4me-systemctl`, or extending `instance.env` + a `taskset` wrapper in the unit). Not warranted unless we observe a real problem.
+- `deploy/README.md` already documents the `CPUAffinity=N` per-instance drop-in as an opt-in escape hatch. An operator who measures jitter can apply it without code changes.
+
+## Decision
+
+**No code change.** Keep the current setup:
+
+- Slice-level `AllowedCPUs=1-7` ensures game servers never touch core 0.
+- `Nice=-5` keeps active srcds tasks weighted heavily so CFS prefers leaving them alone.
+- The `CPUAffinity=N` per-instance drop-in remains the documented escape hatch.
+
+## Revisit triggers
+
+Any of these signals appears, then design + implement strict per-instance pinning:
+
+- ≥ 4 game-server instances running simultaneously on one host.
+- A specific server reports tickrate dips / rubber-banding correlated with another instance starting or a build sandbox firing.
+- `perf stat -e sched:sched_migrate_task -p <srcds-pid>` shows > 1 migration/sec under load.
+
+When revisiting, two implementation paths to choose from:
+
+1. **Modulo assignment in the host library.** Read `LEFT4ME_GAME_CPUS` (or parse the slice's `AllowedCPUs=` drop-in), pick `game_cpus[(int(name) - 1) % len(game_cpus)]`, write `L4D2_CPU=N` into `instance.env`, wrap the unit's `ExecStart` with `taskset -c ${L4D2_CPU}`. Stateless, deterministic, no DB column. **Preferred.**
+2. **Persisted assignment.** Add `Server.cpu_pin` column, web app picks at initialize time and stores. Survives `LEFT4ME_GAME_CPUS` changes (each server keeps its assigned core). Bigger ripple.
+
+## Verification (no-op confirmation)
+
+```sh
+ssh ckn@10.0.4.128 'systemctl show l4d2-game.slice -p AllowedCPUs'
+# expect: AllowedCPUs=1-7
+
+ssh ckn@10.0.4.128 'cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective'
+# expect: 0   (everything-not-game still pinned to core 0)
+
+# When ≥ 1 server is running:
+ssh ckn@10.0.4.128 'for p in $(pgrep srcds); do grep ^Cpus_allowed_list /proc/$p/status; done'
+# expect: 1-7   (CFS picks whichever of those is hottest at any given moment)
+```
+
+## References
+
+- `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md` — sibling design that introduced the `AllowedCPUs=1-7` slice constraint this record builds on.
+- `deploy/README.md` "Performance Tuning" section — the `CPUAffinity=N` per-instance escape hatch.
+- Linux kernel changelog 5.13+ — removal of classic CFS tunable sysctls.
--- a/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
+++ b/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
@ -0,0 +1,230 @@
+# l4d2 server host perf baseline — design
+
+Date: 2026-05-09
+Status: design
+
+## Summary
+
+Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.
+
+## Goals
+
+- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
+- One misbehaving server cannot OOM-kill its siblings or the host.
+- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
+- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.
+
+## Non-goals
+
+- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
+- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
+- CPU governor changes. Documented opt-in only.
+- Per-instance `CPUAffinity`. Host-specific; documented only.
+- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
+- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
+- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec.
+
+## Background
+
+Current state (commit `965b67e`):
+
+- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**.
+- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`.
+- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects.
+
+srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.
+
+## Design
+
+### Slice topology
+
+Flat top-level slices, siblings of `system.slice` and `user.slice`:
+
+```
+-.slice
+├── system.slice         (default CPUWeight=100, IOWeight=100)
+├── user.slice           (default CPUWeight=100, IOWeight=100)
+├── l4d2-game.slice      (CPUWeight=1000, IOWeight=1000)
+└── l4d2-build.slice     (CPUWeight=10,   IOWeight=10)
+```
+
+Rationale:
+
+- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`.
+- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match.
+
+### Per-instance unit additions (`left4me-server@.service`)
+
+Add to `[Service]`:
+
+```
+Slice=l4d2-game.slice
+Nice=-5
+IOSchedulingClass=best-effort
+IOSchedulingPriority=4
+OOMScoreAdjust=-200
+MemoryHigh=1.5G
+MemoryMax=2G
+TasksMax=256
+LimitNOFILE=65536
+KillSignal=SIGINT
+TimeoutStopSec=15s
+LogRateLimitIntervalSec=0
+```
+
+Per-directive justification:
+
+- `Slice=l4d2-game.slice` — places the instance in the high-weight slice.
+- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
+- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
+- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section).
+- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500–800 MB; map-load spikes fit in headroom; a runaway is bounded.
+- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
+- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups.
+- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
+- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL.
+- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.
+
+Existing security directives are kept verbatim.
+
+### Slice unit files
+
+New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`:
+
+```ini
+[Unit]
+Description=left4me game-server slice
+Before=slices.target
+
+[Slice]
+CPUWeight=1000
+IOWeight=1000
+```
+
+New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`:
+
+```ini
+[Unit]
+Description=left4me script-sandbox build slice
+Before=slices.target
+
+[Slice]
+CPUWeight=10
+IOWeight=10
+```
+
+### Sandbox slice + OOM placement
+
+Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run` invocation (transient service mode — the existing helper uses `--unit=` without `--scope`):
+
+- `--slice=l4d2-build.slice`
+- `-p OOMScoreAdjust=500`
+
+Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-unit) compose: weight handles contention, quota handles the absolute ceiling.
+
+### Host sysctls
+
+New file `deploy/files/etc/sysctl.d/99-left4me.conf`:
+
+```
+net.core.rmem_max = 8388608
+net.core.wmem_max = 8388608
+net.core.rmem_default = 524288
+net.core.wmem_default = 524288
+net.core.netdev_max_backlog = 5000
+net.core.netdev_budget = 600
+vm.swappiness = 10
+```
+
+Per-value justification:
+
+- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
+- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do.
+- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
+- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts.
+- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts.
+
+### Deploy script integration
+
+`deploy/deploy-test-server.sh` must:
+
+1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`.
+2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot.
+3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`.
+4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow).
+5. No explicit `systemctl start` of the slices is required — they activate on first child reference.
+
+### Documented escape hatches (no auto-apply)
+
+Append a "Performance tuning" section to `deploy/README.md`:
+
+- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
+- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ".
+- **NIC tuning**: example `ethtool -G <iface> rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only.
+- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves.
+
+These stay pure documentation in v1 — no code paths, no tests asserting them.
+
+### Out-of-scope rationale
+
+- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
+- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts.
+- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.
+
+### Files changed / added
+
+```
+deploy/files/usr/local/lib/systemd/system/left4me-server@.service       (modified)
+deploy/files/usr/local/lib/systemd/system/l4d2-game.slice               (new)
+deploy/files/usr/local/lib/systemd/system/l4d2-build.slice              (new)
+deploy/files/etc/sysctl.d/99-left4me.conf                               (new)
+deploy/files/usr/local/libexec/left4me/left4me-script-sandbox           (modified)
+deploy/deploy-test-server.sh                                            (modified — sysctl --system step)
+deploy/README.md                                                        (modified — performance section)
+deploy/tests/test_deploy_artifacts.py                                   (modified — assertions)
+```
+
+## Tests
+
+`deploy/tests/test_deploy_artifacts.py` additions, following the existing
+`assert "key=value" in text` pattern:
+
+- For `left4me-server@.service`, assert every line listed in *Per-instance
+  unit additions* is present verbatim. Each is a separate assertion so a
+  failing line is identifiable.
+- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`.
+- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`.
+- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*.
+- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice`
+  and `OOMScoreAdjust=500` both appear.
+- Assert the deploy script invokes `sysctl --system` (or
+  `sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the
+  conf into place.
+
+No runtime perf tests in v1 — the spec ships defaults, not measured wins.
+Real-world measurement is left to operators with concrete instance counts,
+hardware, and player loads.
+
+## Rollout
+
+Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`.
+
+## Open questions
+
+None blocking. v2 candidates if measurement justifies them:
+
+- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`).
+- Job-worker awareness of "server has active players" to defer builds further than weights alone.
+- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in.
+
+## References
+
+- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`.
+- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules.
+- systemd.kill(5) — signal handling and `KillSignal`.
+- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`.
+- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
+- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics.
+- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
+- Phoronix governor comparisons — performance vs schedutil for sustained workloads.
+- Multiple latency-tuning guides — `vm.swappiness=10` consensus.
--- a/docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
+++ b/docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
@ -0,0 +1,217 @@
+# l4d2 server lifecycle: reboot-safe + drift reconciliation — design
+
+Date: 2026-05-09
+Status: design
+
+## Summary
+
+Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators.
+
+## Goals
+
+- An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action.
+- The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change.
+- The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
+- Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.
+
+## Non-goals
+
+- **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision.
+- **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
+- **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec.
+- **Per-server poller intervals.** A single global cadence is sufficient.
+- **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays.
+- **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.
+
+## Premise check: system units, not user units
+
+`systemctl --user enable --now` has different lifecycle rules — auto-start only at user login (unless `loginctl enable-linger <user>` is set), symlinks land in `~/.config/systemd/user/<target>.wants/`. It would be wrong here.
+
+This project uses **system units**, confirmed by:
+
+- Unit path: `/usr/local/lib/systemd/system/left4me-server@.service` is the system search path; user units live in `/etc/systemd/user/` or `~/.config/systemd/user/`.
+- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl:31-44`) calls plain `systemctl` (no `--user` flag) and runs as **root** via the sudoers rule at `deploy/files/etc/sudoers.d/left4me:2`.
+- The unit's `[Install] WantedBy=multi-user.target` (line 43 of the unit) is a system target; user units would use `default.target`.
+- The same machinery is already in production for `left4me-web.service` — `deploy-test-server.sh` runs `sudo systemctl enable --now left4me-web.service`, and that's how the web service auto-came-back after today's reboot. We're applying the same pattern to the game-server template instances.
+
+`systemctl enable left4me-server@1.service` will create `/etc/systemd/system/multi-user.target.wants/left4me-server@1.service` symlinked to `/usr/local/lib/systemd/system/left4me-server@.service`. systemd handles the template instantiation via the `@` syntax automatically.
+
+## Background
+
+Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST:
+
+- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action.
+- `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`.
+- `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions.
+- `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot.
+- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not.
+- The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth.
+
+## Design
+
+### Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
+
+**Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`):
+
+Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become:
+
+```sh
+case "$action" in
+    enable)  exec "$systemctl" enable --now "$unit" ;;
+    disable) exec "$systemctl" disable --now "$unit" ;;
+    show)    exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
+    *)       reject ;;
+esac
+```
+
+The existing instance-name validation regex (currently lines 12–17) is unchanged — it constrains the `<name>` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`:
+
+```
+left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
+```
+
+already passes any args; no sudoers update needed.
+
+**Python wrapper** (`l4d2host/service_control.py`):
+
+Rename `start_service` → `enable_service` and `stop_service` → `disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged.
+
+**Instance lifecycle** (`l4d2host/instances.py`):
+
+- `start_instance` — replace the `start_service(...)` call with `enable_service(...)`.
+- `stop_instance` — replace `stop_service(...)` with `disable_service(...)`.
+- `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.
+
+**CLI surface** (`l4d2host/cli.py`):
+
+`l4d2ctl start <name>` and `l4d2ctl stop <name>` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance` → `enable_service` as renamed above.
+
+**Web facade** (`l4d2web/services/l4d2_facade.py`):
+
+Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`.
+
+### Part B — Periodic state poller
+
+Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`:
+
+```python
+def start_state_poller(app):
+    interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
+    thread = threading.Thread(
+        target=state_poller_loop,
+        args=(app, interval),
+        daemon=True,
+        name="left4me-state-poller",
+    )
+    thread.start()
+
+
+def state_poller_loop(app, interval):
+    while True:
+        try:
+            with app.app_context():
+                poll_all_servers()
+        except Exception:
+            pass  # never let a single failure kill the loop
+        time.sleep(interval)
+
+
+def poll_all_servers():
+    with session_scope() as db:
+        active_server_ids = set(db.scalars(
+            select(Job.server_id).where(Job.state.in_(("queued", "running")))
+        ).all())
+        server_ids = [
+            sid for sid in db.scalars(select(Server.id)).all()
+            if sid not in active_server_ids
+        ]
+    for sid in server_ids:
+        try:
+            refresh_server_actual_state(sid)
+        except Exception:
+            pass
+```
+
+**Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.
+
+**Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 84–88: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`).
+
+**First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed.
+
+**Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:44–58`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives.
+
+### Why both parts
+
+Either part alone is insufficient:
+
+- **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now <unit>` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases.
+- **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again.
+
+Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.
+
+### Migration on running hosts
+
+Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot.
+
+The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`.
+
+### Files changed / added
+
+```
+deploy/files/usr/local/libexec/left4me/left4me-systemctl    (Part A — verbs)
+l4d2host/service_control.py                                  (Part A — rename)
+l4d2host/instances.py                                        (Part A — call new names)
+l4d2host/tests/test_lifecycle.py                             (Part A — test updates)
+l4d2host/tests/test_service_control.py                       (Part A — new direct unit tests, create if absent)
+deploy/tests/test_deploy_artifacts.py                        (Part A — helper assertions)
+
+l4d2web/services/job_worker.py                               (Part B — poller code)
+l4d2web/app.py                                               (Part B — wire start_state_poller)
+l4d2web/config.py                                            (Part B — STATE_POLLER_INTERVAL_SECONDS default)
+l4d2web/tests/test_job_worker.py                             (Part B — poller tests)
+```
+
+## Tests
+
+### Part A
+
+- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted).
+- `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start` → `enable` and `stop` → `disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`.
+- New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`.
+
+### Part B
+
+- `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order.
+- `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server.
+- `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally.
+- `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`).
+
+### CI sanity
+
+`pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope).
+
+## Rollout
+
+Single deploy. After deploy:
+
+1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.**
+2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe.
+3. Newly-started servers (`l4d2ctl start <name>` or web UI start) are enable-based from the first invocation.
+
+If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa.
+
+## Open questions
+
+None blocking. v2 candidates:
+
+- Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision).
+- Per-server poll intervals or backoff for repeatedly-failing servers.
+- A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").
+
+## References
+
+- systemd.unit(5) — `WantedBy=`, `Install` section semantics.
+- systemctl(1) — `enable --now` / `disable --now` flags.
+- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`.
+- Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`.
+- `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.
--- a/l4d2host/fs/init.py
+++ b/l4d2host/fs/init.py
--- a/l4d2host/fs/base.py
+++ b/l4d2host/fs/base.py
@ -1,30 +0,0 @@
-from abc import ABC, abstractmethod
-from pathlib import Path
-from typing import Callable
-
-
-class OverlayMounter(ABC):
-    @abstractmethod
-    def mount(
-        self,
-        *,
-        lowerdirs: str,
-        upperdir: Path,
-        workdir: Path,
-        merged: Path,
-        on_stdout: Callable[[str], None] | None = None,
-        on_stderr: Callable[[str], None] | None = None,
-        passthrough: bool = False,
-    ) -> None:
-        raise NotImplementedError
-
-    @abstractmethod
-    def unmount(
-        self,
-        *,
-        merged: Path,
-        on_stdout: Callable[[str], None] | None = None,
-        on_stderr: Callable[[str], None] | None = None,
-        passthrough: bool = False,
-    ) -> None:
-        raise NotImplementedError
--- a/l4d2host/fs/kernel_overlayfs.py
+++ b/l4d2host/fs/kernel_overlayfs.py
@ -1,53 +0,0 @@
-from pathlib import Path
-from typing import Callable
-
-from l4d2host.fs.base import OverlayMounter
-from l4d2host.process import run_command
-
-
-HELPER_PATH = "/usr/local/libexec/left4me/left4me-overlay"
-
-
-class KernelOverlayFSMounter(OverlayMounter):
-    # Delegates the actual mount/umount syscalls to the privileged
-    # left4me-overlay helper. The helper takes only the instance name and
-    # rederives lowerdirs/upper/work/merged from disk; the OverlayMounter
-    # ABC accepts those args for compatibility, so we extract the name
-    # from the merged path's parent directory.
-    def mount(
-        self,
-        *,
-        lowerdirs: str,
-        upperdir: Path,
-        workdir: Path,
-        merged: Path,
-        on_stdout: Callable[[str], None] | None = None,
-        on_stderr: Callable[[str], None] | None = None,
-        passthrough: bool = False,
-        should_cancel: Callable[[], bool] | None = None,
-    ) -> None:
-        del lowerdirs, upperdir, workdir
-        run_command(
-            ["sudo", "-n", HELPER_PATH, "mount", merged.parent.name],
-            on_stdout=on_stdout,
-            on_stderr=on_stderr,
-            passthrough=passthrough,
-            should_cancel=should_cancel,
-        )
-
-    def unmount(
-        self,
-        *,
-        merged: Path,
-        on_stdout: Callable[[str], None] | None = None,
-        on_stderr: Callable[[str], None] | None = None,
-        passthrough: bool = False,
-        should_cancel: Callable[[], bool] | None = None,
-    ) -> None:
-        run_command(
-            ["sudo", "-n", HELPER_PATH, "umount", merged.parent.name],
-            on_stdout=on_stdout,
-            on_stderr=on_stderr,
-            passthrough=passthrough,
-            should_cancel=should_cancel,
-        )
--- a/l4d2host/instances.py
+++ b/l4d2host/instances.py
@ -1,21 +1,16 @@
-import os
 from pathlib import Path
 import shutil
 import subprocess
 from typing import Callable

-from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
 from l4d2host.paths import DEFAULT_LEFT4ME_ROOT, get_left4me_root, overlay_path, validate_instance_name
-from l4d2host.service_control import start_service, stop_service
+from l4d2host.service_control import disable_service, enable_service
 from l4d2host.spec import load_spec


 from l4d2host.logging import emit_step


-_mounter = KernelOverlayFSMounter()
-
-
 DEFAULT_ROOT = DEFAULT_LEFT4ME_ROOT


@ -63,16 +58,6 @@ def initialize_instance(
    emit_step("initialization complete.", on_stdout, passthrough)


-def _load_instance_env(path: Path) -> dict[str, str]:
-    result: dict[str, str] = {}
-    for line in path.read_text().splitlines():
-        if "=" not in line:
-            continue
-        key, value = line.split("=", 1)
-        result[key] = value
-    return result
-
-
 def start_instance(
    name: str,
    *,
@ -87,25 +72,14 @@ def start_instance(
    instance_dir = root / "instances" / name
    runtime_dir = root / "runtime" / name

-    env = _load_instance_env(instance_dir / "instance.env")
-
-    merged = runtime_dir / "merged"
-    if os.path.ismount(merged):
-        # Kernel overlayfs mounts persist when the web worker dies (unlike
-        # fuse daemons, which were reaped with their cgroup). Refuse rather
-        # than double-mount.
-        raise subprocess.CalledProcessError(
-            returncode=1,
-            cmd=["start_instance"],
-            stderr=f"runtime overlay already mounted at {merged}; refusing to double-mount",
-        )
-
-    # Stage cfg files in the upper layer BEFORE mounting. Writing through
-    # merged after the mount triggers overlayfs copy-up, which preserves the
-    # lower file's ownership — and a script-sandbox-built `server.cfg` is
-    # owned by `l4d2-sandbox`, not the worker. Pre-mount writes go straight to
-    # upper with the worker's uid; the kernel just shows them at the top of
-    # the merged stack once mounted.
+    # Stage cfg files in the upper layer. Writing here goes straight to the
+    # upper dir on the host filesystem with the worker's uid; the unit's
+    # ExecStartPre then mounts the overlay (single source of truth for the
+    # mount), and the kernel surfaces these files at the top of the merged
+    # stack. A script-sandbox-built lower-layer `server.cfg` is owned by
+    # `l4d2-sandbox`, not the worker — staging in upper sidesteps the
+    # ownership-preserving copy-up that would happen if we wrote through
+    # merged post-mount.
    emit_step("staging server.cfg + per-overlay aliases in upper layer...", on_stdout, passthrough)
    upper_cfg_dir = runtime_dir / "upper" / "left4dead2" / "cfg"
    upper_cfg_dir.mkdir(parents=True, exist_ok=True)
@ -121,20 +95,8 @@ def start_instance(
            continue
        shutil.copy2(src, upper_cfg_dir / f"server_{o.alias}.cfg")

-    emit_step("mounting runtime overlay...", on_stdout, passthrough)
-    _mounter.mount(
-        lowerdirs=env["L4D2_LOWERDIRS"],
-        upperdir=runtime_dir / "upper",
-        workdir=runtime_dir / "work",
-        merged=merged,
-        on_stdout=on_stdout,
-        on_stderr=on_stderr,
-        passthrough=passthrough,
-        should_cancel=should_cancel,
-    )
-
-    emit_step("starting systemd service...", on_stdout, passthrough)
-    start_service(
+    emit_step("enabling + starting systemd service...", on_stdout, passthrough)
+    enable_service(
        name,
        on_stdout=on_stdout,
        on_stderr=on_stderr,
@ -155,25 +117,17 @@ def stop_instance(
 ) -> None:
    name = validate_instance_name(name)
    root = get_left4me_root() if root is None else Path(root)
-    emit_step("stopping systemd service...", on_stdout, passthrough)
-    stop_service(
+    # `disable --now` triggers the unit's ExecStopPost, which unmounts the
+    # overlay. Single source of truth for unmount lives in the unit file;
+    # no Python-side unmount needed.
+    emit_step("disabling + stopping systemd service...", on_stdout, passthrough)
+    disable_service(
        name,
        on_stdout=on_stdout,
        on_stderr=on_stderr,
        passthrough=passthrough,
        should_cancel=should_cancel,
    )
-    emit_step("unmounting runtime overlay (if mounted)...", on_stdout, passthrough)
-    try:
-        _mounter.unmount(
-            merged=root / "runtime" / name / "merged",
-            on_stdout=on_stdout,
-            on_stderr=on_stderr,
-            passthrough=passthrough,
-            should_cancel=should_cancel,
-        )
-    except subprocess.CalledProcessError:
-        pass
    emit_step("stop complete.", on_stdout, passthrough)


@ -189,9 +143,13 @@ def _purge_instance(
    instance_dir = root / "instances" / name
    runtime_dir = root / "runtime" / name

-    emit_step("stopping systemd service (if running)...", on_stdout, passthrough)
+    # disable --now triggers ExecStopPost which unmounts. The try/except
+    # tolerates the unit-not-loaded case (e.g., delete on an instance that
+    # was initialized but never started — no unit, nothing to disable, no
+    # mount to clean up either).
+    emit_step("disabling + stopping systemd service (if running)...", on_stdout, passthrough)
    try:
-        stop_service(
+        disable_service(
            name,
            on_stdout=on_stdout,
            on_stderr=on_stderr,
@ -201,18 +159,6 @@ def _purge_instance(
    except subprocess.CalledProcessError:
        pass

-    emit_step("unmounting runtime overlay (if mounted)...", on_stdout, passthrough)
-    try:
-        _mounter.unmount(
-            merged=runtime_dir / "merged",
-            on_stdout=on_stdout,
-            on_stderr=on_stderr,
-            passthrough=passthrough,
-            should_cancel=should_cancel,
-        )
-    except subprocess.CalledProcessError:
-        pass
-
    emit_step("removing instance files...", on_stdout, passthrough)
    if instance_dir.exists():
        shutil.rmtree(instance_dir)
--- a/l4d2host/pyproject.toml
+++ b/l4d2host/pyproject.toml
@ -17,7 +17,7 @@ dependencies = [
 l4d2ctl = "l4d2host.cli:app"

 [tool.setuptools]
-packages = ["l4d2host", "l4d2host.fs"]
+packages = ["l4d2host"]

 [tool.setuptools.package-dir]
 l4d2host = "."
--- a/l4d2host/service_control.py
+++ b/l4d2host/service_control.py
@ -17,7 +17,7 @@ def journalctl_command(name: str, lines: int = 200, follow: bool = True) -> list
    return ["sudo", "-n", JOURNALCTL_HELPER, name, "--lines", str(lines), follow_arg]


-def start_service(
+def enable_service(
    name: str,
    *,
    on_stdout: Callable[[str], None] | None = None,
@ -26,7 +26,7 @@ def start_service(
    should_cancel: Callable[[], bool] | None = None,
 ) -> CommandResult:
    return run_command(
-        systemctl_command("start", name),
+        systemctl_command("enable", name),
        on_stdout=on_stdout,
        on_stderr=on_stderr,
        passthrough=passthrough,
@ -34,7 +34,7 @@ def start_service(
    )


-def stop_service(
+def disable_service(
    name: str,
    *,
    on_stdout: Callable[[str], None] | None = None,
@ -43,7 +43,7 @@ def stop_service(
    should_cancel: Callable[[], bool] | None = None,
 ) -> CommandResult:
    return run_command(
-        systemctl_command("stop", name),
+        systemctl_command("disable", name),
        on_stdout=on_stdout,
        on_stderr=on_stderr,
        passthrough=passthrough,
--- a/l4d2host/tests/test_kernel_overlayfs.py
+++ b/l4d2host/tests/test_kernel_overlayfs.py
@ -1,76 +0,0 @@
-from pathlib import Path
-
-import pytest
-
-
-HELPER_PATH = "/usr/local/libexec/left4me/left4me-overlay"
-
-
-def test_mount_invokes_helper_with_name_only(monkeypatch: pytest.MonkeyPatch) -> None:
-    from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
-
-    calls: list[list[str]] = []
-
-    def fake_run_command(cmd, **kwargs):
-        del kwargs
-        calls.append(list(cmd))
-
-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
-
-    KernelOverlayFSMounter().mount(
-        lowerdirs="/var/lib/left4me/installation",
-        upperdir=Path("/var/lib/left4me/runtime/alpha/upper"),
-        workdir=Path("/var/lib/left4me/runtime/alpha/work"),
-        merged=Path("/var/lib/left4me/runtime/alpha/merged"),
-    )
-
-    assert calls == [["sudo", "-n", HELPER_PATH, "mount", "alpha"]]
-
-
-def test_unmount_invokes_helper_with_umount_verb(monkeypatch: pytest.MonkeyPatch) -> None:
-    from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
-
-    calls: list[list[str]] = []
-
-    def fake_run_command(cmd, **kwargs):
-        del kwargs
-        calls.append(list(cmd))
-
-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
-
-    KernelOverlayFSMounter().unmount(merged=Path("/var/lib/left4me/runtime/alpha/merged"))
-
-    assert calls == [["sudo", "-n", HELPER_PATH, "umount", "alpha"]]
-
-
-def test_mount_propagates_run_command_kwargs(monkeypatch: pytest.MonkeyPatch) -> None:
-    from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
-
-    captured: dict = {}
-
-    def fake_run_command(cmd, **kwargs):
-        captured["cmd"] = list(cmd)
-        captured["kwargs"] = kwargs
-
-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
-
-    out: list[str] = []
-    err: list[str] = []
-    KernelOverlayFSMounter().mount(
-        lowerdirs="/var/lib/left4me/installation",
-        upperdir=Path("/var/lib/left4me/runtime/alpha/upper"),
-        workdir=Path("/var/lib/left4me/runtime/alpha/work"),
-        merged=Path("/var/lib/left4me/runtime/alpha/merged"),
-        on_stdout=out.append,
-        on_stderr=err.append,
-        passthrough=False,
-        should_cancel=lambda: False,
-    )
-
-    assert captured["cmd"][0:3] == ["sudo", "-n", HELPER_PATH]
-    captured["kwargs"]["on_stdout"]("hi")
-    captured["kwargs"]["on_stderr"]("oops")
-    assert out == ["hi"]
-    assert err == ["oops"]
-    assert captured["kwargs"]["passthrough"] is False
-    assert callable(captured["kwargs"]["should_cancel"])
--- a/l4d2host/tests/test_lifecycle.py
+++ b/l4d2host/tests/test_lifecycle.py
@ -29,19 +29,16 @@ def test_start_order(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
    (instance_dir / "server.cfg").write_text("sv_consistency 1")
    (instance_dir / "spec.yaml").write_text("port: 27015\noverlays: [x, y]\n")

-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)

    start_instance("alpha", root=tmp_path)

-    assert calls[0] == [
-        "sudo",
-        "-n",
-        "/usr/local/libexec/left4me/left4me-overlay",
-        "mount",
-        "alpha",
+    # The mount is now driven by the unit's ExecStartPre (single source of
+    # truth), so start_instance only stages the cfgs and asks systemd to
+    # enable+start the unit.
+    assert calls == [
+        ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "enable", "alpha"],
    ]
-    assert calls[1] == ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "alpha"]


 def test_start_copies_per_overlay_aliases_and_sweeps_stale(
@ -75,7 +72,6 @@ def test_start_copies_per_overlay_aliases_and_sweeps_stale(
    (src_7 / "server.cfg").write_text("ignored: alias not set\n")
    (upper_cfg_dir / "server_orphan.cfg").write_text("from previous start\n")

-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)

    start_instance("alpha", root=tmp_path)
@ -87,36 +83,6 @@ def test_start_copies_per_overlay_aliases_and_sweeps_stale(
    assert not (upper_cfg_dir / "server_overlay_7.cfg").exists(), "no alias in spec → no copy"


-def test_start_refuses_to_double_mount(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
-    calls: list[list[str]] = []
-
-    def fake_run_command(cmd, **kwargs):
-        del kwargs
-        calls.append(list(cmd))
-
-    instance_dir = tmp_path / "instances" / "alpha"
-    runtime_dir = tmp_path / "runtime" / "alpha"
-    (runtime_dir / "merged").mkdir(parents=True)
-    instance_dir.mkdir(parents=True)
-    (instance_dir / "instance.env").write_text("L4D2_PORT=27015\nL4D2_ARGS=\nL4D2_LOWERDIRS=/x\n")
-    (instance_dir / "server.cfg").write_text("")
-
-    merged = runtime_dir / "merged"
-
-    def fake_ismount(path):
-        return Path(path) == merged
-
-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
-    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
-    monkeypatch.setattr("l4d2host.instances.os.path.ismount", fake_ismount)
-
-    with pytest.raises(subprocess.CalledProcessError) as exc_info:
-        start_instance("alpha", root=tmp_path)
-
-    assert "already mounted" in (exc_info.value.stderr or "")
-    assert calls == [], "no mount/start commands must be issued when refusing"
-
-
 def test_delete_missing_is_noop(tmp_path: Path) -> None:
    delete_instance("missing", root=tmp_path)

@ -127,7 +93,7 @@ def test_delete_succeeds_when_stop_service_fails(tmp_path: Path, monkeypatch: py
    def fake_run_command(cmd, **kwargs):
        del kwargs
        calls.append(list(cmd))
-        if cmd[:2] == ["sudo", "-n"] and "left4me-systemctl" in cmd[2] and "stop" in cmd:
+        if cmd[:2] == ["sudo", "-n"] and "left4me-systemctl" in cmd[2] and "disable" in cmd:
            raise subprocess.CalledProcessError(
                returncode=5,
                cmd=list(cmd),
@ -137,7 +103,6 @@ def test_delete_succeeds_when_stop_service_fails(tmp_path: Path, monkeypatch: py
    (tmp_path / "instances" / "alpha").mkdir(parents=True)
    (tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)

-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)

    delete_instance("alpha", root=tmp_path)
@ -172,7 +137,6 @@ def test_reset_stops_unmounts_and_removes_dirs(tmp_path: Path, monkeypatch: pyte
    (runtime_dir / "upper" / "logs").mkdir(parents=True)
    (runtime_dir / "upper" / "logs" / "console.log").write_text("noise")

-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)

    reset_instance("alpha", root=tmp_path)
@ -180,7 +144,7 @@ def test_reset_stops_unmounts_and_removes_dirs(tmp_path: Path, monkeypatch: pyte
    assert not instance_dir.exists()
    assert not runtime_dir.exists()
    assert any("left4me-systemctl" in arg for cmd in calls for arg in cmd)
-    assert any("stop" in cmd for cmd in calls)
+    assert any("disable" in cmd for cmd in calls)


 def test_reset_on_never_initialized_is_noop(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
@ -188,10 +152,9 @@ def test_reset_on_never_initialized_is_noop(tmp_path: Path, monkeypatch: pytest.
    stop+unmount (both suppressed on failure) and not raise."""
    def fake_run_command(cmd, **kwargs):
        del kwargs
-        if "stop" in cmd:
+        if "disable" in cmd:
            raise subprocess.CalledProcessError(returncode=5, cmd=list(cmd), stderr="not loaded")

-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)

    reset_instance("alpha", root=tmp_path)
@ -210,68 +173,16 @@ def test_delete_stopped_instance_removes_dirs(tmp_path: Path, monkeypatch: pytes
    (tmp_path / "instances" / "alpha").mkdir(parents=True)
    (tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)

-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)

    delete_instance("alpha", root=tmp_path)

    assert not (tmp_path / "instances" / "alpha").exists()
    assert not (tmp_path / "runtime" / "alpha").exists()
-    assert ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "stop", "alpha"] in calls
+    assert ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "disable", "alpha"] in calls


-def test_stop_succeeds_when_unmount_fails(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
-    umount_calls: list[list[str]] = []
-
-    def fake_run_command(cmd, **kwargs):
-        del kwargs
-        if cmd[:4] == [
-            "sudo",
-            "-n",
-            "/usr/local/libexec/left4me/left4me-overlay",
-            "umount",
-        ]:
-            umount_calls.append(list(cmd))
-            raise subprocess.CalledProcessError(
-                returncode=1,
-                cmd=list(cmd),
-                stderr="umount: /var/lib/left4me/runtime/alpha/merged: not mounted",
-            )
-
-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
-    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
-
-    stop_instance("alpha", root=tmp_path)
-
-    assert umount_calls, "stop must always attempt the overlay helper (no preflight)"
-
-
-def test_delete_succeeds_when_unmount_fails(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
-    umount_calls: list[list[str]] = []
-
-    def fake_run_command(cmd, **kwargs):
-        del kwargs
-        if cmd[:4] == [
-            "sudo",
-            "-n",
-            "/usr/local/libexec/left4me/left4me-overlay",
-            "umount",
-        ]:
-            umount_calls.append(list(cmd))
-            raise subprocess.CalledProcessError(
-                returncode=1,
-                cmd=list(cmd),
-                stderr="umount: /var/lib/left4me/runtime/alpha/merged: not mounted",
-            )
-
-    (tmp_path / "instances" / "alpha").mkdir(parents=True)
-    (tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)
-
-    monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
-    monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
-
-    delete_instance("alpha", root=tmp_path)
-
-    assert umount_calls, "delete must always attempt the overlay helper (no preflight)"
-    assert not (tmp_path / "instances" / "alpha").exists()
-    assert not (tmp_path / "runtime" / "alpha").exists()
+# test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
+# were removed when the Python-side unmount was dropped: the unit's
+# ExecStopPost is now the single code path for unmount, so there's no
+# Python-side failure to tolerate.
--- a/l4d2host/tests/test_service_control.py
+++ b/l4d2host/tests/test_service_control.py
@ -0,0 +1,21 @@
+from unittest.mock import patch
+
+from l4d2host.service_control import (
+    SYSTEMCTL_HELPER,
+    disable_service,
+    enable_service,
+)
+
+
+@patch("l4d2host.service_control.run_command")
+def test_enable_service_invokes_helper_with_enable_action(mock_run):
+    enable_service("instance-7")
+    args, _ = mock_run.call_args
+    assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
+
+
+@patch("l4d2host.service_control.run_command")
+def test_disable_service_invokes_helper_with_disable_action(mock_run):
+    disable_service("instance-7")
+    args, _ = mock_run.call_args
+    assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]
--- a/l4d2web/app.py
+++ b/l4d2web/app.py
@ -18,7 +18,11 @@ from l4d2web.routes.overlay_routes import bp as overlay_bp
 from l4d2web.routes.page_routes import bp as page_bp
 from l4d2web.routes.server_routes import bp as server_bp
 from l4d2web.routes.workshop_routes import bp as workshop_bp
-from l4d2web.services.job_worker import recover_stale_jobs, start_job_workers
+from l4d2web.services.job_worker import (
+    recover_stale_jobs,
+    start_job_workers,
+    start_state_poller,
+)


 def _in_flask_cli_context() -> bool:
@ -89,6 +93,7 @@ def create_app(test_config: dict[str, object] | None = None) -> Flask:
    if should_start_workers:
        recover_stale_jobs()
        start_job_workers(app)
+        start_state_poller(app)

    @app.get("/health")
    def health():
--- a/l4d2web/config.py
+++ b/l4d2web/config.py
@ -8,6 +8,7 @@ DEFAULT_CONFIG: dict[str, object] = {
    "JOB_WORKER_THREADS": 4,
    "JOB_WORKER_ENABLED": True,
    "JOB_WORKER_POLL_SECONDS": 1,
+    "STATE_POLLER_INTERVAL_SECONDS": 30,
    "JOB_LOG_REPLAY_LIMIT": 2000,
    "JOB_LOG_LINE_MAX_CHARS": 4096,
    "PORT_RANGE_START": 27015,
@ -27,6 +28,7 @@ def load_config() -> dict[str, object]:
        "JOB_WORKER_THREADS": int(os.getenv("JOB_WORKER_THREADS", "4")),
        "JOB_WORKER_ENABLED": _bool_from_env(os.getenv("JOB_WORKER_ENABLED", "true")),
        "JOB_WORKER_POLL_SECONDS": float(os.getenv("JOB_WORKER_POLL_SECONDS", "1")),
+        "STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("STATE_POLLER_INTERVAL_SECONDS", "30")),
        "JOB_LOG_REPLAY_LIMIT": int(os.getenv("JOB_LOG_REPLAY_LIMIT", "2000")),
        "JOB_LOG_LINE_MAX_CHARS": int(os.getenv("JOB_LOG_LINE_MAX_CHARS", "4096")),
        "PORT_RANGE_START": int(os.getenv("LEFT4ME_PORT_RANGE_START", "27015")),
--- a/l4d2web/services/job_worker.py
+++ b/l4d2web/services/job_worker.py
@ -614,3 +614,45 @@ def worker_loop(app, poll_seconds: float) -> None:
            ran_job = False
        if not ran_job:
            time.sleep(poll_seconds)
+
+
+def start_state_poller(app) -> None:
+    interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
+    thread = threading.Thread(
+        target=state_poller_loop,
+        args=(app, interval),
+        name="left4me-state-poller",
+        daemon=True,
+    )
+    thread.start()
+
+
+def state_poller_loop(app, interval: float) -> None:
+    while True:
+        try:
+            with app.app_context():
+                poll_all_servers()
+        except Exception:
+            pass
+        time.sleep(interval)
+
+
+def poll_all_servers() -> None:
+    with session_scope() as db:
+        active_server_ids = set(
+            db.scalars(
+                select(Job.server_id).where(
+                    Job.state.in_(("queued", "running", "cancelling"))
+                )
+            ).all()
+        )
+        server_ids = [
+            sid
+            for sid in db.scalars(select(Server.id)).all()
+            if sid not in active_server_ids
+        ]
+    for sid in server_ids:
+        try:
+            refresh_server_actual_state(sid)
+        except Exception:
+            pass
--- a/l4d2web/tests/test_job_worker.py
+++ b/l4d2web/tests/test_job_worker.py
@ -843,3 +843,90 @@ def test_build_overlay_script_type_blocks_per_overlay(overlay_seeded_worker) ->
        can_start(DummyJob(operation="build_overlay", overlay_id=ids.overlay + 1), state)
        is True
    )
+
+
+# ---------------------------------------------------------------------------
+# State poller tests — refresh Server.actual_state out-of-band so OOM kills,
+# manual systemctl ops, and reboots no longer leave the DB on stale "running".
+# ---------------------------------------------------------------------------
+
+
+def test_state_poller_refreshes_each_server(seeded_worker, monkeypatch) -> None:
+    from l4d2web.services import job_worker as jw
+
+    worker_app, ids = seeded_worker
+
+    refreshed: list[int] = []
+    monkeypatch.setattr(
+        jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid)
+    )
+
+    with worker_app.app_context():
+        jw.poll_all_servers()
+
+    assert sorted(refreshed) == sorted([ids.server_one, ids.server_two])
+
+
+def test_state_poller_skips_servers_with_inflight_jobs(seeded_worker, monkeypatch) -> None:
+    from l4d2web.services import job_worker as jw
+
+    worker_app, ids = seeded_worker
+
+    add_job(ids.user, "stop", server_id=ids.server_one, state="running")
+
+    refreshed: list[int] = []
+    monkeypatch.setattr(
+        jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid)
+    )
+
+    with worker_app.app_context():
+        jw.poll_all_servers()
+
+    assert ids.server_one not in refreshed
+    assert ids.server_two in refreshed
+
+
+def test_state_poller_swallows_per_server_exceptions(seeded_worker, monkeypatch) -> None:
+    from l4d2web.services import job_worker as jw
+
+    worker_app, ids = seeded_worker
+
+    refreshed: list[int] = []
+
+    def fake_refresh(sid: int) -> None:
+        if sid == ids.server_one:
+            raise RuntimeError("simulated host failure")
+        refreshed.append(sid)
+
+    monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
+
+    with worker_app.app_context():
+        jw.poll_all_servers()  # must not raise
+
+    assert refreshed == [ids.server_two]
+
+
+def test_state_poller_not_started_during_testing(monkeypatch, tmp_path) -> None:
+    from l4d2web import app as app_module
+
+    called: list = []
+    db_url = f"sqlite:///{tmp_path/'poller-testing.db'}"
+    monkeypatch.setattr(app_module, "start_state_poller", lambda app: called.append(app))
+
+    app_module.create_app({"TESTING": True, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
+
+    assert called == []
+
+
+def test_state_poller_started_when_workers_enabled_outside_testing(monkeypatch, tmp_path) -> None:
+    from l4d2web import app as app_module
+
+    called: list = []
+    db_url = f"sqlite:///{tmp_path/'poller-enabled.db'}"
+    monkeypatch.setattr(app_module, "start_state_poller", lambda app: called.append(app))
+    monkeypatch.setattr(app_module, "start_job_workers", lambda app: None)
+    monkeypatch.setattr(app_module, "recover_stale_jobs", lambda: None)
+
+    app = app_module.create_app({"TESTING": False, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
+
+    assert called == [app]
Author	SHA1	Message	Date
mwiegand	59771f91c4	fix(deploy): drop deleted l4d2host.fs from pyproject + use nproc --all Two bugs surfaced by the previous deploy attempt: 1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit packages= list. After deleting the fs/ package, pip install -e fails with "package directory './fs' does not exist". 2. The CPU-isolation deploy step uses `nproc` to detect host core count, but `nproc` honors Cpus_allowed of the calling shell. On a host that already has the cpuset drop-ins applied (system.slice/user.slice → AllowedCPUs=0), the SSH login lands constrained to one core and `nproc` returns 1 — making subsequent deploys think they're on a single-core box and skip the cpuset writes entirely. `nproc --all` reports installed processors regardless of affinity, which is what the deploy actually wants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 13:11:19 +02:00
mwiegand	ff6ce7b091	refactor(l4d2-host): unmount via ExecStopPost — single code path mirroring mount Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until now, the unit's ExecStartPre handled mount but the Python side still drove unmount: stop_instance and _purge_instance both called _mounter.unmount, which wrapped sudo + the helper. Two code paths for two halves of the same lifecycle. Move unmount into the unit: - ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i (ExecStopPost, not ExecStop, so it runs after the cgroup is cleared; ExecStop runs while srcds is alive and would EBUSY the umount syscall.) - Helper's umount verb is now idempotent (mirrors mount): if merged isn't a mount point, return early. PRINT_ONLY mode bypasses both short-circuits so the unit tests still exercise the full nsenter argv. Drop the dead Python machinery: - _mounter.unmount(...) calls in stop_instance and _purge_instance - _mounter global + KernelOverlayFSMounter import - The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter class) — no production callers, just self-tests - l4d2host/tests/test_kernel_overlayfs.py - test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails (tested Python-side unmount-failure tolerance that no longer exists) - The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests After this, the only thing start_instance does beyond cfg-staging is ask systemd to enable+start the unit. stop/delete/reset only ask systemd to disable; the overlay lifecycle lives entirely in the unit file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 13:09:52 +02:00
mwiegand	fc371711ec	fix(deploy): StartLimit* directives belong in [Unit], not [Service] systemd 230+ moved StartLimitBurst= and StartLimitIntervalSec= from [Service] into [Unit] (with the rename from StartLimitInterval=). Putting them in [Service] makes systemd silently ignore them with a warning to journalctl: "Unknown key 'StartLimitIntervalSec' in section [Service], ignoring." — meaning the restart-loop cap I claimed in commit `519567e` wasn't actually applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:56:54 +02:00
mwiegand	a982995d5b	fix(deploy): ExecStartPre runs overlay helper with `+` prefix, not sudo The unit has NoNewPrivileges=true (security hardening for srcds), which blocks sudo's setuid escalation. The previous sudo'd ExecStartPre failed on every start with "sudo: the 'no new privileges' switch is set, which prevents sudo from running as root" -> Restart=on-failure loop. systemd's `+` prefix runs the Exec command as PID 1 (root, no sandbox), bypassing User=/Group=/NoNewPrivileges=. Equivalent privilege scope to the sudoers rule the web app already uses for the same helper, just without the sudo middleman. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:55:16 +02:00
mwiegand	56f5c30296	refactor(l4d2-host): unit's ExecStartPre is the sole code path to the mount Before this change there were two callers of left4me-overlay mount: the web app's start_instance (Python, in-process) and the unit's ExecStartPre (shell, via sudo). The duplication invited divergence; the helper's recently-added idempotency made both paths technically work but at the cost of a "first wins" race and dead-code retry logic in start_instance. Drop the in-process _mounter.mount() call from start_instance. The web app now only stages cfg files (which still must happen on the host filesystem before mount, to avoid overlayfs copy-up changing ownership), then asks systemd to enable+start the unit; the unit's ExecStartPre does the mount. Removed: - os.path.ismount(merged) refusal in start_instance and its test (test_start_refuses_to_double_mount). The race the check guarded against is now handled by the helper's idempotency. - _load_instance_env helper and the `os` import (both became dead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:54:05 +02:00
mwiegand	3d9b7ef771	fix(deploy): WorkingDirectory= prefix `-` so ExecStartPre can mount the overlay systemd applies WorkingDirectory= to every Exec line including ExecStartPre. With the merged dir not yet existing at boot time (the volatile overlay mount has been wiped), the chdir into runtime/%i/merged/left4dead2 fails with status=200/CHDIR before ExecStartPre can run the mount helper. The `-` prefix makes chdir failure non-fatal: ExecStartPre runs in the unit's home (cwd doesn't matter for the mount helper); ExecStart re-applies WorkingDirectory once the mount has landed and chdirs successfully. Companion to commit `519567e` (which added the ExecStartPre mount + helper idempotency but didn't account for the WorkingDirectory ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:51:58 +02:00
mwiegand	519567e156	fix(l4d2-host): mount overlay via ExecStartPre so enabled units boot cleanly The lifecycle change to systemctl enable --now (commit `8552c55`) made units auto-start at boot. But the kernel-overlayfs mount is volatile (reboot kills it), and the web app's start_instance only re-mounts in response to a UI click. Result: at boot, systemd starts the unit, finds empty merged/, CHDIR fails, Restart=on-failure spins forever (counter hit 65 on ckn before this fix landed). Fix: - Unit gets `ExecStartPre=/usr/bin/sudo -n .../left4me-overlay mount %i` so the overlay is established before the main process starts. - Helper is now idempotent: if merged is already a mount point, exit 0. Required because Restart=on-failure re-runs ExecStartPre on each cycle, and the web-app's start_instance also calls the helper, so both paths would otherwise collide on "already mounted". - StartLimitBurst=5 + StartLimitIntervalSec=60s caps the restart loop instead of letting it spin indefinitely on a fundamental failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:47:20 +02:00
mwiegand	b62fc08127	docs(specs): l4d2 cpu pinning — decision record (deferred) Investigated whether to hard-pin each srcds instance to a single core within the existing AllowedCPUs=1-7 set. Modern kernels (5.13+) no longer expose kernel.sched_migration_cost_ns or the other classic CFS "laziness" tunables, so a global cheap-fix is unavailable. Decision for now: trust CFS + Nice=-5 + AllowedCPUs=1-7. Per-instance CPUAffinity= remains an opt-in escape hatch in deploy/README.md. Documents the revisit triggers and the preferred implementation path when the time comes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:41:40 +02:00
mwiegand	67b5521eb6	feat(l4d2-web): periodic state poller refreshes Server.actual_state A background thread spawned alongside the job workers polls every server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and writes the result via the existing refresh_server_actual_state path. Servers with in-flight jobs (queued/running/cancelling) are skipped to avoid racing the post-job refresh. Catches reboot drift, OOM kills, manual systemctl operations, and any other out-of-band state change. Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:31:28 +02:00
mwiegand	8552c559d3	feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now Servers started via the web UI now create a WantedBy= symlink under multi-user.target.wants/, so they auto-start on the next host reboot. Helper verbs renamed start/stop -> enable/disable; service_control.py renamed start_service/stop_service -> enable_service/disable_service. The user-facing l4d2ctl start/stop commands keep their names per the AGENTS.md contract -- only the implementation changes. Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:28:44 +02:00
mwiegand	1dd674714a	docs(specs): perf baseline lifecycle — premise check on system vs user units Make explicit that the project uses system units (root systemctl, unit under /usr/local/lib/systemd/system/, WantedBy=multi-user.target), so `systemctl enable --now` is the correct verb to make instances survive a host reboot. User units have different lifecycle rules and would not auto-start at boot without enable-linger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:25:34 +02:00
mwiegand	3b0bde9b50	docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan Two TDD tasks: helper+service_control verb rename, then poller code + wiring + tests. Operator-side smoke test in F.3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:21:59 +02:00
mwiegand	72cd7ca1ef	docs(specs): l4d2 server lifecycle reboot-and-drift — design Switch lifecycle verbs from systemctl start/stop to enable --now / disable --now (servers survive host reboot via WantedBy= symlinks), plus a periodic state poller for runtime drift (OOM kills, manual systemctl ops, exhausted Restart=on-failure). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:21:59 +02:00
mwiegand	20604dd79c	docs(deploy): document CPU isolation in performance-tuning section Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS / LEFT4ME_GAME_CPUS overrides, the single-core skip, and the subset-of relationship with per-instance CPUAffinity=. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 11:06:59 +02:00
mwiegand	af3171102a	feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes with a stderr warning unless an env var override is set. Spec: docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 11:06:34 +02:00
mwiegand	c91c029c38	docs(plans): l4d2 cpu isolation — implementation plan Two TDD tasks: deploy-script cpuset block + tests, README "CPU isolation" subsection. Operator-side smoke test in F.3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 11:03:37 +02:00
mwiegand	17b7c2ff10	docs(specs): l4d2 cpu isolation — design cgroup-v2 AllowedCPUs= drop-ins for system/user/build/game slices. Defaults: core 0 for everything-not-game, cores 1..N-1 for game, computed from nproc. LEFT4ME_SYSTEM_CPUS / LEFT4ME_GAME_CPUS overrides; single-core hosts skip with a warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 11:03:37 +02:00
mwiegand	e5126c8c0b	docs(deploy): tighten perf-tuning escape hatches - RT example: add AmbientCapabilities=CAP_SYS_NICE so the User=left4me service can actually enter SCHED_FIFO on Trixie. - CPU governor: note that linux-cpupower may need apt install. - CPUAffinity=2: clarify that per-instance values typically increment. - NIC tuning: note that ethtool may need apt install.	2026-05-09 10:15:45 +02:00
mwiegand	9e0f6f17ef	docs(deploy): performance-tuning escape-hatch section in README Documents CPU governor, per-instance CPUAffinity, NIC tuning, and SCHED_FIFO opt-in patterns. None of these are auto-applied; they're ops-side knobs for measured problems the perf baseline doesn't solve.	2026-05-09 10:09:40 +02:00
mwiegand	928519fa34	feat(deploy): install slice + sysctl artifacts and apply via sysctl --system Copies l4d2-game.slice and l4d2-build.slice into /usr/local/lib/systemd/system/, installs 99-left4me.conf into /etc/sysctl.d/, and runs sysctl --system so the perf baseline is live this deploy, not on next reboot.	2026-05-09 10:05:41 +02:00
mwiegand	7e4a5691ed	feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500 Builds yield CPU/IO to game-server instances under contention via the slice's weight=10, and are killed first under memory pressure (servers have OOMScoreAdjust=-200).	2026-05-09 10:01:38 +02:00
mwiegand	b3fca4772c	feat(deploy): host sysctls for UDP buffers + netdev backlog/budget 99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults), netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.	2026-05-09 09:53:07 +02:00
mwiegand	66d83a0282	docs(deploy): point slice files at perf baseline spec Matches the spec-pointer comment Task 1 added to left4me-server@.service. A future operator running `systemctl cat l4d2-game.slice` now finds the rationale.	2026-05-09 09:51:48 +02:00
mwiegand	ad7d73608e	feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio Flat top-level slices. Game wins under contention; build still gets the box when uncontended. Referenced by left4me-server@.service and the script-sandbox systemd-run invocation.	2026-05-09 09:48:41 +02:00
mwiegand	7193163488	feat(deploy): perf-baseline directives on left4me-server@.service Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort, OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256, LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s, LogRateLimitIntervalSec=0. Spec: docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md	2026-05-09 09:44:12 +02:00
mwiegand	851e6629aa	docs(plans): l4d2 server host perf baseline — implementation plan Six tasks (TDD, one commit each): unit directives, slice files, sysctl conf, sandbox slice + OOMScoreAdjust, deploy-script wiring, README escape-hatch section. Final verification step with full deploy + host + web pytest sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 09:39:12 +02:00
mwiegand	b6574e308b	docs(specs): perf baseline — fix transient-service phrasing The existing left4me-script-sandbox helper uses systemd-run in transient service mode (--unit=, no --scope). Spec wrongly said '--scope'. No semantic change — the design's --slice= and -p OOMScoreAdjust= guidance is identical for service vs scope mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 09:39:12 +02:00
mwiegand	db3b149045	docs(specs): l4d2 server host perf baseline — design Approach A: per-instance unit directives (Nice, OOM, Memory caps, KillSignal=SIGINT, log-rate disable), flat l4d2-game/l4d2-build slice hierarchy with 100:1 CPU/IO weight ratio, sandbox into build slice with OOMScoreAdjust=500, host sysctls for UDP buffers + netdev backlog/budget + vm.swappiness. SCHED_FIFO, CPU governor, CPUAffinity, NIC tuning are documented escape hatches, not auto-applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 09:31:05 +02:00