Compare commits

...

28 commits

Author SHA1 Message Date
mwiegand
59771f91c4
fix(deploy): drop deleted l4d2host.fs from pyproject + use nproc --all
Two bugs surfaced by the previous deploy attempt:

1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit
   packages= list. After deleting the fs/ package, pip install -e fails
   with "package directory './fs' does not exist".

2. The CPU-isolation deploy step uses `nproc` to detect host core count,
   but `nproc` honors Cpus_allowed of the calling shell. On a host that
   already has the cpuset drop-ins applied (system.slice/user.slice →
   AllowedCPUs=0), the SSH login lands constrained to one core and
   `nproc` returns 1 — making subsequent deploys think they're on a
   single-core box and skip the cpuset writes entirely. `nproc --all`
   reports installed processors regardless of affinity, which is what
   the deploy actually wants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:11:19 +02:00
mwiegand
ff6ce7b091
refactor(l4d2-host): unmount via ExecStopPost — single code path mirroring mount
Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until
now, the unit's ExecStartPre handled mount but the Python side still drove
unmount: stop_instance and _purge_instance both called _mounter.unmount,
which wrapped sudo + the helper. Two code paths for two halves of the
same lifecycle.

Move unmount into the unit:

- ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
  (ExecStopPost, not ExecStop, so it runs after the cgroup is cleared;
  ExecStop runs while srcds is alive and would EBUSY the umount syscall.)
- Helper's umount verb is now idempotent (mirrors mount): if merged
  isn't a mount point, return early. PRINT_ONLY mode bypasses both
  short-circuits so the unit tests still exercise the full nsenter argv.

Drop the dead Python machinery:

- _mounter.unmount(...) calls in stop_instance and _purge_instance
- _mounter global + KernelOverlayFSMounter import
- The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter
  class) — no production callers, just self-tests
- l4d2host/tests/test_kernel_overlayfs.py
- test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
  (tested Python-side unmount-failure tolerance that no longer exists)
- The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests

After this, the only thing start_instance does beyond cfg-staging is ask
systemd to enable+start the unit. stop/delete/reset only ask systemd to
disable; the overlay lifecycle lives entirely in the unit file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:09:52 +02:00
mwiegand
fc371711ec
fix(deploy): StartLimit* directives belong in [Unit], not [Service]
systemd 230+ moved StartLimitBurst= and StartLimitIntervalSec= from
[Service] into [Unit] (with the rename from StartLimitInterval=). Putting
them in [Service] makes systemd silently ignore them with a warning to
journalctl: "Unknown key 'StartLimitIntervalSec' in section [Service],
ignoring." — meaning the restart-loop cap I claimed in commit 519567e
wasn't actually applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:56:54 +02:00
mwiegand
a982995d5b
fix(deploy): ExecStartPre runs overlay helper with + prefix, not sudo
The unit has NoNewPrivileges=true (security hardening for srcds), which
blocks sudo's setuid escalation. The previous sudo'd ExecStartPre failed
on every start with "sudo: the 'no new privileges' switch is set, which
prevents sudo from running as root" -> Restart=on-failure loop.

systemd's `+` prefix runs the Exec command as PID 1 (root, no sandbox),
bypassing User=/Group=/NoNewPrivileges=. Equivalent privilege scope to
the sudoers rule the web app already uses for the same helper, just
without the sudo middleman.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:55:16 +02:00
mwiegand
56f5c30296
refactor(l4d2-host): unit's ExecStartPre is the sole code path to the mount
Before this change there were two callers of left4me-overlay mount:
the web app's start_instance (Python, in-process) and the unit's
ExecStartPre (shell, via sudo). The duplication invited divergence; the
helper's recently-added idempotency made both paths technically work
but at the cost of a "first wins" race and dead-code retry logic in
start_instance.

Drop the in-process _mounter.mount() call from start_instance. The web
app now only stages cfg files (which still must happen on the host
filesystem before mount, to avoid overlayfs copy-up changing ownership),
then asks systemd to enable+start the unit; the unit's ExecStartPre
does the mount.

Removed:
- os.path.ismount(merged) refusal in start_instance and its test
  (test_start_refuses_to_double_mount). The race the check guarded
  against is now handled by the helper's idempotency.
- _load_instance_env helper and the `os` import (both became dead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:54:05 +02:00
mwiegand
3d9b7ef771
fix(deploy): WorkingDirectory= prefix - so ExecStartPre can mount the overlay
systemd applies WorkingDirectory= to every Exec line including ExecStartPre.
With the merged dir not yet existing at boot time (the volatile overlay
mount has been wiped), the chdir into runtime/%i/merged/left4dead2 fails
with status=200/CHDIR before ExecStartPre can run the mount helper.

The `-` prefix makes chdir failure non-fatal: ExecStartPre runs in the
unit's home (cwd doesn't matter for the mount helper); ExecStart re-applies
WorkingDirectory once the mount has landed and chdirs successfully.

Companion to commit 519567e (which added the ExecStartPre mount + helper
idempotency but didn't account for the WorkingDirectory ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:51:58 +02:00
mwiegand
519567e156
fix(l4d2-host): mount overlay via ExecStartPre so enabled units boot cleanly
The lifecycle change to systemctl enable --now (commit 8552c55) made
units auto-start at boot. But the kernel-overlayfs mount is volatile
(reboot kills it), and the web app's start_instance only re-mounts in
response to a UI click. Result: at boot, systemd starts the unit, finds
empty merged/, CHDIR fails, Restart=on-failure spins forever (counter
hit 65 on ckn before this fix landed).

Fix:
- Unit gets `ExecStartPre=/usr/bin/sudo -n .../left4me-overlay mount %i`
  so the overlay is established before the main process starts.
- Helper is now idempotent: if merged is already a mount point, exit 0.
  Required because Restart=on-failure re-runs ExecStartPre on each
  cycle, and the web-app's start_instance also calls the helper, so
  both paths would otherwise collide on "already mounted".
- StartLimitBurst=5 + StartLimitIntervalSec=60s caps the restart loop
  instead of letting it spin indefinitely on a fundamental failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:47:20 +02:00
mwiegand
b62fc08127
docs(specs): l4d2 cpu pinning — decision record (deferred)
Investigated whether to hard-pin each srcds instance to a single core
within the existing AllowedCPUs=1-7 set. Modern kernels (5.13+) no
longer expose kernel.sched_migration_cost_ns or the other classic CFS
"laziness" tunables, so a global cheap-fix is unavailable. Decision
for now: trust CFS + Nice=-5 + AllowedCPUs=1-7. Per-instance
CPUAffinity= remains an opt-in escape hatch in deploy/README.md.
Documents the revisit triggers and the preferred implementation path
when the time comes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:41:40 +02:00
mwiegand
67b5521eb6
feat(l4d2-web): periodic state poller refreshes Server.actual_state
A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs (queued/running/cancelling) are skipped to
avoid racing the post-job refresh. Catches reboot drift, OOM kills,
manual systemctl operations, and any other out-of-band state change.
Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:31:28 +02:00
mwiegand
8552c559d3
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract -- only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:28:44 +02:00
mwiegand
1dd674714a
docs(specs): perf baseline lifecycle — premise check on system vs user units
Make explicit that the project uses system units (root systemctl, unit
under /usr/local/lib/systemd/system/, WantedBy=multi-user.target), so
`systemctl enable --now` is the correct verb to make instances survive
a host reboot. User units have different lifecycle rules and would not
auto-start at boot without enable-linger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:25:34 +02:00
mwiegand
3b0bde9b50
docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan
Two TDD tasks: helper+service_control verb rename, then poller code
+ wiring + tests. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00
mwiegand
72cd7ca1ef
docs(specs): l4d2 server lifecycle reboot-and-drift — design
Switch lifecycle verbs from systemctl start/stop to enable --now /
disable --now (servers survive host reboot via WantedBy= symlinks),
plus a periodic state poller for runtime drift (OOM kills, manual
systemctl ops, exhausted Restart=on-failure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00
mwiegand
20604dd79c
docs(deploy): document CPU isolation in performance-tuning section
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:06:59 +02:00
mwiegand
af3171102a
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:06:34 +02:00
mwiegand
c91c029c38
docs(plans): l4d2 cpu isolation — implementation plan
Two TDD tasks: deploy-script cpuset block + tests, README
"CPU isolation" subsection. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:03:37 +02:00
mwiegand
17b7c2ff10
docs(specs): l4d2 cpu isolation — design
cgroup-v2 AllowedCPUs= drop-ins for system/user/build/game slices.
Defaults: core 0 for everything-not-game, cores 1..N-1 for game,
computed from nproc. LEFT4ME_SYSTEM_CPUS / LEFT4ME_GAME_CPUS
overrides; single-core hosts skip with a warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:03:37 +02:00
mwiegand
e5126c8c0b
docs(deploy): tighten perf-tuning escape hatches
- RT example: add AmbientCapabilities=CAP_SYS_NICE so the User=left4me
  service can actually enter SCHED_FIFO on Trixie.
- CPU governor: note that linux-cpupower may need apt install.
- CPUAffinity=2: clarify that per-instance values typically increment.
- NIC tuning: note that ethtool may need apt install.
2026-05-09 10:15:45 +02:00
mwiegand
9e0f6f17ef
docs(deploy): performance-tuning escape-hatch section in README
Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
ops-side knobs for measured problems the perf baseline doesn't solve.
2026-05-09 10:09:40 +02:00
mwiegand
928519fa34
feat(deploy): install slice + sysctl artifacts and apply via sysctl --system
Copies l4d2-game.slice and l4d2-build.slice into
/usr/local/lib/systemd/system/, installs 99-left4me.conf into
/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
live this deploy, not on next reboot.
2026-05-09 10:05:41 +02:00
mwiegand
7e4a5691ed
feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500
Builds yield CPU/IO to game-server instances under contention via the
slice's weight=10, and are killed first under memory pressure
(servers have OOMScoreAdjust=-200).
2026-05-09 10:01:38 +02:00
mwiegand
b3fca4772c
feat(deploy): host sysctls for UDP buffers + netdev backlog/budget
99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults),
netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.
2026-05-09 09:53:07 +02:00
mwiegand
66d83a0282
docs(deploy): point slice files at perf baseline spec
Matches the spec-pointer comment Task 1 added to
left4me-server@.service. A future operator running
`systemctl cat l4d2-game.slice` now finds the rationale.
2026-05-09 09:51:48 +02:00
mwiegand
ad7d73608e
feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio
Flat top-level slices. Game wins under contention; build still gets
the box when uncontended. Referenced by left4me-server@.service and
the script-sandbox systemd-run invocation.
2026-05-09 09:48:41 +02:00
mwiegand
7193163488
feat(deploy): perf-baseline directives on left4me-server@.service
Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort,
OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256,
LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s,
LogRateLimitIntervalSec=0. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
2026-05-09 09:44:12 +02:00
mwiegand
851e6629aa
docs(plans): l4d2 server host perf baseline — implementation plan
Six tasks (TDD, one commit each): unit directives, slice files,
sysctl conf, sandbox slice + OOMScoreAdjust, deploy-script wiring,
README escape-hatch section. Final verification step with full
deploy + host + web pytest sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:39:12 +02:00
mwiegand
b6574e308b
docs(specs): perf baseline — fix transient-service phrasing
The existing left4me-script-sandbox helper uses systemd-run in
transient service mode (--unit=, no --scope). Spec wrongly said
'--scope'. No semantic change — the design's --slice= and
-p OOMScoreAdjust= guidance is identical for service vs scope mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:39:12 +02:00
mwiegand
db3b149045
docs(specs): l4d2 server host perf baseline — design
Approach A: per-instance unit directives (Nice, OOM, Memory caps,
KillSignal=SIGINT, log-rate disable), flat l4d2-game/l4d2-build slice
hierarchy with 100:1 CPU/IO weight ratio, sandbox into build slice with
OOMScoreAdjust=500, host sysctls for UDP buffers + netdev backlog/budget
+ vm.swappiness. SCHED_FIFO, CPU governor, CPUAffinity, NIC tuning are
documented escape hatches, not auto-applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:31:05 +02:00
30 changed files with 2815 additions and 359 deletions

View file

@ -71,3 +71,85 @@ The web app currently supports two overlay surfaces:
- `script` overlays — populated by an arbitrary user-authored bash script that runs inside `bubblewrap` + `systemd-run --scope` as the unprivileged `l4d2-sandbox` UID, with the overlay directory bind-mounted RW at `/overlay`. Resource caps: 1h walltime, 4 GB RAM, 512 tasks, 200% CPU, 20 GB post-build disk cap.
Both the caches and the overlay directories are owned by the `left4me` runtime user; if the web service ever runs as a different uid, ensure it shares a group with the host process and that both trees are group-readable.
## Performance Tuning
The deployment ships a host-side perf baseline (slices, unit directives, sysctls). See `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` for design rationale.
The following knobs are documented escape hatches — they are **not** auto-applied. Apply only if you have measured a need and understand the failure modes.
### CPU governor
The performance governor squeezes a few percent off jitter under bursty load. `schedutil` is acceptable for sustained UDP workloads.
```sh
sudo cpupower frequency-set -g performance
```
Install via `sudo apt install linux-cpupower` if the binary isn't present.
Persist via your distro's CPU-frequency tooling (e.g. `/etc/default/cpufrequtils`).
### CPU isolation (cores)
The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
Override the split by setting either env var when running the deploy:
```sh
LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
```
On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
### Per-instance CPU affinity
`srcds` is single-threaded per instance. On a multi-core host, pinning each instance to its own core can cut jitter under contention. Drop in `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
```ini
[Service]
CPUAffinity=2
```
This pins the instance to CPU 2 specifically; per-instance values would typically be 1, 2, 3, ... so each server has its own core.
A reasonable strategy on an N-core host: leave core 0 for the kernel + IRQs + system services, then pin one instance per remaining core.
### NIC tuning
Hardware-specific (install via `sudo apt install ethtool` if not present). On a host with a single primary interface (replace `eth0`):
```sh
sudo ethtool -G eth0 rx 4096 tx 4096
sudo ethtool -K eth0 gro on lro off
```
If you run a high instance count, also pin the NIC's interrupts off the cores that game servers occupy (see `/proc/interrupts` and `/proc/irq/<n>/smp_affinity`).
### Real-time scheduling (advanced, opt-in)
Source-engine servers do not need real-time scheduling, and a misbehaving `srcds` at any RT priority can starve kernel threads — even with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of CPU back. Use only if you have a measured jitter problem that the baseline does not solve.
`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
```ini
[Service]
CPUSchedulingPolicy=fifo
CPUSchedulingPriority=10
LimitRTPRIO=10
AmbientCapabilities=CAP_SYS_NICE
```
The `AmbientCapabilities=CAP_SYS_NICE` line is needed because the service runs as `User=left4me` with `NoNewPrivileges=true`; without it some kernels/systemd combinations refuse to apply the RT policy.
### Applying changes to running servers
Unit-file changes do not apply to already-running services. After any change:
```sh
sudo systemctl daemon-reload
# Restart each game server via the web UI's stop + start, or:
sudo systemctl restart 'left4me-server@*.service'
```

View file

@ -136,6 +136,42 @@ $sudo_cmd chown -R left4me:left4me /opt/left4me
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
# isn't a live game server to core 0; give game servers cores 1..N-1.
# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
# `nproc --all` reports installed processors regardless of the calling
# shell's CPU affinity. Plain `nproc` honors Cpus_allowed of the calling
# process, so on a host that already has the cpuset drop-ins applied
# (system.slice → AllowedCPUs=0), the SSH login lands in user.slice with
# AllowedCPUs=0 and `nproc` would return 1 — making subsequent deploys
# wrongly think they're on a single-core box and skip CPU isolation.
NPROC=$(nproc --all)
SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
GAME_CPUS=$LEFT4ME_GAME_CPUS
else
GAME_CPUS="1-$((NPROC - 1))"
fi
if [ "$NPROC" -lt 2 ] && [ "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" = "" ]; then
printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
else
for slice_drop_in in \
/etc/systemd/system/system.slice.d/99-left4me-cpuset.conf \
/etc/systemd/system/user.slice.d/99-left4me-cpuset.conf \
/etc/systemd/system/l4d2-build.slice.d/99-left4me-cpuset.conf; do
$sudo_cmd mkdir -p "$(dirname "$slice_drop_in")"
printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin "$slice_drop_in"
done
$sudo_cmd mkdir -p /etc/systemd/system/l4d2-game.slice.d
printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf
fi
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-systemctl /usr/local/libexec/left4me/left4me-systemctl
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-journalctl /usr/local/libexec/left4me/left4me-journalctl
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-overlay /usr/local/libexec/left4me/left4me-overlay
@ -154,6 +190,13 @@ $sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/etc/left4me/sandbox-resolv.conf \
/etc/left4me/sandbox-resolv.conf
# Host perf-baseline sysctls. Apply with `sysctl --system` so values
# take effect this deploy, not on next reboot.
$sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/etc/sysctl.d/99-left4me.conf \
/etc/sysctl.d/99-left4me.conf
$sudo_cmd sysctl --system >/dev/null
# Stomp the file every deploy so newly added vars reach existing boxes.
# SECRET_KEY is derived from /etc/machine-id so it stays stable across
# redeploys (no session invalidation) without persisting state in /etc.

View file

@ -0,0 +1,21 @@
# Host-side perf baseline for left4me — see
# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
#
# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
# Source-engine UDP across multiple instances. 8 MiB matches the standard
# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
# explicitly enlarge their buffers.
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
# netdev_budget = 600 gives softirq more drain headroom per pass.
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
# Latency-sensitive default: avoid swap unless the box is really under
# pressure. Harmless on swapless hosts.
vm.swappiness = 10

View file

@ -0,0 +1,8 @@
# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target
[Slice]
CPUWeight=10
IOWeight=10

View file

@ -0,0 +1,8 @@
# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
[Unit]
Description=left4me game-server slice
Before=slices.target
[Slice]
CPUWeight=1000
IOWeight=1000

View file

@ -2,6 +2,11 @@
Description=left4me server instance %i
After=network-online.target
Wants=network-online.target
# Bound the restart loop. Without these, a persistent ExecStartPre or
# ExecStart failure spins indefinitely. Note: these are [Unit]-section
# directives (systemd 230+), not [Service].
StartLimitBurst=5
StartLimitIntervalSec=60s
[Service]
Type=simple
@ -9,10 +14,45 @@ User=left4me
Group=left4me
EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
# `-` prefix: chdir failure is non-fatal. systemd applies WorkingDirectory
# before every Exec line — including ExecStartPre — but the merged dir only
# exists once ExecStartPre's overlay mount succeeds. With `-`, ExecStartPre
# runs in the unit's home (cwd doesn't matter for the mount helper); the
# ExecStart re-applies WorkingDirectory after the mount and finds the dir.
WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
# Single source of truth for the kernel-overlayfs mount lifecycle: the web
# app's start_instance only stages cfg files and asks systemd to enable+
# start this unit; the actual `mount -t overlay` lives here so reboot
# auto-start works the same as a UI-driven start. ExecStopPost mirrors it
# so the unmount lives in the same place — no Python-side _mounter needed
# in stop/delete/reset paths. Both helper verbs are idempotent.
#
# `+` prefix runs the helper as PID 1 (root, no sandbox). Required because
# the unit has NoNewPrivileges=true, which blocks sudo's setuid escalation
# — and the helper itself needs root to nsenter into PID 1's mnt namespace
# anyway. ExecStopPost (not ExecStop) so unmount runs after the cgroup is
# cleared; ExecStop runs while srcds is still alive and would EBUSY.
ExecStartPre=+/usr/local/libexec/left4me/left4me-overlay mount %i
ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
Restart=on-failure
RestartSec=5
# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
# Hardening (unchanged from previous baseline).
NoNewPrivileges=true
PrivateTmp=true
PrivateDevices=true

View file

@ -127,16 +127,30 @@ def exec_or_print(argv: list[str]) -> None:
def cmd_mount(name: str) -> None:
name = validate_name(name)
r = root()
runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
merged_for_check = (runtime_name_dir / "merged").resolve(strict=True)
# Idempotency for unit restart cycles: if a previous start mounted
# successfully but ExecStart failed afterwards (and Restart=on-failure
# fires another cycle), the second ExecStartPre would otherwise refuse
# to mount-on-top. Short-circuit here so the second cycle just gets
# straight to ExecStart. PRINT_ONLY (test mode) bypasses this so the
# tests can exercise the full nsenter argv regardless of mount state.
if (
os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") != "1"
and os.path.ismount(merged_for_check)
):
return
instance_env = r / "instances" / name / "instance.env"
raw_lowerdirs = parse_lowerdirs(instance_env)
allowed_roots = [(r / sub).resolve() for sub in LOWERDIR_ALLOWLIST]
canonical_lowerdirs = [str(canonical_under(allowed_roots, Path(p))) for p in raw_lowerdirs]
runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
upper = (runtime_name_dir / "upper").resolve(strict=True)
work = (runtime_name_dir / "work").resolve(strict=True)
merged = (runtime_name_dir / "merged").resolve(strict=True)
merged = merged_for_check
for label, path in (("upper", upper), ("work", work), ("merged", merged)):
if path.parent != runtime_name_dir:
die(f"{label} resolved outside runtime/{name}: {path}")
@ -164,6 +178,18 @@ def cmd_umount(name: str) -> None:
merged = (runtime_name_dir / "merged").resolve(strict=True)
if merged.parent != runtime_name_dir:
die(f"merged resolved outside runtime/{name}: {merged}")
# Idempotency: if merged isn't a mount point right now, we have nothing
# to do. Mirrors cmd_mount's symmetric check. ExecStopPost on the unit
# is the one canonical caller, but a manual `systemctl reset-failed`
# cycle or a redundant cleanup pass should still be a no-op. PRINT_ONLY
# bypasses for the same reason as cmd_mount above.
if (
os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") != "1"
and not os.path.ismount(merged)
):
return
argv = [
NSENTER,
"--mount=/proc/1/ns/mnt",

View file

@ -45,6 +45,8 @@ chmod 0755 "$OVERLAY_DIR"
SCRIPT_RC=0
systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
--slice=l4d2-build.slice \
-p OOMScoreAdjust=500 \
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
-p UMask=0022 \
-p NoNewPrivileges=yes \

View file

@ -2,7 +2,7 @@
set -eu
usage() {
printf '%s\n' "usage: left4me-systemctl start|stop|show <server-name>" >&2
printf '%s\n' "usage: left4me-systemctl enable|disable|show <server-name>" >&2
exit 2
}
@ -22,7 +22,7 @@ action=$1
name=$2
case "$action" in
start|stop|show) ;;
enable|disable|show) ;;
*) usage ;;
esac
@ -38,7 +38,7 @@ else
fi
case "$action" in
start) exec "$systemctl" start "$unit" ;;
stop) exec "$systemctl" stop "$unit" ;;
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show --property=ActiveState --property=SubState "$unit" ;;
esac

View file

@ -9,6 +9,9 @@ DEPLOY = ROOT / "deploy"
WEB_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-web.service"
SERVER_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-server@.service"
GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
GLOBAL_REFRESH_SERVICE = DEPLOY / "files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.service"
GLOBAL_REFRESH_TIMER = DEPLOY / "files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer"
SANDBOX_UNIT_DIR = DEPLOY / "files/usr/local/lib/systemd/system"
@ -60,7 +63,10 @@ def test_server_unit_contains_required_runtime_contract():
assert "Group=left4me" in unit
assert "EnvironmentFile=/etc/left4me/host.env" in unit
assert "EnvironmentFile=/var/lib/left4me/instances/%i/instance.env" in unit
assert "WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
# `-` prefix: chdir failure is non-fatal so ExecStartPre can run the
# mount helper before the merged dir exists. ExecStart re-applies and
# finds the dir once the mount has landed.
assert "WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
assert "ExecStart=/var/lib/left4me/installation/srcds_run" in unit
assert "$L4D2_ARGS" in unit
assert "${L4D2_ARGS}" not in unit
@ -75,6 +81,176 @@ def test_server_unit_contains_required_runtime_contract():
assert "LockPersonality=true" in unit
def test_server_unit_mounts_overlay_via_exec_start_pre():
"""At boot, systemd auto-starts enabled units before the web app gets a
chance to run start_instance's pre-start mount. The unit itself must
re-mount the overlay so reboots are transparent. Pairs with the helper's
idempotency check (test_overlay_helper_mount_is_idempotent_when_mounted).
"""
unit = SERVER_UNIT.read_text()
# `+` prefix: ExecStartPre runs as PID 1 (root, no sandbox). Required
# because the unit has NoNewPrivileges=true, which blocks sudo's setuid
# escalation — and the helper needs root for nsenter anyway.
assert (
"ExecStartPre=+/usr/local/libexec/left4me/left4me-overlay mount %i"
in unit
)
# Bound the restart loop; without these, a CHDIR-failure (or any other
# pre-start error) spins indefinitely.
assert "StartLimitBurst=5" in unit
assert "StartLimitIntervalSec=60s" in unit
def test_server_unit_unmounts_overlay_via_exec_stop_post():
"""Single source of truth for unmount, mirroring the mount path.
ExecStopPost (not ExecStop) so it runs after srcds has fully exited
and the cgroup is cleared otherwise the open files in merged/ would
EBUSY the umount syscall.
"""
unit = SERVER_UNIT.read_text()
assert (
"ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i"
in unit
)
def test_overlay_helper_mount_is_idempotent_when_already_mounted():
"""ExecStartPre runs on every Restart=on-failure cycle. If a previous
start mounted successfully but ExecStart failed afterwards, the next
ExecStartPre would re-mount on top -- which fails. The helper must
short-circuit when merged is already a mount point.
"""
text = OVERLAY_HELPER.read_text()
# Two ismount checks now: one in cmd_mount (skip if mounted),
# one in cmd_umount (skip if not mounted).
assert text.count("os.path.ismount") >= 2
def test_server_unit_contains_perf_baseline_directives():
unit = SERVER_UNIT.read_text()
# Slice membership.
assert "Slice=l4d2-game.slice" in unit
# CFS priority bump (no SCHED_FIFO).
assert "Nice=-5" in unit
assert "CPUSchedulingPolicy=" not in unit
# I/O priority.
assert "IOSchedulingClass=best-effort" in unit
assert "IOSchedulingPriority=4" in unit
# OOM ordering: game servers survive, sandbox dies first.
assert "OOMScoreAdjust=-200" in unit
# Memory caps with headroom for map-load spikes.
assert "MemoryHigh=1.5G" in unit
assert "MemoryMax=2G" in unit
# Bounded fork surface.
assert "TasksMax=256" in unit
# Plenty of fds for plugin-heavy setups.
assert "LimitNOFILE=65536" in unit
# srcds clean shutdown via SIGINT, with time to flush.
assert "KillSignal=SIGINT" in unit
assert "TimeoutStopSec=15s" in unit
# Per-unit override of journald rate limiting (default drops srcds output).
assert "LogRateLimitIntervalSec=0" in unit
def test_l4d2_game_slice_exists_with_high_weights():
assert GAME_SLICE.is_file()
text = GAME_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=1000" in text
assert "IOWeight=1000" in text
def test_l4d2_build_slice_exists_with_low_weights():
assert BUILD_SLICE.is_file()
text = BUILD_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=10" in text
assert "IOWeight=10" in text
def test_sysctl_conf_present_with_perf_settings():
assert SYSCTL_CONF.is_file()
text = SYSCTL_CONF.read_text()
for line in (
"net.core.rmem_max = 8388608",
"net.core.wmem_max = 8388608",
"net.core.rmem_default = 524288",
"net.core.wmem_default = 524288",
"net.core.netdev_max_backlog = 5000",
"net.core.netdev_budget = 600",
"vm.swappiness = 10",
):
assert line in text, f"missing {line!r} in 99-left4me.conf"
def test_script_sandbox_in_build_slice_with_oom_adjust():
text = SCRIPT_SANDBOX_HELPER.read_text()
# Put the transient unit in the low-weight build slice so it yields to
# game-server instances under CPU/IO contention.
assert "--slice=l4d2-build.slice" in text
# Sandbox dies first if the host hits memory pressure; servers
# (OOMScoreAdjust=-200) survive.
assert "-p OOMScoreAdjust=500" in text
def test_deploy_script_installs_perf_artifacts():
script = DEPLOY_SCRIPT.read_text()
# Slice files copied into the system-wide systemd unit dir.
assert "/usr/local/lib/systemd/system/l4d2-game.slice" in script
assert "/usr/local/lib/systemd/system/l4d2-build.slice" in script
# Sysctl drop-in installed under /etc/sysctl.d/.
assert "/etc/sysctl.d/99-left4me.conf" in script
# Values applied immediately, not on next boot.
assert "sysctl --system" in script
def test_deploy_script_writes_cpuset_drop_ins():
script = DEPLOY_SCRIPT.read_text()
# Reads nproc and binds defaults via ${VAR:-...}.
assert "nproc" in script
assert "LEFT4ME_SYSTEM_CPUS" in script
assert "LEFT4ME_GAME_CPUS" in script
assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
# Default game-core upper bound is computed from nproc; accept either
# the NPROC-1 form or LEFT4ME_GAME_CPUS:-1- prefix.
assert (
"1-$((NPROC - 1))" in script
or "1-$((NPROC-1))" in script
or "1-$((nproc-1))" in script
or "LEFT4ME_GAME_CPUS:-1-" in script
)
# All four drop-in paths.
for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
assert (
f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf"
in script
)
# Drop-ins use the existing install pattern.
assert "install -m 0644 -o root -g root" in script
# Single-core host: skip with a warning to stderr.
assert ("-lt 2" in script) or ("< 2" in script) or ("-ge 2" in script)
assert "skipping CPU isolation" in script
def _fake_command(tmp_path, command_name):
marker = tmp_path / f"{command_name}.args"
command = tmp_path / command_name
@ -105,12 +281,16 @@ def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_pat
for args in [
["bad/action", "alpha"],
["start", ""],
["start", ".hidden"],
["start", "bad..name"],
["start", "bad/name"],
["start", "bad\\name"],
["start", "bad name"],
# `start` and `stop` are no longer accepted verbs — the lifecycle now
# uses `enable`/`disable` for reboot survival via WantedBy= symlinks.
["start", "alpha"],
["stop", "alpha"],
["enable", ""],
["enable", ".hidden"],
["enable", "bad..name"],
["enable", "bad/name"],
["enable", "bad\\name"],
["enable", "bad name"],
]:
result = subprocess.run(["sh", str(SYSTEMCTL_HELPER), *args], env=_env_with_fake_commands(tmp_path), check=False)
assert result.returncode != 0
@ -118,8 +298,8 @@ def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_pat
script = SYSTEMCTL_HELPER.read_text()
assert 'unit="left4me-server@${name}.service"' in script
assert 'start) exec "$systemctl" start "$unit"' in script
assert 'stop) exec "$systemctl" stop "$unit"' in script
assert 'enable) exec "$systemctl" enable --now "$unit"' in script
assert 'disable) exec "$systemctl" disable --now "$unit"' in script
assert "--property=ActiveState" in script
assert "--property=SubState" in script

View file

@ -0,0 +1,260 @@
# L4D2 CPU Isolation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively, scaled automatically across host sizes.
**Architecture:** Four `99-left4me-cpuset.conf` drop-ins under `/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/`, written by the deploy script from heredocs. `LEFT4ME_SYSTEM_CPUS` (default `0`) and `LEFT4ME_GAME_CPUS` (default `1-$((NPROC-1))`) are env-var overrides. Single-core hosts skip the cpuset writes with a warning.
**Tech Stack:** systemd cgroup-v2 `AllowedCPUs=` directive, bash heredoc + `install`, Linux `nproc(1)`, pytest text-assertion tests.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`
---
## File Structure
Files to modify:
- `deploy/deploy-test-server.sh` — compute `NPROC`, default `LEFT4ME_SYSTEM_CPUS=0` / `LEFT4ME_GAME_CPUS=1-$((NPROC-1))`, write four drop-in files. Skip when `nproc < 2` (with stderr warning) unless either env var is set explicitly.
- `deploy/README.md` — append a "CPU isolation" subsection inside the existing "Performance Tuning" section.
- `deploy/tests/test_deploy_artifacts.py` — new test functions.
No host library or web app changes.
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing deploy tests are at the known-good baseline**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 35 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`).
If the count differs, stop and surface — this plan assumes that exact baseline.
---
## Task 1: Deploy-script CPU-isolation block + tests
Write the four drop-ins from the deploy script in one cohesive block. The block computes `NPROC` once, resolves both env vars (with defaults), guards single-core hosts, and writes each drop-in via the existing `install -m 0644 -o root -g root` pattern. Tests cover defaults, overrides, single-core skip, and drop-in paths.
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Modify: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 1.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py` and append (after the `test_deploy_script_installs_perf_artifacts` from the perf-baseline branch):
```python
def test_deploy_script_writes_cpuset_drop_ins():
script = DEPLOY_SCRIPT.read_text()
# Reads nproc and binds defaults via ${VAR:-...}.
assert "nproc" in script
assert "LEFT4ME_SYSTEM_CPUS" in script
assert "LEFT4ME_GAME_CPUS" in script
assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
# Default game-core expression: 1-(nproc-1). Match the form the
# implementer chose; both `1-$((NPROC-1))` and `1-$((nproc-1))` are
# acceptable as long as the upper bound is computed from nproc.
assert ("1-$((NPROC-1))" in script) or ("1-$((nproc-1))" in script) \
or ("LEFT4ME_GAME_CPUS:-1-" in script)
# All four drop-in paths.
for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
assert f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf" in script
# Drop-ins use the existing install pattern.
assert "install -m 0644 -o root -g root" in script
# Single-core host: skip with a warning to stderr.
# Match either an explicit `nproc < 2` / `-lt 2` guard or `[ "$nproc" -ge 2 ]` form.
assert ("nproc" in script) and (("-lt 2" in script) or ("-ge 2" in script) or ("< 2" in script))
assert "skipping CPU isolation" in script.lower() or "skip cpu isolation" in script.lower()
```
- [ ] **Step 1.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_writes_cpuset_drop_ins -v`
Expected: FAIL — none of the new strings exist yet.
- [ ] **Step 1.3: Edit the deploy script — add the cpuset block**
Open `deploy/deploy-test-server.sh`. Find the block that copies the slice files (added in the perf-baseline branch, around lines 139140):
```sh
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
```
Immediately after that pair, before any of the helper-script copies that follow, insert this block:
```sh
# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
# isn't a live game server to core 0; give game servers cores 1..N-1.
# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
NPROC=$(nproc)
SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
GAME_CPUS=$LEFT4ME_GAME_CPUS
else
GAME_CPUS="1-$((NPROC - 1))"
fi
if [ "$NPROC" -lt 2 ] && [ -z "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" ]; then
printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
else
for slice_name in system user l4d2-build; do
$sudo_cmd mkdir -p "/etc/systemd/system/${slice_name}.slice.d"
printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
"/etc/systemd/system/${slice_name}.slice.d/99-left4me-cpuset.conf"
done
$sudo_cmd mkdir -p "/etc/systemd/system/l4d2-game.slice.d"
printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
"/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf"
fi
```
Notes for the implementer:
- The single-core skip only triggers when **neither** override is set. If the operator sets either `LEFT4ME_SYSTEM_CPUS` or `LEFT4ME_GAME_CPUS` explicitly on a single-core host, honor their intent.
- `install -m 0644 -o root -g root /dev/stdin <dest>` is the idiomatic way to install a small generated file from a pipeline (matches the existing pattern for sandbox-resolv.conf, just with `/dev/stdin` as source).
- The `mkdir -p` for each `.d` directory is required: systemd reads drop-ins only from existing directories.
- [ ] **Step 1.4: Verify shell syntax still parses**
Run: `sh -n deploy/deploy-test-server.sh`
Expected: exit 0, no output.
- [ ] **Step 1.5: Run the new test and full deploy test suite**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 36 passed, 1 failed (the pre-existing unrelated test, count goes from 35→36 because of the new test).
If your specific assertion forms in Step 1.1 don't match the implementation, adjust the test — but only the `or` branches; do not weaken the contract.
- [ ] **Step 1.6: Commit**
```bash
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
EOF
)"
```
---
## Task 2: README "CPU isolation" subsection
Append a subsection to `deploy/README.md` inside the existing "Performance Tuning" section, documenting the layout, the env-var overrides, the single-core skip, and the relationship to the existing per-instance `CPUAffinity=` escape hatch.
**Files:**
- Modify: `deploy/README.md`
No test for this task — README content is documentation, not contract.
- [ ] **Step 2.1: Append the CPU isolation subsection**
Open `deploy/README.md`. Find the existing `### Per-instance CPU affinity` subsection (added in the perf-baseline branch). Insert a new subsection **immediately before** it (so the slice-level isolation is documented before the per-instance refinement that builds on top). The new subsection content:
```markdown
### CPU isolation (cores)
The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
Override the split by setting either env var when running the deploy:
```sh
LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
```
On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
```
(The outer triple-backticks above are markdown punctuation around this prompt block, not part of the README content. Inner code-block fences DO need to be written into the README. The `markdown` language tag on the outer fence in this plan is documentation-only.)
- [ ] **Step 2.2: Run the full deploy test suite**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 36 passed, 1 failed (unchanged; README has no test).
- [ ] **Step 2.3: Commit**
```bash
git add deploy/README.md
git commit -m "$(cat <<'EOF'
docs(deploy): document CPU isolation in performance-tuning section
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full deploy + host + web test sweep**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: deploy 36 passed / 1 failed (pre-existing); host 111 passed / 1 skipped; web 313 passed / 1 skipped.
- [ ] **Step F.2: Working tree clean and commits in order**
Run: `git status && git log --oneline -5`
Expected:
- `git status`: clean.
- Top of `git log`:
1. `docs(deploy): document CPU isolation in performance-tuning section`
2. `feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest`
3. `docs(plans): l4d2 cpu isolation — implementation plan`
4. `docs(specs): l4d2 cpu isolation — design`
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
This plan ships artifacts. Confirming systemd actually enforces `AllowedCPUs=` on a real Trixie host is operator-side:
```sh
deploy/deploy-test-server.sh deploy-user@example-host
ssh deploy-user@example-host '
systemctl cat system.slice | grep AllowedCPUs
systemctl cat l4d2-game.slice | grep AllowedCPUs
cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
cat /sys/fs/cgroup/l4d2-game.slice/cpuset.cpus.effective
'
# Expect on an 8-core box:
# system.slice → AllowedCPUs=0 → cpuset.cpus.effective = 0
# l4d2-game.slice → AllowedCPUs=1-7 → cpuset.cpus.effective = 1-7
```
End-to-end behavioural test (manual, ops-side): on a 4-core host, run two L4D2 instances + a script-sandbox build simultaneously. Confirm via `htop` (with affinity column on) that the srcds processes only ever appear on cores 1, 2, 3 and the sandbox + web stay on core 0.
---
## Out of Scope (do NOT implement here)
- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot params.
- NIC IRQ pinning automation.
- Per-instance `CPUAffinity=` driven by a deploy-env knob.
- A separate `l4d2-web.slice`.
- Any web-app or host-library code changes.
If you find yourself touching any of these, stop — they belong in a separate spec.

View file

@ -0,0 +1,686 @@
# L4D2 Server Host Perf Baseline Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Apply a host-side performance and resource-isolation baseline (systemd directives, slice hierarchy, host sysctls) to every L4D2 server instance, leaving game ConVars to the maintainer.
**Architecture:** Add resource-control directives to `left4me-server@.service`; introduce two flat top-level slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10) so the build sandbox is starved by the kernel under contention; ship `/etc/sysctl.d/99-left4me.conf` for UDP buffer and netdev tuning; place the script-sandbox transient unit into `l4d2-build.slice` with `OOMScoreAdjust=500`. RT scheduling, CPU governor, CPUAffinity, NIC tuning are documentation-only escape hatches.
**Tech Stack:** systemd unit files (service + slice), `systemd-run` properties, Linux sysctl, bash deploy script, pytest text-assertion tests under `deploy/tests/test_deploy_artifacts.py`.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`
---
## File Structure
Files to create:
- `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` — high-weight slice for game-server instances.
- `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` — low-weight slice for sandboxed script-overlay builds.
- `deploy/files/etc/sysctl.d/99-left4me.conf` — host UDP/netdev/swap sysctls.
Files to modify:
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` — add resource-control directives (`Slice`, `Nice`, `IOSchedulingClass`, `IOSchedulingPriority`, `OOMScoreAdjust`, `MemoryHigh`, `MemoryMax`, `TasksMax`, `LimitNOFILE`, `KillSignal`, `TimeoutStopSec`, `LogRateLimitIntervalSec`).
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — add `--slice=l4d2-build.slice` and `-p OOMScoreAdjust=500` to the `systemd-run` invocation.
- `deploy/deploy-test-server.sh` — copy the two slice files and the sysctl conf during deploy; run `sysctl --system` so values take effect immediately.
- `deploy/README.md` — append a "Performance tuning" section with the four documented escape hatches.
- `deploy/tests/test_deploy_artifacts.py` — new tests for each artifact above (text assertions following the existing `assert "X" in text` style).
No application code (Python, Flask, host library) is touched.
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing deploy tests pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: all green.
If any test is already red, stop and surface — this plan assumes the baseline is green.
---
## Task 1: Per-Instance Unit Resource-Control Directives
Add the per-instance baseline to `left4me-server@.service`. This task is self-contained even though `Slice=l4d2-game.slice` references a slice that doesn't exist yet — systemd does not validate the reference until the unit is actually started, and the deploy artifact tests are pure text checks.
**Files:**
- Modify: `deploy/files/usr/local/lib/systemd/system/left4me-server@.service`
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 1.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py` and append (after `test_server_unit_contains_required_runtime_contract`):
```python
def test_server_unit_contains_perf_baseline_directives():
unit = SERVER_UNIT.read_text()
# Slice membership.
assert "Slice=l4d2-game.slice" in unit
# CFS priority bump (no SCHED_FIFO).
assert "Nice=-5" in unit
assert "CPUSchedulingPolicy=" not in unit
# I/O priority.
assert "IOSchedulingClass=best-effort" in unit
assert "IOSchedulingPriority=4" in unit
# OOM ordering: game servers survive, sandbox dies first.
assert "OOMScoreAdjust=-200" in unit
# Memory caps with headroom for map-load spikes.
assert "MemoryHigh=1.5G" in unit
assert "MemoryMax=2G" in unit
# Bounded fork surface.
assert "TasksMax=256" in unit
# Plenty of fds for plugin-heavy setups.
assert "LimitNOFILE=65536" in unit
# srcds clean shutdown via SIGINT, with time to flush.
assert "KillSignal=SIGINT" in unit
assert "TimeoutStopSec=15s" in unit
# Per-unit override of journald rate limiting (default drops srcds output).
assert "LogRateLimitIntervalSec=0" in unit
```
- [ ] **Step 1.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
Expected: FAIL — first failing assert is on `Slice=l4d2-game.slice`.
- [ ] **Step 1.3: Edit the unit file**
Open `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` and replace its contents with:
```ini
[Unit]
Description=left4me server instance %i
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=left4me
Group=left4me
EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
Restart=on-failure
RestartSec=5
# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
# Hardening (unchanged from previous baseline).
NoNewPrivileges=true
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
ProtectSystem=strict
ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays
ReadWritePaths=/var/lib/left4me/runtime/%i
RestrictSUIDSGID=true
LockPersonality=true
[Install]
WantedBy=multi-user.target
```
- [ ] **Step 1.4: Run the new test, verify it passes**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
Expected: PASS.
- [ ] **Step 1.5: Re-run the existing server-unit test, verify still passes**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_required_runtime_contract -v`
Expected: PASS — the existing assertions (`User=left4me`, `Group=left4me`, hardening directives, etc.) still match.
- [ ] **Step 1.6: Commit**
```bash
git add deploy/files/usr/local/lib/systemd/system/left4me-server@.service deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): perf-baseline directives on left4me-server@.service
Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort,
OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256,
LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s,
LogRateLimitIntervalSec=0. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
EOF
)"
```
---
## Task 2: Slice Unit Files
Create the two slice unit files. After this task the perf unit's `Slice=l4d2-game.slice` reference is satisfied.
**Files:**
- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`
- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`
- Test: `deploy/tests/test_deploy_artifacts.py` (new constants + new test functions)
- [ ] **Step 2.1: Add path constants and failing tests**
Open `deploy/tests/test_deploy_artifacts.py`. After the existing `SERVER_UNIT = ...` line, add:
```python
GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
```
After the new `test_server_unit_contains_perf_baseline_directives`, append:
```python
def test_l4d2_game_slice_exists_with_high_weights():
assert GAME_SLICE.is_file()
text = GAME_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=1000" in text
assert "IOWeight=1000" in text
def test_l4d2_build_slice_exists_with_low_weights():
assert BUILD_SLICE.is_file()
text = BUILD_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=10" in text
assert "IOWeight=10" in text
```
- [ ] **Step 2.2: Run the new tests, verify they fail**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
Expected: FAIL on `assert GAME_SLICE.is_file()` (file does not exist).
- [ ] **Step 2.3: Create the game slice file**
Create `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` with:
```ini
[Unit]
Description=left4me game-server slice
Before=slices.target
[Slice]
CPUWeight=1000
IOWeight=1000
```
- [ ] **Step 2.4: Create the build slice file**
Create `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` with:
```ini
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target
[Slice]
CPUWeight=10
IOWeight=10
```
- [ ] **Step 2.5: Run the new tests, verify they pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
Expected: PASS.
- [ ] **Step 2.6: Commit**
```bash
git add deploy/files/usr/local/lib/systemd/system/l4d2-game.slice deploy/files/usr/local/lib/systemd/system/l4d2-build.slice deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio
Flat top-level slices. Game wins under contention; build still gets
the box when uncontended. Referenced by left4me-server@.service and
the script-sandbox systemd-run invocation.
EOF
)"
```
---
## Task 3: Host Sysctls
Ship a `/etc/sysctl.d/` drop-in for UDP buffers, netdev backlog, netdev budget, and `vm.swappiness`.
**Files:**
- Create: `deploy/files/etc/sysctl.d/99-left4me.conf`
- Test: `deploy/tests/test_deploy_artifacts.py` (new constant + new test function)
- [ ] **Step 3.1: Add path constant and failing test**
Open `deploy/tests/test_deploy_artifacts.py`. After the slice constants, add:
```python
SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
```
Append a new test:
```python
def test_sysctl_conf_present_with_perf_settings():
assert SYSCTL_CONF.is_file()
text = SYSCTL_CONF.read_text()
for line in (
"net.core.rmem_max = 8388608",
"net.core.wmem_max = 8388608",
"net.core.rmem_default = 524288",
"net.core.wmem_default = 524288",
"net.core.netdev_max_backlog = 5000",
"net.core.netdev_budget = 600",
"vm.swappiness = 10",
):
assert line in text, f"missing {line!r} in 99-left4me.conf"
```
- [ ] **Step 3.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
Expected: FAIL on `assert SYSCTL_CONF.is_file()`.
- [ ] **Step 3.3: Create the sysctl conf file**
Create `deploy/files/etc/sysctl.d/99-left4me.conf` with:
```
# Host-side perf baseline for left4me — see
# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
#
# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
# Source-engine UDP across multiple instances. 8 MiB matches the standard
# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
# explicitly enlarge their buffers.
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
# netdev_budget = 600 gives softirq more drain headroom per pass.
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
# Latency-sensitive default: avoid swap unless the box is really under
# pressure. Harmless on swapless hosts.
vm.swappiness = 10
```
- [ ] **Step 3.4: Run the new test, verify it passes**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
Expected: PASS.
- [ ] **Step 3.5: Commit**
```bash
git add deploy/files/etc/sysctl.d/99-left4me.conf deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): host sysctls for UDP buffers + netdev backlog/budget
99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults),
netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.
EOF
)"
```
---
## Task 4: Sandbox in Build Slice
Place the script-sandbox transient unit into `l4d2-build.slice` and give it `OOMScoreAdjust=500` so it dies first under memory pressure.
**Files:**
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 4.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py`. Append:
```python
def test_script_sandbox_in_build_slice_with_oom_adjust():
text = SCRIPT_SANDBOX_HELPER.read_text()
# Put the transient unit in the low-weight build slice so it yields to
# game-server instances under CPU/IO contention.
assert "--slice=l4d2-build.slice" in text
# Sandbox dies first if the host hits memory pressure; servers
# (OOMScoreAdjust=-200) survive.
assert "-p OOMScoreAdjust=500" in text
```
- [ ] **Step 4.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust -v`
Expected: FAIL — neither string is in the helper yet.
- [ ] **Step 4.3: Edit the sandbox helper**
Open `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`. Locate the `systemd-run` invocation that begins with:
```
systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
```
Insert two new lines immediately after the `--unit=` line, before `-p User=l4d2-sandbox`. The block becomes:
```
systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
--slice=l4d2-build.slice \
-p OOMScoreAdjust=500 \
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
```
Leave every other `-p` line untouched.
- [ ] **Step 4.4: Verify shell syntax still parses**
Run: `bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
Expected: exit 0, no output.
- [ ] **Step 4.5: Run the new test and the existing sandbox-helper tests, verify they pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_invokes_systemd_run_with_hardening deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_passes_shell_syntax_check -v`
Expected: PASS for all three. The hardening test still matches because it only checks for substring presence; we added strings, didn't remove any.
- [ ] **Step 4.6: Commit**
```bash
git add deploy/files/usr/local/libexec/left4me/left4me-script-sandbox deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500
Builds yield CPU/IO to game-server instances under contention via the
slice's weight=10, and are killed first under memory pressure
(servers have OOMScoreAdjust=-200).
EOF
)"
```
---
## Task 5: Deploy Script Installs Slice + Sysctl Artifacts
Wire the new artifacts into `deploy-test-server.sh` so a fresh deploy actually puts them on disk and applies the sysctls.
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 5.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py`. Append:
```python
def test_deploy_script_installs_perf_artifacts():
script = DEPLOY_SCRIPT.read_text()
# Slice files copied into the system-wide systemd unit dir.
assert "/usr/local/lib/systemd/system/l4d2-game.slice" in script
assert "/usr/local/lib/systemd/system/l4d2-build.slice" in script
# Sysctl drop-in installed under /etc/sysctl.d/.
assert "/etc/sysctl.d/99-left4me.conf" in script
# Values applied immediately, not on next boot.
assert "sysctl --system" in script
```
- [ ] **Step 5.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts -v`
Expected: FAIL on the first assertion.
- [ ] **Step 5.3: Edit the deploy script — copy the slice + sysctl files**
Open `deploy/deploy-test-server.sh`. Find the block that copies unit files (currently around line 138):
```sh
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
```
Add two new lines immediately after the `left4me-server@.service` copy line, so the block becomes:
```sh
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
```
- [ ] **Step 5.4: Edit the deploy script — install the sysctl conf and apply it**
In `deploy/deploy-test-server.sh`, find the block that installs `/etc/left4me/sandbox-resolv.conf` (currently around lines 153155):
```sh
$sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/etc/left4me/sandbox-resolv.conf \
/etc/left4me/sandbox-resolv.conf
```
Immediately after that block, add:
```sh
# Host perf-baseline sysctls. Apply with `sysctl --system` so values
# take effect this deploy, not on next reboot.
$sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/etc/sysctl.d/99-left4me.conf \
/etc/sysctl.d/99-left4me.conf
$sudo_cmd sysctl --system >/dev/null
```
- [ ] **Step 5.5: Verify the deploy script's shell syntax still parses**
Run: `sh -n deploy/deploy-test-server.sh`
Expected: exit 0, no output.
- [ ] **Step 5.6: Run the new test and the existing deploy-script tests, verify they pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts deploy/tests/test_deploy_artifacts.py::test_deploy_script_has_safe_defaults_and_preserves_state deploy/tests/test_deploy_artifacts.py::test_deploy_script_shell_syntax -v`
Expected: PASS for all three.
- [ ] **Step 5.7: Commit**
```bash
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): install slice + sysctl artifacts and apply via sysctl --system
Copies l4d2-game.slice and l4d2-build.slice into
/usr/local/lib/systemd/system/, installs 99-left4me.conf into
/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
live this deploy, not on next reboot.
EOF
)"
```
---
## Task 6: Performance-Tuning Section in deploy/README.md
Document the four escape hatches the spec lists as opt-in: CPU governor, per-instance `CPUAffinity`, NIC tuning, and SCHED_FIFO.
**Files:**
- Modify: `deploy/README.md`
No test for this task — README content is documentation, not contract.
- [ ] **Step 6.1: Append the Performance Tuning section**
Open `deploy/README.md`. Append (after the existing final paragraph) a new section:
```markdown
## Performance Tuning
The deployment ships a host-side perf baseline (slices, unit directives, sysctls). See `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` for design rationale.
The following knobs are documented escape hatches — they are **not** auto-applied. Apply only if you have measured a need and understand the failure modes.
### CPU governor
The performance governor squeezes a few percent off jitter under bursty load. `schedutil` is acceptable for sustained UDP workloads.
```sh
sudo cpupower frequency-set -g performance
```
Persist via your distro's CPU-frequency tooling (e.g. `/etc/default/cpufrequtils`).
### Per-instance CPU affinity
`srcds` is single-threaded per instance. On a multi-core host, pinning each instance to its own core can cut jitter under contention. Drop in `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
```ini
[Service]
CPUAffinity=2
```
A reasonable strategy on an N-core host: leave core 0 for the kernel + IRQs + system services, then pin one instance per remaining core.
### NIC tuning
Hardware-specific. On a host with a single primary interface (replace `eth0`):
```sh
sudo ethtool -G eth0 rx 4096 tx 4096
sudo ethtool -K eth0 gro on lro off
```
If you run a high instance count, also pin the NIC's interrupts off the cores that game servers occupy (see `/proc/interrupts` and `/proc/irq/<n>/smp_affinity`).
### Real-time scheduling (advanced, opt-in)
Source-engine servers do not need real-time scheduling, and a misbehaving `srcds` at any RT priority can starve kernel threads — even with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of CPU back. Use only if you have a measured jitter problem that the baseline does not solve.
`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
```ini
[Service]
CPUSchedulingPolicy=fifo
CPUSchedulingPriority=10
LimitRTPRIO=10
```
### Applying changes to running servers
Unit-file changes do not apply to already-running services. After any change:
```sh
sudo systemctl daemon-reload
# Restart each game server via the web UI's stop + start, or:
sudo systemctl restart 'left4me-server@*.service'
```
```
- [ ] **Step 6.2: Run the full deploy test suite and verify it stays green**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: all green. README changes have no test, but should not break any existing tests.
- [ ] **Step 6.3: Commit**
```bash
git add deploy/README.md
git commit -m "$(cat <<'EOF'
docs(deploy): performance-tuning escape-hatch section in README
Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
ops-side knobs for measured problems the perf baseline doesn't solve.
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full deploy test suite green**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
Expected: all green.
- [ ] **Step F.2: Host library + web tests still green (regression check)**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q && pytest l4d2web/tests -q`
Expected: all green. Nothing in this plan touches host or web Python code, but a clean run rules out accidental import-time damage.
- [ ] **Step F.3: Working tree clean and commits in order**
Run: `git status && git log --oneline -8`
Expected:
- `git status`: `nothing to commit, working tree clean`.
- `git log`: six new commits in this order, top-most first:
1. `docs(deploy): performance-tuning escape-hatch section in README`
2. `feat(deploy): install slice + sysctl artifacts and apply via sysctl --system`
3. `feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500`
4. `feat(deploy): host sysctls for UDP buffers + netdev backlog/budget`
5. `feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio`
6. `feat(deploy): perf-baseline directives on left4me-server@.service`
If any step is missing or out of order, do not amend — diagnose, fix, and create new commits.
- [ ] **Step F.4: Manual deploy smoke test (deferred, ops-side)**
This plan ships artifacts. Confirming that systemd actually accepts and applies them on a real host requires running the deploy script against a test target. That validation is operator-side, not part of this implementation:
```sh
deploy/deploy-test-server.sh deploy-user@example-host
ssh deploy-user@example-host 'systemctl cat l4d2-game.slice'
ssh deploy-user@example-host 'sysctl net.core.rmem_max' # expect 8388608
ssh deploy-user@example-host 'systemd-analyze verify /usr/local/lib/systemd/system/left4me-server@.service'
```
Document any deploy-time problems back into the spec or this plan as v1.x corrections. Do not invent fixes that go beyond the spec.
---
## Out of Scope (do NOT implement here)
Listed in the spec — repeated for clarity:
- ConVars / blueprint arguments / tickrate / sv_minrate.
- SCHED_FIFO auto-apply.
- CPU governor auto-apply.
- Per-instance `CPUAffinity` auto-apply.
- NIC ring-buffer / IRQ-pinning code.
- Job-scheduler awareness ("don't build while server X has players").
- Hardening tightening (`ProtectKernelTunables=yes`, etc.).
If you find yourself touching any of these, stop — they belong in a separate spec.

View file

@ -0,0 +1,584 @@
# L4D2 Server Lifecycle: Reboot-Safe + Drift Reconciliation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make L4D2 server instances survive a host reboot (Part A) and converge `Server.actual_state` to systemd reality every ~30s for out-of-band drift (Part B).
**Architecture:** Helper script + `service_control.py` switch from `systemctl start/stop` to `systemctl enable --now / disable --now`. A new background thread spawned with the job workers polls every server's status periodically and writes the result via the existing `refresh_server_actual_state()` path. Skip servers with in-flight jobs to avoid racing with the post-job refresh.
**Tech Stack:** bash helper script + sudoers; Python `subprocess` via `l4d2host.service_control.systemctl_command`; SQLAlchemy via `session_scope()`; threading; pytest.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md`
---
## File Structure
Files to modify (Part A — lifecycle verb change):
- `deploy/files/usr/local/libexec/left4me/left4me-systemctl` — accept verbs `enable`/`disable`/`show` (drop `start`/`stop`).
- `l4d2host/service_control.py` — rename `start_service``enable_service`, `stop_service``disable_service`. Action tokens become `"enable"` / `"disable"`.
- `l4d2host/instances.py` — call `enable_service` from `start_instance`; call `disable_service` from `stop_instance` and `_purge_instance`.
- `l4d2host/tests/test_lifecycle.py` — update mock-call expectations.
- `l4d2host/tests/test_service_control.py` — new file with direct unit tests for `enable_service` / `disable_service`.
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args` — update the verb assertions.
Files to modify (Part B — poller):
- `l4d2web/services/job_worker.py` — add `start_state_poller`, `state_poller_loop`, `poll_all_servers`.
- `l4d2web/app.py` — call `start_state_poller(app)` next to `start_job_workers(app)`.
- `l4d2web/config.py` — default `STATE_POLLER_INTERVAL_SECONDS = 30`.
- `l4d2web/tests/test_job_worker.py` — four new tests for the poller.
No host-library, web-app facade, or CLI surface signatures change. The `l4d2ctl start <name>` / `l4d2ctl stop <name>` commands keep their names (per `AGENTS.md`).
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing test suite is at the known-good baseline**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: 460 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
If the count differs, stop and surface — this plan assumes that exact baseline.
---
## Task 1: Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
This task changes the helper script, the Python wrapper, and the instance lifecycle in one cohesive commit. The change is end-to-end vertical — splitting it across commits would leave broken intermediate states (helper accepting verbs that no caller uses, or callers using verbs the helper rejects).
**Files:**
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-systemctl`
- Modify: `l4d2host/service_control.py`
- Modify: `l4d2host/instances.py`
- Modify: `l4d2host/tests/test_lifecycle.py`
- Create: `l4d2host/tests/test_service_control.py`
- Modify: `deploy/tests/test_deploy_artifacts.py`
### Step 1.1: Update the deploy artifact test for the helper
Open `deploy/tests/test_deploy_artifacts.py`. Find `test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`.
Replace the assertions that check the helper's case-statement bodies. Currently the test asserts something like:
```python
assert 'start) exec "$systemctl" start "$unit"' in script
assert 'stop) exec "$systemctl" stop "$unit"' in script
```
Update to:
```python
assert 'enable)' in script
assert 'enable --now' in script
assert 'disable)' in script
assert 'disable --now' in script
```
Keep the `--property=ActiveState` and `--property=SubState` assertions for the `show` action (unchanged).
The rejected-action examples list (currently includes things like `["bad/action", "alpha"]`) is unchanged — those are still bad. If the test currently asserts that `start` and `stop` are accepted (e.g., a positive case), drop those — `start`/`stop` are now rejected verbs, not accepted ones.
### Step 1.2: Run the updated artifact test to verify it fails
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
Expected: FAIL — the helper script still has `start)`/`stop)` cases, not `enable)`/`disable)`.
### Step 1.3: Edit the helper script
Open `deploy/files/usr/local/libexec/left4me/left4me-systemctl`. Find the case-statement (currently around lines 2427). Replace:
```sh
case "$action" in
start) exec "$systemctl" start "$unit" ;;
stop) exec "$systemctl" stop "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) ...
esac
```
with:
```sh
case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) ...
esac
```
Keep the rest of the script (shebang, name validation, `*)` reject-and-exit branch) unchanged. The exact form of the `*)` reject case in the existing helper should be preserved.
### Step 1.4: Verify the helper script still parses
Run: `sh -n deploy/files/usr/local/libexec/left4me/left4me-systemctl`
Expected: exit 0, no output.
### Step 1.5: Run the artifact test, verify it passes
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
Expected: PASS.
### Step 1.6: Update `service_control.py`
Open `l4d2host/service_control.py`. Replace:
```python
def start_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("start", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
def stop_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("stop", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
```
with:
```python
def enable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("enable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
def disable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("disable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
```
`show_service`, `stream_command`, `stream_journal`, and the `systemctl_command` / `journalctl_command` helpers are unchanged.
### Step 1.7: Update `instances.py` to call the new names
Open `l4d2host/instances.py`. Replace the import:
```python
from l4d2host.service_control import start_service, stop_service
```
with:
```python
from l4d2host.service_control import disable_service, enable_service
```
Inside `start_instance`, find the `start_service(...)` call (around line 137 in current source) and replace with `enable_service(...)`. Inside `stop_instance` (line 159) and `_purge_instance` (line 194), replace `stop_service(...)` with `disable_service(...)`. Keep all keyword arguments identical — only the function name changes.
### Step 1.8: Update `test_lifecycle.py`
Open `l4d2host/tests/test_lifecycle.py`. Search for every assertion that references the `start` or `stop` action token in mock-call expectations against `service_control.run_command` or `systemctl_command`. The tests typically look for argument lists like `["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "<name>"]`.
Update each occurrence:
- `"start"``"enable"` (in the `start_instance` test paths)
- `"stop"``"disable"` (in the `stop_instance`, `delete_instance`, `reset_instance`, and `_purge_instance` test paths)
Some tests may import `start_service` / `stop_service` directly. Update those imports to `enable_service` / `disable_service`.
### Step 1.9: Create direct unit tests for `enable_service` / `disable_service`
Create `l4d2host/tests/test_service_control.py` with:
```python
from unittest.mock import patch
from l4d2host.service_control import (
SYSTEMCTL_HELPER,
disable_service,
enable_service,
)
@patch("l4d2host.service_control.run_command")
def test_enable_service_invokes_helper_with_enable_action(mock_run):
enable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
@patch("l4d2host.service_control.run_command")
def test_disable_service_invokes_helper_with_disable_action(mock_run):
disable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]
```
### Step 1.10: Run the host-library tests
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q`
Expected: all green (110 or 111 passing depending on whether `test_service_control.py` already existed; `+2` from the new direct tests).
If anything red: fix the test expectations, not the implementation. The implementation matches the spec exactly. Most likely failure mode: a test in `test_lifecycle.py` you missed updating; search for any remaining string literal `"start"` or `"stop"` in helper-arg-list contexts.
### Step 1.11: Run the deploy artifact test suite
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
Expected: 36 passed, 1 failed (the pre-existing unrelated test).
### Step 1.12: Commit
```bash
git add deploy/files/usr/local/libexec/left4me/left4me-systemctl \
l4d2host/service_control.py l4d2host/instances.py \
l4d2host/tests/test_lifecycle.py \
l4d2host/tests/test_service_control.py \
deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract — only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"
```
---
## Task 2: Part B — Periodic state poller
This task adds the poller code, wires it into the Flask startup, exposes its config knob, and tests four behaviors. One cohesive commit.
**Files:**
- Modify: `l4d2web/services/job_worker.py`
- Modify: `l4d2web/app.py`
- Modify: `l4d2web/config.py`
- Modify: `l4d2web/tests/test_job_worker.py`
### Step 2.1: Add the failing tests
Open `l4d2web/tests/test_job_worker.py`. Append after the existing tests:
```python
def test_state_poller_refreshes_each_server(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Server
with session_scope() as db:
db.add_all([
Server(id=11, name="alpha", port=27015, blueprint_id=None,
desired_state="running", actual_state="unknown"),
Server(id=12, name="beta", port=27016, blueprint_id=None,
desired_state="running", actual_state="unknown"),
])
refreshed = []
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
with app.app_context():
jw.poll_all_servers()
assert sorted(refreshed) == [11, 12]
def test_state_poller_skips_servers_with_inflight_jobs(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Job, Server
with session_scope() as db:
db.add(Server(id=21, name="gamma", port=27017, blueprint_id=None,
desired_state="running", actual_state="running"))
db.add(Job(server_id=21, operation="stop", state="running"))
refreshed = []
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
with app.app_context():
jw.poll_all_servers()
assert refreshed == []
def test_state_poller_swallows_per_server_exceptions(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Server
with session_scope() as db:
db.add_all([
Server(id=31, name="bad", port=27018, blueprint_id=None,
desired_state="running", actual_state="unknown"),
Server(id=32, name="good", port=27019, blueprint_id=None,
desired_state="running", actual_state="unknown"),
])
refreshed = []
def fake_refresh(sid):
if sid == 31:
raise RuntimeError("simulated host failure")
refreshed.append(sid)
monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
with app.app_context():
jw.poll_all_servers() # must not raise
assert refreshed == [32]
def test_state_poller_disabled_when_job_workers_disabled(monkeypatch):
"""create_app must not spawn the poller thread when JOB_WORKER_ENABLED=False."""
import threading
from l4d2web.app import create_app
spawned = []
real_thread_init = threading.Thread.__init__
def tracking_init(self, *args, **kwargs):
if kwargs.get("name") == "left4me-state-poller":
spawned.append(True)
real_thread_init(self, *args, **kwargs)
monkeypatch.setattr(threading.Thread, "__init__", tracking_init)
create_app({"TESTING": True, "JOB_WORKER_ENABLED": False})
assert not spawned
```
(The tests assume the existing `app` fixture from `conftest.py`. If your project uses a different fixture name, adjust accordingly. The polling tests run `poll_all_servers()` synchronously to avoid testing the loop's `time.sleep`.)
### Step 2.2: Run the new tests, verify they fail
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
Expected: FAIL — `poll_all_servers` and `start_state_poller` don't exist yet.
### Step 2.3: Add the poller code to `job_worker.py`
Open `l4d2web/services/job_worker.py`. Add at the bottom of the file:
```python
def start_state_poller(app):
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
daemon=True,
name="left4me-state-poller",
)
thread.start()
def state_poller_loop(app, interval: float) -> None:
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass
time.sleep(interval)
def poll_all_servers() -> None:
with session_scope() as db:
active_server_ids = set(db.scalars(
select(Job.server_id).where(Job.state.in_(("queued", "running")))
).all())
server_ids = [
sid for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass
```
`Server`, `Job`, `select`, `session_scope`, `threading`, `time`, and `refresh_server_actual_state` are already imported in this file. Verify by scanning the existing imports; if any are missing (unlikely for `select`/`Server`/`Job` since the worker uses them), add them.
### Step 2.4: Wire the poller into `create_app`
Open `l4d2web/app.py`. Find the existing `start_job_workers(app)` call (around line 91, inside the `if should_start_workers:` block). Add `start_state_poller(app)` immediately after it:
```python
if should_start_workers:
recover_stale_jobs()
start_job_workers(app)
start_state_poller(app)
```
Also update the import:
```python
from l4d2web.services.job_worker import (
recover_stale_jobs,
start_job_workers,
start_state_poller,
)
```
(If the existing import is single-line `from ... import recover_stale_jobs, start_job_workers`, just add `start_state_poller` to the list.)
### Step 2.5: Add the config default
Open `l4d2web/config.py`. Find the dict literal that contains other defaults like `JOB_WORKER_THREADS`, `PORT_RANGE_START`, etc. Add:
```python
"STATE_POLLER_INTERVAL_SECONDS": 30,
```
In the env-var-loading section (where `LEFT4ME_PORT_RANGE_START` etc. are read), add:
```python
"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("LEFT4ME_STATE_POLLER_INTERVAL_SECONDS", "30")),
```
### Step 2.6: Run the four new tests, verify they pass
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
Expected: PASS for all four.
### Step 2.7: Run the full web test suite
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests -q`
Expected: 317 passed, 1 skipped (313 + 4 new tests).
### Step 2.8: Commit
```bash
git add l4d2web/services/job_worker.py l4d2web/app.py l4d2web/config.py l4d2web/tests/test_job_worker.py
git commit -m "$(cat <<'EOF'
feat(l4d2-web): periodic state poller refreshes Server.actual_state
A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs are skipped to avoid racing the post-job
refresh. Catches reboot drift, OOM kills, manual systemctl operations,
and any other out-of-band state change. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full test sweep**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: ~466 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
- [ ] **Step F.2: Working tree clean and commit shape**
Run: `git status && git log --oneline -5`
Expected:
- `git status`: clean.
- Top of `git log`:
1. `feat(l4d2-web): periodic state poller refreshes Server.actual_state`
2. `feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now`
3. `docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan`
4. `docs(specs): l4d2 server lifecycle reboot-and-drift — design`
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
End-to-end on `ckn@10.0.4.128` after deploy:
```sh
deploy/deploy-test-server.sh ckn@10.0.4.128
# Confirm the helper now drives enable/disable
ssh ckn@10.0.4.128 'cat /usr/local/libexec/left4me/left4me-systemctl | grep -E "enable|disable"'
# expect: enable) exec "$systemctl" enable --now "$unit"
# disable) exec "$systemctl" disable --now "$unit"
# Click "start" in the web UI for a server. Then:
ssh ckn@10.0.4.128 'systemctl is-enabled left4me-server@1.service'
# expect: enabled
# Reboot the host:
ssh ckn@10.0.4.128 'sudo systemctl reboot'
# wait for it to come back, then:
ssh ckn@10.0.4.128 'systemctl is-active left4me-server@1.service && pgrep -fa srcds'
# expect: active, srcds running with no UI intervention
# Confirm the poller corrects out-of-band drift
ssh ckn@10.0.4.128 'sudo systemctl disable --now left4me-server@1.service'
# Within ~30s the web UI's actual_state for server 1 flips from "running" to "stopped".
ssh ckn@10.0.4.128 'sudo -u left4me /opt/left4me/.venv/bin/python -c "
import sqlite3
c = sqlite3.connect(\"/var/lib/left4me/left4me.db\")
print(c.execute(\"SELECT id, actual_state, actual_state_updated_at FROM servers WHERE id=1\").fetchone())
"'
# expect: actual_state='stopped' with a fresh updated_at.
```
---
## Out of Scope (do NOT implement here)
- Auto-restart on `desired_state=running && actual_state=stopped`.
- UI banners for stale-state warnings.
- Reconciliation of orphan systemd units.
- Per-server poll intervals.
- Replacing `Restart=on-failure`.
- Touching the pre-existing red test (`test_deploy_script_has_safe_defaults_and_preserves_state`).
If you find yourself touching any of these, stop — they belong in a separate spec.

View file

@ -0,0 +1,131 @@
# l4d2 cpu isolation — design
Date: 2026-05-09
Status: design
## Summary
Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively. Implementation is systemd cgroup-v2 `AllowedCPUs=` drop-ins, computed at deploy time from `nproc`, overridable via env vars. Lands on top of the perf baseline shipped in `851e662..e5126c8`.
## Goals
- A logged-in admin doing CPU-heavy work, the script-build sandbox, and the Flask web app cannot steal cycles from a live match.
- Layout scales automatically across host sizes (4-core, 8-core, 16-core) without per-host edits.
- Operator can override the default `0` / `1..N-1` split for NUMA boxes or hyperthread quirks.
- Single-core hosts degrade gracefully: skip CPU isolation, keep the rest of the perf baseline.
## Non-goals
- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot parameters. True core isolation (eviction of softirqs, RCU, timer ticks) requires GRUB edits + reboot + per-host tuning. cgroup cpuset is sufficient for L4D2 tickrates; document as a future opt-in if measurement justifies it.
- NIC IRQ pinning. Hardware-specific; already documented as an escape hatch in `deploy/README.md`.
- Per-instance pinning *within* the game-core set. The slice-level cpuset is the floor; the existing per-instance `CPUAffinity=` drop-in escape hatch (already in `deploy/README.md`) composes on top — the kernel enforces "per-instance value must be a subset of slice's allowed set."
- A separate `l4d2-web.slice`. The web app is light; living in `system.slice` on core 0 is fine.
- Web-app or host-library code changes. Pure deploy-side artifact work.
## Background
The perf baseline (commit range `851e662..e5126c8`) introduced two slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10), per-instance unit directives (Nice, OOM, memory caps), and host sysctls. None of those constrain *which* CPUs cgroups run on. Under the kernel CFS, every task can move to any core; the build sandbox, ssh sessions, the web app, and game servers all compete for the same cores.
## Design
### Topology
```
core 0 cores 1..N-1
───────── ────────────
system.slice AllowedCPUs=0
user.slice AllowedCPUs=0
l4d2-build.slice AllowedCPUs=0
l4d2-game.slice AllowedCPUs=1-(N-1)
```
Everything that isn't a live game server (Flask web app, ssh sessions, journald, script-sandbox builds, cron, systemd housekeeping) is funneled to core 0. Game servers get cores 1..N-1 exclusively.
### Why slice-level `AllowedCPUs=`, not per-instance `CPUAffinity=`
- **Hierarchy does the work for free.** A cpuset on `l4d2-game.slice` propagates to every `left4me-server@*.service` automatically. No per-instance drop-ins to manage; no logic in the web app to pick cores.
- **Hot-applied.** cgroup-v2 cpuset changes apply to running cgroups; existing servers move next time the kernel schedules them. No need to restart instances after a deploy.
- **Composable.** A future operator who wants per-instance pinning *within* the game cores adds `CPUAffinity=N` via `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` (already documented). The slice constraint and per-instance pin compose; the kernel enforces subset-of.
### Why drop-ins, not edits to the existing `.slice` files
The two slice files we ship today (`l4d2-game.slice`, `l4d2-build.slice`) are static text and host-portable. `AllowedCPUs=1-7` is true on an 8-core host and wrong on a 4-core host. Drop-ins under `<unit>.d/*.conf` are the standard systemd pattern for host-specific overrides. We already use `99-` prefixing for the sysctl drop-in so it lex-orders last; reuse that.
### Operator override
Two env vars consumed by the deploy script:
- `LEFT4ME_SYSTEM_CPUS` — defaults to `0`. Goes into `system.slice`, `user.slice`, `l4d2-build.slice` drop-ins.
- `LEFT4ME_GAME_CPUS` — defaults to `1-$((NPROC-1))`. Goes into `l4d2-game.slice` drop-in.
Operators with NUMA boxes, hyperthread quirks, or "I want core 0 *and* core 1 for system" set the vars explicitly. Defaults handle the typical case.
### Single-core fallback
If `nproc < 2`, skip CPU isolation entirely (write no drop-ins). Print a warning to stderr explaining the deploy is leaving cpuset unset. The rest of the perf baseline still applies (weights, sysctls, OOM scores).
If `LEFT4ME_GAME_CPUS` or `LEFT4ME_SYSTEM_CPUS` is set explicitly on a single-core host, honor the operator's intent — they presumably know what they're doing — but still write the drop-ins.
### Drop-in layout
Four files written to `/etc/systemd/system/`, each named `99-left4me-cpuset.conf`:
```
/etc/systemd/system/system.slice.d/99-left4me-cpuset.conf
/etc/systemd/system/user.slice.d/99-left4me-cpuset.conf
/etc/systemd/system/l4d2-build.slice.d/99-left4me-cpuset.conf
/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf
```
Each file contains:
```ini
[Slice]
AllowedCPUs=<resolved value>
```
### systemd compatibility
`AllowedCPUs=` is systemd 244+. Debian Trixie ships systemd 256+. Cgroup-v2 cpuset controller is enabled by default on Trixie; systemd auto-enables the controller when `AllowedCPUs=` is set on a unit. No additional machinery.
### Files changed / added
```
deploy/deploy-test-server.sh (modified — compute layout, write four drop-ins)
deploy/README.md (modified — new "CPU isolation" subsection inside Performance Tuning)
deploy/tests/test_deploy_artifacts.py (modified — new tests)
```
## Tests
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
`assert "X" in script` pattern:
- For `deploy-test-server.sh`, assert:
- All four drop-in paths (`/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/99-left4me-cpuset.conf`) appear.
- The script reads `nproc` (substring `nproc` plus a default-binding form for `LEFT4ME_GAME_CPUS`).
- The script honors `LEFT4ME_SYSTEM_CPUS` and `LEFT4ME_GAME_CPUS` env-var overrides (substrings present, default-binding form like `${LEFT4ME_SYSTEM_CPUS:-...}`).
- The script has a single-core fallback (substring guarding `nproc -lt 2` or equivalent, with a warning to stderr).
- Each drop-in is written via the existing `install -m 0644 -o root -g root` heredoc pattern.
No runtime tests in this spec — verifying that systemd actually enforces `AllowedCPUs=` is operator-side via `cat /sys/fs/cgroup/<slice>/cpuset.cpus.effective` after deploy.
## Rollout
Single deploy. cgroup-v2 cpuset changes apply to running cgroups, so already-running servers move next time the kernel reschedules them — no instance restarts required. The `daemon-reload` already in the deploy script picks up the new drop-ins.
If something goes wrong (cpuset too narrow, a slice can't run any process), `systemctl status <slice>` will show the error and the operator can either fix the env vars and redeploy or `rm /etc/systemd/system/<slice>.slice.d/99-left4me-cpuset.conf` followed by `systemctl daemon-reload` to revert.
## Open questions
None blocking. Possible v2 candidates if measurement justifies them:
- Pair this with kernel `isolcpus=` boot params for true core isolation.
- Auto-pin NIC IRQs to core 0 (would compose with this isolation).
- Per-instance `CPUAffinity=` driven by a deploy-env knob, partitioning the game-core set across instances deterministically.
## References
- systemd.resource-control(5) — `AllowedCPUs=` semantics.
- Linux Documentation/admin-guide/cgroup-v2.rst — cpuset controller behavior on `cpuset.cpus` / `cpuset.cpus.effective`.
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` — sibling work that introduced the slices this spec extends.

View file

@ -0,0 +1,83 @@
# l4d2 cpu pinning — decision record (deferred)
Date: 2026-05-09
Status: decision (no implementation)
## Question
After the lifecycle + drift fix landed (commits `8552c55`, `67b5521`), the
question came up: with `AllowedCPUs=1-7` already constraining game servers
to cores 17, do CFS scheduler migrations *within* that range still cause
meaningful jitter? Should we hard-pin each instance to a single core?
## Investigation
The classic "lazy CFS" sysctl knob is **gone** on modern kernels. Verified
on Trixie's running kernel 6.12 (`ckn@10.0.4.128`):
```
/sbin/sysctl -a | grep -E "sched_migration_cost|sched_min_granularity|sched_wakeup_granularity|sched_latency"
# (no output)
```
`kernel.sched_migration_cost_ns` and the other classic CFS tunables were
removed in 5.13+ as part of the scheduler internals refactor that culminated
in EEVDF (6.6). Only `kernel.sched_rt_period_us` / `sched_rt_runtime_us`
remain. There is no global "be lazy about migrations" knob anymore.
### Available paths
| Option | Cost | Strictness | Pays off when |
|---|---|---|---|
| Trust CFS + `Nice=-5` + `AllowedCPUs=1-7` (current) | None | Soft | ≤ 3 instances on 7 cores; CFS rarely migrates active CPU-bound nice<0 tasks |
| Per-instance `CPUAffinity=N` drop-in | Web-app machinery to write drop-ins, daemon-reload, modulo or DB-persisted assignment | Strict | ≥ 4 instances (each gets exclusive core), or measured jitter |
| `isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7` kernel cmdline | GRUB edit + reboot, host-specific | Strongest (also evicts kernel softirqs/RCU/timer ticks from game cores) | Tickrate-128 with measurable kernel-induced jitter |
| `SCHED_FIFO` per unit | Risky (RT misconfig can stall kernel) | Strict | Already documented as ops-side escape hatch in `deploy/README.md` |
### Why deferring is defensible
- The slice's `AllowedCPUs=1-7` already prevents game servers from running on core 0. The open question is "do they migrate within 17?" — yes, CFS can migrate, but for long-running CPU-bound `srcds` with `Nice=-5`, migrations are infrequent. CFS prefers cache locality and only migrates when an idle core "steals" or a periodic load-balance tick detects imbalance.
- With ≤ 3 instances on 7 game cores, the load balancer rarely sees imbalance to fix.
- Per-instance hard pinning adds non-trivial machinery (drop-in writer through `left4me-systemctl`, or extending `instance.env` + a `taskset` wrapper in the unit). Not warranted unless we observe a real problem.
- `deploy/README.md` already documents the `CPUAffinity=N` per-instance drop-in as an opt-in escape hatch. An operator who measures jitter can apply it without code changes.
## Decision
**No code change.** Keep the current setup:
- Slice-level `AllowedCPUs=1-7` ensures game servers never touch core 0.
- `Nice=-5` keeps active srcds tasks weighted heavily so CFS prefers leaving them alone.
- The `CPUAffinity=N` per-instance drop-in remains the documented escape hatch.
## Revisit triggers
Any of these signals appears, then design + implement strict per-instance pinning:
- ≥ 4 game-server instances running simultaneously on one host.
- A specific server reports tickrate dips / rubber-banding correlated with another instance starting or a build sandbox firing.
- `perf stat -e sched:sched_migrate_task -p <srcds-pid>` shows > 1 migration/sec under load.
When revisiting, two implementation paths to choose from:
1. **Modulo assignment in the host library.** Read `LEFT4ME_GAME_CPUS` (or parse the slice's `AllowedCPUs=` drop-in), pick `game_cpus[(int(name) - 1) % len(game_cpus)]`, write `L4D2_CPU=N` into `instance.env`, wrap the unit's `ExecStart` with `taskset -c ${L4D2_CPU}`. Stateless, deterministic, no DB column. **Preferred.**
2. **Persisted assignment.** Add `Server.cpu_pin` column, web app picks at initialize time and stores. Survives `LEFT4ME_GAME_CPUS` changes (each server keeps its assigned core). Bigger ripple.
## Verification (no-op confirmation)
```sh
ssh ckn@10.0.4.128 'systemctl show l4d2-game.slice -p AllowedCPUs'
# expect: AllowedCPUs=1-7
ssh ckn@10.0.4.128 'cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective'
# expect: 0 (everything-not-game still pinned to core 0)
# When ≥ 1 server is running:
ssh ckn@10.0.4.128 'for p in $(pgrep srcds); do grep ^Cpus_allowed_list /proc/$p/status; done'
# expect: 1-7 (CFS picks whichever of those is hottest at any given moment)
```
## References
- `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md` — sibling design that introduced the `AllowedCPUs=1-7` slice constraint this record builds on.
- `deploy/README.md` "Performance Tuning" section — the `CPUAffinity=N` per-instance escape hatch.
- Linux kernel changelog 5.13+ — removal of classic CFS tunable sysctls.

View file

@ -0,0 +1,230 @@
# l4d2 server host perf baseline — design
Date: 2026-05-09
Status: design
## Summary
Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.
## Goals
- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
- One misbehaving server cannot OOM-kill its siblings or the host.
- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.
## Non-goals
- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
- CPU governor changes. Documented opt-in only.
- Per-instance `CPUAffinity`. Host-specific; documented only.
- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec.
## Background
Current state (commit `965b67e`):
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**.
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`.
- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects.
srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.
## Design
### Slice topology
Flat top-level slices, siblings of `system.slice` and `user.slice`:
```
-.slice
├── system.slice (default CPUWeight=100, IOWeight=100)
├── user.slice (default CPUWeight=100, IOWeight=100)
├── l4d2-game.slice (CPUWeight=1000, IOWeight=1000)
└── l4d2-build.slice (CPUWeight=10, IOWeight=10)
```
Rationale:
- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`.
- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match.
### Per-instance unit additions (`left4me-server@.service`)
Add to `[Service]`:
```
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
```
Per-directive justification:
- `Slice=l4d2-game.slice` — places the instance in the high-weight slice.
- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section).
- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500800 MB; map-load spikes fit in headroom; a runaway is bounded.
- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups.
- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL.
- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.
Existing security directives are kept verbatim.
### Slice unit files
New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`:
```ini
[Unit]
Description=left4me game-server slice
Before=slices.target
[Slice]
CPUWeight=1000
IOWeight=1000
```
New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`:
```ini
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target
[Slice]
CPUWeight=10
IOWeight=10
```
### Sandbox slice + OOM placement
Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run` invocation (transient service mode — the existing helper uses `--unit=` without `--scope`):
- `--slice=l4d2-build.slice`
- `-p OOMScoreAdjust=500`
Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-unit) compose: weight handles contention, quota handles the absolute ceiling.
### Host sysctls
New file `deploy/files/etc/sysctl.d/99-left4me.conf`:
```
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
vm.swappiness = 10
```
Per-value justification:
- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do.
- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts.
- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts.
### Deploy script integration
`deploy/deploy-test-server.sh` must:
1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`.
2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot.
3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`.
4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow).
5. No explicit `systemctl start` of the slices is required — they activate on first child reference.
### Documented escape hatches (no auto-apply)
Append a "Performance tuning" section to `deploy/README.md`:
- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ".
- **NIC tuning**: example `ethtool -G <iface> rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only.
- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves.
These stay pure documentation in v1 — no code paths, no tests asserting them.
### Out-of-scope rationale
- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts.
- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.
### Files changed / added
```
deploy/files/usr/local/lib/systemd/system/left4me-server@.service (modified)
deploy/files/usr/local/lib/systemd/system/l4d2-game.slice (new)
deploy/files/usr/local/lib/systemd/system/l4d2-build.slice (new)
deploy/files/etc/sysctl.d/99-left4me.conf (new)
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox (modified)
deploy/deploy-test-server.sh (modified — sysctl --system step)
deploy/README.md (modified — performance section)
deploy/tests/test_deploy_artifacts.py (modified — assertions)
```
## Tests
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
`assert "key=value" in text` pattern:
- For `left4me-server@.service`, assert every line listed in *Per-instance
unit additions* is present verbatim. Each is a separate assertion so a
failing line is identifiable.
- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`.
- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`.
- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*.
- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice`
and `OOMScoreAdjust=500` both appear.
- Assert the deploy script invokes `sysctl --system` (or
`sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the
conf into place.
No runtime perf tests in v1 — the spec ships defaults, not measured wins.
Real-world measurement is left to operators with concrete instance counts,
hardware, and player loads.
## Rollout
Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`.
## Open questions
None blocking. v2 candidates if measurement justifies them:
- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`).
- Job-worker awareness of "server has active players" to defer builds further than weights alone.
- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in.
## References
- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`.
- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules.
- systemd.kill(5) — signal handling and `KillSignal`.
- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`.
- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics.
- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
- Phoronix governor comparisons — performance vs schedutil for sustained workloads.
- Multiple latency-tuning guides — `vm.swappiness=10` consensus.

View file

@ -0,0 +1,217 @@
# l4d2 server lifecycle: reboot-safe + drift reconciliation — design
Date: 2026-05-09
Status: design
## Summary
Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators.
## Goals
- An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action.
- The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change.
- The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
- Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.
## Non-goals
- **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision.
- **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
- **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec.
- **Per-server poller intervals.** A single global cadence is sufficient.
- **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays.
- **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.
## Premise check: system units, not user units
`systemctl --user enable --now` has different lifecycle rules — auto-start only at user login (unless `loginctl enable-linger <user>` is set), symlinks land in `~/.config/systemd/user/<target>.wants/`. It would be wrong here.
This project uses **system units**, confirmed by:
- Unit path: `/usr/local/lib/systemd/system/left4me-server@.service` is the system search path; user units live in `/etc/systemd/user/` or `~/.config/systemd/user/`.
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl:31-44`) calls plain `systemctl` (no `--user` flag) and runs as **root** via the sudoers rule at `deploy/files/etc/sudoers.d/left4me:2`.
- The unit's `[Install] WantedBy=multi-user.target` (line 43 of the unit) is a system target; user units would use `default.target`.
- The same machinery is already in production for `left4me-web.service``deploy-test-server.sh` runs `sudo systemctl enable --now left4me-web.service`, and that's how the web service auto-came-back after today's reboot. We're applying the same pattern to the game-server template instances.
`systemctl enable left4me-server@1.service` will create `/etc/systemd/system/multi-user.target.wants/left4me-server@1.service` symlinked to `/usr/local/lib/systemd/system/left4me-server@.service`. systemd handles the template instantiation via the `@` syntax automatically.
## Background
Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST:
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action.
- `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`.
- `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions.
- `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot.
- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not.
- The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth.
## Design
### Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
**Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`):
Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become:
```sh
case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) reject ;;
esac
```
The existing instance-name validation regex (currently lines 1217) is unchanged — it constrains the `<name>` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
```
already passes any args; no sudoers update needed.
**Python wrapper** (`l4d2host/service_control.py`):
Rename `start_service``enable_service` and `stop_service``disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged.
**Instance lifecycle** (`l4d2host/instances.py`):
- `start_instance` — replace the `start_service(...)` call with `enable_service(...)`.
- `stop_instance` — replace `stop_service(...)` with `disable_service(...)`.
- `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.
**CLI surface** (`l4d2host/cli.py`):
`l4d2ctl start <name>` and `l4d2ctl stop <name>` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance``enable_service` as renamed above.
**Web facade** (`l4d2web/services/l4d2_facade.py`):
Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`.
### Part B — Periodic state poller
Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`:
```python
def start_state_poller(app):
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
daemon=True,
name="left4me-state-poller",
)
thread.start()
def state_poller_loop(app, interval):
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass # never let a single failure kill the loop
time.sleep(interval)
def poll_all_servers():
with session_scope() as db:
active_server_ids = set(db.scalars(
select(Job.server_id).where(Job.state.in_(("queued", "running")))
).all())
server_ids = [
sid for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass
```
**Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.
**Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 8488: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`).
**First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed.
**Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:4458`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives.
### Why both parts
Either part alone is insufficient:
- **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now <unit>` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases.
- **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again.
Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.
### Migration on running hosts
Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot.
The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`.
### Files changed / added
```
deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs)
l4d2host/service_control.py (Part A — rename)
l4d2host/instances.py (Part A — call new names)
l4d2host/tests/test_lifecycle.py (Part A — test updates)
l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent)
deploy/tests/test_deploy_artifacts.py (Part A — helper assertions)
l4d2web/services/job_worker.py (Part B — poller code)
l4d2web/app.py (Part B — wire start_state_poller)
l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default)
l4d2web/tests/test_job_worker.py (Part B — poller tests)
```
## Tests
### Part A
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted).
- `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start``enable` and `stop``disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`.
- New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`.
### Part B
- `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order.
- `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server.
- `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally.
- `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`).
### CI sanity
`pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope).
## Rollout
Single deploy. After deploy:
1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.**
2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe.
3. Newly-started servers (`l4d2ctl start <name>` or web UI start) are enable-based from the first invocation.
If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa.
## Open questions
None blocking. v2 candidates:
- Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision).
- Per-server poll intervals or backoff for repeatedly-failing servers.
- A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").
## References
- systemd.unit(5) — `WantedBy=`, `Install` section semantics.
- systemctl(1) — `enable --now` / `disable --now` flags.
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`.
- Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`.
- `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.

View file

@ -1,30 +0,0 @@
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Callable
class OverlayMounter(ABC):
@abstractmethod
def mount(
self,
*,
lowerdirs: str,
upperdir: Path,
workdir: Path,
merged: Path,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
) -> None:
raise NotImplementedError
@abstractmethod
def unmount(
self,
*,
merged: Path,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
) -> None:
raise NotImplementedError

View file

@ -1,53 +0,0 @@
from pathlib import Path
from typing import Callable
from l4d2host.fs.base import OverlayMounter
from l4d2host.process import run_command
HELPER_PATH = "/usr/local/libexec/left4me/left4me-overlay"
class KernelOverlayFSMounter(OverlayMounter):
# Delegates the actual mount/umount syscalls to the privileged
# left4me-overlay helper. The helper takes only the instance name and
# rederives lowerdirs/upper/work/merged from disk; the OverlayMounter
# ABC accepts those args for compatibility, so we extract the name
# from the merged path's parent directory.
def mount(
self,
*,
lowerdirs: str,
upperdir: Path,
workdir: Path,
merged: Path,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> None:
del lowerdirs, upperdir, workdir
run_command(
["sudo", "-n", HELPER_PATH, "mount", merged.parent.name],
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
def unmount(
self,
*,
merged: Path,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> None:
run_command(
["sudo", "-n", HELPER_PATH, "umount", merged.parent.name],
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)

View file

@ -1,21 +1,16 @@
import os
from pathlib import Path
import shutil
import subprocess
from typing import Callable
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
from l4d2host.paths import DEFAULT_LEFT4ME_ROOT, get_left4me_root, overlay_path, validate_instance_name
from l4d2host.service_control import start_service, stop_service
from l4d2host.service_control import disable_service, enable_service
from l4d2host.spec import load_spec
from l4d2host.logging import emit_step
_mounter = KernelOverlayFSMounter()
DEFAULT_ROOT = DEFAULT_LEFT4ME_ROOT
@ -63,16 +58,6 @@ def initialize_instance(
emit_step("initialization complete.", on_stdout, passthrough)
def _load_instance_env(path: Path) -> dict[str, str]:
result: dict[str, str] = {}
for line in path.read_text().splitlines():
if "=" not in line:
continue
key, value = line.split("=", 1)
result[key] = value
return result
def start_instance(
name: str,
*,
@ -87,25 +72,14 @@ def start_instance(
instance_dir = root / "instances" / name
runtime_dir = root / "runtime" / name
env = _load_instance_env(instance_dir / "instance.env")
merged = runtime_dir / "merged"
if os.path.ismount(merged):
# Kernel overlayfs mounts persist when the web worker dies (unlike
# fuse daemons, which were reaped with their cgroup). Refuse rather
# than double-mount.
raise subprocess.CalledProcessError(
returncode=1,
cmd=["start_instance"],
stderr=f"runtime overlay already mounted at {merged}; refusing to double-mount",
)
# Stage cfg files in the upper layer BEFORE mounting. Writing through
# merged after the mount triggers overlayfs copy-up, which preserves the
# lower file's ownership — and a script-sandbox-built `server.cfg` is
# owned by `l4d2-sandbox`, not the worker. Pre-mount writes go straight to
# upper with the worker's uid; the kernel just shows them at the top of
# the merged stack once mounted.
# Stage cfg files in the upper layer. Writing here goes straight to the
# upper dir on the host filesystem with the worker's uid; the unit's
# ExecStartPre then mounts the overlay (single source of truth for the
# mount), and the kernel surfaces these files at the top of the merged
# stack. A script-sandbox-built lower-layer `server.cfg` is owned by
# `l4d2-sandbox`, not the worker — staging in upper sidesteps the
# ownership-preserving copy-up that would happen if we wrote through
# merged post-mount.
emit_step("staging server.cfg + per-overlay aliases in upper layer...", on_stdout, passthrough)
upper_cfg_dir = runtime_dir / "upper" / "left4dead2" / "cfg"
upper_cfg_dir.mkdir(parents=True, exist_ok=True)
@ -121,20 +95,8 @@ def start_instance(
continue
shutil.copy2(src, upper_cfg_dir / f"server_{o.alias}.cfg")
emit_step("mounting runtime overlay...", on_stdout, passthrough)
_mounter.mount(
lowerdirs=env["L4D2_LOWERDIRS"],
upperdir=runtime_dir / "upper",
workdir=runtime_dir / "work",
merged=merged,
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
emit_step("starting systemd service...", on_stdout, passthrough)
start_service(
emit_step("enabling + starting systemd service...", on_stdout, passthrough)
enable_service(
name,
on_stdout=on_stdout,
on_stderr=on_stderr,
@ -155,25 +117,17 @@ def stop_instance(
) -> None:
name = validate_instance_name(name)
root = get_left4me_root() if root is None else Path(root)
emit_step("stopping systemd service...", on_stdout, passthrough)
stop_service(
# `disable --now` triggers the unit's ExecStopPost, which unmounts the
# overlay. Single source of truth for unmount lives in the unit file;
# no Python-side unmount needed.
emit_step("disabling + stopping systemd service...", on_stdout, passthrough)
disable_service(
name,
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
emit_step("unmounting runtime overlay (if mounted)...", on_stdout, passthrough)
try:
_mounter.unmount(
merged=root / "runtime" / name / "merged",
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
except subprocess.CalledProcessError:
pass
emit_step("stop complete.", on_stdout, passthrough)
@ -189,9 +143,13 @@ def _purge_instance(
instance_dir = root / "instances" / name
runtime_dir = root / "runtime" / name
emit_step("stopping systemd service (if running)...", on_stdout, passthrough)
# disable --now triggers ExecStopPost which unmounts. The try/except
# tolerates the unit-not-loaded case (e.g., delete on an instance that
# was initialized but never started — no unit, nothing to disable, no
# mount to clean up either).
emit_step("disabling + stopping systemd service (if running)...", on_stdout, passthrough)
try:
stop_service(
disable_service(
name,
on_stdout=on_stdout,
on_stderr=on_stderr,
@ -201,18 +159,6 @@ def _purge_instance(
except subprocess.CalledProcessError:
pass
emit_step("unmounting runtime overlay (if mounted)...", on_stdout, passthrough)
try:
_mounter.unmount(
merged=runtime_dir / "merged",
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
except subprocess.CalledProcessError:
pass
emit_step("removing instance files...", on_stdout, passthrough)
if instance_dir.exists():
shutil.rmtree(instance_dir)

View file

@ -17,7 +17,7 @@ dependencies = [
l4d2ctl = "l4d2host.cli:app"
[tool.setuptools]
packages = ["l4d2host", "l4d2host.fs"]
packages = ["l4d2host"]
[tool.setuptools.package-dir]
l4d2host = "."

View file

@ -17,7 +17,7 @@ def journalctl_command(name: str, lines: int = 200, follow: bool = True) -> list
return ["sudo", "-n", JOURNALCTL_HELPER, name, "--lines", str(lines), follow_arg]
def start_service(
def enable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
@ -26,7 +26,7 @@ def start_service(
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("start", name),
systemctl_command("enable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
@ -34,7 +34,7 @@ def start_service(
)
def stop_service(
def disable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
@ -43,7 +43,7 @@ def stop_service(
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("stop", name),
systemctl_command("disable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,

View file

@ -1,76 +0,0 @@
from pathlib import Path
import pytest
HELPER_PATH = "/usr/local/libexec/left4me/left4me-overlay"
def test_mount_invokes_helper_with_name_only(monkeypatch: pytest.MonkeyPatch) -> None:
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
calls: list[list[str]] = []
def fake_run_command(cmd, **kwargs):
del kwargs
calls.append(list(cmd))
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
KernelOverlayFSMounter().mount(
lowerdirs="/var/lib/left4me/installation",
upperdir=Path("/var/lib/left4me/runtime/alpha/upper"),
workdir=Path("/var/lib/left4me/runtime/alpha/work"),
merged=Path("/var/lib/left4me/runtime/alpha/merged"),
)
assert calls == [["sudo", "-n", HELPER_PATH, "mount", "alpha"]]
def test_unmount_invokes_helper_with_umount_verb(monkeypatch: pytest.MonkeyPatch) -> None:
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
calls: list[list[str]] = []
def fake_run_command(cmd, **kwargs):
del kwargs
calls.append(list(cmd))
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
KernelOverlayFSMounter().unmount(merged=Path("/var/lib/left4me/runtime/alpha/merged"))
assert calls == [["sudo", "-n", HELPER_PATH, "umount", "alpha"]]
def test_mount_propagates_run_command_kwargs(monkeypatch: pytest.MonkeyPatch) -> None:
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
captured: dict = {}
def fake_run_command(cmd, **kwargs):
captured["cmd"] = list(cmd)
captured["kwargs"] = kwargs
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
out: list[str] = []
err: list[str] = []
KernelOverlayFSMounter().mount(
lowerdirs="/var/lib/left4me/installation",
upperdir=Path("/var/lib/left4me/runtime/alpha/upper"),
workdir=Path("/var/lib/left4me/runtime/alpha/work"),
merged=Path("/var/lib/left4me/runtime/alpha/merged"),
on_stdout=out.append,
on_stderr=err.append,
passthrough=False,
should_cancel=lambda: False,
)
assert captured["cmd"][0:3] == ["sudo", "-n", HELPER_PATH]
captured["kwargs"]["on_stdout"]("hi")
captured["kwargs"]["on_stderr"]("oops")
assert out == ["hi"]
assert err == ["oops"]
assert captured["kwargs"]["passthrough"] is False
assert callable(captured["kwargs"]["should_cancel"])

View file

@ -29,19 +29,16 @@ def test_start_order(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
(instance_dir / "server.cfg").write_text("sv_consistency 1")
(instance_dir / "spec.yaml").write_text("port: 27015\noverlays: [x, y]\n")
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
start_instance("alpha", root=tmp_path)
assert calls[0] == [
"sudo",
"-n",
"/usr/local/libexec/left4me/left4me-overlay",
"mount",
"alpha",
# The mount is now driven by the unit's ExecStartPre (single source of
# truth), so start_instance only stages the cfgs and asks systemd to
# enable+start the unit.
assert calls == [
["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "enable", "alpha"],
]
assert calls[1] == ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "alpha"]
def test_start_copies_per_overlay_aliases_and_sweeps_stale(
@ -75,7 +72,6 @@ def test_start_copies_per_overlay_aliases_and_sweeps_stale(
(src_7 / "server.cfg").write_text("ignored: alias not set\n")
(upper_cfg_dir / "server_orphan.cfg").write_text("from previous start\n")
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
start_instance("alpha", root=tmp_path)
@ -87,36 +83,6 @@ def test_start_copies_per_overlay_aliases_and_sweeps_stale(
assert not (upper_cfg_dir / "server_overlay_7.cfg").exists(), "no alias in spec → no copy"
def test_start_refuses_to_double_mount(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
calls: list[list[str]] = []
def fake_run_command(cmd, **kwargs):
del kwargs
calls.append(list(cmd))
instance_dir = tmp_path / "instances" / "alpha"
runtime_dir = tmp_path / "runtime" / "alpha"
(runtime_dir / "merged").mkdir(parents=True)
instance_dir.mkdir(parents=True)
(instance_dir / "instance.env").write_text("L4D2_PORT=27015\nL4D2_ARGS=\nL4D2_LOWERDIRS=/x\n")
(instance_dir / "server.cfg").write_text("")
merged = runtime_dir / "merged"
def fake_ismount(path):
return Path(path) == merged
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.instances.os.path.ismount", fake_ismount)
with pytest.raises(subprocess.CalledProcessError) as exc_info:
start_instance("alpha", root=tmp_path)
assert "already mounted" in (exc_info.value.stderr or "")
assert calls == [], "no mount/start commands must be issued when refusing"
def test_delete_missing_is_noop(tmp_path: Path) -> None:
delete_instance("missing", root=tmp_path)
@ -127,7 +93,7 @@ def test_delete_succeeds_when_stop_service_fails(tmp_path: Path, monkeypatch: py
def fake_run_command(cmd, **kwargs):
del kwargs
calls.append(list(cmd))
if cmd[:2] == ["sudo", "-n"] and "left4me-systemctl" in cmd[2] and "stop" in cmd:
if cmd[:2] == ["sudo", "-n"] and "left4me-systemctl" in cmd[2] and "disable" in cmd:
raise subprocess.CalledProcessError(
returncode=5,
cmd=list(cmd),
@ -137,7 +103,6 @@ def test_delete_succeeds_when_stop_service_fails(tmp_path: Path, monkeypatch: py
(tmp_path / "instances" / "alpha").mkdir(parents=True)
(tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
delete_instance("alpha", root=tmp_path)
@ -172,7 +137,6 @@ def test_reset_stops_unmounts_and_removes_dirs(tmp_path: Path, monkeypatch: pyte
(runtime_dir / "upper" / "logs").mkdir(parents=True)
(runtime_dir / "upper" / "logs" / "console.log").write_text("noise")
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
reset_instance("alpha", root=tmp_path)
@ -180,7 +144,7 @@ def test_reset_stops_unmounts_and_removes_dirs(tmp_path: Path, monkeypatch: pyte
assert not instance_dir.exists()
assert not runtime_dir.exists()
assert any("left4me-systemctl" in arg for cmd in calls for arg in cmd)
assert any("stop" in cmd for cmd in calls)
assert any("disable" in cmd for cmd in calls)
def test_reset_on_never_initialized_is_noop(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
@ -188,10 +152,9 @@ def test_reset_on_never_initialized_is_noop(tmp_path: Path, monkeypatch: pytest.
stop+unmount (both suppressed on failure) and not raise."""
def fake_run_command(cmd, **kwargs):
del kwargs
if "stop" in cmd:
if "disable" in cmd:
raise subprocess.CalledProcessError(returncode=5, cmd=list(cmd), stderr="not loaded")
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
reset_instance("alpha", root=tmp_path)
@ -210,68 +173,16 @@ def test_delete_stopped_instance_removes_dirs(tmp_path: Path, monkeypatch: pytes
(tmp_path / "instances" / "alpha").mkdir(parents=True)
(tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
delete_instance("alpha", root=tmp_path)
assert not (tmp_path / "instances" / "alpha").exists()
assert not (tmp_path / "runtime" / "alpha").exists()
assert ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "stop", "alpha"] in calls
assert ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "disable", "alpha"] in calls
def test_stop_succeeds_when_unmount_fails(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
umount_calls: list[list[str]] = []
def fake_run_command(cmd, **kwargs):
del kwargs
if cmd[:4] == [
"sudo",
"-n",
"/usr/local/libexec/left4me/left4me-overlay",
"umount",
]:
umount_calls.append(list(cmd))
raise subprocess.CalledProcessError(
returncode=1,
cmd=list(cmd),
stderr="umount: /var/lib/left4me/runtime/alpha/merged: not mounted",
)
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
stop_instance("alpha", root=tmp_path)
assert umount_calls, "stop must always attempt the overlay helper (no preflight)"
def test_delete_succeeds_when_unmount_fails(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
umount_calls: list[list[str]] = []
def fake_run_command(cmd, **kwargs):
del kwargs
if cmd[:4] == [
"sudo",
"-n",
"/usr/local/libexec/left4me/left4me-overlay",
"umount",
]:
umount_calls.append(list(cmd))
raise subprocess.CalledProcessError(
returncode=1,
cmd=list(cmd),
stderr="umount: /var/lib/left4me/runtime/alpha/merged: not mounted",
)
(tmp_path / "instances" / "alpha").mkdir(parents=True)
(tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
delete_instance("alpha", root=tmp_path)
assert umount_calls, "delete must always attempt the overlay helper (no preflight)"
assert not (tmp_path / "instances" / "alpha").exists()
assert not (tmp_path / "runtime" / "alpha").exists()
# test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
# were removed when the Python-side unmount was dropped: the unit's
# ExecStopPost is now the single code path for unmount, so there's no
# Python-side failure to tolerate.

View file

@ -0,0 +1,21 @@
from unittest.mock import patch
from l4d2host.service_control import (
SYSTEMCTL_HELPER,
disable_service,
enable_service,
)
@patch("l4d2host.service_control.run_command")
def test_enable_service_invokes_helper_with_enable_action(mock_run):
enable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
@patch("l4d2host.service_control.run_command")
def test_disable_service_invokes_helper_with_disable_action(mock_run):
disable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]

View file

@ -18,7 +18,11 @@ from l4d2web.routes.overlay_routes import bp as overlay_bp
from l4d2web.routes.page_routes import bp as page_bp
from l4d2web.routes.server_routes import bp as server_bp
from l4d2web.routes.workshop_routes import bp as workshop_bp
from l4d2web.services.job_worker import recover_stale_jobs, start_job_workers
from l4d2web.services.job_worker import (
recover_stale_jobs,
start_job_workers,
start_state_poller,
)
def _in_flask_cli_context() -> bool:
@ -89,6 +93,7 @@ def create_app(test_config: dict[str, object] | None = None) -> Flask:
if should_start_workers:
recover_stale_jobs()
start_job_workers(app)
start_state_poller(app)
@app.get("/health")
def health():

View file

@ -8,6 +8,7 @@ DEFAULT_CONFIG: dict[str, object] = {
"JOB_WORKER_THREADS": 4,
"JOB_WORKER_ENABLED": True,
"JOB_WORKER_POLL_SECONDS": 1,
"STATE_POLLER_INTERVAL_SECONDS": 30,
"JOB_LOG_REPLAY_LIMIT": 2000,
"JOB_LOG_LINE_MAX_CHARS": 4096,
"PORT_RANGE_START": 27015,
@ -27,6 +28,7 @@ def load_config() -> dict[str, object]:
"JOB_WORKER_THREADS": int(os.getenv("JOB_WORKER_THREADS", "4")),
"JOB_WORKER_ENABLED": _bool_from_env(os.getenv("JOB_WORKER_ENABLED", "true")),
"JOB_WORKER_POLL_SECONDS": float(os.getenv("JOB_WORKER_POLL_SECONDS", "1")),
"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("STATE_POLLER_INTERVAL_SECONDS", "30")),
"JOB_LOG_REPLAY_LIMIT": int(os.getenv("JOB_LOG_REPLAY_LIMIT", "2000")),
"JOB_LOG_LINE_MAX_CHARS": int(os.getenv("JOB_LOG_LINE_MAX_CHARS", "4096")),
"PORT_RANGE_START": int(os.getenv("LEFT4ME_PORT_RANGE_START", "27015")),

View file

@ -614,3 +614,45 @@ def worker_loop(app, poll_seconds: float) -> None:
ran_job = False
if not ran_job:
time.sleep(poll_seconds)
def start_state_poller(app) -> None:
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
name="left4me-state-poller",
daemon=True,
)
thread.start()
def state_poller_loop(app, interval: float) -> None:
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass
time.sleep(interval)
def poll_all_servers() -> None:
with session_scope() as db:
active_server_ids = set(
db.scalars(
select(Job.server_id).where(
Job.state.in_(("queued", "running", "cancelling"))
)
).all()
)
server_ids = [
sid
for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass

View file

@ -843,3 +843,90 @@ def test_build_overlay_script_type_blocks_per_overlay(overlay_seeded_worker) ->
can_start(DummyJob(operation="build_overlay", overlay_id=ids.overlay + 1), state)
is True
)
# ---------------------------------------------------------------------------
# State poller tests — refresh Server.actual_state out-of-band so OOM kills,
# manual systemctl ops, and reboots no longer leave the DB on stale "running".
# ---------------------------------------------------------------------------
def test_state_poller_refreshes_each_server(seeded_worker, monkeypatch) -> None:
from l4d2web.services import job_worker as jw
worker_app, ids = seeded_worker
refreshed: list[int] = []
monkeypatch.setattr(
jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid)
)
with worker_app.app_context():
jw.poll_all_servers()
assert sorted(refreshed) == sorted([ids.server_one, ids.server_two])
def test_state_poller_skips_servers_with_inflight_jobs(seeded_worker, monkeypatch) -> None:
from l4d2web.services import job_worker as jw
worker_app, ids = seeded_worker
add_job(ids.user, "stop", server_id=ids.server_one, state="running")
refreshed: list[int] = []
monkeypatch.setattr(
jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid)
)
with worker_app.app_context():
jw.poll_all_servers()
assert ids.server_one not in refreshed
assert ids.server_two in refreshed
def test_state_poller_swallows_per_server_exceptions(seeded_worker, monkeypatch) -> None:
from l4d2web.services import job_worker as jw
worker_app, ids = seeded_worker
refreshed: list[int] = []
def fake_refresh(sid: int) -> None:
if sid == ids.server_one:
raise RuntimeError("simulated host failure")
refreshed.append(sid)
monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
with worker_app.app_context():
jw.poll_all_servers() # must not raise
assert refreshed == [ids.server_two]
def test_state_poller_not_started_during_testing(monkeypatch, tmp_path) -> None:
from l4d2web import app as app_module
called: list = []
db_url = f"sqlite:///{tmp_path/'poller-testing.db'}"
monkeypatch.setattr(app_module, "start_state_poller", lambda app: called.append(app))
app_module.create_app({"TESTING": True, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
assert called == []
def test_state_poller_started_when_workers_enabled_outside_testing(monkeypatch, tmp_path) -> None:
from l4d2web import app as app_module
called: list = []
db_url = f"sqlite:///{tmp_path/'poller-enabled.db'}"
monkeypatch.setattr(app_module, "start_state_poller", lambda app: called.append(app))
monkeypatch.setattr(app_module, "start_job_workers", lambda app: None)
monkeypatch.setattr(app_module, "recover_stale_jobs", lambda: None)
app = app_module.create_app({"TESTING": False, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
assert called == [app]