Compare commits
28 commits
965b67e6fc
...
59771f91c4
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
59771f91c4 | ||
|
|
ff6ce7b091 | ||
|
|
fc371711ec | ||
|
|
a982995d5b | ||
|
|
56f5c30296 | ||
|
|
3d9b7ef771 | ||
|
|
519567e156 | ||
|
|
b62fc08127 | ||
|
|
67b5521eb6 | ||
|
|
8552c559d3 | ||
|
|
1dd674714a | ||
|
|
3b0bde9b50 | ||
|
|
72cd7ca1ef | ||
|
|
20604dd79c | ||
|
|
af3171102a | ||
|
|
c91c029c38 | ||
|
|
17b7c2ff10 | ||
|
|
e5126c8c0b | ||
|
|
9e0f6f17ef | ||
|
|
928519fa34 | ||
|
|
7e4a5691ed | ||
|
|
b3fca4772c | ||
|
|
66d83a0282 | ||
|
|
ad7d73608e | ||
|
|
7193163488 | ||
|
|
851e6629aa | ||
|
|
b6574e308b | ||
|
|
db3b149045 |
30 changed files with 2815 additions and 359 deletions
|
|
@ -71,3 +71,85 @@ The web app currently supports two overlay surfaces:
|
|||
- `script` overlays — populated by an arbitrary user-authored bash script that runs inside `bubblewrap` + `systemd-run --scope` as the unprivileged `l4d2-sandbox` UID, with the overlay directory bind-mounted RW at `/overlay`. Resource caps: 1h walltime, 4 GB RAM, 512 tasks, 200% CPU, 20 GB post-build disk cap.
|
||||
|
||||
Both the caches and the overlay directories are owned by the `left4me` runtime user; if the web service ever runs as a different uid, ensure it shares a group with the host process and that both trees are group-readable.
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
The deployment ships a host-side perf baseline (slices, unit directives, sysctls). See `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` for design rationale.
|
||||
|
||||
The following knobs are documented escape hatches — they are **not** auto-applied. Apply only if you have measured a need and understand the failure modes.
|
||||
|
||||
### CPU governor
|
||||
|
||||
The performance governor squeezes a few percent off jitter under bursty load. `schedutil` is acceptable for sustained UDP workloads.
|
||||
|
||||
```sh
|
||||
sudo cpupower frequency-set -g performance
|
||||
```
|
||||
|
||||
Install via `sudo apt install linux-cpupower` if the binary isn't present.
|
||||
|
||||
Persist via your distro's CPU-frequency tooling (e.g. `/etc/default/cpufrequtils`).
|
||||
|
||||
### CPU isolation (cores)
|
||||
|
||||
The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
|
||||
|
||||
Override the split by setting either env var when running the deploy:
|
||||
|
||||
```sh
|
||||
LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
|
||||
```
|
||||
|
||||
On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
|
||||
|
||||
Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
|
||||
|
||||
### Per-instance CPU affinity
|
||||
|
||||
`srcds` is single-threaded per instance. On a multi-core host, pinning each instance to its own core can cut jitter under contention. Drop in `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
CPUAffinity=2
|
||||
```
|
||||
|
||||
This pins the instance to CPU 2 specifically; per-instance values would typically be 1, 2, 3, ... so each server has its own core.
|
||||
|
||||
A reasonable strategy on an N-core host: leave core 0 for the kernel + IRQs + system services, then pin one instance per remaining core.
|
||||
|
||||
### NIC tuning
|
||||
|
||||
Hardware-specific (install via `sudo apt install ethtool` if not present). On a host with a single primary interface (replace `eth0`):
|
||||
|
||||
```sh
|
||||
sudo ethtool -G eth0 rx 4096 tx 4096
|
||||
sudo ethtool -K eth0 gro on lro off
|
||||
```
|
||||
|
||||
If you run a high instance count, also pin the NIC's interrupts off the cores that game servers occupy (see `/proc/interrupts` and `/proc/irq/<n>/smp_affinity`).
|
||||
|
||||
### Real-time scheduling (advanced, opt-in)
|
||||
|
||||
Source-engine servers do not need real-time scheduling, and a misbehaving `srcds` at any RT priority can starve kernel threads — even with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of CPU back. Use only if you have a measured jitter problem that the baseline does not solve.
|
||||
|
||||
`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
CPUSchedulingPolicy=fifo
|
||||
CPUSchedulingPriority=10
|
||||
LimitRTPRIO=10
|
||||
AmbientCapabilities=CAP_SYS_NICE
|
||||
```
|
||||
|
||||
The `AmbientCapabilities=CAP_SYS_NICE` line is needed because the service runs as `User=left4me` with `NoNewPrivileges=true`; without it some kernels/systemd combinations refuse to apply the RT policy.
|
||||
|
||||
### Applying changes to running servers
|
||||
|
||||
Unit-file changes do not apply to already-running services. After any change:
|
||||
|
||||
```sh
|
||||
sudo systemctl daemon-reload
|
||||
# Restart each game server via the web UI's stop + start, or:
|
||||
sudo systemctl restart 'left4me-server@*.service'
|
||||
```
|
||||
|
|
|
|||
|
|
@ -136,6 +136,42 @@ $sudo_cmd chown -R left4me:left4me /opt/left4me
|
|||
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
|
||||
|
||||
# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
|
||||
# isn't a live game server to core 0; give game servers cores 1..N-1.
|
||||
# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
|
||||
# `nproc --all` reports installed processors regardless of the calling
|
||||
# shell's CPU affinity. Plain `nproc` honors Cpus_allowed of the calling
|
||||
# process, so on a host that already has the cpuset drop-ins applied
|
||||
# (system.slice → AllowedCPUs=0), the SSH login lands in user.slice with
|
||||
# AllowedCPUs=0 and `nproc` would return 1 — making subsequent deploys
|
||||
# wrongly think they're on a single-core box and skip CPU isolation.
|
||||
NPROC=$(nproc --all)
|
||||
SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
|
||||
if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
|
||||
GAME_CPUS=$LEFT4ME_GAME_CPUS
|
||||
else
|
||||
GAME_CPUS="1-$((NPROC - 1))"
|
||||
fi
|
||||
if [ "$NPROC" -lt 2 ] && [ "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" = "" ]; then
|
||||
printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
|
||||
else
|
||||
for slice_drop_in in \
|
||||
/etc/systemd/system/system.slice.d/99-left4me-cpuset.conf \
|
||||
/etc/systemd/system/user.slice.d/99-left4me-cpuset.conf \
|
||||
/etc/systemd/system/l4d2-build.slice.d/99-left4me-cpuset.conf; do
|
||||
$sudo_cmd mkdir -p "$(dirname "$slice_drop_in")"
|
||||
printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
|
||||
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin "$slice_drop_in"
|
||||
done
|
||||
$sudo_cmd mkdir -p /etc/systemd/system/l4d2-game.slice.d
|
||||
printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
|
||||
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
|
||||
/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf
|
||||
fi
|
||||
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-systemctl /usr/local/libexec/left4me/left4me-systemctl
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-journalctl /usr/local/libexec/left4me/left4me-journalctl
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-overlay /usr/local/libexec/left4me/left4me-overlay
|
||||
|
|
@ -154,6 +190,13 @@ $sudo_cmd install -m 0644 -o root -g root \
|
|||
/opt/left4me/deploy/files/etc/left4me/sandbox-resolv.conf \
|
||||
/etc/left4me/sandbox-resolv.conf
|
||||
|
||||
# Host perf-baseline sysctls. Apply with `sysctl --system` so values
|
||||
# take effect this deploy, not on next reboot.
|
||||
$sudo_cmd install -m 0644 -o root -g root \
|
||||
/opt/left4me/deploy/files/etc/sysctl.d/99-left4me.conf \
|
||||
/etc/sysctl.d/99-left4me.conf
|
||||
$sudo_cmd sysctl --system >/dev/null
|
||||
|
||||
# Stomp the file every deploy so newly added vars reach existing boxes.
|
||||
# SECRET_KEY is derived from /etc/machine-id so it stays stable across
|
||||
# redeploys (no session invalidation) without persisting state in /etc.
|
||||
|
|
|
|||
21
deploy/files/etc/sysctl.d/99-left4me.conf
Normal file
21
deploy/files/etc/sysctl.d/99-left4me.conf
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
# Host-side perf baseline for left4me — see
|
||||
# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
||||
#
|
||||
# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
|
||||
# Source-engine UDP across multiple instances. 8 MiB matches the standard
|
||||
# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
|
||||
# explicitly enlarge their buffers.
|
||||
net.core.rmem_max = 8388608
|
||||
net.core.wmem_max = 8388608
|
||||
net.core.rmem_default = 524288
|
||||
net.core.wmem_default = 524288
|
||||
|
||||
# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
|
||||
# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
|
||||
# netdev_budget = 600 gives softirq more drain headroom per pass.
|
||||
net.core.netdev_max_backlog = 5000
|
||||
net.core.netdev_budget = 600
|
||||
|
||||
# Latency-sensitive default: avoid swap unless the box is really under
|
||||
# pressure. Harmless on swapless hosts.
|
||||
vm.swappiness = 10
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
||||
[Unit]
|
||||
Description=left4me script-sandbox build slice
|
||||
Before=slices.target
|
||||
|
||||
[Slice]
|
||||
CPUWeight=10
|
||||
IOWeight=10
|
||||
|
|
@ -0,0 +1,8 @@
|
|||
# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
||||
[Unit]
|
||||
Description=left4me game-server slice
|
||||
Before=slices.target
|
||||
|
||||
[Slice]
|
||||
CPUWeight=1000
|
||||
IOWeight=1000
|
||||
|
|
@ -2,6 +2,11 @@
|
|||
Description=left4me server instance %i
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
# Bound the restart loop. Without these, a persistent ExecStartPre or
|
||||
# ExecStart failure spins indefinitely. Note: these are [Unit]-section
|
||||
# directives (systemd 230+), not [Service].
|
||||
StartLimitBurst=5
|
||||
StartLimitIntervalSec=60s
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
|
|
@ -9,10 +14,45 @@ User=left4me
|
|||
Group=left4me
|
||||
EnvironmentFile=/etc/left4me/host.env
|
||||
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
|
||||
WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
|
||||
# `-` prefix: chdir failure is non-fatal. systemd applies WorkingDirectory
|
||||
# before every Exec line — including ExecStartPre — but the merged dir only
|
||||
# exists once ExecStartPre's overlay mount succeeds. With `-`, ExecStartPre
|
||||
# runs in the unit's home (cwd doesn't matter for the mount helper); the
|
||||
# ExecStart re-applies WorkingDirectory after the mount and finds the dir.
|
||||
WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
|
||||
# Single source of truth for the kernel-overlayfs mount lifecycle: the web
|
||||
# app's start_instance only stages cfg files and asks systemd to enable+
|
||||
# start this unit; the actual `mount -t overlay` lives here so reboot
|
||||
# auto-start works the same as a UI-driven start. ExecStopPost mirrors it
|
||||
# so the unmount lives in the same place — no Python-side _mounter needed
|
||||
# in stop/delete/reset paths. Both helper verbs are idempotent.
|
||||
#
|
||||
# `+` prefix runs the helper as PID 1 (root, no sandbox). Required because
|
||||
# the unit has NoNewPrivileges=true, which blocks sudo's setuid escalation
|
||||
# — and the helper itself needs root to nsenter into PID 1's mnt namespace
|
||||
# anyway. ExecStopPost (not ExecStop) so unmount runs after the cgroup is
|
||||
# cleared; ExecStop runs while srcds is still alive and would EBUSY.
|
||||
ExecStartPre=+/usr/local/libexec/left4me/left4me-overlay mount %i
|
||||
ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
|
||||
ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
||||
Slice=l4d2-game.slice
|
||||
Nice=-5
|
||||
IOSchedulingClass=best-effort
|
||||
IOSchedulingPriority=4
|
||||
OOMScoreAdjust=-200
|
||||
MemoryHigh=1.5G
|
||||
MemoryMax=2G
|
||||
TasksMax=256
|
||||
LimitNOFILE=65536
|
||||
KillSignal=SIGINT
|
||||
TimeoutStopSec=15s
|
||||
LogRateLimitIntervalSec=0
|
||||
|
||||
# Hardening (unchanged from previous baseline).
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
PrivateDevices=true
|
||||
|
|
|
|||
|
|
@ -127,16 +127,30 @@ def exec_or_print(argv: list[str]) -> None:
|
|||
def cmd_mount(name: str) -> None:
|
||||
name = validate_name(name)
|
||||
r = root()
|
||||
runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
|
||||
merged_for_check = (runtime_name_dir / "merged").resolve(strict=True)
|
||||
|
||||
# Idempotency for unit restart cycles: if a previous start mounted
|
||||
# successfully but ExecStart failed afterwards (and Restart=on-failure
|
||||
# fires another cycle), the second ExecStartPre would otherwise refuse
|
||||
# to mount-on-top. Short-circuit here so the second cycle just gets
|
||||
# straight to ExecStart. PRINT_ONLY (test mode) bypasses this so the
|
||||
# tests can exercise the full nsenter argv regardless of mount state.
|
||||
if (
|
||||
os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") != "1"
|
||||
and os.path.ismount(merged_for_check)
|
||||
):
|
||||
return
|
||||
|
||||
instance_env = r / "instances" / name / "instance.env"
|
||||
raw_lowerdirs = parse_lowerdirs(instance_env)
|
||||
|
||||
allowed_roots = [(r / sub).resolve() for sub in LOWERDIR_ALLOWLIST]
|
||||
canonical_lowerdirs = [str(canonical_under(allowed_roots, Path(p))) for p in raw_lowerdirs]
|
||||
|
||||
runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
|
||||
upper = (runtime_name_dir / "upper").resolve(strict=True)
|
||||
work = (runtime_name_dir / "work").resolve(strict=True)
|
||||
merged = (runtime_name_dir / "merged").resolve(strict=True)
|
||||
merged = merged_for_check
|
||||
for label, path in (("upper", upper), ("work", work), ("merged", merged)):
|
||||
if path.parent != runtime_name_dir:
|
||||
die(f"{label} resolved outside runtime/{name}: {path}")
|
||||
|
|
@ -164,6 +178,18 @@ def cmd_umount(name: str) -> None:
|
|||
merged = (runtime_name_dir / "merged").resolve(strict=True)
|
||||
if merged.parent != runtime_name_dir:
|
||||
die(f"merged resolved outside runtime/{name}: {merged}")
|
||||
|
||||
# Idempotency: if merged isn't a mount point right now, we have nothing
|
||||
# to do. Mirrors cmd_mount's symmetric check. ExecStopPost on the unit
|
||||
# is the one canonical caller, but a manual `systemctl reset-failed`
|
||||
# cycle or a redundant cleanup pass should still be a no-op. PRINT_ONLY
|
||||
# bypasses for the same reason as cmd_mount above.
|
||||
if (
|
||||
os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") != "1"
|
||||
and not os.path.ismount(merged)
|
||||
):
|
||||
return
|
||||
|
||||
argv = [
|
||||
NSENTER,
|
||||
"--mount=/proc/1/ns/mnt",
|
||||
|
|
|
|||
|
|
@ -45,6 +45,8 @@ chmod 0755 "$OVERLAY_DIR"
|
|||
SCRIPT_RC=0
|
||||
systemd-run --quiet --collect --wait --pipe \
|
||||
--unit="left4me-script-${OVERLAY_ID}-$$" \
|
||||
--slice=l4d2-build.slice \
|
||||
-p OOMScoreAdjust=500 \
|
||||
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
|
||||
-p UMask=0022 \
|
||||
-p NoNewPrivileges=yes \
|
||||
|
|
|
|||
|
|
@ -2,7 +2,7 @@
|
|||
set -eu
|
||||
|
||||
usage() {
|
||||
printf '%s\n' "usage: left4me-systemctl start|stop|show <server-name>" >&2
|
||||
printf '%s\n' "usage: left4me-systemctl enable|disable|show <server-name>" >&2
|
||||
exit 2
|
||||
}
|
||||
|
||||
|
|
@ -22,7 +22,7 @@ action=$1
|
|||
name=$2
|
||||
|
||||
case "$action" in
|
||||
start|stop|show) ;;
|
||||
enable|disable|show) ;;
|
||||
*) usage ;;
|
||||
esac
|
||||
|
||||
|
|
@ -38,7 +38,7 @@ else
|
|||
fi
|
||||
|
||||
case "$action" in
|
||||
start) exec "$systemctl" start "$unit" ;;
|
||||
stop) exec "$systemctl" stop "$unit" ;;
|
||||
enable) exec "$systemctl" enable --now "$unit" ;;
|
||||
disable) exec "$systemctl" disable --now "$unit" ;;
|
||||
show) exec "$systemctl" show --property=ActiveState --property=SubState "$unit" ;;
|
||||
esac
|
||||
|
|
|
|||
|
|
@ -9,6 +9,9 @@ DEPLOY = ROOT / "deploy"
|
|||
|
||||
WEB_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-web.service"
|
||||
SERVER_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-server@.service"
|
||||
GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
|
||||
BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
|
||||
SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
|
||||
GLOBAL_REFRESH_SERVICE = DEPLOY / "files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.service"
|
||||
GLOBAL_REFRESH_TIMER = DEPLOY / "files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer"
|
||||
SANDBOX_UNIT_DIR = DEPLOY / "files/usr/local/lib/systemd/system"
|
||||
|
|
@ -60,7 +63,10 @@ def test_server_unit_contains_required_runtime_contract():
|
|||
assert "Group=left4me" in unit
|
||||
assert "EnvironmentFile=/etc/left4me/host.env" in unit
|
||||
assert "EnvironmentFile=/var/lib/left4me/instances/%i/instance.env" in unit
|
||||
assert "WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
|
||||
# `-` prefix: chdir failure is non-fatal so ExecStartPre can run the
|
||||
# mount helper before the merged dir exists. ExecStart re-applies and
|
||||
# finds the dir once the mount has landed.
|
||||
assert "WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
|
||||
assert "ExecStart=/var/lib/left4me/installation/srcds_run" in unit
|
||||
assert "$L4D2_ARGS" in unit
|
||||
assert "${L4D2_ARGS}" not in unit
|
||||
|
|
@ -75,6 +81,176 @@ def test_server_unit_contains_required_runtime_contract():
|
|||
assert "LockPersonality=true" in unit
|
||||
|
||||
|
||||
def test_server_unit_mounts_overlay_via_exec_start_pre():
|
||||
"""At boot, systemd auto-starts enabled units before the web app gets a
|
||||
chance to run start_instance's pre-start mount. The unit itself must
|
||||
re-mount the overlay so reboots are transparent. Pairs with the helper's
|
||||
idempotency check (test_overlay_helper_mount_is_idempotent_when_mounted).
|
||||
"""
|
||||
unit = SERVER_UNIT.read_text()
|
||||
# `+` prefix: ExecStartPre runs as PID 1 (root, no sandbox). Required
|
||||
# because the unit has NoNewPrivileges=true, which blocks sudo's setuid
|
||||
# escalation — and the helper needs root for nsenter anyway.
|
||||
assert (
|
||||
"ExecStartPre=+/usr/local/libexec/left4me/left4me-overlay mount %i"
|
||||
in unit
|
||||
)
|
||||
# Bound the restart loop; without these, a CHDIR-failure (or any other
|
||||
# pre-start error) spins indefinitely.
|
||||
assert "StartLimitBurst=5" in unit
|
||||
assert "StartLimitIntervalSec=60s" in unit
|
||||
|
||||
|
||||
def test_server_unit_unmounts_overlay_via_exec_stop_post():
|
||||
"""Single source of truth for unmount, mirroring the mount path.
|
||||
ExecStopPost (not ExecStop) so it runs after srcds has fully exited
|
||||
and the cgroup is cleared — otherwise the open files in merged/ would
|
||||
EBUSY the umount syscall.
|
||||
"""
|
||||
unit = SERVER_UNIT.read_text()
|
||||
assert (
|
||||
"ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i"
|
||||
in unit
|
||||
)
|
||||
|
||||
|
||||
def test_overlay_helper_mount_is_idempotent_when_already_mounted():
|
||||
"""ExecStartPre runs on every Restart=on-failure cycle. If a previous
|
||||
start mounted successfully but ExecStart failed afterwards, the next
|
||||
ExecStartPre would re-mount on top -- which fails. The helper must
|
||||
short-circuit when merged is already a mount point.
|
||||
"""
|
||||
text = OVERLAY_HELPER.read_text()
|
||||
# Two ismount checks now: one in cmd_mount (skip if mounted),
|
||||
# one in cmd_umount (skip if not mounted).
|
||||
assert text.count("os.path.ismount") >= 2
|
||||
|
||||
|
||||
def test_server_unit_contains_perf_baseline_directives():
|
||||
unit = SERVER_UNIT.read_text()
|
||||
|
||||
# Slice membership.
|
||||
assert "Slice=l4d2-game.slice" in unit
|
||||
|
||||
# CFS priority bump (no SCHED_FIFO).
|
||||
assert "Nice=-5" in unit
|
||||
assert "CPUSchedulingPolicy=" not in unit
|
||||
|
||||
# I/O priority.
|
||||
assert "IOSchedulingClass=best-effort" in unit
|
||||
assert "IOSchedulingPriority=4" in unit
|
||||
|
||||
# OOM ordering: game servers survive, sandbox dies first.
|
||||
assert "OOMScoreAdjust=-200" in unit
|
||||
|
||||
# Memory caps with headroom for map-load spikes.
|
||||
assert "MemoryHigh=1.5G" in unit
|
||||
assert "MemoryMax=2G" in unit
|
||||
|
||||
# Bounded fork surface.
|
||||
assert "TasksMax=256" in unit
|
||||
|
||||
# Plenty of fds for plugin-heavy setups.
|
||||
assert "LimitNOFILE=65536" in unit
|
||||
|
||||
# srcds clean shutdown via SIGINT, with time to flush.
|
||||
assert "KillSignal=SIGINT" in unit
|
||||
assert "TimeoutStopSec=15s" in unit
|
||||
|
||||
# Per-unit override of journald rate limiting (default drops srcds output).
|
||||
assert "LogRateLimitIntervalSec=0" in unit
|
||||
|
||||
|
||||
def test_l4d2_game_slice_exists_with_high_weights():
|
||||
assert GAME_SLICE.is_file()
|
||||
text = GAME_SLICE.read_text()
|
||||
assert "[Slice]" in text
|
||||
assert "CPUWeight=1000" in text
|
||||
assert "IOWeight=1000" in text
|
||||
|
||||
|
||||
def test_l4d2_build_slice_exists_with_low_weights():
|
||||
assert BUILD_SLICE.is_file()
|
||||
text = BUILD_SLICE.read_text()
|
||||
assert "[Slice]" in text
|
||||
assert "CPUWeight=10" in text
|
||||
assert "IOWeight=10" in text
|
||||
|
||||
|
||||
def test_sysctl_conf_present_with_perf_settings():
|
||||
assert SYSCTL_CONF.is_file()
|
||||
text = SYSCTL_CONF.read_text()
|
||||
for line in (
|
||||
"net.core.rmem_max = 8388608",
|
||||
"net.core.wmem_max = 8388608",
|
||||
"net.core.rmem_default = 524288",
|
||||
"net.core.wmem_default = 524288",
|
||||
"net.core.netdev_max_backlog = 5000",
|
||||
"net.core.netdev_budget = 600",
|
||||
"vm.swappiness = 10",
|
||||
):
|
||||
assert line in text, f"missing {line!r} in 99-left4me.conf"
|
||||
|
||||
|
||||
def test_script_sandbox_in_build_slice_with_oom_adjust():
|
||||
text = SCRIPT_SANDBOX_HELPER.read_text()
|
||||
|
||||
# Put the transient unit in the low-weight build slice so it yields to
|
||||
# game-server instances under CPU/IO contention.
|
||||
assert "--slice=l4d2-build.slice" in text
|
||||
|
||||
# Sandbox dies first if the host hits memory pressure; servers
|
||||
# (OOMScoreAdjust=-200) survive.
|
||||
assert "-p OOMScoreAdjust=500" in text
|
||||
|
||||
|
||||
def test_deploy_script_installs_perf_artifacts():
|
||||
script = DEPLOY_SCRIPT.read_text()
|
||||
|
||||
# Slice files copied into the system-wide systemd unit dir.
|
||||
assert "/usr/local/lib/systemd/system/l4d2-game.slice" in script
|
||||
assert "/usr/local/lib/systemd/system/l4d2-build.slice" in script
|
||||
|
||||
# Sysctl drop-in installed under /etc/sysctl.d/.
|
||||
assert "/etc/sysctl.d/99-left4me.conf" in script
|
||||
|
||||
# Values applied immediately, not on next boot.
|
||||
assert "sysctl --system" in script
|
||||
|
||||
|
||||
def test_deploy_script_writes_cpuset_drop_ins():
|
||||
script = DEPLOY_SCRIPT.read_text()
|
||||
|
||||
# Reads nproc and binds defaults via ${VAR:-...}.
|
||||
assert "nproc" in script
|
||||
assert "LEFT4ME_SYSTEM_CPUS" in script
|
||||
assert "LEFT4ME_GAME_CPUS" in script
|
||||
assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
|
||||
|
||||
# Default game-core upper bound is computed from nproc; accept either
|
||||
# the NPROC-1 form or LEFT4ME_GAME_CPUS:-1- prefix.
|
||||
assert (
|
||||
"1-$((NPROC - 1))" in script
|
||||
or "1-$((NPROC-1))" in script
|
||||
or "1-$((nproc-1))" in script
|
||||
or "LEFT4ME_GAME_CPUS:-1-" in script
|
||||
)
|
||||
|
||||
# All four drop-in paths.
|
||||
for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
|
||||
assert (
|
||||
f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf"
|
||||
in script
|
||||
)
|
||||
|
||||
# Drop-ins use the existing install pattern.
|
||||
assert "install -m 0644 -o root -g root" in script
|
||||
|
||||
# Single-core host: skip with a warning to stderr.
|
||||
assert ("-lt 2" in script) or ("< 2" in script) or ("-ge 2" in script)
|
||||
assert "skipping CPU isolation" in script
|
||||
|
||||
|
||||
def _fake_command(tmp_path, command_name):
|
||||
marker = tmp_path / f"{command_name}.args"
|
||||
command = tmp_path / command_name
|
||||
|
|
@ -105,12 +281,16 @@ def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_pat
|
|||
|
||||
for args in [
|
||||
["bad/action", "alpha"],
|
||||
["start", ""],
|
||||
["start", ".hidden"],
|
||||
["start", "bad..name"],
|
||||
["start", "bad/name"],
|
||||
["start", "bad\\name"],
|
||||
["start", "bad name"],
|
||||
# `start` and `stop` are no longer accepted verbs — the lifecycle now
|
||||
# uses `enable`/`disable` for reboot survival via WantedBy= symlinks.
|
||||
["start", "alpha"],
|
||||
["stop", "alpha"],
|
||||
["enable", ""],
|
||||
["enable", ".hidden"],
|
||||
["enable", "bad..name"],
|
||||
["enable", "bad/name"],
|
||||
["enable", "bad\\name"],
|
||||
["enable", "bad name"],
|
||||
]:
|
||||
result = subprocess.run(["sh", str(SYSTEMCTL_HELPER), *args], env=_env_with_fake_commands(tmp_path), check=False)
|
||||
assert result.returncode != 0
|
||||
|
|
@ -118,8 +298,8 @@ def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_pat
|
|||
|
||||
script = SYSTEMCTL_HELPER.read_text()
|
||||
assert 'unit="left4me-server@${name}.service"' in script
|
||||
assert 'start) exec "$systemctl" start "$unit"' in script
|
||||
assert 'stop) exec "$systemctl" stop "$unit"' in script
|
||||
assert 'enable) exec "$systemctl" enable --now "$unit"' in script
|
||||
assert 'disable) exec "$systemctl" disable --now "$unit"' in script
|
||||
assert "--property=ActiveState" in script
|
||||
assert "--property=SubState" in script
|
||||
|
||||
|
|
|
|||
260
docs/superpowers/plans/2026-05-09-l4d2-cpu-isolation.md
Normal file
260
docs/superpowers/plans/2026-05-09-l4d2-cpu-isolation.md
Normal file
|
|
@ -0,0 +1,260 @@
|
|||
# L4D2 CPU Isolation Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively, scaled automatically across host sizes.
|
||||
|
||||
**Architecture:** Four `99-left4me-cpuset.conf` drop-ins under `/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/`, written by the deploy script from heredocs. `LEFT4ME_SYSTEM_CPUS` (default `0`) and `LEFT4ME_GAME_CPUS` (default `1-$((NPROC-1))`) are env-var overrides. Single-core hosts skip the cpuset writes with a warning.
|
||||
|
||||
**Tech Stack:** systemd cgroup-v2 `AllowedCPUs=` directive, bash heredoc + `install`, Linux `nproc(1)`, pytest text-assertion tests.
|
||||
|
||||
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
Files to modify:
|
||||
|
||||
- `deploy/deploy-test-server.sh` — compute `NPROC`, default `LEFT4ME_SYSTEM_CPUS=0` / `LEFT4ME_GAME_CPUS=1-$((NPROC-1))`, write four drop-in files. Skip when `nproc < 2` (with stderr warning) unless either env var is set explicitly.
|
||||
- `deploy/README.md` — append a "CPU isolation" subsection inside the existing "Performance Tuning" section.
|
||||
- `deploy/tests/test_deploy_artifacts.py` — new test functions.
|
||||
|
||||
No host library or web app changes.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight
|
||||
|
||||
- [ ] **Step 0a: Verify clean working tree**
|
||||
|
||||
Run: `git status`
|
||||
Expected: `nothing to commit, working tree clean`
|
||||
|
||||
- [ ] **Step 0b: Verify the existing deploy tests are at the known-good baseline**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
|
||||
Expected: 35 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`).
|
||||
|
||||
If the count differs, stop and surface — this plan assumes that exact baseline.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Deploy-script CPU-isolation block + tests
|
||||
|
||||
Write the four drop-ins from the deploy script in one cohesive block. The block computes `NPROC` once, resolves both env vars (with defaults), guards single-core hosts, and writes each drop-in via the existing `install -m 0644 -o root -g root` pattern. Tests cover defaults, overrides, single-core skip, and drop-in paths.
|
||||
|
||||
**Files:**
|
||||
- Modify: `deploy/deploy-test-server.sh`
|
||||
- Modify: `deploy/tests/test_deploy_artifacts.py` (new test function)
|
||||
|
||||
- [ ] **Step 1.1: Add the failing test**
|
||||
|
||||
Open `deploy/tests/test_deploy_artifacts.py` and append (after the `test_deploy_script_installs_perf_artifacts` from the perf-baseline branch):
|
||||
|
||||
```python
|
||||
def test_deploy_script_writes_cpuset_drop_ins():
|
||||
script = DEPLOY_SCRIPT.read_text()
|
||||
|
||||
# Reads nproc and binds defaults via ${VAR:-...}.
|
||||
assert "nproc" in script
|
||||
assert "LEFT4ME_SYSTEM_CPUS" in script
|
||||
assert "LEFT4ME_GAME_CPUS" in script
|
||||
assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
|
||||
# Default game-core expression: 1-(nproc-1). Match the form the
|
||||
# implementer chose; both `1-$((NPROC-1))` and `1-$((nproc-1))` are
|
||||
# acceptable as long as the upper bound is computed from nproc.
|
||||
assert ("1-$((NPROC-1))" in script) or ("1-$((nproc-1))" in script) \
|
||||
or ("LEFT4ME_GAME_CPUS:-1-" in script)
|
||||
|
||||
# All four drop-in paths.
|
||||
for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
|
||||
assert f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf" in script
|
||||
|
||||
# Drop-ins use the existing install pattern.
|
||||
assert "install -m 0644 -o root -g root" in script
|
||||
|
||||
# Single-core host: skip with a warning to stderr.
|
||||
# Match either an explicit `nproc < 2` / `-lt 2` guard or `[ "$nproc" -ge 2 ]` form.
|
||||
assert ("nproc" in script) and (("-lt 2" in script) or ("-ge 2" in script) or ("< 2" in script))
|
||||
assert "skipping CPU isolation" in script.lower() or "skip cpu isolation" in script.lower()
|
||||
```
|
||||
|
||||
- [ ] **Step 1.2: Run the new test, verify it fails**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_writes_cpuset_drop_ins -v`
|
||||
Expected: FAIL — none of the new strings exist yet.
|
||||
|
||||
- [ ] **Step 1.3: Edit the deploy script — add the cpuset block**
|
||||
|
||||
Open `deploy/deploy-test-server.sh`. Find the block that copies the slice files (added in the perf-baseline branch, around lines 139–140):
|
||||
|
||||
```sh
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
|
||||
```
|
||||
|
||||
Immediately after that pair, before any of the helper-script copies that follow, insert this block:
|
||||
|
||||
```sh
|
||||
# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
|
||||
# isn't a live game server to core 0; give game servers cores 1..N-1.
|
||||
# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
|
||||
NPROC=$(nproc)
|
||||
SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
|
||||
if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
|
||||
GAME_CPUS=$LEFT4ME_GAME_CPUS
|
||||
else
|
||||
GAME_CPUS="1-$((NPROC - 1))"
|
||||
fi
|
||||
if [ "$NPROC" -lt 2 ] && [ -z "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" ]; then
|
||||
printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
|
||||
else
|
||||
for slice_name in system user l4d2-build; do
|
||||
$sudo_cmd mkdir -p "/etc/systemd/system/${slice_name}.slice.d"
|
||||
printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
|
||||
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
|
||||
"/etc/systemd/system/${slice_name}.slice.d/99-left4me-cpuset.conf"
|
||||
done
|
||||
$sudo_cmd mkdir -p "/etc/systemd/system/l4d2-game.slice.d"
|
||||
printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
|
||||
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
|
||||
"/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf"
|
||||
fi
|
||||
```
|
||||
|
||||
Notes for the implementer:
|
||||
|
||||
- The single-core skip only triggers when **neither** override is set. If the operator sets either `LEFT4ME_SYSTEM_CPUS` or `LEFT4ME_GAME_CPUS` explicitly on a single-core host, honor their intent.
|
||||
- `install -m 0644 -o root -g root /dev/stdin <dest>` is the idiomatic way to install a small generated file from a pipeline (matches the existing pattern for sandbox-resolv.conf, just with `/dev/stdin` as source).
|
||||
- The `mkdir -p` for each `.d` directory is required: systemd reads drop-ins only from existing directories.
|
||||
|
||||
- [ ] **Step 1.4: Verify shell syntax still parses**
|
||||
|
||||
Run: `sh -n deploy/deploy-test-server.sh`
|
||||
Expected: exit 0, no output.
|
||||
|
||||
- [ ] **Step 1.5: Run the new test and full deploy test suite**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
|
||||
Expected: 36 passed, 1 failed (the pre-existing unrelated test, count goes from 35→36 because of the new test).
|
||||
|
||||
If your specific assertion forms in Step 1.1 don't match the implementation, adjust the test — but only the `or` branches; do not weaken the contract.
|
||||
|
||||
- [ ] **Step 1.6: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
|
||||
|
||||
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
|
||||
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
|
||||
with a stderr warning unless an env var override is set. Spec:
|
||||
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: README "CPU isolation" subsection
|
||||
|
||||
Append a subsection to `deploy/README.md` inside the existing "Performance Tuning" section, documenting the layout, the env-var overrides, the single-core skip, and the relationship to the existing per-instance `CPUAffinity=` escape hatch.
|
||||
|
||||
**Files:**
|
||||
- Modify: `deploy/README.md`
|
||||
|
||||
No test for this task — README content is documentation, not contract.
|
||||
|
||||
- [ ] **Step 2.1: Append the CPU isolation subsection**
|
||||
|
||||
Open `deploy/README.md`. Find the existing `### Per-instance CPU affinity` subsection (added in the perf-baseline branch). Insert a new subsection **immediately before** it (so the slice-level isolation is documented before the per-instance refinement that builds on top). The new subsection content:
|
||||
|
||||
```markdown
|
||||
### CPU isolation (cores)
|
||||
|
||||
The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
|
||||
|
||||
Override the split by setting either env var when running the deploy:
|
||||
|
||||
```sh
|
||||
LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
|
||||
```
|
||||
|
||||
On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
|
||||
|
||||
Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
|
||||
```
|
||||
|
||||
(The outer triple-backticks above are markdown punctuation around this prompt block, not part of the README content. Inner code-block fences DO need to be written into the README. The `markdown` language tag on the outer fence in this plan is documentation-only.)
|
||||
|
||||
- [ ] **Step 2.2: Run the full deploy test suite**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
|
||||
Expected: 36 passed, 1 failed (unchanged; README has no test).
|
||||
|
||||
- [ ] **Step 2.3: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/README.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
docs(deploy): document CPU isolation in performance-tuning section
|
||||
|
||||
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
|
||||
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
|
||||
subset-of relationship with per-instance CPUAffinity=.
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Final Verification
|
||||
|
||||
- [ ] **Step F.1: Full deploy + host + web test sweep**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
|
||||
Expected: deploy 36 passed / 1 failed (pre-existing); host 111 passed / 1 skipped; web 313 passed / 1 skipped.
|
||||
|
||||
- [ ] **Step F.2: Working tree clean and commits in order**
|
||||
|
||||
Run: `git status && git log --oneline -5`
|
||||
Expected:
|
||||
- `git status`: clean.
|
||||
- Top of `git log`:
|
||||
1. `docs(deploy): document CPU isolation in performance-tuning section`
|
||||
2. `feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest`
|
||||
3. `docs(plans): l4d2 cpu isolation — implementation plan`
|
||||
4. `docs(specs): l4d2 cpu isolation — design`
|
||||
|
||||
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
|
||||
|
||||
This plan ships artifacts. Confirming systemd actually enforces `AllowedCPUs=` on a real Trixie host is operator-side:
|
||||
|
||||
```sh
|
||||
deploy/deploy-test-server.sh deploy-user@example-host
|
||||
ssh deploy-user@example-host '
|
||||
systemctl cat system.slice | grep AllowedCPUs
|
||||
systemctl cat l4d2-game.slice | grep AllowedCPUs
|
||||
cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
|
||||
cat /sys/fs/cgroup/l4d2-game.slice/cpuset.cpus.effective
|
||||
'
|
||||
# Expect on an 8-core box:
|
||||
# system.slice → AllowedCPUs=0 → cpuset.cpus.effective = 0
|
||||
# l4d2-game.slice → AllowedCPUs=1-7 → cpuset.cpus.effective = 1-7
|
||||
```
|
||||
|
||||
End-to-end behavioural test (manual, ops-side): on a 4-core host, run two L4D2 instances + a script-sandbox build simultaneously. Confirm via `htop` (with affinity column on) that the srcds processes only ever appear on cores 1, 2, 3 and the sandbox + web stay on core 0.
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope (do NOT implement here)
|
||||
|
||||
- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot params.
|
||||
- NIC IRQ pinning automation.
|
||||
- Per-instance `CPUAffinity=` driven by a deploy-env knob.
|
||||
- A separate `l4d2-web.slice`.
|
||||
- Any web-app or host-library code changes.
|
||||
|
||||
If you find yourself touching any of these, stop — they belong in a separate spec.
|
||||
|
|
@ -0,0 +1,686 @@
|
|||
# L4D2 Server Host Perf Baseline Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Apply a host-side performance and resource-isolation baseline (systemd directives, slice hierarchy, host sysctls) to every L4D2 server instance, leaving game ConVars to the maintainer.
|
||||
|
||||
**Architecture:** Add resource-control directives to `left4me-server@.service`; introduce two flat top-level slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10) so the build sandbox is starved by the kernel under contention; ship `/etc/sysctl.d/99-left4me.conf` for UDP buffer and netdev tuning; place the script-sandbox transient unit into `l4d2-build.slice` with `OOMScoreAdjust=500`. RT scheduling, CPU governor, CPUAffinity, NIC tuning are documentation-only escape hatches.
|
||||
|
||||
**Tech Stack:** systemd unit files (service + slice), `systemd-run` properties, Linux sysctl, bash deploy script, pytest text-assertion tests under `deploy/tests/test_deploy_artifacts.py`.
|
||||
|
||||
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
Files to create:
|
||||
|
||||
- `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` — high-weight slice for game-server instances.
|
||||
- `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` — low-weight slice for sandboxed script-overlay builds.
|
||||
- `deploy/files/etc/sysctl.d/99-left4me.conf` — host UDP/netdev/swap sysctls.
|
||||
|
||||
Files to modify:
|
||||
|
||||
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` — add resource-control directives (`Slice`, `Nice`, `IOSchedulingClass`, `IOSchedulingPriority`, `OOMScoreAdjust`, `MemoryHigh`, `MemoryMax`, `TasksMax`, `LimitNOFILE`, `KillSignal`, `TimeoutStopSec`, `LogRateLimitIntervalSec`).
|
||||
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — add `--slice=l4d2-build.slice` and `-p OOMScoreAdjust=500` to the `systemd-run` invocation.
|
||||
- `deploy/deploy-test-server.sh` — copy the two slice files and the sysctl conf during deploy; run `sysctl --system` so values take effect immediately.
|
||||
- `deploy/README.md` — append a "Performance tuning" section with the four documented escape hatches.
|
||||
- `deploy/tests/test_deploy_artifacts.py` — new tests for each artifact above (text assertions following the existing `assert "X" in text` style).
|
||||
|
||||
No application code (Python, Flask, host library) is touched.
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight
|
||||
|
||||
- [ ] **Step 0a: Verify clean working tree**
|
||||
|
||||
Run: `git status`
|
||||
Expected: `nothing to commit, working tree clean`
|
||||
|
||||
- [ ] **Step 0b: Verify the existing deploy tests pass**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
|
||||
Expected: all green.
|
||||
|
||||
If any test is already red, stop and surface — this plan assumes the baseline is green.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Per-Instance Unit Resource-Control Directives
|
||||
|
||||
Add the per-instance baseline to `left4me-server@.service`. This task is self-contained even though `Slice=l4d2-game.slice` references a slice that doesn't exist yet — systemd does not validate the reference until the unit is actually started, and the deploy artifact tests are pure text checks.
|
||||
|
||||
**Files:**
|
||||
- Modify: `deploy/files/usr/local/lib/systemd/system/left4me-server@.service`
|
||||
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
|
||||
|
||||
- [ ] **Step 1.1: Add the failing test**
|
||||
|
||||
Open `deploy/tests/test_deploy_artifacts.py` and append (after `test_server_unit_contains_required_runtime_contract`):
|
||||
|
||||
```python
|
||||
def test_server_unit_contains_perf_baseline_directives():
|
||||
unit = SERVER_UNIT.read_text()
|
||||
|
||||
# Slice membership.
|
||||
assert "Slice=l4d2-game.slice" in unit
|
||||
|
||||
# CFS priority bump (no SCHED_FIFO).
|
||||
assert "Nice=-5" in unit
|
||||
assert "CPUSchedulingPolicy=" not in unit
|
||||
|
||||
# I/O priority.
|
||||
assert "IOSchedulingClass=best-effort" in unit
|
||||
assert "IOSchedulingPriority=4" in unit
|
||||
|
||||
# OOM ordering: game servers survive, sandbox dies first.
|
||||
assert "OOMScoreAdjust=-200" in unit
|
||||
|
||||
# Memory caps with headroom for map-load spikes.
|
||||
assert "MemoryHigh=1.5G" in unit
|
||||
assert "MemoryMax=2G" in unit
|
||||
|
||||
# Bounded fork surface.
|
||||
assert "TasksMax=256" in unit
|
||||
|
||||
# Plenty of fds for plugin-heavy setups.
|
||||
assert "LimitNOFILE=65536" in unit
|
||||
|
||||
# srcds clean shutdown via SIGINT, with time to flush.
|
||||
assert "KillSignal=SIGINT" in unit
|
||||
assert "TimeoutStopSec=15s" in unit
|
||||
|
||||
# Per-unit override of journald rate limiting (default drops srcds output).
|
||||
assert "LogRateLimitIntervalSec=0" in unit
|
||||
```
|
||||
|
||||
- [ ] **Step 1.2: Run the new test, verify it fails**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
|
||||
Expected: FAIL — first failing assert is on `Slice=l4d2-game.slice`.
|
||||
|
||||
- [ ] **Step 1.3: Edit the unit file**
|
||||
|
||||
Open `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` and replace its contents with:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=left4me server instance %i
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
User=left4me
|
||||
Group=left4me
|
||||
EnvironmentFile=/etc/left4me/host.env
|
||||
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
|
||||
WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
|
||||
ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
||||
Slice=l4d2-game.slice
|
||||
Nice=-5
|
||||
IOSchedulingClass=best-effort
|
||||
IOSchedulingPriority=4
|
||||
OOMScoreAdjust=-200
|
||||
MemoryHigh=1.5G
|
||||
MemoryMax=2G
|
||||
TasksMax=256
|
||||
LimitNOFILE=65536
|
||||
KillSignal=SIGINT
|
||||
TimeoutStopSec=15s
|
||||
LogRateLimitIntervalSec=0
|
||||
|
||||
# Hardening (unchanged from previous baseline).
|
||||
NoNewPrivileges=true
|
||||
PrivateTmp=true
|
||||
PrivateDevices=true
|
||||
ProtectHome=true
|
||||
ProtectSystem=strict
|
||||
ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays
|
||||
ReadWritePaths=/var/lib/left4me/runtime/%i
|
||||
RestrictSUIDSGID=true
|
||||
LockPersonality=true
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
- [ ] **Step 1.4: Run the new test, verify it passes**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
|
||||
Expected: PASS.
|
||||
|
||||
- [ ] **Step 1.5: Re-run the existing server-unit test, verify still passes**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_required_runtime_contract -v`
|
||||
Expected: PASS — the existing assertions (`User=left4me`, `Group=left4me`, hardening directives, etc.) still match.
|
||||
|
||||
- [ ] **Step 1.6: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/files/usr/local/lib/systemd/system/left4me-server@.service deploy/tests/test_deploy_artifacts.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(deploy): perf-baseline directives on left4me-server@.service
|
||||
|
||||
Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort,
|
||||
OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256,
|
||||
LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s,
|
||||
LogRateLimitIntervalSec=0. Spec:
|
||||
docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Slice Unit Files
|
||||
|
||||
Create the two slice unit files. After this task the perf unit's `Slice=l4d2-game.slice` reference is satisfied.
|
||||
|
||||
**Files:**
|
||||
- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`
|
||||
- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`
|
||||
- Test: `deploy/tests/test_deploy_artifacts.py` (new constants + new test functions)
|
||||
|
||||
- [ ] **Step 2.1: Add path constants and failing tests**
|
||||
|
||||
Open `deploy/tests/test_deploy_artifacts.py`. After the existing `SERVER_UNIT = ...` line, add:
|
||||
|
||||
```python
|
||||
GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
|
||||
BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
|
||||
```
|
||||
|
||||
After the new `test_server_unit_contains_perf_baseline_directives`, append:
|
||||
|
||||
```python
|
||||
def test_l4d2_game_slice_exists_with_high_weights():
|
||||
assert GAME_SLICE.is_file()
|
||||
text = GAME_SLICE.read_text()
|
||||
assert "[Slice]" in text
|
||||
assert "CPUWeight=1000" in text
|
||||
assert "IOWeight=1000" in text
|
||||
|
||||
|
||||
def test_l4d2_build_slice_exists_with_low_weights():
|
||||
assert BUILD_SLICE.is_file()
|
||||
text = BUILD_SLICE.read_text()
|
||||
assert "[Slice]" in text
|
||||
assert "CPUWeight=10" in text
|
||||
assert "IOWeight=10" in text
|
||||
```
|
||||
|
||||
- [ ] **Step 2.2: Run the new tests, verify they fail**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
|
||||
Expected: FAIL on `assert GAME_SLICE.is_file()` (file does not exist).
|
||||
|
||||
- [ ] **Step 2.3: Create the game slice file**
|
||||
|
||||
Create `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` with:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=left4me game-server slice
|
||||
Before=slices.target
|
||||
|
||||
[Slice]
|
||||
CPUWeight=1000
|
||||
IOWeight=1000
|
||||
```
|
||||
|
||||
- [ ] **Step 2.4: Create the build slice file**
|
||||
|
||||
Create `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` with:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=left4me script-sandbox build slice
|
||||
Before=slices.target
|
||||
|
||||
[Slice]
|
||||
CPUWeight=10
|
||||
IOWeight=10
|
||||
```
|
||||
|
||||
- [ ] **Step 2.5: Run the new tests, verify they pass**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
|
||||
Expected: PASS.
|
||||
|
||||
- [ ] **Step 2.6: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/files/usr/local/lib/systemd/system/l4d2-game.slice deploy/files/usr/local/lib/systemd/system/l4d2-build.slice deploy/tests/test_deploy_artifacts.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio
|
||||
|
||||
Flat top-level slices. Game wins under contention; build still gets
|
||||
the box when uncontended. Referenced by left4me-server@.service and
|
||||
the script-sandbox systemd-run invocation.
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Host Sysctls
|
||||
|
||||
Ship a `/etc/sysctl.d/` drop-in for UDP buffers, netdev backlog, netdev budget, and `vm.swappiness`.
|
||||
|
||||
**Files:**
|
||||
- Create: `deploy/files/etc/sysctl.d/99-left4me.conf`
|
||||
- Test: `deploy/tests/test_deploy_artifacts.py` (new constant + new test function)
|
||||
|
||||
- [ ] **Step 3.1: Add path constant and failing test**
|
||||
|
||||
Open `deploy/tests/test_deploy_artifacts.py`. After the slice constants, add:
|
||||
|
||||
```python
|
||||
SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
|
||||
```
|
||||
|
||||
Append a new test:
|
||||
|
||||
```python
|
||||
def test_sysctl_conf_present_with_perf_settings():
|
||||
assert SYSCTL_CONF.is_file()
|
||||
text = SYSCTL_CONF.read_text()
|
||||
for line in (
|
||||
"net.core.rmem_max = 8388608",
|
||||
"net.core.wmem_max = 8388608",
|
||||
"net.core.rmem_default = 524288",
|
||||
"net.core.wmem_default = 524288",
|
||||
"net.core.netdev_max_backlog = 5000",
|
||||
"net.core.netdev_budget = 600",
|
||||
"vm.swappiness = 10",
|
||||
):
|
||||
assert line in text, f"missing {line!r} in 99-left4me.conf"
|
||||
```
|
||||
|
||||
- [ ] **Step 3.2: Run the new test, verify it fails**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
|
||||
Expected: FAIL on `assert SYSCTL_CONF.is_file()`.
|
||||
|
||||
- [ ] **Step 3.3: Create the sysctl conf file**
|
||||
|
||||
Create `deploy/files/etc/sysctl.d/99-left4me.conf` with:
|
||||
|
||||
```
|
||||
# Host-side perf baseline for left4me — see
|
||||
# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
||||
#
|
||||
# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
|
||||
# Source-engine UDP across multiple instances. 8 MiB matches the standard
|
||||
# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
|
||||
# explicitly enlarge their buffers.
|
||||
net.core.rmem_max = 8388608
|
||||
net.core.wmem_max = 8388608
|
||||
net.core.rmem_default = 524288
|
||||
net.core.wmem_default = 524288
|
||||
|
||||
# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
|
||||
# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
|
||||
# netdev_budget = 600 gives softirq more drain headroom per pass.
|
||||
net.core.netdev_max_backlog = 5000
|
||||
net.core.netdev_budget = 600
|
||||
|
||||
# Latency-sensitive default: avoid swap unless the box is really under
|
||||
# pressure. Harmless on swapless hosts.
|
||||
vm.swappiness = 10
|
||||
```
|
||||
|
||||
- [ ] **Step 3.4: Run the new test, verify it passes**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
|
||||
Expected: PASS.
|
||||
|
||||
- [ ] **Step 3.5: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/files/etc/sysctl.d/99-left4me.conf deploy/tests/test_deploy_artifacts.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(deploy): host sysctls for UDP buffers + netdev backlog/budget
|
||||
|
||||
99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults),
|
||||
netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 4: Sandbox in Build Slice
|
||||
|
||||
Place the script-sandbox transient unit into `l4d2-build.slice` and give it `OOMScoreAdjust=500` so it dies first under memory pressure.
|
||||
|
||||
**Files:**
|
||||
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
|
||||
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
|
||||
|
||||
- [ ] **Step 4.1: Add the failing test**
|
||||
|
||||
Open `deploy/tests/test_deploy_artifacts.py`. Append:
|
||||
|
||||
```python
|
||||
def test_script_sandbox_in_build_slice_with_oom_adjust():
|
||||
text = SCRIPT_SANDBOX_HELPER.read_text()
|
||||
|
||||
# Put the transient unit in the low-weight build slice so it yields to
|
||||
# game-server instances under CPU/IO contention.
|
||||
assert "--slice=l4d2-build.slice" in text
|
||||
|
||||
# Sandbox dies first if the host hits memory pressure; servers
|
||||
# (OOMScoreAdjust=-200) survive.
|
||||
assert "-p OOMScoreAdjust=500" in text
|
||||
```
|
||||
|
||||
- [ ] **Step 4.2: Run the new test, verify it fails**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust -v`
|
||||
Expected: FAIL — neither string is in the helper yet.
|
||||
|
||||
- [ ] **Step 4.3: Edit the sandbox helper**
|
||||
|
||||
Open `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`. Locate the `systemd-run` invocation that begins with:
|
||||
|
||||
```
|
||||
systemd-run --quiet --collect --wait --pipe \
|
||||
--unit="left4me-script-${OVERLAY_ID}-$$" \
|
||||
```
|
||||
|
||||
Insert two new lines immediately after the `--unit=` line, before `-p User=l4d2-sandbox`. The block becomes:
|
||||
|
||||
```
|
||||
systemd-run --quiet --collect --wait --pipe \
|
||||
--unit="left4me-script-${OVERLAY_ID}-$$" \
|
||||
--slice=l4d2-build.slice \
|
||||
-p OOMScoreAdjust=500 \
|
||||
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
|
||||
```
|
||||
|
||||
Leave every other `-p` line untouched.
|
||||
|
||||
- [ ] **Step 4.4: Verify shell syntax still parses**
|
||||
|
||||
Run: `bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
|
||||
Expected: exit 0, no output.
|
||||
|
||||
- [ ] **Step 4.5: Run the new test and the existing sandbox-helper tests, verify they pass**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_invokes_systemd_run_with_hardening deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_passes_shell_syntax_check -v`
|
||||
Expected: PASS for all three. The hardening test still matches because it only checks for substring presence; we added strings, didn't remove any.
|
||||
|
||||
- [ ] **Step 4.6: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/files/usr/local/libexec/left4me/left4me-script-sandbox deploy/tests/test_deploy_artifacts.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500
|
||||
|
||||
Builds yield CPU/IO to game-server instances under contention via the
|
||||
slice's weight=10, and are killed first under memory pressure
|
||||
(servers have OOMScoreAdjust=-200).
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 5: Deploy Script Installs Slice + Sysctl Artifacts
|
||||
|
||||
Wire the new artifacts into `deploy-test-server.sh` so a fresh deploy actually puts them on disk and applies the sysctls.
|
||||
|
||||
**Files:**
|
||||
- Modify: `deploy/deploy-test-server.sh`
|
||||
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
|
||||
|
||||
- [ ] **Step 5.1: Add the failing test**
|
||||
|
||||
Open `deploy/tests/test_deploy_artifacts.py`. Append:
|
||||
|
||||
```python
|
||||
def test_deploy_script_installs_perf_artifacts():
|
||||
script = DEPLOY_SCRIPT.read_text()
|
||||
|
||||
# Slice files copied into the system-wide systemd unit dir.
|
||||
assert "/usr/local/lib/systemd/system/l4d2-game.slice" in script
|
||||
assert "/usr/local/lib/systemd/system/l4d2-build.slice" in script
|
||||
|
||||
# Sysctl drop-in installed under /etc/sysctl.d/.
|
||||
assert "/etc/sysctl.d/99-left4me.conf" in script
|
||||
|
||||
# Values applied immediately, not on next boot.
|
||||
assert "sysctl --system" in script
|
||||
```
|
||||
|
||||
- [ ] **Step 5.2: Run the new test, verify it fails**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts -v`
|
||||
Expected: FAIL on the first assertion.
|
||||
|
||||
- [ ] **Step 5.3: Edit the deploy script — copy the slice + sysctl files**
|
||||
|
||||
Open `deploy/deploy-test-server.sh`. Find the block that copies unit files (currently around line 138):
|
||||
|
||||
```sh
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
|
||||
```
|
||||
|
||||
Add two new lines immediately after the `left4me-server@.service` copy line, so the block becomes:
|
||||
|
||||
```sh
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
|
||||
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
|
||||
```
|
||||
|
||||
- [ ] **Step 5.4: Edit the deploy script — install the sysctl conf and apply it**
|
||||
|
||||
In `deploy/deploy-test-server.sh`, find the block that installs `/etc/left4me/sandbox-resolv.conf` (currently around lines 153–155):
|
||||
|
||||
```sh
|
||||
$sudo_cmd install -m 0644 -o root -g root \
|
||||
/opt/left4me/deploy/files/etc/left4me/sandbox-resolv.conf \
|
||||
/etc/left4me/sandbox-resolv.conf
|
||||
```
|
||||
|
||||
Immediately after that block, add:
|
||||
|
||||
```sh
|
||||
# Host perf-baseline sysctls. Apply with `sysctl --system` so values
|
||||
# take effect this deploy, not on next reboot.
|
||||
$sudo_cmd install -m 0644 -o root -g root \
|
||||
/opt/left4me/deploy/files/etc/sysctl.d/99-left4me.conf \
|
||||
/etc/sysctl.d/99-left4me.conf
|
||||
$sudo_cmd sysctl --system >/dev/null
|
||||
```
|
||||
|
||||
- [ ] **Step 5.5: Verify the deploy script's shell syntax still parses**
|
||||
|
||||
Run: `sh -n deploy/deploy-test-server.sh`
|
||||
Expected: exit 0, no output.
|
||||
|
||||
- [ ] **Step 5.6: Run the new test and the existing deploy-script tests, verify they pass**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts deploy/tests/test_deploy_artifacts.py::test_deploy_script_has_safe_defaults_and_preserves_state deploy/tests/test_deploy_artifacts.py::test_deploy_script_shell_syntax -v`
|
||||
Expected: PASS for all three.
|
||||
|
||||
- [ ] **Step 5.7: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(deploy): install slice + sysctl artifacts and apply via sysctl --system
|
||||
|
||||
Copies l4d2-game.slice and l4d2-build.slice into
|
||||
/usr/local/lib/systemd/system/, installs 99-left4me.conf into
|
||||
/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
|
||||
live this deploy, not on next reboot.
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 6: Performance-Tuning Section in deploy/README.md
|
||||
|
||||
Document the four escape hatches the spec lists as opt-in: CPU governor, per-instance `CPUAffinity`, NIC tuning, and SCHED_FIFO.
|
||||
|
||||
**Files:**
|
||||
- Modify: `deploy/README.md`
|
||||
|
||||
No test for this task — README content is documentation, not contract.
|
||||
|
||||
- [ ] **Step 6.1: Append the Performance Tuning section**
|
||||
|
||||
Open `deploy/README.md`. Append (after the existing final paragraph) a new section:
|
||||
|
||||
```markdown
|
||||
## Performance Tuning
|
||||
|
||||
The deployment ships a host-side perf baseline (slices, unit directives, sysctls). See `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` for design rationale.
|
||||
|
||||
The following knobs are documented escape hatches — they are **not** auto-applied. Apply only if you have measured a need and understand the failure modes.
|
||||
|
||||
### CPU governor
|
||||
|
||||
The performance governor squeezes a few percent off jitter under bursty load. `schedutil` is acceptable for sustained UDP workloads.
|
||||
|
||||
```sh
|
||||
sudo cpupower frequency-set -g performance
|
||||
```
|
||||
|
||||
Persist via your distro's CPU-frequency tooling (e.g. `/etc/default/cpufrequtils`).
|
||||
|
||||
### Per-instance CPU affinity
|
||||
|
||||
`srcds` is single-threaded per instance. On a multi-core host, pinning each instance to its own core can cut jitter under contention. Drop in `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
CPUAffinity=2
|
||||
```
|
||||
|
||||
A reasonable strategy on an N-core host: leave core 0 for the kernel + IRQs + system services, then pin one instance per remaining core.
|
||||
|
||||
### NIC tuning
|
||||
|
||||
Hardware-specific. On a host with a single primary interface (replace `eth0`):
|
||||
|
||||
```sh
|
||||
sudo ethtool -G eth0 rx 4096 tx 4096
|
||||
sudo ethtool -K eth0 gro on lro off
|
||||
```
|
||||
|
||||
If you run a high instance count, also pin the NIC's interrupts off the cores that game servers occupy (see `/proc/interrupts` and `/proc/irq/<n>/smp_affinity`).
|
||||
|
||||
### Real-time scheduling (advanced, opt-in)
|
||||
|
||||
Source-engine servers do not need real-time scheduling, and a misbehaving `srcds` at any RT priority can starve kernel threads — even with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of CPU back. Use only if you have a measured jitter problem that the baseline does not solve.
|
||||
|
||||
`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
|
||||
|
||||
```ini
|
||||
[Service]
|
||||
CPUSchedulingPolicy=fifo
|
||||
CPUSchedulingPriority=10
|
||||
LimitRTPRIO=10
|
||||
```
|
||||
|
||||
### Applying changes to running servers
|
||||
|
||||
Unit-file changes do not apply to already-running services. After any change:
|
||||
|
||||
```sh
|
||||
sudo systemctl daemon-reload
|
||||
# Restart each game server via the web UI's stop + start, or:
|
||||
sudo systemctl restart 'left4me-server@*.service'
|
||||
```
|
||||
```
|
||||
|
||||
- [ ] **Step 6.2: Run the full deploy test suite and verify it stays green**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
|
||||
Expected: all green. README changes have no test, but should not break any existing tests.
|
||||
|
||||
- [ ] **Step 6.3: Commit**
|
||||
|
||||
```bash
|
||||
git add deploy/README.md
|
||||
git commit -m "$(cat <<'EOF'
|
||||
docs(deploy): performance-tuning escape-hatch section in README
|
||||
|
||||
Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
|
||||
SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
|
||||
ops-side knobs for measured problems the perf baseline doesn't solve.
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Final Verification
|
||||
|
||||
- [ ] **Step F.1: Full deploy test suite green**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
|
||||
Expected: all green.
|
||||
|
||||
- [ ] **Step F.2: Host library + web tests still green (regression check)**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q && pytest l4d2web/tests -q`
|
||||
Expected: all green. Nothing in this plan touches host or web Python code, but a clean run rules out accidental import-time damage.
|
||||
|
||||
- [ ] **Step F.3: Working tree clean and commits in order**
|
||||
|
||||
Run: `git status && git log --oneline -8`
|
||||
Expected:
|
||||
- `git status`: `nothing to commit, working tree clean`.
|
||||
- `git log`: six new commits in this order, top-most first:
|
||||
1. `docs(deploy): performance-tuning escape-hatch section in README`
|
||||
2. `feat(deploy): install slice + sysctl artifacts and apply via sysctl --system`
|
||||
3. `feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500`
|
||||
4. `feat(deploy): host sysctls for UDP buffers + netdev backlog/budget`
|
||||
5. `feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio`
|
||||
6. `feat(deploy): perf-baseline directives on left4me-server@.service`
|
||||
|
||||
If any step is missing or out of order, do not amend — diagnose, fix, and create new commits.
|
||||
|
||||
- [ ] **Step F.4: Manual deploy smoke test (deferred, ops-side)**
|
||||
|
||||
This plan ships artifacts. Confirming that systemd actually accepts and applies them on a real host requires running the deploy script against a test target. That validation is operator-side, not part of this implementation:
|
||||
|
||||
```sh
|
||||
deploy/deploy-test-server.sh deploy-user@example-host
|
||||
ssh deploy-user@example-host 'systemctl cat l4d2-game.slice'
|
||||
ssh deploy-user@example-host 'sysctl net.core.rmem_max' # expect 8388608
|
||||
ssh deploy-user@example-host 'systemd-analyze verify /usr/local/lib/systemd/system/left4me-server@.service'
|
||||
```
|
||||
|
||||
Document any deploy-time problems back into the spec or this plan as v1.x corrections. Do not invent fixes that go beyond the spec.
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope (do NOT implement here)
|
||||
|
||||
Listed in the spec — repeated for clarity:
|
||||
|
||||
- ConVars / blueprint arguments / tickrate / sv_minrate.
|
||||
- SCHED_FIFO auto-apply.
|
||||
- CPU governor auto-apply.
|
||||
- Per-instance `CPUAffinity` auto-apply.
|
||||
- NIC ring-buffer / IRQ-pinning code.
|
||||
- Job-scheduler awareness ("don't build while server X has players").
|
||||
- Hardening tightening (`ProtectKernelTunables=yes`, etc.).
|
||||
|
||||
If you find yourself touching any of these, stop — they belong in a separate spec.
|
||||
|
|
@ -0,0 +1,584 @@
|
|||
# L4D2 Server Lifecycle: Reboot-Safe + Drift Reconciliation Implementation Plan
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||||
|
||||
**Goal:** Make L4D2 server instances survive a host reboot (Part A) and converge `Server.actual_state` to systemd reality every ~30s for out-of-band drift (Part B).
|
||||
|
||||
**Architecture:** Helper script + `service_control.py` switch from `systemctl start/stop` to `systemctl enable --now / disable --now`. A new background thread spawned with the job workers polls every server's status periodically and writes the result via the existing `refresh_server_actual_state()` path. Skip servers with in-flight jobs to avoid racing with the post-job refresh.
|
||||
|
||||
**Tech Stack:** bash helper script + sudoers; Python `subprocess` via `l4d2host.service_control.systemctl_command`; SQLAlchemy via `session_scope()`; threading; pytest.
|
||||
|
||||
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md`
|
||||
|
||||
---
|
||||
|
||||
## File Structure
|
||||
|
||||
Files to modify (Part A — lifecycle verb change):
|
||||
|
||||
- `deploy/files/usr/local/libexec/left4me/left4me-systemctl` — accept verbs `enable`/`disable`/`show` (drop `start`/`stop`).
|
||||
- `l4d2host/service_control.py` — rename `start_service` → `enable_service`, `stop_service` → `disable_service`. Action tokens become `"enable"` / `"disable"`.
|
||||
- `l4d2host/instances.py` — call `enable_service` from `start_instance`; call `disable_service` from `stop_instance` and `_purge_instance`.
|
||||
- `l4d2host/tests/test_lifecycle.py` — update mock-call expectations.
|
||||
- `l4d2host/tests/test_service_control.py` — new file with direct unit tests for `enable_service` / `disable_service`.
|
||||
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args` — update the verb assertions.
|
||||
|
||||
Files to modify (Part B — poller):
|
||||
|
||||
- `l4d2web/services/job_worker.py` — add `start_state_poller`, `state_poller_loop`, `poll_all_servers`.
|
||||
- `l4d2web/app.py` — call `start_state_poller(app)` next to `start_job_workers(app)`.
|
||||
- `l4d2web/config.py` — default `STATE_POLLER_INTERVAL_SECONDS = 30`.
|
||||
- `l4d2web/tests/test_job_worker.py` — four new tests for the poller.
|
||||
|
||||
No host-library, web-app facade, or CLI surface signatures change. The `l4d2ctl start <name>` / `l4d2ctl stop <name>` commands keep their names (per `AGENTS.md`).
|
||||
|
||||
---
|
||||
|
||||
## Pre-flight
|
||||
|
||||
- [ ] **Step 0a: Verify clean working tree**
|
||||
|
||||
Run: `git status`
|
||||
Expected: `nothing to commit, working tree clean`
|
||||
|
||||
- [ ] **Step 0b: Verify the existing test suite is at the known-good baseline**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
|
||||
Expected: 460 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
|
||||
|
||||
If the count differs, stop and surface — this plan assumes that exact baseline.
|
||||
|
||||
---
|
||||
|
||||
## Task 1: Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
|
||||
|
||||
This task changes the helper script, the Python wrapper, and the instance lifecycle in one cohesive commit. The change is end-to-end vertical — splitting it across commits would leave broken intermediate states (helper accepting verbs that no caller uses, or callers using verbs the helper rejects).
|
||||
|
||||
**Files:**
|
||||
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-systemctl`
|
||||
- Modify: `l4d2host/service_control.py`
|
||||
- Modify: `l4d2host/instances.py`
|
||||
- Modify: `l4d2host/tests/test_lifecycle.py`
|
||||
- Create: `l4d2host/tests/test_service_control.py`
|
||||
- Modify: `deploy/tests/test_deploy_artifacts.py`
|
||||
|
||||
### Step 1.1: Update the deploy artifact test for the helper
|
||||
|
||||
Open `deploy/tests/test_deploy_artifacts.py`. Find `test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`.
|
||||
|
||||
Replace the assertions that check the helper's case-statement bodies. Currently the test asserts something like:
|
||||
|
||||
```python
|
||||
assert 'start) exec "$systemctl" start "$unit"' in script
|
||||
assert 'stop) exec "$systemctl" stop "$unit"' in script
|
||||
```
|
||||
|
||||
Update to:
|
||||
|
||||
```python
|
||||
assert 'enable)' in script
|
||||
assert 'enable --now' in script
|
||||
assert 'disable)' in script
|
||||
assert 'disable --now' in script
|
||||
```
|
||||
|
||||
Keep the `--property=ActiveState` and `--property=SubState` assertions for the `show` action (unchanged).
|
||||
|
||||
The rejected-action examples list (currently includes things like `["bad/action", "alpha"]`) is unchanged — those are still bad. If the test currently asserts that `start` and `stop` are accepted (e.g., a positive case), drop those — `start`/`stop` are now rejected verbs, not accepted ones.
|
||||
|
||||
### Step 1.2: Run the updated artifact test to verify it fails
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
|
||||
Expected: FAIL — the helper script still has `start)`/`stop)` cases, not `enable)`/`disable)`.
|
||||
|
||||
### Step 1.3: Edit the helper script
|
||||
|
||||
Open `deploy/files/usr/local/libexec/left4me/left4me-systemctl`. Find the case-statement (currently around lines 24–27). Replace:
|
||||
|
||||
```sh
|
||||
case "$action" in
|
||||
start) exec "$systemctl" start "$unit" ;;
|
||||
stop) exec "$systemctl" stop "$unit" ;;
|
||||
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
|
||||
*) ...
|
||||
esac
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```sh
|
||||
case "$action" in
|
||||
enable) exec "$systemctl" enable --now "$unit" ;;
|
||||
disable) exec "$systemctl" disable --now "$unit" ;;
|
||||
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
|
||||
*) ...
|
||||
esac
|
||||
```
|
||||
|
||||
Keep the rest of the script (shebang, name validation, `*)` reject-and-exit branch) unchanged. The exact form of the `*)` reject case in the existing helper should be preserved.
|
||||
|
||||
### Step 1.4: Verify the helper script still parses
|
||||
|
||||
Run: `sh -n deploy/files/usr/local/libexec/left4me/left4me-systemctl`
|
||||
Expected: exit 0, no output.
|
||||
|
||||
### Step 1.5: Run the artifact test, verify it passes
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
|
||||
Expected: PASS.
|
||||
|
||||
### Step 1.6: Update `service_control.py`
|
||||
|
||||
Open `l4d2host/service_control.py`. Replace:
|
||||
|
||||
```python
|
||||
def start_service(
|
||||
name: str,
|
||||
*,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> CommandResult:
|
||||
return run_command(
|
||||
systemctl_command("start", name),
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
|
||||
|
||||
def stop_service(
|
||||
name: str,
|
||||
*,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> CommandResult:
|
||||
return run_command(
|
||||
systemctl_command("stop", name),
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```python
|
||||
def enable_service(
|
||||
name: str,
|
||||
*,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> CommandResult:
|
||||
return run_command(
|
||||
systemctl_command("enable", name),
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
|
||||
|
||||
def disable_service(
|
||||
name: str,
|
||||
*,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> CommandResult:
|
||||
return run_command(
|
||||
systemctl_command("disable", name),
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
```
|
||||
|
||||
`show_service`, `stream_command`, `stream_journal`, and the `systemctl_command` / `journalctl_command` helpers are unchanged.
|
||||
|
||||
### Step 1.7: Update `instances.py` to call the new names
|
||||
|
||||
Open `l4d2host/instances.py`. Replace the import:
|
||||
|
||||
```python
|
||||
from l4d2host.service_control import start_service, stop_service
|
||||
```
|
||||
|
||||
with:
|
||||
|
||||
```python
|
||||
from l4d2host.service_control import disable_service, enable_service
|
||||
```
|
||||
|
||||
Inside `start_instance`, find the `start_service(...)` call (around line 137 in current source) and replace with `enable_service(...)`. Inside `stop_instance` (line 159) and `_purge_instance` (line 194), replace `stop_service(...)` with `disable_service(...)`. Keep all keyword arguments identical — only the function name changes.
|
||||
|
||||
### Step 1.8: Update `test_lifecycle.py`
|
||||
|
||||
Open `l4d2host/tests/test_lifecycle.py`. Search for every assertion that references the `start` or `stop` action token in mock-call expectations against `service_control.run_command` or `systemctl_command`. The tests typically look for argument lists like `["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "<name>"]`.
|
||||
|
||||
Update each occurrence:
|
||||
- `"start"` → `"enable"` (in the `start_instance` test paths)
|
||||
- `"stop"` → `"disable"` (in the `stop_instance`, `delete_instance`, `reset_instance`, and `_purge_instance` test paths)
|
||||
|
||||
Some tests may import `start_service` / `stop_service` directly. Update those imports to `enable_service` / `disable_service`.
|
||||
|
||||
### Step 1.9: Create direct unit tests for `enable_service` / `disable_service`
|
||||
|
||||
Create `l4d2host/tests/test_service_control.py` with:
|
||||
|
||||
```python
|
||||
from unittest.mock import patch
|
||||
|
||||
from l4d2host.service_control import (
|
||||
SYSTEMCTL_HELPER,
|
||||
disable_service,
|
||||
enable_service,
|
||||
)
|
||||
|
||||
|
||||
@patch("l4d2host.service_control.run_command")
|
||||
def test_enable_service_invokes_helper_with_enable_action(mock_run):
|
||||
enable_service("instance-7")
|
||||
args, _ = mock_run.call_args
|
||||
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
|
||||
|
||||
|
||||
@patch("l4d2host.service_control.run_command")
|
||||
def test_disable_service_invokes_helper_with_disable_action(mock_run):
|
||||
disable_service("instance-7")
|
||||
args, _ = mock_run.call_args
|
||||
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]
|
||||
```
|
||||
|
||||
### Step 1.10: Run the host-library tests
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q`
|
||||
Expected: all green (110 or 111 passing depending on whether `test_service_control.py` already existed; `+2` from the new direct tests).
|
||||
|
||||
If anything red: fix the test expectations, not the implementation. The implementation matches the spec exactly. Most likely failure mode: a test in `test_lifecycle.py` you missed updating; search for any remaining string literal `"start"` or `"stop"` in helper-arg-list contexts.
|
||||
|
||||
### Step 1.11: Run the deploy artifact test suite
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
|
||||
Expected: 36 passed, 1 failed (the pre-existing unrelated test).
|
||||
|
||||
### Step 1.12: Commit
|
||||
|
||||
```bash
|
||||
git add deploy/files/usr/local/libexec/left4me/left4me-systemctl \
|
||||
l4d2host/service_control.py l4d2host/instances.py \
|
||||
l4d2host/tests/test_lifecycle.py \
|
||||
l4d2host/tests/test_service_control.py \
|
||||
deploy/tests/test_deploy_artifacts.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
|
||||
|
||||
Servers started via the web UI now create a WantedBy= symlink under
|
||||
multi-user.target.wants/, so they auto-start on the next host reboot.
|
||||
Helper verbs renamed start/stop -> enable/disable; service_control.py
|
||||
renamed start_service/stop_service -> enable_service/disable_service.
|
||||
The user-facing l4d2ctl start/stop commands keep their names per the
|
||||
AGENTS.md contract — only the implementation changes. Spec:
|
||||
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Part B — Periodic state poller
|
||||
|
||||
This task adds the poller code, wires it into the Flask startup, exposes its config knob, and tests four behaviors. One cohesive commit.
|
||||
|
||||
**Files:**
|
||||
- Modify: `l4d2web/services/job_worker.py`
|
||||
- Modify: `l4d2web/app.py`
|
||||
- Modify: `l4d2web/config.py`
|
||||
- Modify: `l4d2web/tests/test_job_worker.py`
|
||||
|
||||
### Step 2.1: Add the failing tests
|
||||
|
||||
Open `l4d2web/tests/test_job_worker.py`. Append after the existing tests:
|
||||
|
||||
```python
|
||||
def test_state_poller_refreshes_each_server(app, monkeypatch):
|
||||
from l4d2web.services import job_worker as jw
|
||||
|
||||
with app.app_context():
|
||||
from l4d2web.db import session_scope
|
||||
from l4d2web.models import Server
|
||||
with session_scope() as db:
|
||||
db.add_all([
|
||||
Server(id=11, name="alpha", port=27015, blueprint_id=None,
|
||||
desired_state="running", actual_state="unknown"),
|
||||
Server(id=12, name="beta", port=27016, blueprint_id=None,
|
||||
desired_state="running", actual_state="unknown"),
|
||||
])
|
||||
|
||||
refreshed = []
|
||||
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
|
||||
|
||||
with app.app_context():
|
||||
jw.poll_all_servers()
|
||||
|
||||
assert sorted(refreshed) == [11, 12]
|
||||
|
||||
|
||||
def test_state_poller_skips_servers_with_inflight_jobs(app, monkeypatch):
|
||||
from l4d2web.services import job_worker as jw
|
||||
|
||||
with app.app_context():
|
||||
from l4d2web.db import session_scope
|
||||
from l4d2web.models import Job, Server
|
||||
with session_scope() as db:
|
||||
db.add(Server(id=21, name="gamma", port=27017, blueprint_id=None,
|
||||
desired_state="running", actual_state="running"))
|
||||
db.add(Job(server_id=21, operation="stop", state="running"))
|
||||
|
||||
refreshed = []
|
||||
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
|
||||
|
||||
with app.app_context():
|
||||
jw.poll_all_servers()
|
||||
|
||||
assert refreshed == []
|
||||
|
||||
|
||||
def test_state_poller_swallows_per_server_exceptions(app, monkeypatch):
|
||||
from l4d2web.services import job_worker as jw
|
||||
|
||||
with app.app_context():
|
||||
from l4d2web.db import session_scope
|
||||
from l4d2web.models import Server
|
||||
with session_scope() as db:
|
||||
db.add_all([
|
||||
Server(id=31, name="bad", port=27018, blueprint_id=None,
|
||||
desired_state="running", actual_state="unknown"),
|
||||
Server(id=32, name="good", port=27019, blueprint_id=None,
|
||||
desired_state="running", actual_state="unknown"),
|
||||
])
|
||||
|
||||
refreshed = []
|
||||
|
||||
def fake_refresh(sid):
|
||||
if sid == 31:
|
||||
raise RuntimeError("simulated host failure")
|
||||
refreshed.append(sid)
|
||||
|
||||
monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
|
||||
|
||||
with app.app_context():
|
||||
jw.poll_all_servers() # must not raise
|
||||
|
||||
assert refreshed == [32]
|
||||
|
||||
|
||||
def test_state_poller_disabled_when_job_workers_disabled(monkeypatch):
|
||||
"""create_app must not spawn the poller thread when JOB_WORKER_ENABLED=False."""
|
||||
import threading
|
||||
|
||||
from l4d2web.app import create_app
|
||||
|
||||
spawned = []
|
||||
real_thread_init = threading.Thread.__init__
|
||||
|
||||
def tracking_init(self, *args, **kwargs):
|
||||
if kwargs.get("name") == "left4me-state-poller":
|
||||
spawned.append(True)
|
||||
real_thread_init(self, *args, **kwargs)
|
||||
|
||||
monkeypatch.setattr(threading.Thread, "__init__", tracking_init)
|
||||
create_app({"TESTING": True, "JOB_WORKER_ENABLED": False})
|
||||
assert not spawned
|
||||
```
|
||||
|
||||
(The tests assume the existing `app` fixture from `conftest.py`. If your project uses a different fixture name, adjust accordingly. The polling tests run `poll_all_servers()` synchronously to avoid testing the loop's `time.sleep`.)
|
||||
|
||||
### Step 2.2: Run the new tests, verify they fail
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
|
||||
Expected: FAIL — `poll_all_servers` and `start_state_poller` don't exist yet.
|
||||
|
||||
### Step 2.3: Add the poller code to `job_worker.py`
|
||||
|
||||
Open `l4d2web/services/job_worker.py`. Add at the bottom of the file:
|
||||
|
||||
```python
|
||||
def start_state_poller(app):
|
||||
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
|
||||
thread = threading.Thread(
|
||||
target=state_poller_loop,
|
||||
args=(app, interval),
|
||||
daemon=True,
|
||||
name="left4me-state-poller",
|
||||
)
|
||||
thread.start()
|
||||
|
||||
|
||||
def state_poller_loop(app, interval: float) -> None:
|
||||
while True:
|
||||
try:
|
||||
with app.app_context():
|
||||
poll_all_servers()
|
||||
except Exception:
|
||||
pass
|
||||
time.sleep(interval)
|
||||
|
||||
|
||||
def poll_all_servers() -> None:
|
||||
with session_scope() as db:
|
||||
active_server_ids = set(db.scalars(
|
||||
select(Job.server_id).where(Job.state.in_(("queued", "running")))
|
||||
).all())
|
||||
server_ids = [
|
||||
sid for sid in db.scalars(select(Server.id)).all()
|
||||
if sid not in active_server_ids
|
||||
]
|
||||
for sid in server_ids:
|
||||
try:
|
||||
refresh_server_actual_state(sid)
|
||||
except Exception:
|
||||
pass
|
||||
```
|
||||
|
||||
`Server`, `Job`, `select`, `session_scope`, `threading`, `time`, and `refresh_server_actual_state` are already imported in this file. Verify by scanning the existing imports; if any are missing (unlikely for `select`/`Server`/`Job` since the worker uses them), add them.
|
||||
|
||||
### Step 2.4: Wire the poller into `create_app`
|
||||
|
||||
Open `l4d2web/app.py`. Find the existing `start_job_workers(app)` call (around line 91, inside the `if should_start_workers:` block). Add `start_state_poller(app)` immediately after it:
|
||||
|
||||
```python
|
||||
if should_start_workers:
|
||||
recover_stale_jobs()
|
||||
start_job_workers(app)
|
||||
start_state_poller(app)
|
||||
```
|
||||
|
||||
Also update the import:
|
||||
|
||||
```python
|
||||
from l4d2web.services.job_worker import (
|
||||
recover_stale_jobs,
|
||||
start_job_workers,
|
||||
start_state_poller,
|
||||
)
|
||||
```
|
||||
|
||||
(If the existing import is single-line `from ... import recover_stale_jobs, start_job_workers`, just add `start_state_poller` to the list.)
|
||||
|
||||
### Step 2.5: Add the config default
|
||||
|
||||
Open `l4d2web/config.py`. Find the dict literal that contains other defaults like `JOB_WORKER_THREADS`, `PORT_RANGE_START`, etc. Add:
|
||||
|
||||
```python
|
||||
"STATE_POLLER_INTERVAL_SECONDS": 30,
|
||||
```
|
||||
|
||||
In the env-var-loading section (where `LEFT4ME_PORT_RANGE_START` etc. are read), add:
|
||||
|
||||
```python
|
||||
"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("LEFT4ME_STATE_POLLER_INTERVAL_SECONDS", "30")),
|
||||
```
|
||||
|
||||
### Step 2.6: Run the four new tests, verify they pass
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
|
||||
Expected: PASS for all four.
|
||||
|
||||
### Step 2.7: Run the full web test suite
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests -q`
|
||||
Expected: 317 passed, 1 skipped (313 + 4 new tests).
|
||||
|
||||
### Step 2.8: Commit
|
||||
|
||||
```bash
|
||||
git add l4d2web/services/job_worker.py l4d2web/app.py l4d2web/config.py l4d2web/tests/test_job_worker.py
|
||||
git commit -m "$(cat <<'EOF'
|
||||
feat(l4d2-web): periodic state poller refreshes Server.actual_state
|
||||
|
||||
A background thread spawned alongside the job workers polls every
|
||||
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
|
||||
writes the result via the existing refresh_server_actual_state path.
|
||||
Servers with in-flight jobs are skipped to avoid racing the post-job
|
||||
refresh. Catches reboot drift, OOM kills, manual systemctl operations,
|
||||
and any other out-of-band state change. Spec:
|
||||
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
|
||||
EOF
|
||||
)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Final Verification
|
||||
|
||||
- [ ] **Step F.1: Full test sweep**
|
||||
|
||||
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
|
||||
Expected: ~466 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
|
||||
|
||||
- [ ] **Step F.2: Working tree clean and commit shape**
|
||||
|
||||
Run: `git status && git log --oneline -5`
|
||||
Expected:
|
||||
- `git status`: clean.
|
||||
- Top of `git log`:
|
||||
1. `feat(l4d2-web): periodic state poller refreshes Server.actual_state`
|
||||
2. `feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now`
|
||||
3. `docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan`
|
||||
4. `docs(specs): l4d2 server lifecycle reboot-and-drift — design`
|
||||
|
||||
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
|
||||
|
||||
End-to-end on `ckn@10.0.4.128` after deploy:
|
||||
|
||||
```sh
|
||||
deploy/deploy-test-server.sh ckn@10.0.4.128
|
||||
|
||||
# Confirm the helper now drives enable/disable
|
||||
ssh ckn@10.0.4.128 'cat /usr/local/libexec/left4me/left4me-systemctl | grep -E "enable|disable"'
|
||||
# expect: enable) exec "$systemctl" enable --now "$unit"
|
||||
# disable) exec "$systemctl" disable --now "$unit"
|
||||
|
||||
# Click "start" in the web UI for a server. Then:
|
||||
ssh ckn@10.0.4.128 'systemctl is-enabled left4me-server@1.service'
|
||||
# expect: enabled
|
||||
|
||||
# Reboot the host:
|
||||
ssh ckn@10.0.4.128 'sudo systemctl reboot'
|
||||
# wait for it to come back, then:
|
||||
ssh ckn@10.0.4.128 'systemctl is-active left4me-server@1.service && pgrep -fa srcds'
|
||||
# expect: active, srcds running with no UI intervention
|
||||
|
||||
# Confirm the poller corrects out-of-band drift
|
||||
ssh ckn@10.0.4.128 'sudo systemctl disable --now left4me-server@1.service'
|
||||
# Within ~30s the web UI's actual_state for server 1 flips from "running" to "stopped".
|
||||
ssh ckn@10.0.4.128 'sudo -u left4me /opt/left4me/.venv/bin/python -c "
|
||||
import sqlite3
|
||||
c = sqlite3.connect(\"/var/lib/left4me/left4me.db\")
|
||||
print(c.execute(\"SELECT id, actual_state, actual_state_updated_at FROM servers WHERE id=1\").fetchone())
|
||||
"'
|
||||
# expect: actual_state='stopped' with a fresh updated_at.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope (do NOT implement here)
|
||||
|
||||
- Auto-restart on `desired_state=running && actual_state=stopped`.
|
||||
- UI banners for stale-state warnings.
|
||||
- Reconciliation of orphan systemd units.
|
||||
- Per-server poll intervals.
|
||||
- Replacing `Restart=on-failure`.
|
||||
- Touching the pre-existing red test (`test_deploy_script_has_safe_defaults_and_preserves_state`).
|
||||
|
||||
If you find yourself touching any of these, stop — they belong in a separate spec.
|
||||
131
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
Normal file
131
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
Normal file
|
|
@ -0,0 +1,131 @@
|
|||
# l4d2 cpu isolation — design
|
||||
|
||||
Date: 2026-05-09
|
||||
Status: design
|
||||
|
||||
## Summary
|
||||
|
||||
Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively. Implementation is systemd cgroup-v2 `AllowedCPUs=` drop-ins, computed at deploy time from `nproc`, overridable via env vars. Lands on top of the perf baseline shipped in `851e662..e5126c8`.
|
||||
|
||||
## Goals
|
||||
|
||||
- A logged-in admin doing CPU-heavy work, the script-build sandbox, and the Flask web app cannot steal cycles from a live match.
|
||||
- Layout scales automatically across host sizes (4-core, 8-core, 16-core) without per-host edits.
|
||||
- Operator can override the default `0` / `1..N-1` split for NUMA boxes or hyperthread quirks.
|
||||
- Single-core hosts degrade gracefully: skip CPU isolation, keep the rest of the perf baseline.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot parameters. True core isolation (eviction of softirqs, RCU, timer ticks) requires GRUB edits + reboot + per-host tuning. cgroup cpuset is sufficient for L4D2 tickrates; document as a future opt-in if measurement justifies it.
|
||||
- NIC IRQ pinning. Hardware-specific; already documented as an escape hatch in `deploy/README.md`.
|
||||
- Per-instance pinning *within* the game-core set. The slice-level cpuset is the floor; the existing per-instance `CPUAffinity=` drop-in escape hatch (already in `deploy/README.md`) composes on top — the kernel enforces "per-instance value must be a subset of slice's allowed set."
|
||||
- A separate `l4d2-web.slice`. The web app is light; living in `system.slice` on core 0 is fine.
|
||||
- Web-app or host-library code changes. Pure deploy-side artifact work.
|
||||
|
||||
## Background
|
||||
|
||||
The perf baseline (commit range `851e662..e5126c8`) introduced two slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10), per-instance unit directives (Nice, OOM, memory caps), and host sysctls. None of those constrain *which* CPUs cgroups run on. Under the kernel CFS, every task can move to any core; the build sandbox, ssh sessions, the web app, and game servers all compete for the same cores.
|
||||
|
||||
## Design
|
||||
|
||||
### Topology
|
||||
|
||||
```
|
||||
core 0 cores 1..N-1
|
||||
───────── ────────────
|
||||
system.slice AllowedCPUs=0
|
||||
user.slice AllowedCPUs=0
|
||||
l4d2-build.slice AllowedCPUs=0
|
||||
l4d2-game.slice AllowedCPUs=1-(N-1)
|
||||
```
|
||||
|
||||
Everything that isn't a live game server (Flask web app, ssh sessions, journald, script-sandbox builds, cron, systemd housekeeping) is funneled to core 0. Game servers get cores 1..N-1 exclusively.
|
||||
|
||||
### Why slice-level `AllowedCPUs=`, not per-instance `CPUAffinity=`
|
||||
|
||||
- **Hierarchy does the work for free.** A cpuset on `l4d2-game.slice` propagates to every `left4me-server@*.service` automatically. No per-instance drop-ins to manage; no logic in the web app to pick cores.
|
||||
- **Hot-applied.** cgroup-v2 cpuset changes apply to running cgroups; existing servers move next time the kernel schedules them. No need to restart instances after a deploy.
|
||||
- **Composable.** A future operator who wants per-instance pinning *within* the game cores adds `CPUAffinity=N` via `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` (already documented). The slice constraint and per-instance pin compose; the kernel enforces subset-of.
|
||||
|
||||
### Why drop-ins, not edits to the existing `.slice` files
|
||||
|
||||
The two slice files we ship today (`l4d2-game.slice`, `l4d2-build.slice`) are static text and host-portable. `AllowedCPUs=1-7` is true on an 8-core host and wrong on a 4-core host. Drop-ins under `<unit>.d/*.conf` are the standard systemd pattern for host-specific overrides. We already use `99-` prefixing for the sysctl drop-in so it lex-orders last; reuse that.
|
||||
|
||||
### Operator override
|
||||
|
||||
Two env vars consumed by the deploy script:
|
||||
|
||||
- `LEFT4ME_SYSTEM_CPUS` — defaults to `0`. Goes into `system.slice`, `user.slice`, `l4d2-build.slice` drop-ins.
|
||||
- `LEFT4ME_GAME_CPUS` — defaults to `1-$((NPROC-1))`. Goes into `l4d2-game.slice` drop-in.
|
||||
|
||||
Operators with NUMA boxes, hyperthread quirks, or "I want core 0 *and* core 1 for system" set the vars explicitly. Defaults handle the typical case.
|
||||
|
||||
### Single-core fallback
|
||||
|
||||
If `nproc < 2`, skip CPU isolation entirely (write no drop-ins). Print a warning to stderr explaining the deploy is leaving cpuset unset. The rest of the perf baseline still applies (weights, sysctls, OOM scores).
|
||||
|
||||
If `LEFT4ME_GAME_CPUS` or `LEFT4ME_SYSTEM_CPUS` is set explicitly on a single-core host, honor the operator's intent — they presumably know what they're doing — but still write the drop-ins.
|
||||
|
||||
### Drop-in layout
|
||||
|
||||
Four files written to `/etc/systemd/system/`, each named `99-left4me-cpuset.conf`:
|
||||
|
||||
```
|
||||
/etc/systemd/system/system.slice.d/99-left4me-cpuset.conf
|
||||
/etc/systemd/system/user.slice.d/99-left4me-cpuset.conf
|
||||
/etc/systemd/system/l4d2-build.slice.d/99-left4me-cpuset.conf
|
||||
/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf
|
||||
```
|
||||
|
||||
Each file contains:
|
||||
|
||||
```ini
|
||||
[Slice]
|
||||
AllowedCPUs=<resolved value>
|
||||
```
|
||||
|
||||
### systemd compatibility
|
||||
|
||||
`AllowedCPUs=` is systemd 244+. Debian Trixie ships systemd 256+. Cgroup-v2 cpuset controller is enabled by default on Trixie; systemd auto-enables the controller when `AllowedCPUs=` is set on a unit. No additional machinery.
|
||||
|
||||
### Files changed / added
|
||||
|
||||
```
|
||||
deploy/deploy-test-server.sh (modified — compute layout, write four drop-ins)
|
||||
deploy/README.md (modified — new "CPU isolation" subsection inside Performance Tuning)
|
||||
deploy/tests/test_deploy_artifacts.py (modified — new tests)
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
|
||||
`assert "X" in script` pattern:
|
||||
|
||||
- For `deploy-test-server.sh`, assert:
|
||||
- All four drop-in paths (`/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/99-left4me-cpuset.conf`) appear.
|
||||
- The script reads `nproc` (substring `nproc` plus a default-binding form for `LEFT4ME_GAME_CPUS`).
|
||||
- The script honors `LEFT4ME_SYSTEM_CPUS` and `LEFT4ME_GAME_CPUS` env-var overrides (substrings present, default-binding form like `${LEFT4ME_SYSTEM_CPUS:-...}`).
|
||||
- The script has a single-core fallback (substring guarding `nproc -lt 2` or equivalent, with a warning to stderr).
|
||||
- Each drop-in is written via the existing `install -m 0644 -o root -g root` heredoc pattern.
|
||||
|
||||
No runtime tests in this spec — verifying that systemd actually enforces `AllowedCPUs=` is operator-side via `cat /sys/fs/cgroup/<slice>/cpuset.cpus.effective` after deploy.
|
||||
|
||||
## Rollout
|
||||
|
||||
Single deploy. cgroup-v2 cpuset changes apply to running cgroups, so already-running servers move next time the kernel reschedules them — no instance restarts required. The `daemon-reload` already in the deploy script picks up the new drop-ins.
|
||||
|
||||
If something goes wrong (cpuset too narrow, a slice can't run any process), `systemctl status <slice>` will show the error and the operator can either fix the env vars and redeploy or `rm /etc/systemd/system/<slice>.slice.d/99-left4me-cpuset.conf` followed by `systemctl daemon-reload` to revert.
|
||||
|
||||
## Open questions
|
||||
|
||||
None blocking. Possible v2 candidates if measurement justifies them:
|
||||
|
||||
- Pair this with kernel `isolcpus=` boot params for true core isolation.
|
||||
- Auto-pin NIC IRQs to core 0 (would compose with this isolation).
|
||||
- Per-instance `CPUAffinity=` driven by a deploy-env knob, partitioning the game-core set across instances deterministically.
|
||||
|
||||
## References
|
||||
|
||||
- systemd.resource-control(5) — `AllowedCPUs=` semantics.
|
||||
- Linux Documentation/admin-guide/cgroup-v2.rst — cpuset controller behavior on `cpuset.cpus` / `cpuset.cpus.effective`.
|
||||
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` — sibling work that introduced the slices this spec extends.
|
||||
|
|
@ -0,0 +1,83 @@
|
|||
# l4d2 cpu pinning — decision record (deferred)
|
||||
|
||||
Date: 2026-05-09
|
||||
Status: decision (no implementation)
|
||||
|
||||
## Question
|
||||
|
||||
After the lifecycle + drift fix landed (commits `8552c55`, `67b5521`), the
|
||||
question came up: with `AllowedCPUs=1-7` already constraining game servers
|
||||
to cores 1–7, do CFS scheduler migrations *within* that range still cause
|
||||
meaningful jitter? Should we hard-pin each instance to a single core?
|
||||
|
||||
## Investigation
|
||||
|
||||
The classic "lazy CFS" sysctl knob is **gone** on modern kernels. Verified
|
||||
on Trixie's running kernel 6.12 (`ckn@10.0.4.128`):
|
||||
|
||||
```
|
||||
/sbin/sysctl -a | grep -E "sched_migration_cost|sched_min_granularity|sched_wakeup_granularity|sched_latency"
|
||||
# (no output)
|
||||
```
|
||||
|
||||
`kernel.sched_migration_cost_ns` and the other classic CFS tunables were
|
||||
removed in 5.13+ as part of the scheduler internals refactor that culminated
|
||||
in EEVDF (6.6). Only `kernel.sched_rt_period_us` / `sched_rt_runtime_us`
|
||||
remain. There is no global "be lazy about migrations" knob anymore.
|
||||
|
||||
### Available paths
|
||||
|
||||
| Option | Cost | Strictness | Pays off when |
|
||||
|---|---|---|---|
|
||||
| Trust CFS + `Nice=-5` + `AllowedCPUs=1-7` (current) | None | Soft | ≤ 3 instances on 7 cores; CFS rarely migrates active CPU-bound nice<0 tasks |
|
||||
| Per-instance `CPUAffinity=N` drop-in | Web-app machinery to write drop-ins, daemon-reload, modulo or DB-persisted assignment | Strict | ≥ 4 instances (each gets exclusive core), or measured jitter |
|
||||
| `isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7` kernel cmdline | GRUB edit + reboot, host-specific | Strongest (also evicts kernel softirqs/RCU/timer ticks from game cores) | Tickrate-128 with measurable kernel-induced jitter |
|
||||
| `SCHED_FIFO` per unit | Risky (RT misconfig can stall kernel) | Strict | Already documented as ops-side escape hatch in `deploy/README.md` |
|
||||
|
||||
### Why deferring is defensible
|
||||
|
||||
- The slice's `AllowedCPUs=1-7` already prevents game servers from running on core 0. The open question is "do they migrate within 1–7?" — yes, CFS can migrate, but for long-running CPU-bound `srcds` with `Nice=-5`, migrations are infrequent. CFS prefers cache locality and only migrates when an idle core "steals" or a periodic load-balance tick detects imbalance.
|
||||
- With ≤ 3 instances on 7 game cores, the load balancer rarely sees imbalance to fix.
|
||||
- Per-instance hard pinning adds non-trivial machinery (drop-in writer through `left4me-systemctl`, or extending `instance.env` + a `taskset` wrapper in the unit). Not warranted unless we observe a real problem.
|
||||
- `deploy/README.md` already documents the `CPUAffinity=N` per-instance drop-in as an opt-in escape hatch. An operator who measures jitter can apply it without code changes.
|
||||
|
||||
## Decision
|
||||
|
||||
**No code change.** Keep the current setup:
|
||||
|
||||
- Slice-level `AllowedCPUs=1-7` ensures game servers never touch core 0.
|
||||
- `Nice=-5` keeps active srcds tasks weighted heavily so CFS prefers leaving them alone.
|
||||
- The `CPUAffinity=N` per-instance drop-in remains the documented escape hatch.
|
||||
|
||||
## Revisit triggers
|
||||
|
||||
Any of these signals appears, then design + implement strict per-instance pinning:
|
||||
|
||||
- ≥ 4 game-server instances running simultaneously on one host.
|
||||
- A specific server reports tickrate dips / rubber-banding correlated with another instance starting or a build sandbox firing.
|
||||
- `perf stat -e sched:sched_migrate_task -p <srcds-pid>` shows > 1 migration/sec under load.
|
||||
|
||||
When revisiting, two implementation paths to choose from:
|
||||
|
||||
1. **Modulo assignment in the host library.** Read `LEFT4ME_GAME_CPUS` (or parse the slice's `AllowedCPUs=` drop-in), pick `game_cpus[(int(name) - 1) % len(game_cpus)]`, write `L4D2_CPU=N` into `instance.env`, wrap the unit's `ExecStart` with `taskset -c ${L4D2_CPU}`. Stateless, deterministic, no DB column. **Preferred.**
|
||||
2. **Persisted assignment.** Add `Server.cpu_pin` column, web app picks at initialize time and stores. Survives `LEFT4ME_GAME_CPUS` changes (each server keeps its assigned core). Bigger ripple.
|
||||
|
||||
## Verification (no-op confirmation)
|
||||
|
||||
```sh
|
||||
ssh ckn@10.0.4.128 'systemctl show l4d2-game.slice -p AllowedCPUs'
|
||||
# expect: AllowedCPUs=1-7
|
||||
|
||||
ssh ckn@10.0.4.128 'cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective'
|
||||
# expect: 0 (everything-not-game still pinned to core 0)
|
||||
|
||||
# When ≥ 1 server is running:
|
||||
ssh ckn@10.0.4.128 'for p in $(pgrep srcds); do grep ^Cpus_allowed_list /proc/$p/status; done'
|
||||
# expect: 1-7 (CFS picks whichever of those is hottest at any given moment)
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md` — sibling design that introduced the `AllowedCPUs=1-7` slice constraint this record builds on.
|
||||
- `deploy/README.md` "Performance Tuning" section — the `CPUAffinity=N` per-instance escape hatch.
|
||||
- Linux kernel changelog 5.13+ — removal of classic CFS tunable sysctls.
|
||||
|
|
@ -0,0 +1,230 @@
|
|||
# l4d2 server host perf baseline — design
|
||||
|
||||
Date: 2026-05-09
|
||||
Status: design
|
||||
|
||||
## Summary
|
||||
|
||||
Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.
|
||||
|
||||
## Goals
|
||||
|
||||
- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
|
||||
- One misbehaving server cannot OOM-kill its siblings or the host.
|
||||
- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
|
||||
- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
|
||||
- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
|
||||
- CPU governor changes. Documented opt-in only.
|
||||
- Per-instance `CPUAffinity`. Host-specific; documented only.
|
||||
- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
|
||||
- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
|
||||
- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec.
|
||||
|
||||
## Background
|
||||
|
||||
Current state (commit `965b67e`):
|
||||
|
||||
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**.
|
||||
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`.
|
||||
- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects.
|
||||
|
||||
srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.
|
||||
|
||||
## Design
|
||||
|
||||
### Slice topology
|
||||
|
||||
Flat top-level slices, siblings of `system.slice` and `user.slice`:
|
||||
|
||||
```
|
||||
-.slice
|
||||
├── system.slice (default CPUWeight=100, IOWeight=100)
|
||||
├── user.slice (default CPUWeight=100, IOWeight=100)
|
||||
├── l4d2-game.slice (CPUWeight=1000, IOWeight=1000)
|
||||
└── l4d2-build.slice (CPUWeight=10, IOWeight=10)
|
||||
```
|
||||
|
||||
Rationale:
|
||||
|
||||
- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`.
|
||||
- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match.
|
||||
|
||||
### Per-instance unit additions (`left4me-server@.service`)
|
||||
|
||||
Add to `[Service]`:
|
||||
|
||||
```
|
||||
Slice=l4d2-game.slice
|
||||
Nice=-5
|
||||
IOSchedulingClass=best-effort
|
||||
IOSchedulingPriority=4
|
||||
OOMScoreAdjust=-200
|
||||
MemoryHigh=1.5G
|
||||
MemoryMax=2G
|
||||
TasksMax=256
|
||||
LimitNOFILE=65536
|
||||
KillSignal=SIGINT
|
||||
TimeoutStopSec=15s
|
||||
LogRateLimitIntervalSec=0
|
||||
```
|
||||
|
||||
Per-directive justification:
|
||||
|
||||
- `Slice=l4d2-game.slice` — places the instance in the high-weight slice.
|
||||
- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
|
||||
- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
|
||||
- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section).
|
||||
- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500–800 MB; map-load spikes fit in headroom; a runaway is bounded.
|
||||
- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
|
||||
- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups.
|
||||
- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
|
||||
- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL.
|
||||
- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.
|
||||
|
||||
Existing security directives are kept verbatim.
|
||||
|
||||
### Slice unit files
|
||||
|
||||
New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=left4me game-server slice
|
||||
Before=slices.target
|
||||
|
||||
[Slice]
|
||||
CPUWeight=1000
|
||||
IOWeight=1000
|
||||
```
|
||||
|
||||
New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`:
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=left4me script-sandbox build slice
|
||||
Before=slices.target
|
||||
|
||||
[Slice]
|
||||
CPUWeight=10
|
||||
IOWeight=10
|
||||
```
|
||||
|
||||
### Sandbox slice + OOM placement
|
||||
|
||||
Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run` invocation (transient service mode — the existing helper uses `--unit=` without `--scope`):
|
||||
|
||||
- `--slice=l4d2-build.slice`
|
||||
- `-p OOMScoreAdjust=500`
|
||||
|
||||
Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-unit) compose: weight handles contention, quota handles the absolute ceiling.
|
||||
|
||||
### Host sysctls
|
||||
|
||||
New file `deploy/files/etc/sysctl.d/99-left4me.conf`:
|
||||
|
||||
```
|
||||
net.core.rmem_max = 8388608
|
||||
net.core.wmem_max = 8388608
|
||||
net.core.rmem_default = 524288
|
||||
net.core.wmem_default = 524288
|
||||
net.core.netdev_max_backlog = 5000
|
||||
net.core.netdev_budget = 600
|
||||
vm.swappiness = 10
|
||||
```
|
||||
|
||||
Per-value justification:
|
||||
|
||||
- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
|
||||
- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do.
|
||||
- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
|
||||
- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts.
|
||||
- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts.
|
||||
|
||||
### Deploy script integration
|
||||
|
||||
`deploy/deploy-test-server.sh` must:
|
||||
|
||||
1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`.
|
||||
2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot.
|
||||
3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`.
|
||||
4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow).
|
||||
5. No explicit `systemctl start` of the slices is required — they activate on first child reference.
|
||||
|
||||
### Documented escape hatches (no auto-apply)
|
||||
|
||||
Append a "Performance tuning" section to `deploy/README.md`:
|
||||
|
||||
- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
|
||||
- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ".
|
||||
- **NIC tuning**: example `ethtool -G <iface> rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only.
|
||||
- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves.
|
||||
|
||||
These stay pure documentation in v1 — no code paths, no tests asserting them.
|
||||
|
||||
### Out-of-scope rationale
|
||||
|
||||
- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
|
||||
- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts.
|
||||
- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.
|
||||
|
||||
### Files changed / added
|
||||
|
||||
```
|
||||
deploy/files/usr/local/lib/systemd/system/left4me-server@.service (modified)
|
||||
deploy/files/usr/local/lib/systemd/system/l4d2-game.slice (new)
|
||||
deploy/files/usr/local/lib/systemd/system/l4d2-build.slice (new)
|
||||
deploy/files/etc/sysctl.d/99-left4me.conf (new)
|
||||
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox (modified)
|
||||
deploy/deploy-test-server.sh (modified — sysctl --system step)
|
||||
deploy/README.md (modified — performance section)
|
||||
deploy/tests/test_deploy_artifacts.py (modified — assertions)
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
|
||||
`assert "key=value" in text` pattern:
|
||||
|
||||
- For `left4me-server@.service`, assert every line listed in *Per-instance
|
||||
unit additions* is present verbatim. Each is a separate assertion so a
|
||||
failing line is identifiable.
|
||||
- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`.
|
||||
- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`.
|
||||
- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*.
|
||||
- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice`
|
||||
and `OOMScoreAdjust=500` both appear.
|
||||
- Assert the deploy script invokes `sysctl --system` (or
|
||||
`sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the
|
||||
conf into place.
|
||||
|
||||
No runtime perf tests in v1 — the spec ships defaults, not measured wins.
|
||||
Real-world measurement is left to operators with concrete instance counts,
|
||||
hardware, and player loads.
|
||||
|
||||
## Rollout
|
||||
|
||||
Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`.
|
||||
|
||||
## Open questions
|
||||
|
||||
None blocking. v2 candidates if measurement justifies them:
|
||||
|
||||
- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`).
|
||||
- Job-worker awareness of "server has active players" to defer builds further than weights alone.
|
||||
- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in.
|
||||
|
||||
## References
|
||||
|
||||
- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`.
|
||||
- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules.
|
||||
- systemd.kill(5) — signal handling and `KillSignal`.
|
||||
- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`.
|
||||
- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
|
||||
- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics.
|
||||
- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
|
||||
- Phoronix governor comparisons — performance vs schedutil for sustained workloads.
|
||||
- Multiple latency-tuning guides — `vm.swappiness=10` consensus.
|
||||
|
|
@ -0,0 +1,217 @@
|
|||
# l4d2 server lifecycle: reboot-safe + drift reconciliation — design
|
||||
|
||||
Date: 2026-05-09
|
||||
Status: design
|
||||
|
||||
## Summary
|
||||
|
||||
Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators.
|
||||
|
||||
## Goals
|
||||
|
||||
- An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action.
|
||||
- The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change.
|
||||
- The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
|
||||
- Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision.
|
||||
- **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
|
||||
- **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec.
|
||||
- **Per-server poller intervals.** A single global cadence is sufficient.
|
||||
- **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays.
|
||||
- **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.
|
||||
|
||||
## Premise check: system units, not user units
|
||||
|
||||
`systemctl --user enable --now` has different lifecycle rules — auto-start only at user login (unless `loginctl enable-linger <user>` is set), symlinks land in `~/.config/systemd/user/<target>.wants/`. It would be wrong here.
|
||||
|
||||
This project uses **system units**, confirmed by:
|
||||
|
||||
- Unit path: `/usr/local/lib/systemd/system/left4me-server@.service` is the system search path; user units live in `/etc/systemd/user/` or `~/.config/systemd/user/`.
|
||||
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl:31-44`) calls plain `systemctl` (no `--user` flag) and runs as **root** via the sudoers rule at `deploy/files/etc/sudoers.d/left4me:2`.
|
||||
- The unit's `[Install] WantedBy=multi-user.target` (line 43 of the unit) is a system target; user units would use `default.target`.
|
||||
- The same machinery is already in production for `left4me-web.service` — `deploy-test-server.sh` runs `sudo systemctl enable --now left4me-web.service`, and that's how the web service auto-came-back after today's reboot. We're applying the same pattern to the game-server template instances.
|
||||
|
||||
`systemctl enable left4me-server@1.service` will create `/etc/systemd/system/multi-user.target.wants/left4me-server@1.service` symlinked to `/usr/local/lib/systemd/system/left4me-server@.service`. systemd handles the template instantiation via the `@` syntax automatically.
|
||||
|
||||
## Background
|
||||
|
||||
Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST:
|
||||
|
||||
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action.
|
||||
- `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`.
|
||||
- `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions.
|
||||
- `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot.
|
||||
- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not.
|
||||
- The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth.
|
||||
|
||||
## Design
|
||||
|
||||
### Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
|
||||
|
||||
**Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`):
|
||||
|
||||
Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become:
|
||||
|
||||
```sh
|
||||
case "$action" in
|
||||
enable) exec "$systemctl" enable --now "$unit" ;;
|
||||
disable) exec "$systemctl" disable --now "$unit" ;;
|
||||
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
|
||||
*) reject ;;
|
||||
esac
|
||||
```
|
||||
|
||||
The existing instance-name validation regex (currently lines 12–17) is unchanged — it constrains the `<name>` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`:
|
||||
|
||||
```
|
||||
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
|
||||
```
|
||||
|
||||
already passes any args; no sudoers update needed.
|
||||
|
||||
**Python wrapper** (`l4d2host/service_control.py`):
|
||||
|
||||
Rename `start_service` → `enable_service` and `stop_service` → `disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged.
|
||||
|
||||
**Instance lifecycle** (`l4d2host/instances.py`):
|
||||
|
||||
- `start_instance` — replace the `start_service(...)` call with `enable_service(...)`.
|
||||
- `stop_instance` — replace `stop_service(...)` with `disable_service(...)`.
|
||||
- `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.
|
||||
|
||||
**CLI surface** (`l4d2host/cli.py`):
|
||||
|
||||
`l4d2ctl start <name>` and `l4d2ctl stop <name>` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance` → `enable_service` as renamed above.
|
||||
|
||||
**Web facade** (`l4d2web/services/l4d2_facade.py`):
|
||||
|
||||
Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`.
|
||||
|
||||
### Part B — Periodic state poller
|
||||
|
||||
Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`:
|
||||
|
||||
```python
|
||||
def start_state_poller(app):
|
||||
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
|
||||
thread = threading.Thread(
|
||||
target=state_poller_loop,
|
||||
args=(app, interval),
|
||||
daemon=True,
|
||||
name="left4me-state-poller",
|
||||
)
|
||||
thread.start()
|
||||
|
||||
|
||||
def state_poller_loop(app, interval):
|
||||
while True:
|
||||
try:
|
||||
with app.app_context():
|
||||
poll_all_servers()
|
||||
except Exception:
|
||||
pass # never let a single failure kill the loop
|
||||
time.sleep(interval)
|
||||
|
||||
|
||||
def poll_all_servers():
|
||||
with session_scope() as db:
|
||||
active_server_ids = set(db.scalars(
|
||||
select(Job.server_id).where(Job.state.in_(("queued", "running")))
|
||||
).all())
|
||||
server_ids = [
|
||||
sid for sid in db.scalars(select(Server.id)).all()
|
||||
if sid not in active_server_ids
|
||||
]
|
||||
for sid in server_ids:
|
||||
try:
|
||||
refresh_server_actual_state(sid)
|
||||
except Exception:
|
||||
pass
|
||||
```
|
||||
|
||||
**Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.
|
||||
|
||||
**Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 84–88: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`).
|
||||
|
||||
**First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed.
|
||||
|
||||
**Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:44–58`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives.
|
||||
|
||||
### Why both parts
|
||||
|
||||
Either part alone is insufficient:
|
||||
|
||||
- **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now <unit>` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases.
|
||||
- **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again.
|
||||
|
||||
Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.
|
||||
|
||||
### Migration on running hosts
|
||||
|
||||
Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot.
|
||||
|
||||
The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`.
|
||||
|
||||
### Files changed / added
|
||||
|
||||
```
|
||||
deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs)
|
||||
l4d2host/service_control.py (Part A — rename)
|
||||
l4d2host/instances.py (Part A — call new names)
|
||||
l4d2host/tests/test_lifecycle.py (Part A — test updates)
|
||||
l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent)
|
||||
deploy/tests/test_deploy_artifacts.py (Part A — helper assertions)
|
||||
|
||||
l4d2web/services/job_worker.py (Part B — poller code)
|
||||
l4d2web/app.py (Part B — wire start_state_poller)
|
||||
l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default)
|
||||
l4d2web/tests/test_job_worker.py (Part B — poller tests)
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
### Part A
|
||||
|
||||
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted).
|
||||
- `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start` → `enable` and `stop` → `disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`.
|
||||
- New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`.
|
||||
|
||||
### Part B
|
||||
|
||||
- `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order.
|
||||
- `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server.
|
||||
- `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally.
|
||||
- `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`).
|
||||
|
||||
### CI sanity
|
||||
|
||||
`pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope).
|
||||
|
||||
## Rollout
|
||||
|
||||
Single deploy. After deploy:
|
||||
|
||||
1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.**
|
||||
2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe.
|
||||
3. Newly-started servers (`l4d2ctl start <name>` or web UI start) are enable-based from the first invocation.
|
||||
|
||||
If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa.
|
||||
|
||||
## Open questions
|
||||
|
||||
None blocking. v2 candidates:
|
||||
|
||||
- Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision).
|
||||
- Per-server poll intervals or backoff for repeatedly-failing servers.
|
||||
- A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").
|
||||
|
||||
## References
|
||||
|
||||
- systemd.unit(5) — `WantedBy=`, `Install` section semantics.
|
||||
- systemctl(1) — `enable --now` / `disable --now` flags.
|
||||
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`.
|
||||
- Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`.
|
||||
- `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.
|
||||
|
|
@ -1,30 +0,0 @@
|
|||
from abc import ABC, abstractmethod
|
||||
from pathlib import Path
|
||||
from typing import Callable
|
||||
|
||||
|
||||
class OverlayMounter(ABC):
|
||||
@abstractmethod
|
||||
def mount(
|
||||
self,
|
||||
*,
|
||||
lowerdirs: str,
|
||||
upperdir: Path,
|
||||
workdir: Path,
|
||||
merged: Path,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
) -> None:
|
||||
raise NotImplementedError
|
||||
|
||||
@abstractmethod
|
||||
def unmount(
|
||||
self,
|
||||
*,
|
||||
merged: Path,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
) -> None:
|
||||
raise NotImplementedError
|
||||
|
|
@ -1,53 +0,0 @@
|
|||
from pathlib import Path
|
||||
from typing import Callable
|
||||
|
||||
from l4d2host.fs.base import OverlayMounter
|
||||
from l4d2host.process import run_command
|
||||
|
||||
|
||||
HELPER_PATH = "/usr/local/libexec/left4me/left4me-overlay"
|
||||
|
||||
|
||||
class KernelOverlayFSMounter(OverlayMounter):
|
||||
# Delegates the actual mount/umount syscalls to the privileged
|
||||
# left4me-overlay helper. The helper takes only the instance name and
|
||||
# rederives lowerdirs/upper/work/merged from disk; the OverlayMounter
|
||||
# ABC accepts those args for compatibility, so we extract the name
|
||||
# from the merged path's parent directory.
|
||||
def mount(
|
||||
self,
|
||||
*,
|
||||
lowerdirs: str,
|
||||
upperdir: Path,
|
||||
workdir: Path,
|
||||
merged: Path,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> None:
|
||||
del lowerdirs, upperdir, workdir
|
||||
run_command(
|
||||
["sudo", "-n", HELPER_PATH, "mount", merged.parent.name],
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
|
||||
def unmount(
|
||||
self,
|
||||
*,
|
||||
merged: Path,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
on_stderr: Callable[[str], None] | None = None,
|
||||
passthrough: bool = False,
|
||||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> None:
|
||||
run_command(
|
||||
["sudo", "-n", HELPER_PATH, "umount", merged.parent.name],
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
|
|
@ -1,21 +1,16 @@
|
|||
import os
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
import subprocess
|
||||
from typing import Callable
|
||||
|
||||
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
|
||||
from l4d2host.paths import DEFAULT_LEFT4ME_ROOT, get_left4me_root, overlay_path, validate_instance_name
|
||||
from l4d2host.service_control import start_service, stop_service
|
||||
from l4d2host.service_control import disable_service, enable_service
|
||||
from l4d2host.spec import load_spec
|
||||
|
||||
|
||||
from l4d2host.logging import emit_step
|
||||
|
||||
|
||||
_mounter = KernelOverlayFSMounter()
|
||||
|
||||
|
||||
DEFAULT_ROOT = DEFAULT_LEFT4ME_ROOT
|
||||
|
||||
|
||||
|
|
@ -63,16 +58,6 @@ def initialize_instance(
|
|||
emit_step("initialization complete.", on_stdout, passthrough)
|
||||
|
||||
|
||||
def _load_instance_env(path: Path) -> dict[str, str]:
|
||||
result: dict[str, str] = {}
|
||||
for line in path.read_text().splitlines():
|
||||
if "=" not in line:
|
||||
continue
|
||||
key, value = line.split("=", 1)
|
||||
result[key] = value
|
||||
return result
|
||||
|
||||
|
||||
def start_instance(
|
||||
name: str,
|
||||
*,
|
||||
|
|
@ -87,25 +72,14 @@ def start_instance(
|
|||
instance_dir = root / "instances" / name
|
||||
runtime_dir = root / "runtime" / name
|
||||
|
||||
env = _load_instance_env(instance_dir / "instance.env")
|
||||
|
||||
merged = runtime_dir / "merged"
|
||||
if os.path.ismount(merged):
|
||||
# Kernel overlayfs mounts persist when the web worker dies (unlike
|
||||
# fuse daemons, which were reaped with their cgroup). Refuse rather
|
||||
# than double-mount.
|
||||
raise subprocess.CalledProcessError(
|
||||
returncode=1,
|
||||
cmd=["start_instance"],
|
||||
stderr=f"runtime overlay already mounted at {merged}; refusing to double-mount",
|
||||
)
|
||||
|
||||
# Stage cfg files in the upper layer BEFORE mounting. Writing through
|
||||
# merged after the mount triggers overlayfs copy-up, which preserves the
|
||||
# lower file's ownership — and a script-sandbox-built `server.cfg` is
|
||||
# owned by `l4d2-sandbox`, not the worker. Pre-mount writes go straight to
|
||||
# upper with the worker's uid; the kernel just shows them at the top of
|
||||
# the merged stack once mounted.
|
||||
# Stage cfg files in the upper layer. Writing here goes straight to the
|
||||
# upper dir on the host filesystem with the worker's uid; the unit's
|
||||
# ExecStartPre then mounts the overlay (single source of truth for the
|
||||
# mount), and the kernel surfaces these files at the top of the merged
|
||||
# stack. A script-sandbox-built lower-layer `server.cfg` is owned by
|
||||
# `l4d2-sandbox`, not the worker — staging in upper sidesteps the
|
||||
# ownership-preserving copy-up that would happen if we wrote through
|
||||
# merged post-mount.
|
||||
emit_step("staging server.cfg + per-overlay aliases in upper layer...", on_stdout, passthrough)
|
||||
upper_cfg_dir = runtime_dir / "upper" / "left4dead2" / "cfg"
|
||||
upper_cfg_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
|
@ -121,20 +95,8 @@ def start_instance(
|
|||
continue
|
||||
shutil.copy2(src, upper_cfg_dir / f"server_{o.alias}.cfg")
|
||||
|
||||
emit_step("mounting runtime overlay...", on_stdout, passthrough)
|
||||
_mounter.mount(
|
||||
lowerdirs=env["L4D2_LOWERDIRS"],
|
||||
upperdir=runtime_dir / "upper",
|
||||
workdir=runtime_dir / "work",
|
||||
merged=merged,
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
|
||||
emit_step("starting systemd service...", on_stdout, passthrough)
|
||||
start_service(
|
||||
emit_step("enabling + starting systemd service...", on_stdout, passthrough)
|
||||
enable_service(
|
||||
name,
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
|
|
@ -155,25 +117,17 @@ def stop_instance(
|
|||
) -> None:
|
||||
name = validate_instance_name(name)
|
||||
root = get_left4me_root() if root is None else Path(root)
|
||||
emit_step("stopping systemd service...", on_stdout, passthrough)
|
||||
stop_service(
|
||||
# `disable --now` triggers the unit's ExecStopPost, which unmounts the
|
||||
# overlay. Single source of truth for unmount lives in the unit file;
|
||||
# no Python-side unmount needed.
|
||||
emit_step("disabling + stopping systemd service...", on_stdout, passthrough)
|
||||
disable_service(
|
||||
name,
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
emit_step("unmounting runtime overlay (if mounted)...", on_stdout, passthrough)
|
||||
try:
|
||||
_mounter.unmount(
|
||||
merged=root / "runtime" / name / "merged",
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
except subprocess.CalledProcessError:
|
||||
pass
|
||||
emit_step("stop complete.", on_stdout, passthrough)
|
||||
|
||||
|
||||
|
|
@ -189,9 +143,13 @@ def _purge_instance(
|
|||
instance_dir = root / "instances" / name
|
||||
runtime_dir = root / "runtime" / name
|
||||
|
||||
emit_step("stopping systemd service (if running)...", on_stdout, passthrough)
|
||||
# disable --now triggers ExecStopPost which unmounts. The try/except
|
||||
# tolerates the unit-not-loaded case (e.g., delete on an instance that
|
||||
# was initialized but never started — no unit, nothing to disable, no
|
||||
# mount to clean up either).
|
||||
emit_step("disabling + stopping systemd service (if running)...", on_stdout, passthrough)
|
||||
try:
|
||||
stop_service(
|
||||
disable_service(
|
||||
name,
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
|
|
@ -201,18 +159,6 @@ def _purge_instance(
|
|||
except subprocess.CalledProcessError:
|
||||
pass
|
||||
|
||||
emit_step("unmounting runtime overlay (if mounted)...", on_stdout, passthrough)
|
||||
try:
|
||||
_mounter.unmount(
|
||||
merged=runtime_dir / "merged",
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
should_cancel=should_cancel,
|
||||
)
|
||||
except subprocess.CalledProcessError:
|
||||
pass
|
||||
|
||||
emit_step("removing instance files...", on_stdout, passthrough)
|
||||
if instance_dir.exists():
|
||||
shutil.rmtree(instance_dir)
|
||||
|
|
|
|||
|
|
@ -17,7 +17,7 @@ dependencies = [
|
|||
l4d2ctl = "l4d2host.cli:app"
|
||||
|
||||
[tool.setuptools]
|
||||
packages = ["l4d2host", "l4d2host.fs"]
|
||||
packages = ["l4d2host"]
|
||||
|
||||
[tool.setuptools.package-dir]
|
||||
l4d2host = "."
|
||||
|
|
|
|||
|
|
@ -17,7 +17,7 @@ def journalctl_command(name: str, lines: int = 200, follow: bool = True) -> list
|
|||
return ["sudo", "-n", JOURNALCTL_HELPER, name, "--lines", str(lines), follow_arg]
|
||||
|
||||
|
||||
def start_service(
|
||||
def enable_service(
|
||||
name: str,
|
||||
*,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
|
|
@ -26,7 +26,7 @@ def start_service(
|
|||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> CommandResult:
|
||||
return run_command(
|
||||
systemctl_command("start", name),
|
||||
systemctl_command("enable", name),
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
|
|
@ -34,7 +34,7 @@ def start_service(
|
|||
)
|
||||
|
||||
|
||||
def stop_service(
|
||||
def disable_service(
|
||||
name: str,
|
||||
*,
|
||||
on_stdout: Callable[[str], None] | None = None,
|
||||
|
|
@ -43,7 +43,7 @@ def stop_service(
|
|||
should_cancel: Callable[[], bool] | None = None,
|
||||
) -> CommandResult:
|
||||
return run_command(
|
||||
systemctl_command("stop", name),
|
||||
systemctl_command("disable", name),
|
||||
on_stdout=on_stdout,
|
||||
on_stderr=on_stderr,
|
||||
passthrough=passthrough,
|
||||
|
|
|
|||
|
|
@ -1,76 +0,0 @@
|
|||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
HELPER_PATH = "/usr/local/libexec/left4me/left4me-overlay"
|
||||
|
||||
|
||||
def test_mount_invokes_helper_with_name_only(monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
|
||||
|
||||
calls: list[list[str]] = []
|
||||
|
||||
def fake_run_command(cmd, **kwargs):
|
||||
del kwargs
|
||||
calls.append(list(cmd))
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
|
||||
KernelOverlayFSMounter().mount(
|
||||
lowerdirs="/var/lib/left4me/installation",
|
||||
upperdir=Path("/var/lib/left4me/runtime/alpha/upper"),
|
||||
workdir=Path("/var/lib/left4me/runtime/alpha/work"),
|
||||
merged=Path("/var/lib/left4me/runtime/alpha/merged"),
|
||||
)
|
||||
|
||||
assert calls == [["sudo", "-n", HELPER_PATH, "mount", "alpha"]]
|
||||
|
||||
|
||||
def test_unmount_invokes_helper_with_umount_verb(monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
|
||||
|
||||
calls: list[list[str]] = []
|
||||
|
||||
def fake_run_command(cmd, **kwargs):
|
||||
del kwargs
|
||||
calls.append(list(cmd))
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
|
||||
KernelOverlayFSMounter().unmount(merged=Path("/var/lib/left4me/runtime/alpha/merged"))
|
||||
|
||||
assert calls == [["sudo", "-n", HELPER_PATH, "umount", "alpha"]]
|
||||
|
||||
|
||||
def test_mount_propagates_run_command_kwargs(monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
from l4d2host.fs.kernel_overlayfs import KernelOverlayFSMounter
|
||||
|
||||
captured: dict = {}
|
||||
|
||||
def fake_run_command(cmd, **kwargs):
|
||||
captured["cmd"] = list(cmd)
|
||||
captured["kwargs"] = kwargs
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
|
||||
out: list[str] = []
|
||||
err: list[str] = []
|
||||
KernelOverlayFSMounter().mount(
|
||||
lowerdirs="/var/lib/left4me/installation",
|
||||
upperdir=Path("/var/lib/left4me/runtime/alpha/upper"),
|
||||
workdir=Path("/var/lib/left4me/runtime/alpha/work"),
|
||||
merged=Path("/var/lib/left4me/runtime/alpha/merged"),
|
||||
on_stdout=out.append,
|
||||
on_stderr=err.append,
|
||||
passthrough=False,
|
||||
should_cancel=lambda: False,
|
||||
)
|
||||
|
||||
assert captured["cmd"][0:3] == ["sudo", "-n", HELPER_PATH]
|
||||
captured["kwargs"]["on_stdout"]("hi")
|
||||
captured["kwargs"]["on_stderr"]("oops")
|
||||
assert out == ["hi"]
|
||||
assert err == ["oops"]
|
||||
assert captured["kwargs"]["passthrough"] is False
|
||||
assert callable(captured["kwargs"]["should_cancel"])
|
||||
|
|
@ -29,19 +29,16 @@ def test_start_order(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
|
|||
(instance_dir / "server.cfg").write_text("sv_consistency 1")
|
||||
(instance_dir / "spec.yaml").write_text("port: 27015\noverlays: [x, y]\n")
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
start_instance("alpha", root=tmp_path)
|
||||
|
||||
assert calls[0] == [
|
||||
"sudo",
|
||||
"-n",
|
||||
"/usr/local/libexec/left4me/left4me-overlay",
|
||||
"mount",
|
||||
"alpha",
|
||||
# The mount is now driven by the unit's ExecStartPre (single source of
|
||||
# truth), so start_instance only stages the cfgs and asks systemd to
|
||||
# enable+start the unit.
|
||||
assert calls == [
|
||||
["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "enable", "alpha"],
|
||||
]
|
||||
assert calls[1] == ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "alpha"]
|
||||
|
||||
|
||||
def test_start_copies_per_overlay_aliases_and_sweeps_stale(
|
||||
|
|
@ -75,7 +72,6 @@ def test_start_copies_per_overlay_aliases_and_sweeps_stale(
|
|||
(src_7 / "server.cfg").write_text("ignored: alias not set\n")
|
||||
(upper_cfg_dir / "server_orphan.cfg").write_text("from previous start\n")
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
start_instance("alpha", root=tmp_path)
|
||||
|
|
@ -87,36 +83,6 @@ def test_start_copies_per_overlay_aliases_and_sweeps_stale(
|
|||
assert not (upper_cfg_dir / "server_overlay_7.cfg").exists(), "no alias in spec → no copy"
|
||||
|
||||
|
||||
def test_start_refuses_to_double_mount(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
calls: list[list[str]] = []
|
||||
|
||||
def fake_run_command(cmd, **kwargs):
|
||||
del kwargs
|
||||
calls.append(list(cmd))
|
||||
|
||||
instance_dir = tmp_path / "instances" / "alpha"
|
||||
runtime_dir = tmp_path / "runtime" / "alpha"
|
||||
(runtime_dir / "merged").mkdir(parents=True)
|
||||
instance_dir.mkdir(parents=True)
|
||||
(instance_dir / "instance.env").write_text("L4D2_PORT=27015\nL4D2_ARGS=\nL4D2_LOWERDIRS=/x\n")
|
||||
(instance_dir / "server.cfg").write_text("")
|
||||
|
||||
merged = runtime_dir / "merged"
|
||||
|
||||
def fake_ismount(path):
|
||||
return Path(path) == merged
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.instances.os.path.ismount", fake_ismount)
|
||||
|
||||
with pytest.raises(subprocess.CalledProcessError) as exc_info:
|
||||
start_instance("alpha", root=tmp_path)
|
||||
|
||||
assert "already mounted" in (exc_info.value.stderr or "")
|
||||
assert calls == [], "no mount/start commands must be issued when refusing"
|
||||
|
||||
|
||||
def test_delete_missing_is_noop(tmp_path: Path) -> None:
|
||||
delete_instance("missing", root=tmp_path)
|
||||
|
||||
|
|
@ -127,7 +93,7 @@ def test_delete_succeeds_when_stop_service_fails(tmp_path: Path, monkeypatch: py
|
|||
def fake_run_command(cmd, **kwargs):
|
||||
del kwargs
|
||||
calls.append(list(cmd))
|
||||
if cmd[:2] == ["sudo", "-n"] and "left4me-systemctl" in cmd[2] and "stop" in cmd:
|
||||
if cmd[:2] == ["sudo", "-n"] and "left4me-systemctl" in cmd[2] and "disable" in cmd:
|
||||
raise subprocess.CalledProcessError(
|
||||
returncode=5,
|
||||
cmd=list(cmd),
|
||||
|
|
@ -137,7 +103,6 @@ def test_delete_succeeds_when_stop_service_fails(tmp_path: Path, monkeypatch: py
|
|||
(tmp_path / "instances" / "alpha").mkdir(parents=True)
|
||||
(tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
delete_instance("alpha", root=tmp_path)
|
||||
|
|
@ -172,7 +137,6 @@ def test_reset_stops_unmounts_and_removes_dirs(tmp_path: Path, monkeypatch: pyte
|
|||
(runtime_dir / "upper" / "logs").mkdir(parents=True)
|
||||
(runtime_dir / "upper" / "logs" / "console.log").write_text("noise")
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
reset_instance("alpha", root=tmp_path)
|
||||
|
|
@ -180,7 +144,7 @@ def test_reset_stops_unmounts_and_removes_dirs(tmp_path: Path, monkeypatch: pyte
|
|||
assert not instance_dir.exists()
|
||||
assert not runtime_dir.exists()
|
||||
assert any("left4me-systemctl" in arg for cmd in calls for arg in cmd)
|
||||
assert any("stop" in cmd for cmd in calls)
|
||||
assert any("disable" in cmd for cmd in calls)
|
||||
|
||||
|
||||
def test_reset_on_never_initialized_is_noop(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
|
|
@ -188,10 +152,9 @@ def test_reset_on_never_initialized_is_noop(tmp_path: Path, monkeypatch: pytest.
|
|||
stop+unmount (both suppressed on failure) and not raise."""
|
||||
def fake_run_command(cmd, **kwargs):
|
||||
del kwargs
|
||||
if "stop" in cmd:
|
||||
if "disable" in cmd:
|
||||
raise subprocess.CalledProcessError(returncode=5, cmd=list(cmd), stderr="not loaded")
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
reset_instance("alpha", root=tmp_path)
|
||||
|
|
@ -210,68 +173,16 @@ def test_delete_stopped_instance_removes_dirs(tmp_path: Path, monkeypatch: pytes
|
|||
(tmp_path / "instances" / "alpha").mkdir(parents=True)
|
||||
(tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
delete_instance("alpha", root=tmp_path)
|
||||
|
||||
assert not (tmp_path / "instances" / "alpha").exists()
|
||||
assert not (tmp_path / "runtime" / "alpha").exists()
|
||||
assert ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "stop", "alpha"] in calls
|
||||
assert ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "disable", "alpha"] in calls
|
||||
|
||||
|
||||
def test_stop_succeeds_when_unmount_fails(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
umount_calls: list[list[str]] = []
|
||||
|
||||
def fake_run_command(cmd, **kwargs):
|
||||
del kwargs
|
||||
if cmd[:4] == [
|
||||
"sudo",
|
||||
"-n",
|
||||
"/usr/local/libexec/left4me/left4me-overlay",
|
||||
"umount",
|
||||
]:
|
||||
umount_calls.append(list(cmd))
|
||||
raise subprocess.CalledProcessError(
|
||||
returncode=1,
|
||||
cmd=list(cmd),
|
||||
stderr="umount: /var/lib/left4me/runtime/alpha/merged: not mounted",
|
||||
)
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
stop_instance("alpha", root=tmp_path)
|
||||
|
||||
assert umount_calls, "stop must always attempt the overlay helper (no preflight)"
|
||||
|
||||
|
||||
def test_delete_succeeds_when_unmount_fails(tmp_path: Path, monkeypatch: pytest.MonkeyPatch) -> None:
|
||||
umount_calls: list[list[str]] = []
|
||||
|
||||
def fake_run_command(cmd, **kwargs):
|
||||
del kwargs
|
||||
if cmd[:4] == [
|
||||
"sudo",
|
||||
"-n",
|
||||
"/usr/local/libexec/left4me/left4me-overlay",
|
||||
"umount",
|
||||
]:
|
||||
umount_calls.append(list(cmd))
|
||||
raise subprocess.CalledProcessError(
|
||||
returncode=1,
|
||||
cmd=list(cmd),
|
||||
stderr="umount: /var/lib/left4me/runtime/alpha/merged: not mounted",
|
||||
)
|
||||
|
||||
(tmp_path / "instances" / "alpha").mkdir(parents=True)
|
||||
(tmp_path / "runtime" / "alpha" / "merged").mkdir(parents=True)
|
||||
|
||||
monkeypatch.setattr("l4d2host.fs.kernel_overlayfs.run_command", fake_run_command)
|
||||
monkeypatch.setattr("l4d2host.service_control.run_command", fake_run_command)
|
||||
|
||||
delete_instance("alpha", root=tmp_path)
|
||||
|
||||
assert umount_calls, "delete must always attempt the overlay helper (no preflight)"
|
||||
assert not (tmp_path / "instances" / "alpha").exists()
|
||||
assert not (tmp_path / "runtime" / "alpha").exists()
|
||||
# test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
|
||||
# were removed when the Python-side unmount was dropped: the unit's
|
||||
# ExecStopPost is now the single code path for unmount, so there's no
|
||||
# Python-side failure to tolerate.
|
||||
|
|
|
|||
21
l4d2host/tests/test_service_control.py
Normal file
21
l4d2host/tests/test_service_control.py
Normal file
|
|
@ -0,0 +1,21 @@
|
|||
from unittest.mock import patch
|
||||
|
||||
from l4d2host.service_control import (
|
||||
SYSTEMCTL_HELPER,
|
||||
disable_service,
|
||||
enable_service,
|
||||
)
|
||||
|
||||
|
||||
@patch("l4d2host.service_control.run_command")
|
||||
def test_enable_service_invokes_helper_with_enable_action(mock_run):
|
||||
enable_service("instance-7")
|
||||
args, _ = mock_run.call_args
|
||||
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
|
||||
|
||||
|
||||
@patch("l4d2host.service_control.run_command")
|
||||
def test_disable_service_invokes_helper_with_disable_action(mock_run):
|
||||
disable_service("instance-7")
|
||||
args, _ = mock_run.call_args
|
||||
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]
|
||||
|
|
@ -18,7 +18,11 @@ from l4d2web.routes.overlay_routes import bp as overlay_bp
|
|||
from l4d2web.routes.page_routes import bp as page_bp
|
||||
from l4d2web.routes.server_routes import bp as server_bp
|
||||
from l4d2web.routes.workshop_routes import bp as workshop_bp
|
||||
from l4d2web.services.job_worker import recover_stale_jobs, start_job_workers
|
||||
from l4d2web.services.job_worker import (
|
||||
recover_stale_jobs,
|
||||
start_job_workers,
|
||||
start_state_poller,
|
||||
)
|
||||
|
||||
|
||||
def _in_flask_cli_context() -> bool:
|
||||
|
|
@ -89,6 +93,7 @@ def create_app(test_config: dict[str, object] | None = None) -> Flask:
|
|||
if should_start_workers:
|
||||
recover_stale_jobs()
|
||||
start_job_workers(app)
|
||||
start_state_poller(app)
|
||||
|
||||
@app.get("/health")
|
||||
def health():
|
||||
|
|
|
|||
|
|
@ -8,6 +8,7 @@ DEFAULT_CONFIG: dict[str, object] = {
|
|||
"JOB_WORKER_THREADS": 4,
|
||||
"JOB_WORKER_ENABLED": True,
|
||||
"JOB_WORKER_POLL_SECONDS": 1,
|
||||
"STATE_POLLER_INTERVAL_SECONDS": 30,
|
||||
"JOB_LOG_REPLAY_LIMIT": 2000,
|
||||
"JOB_LOG_LINE_MAX_CHARS": 4096,
|
||||
"PORT_RANGE_START": 27015,
|
||||
|
|
@ -27,6 +28,7 @@ def load_config() -> dict[str, object]:
|
|||
"JOB_WORKER_THREADS": int(os.getenv("JOB_WORKER_THREADS", "4")),
|
||||
"JOB_WORKER_ENABLED": _bool_from_env(os.getenv("JOB_WORKER_ENABLED", "true")),
|
||||
"JOB_WORKER_POLL_SECONDS": float(os.getenv("JOB_WORKER_POLL_SECONDS", "1")),
|
||||
"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("STATE_POLLER_INTERVAL_SECONDS", "30")),
|
||||
"JOB_LOG_REPLAY_LIMIT": int(os.getenv("JOB_LOG_REPLAY_LIMIT", "2000")),
|
||||
"JOB_LOG_LINE_MAX_CHARS": int(os.getenv("JOB_LOG_LINE_MAX_CHARS", "4096")),
|
||||
"PORT_RANGE_START": int(os.getenv("LEFT4ME_PORT_RANGE_START", "27015")),
|
||||
|
|
|
|||
|
|
@ -614,3 +614,45 @@ def worker_loop(app, poll_seconds: float) -> None:
|
|||
ran_job = False
|
||||
if not ran_job:
|
||||
time.sleep(poll_seconds)
|
||||
|
||||
|
||||
def start_state_poller(app) -> None:
|
||||
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
|
||||
thread = threading.Thread(
|
||||
target=state_poller_loop,
|
||||
args=(app, interval),
|
||||
name="left4me-state-poller",
|
||||
daemon=True,
|
||||
)
|
||||
thread.start()
|
||||
|
||||
|
||||
def state_poller_loop(app, interval: float) -> None:
|
||||
while True:
|
||||
try:
|
||||
with app.app_context():
|
||||
poll_all_servers()
|
||||
except Exception:
|
||||
pass
|
||||
time.sleep(interval)
|
||||
|
||||
|
||||
def poll_all_servers() -> None:
|
||||
with session_scope() as db:
|
||||
active_server_ids = set(
|
||||
db.scalars(
|
||||
select(Job.server_id).where(
|
||||
Job.state.in_(("queued", "running", "cancelling"))
|
||||
)
|
||||
).all()
|
||||
)
|
||||
server_ids = [
|
||||
sid
|
||||
for sid in db.scalars(select(Server.id)).all()
|
||||
if sid not in active_server_ids
|
||||
]
|
||||
for sid in server_ids:
|
||||
try:
|
||||
refresh_server_actual_state(sid)
|
||||
except Exception:
|
||||
pass
|
||||
|
|
|
|||
|
|
@ -843,3 +843,90 @@ def test_build_overlay_script_type_blocks_per_overlay(overlay_seeded_worker) ->
|
|||
can_start(DummyJob(operation="build_overlay", overlay_id=ids.overlay + 1), state)
|
||||
is True
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# State poller tests — refresh Server.actual_state out-of-band so OOM kills,
|
||||
# manual systemctl ops, and reboots no longer leave the DB on stale "running".
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def test_state_poller_refreshes_each_server(seeded_worker, monkeypatch) -> None:
|
||||
from l4d2web.services import job_worker as jw
|
||||
|
||||
worker_app, ids = seeded_worker
|
||||
|
||||
refreshed: list[int] = []
|
||||
monkeypatch.setattr(
|
||||
jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid)
|
||||
)
|
||||
|
||||
with worker_app.app_context():
|
||||
jw.poll_all_servers()
|
||||
|
||||
assert sorted(refreshed) == sorted([ids.server_one, ids.server_two])
|
||||
|
||||
|
||||
def test_state_poller_skips_servers_with_inflight_jobs(seeded_worker, monkeypatch) -> None:
|
||||
from l4d2web.services import job_worker as jw
|
||||
|
||||
worker_app, ids = seeded_worker
|
||||
|
||||
add_job(ids.user, "stop", server_id=ids.server_one, state="running")
|
||||
|
||||
refreshed: list[int] = []
|
||||
monkeypatch.setattr(
|
||||
jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid)
|
||||
)
|
||||
|
||||
with worker_app.app_context():
|
||||
jw.poll_all_servers()
|
||||
|
||||
assert ids.server_one not in refreshed
|
||||
assert ids.server_two in refreshed
|
||||
|
||||
|
||||
def test_state_poller_swallows_per_server_exceptions(seeded_worker, monkeypatch) -> None:
|
||||
from l4d2web.services import job_worker as jw
|
||||
|
||||
worker_app, ids = seeded_worker
|
||||
|
||||
refreshed: list[int] = []
|
||||
|
||||
def fake_refresh(sid: int) -> None:
|
||||
if sid == ids.server_one:
|
||||
raise RuntimeError("simulated host failure")
|
||||
refreshed.append(sid)
|
||||
|
||||
monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
|
||||
|
||||
with worker_app.app_context():
|
||||
jw.poll_all_servers() # must not raise
|
||||
|
||||
assert refreshed == [ids.server_two]
|
||||
|
||||
|
||||
def test_state_poller_not_started_during_testing(monkeypatch, tmp_path) -> None:
|
||||
from l4d2web import app as app_module
|
||||
|
||||
called: list = []
|
||||
db_url = f"sqlite:///{tmp_path/'poller-testing.db'}"
|
||||
monkeypatch.setattr(app_module, "start_state_poller", lambda app: called.append(app))
|
||||
|
||||
app_module.create_app({"TESTING": True, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
|
||||
|
||||
assert called == []
|
||||
|
||||
|
||||
def test_state_poller_started_when_workers_enabled_outside_testing(monkeypatch, tmp_path) -> None:
|
||||
from l4d2web import app as app_module
|
||||
|
||||
called: list = []
|
||||
db_url = f"sqlite:///{tmp_path/'poller-enabled.db'}"
|
||||
monkeypatch.setattr(app_module, "start_state_poller", lambda app: called.append(app))
|
||||
monkeypatch.setattr(app_module, "start_job_workers", lambda app: None)
|
||||
monkeypatch.setattr(app_module, "recover_stale_jobs", lambda: None)
|
||||
|
||||
app = app_module.create_app({"TESTING": False, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
|
||||
|
||||
assert called == [app]
|
||||
|
|
|
|||
Loading…
Reference in a new issue