diff --git a/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md b/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md new file mode 100644 index 0000000..04b9190 --- /dev/null +++ b/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md @@ -0,0 +1,230 @@ +# l4d2 server host perf baseline — design + +Date: 2026-05-09 +Status: design + +## Summary + +Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope. + +## Goals + +- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic. +- One misbehaving server cannot OOM-kill its siblings or the host. +- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults. +- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default. + +## Non-goals + +- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server. +- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale. +- CPU governor changes. Documented opt-in only. +- Per-instance `CPUAffinity`. Host-specific; documented only. +- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only. +- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees. +- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec. + +## Background + +Current state (commit `965b67e`): + +- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**. +- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`. +- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects. + +srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits. + +## Design + +### Slice topology + +Flat top-level slices, siblings of `system.slice` and `user.slice`: + +``` +-.slice +├── system.slice (default CPUWeight=100, IOWeight=100) +├── user.slice (default CPUWeight=100, IOWeight=100) +├── l4d2-game.slice (CPUWeight=1000, IOWeight=1000) +└── l4d2-build.slice (CPUWeight=10, IOWeight=10) +``` + +Rationale: + +- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`. +- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match. + +### Per-instance unit additions (`left4me-server@.service`) + +Add to `[Service]`: + +``` +Slice=l4d2-game.slice +Nice=-5 +IOSchedulingClass=best-effort +IOSchedulingPriority=4 +OOMScoreAdjust=-200 +MemoryHigh=1.5G +MemoryMax=2G +TasksMax=256 +LimitNOFILE=65536 +KillSignal=SIGINT +TimeoutStopSec=15s +LogRateLimitIntervalSec=0 +``` + +Per-directive justification: + +- `Slice=l4d2-game.slice` — places the instance in the high-weight slice. +- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale. +- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless. +- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section). +- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500–800 MB; map-load spikes fit in headroom; a runaway is bounded. +- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide. +- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups. +- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher. +- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL. +- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics. + +Existing security directives are kept verbatim. + +### Slice unit files + +New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`: + +```ini +[Unit] +Description=left4me game-server slice +Before=slices.target + +[Slice] +CPUWeight=1000 +IOWeight=1000 +``` + +New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`: + +```ini +[Unit] +Description=left4me script-sandbox build slice +Before=slices.target + +[Slice] +CPUWeight=10 +IOWeight=10 +``` + +### Sandbox slice + OOM placement + +Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run --scope` invocation: + +- `--slice=l4d2-build.slice` +- `-p OOMScoreAdjust=500` + +Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-scope) compose: weight handles contention, quota handles the absolute ceiling. + +### Host sysctls + +New file `deploy/files/etc/sysctl.d/99-left4me.conf`: + +``` +net.core.rmem_max = 8388608 +net.core.wmem_max = 8388608 +net.core.rmem_default = 524288 +net.core.wmem_default = 524288 +net.core.netdev_max_backlog = 5000 +net.core.netdev_budget = 600 +vm.swappiness = 10 +``` + +Per-value justification: + +- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB. +- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do. +- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full. +- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts. +- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts. + +### Deploy script integration + +`deploy/deploy-test-server.sh` must: + +1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`. +2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot. +3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`. +4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow). +5. No explicit `systemctl start` of the slices is required — they activate on first child reference. + +### Documented escape hatches (no auto-apply) + +Append a "Performance tuning" section to `deploy/README.md`: + +- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1. +- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ". +- **NIC tuning**: example `ethtool -G rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only. +- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves. + +These stay pure documentation in v1 — no code paths, no tests asserting them. + +### Out-of-scope rationale + +- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in. +- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts. +- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope. + +### Files changed / added + +``` +deploy/files/usr/local/lib/systemd/system/left4me-server@.service (modified) +deploy/files/usr/local/lib/systemd/system/l4d2-game.slice (new) +deploy/files/usr/local/lib/systemd/system/l4d2-build.slice (new) +deploy/files/etc/sysctl.d/99-left4me.conf (new) +deploy/files/usr/local/libexec/left4me/left4me-script-sandbox (modified) +deploy/deploy-test-server.sh (modified — sysctl --system step) +deploy/README.md (modified — performance section) +deploy/tests/test_deploy_artifacts.py (modified — assertions) +``` + +## Tests + +`deploy/tests/test_deploy_artifacts.py` additions, following the existing +`assert "key=value" in text` pattern: + +- For `left4me-server@.service`, assert every line listed in *Per-instance + unit additions* is present verbatim. Each is a separate assertion so a + failing line is identifiable. +- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`. +- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`. +- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*. +- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice` + and `OOMScoreAdjust=500` both appear. +- Assert the deploy script invokes `sysctl --system` (or + `sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the + conf into place. + +No runtime perf tests in v1 — the spec ships defaults, not measured wins. +Real-world measurement is left to operators with concrete instance counts, +hardware, and player loads. + +## Rollout + +Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`. + +## Open questions + +None blocking. v2 candidates if measurement justifies them: + +- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`). +- Job-worker awareness of "server has active players" to defer builds further than weights alone. +- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in. + +## References + +- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`. +- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules. +- systemd.kill(5) — signal handling and `KillSignal`. +- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`. +- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default. +- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics. +- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance. +- Phoronix governor comparisons — performance vs schedutil for sustained workloads. +- Multiple latency-tuning guides — `vm.swappiness=10` consensus.