left4me/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
mwiegand b6574e308b
docs(specs): perf baseline — fix transient-service phrasing
The existing left4me-script-sandbox helper uses systemd-run in
transient service mode (--unit=, no --scope). Spec wrongly said
'--scope'. No semantic change — the design's --slice= and
-p OOMScoreAdjust= guidance is identical for service vs scope mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:39:12 +02:00

230 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# l4d2 server host perf baseline — design
Date: 2026-05-09
Status: design
## Summary
Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.
## Goals
- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
- One misbehaving server cannot OOM-kill its siblings or the host.
- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.
## Non-goals
- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
- CPU governor changes. Documented opt-in only.
- Per-instance `CPUAffinity`. Host-specific; documented only.
- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec.
## Background
Current state (commit `965b67e`):
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**.
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`.
- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects.
srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.
## Design
### Slice topology
Flat top-level slices, siblings of `system.slice` and `user.slice`:
```
-.slice
├── system.slice (default CPUWeight=100, IOWeight=100)
├── user.slice (default CPUWeight=100, IOWeight=100)
├── l4d2-game.slice (CPUWeight=1000, IOWeight=1000)
└── l4d2-build.slice (CPUWeight=10, IOWeight=10)
```
Rationale:
- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`.
- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match.
### Per-instance unit additions (`left4me-server@.service`)
Add to `[Service]`:
```
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
```
Per-directive justification:
- `Slice=l4d2-game.slice` — places the instance in the high-weight slice.
- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section).
- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500800 MB; map-load spikes fit in headroom; a runaway is bounded.
- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups.
- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL.
- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.
Existing security directives are kept verbatim.
### Slice unit files
New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`:
```ini
[Unit]
Description=left4me game-server slice
Before=slices.target
[Slice]
CPUWeight=1000
IOWeight=1000
```
New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`:
```ini
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target
[Slice]
CPUWeight=10
IOWeight=10
```
### Sandbox slice + OOM placement
Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run` invocation (transient service mode — the existing helper uses `--unit=` without `--scope`):
- `--slice=l4d2-build.slice`
- `-p OOMScoreAdjust=500`
Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-unit) compose: weight handles contention, quota handles the absolute ceiling.
### Host sysctls
New file `deploy/files/etc/sysctl.d/99-left4me.conf`:
```
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
vm.swappiness = 10
```
Per-value justification:
- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do.
- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts.
- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts.
### Deploy script integration
`deploy/deploy-test-server.sh` must:
1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`.
2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot.
3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`.
4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow).
5. No explicit `systemctl start` of the slices is required — they activate on first child reference.
### Documented escape hatches (no auto-apply)
Append a "Performance tuning" section to `deploy/README.md`:
- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ".
- **NIC tuning**: example `ethtool -G <iface> rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only.
- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves.
These stay pure documentation in v1 — no code paths, no tests asserting them.
### Out-of-scope rationale
- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts.
- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.
### Files changed / added
```
deploy/files/usr/local/lib/systemd/system/left4me-server@.service (modified)
deploy/files/usr/local/lib/systemd/system/l4d2-game.slice (new)
deploy/files/usr/local/lib/systemd/system/l4d2-build.slice (new)
deploy/files/etc/sysctl.d/99-left4me.conf (new)
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox (modified)
deploy/deploy-test-server.sh (modified — sysctl --system step)
deploy/README.md (modified — performance section)
deploy/tests/test_deploy_artifacts.py (modified — assertions)
```
## Tests
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
`assert "key=value" in text` pattern:
- For `left4me-server@.service`, assert every line listed in *Per-instance
unit additions* is present verbatim. Each is a separate assertion so a
failing line is identifiable.
- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`.
- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`.
- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*.
- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice`
and `OOMScoreAdjust=500` both appear.
- Assert the deploy script invokes `sysctl --system` (or
`sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the
conf into place.
No runtime perf tests in v1 — the spec ships defaults, not measured wins.
Real-world measurement is left to operators with concrete instance counts,
hardware, and player loads.
## Rollout
Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`.
## Open questions
None blocking. v2 candidates if measurement justifies them:
- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`).
- Job-worker awareness of "server has active players" to defer builds further than weights alone.
- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in.
## References
- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`.
- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules.
- systemd.kill(5) — signal handling and `KillSignal`.
- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`.
- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics.
- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
- Phoronix governor comparisons — performance vs schedutil for sustained workloads.
- Multiple latency-tuning guides — `vm.swappiness=10` consensus.