docs(specs): l4d2 server host perf baseline — design
Approach A: per-instance unit directives (Nice, OOM, Memory caps, KillSignal=SIGINT, log-rate disable), flat l4d2-game/l4d2-build slice hierarchy with 100:1 CPU/IO weight ratio, sandbox into build slice with OOMScoreAdjust=500, host sysctls for UDP buffers + netdev backlog/budget + vm.swappiness. SCHED_FIFO, CPU governor, CPUAffinity, NIC tuning are documented escape hatches, not auto-applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
965b67e6fc
commit
db3b149045
1 changed files with 230 additions and 0 deletions
|
|
@ -0,0 +1,230 @@
|
||||||
|
# l4d2 server host perf baseline — design
|
||||||
|
|
||||||
|
Date: 2026-05-09
|
||||||
|
Status: design
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
|
||||||
|
- One misbehaving server cannot OOM-kill its siblings or the host.
|
||||||
|
- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
|
||||||
|
- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.
|
||||||
|
|
||||||
|
## Non-goals
|
||||||
|
|
||||||
|
- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
|
||||||
|
- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
|
||||||
|
- CPU governor changes. Documented opt-in only.
|
||||||
|
- Per-instance `CPUAffinity`. Host-specific; documented only.
|
||||||
|
- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
|
||||||
|
- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
|
||||||
|
- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
Current state (commit `965b67e`):
|
||||||
|
|
||||||
|
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**.
|
||||||
|
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`.
|
||||||
|
- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects.
|
||||||
|
|
||||||
|
srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.
|
||||||
|
|
||||||
|
## Design
|
||||||
|
|
||||||
|
### Slice topology
|
||||||
|
|
||||||
|
Flat top-level slices, siblings of `system.slice` and `user.slice`:
|
||||||
|
|
||||||
|
```
|
||||||
|
-.slice
|
||||||
|
├── system.slice (default CPUWeight=100, IOWeight=100)
|
||||||
|
├── user.slice (default CPUWeight=100, IOWeight=100)
|
||||||
|
├── l4d2-game.slice (CPUWeight=1000, IOWeight=1000)
|
||||||
|
└── l4d2-build.slice (CPUWeight=10, IOWeight=10)
|
||||||
|
```
|
||||||
|
|
||||||
|
Rationale:
|
||||||
|
|
||||||
|
- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`.
|
||||||
|
- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match.
|
||||||
|
|
||||||
|
### Per-instance unit additions (`left4me-server@.service`)
|
||||||
|
|
||||||
|
Add to `[Service]`:
|
||||||
|
|
||||||
|
```
|
||||||
|
Slice=l4d2-game.slice
|
||||||
|
Nice=-5
|
||||||
|
IOSchedulingClass=best-effort
|
||||||
|
IOSchedulingPriority=4
|
||||||
|
OOMScoreAdjust=-200
|
||||||
|
MemoryHigh=1.5G
|
||||||
|
MemoryMax=2G
|
||||||
|
TasksMax=256
|
||||||
|
LimitNOFILE=65536
|
||||||
|
KillSignal=SIGINT
|
||||||
|
TimeoutStopSec=15s
|
||||||
|
LogRateLimitIntervalSec=0
|
||||||
|
```
|
||||||
|
|
||||||
|
Per-directive justification:
|
||||||
|
|
||||||
|
- `Slice=l4d2-game.slice` — places the instance in the high-weight slice.
|
||||||
|
- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
|
||||||
|
- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
|
||||||
|
- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section).
|
||||||
|
- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500–800 MB; map-load spikes fit in headroom; a runaway is bounded.
|
||||||
|
- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
|
||||||
|
- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups.
|
||||||
|
- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
|
||||||
|
- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL.
|
||||||
|
- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.
|
||||||
|
|
||||||
|
Existing security directives are kept verbatim.
|
||||||
|
|
||||||
|
### Slice unit files
|
||||||
|
|
||||||
|
New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=left4me game-server slice
|
||||||
|
Before=slices.target
|
||||||
|
|
||||||
|
[Slice]
|
||||||
|
CPUWeight=1000
|
||||||
|
IOWeight=1000
|
||||||
|
```
|
||||||
|
|
||||||
|
New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=left4me script-sandbox build slice
|
||||||
|
Before=slices.target
|
||||||
|
|
||||||
|
[Slice]
|
||||||
|
CPUWeight=10
|
||||||
|
IOWeight=10
|
||||||
|
```
|
||||||
|
|
||||||
|
### Sandbox slice + OOM placement
|
||||||
|
|
||||||
|
Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run --scope` invocation:
|
||||||
|
|
||||||
|
- `--slice=l4d2-build.slice`
|
||||||
|
- `-p OOMScoreAdjust=500`
|
||||||
|
|
||||||
|
Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-scope) compose: weight handles contention, quota handles the absolute ceiling.
|
||||||
|
|
||||||
|
### Host sysctls
|
||||||
|
|
||||||
|
New file `deploy/files/etc/sysctl.d/99-left4me.conf`:
|
||||||
|
|
||||||
|
```
|
||||||
|
net.core.rmem_max = 8388608
|
||||||
|
net.core.wmem_max = 8388608
|
||||||
|
net.core.rmem_default = 524288
|
||||||
|
net.core.wmem_default = 524288
|
||||||
|
net.core.netdev_max_backlog = 5000
|
||||||
|
net.core.netdev_budget = 600
|
||||||
|
vm.swappiness = 10
|
||||||
|
```
|
||||||
|
|
||||||
|
Per-value justification:
|
||||||
|
|
||||||
|
- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
|
||||||
|
- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do.
|
||||||
|
- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
|
||||||
|
- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts.
|
||||||
|
- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts.
|
||||||
|
|
||||||
|
### Deploy script integration
|
||||||
|
|
||||||
|
`deploy/deploy-test-server.sh` must:
|
||||||
|
|
||||||
|
1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`.
|
||||||
|
2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot.
|
||||||
|
3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`.
|
||||||
|
4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow).
|
||||||
|
5. No explicit `systemctl start` of the slices is required — they activate on first child reference.
|
||||||
|
|
||||||
|
### Documented escape hatches (no auto-apply)
|
||||||
|
|
||||||
|
Append a "Performance tuning" section to `deploy/README.md`:
|
||||||
|
|
||||||
|
- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
|
||||||
|
- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ".
|
||||||
|
- **NIC tuning**: example `ethtool -G <iface> rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only.
|
||||||
|
- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves.
|
||||||
|
|
||||||
|
These stay pure documentation in v1 — no code paths, no tests asserting them.
|
||||||
|
|
||||||
|
### Out-of-scope rationale
|
||||||
|
|
||||||
|
- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
|
||||||
|
- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts.
|
||||||
|
- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.
|
||||||
|
|
||||||
|
### Files changed / added
|
||||||
|
|
||||||
|
```
|
||||||
|
deploy/files/usr/local/lib/systemd/system/left4me-server@.service (modified)
|
||||||
|
deploy/files/usr/local/lib/systemd/system/l4d2-game.slice (new)
|
||||||
|
deploy/files/usr/local/lib/systemd/system/l4d2-build.slice (new)
|
||||||
|
deploy/files/etc/sysctl.d/99-left4me.conf (new)
|
||||||
|
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox (modified)
|
||||||
|
deploy/deploy-test-server.sh (modified — sysctl --system step)
|
||||||
|
deploy/README.md (modified — performance section)
|
||||||
|
deploy/tests/test_deploy_artifacts.py (modified — assertions)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Tests
|
||||||
|
|
||||||
|
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
|
||||||
|
`assert "key=value" in text` pattern:
|
||||||
|
|
||||||
|
- For `left4me-server@.service`, assert every line listed in *Per-instance
|
||||||
|
unit additions* is present verbatim. Each is a separate assertion so a
|
||||||
|
failing line is identifiable.
|
||||||
|
- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`.
|
||||||
|
- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`.
|
||||||
|
- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*.
|
||||||
|
- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice`
|
||||||
|
and `OOMScoreAdjust=500` both appear.
|
||||||
|
- Assert the deploy script invokes `sysctl --system` (or
|
||||||
|
`sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the
|
||||||
|
conf into place.
|
||||||
|
|
||||||
|
No runtime perf tests in v1 — the spec ships defaults, not measured wins.
|
||||||
|
Real-world measurement is left to operators with concrete instance counts,
|
||||||
|
hardware, and player loads.
|
||||||
|
|
||||||
|
## Rollout
|
||||||
|
|
||||||
|
Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`.
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
None blocking. v2 candidates if measurement justifies them:
|
||||||
|
|
||||||
|
- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`).
|
||||||
|
- Job-worker awareness of "server has active players" to defer builds further than weights alone.
|
||||||
|
- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in.
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`.
|
||||||
|
- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules.
|
||||||
|
- systemd.kill(5) — signal handling and `KillSignal`.
|
||||||
|
- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`.
|
||||||
|
- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
|
||||||
|
- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics.
|
||||||
|
- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
|
||||||
|
- Phoronix governor comparisons — performance vs schedutil for sustained workloads.
|
||||||
|
- Multiple latency-tuning guides — `vm.swappiness=10` consensus.
|
||||||
Loading…
Reference in a new issue