left4me/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md

# l4d2 server host perf baseline — design

Date: 2026-05-09
Status: design

## Summary

Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.

## Goals

- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
- One misbehaving server cannot OOM-kill its siblings or the host.
- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.

## Non-goals

- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
- CPU governor changes. Documented opt-in only.
- Per-instance `CPUAffinity`. Host-specific; documented only.
- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec.

## Background

Current state (commit `965b67e`):

- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**.
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`.
- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects.

srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.

## Design

### Slice topology

Flat top-level slices, siblings of `system.slice` and `user.slice`:

```
-.slice
├── system.slice         (default CPUWeight=100, IOWeight=100)
├── user.slice           (default CPUWeight=100, IOWeight=100)
├── l4d2-game.slice      (CPUWeight=1000, IOWeight=1000)
└── l4d2-build.slice     (CPUWeight=10,   IOWeight=10)
```

Rationale:

- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`.
- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match.

### Per-instance unit additions (`left4me-server@.service`)

Add to `[Service]`:

```
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
```

Per-directive justification:

- `Slice=l4d2-game.slice` — places the instance in the high-weight slice.
- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section).
- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500–800 MB; map-load spikes fit in headroom; a runaway is bounded.
- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups.
- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL.
- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.

Existing security directives are kept verbatim.

### Slice unit files

New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`:

```ini
[Unit]
Description=left4me game-server slice
Before=slices.target

[Slice]
CPUWeight=1000
IOWeight=1000
```

New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`:

```ini
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target

[Slice]
CPUWeight=10
IOWeight=10
```

### Sandbox slice + OOM placement

Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run` invocation (transient service mode — the existing helper uses `--unit=` without `--scope`):

- `--slice=l4d2-build.slice`
- `-p OOMScoreAdjust=500`

Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-unit) compose: weight handles contention, quota handles the absolute ceiling.

### Host sysctls

New file `deploy/files/etc/sysctl.d/99-left4me.conf`:

```
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
vm.swappiness = 10
```

Per-value justification:

- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do.
- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts.
- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts.

### Deploy script integration

`deploy/deploy-test-server.sh` must:

1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`.
2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot.
3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`.
4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow).
5. No explicit `systemctl start` of the slices is required — they activate on first child reference.

### Documented escape hatches (no auto-apply)

Append a "Performance tuning" section to `deploy/README.md`:

- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ".
- **NIC tuning**: example `ethtool -G <iface> rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only.
- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves.

These stay pure documentation in v1 — no code paths, no tests asserting them.

### Out-of-scope rationale

- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts.
- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.

### Files changed / added

```
deploy/files/usr/local/lib/systemd/system/left4me-server@.service       (modified)
deploy/files/usr/local/lib/systemd/system/l4d2-game.slice               (new)
deploy/files/usr/local/lib/systemd/system/l4d2-build.slice              (new)
deploy/files/etc/sysctl.d/99-left4me.conf                               (new)
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox           (modified)
deploy/deploy-test-server.sh                                            (modified — sysctl --system step)
deploy/README.md                                                        (modified — performance section)
deploy/tests/test_deploy_artifacts.py                                   (modified — assertions)
```

## Tests

`deploy/tests/test_deploy_artifacts.py` additions, following the existing
`assert "key=value" in text` pattern:

- For `left4me-server@.service`, assert every line listed in *Per-instance
  unit additions* is present verbatim. Each is a separate assertion so a
  failing line is identifiable.
- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`.
- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`.
- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*.
- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice`
  and `OOMScoreAdjust=500` both appear.
- Assert the deploy script invokes `sysctl --system` (or
  `sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the
  conf into place.

No runtime perf tests in v1 — the spec ships defaults, not measured wins.
Real-world measurement is left to operators with concrete instance counts,
hardware, and player loads.

## Rollout

Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`.

## Open questions

None blocking. v2 candidates if measurement justifies them:

- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`).
- Job-worker awareness of "server has active players" to defer builds further than weights alone.
- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in.

## References

- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`.
- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules.
- systemd.kill(5) — signal handling and `KillSignal`.
- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`.
- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics.
- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
- Phoronix governor comparisons — performance vs schedutil for sustained workloads.
- Multiple latency-tuning guides — `vm.swappiness=10` consensus.