left4me/docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
mwiegand b6574e308b
docs(specs): perf baseline — fix transient-service phrasing
The existing left4me-script-sandbox helper uses systemd-run in
transient service mode (--unit=, no --scope). Spec wrongly said
'--scope'. No semantic change — the design's --slice= and
-p OOMScoreAdjust= guidance is identical for service vs scope mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:39:12 +02:00

12 KiB
Raw Blame History

l4d2 server host perf baseline — design

Date: 2026-05-09 Status: design

Summary

Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.

Goals

  • Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
  • One misbehaving server cannot OOM-kill its siblings or the host.
  • The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
  • Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.

Non-goals

  • ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
  • Real-time (SCHED_FIFO/SCHED_RR) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
  • CPU governor changes. Documented opt-in only.
  • Per-instance CPUAffinity. Host-specific; documented only.
  • NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
  • Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
  • Hardening tightening (ProtectKernelTunables=yes, etc.). Security-focused, separate spec.

Background

Current state (commit 965b67e):

  • deploy/files/usr/local/lib/systemd/system/left4me-server@.service runs srcds_run as user left4me with security hardening (NoNewPrivileges, PrivateTmp, PrivateDevices, ProtectHome, ProtectSystem=strict, ReadOnlyPaths, ReadWritePaths, RestrictSUIDSGID, LockPersonality) but no scheduling, memory, OOM, kill-signal, or log-rate directives.
  • deploy/files/usr/local/libexec/left4me/left4me-script-sandbox runs script-overlay builds via systemd-run --scope with CPUQuota=200% and RuntimeMaxSec=3600, but in the default cgroup — it competes against game servers as an equal sibling under system.slice.
  • No host sysctls are deployed. Linux defaults (rmem_max/wmem_max ≈ 128 KB, netdev_max_backlog=1000) are below what sustained UDP gameplay across multiple instances expects.

srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.

Design

Slice topology

Flat top-level slices, siblings of system.slice and user.slice:

-.slice
├── system.slice         (default CPUWeight=100, IOWeight=100)
├── user.slice           (default CPUWeight=100, IOWeight=100)
├── l4d2-game.slice      (CPUWeight=1000, IOWeight=1000)
└── l4d2-build.slice     (CPUWeight=10,   IOWeight=10)

Rationale:

  • 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own CPUQuota=200%.
  • Flat (not nested under system.slice) so a logged-in admin running a heavy task in user.slice cannot steal cycles from a live match.

Per-instance unit additions (left4me-server@.service)

Add to [Service]:

Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0

Per-directive justification:

  • Slice=l4d2-game.slice — places the instance in the high-weight slice.
  • Nice=-5 — modest CFS priority bump. Negative Nice set by systemd does not require CAP_SYS_NICE because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
  • IOSchedulingClass=best-effort + IOSchedulingPriority=4 — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
  • OOMScoreAdjust=-200 — game servers survive memory pressure; sandbox dies first (see sandbox section).
  • MemoryHigh=1.5G, MemoryMax=2G — soft + hard ceiling. Typical L4D2 srcds runs ~500800 MB; map-load spikes fit in headroom; a runaway is bounded.
  • TasksMax=256 — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
  • LimitNOFILE=65536 — Valve wiki recommendation; cheap and matches multi-plugin setups.
  • KillSignal=SIGINT — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
  • TimeoutStopSec=15s — gives srcds time to finish flush before SIGKILL.
  • LogRateLimitIntervalSec=0 — disables journald per-unit rate limiting (default 10000 msgs/30s). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.

Existing security directives are kept verbatim.

Slice unit files

New file deploy/files/usr/local/lib/systemd/system/l4d2-game.slice:

[Unit]
Description=left4me game-server slice
Before=slices.target

[Slice]
CPUWeight=1000
IOWeight=1000

New file deploy/files/usr/local/lib/systemd/system/l4d2-build.slice:

[Unit]
Description=left4me script-sandbox build slice
Before=slices.target

[Slice]
CPUWeight=10
IOWeight=10

Sandbox slice + OOM placement

Edit deploy/files/usr/local/libexec/left4me/left4me-script-sandbox to add to the systemd-run invocation (transient service mode — the existing helper uses --unit= without --scope):

  • --slice=l4d2-build.slice
  • -p OOMScoreAdjust=500

Existing CPUQuota=200% and RuntimeMaxSec=3600 stay. Cgroup weight (slice) and CPU quota (per-unit) compose: weight handles contention, quota handles the absolute ceiling.

Host sysctls

New file deploy/files/etc/sysctl.d/99-left4me.conf:

net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
vm.swappiness = 10

Per-value justification:

  • rmem_max/wmem_max = 8 MB — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
  • rmem_default/wmem_default = 512 KB — protects sockets that don't explicitly call setsockopt(SO_RCVBUF/SO_SNDBUF); harmless when they do.
  • netdev_max_backlog = 5000 — default 1000 overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
  • netdev_budget = 600 — gives softirq more packet-drain headroom per pass; default 300 is undersized for multi-Gbit-class hosts.
  • vm.swappiness = 10 — universally recommended for latency-sensitive servers; harmless on swapless hosts.

Deploy script integration

deploy/deploy-test-server.sh must:

  1. Copy etc/sysctl.d/99-left4me.conf to /etc/sysctl.d/.
  2. Run sysctl --system (or sysctl -p /etc/sysctl.d/99-left4me.conf) so values take effect immediately, not on next boot.
  3. Copy the two .slice files into /usr/local/lib/systemd/system/.
  4. systemctl daemon-reload after unit/slice changes (already done in current deploy flow).
  5. No explicit systemctl start of the slices is required — they activate on first child reference.

Documented escape hatches (no auto-apply)

Append a "Performance tuning" section to deploy/README.md:

  • CPU governor: cpupower frequency-set -g performance if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
  • CPU affinity per instance: example drop-in at /etc/systemd/system/left4me-server@<name>.service.d/affinity.conf setting CPUAffinity=N. Document the strategy "one instance per core, leave core 0 for system + IRQ".
  • NIC tuning: example ethtool -G <iface> rx 4096 tx 4096, IRQ-pinning hints. Hardware-specific; ops-only.
  • Real-time scheduling opt-in: example drop-in adding CPUSchedulingPolicy=fifo, CPUSchedulingPriority=10, LimitRTPRIO=10. Include a one-paragraph warning citing RT-throttling defaults (sched_rt_runtime_us=950000) and the failure mode if a single instance misbehaves.

These stay pure documentation in v1 — no code paths, no tests asserting them.

Out-of-scope rationale

  • SCHED_FIFO: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. Nice=-5 plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
  • CPU governor auto-set: Phoronix and Arch comparisons show schedutil is within noise of performance on sustained workloads like Source UDP; aggressively forcing performance would surprise users on power-managed hosts.
  • CPUAffinity in the unit: the unit template is shared across all instances; a single hard-coded CPUAffinity= would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.

Files changed / added

deploy/files/usr/local/lib/systemd/system/left4me-server@.service       (modified)
deploy/files/usr/local/lib/systemd/system/l4d2-game.slice               (new)
deploy/files/usr/local/lib/systemd/system/l4d2-build.slice              (new)
deploy/files/etc/sysctl.d/99-left4me.conf                               (new)
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox           (modified)
deploy/deploy-test-server.sh                                            (modified — sysctl --system step)
deploy/README.md                                                        (modified — performance section)
deploy/tests/test_deploy_artifacts.py                                   (modified — assertions)

Tests

deploy/tests/test_deploy_artifacts.py additions, following the existing assert "key=value" in text pattern:

  • For left4me-server@.service, assert every line listed in Per-instance unit additions is present verbatim. Each is a separate assertion so a failing line is identifiable.
  • For l4d2-game.slice, assert CPUWeight=1000 and IOWeight=1000.
  • For l4d2-build.slice, assert CPUWeight=10 and IOWeight=10.
  • For 99-left4me.conf, assert every sysctl line listed in Host sysctls.
  • For left4me-script-sandbox, assert the strings --slice=l4d2-build.slice and OOMScoreAdjust=500 both appear.
  • Assert the deploy script invokes sysctl --system (or sysctl -p /etc/sysctl.d/99-left4me.conf) at least once after copying the conf into place.

No runtime perf tests in v1 — the spec ships defaults, not measured wins. Real-world measurement is left to operators with concrete instance counts, hardware, and player loads.

Rollout

Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in deploy/README.md.

Open questions

None blocking. v2 candidates if measurement justifies them:

  • Per-instance CPUAffinity driven by a deploy-env knob (LEFT4ME_INSTANCE_CPUS).
  • Job-worker awareness of "server has active players" to defer builds further than weights alone.
  • Optional left4me-host-perf.service oneshot that sets governor + NIC tuning under a single env-flag opt-in.

References

  • systemd.exec(5) — Nice=, IOSchedulingClass=, OOMScoreAdjust=, MemoryHigh=, MemoryMax=, TasksMax=, KillSignal=, TimeoutStopSec=, LimitNOFILE=, LogRateLimitIntervalSec=.
  • systemd.resource-control(5) — slice semantics, CPUWeight=, IOWeight=, weight competition rules.
  • systemd.kill(5) — signal handling and KillSignal.
  • Red Hat Enterprise Linux Network Performance Tuning Guide — rmem_max/wmem_max/netdev_max_backlog/netdev_budget.
  • LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
  • Linux Foundation real-time wiki — sched_rt_runtime_us semantics.
  • forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
  • Phoronix governor comparisons — performance vs schedutil for sustained workloads.
  • Multiple latency-tuning guides — vm.swappiness=10 consensus.