From 0cc92f2c1748e0e4f041fa4b100ff2889a39215f Mon Sep 17 00:00:00 2001 From: mwiegand Date: Sun, 10 May 2026 00:05:44 +0200 Subject: [PATCH] =?UTF-8?q?docs(specs):=20l4d2=20network=20shaping=20&=20m?= =?UTF-8?q?arking=20=E2=80=94=20design?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CAKE egress shaping (test-deploy oneshot + systemd-networkd [CAKE] block on prod), nftables uid-based DSCP-EF + skb-priority marking for srcds UDP, plus rounding sysctls (udp_rmem_min/wmem_min, default_qdisc=fq_codel, tcp_congestion_control=bbr). Hardware-specific knobs stay documented escape hatches matching the perf-baseline boundary. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../2026-05-10-l4d2-network-shaping-design.md | 487 ++++++++++++++++++ 1 file changed, 487 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md diff --git a/docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md b/docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md new file mode 100644 index 0000000..04144ba --- /dev/null +++ b/docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md @@ -0,0 +1,487 @@ +# l4d2 network shaping & marking — design + +Date: 2026-05-10 +Status: design + +## Summary + +Add a network-side player-experience baseline alongside the existing host +perf baseline. Three concerns ship together: + +1. **Mark srcds outbound packets** with DSCP `EF` and skb priority `6:0` so + any qdisc — host CAKE, ISP gear that honours DSCP, future systems — + recognises L4D2 game traffic as latency-sensitive. Marking happens by uid + match on the `left4me` user. +2. **Round out the UDP-socket sysctl baseline** (`udp_rmem_min`, + `udp_wmem_min`), set the default qdisc explicitly to `fq_codel`, and + switch TCP to `bbr` so coexisting TCP egress (admin, backups, web app, + apt) cannot bufferbloat the link the players share. +3. **Shape egress with CAKE.** On the test deploy, install a systemd oneshot + that applies `tc qdisc replace … cake …` from an operator-edited env + file. On production hosts running `systemd-networkd`, document the + equivalent `[CAKE]` section in the matching `.network` file as the + long-term path. + +The intent is "all reasonable measures that do not depend on host-specific +hardware." Hardware-specific tuning (NIC ring buffers, IRQ pinning, CPU +governor, real-time scheduling, CPU affinity) remains a documented escape +hatch — same boundary the existing perf-baseline spec drew. The pieces +that *are* universally safe ship as defaults. + +## Goals + +- Game-server UDP packets carry an unambiguous priority signal in DSCP and + in `skb->priority`, set on the host before any qdisc inspects them. +- A coexisting bulk TCP flow on the same host (backup upload, package + fetch, web-app response) cannot push the bottleneck queue ahead of game + UDP under saturation. +- An operator who declares uplink bandwidth gets fair-queueing egress + shaping with diffserv-aware tin selection — i.e. EF-marked srcds traffic + drops into the highest-priority CAKE tin, per-destination-host fairness + keeps every connected player on equal footing. +- A production deployment using `systemd-networkd` has a one-block + configuration recipe, no helper script needed. +- Operators have a documented set of additional knobs (ingress shaping via + IFB, `busy_poll`, GRO toggling) for cases the default baseline does not + cover. None of these auto-apply. + +## Non-goals + +- NIC ring-buffer / IRQ pinning / RPS / RFS / hardware timestamping — + already declared host-specific in the perf-baseline spec; not + re-litigated here. +- `busy_poll` / `busy_read` as defaults — non-trivial CPU cost; documented + as opt-in. +- Ingress shaping via IFB as a default — only matters if egress CAKE turns + out load-bearing and ingress is also saturated; documented as opt-in. +- Real-time scheduling, governor changes — already declined by the + perf-baseline spec. +- Blueprint-side game settings (`sv_minrate`, `sv_maxrate`, tickrate, + `fps_max`) — owned by the server maintainer. +- Auto-detection or measurement of uplink bandwidth. CAKE only shapes + correctly when its declared bandwidth sits below the real bottleneck; + the operator must measure once and configure. +- Iface-flap watchdog. `tc qdisc replace` is idempotent; on prod, + `systemd-networkd` reapplies CAKE across iface lifecycle events. On + test, `systemctl restart left4me-cake.service` is the documented + recovery. + +## Background + +Current state (commit `62d6d4c` or thereabouts): + +- The perf-baseline spec ships `/etc/sysctl.d/99-left4me.conf` with + `rmem_max`, `wmem_max`, `rmem_default`, `wmem_default`, + `netdev_max_backlog`, `netdev_budget`, `vm.swappiness`. No per-socket + UDP minimums, no default-qdisc directive, no TCP congestion-control + setting. +- `srcds_run` runs as system user `left4me`. srcds itself does not set + `IP_TOS` or `SO_PRIORITY`, so its UDP packets leave the host with + DSCP 0 and priority 0 — indistinguishable from any other UDP traffic to + any qdisc. +- The deploy ships nftables-relevant infrastructure only via package + defaults (Debian Trixie ships `nftables` in base, but no `left4me` + table is created). +- No qdisc is explicitly configured. The kernel's per-iface default + applies — `fq_codel` on Trixie, but only because Debian's default has + been `fq_codel` since Buster. +- The deploy script already copies sysctl drop-ins and runs + `sysctl --system` (`deploy/deploy-test-server.sh:196`). + +## Design + +### Sysctl additions to `99-left4me.conf` + +Append to `deploy/files/etc/sysctl.d/99-left4me.conf`: + +``` +# Per-socket UDP buffer floors: protect game-server sockets that don't bump +# their own SO_RCVBUF/SO_SNDBUF when softirq drains lag briefly. +net.ipv4.udp_rmem_min = 16384 +net.ipv4.udp_wmem_min = 16384 + +# Default qdisc for ifaces we don't explicitly shape with CAKE. Debian +# Trixie already defaults to fq_codel; setting it explicitly is +# belt-and-suspenders and survives kernel-default churn. +net.core.default_qdisc = fq_codel + +# TCP congestion control: BBR for any bulk TCP egress on the host (admin +# SSH, backups, package fetches, web-app responses) so a long flow does +# not push the bottleneck queue ahead of game UDP. UDP srcds is +# unaffected. +net.ipv4.tcp_congestion_control = bbr +``` + +The deploy already runs `sysctl --system` after copying the conf +(`deploy/deploy-test-server.sh:198`); no script change required for this +block. + +### nftables packet marking + +New file `deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft`: + +```nft +table inet left4me_mark { + chain mangle_output { + type filter hook output priority mangle; policy accept; + meta skuid "left4me" meta l4proto udp ip dscp set ef meta priority set 0006:0000 + meta skuid "left4me" meta l4proto udp ip6 dscp set ef meta priority set 0006:0000 + } +} +``` + +Per-element rationale: + +- `meta skuid "left4me"` — every srcds instance runs as that user. The + match is exact; nothing else on the host matches. No false positives + against the web app (which runs as `left4me` too but speaks TCP) or the + build sandbox (different uid). +- `meta l4proto udp` — bypass anything not UDP, including the future + RCON/HTTP TCP traffic from the web app. +- `ip dscp set ef` / `ip6 dscp set ef` — DSCP `EF` (Expedited Forwarding, + decimal 46) is the standard low-latency marking. CAKE's `diffserv4` + preset routes EF into its highest-priority "Voice" tin. Two rules, + one per L3 family, because in an `inet` table the `ip` matcher only + fires on v4 and `ip6` only on v6. +- `meta priority set 0006:0000` — sets `skb->priority` to class `6:0`. + Read by qdiscs that classify on skb priority (CAKE included) ahead of + any DSCP table lookup. Set inline with the DSCP rule so a single + rule-match runs both statements. + +The table is named `left4me_mark` and lives in its own `inet` namespace. +It does not touch, depend on, or conflict with any nftables config the +operator may run independently. `nft -f` loads the file; `nft delete +table inet left4me_mark` cleanly removes it. + +New unit `deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service`: + +```ini +[Unit] +Description=left4me nftables packet marking (DSCP EF + priority for srcds) +After=network-pre.target +Before=network.target +Wants=network-pre.target + +[Service] +Type=oneshot +RemainAfterExit=yes +ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft +ExecStop=/usr/sbin/nft delete table inet left4me_mark + +[Install] +WantedBy=multi-user.target +``` + +`After=network-pre.target` / `Before=network.target` keeps the rules in +place before any iface comes up, so the very first packet srcds emits +post-boot is already marked. + +Deploy script changes: + +- Ensure `nftables` is installed (`apt-get install -y nftables`; + idempotent — package is in Trixie base). +- Create `/usr/local/lib/left4me/nft/` and copy `left4me-mark.nft` into + it. +- Copy the unit, `daemon-reload`, `systemctl enable --now + left4me-nft-mark.service`. + +### CAKE egress shaper — test deploy mechanism + +Three files plus deploy-script changes. All operator-tunable knobs go in +the env file; the helper and unit are static. + +**`deploy/files/etc/left4me/cake.env`** (template; deploy installs only +if absent so operator edits survive re-runs): + +``` +# Uplink bandwidth in Mbit/s. Set to ~95% of the smaller of measured +# upload and measured download. CAKE only shapes correctly when its +# declared bandwidth sits below the real bottleneck. If unset, the +# left4me-cake.service unit logs a warning and exits 0 (no shaping). +LEFT4ME_UPLINK_MBIT= + +# Egress interface. If unset, auto-detected from the IPv4 default route. +LEFT4ME_UPLINK_IFACE= +``` + +**`deploy/files/usr/local/libexec/left4me/left4me-apply-cake`** (mode +`0755`, owner `root:root`). The helper takes a single argument — `apply` +or `clear` — so the unit's `ExecStart` and `ExecStop` both call the same +script and the unit file stays free of shell escaping: + +```sh +#!/bin/sh +set -eu + +mode=${1:-apply} + +if [ -r /etc/left4me/cake.env ]; then + . /etc/left4me/cake.env +fi + +resolve_iface() { + if [ -n "${LEFT4ME_UPLINK_IFACE:-}" ]; then + printf '%s' "$LEFT4ME_UPLINK_IFACE" + return + fi + ip -4 route show default | awk '/default/ {print $5; exit}' +} + +case "$mode" in + apply) + if [ -z "${LEFT4ME_UPLINK_MBIT:-}" ]; then + echo "left4me-cake: LEFT4ME_UPLINK_MBIT unset; skipping shaper" >&2 + exit 0 + fi + iface=$(resolve_iface) + if [ -z "$iface" ]; then + echo "left4me-cake: cannot determine egress iface; skipping" >&2 + exit 0 + fi + exec tc qdisc replace dev "$iface" root cake \ + bandwidth "${LEFT4ME_UPLINK_MBIT}mbit" \ + internet diffserv4 dual-dsthost + ;; + clear) + iface=$(resolve_iface) + if [ -z "$iface" ]; then + exit 0 + fi + tc qdisc del dev "$iface" root 2>/dev/null || true + ;; + *) + echo "usage: $0 [apply|clear]" >&2 + exit 2 + ;; +esac +``` + +`tc qdisc replace` is idempotent: replaces an existing root qdisc on the +iface, adds one if absent. Re-running the unit any time is safe. `clear` +swallows the "no such qdisc" error so stop is also idempotent. + +Fail-soft on missing config matches the perf-baseline philosophy — the +deploy does not refuse to boot servers because the operator has not yet +filled in `LEFT4ME_UPLINK_MBIT`. The journal warning surfaces the gap. + +**`deploy/files/usr/local/lib/systemd/system/left4me-cake.service`**: + +```ini +[Unit] +Description=left4me CAKE egress shaper +After=network-online.target +Wants=network-online.target + +[Service] +Type=oneshot +RemainAfterExit=yes +EnvironmentFile=-/etc/left4me/cake.env +ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply +ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear + +[Install] +WantedBy=multi-user.target +``` + +Per-flag rationale for the `cake` invocation: + +- `bandwidth ${LEFT4ME_UPLINK_MBIT}mbit` — operator-declared, ≈95% of + measured uplink. CAKE only shapes if its declared bandwidth is below + the real bottleneck; setting it slightly low moves the queue into a + place the host controls. +- `internet` — overhead-accounting keyword that handles common + Ethernet+ISP encapsulation (DOCSIS / GPON / PPPoE) correctly without + undershooting. Conservative default. +- `diffserv4` — four-tier DSCP-aware tin selection. Reads the EF marks + set by the nftables rule and routes srcds packets into the + highest-priority "Voice" tin. Without `diffserv4`, the marks are + ignored. +- `dual-dsthost` — egress fairness keyed on destination host. With ≥2 + players connected, each player gets fair share regardless of how + chatty the server is to any single client. + +Iface-flap behaviour: the kernel keeps the qdisc on an iface across +link-down/link-up while the iface itself exists. If the iface is +recreated (e.g., NetworkManager reconfiguration), `systemctl restart +left4me-cake.service` reapplies. Documented; no auto-watchdog in v1. + +Deploy script changes (in `deploy/deploy-test-server.sh`): + +- Copy `cake.env` to `/etc/left4me/cake.env` only if absent (do not + clobber operator edits). +- Copy `left4me-apply-cake` to `/usr/local/libexec/left4me/`, mode + `0755`, owner `root:root`. +- Copy `left4me-cake.service` to `/usr/local/lib/systemd/system/`. +- `systemctl daemon-reload` (already done in the existing flow). +- `systemctl enable --now left4me-cake.service`. + +### CAKE egress shaper — production deployment (systemd-networkd) + +On hosts running `systemd-networkd`, the CAKE configuration belongs in +the matching `.network` file. systemd-networkd reapplies it across iface +lifecycle events, addressing the only fragility of the test-deploy +oneshot. + +Document in `deploy/README.md` Performance section: + +```ini +# /etc/systemd/network/.network +[CAKE] +Bandwidth=480M +OverheadKeyword=internet +PriorityQueueingPreset=diffserv4 +EgressHostIsolation=yes +``` + +Directive names follow `systemd.network(5)`. Values mirror the test +deploy's `tc` invocation: + +- `Bandwidth=480M` — placeholder; operator sets to ≈95% of measured + uplink in their actual `.network`. +- `OverheadKeyword=internet` — equivalent of the `internet` keyword. +- `PriorityQueueingPreset=diffserv4` — equivalent of `diffserv4`. +- `EgressHostIsolation=yes` — equivalent of `dual-dsthost` on egress. + +The nftables marking from the previous section ships unchanged on prod; +it is qdisc-installer-agnostic. + +The test-deploy oneshot does NOT install on a host running +`systemd-networkd`. v1 does not implement that gate — production hosts +do not run the test-deploy script. If the boundary blurs in the future, +add a check in `left4me-apply-cake` for `systemctl is-active +systemd-networkd` and skip cleanly. + +### Documented escape hatches + +Append to `deploy/README.md` Performance section, alongside the existing +governor / CPU-affinity / NIC entries: + +- **Ingress shaping via IFB.** Egress CAKE alone does not protect srcds + receive against ingress saturation (large workshop downloads, package + fetches arriving at line rate). One-liner template using `modprobe + ifb`, `ip link set ifb0 up`, `tc qdisc add dev ifb0 root cake bandwidth + Xmbit ingress diffserv4 dual-srchost`, and a `tc filter` redirect from + the uplink iface. Worth flipping only when measurement shows ingress + hurting receive; in v1 we have no such measurement, so it stays + documented. +- **`net.core.busy_poll = 50` / `net.core.busy_read = 50`.** Reduces UDP + receive median latency by polling for incoming packets briefly at + syscall boundaries. Cost: measurable CPU per syscall under load. Worth + flipping if a host is dedicated to game serving and CPU headroom is + plentiful. +- **`ethtool -K gro off`.** Some Source-engine ops disable + generic receive offload to avoid receive-side coalescing latency. + Hardware/driver dependent. Document, do not ship. + +These three entries follow the existing escape-hatch style: a one-liner +or short config block, plus one sentence on when it matters. + +### Files changed / added + +``` +deploy/files/etc/sysctl.d/99-left4me.conf (modified — block added) +deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft (new) +deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service (new) +deploy/files/etc/left4me/cake.env (new — template, deploy preserves operator edits) +deploy/files/usr/local/libexec/left4me/left4me-apply-cake (new) +deploy/files/usr/local/lib/systemd/system/left4me-cake.service (new) +deploy/deploy-test-server.sh (modified — install+enable nft and cake units, conditional copy of cake.env) +deploy/README.md (modified — Network shaping subsection + 3 new escape hatches) +deploy/tests/test_deploy_artifacts.py (modified — assertions for all artifacts above) +``` + +## Tests + +Following the existing `assert "key=value" in text` pattern in +`deploy/tests/test_deploy_artifacts.py`: + +**Sysctl block** (extension of the existing perf-baseline assertions): + +- Each of `net.ipv4.udp_rmem_min = 16384`, `net.ipv4.udp_wmem_min = + 16384`, `net.core.default_qdisc = fq_codel`, + `net.ipv4.tcp_congestion_control = bbr` is asserted as a separate line. + +**nftables marking artifacts:** + +- `left4me-mark.nft` ships with `table inet left4me_mark`, `chain + mangle_output`, `meta skuid "left4me"`, `ip dscp set ef`, `ip6 dscp + set ef`, and `meta priority set 0006:0000` each asserted as separate + substring matches. (DSCP and priority statements appear inline on + the same rule per L3 family; substring assertions don't depend on + rule layout.) +- `left4me-nft-mark.service` has `ExecStart=/usr/sbin/nft -f + /usr/local/lib/left4me/nft/left4me-mark.nft`, `ExecStop=/usr/sbin/nft + delete table inet left4me_mark`, `Type=oneshot`, + `RemainAfterExit=yes`, `WantedBy=multi-user.target`. +- `deploy-test-server.sh` invokes `systemctl enable --now + left4me-nft-mark.service` (or equivalent at-deploy enabling step). + +**CAKE artifacts:** + +- `cake.env` template contains the literal lines `LEFT4ME_UPLINK_MBIT=` + and `LEFT4ME_UPLINK_IFACE=` (commented or uncommented; matched as + substring). +- `left4me-apply-cake` contains the literals `tc qdisc replace`, `cake`, + `bandwidth`, `internet`, `diffserv4`, `dual-dsthost`, + `LEFT4ME_UPLINK_MBIT`, `LEFT4ME_UPLINK_IFACE`. +- `left4me-apply-cake` is mode `0755` after deploy (asserted via the + same mechanism the existing helper-script tests use). +- `left4me-cake.service` contains + `EnvironmentFile=-/etc/left4me/cake.env`, + `ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply`, + `ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear`, + `Wants=network-online.target`, `Type=oneshot`, + `WantedBy=multi-user.target`. +- `deploy-test-server.sh` invokes `systemctl enable --now + left4me-cake.service`. +- `deploy-test-server.sh` copies `cake.env` only when target absent + (asserted by literal substring of the guarding `[ -e + /etc/left4me/cake.env ]` test or equivalent). + +No runtime networking tests in v1. The artifacts are static; their +runtime behaviour requires a real iface and a real bandwidth load, +which the operator measures. + +## Rollout + +Single deploy. After the new sysctl block lands, `sysctl --system` +applies it immediately (already in the deploy flow). The two new +systemd units start on `systemctl enable --now`; CAKE without a +configured `LEFT4ME_UPLINK_MBIT` logs a warning and no-ops, which is +the expected fresh-deploy state. The operator measures their uplink, +edits `/etc/left4me/cake.env`, and runs `systemctl restart +left4me-cake.service`. + +Already-running game servers are unaffected by the network changes +themselves. The marking applies on every emitted packet from the moment +the nft rule loads; future-emitted packets pick up DSCP+priority without +restarting any srcds instance. + +## Open questions + +None blocking. v2 candidates if measurement justifies them: + +- A `LEFT4ME_INGRESS_MBIT` knob that flips on the IFB ingress shaper as + a default, conditional on the env value being set. +- A `left4me-net-doctor` helper that reports current qdisc, applied + marks, and a one-shot saturation+ping measurement against a local + endpoint. +- A small Python wrapper in `l4d2host` that reads `cake.env` for + display in the web UI, so the operator sees in one place whether + shaping is active. + +## References + +- `tc-cake(8)` — keyword semantics: `bandwidth`, `internet`, + `diffserv4`, `dual-dsthost`, tin priority mapping. +- `systemd.network(5)` — `[CAKE]` section directives: + `Bandwidth=`, `OverheadKeyword=`, `PriorityQueueingPreset=`, + `EgressHostIsolation=`. +- `nft(8)` — `meta skuid`, `meta priority`, `ip dscp set`, table + isolation semantics. +- RFC 3246 — Expedited Forwarding (EF) PHB. +- Linux kernel `Documentation/networking/tcp_bbr.txt` — BBR pairs with + `fq` / `fq_codel` for correct pacing. +- `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` + — sibling spec; this spec extends `99-left4me.conf` and reuses the + same deploy-test-artifact pattern.