# l4d2 network shaping & marking — design Date: 2026-05-10 Status: design ## Summary Add a network-side player-experience baseline alongside the existing host perf baseline. Three concerns ship together: 1. **Mark srcds outbound packets** with DSCP `EF` and skb priority `6:0` so any qdisc — host CAKE, ISP gear that honours DSCP, future systems — recognises L4D2 game traffic as latency-sensitive. Marking happens by uid match on the `left4me` user. 2. **Round out the UDP-socket sysctl baseline** (`udp_rmem_min`, `udp_wmem_min`), set the default qdisc explicitly to `fq_codel`, and switch TCP to `bbr` so coexisting TCP egress (admin, backups, web app, apt) cannot bufferbloat the link the players share. 3. **Shape egress with CAKE.** On the test deploy, install a systemd oneshot that applies `tc qdisc replace … cake …` from an operator-edited env file. On production hosts running `systemd-networkd`, document the equivalent `[CAKE]` section in the matching `.network` file as the long-term path. The intent is "all reasonable measures that do not depend on host-specific hardware." Hardware-specific tuning (NIC ring buffers, IRQ pinning, CPU governor, real-time scheduling, CPU affinity) remains a documented escape hatch — same boundary the existing perf-baseline spec drew. The pieces that *are* universally safe ship as defaults. ## Goals - Game-server UDP packets carry an unambiguous priority signal in DSCP and in `skb->priority`, set on the host before any qdisc inspects them. - A coexisting bulk TCP flow on the same host (backup upload, package fetch, web-app response) cannot push the bottleneck queue ahead of game UDP under saturation. - An operator who declares uplink bandwidth gets fair-queueing egress shaping with diffserv-aware tin selection — i.e. EF-marked srcds traffic drops into the highest-priority CAKE tin, per-destination-host fairness keeps every connected player on equal footing. - A production deployment using `systemd-networkd` has a one-block configuration recipe, no helper script needed. - Operators have a documented set of additional knobs (ingress shaping via IFB, `busy_poll`, GRO toggling) for cases the default baseline does not cover. None of these auto-apply. ## Non-goals - NIC ring-buffer / IRQ pinning / RPS / RFS / hardware timestamping — already declared host-specific in the perf-baseline spec; not re-litigated here. - `busy_poll` / `busy_read` as defaults — non-trivial CPU cost; documented as opt-in. - Ingress shaping via IFB as a default — only matters if egress CAKE turns out load-bearing and ingress is also saturated; documented as opt-in. - Real-time scheduling, governor changes — already declined by the perf-baseline spec. - Blueprint-side game settings (`sv_minrate`, `sv_maxrate`, tickrate, `fps_max`) — owned by the server maintainer. - Auto-detection or measurement of uplink bandwidth. CAKE only shapes correctly when its declared bandwidth sits below the real bottleneck; the operator must measure once and configure. - Iface-flap watchdog. `tc qdisc replace` is idempotent; on prod, `systemd-networkd` reapplies CAKE across iface lifecycle events. On test, `systemctl restart left4me-cake.service` is the documented recovery. ## Background Current state (commit `62d6d4c` or thereabouts): - The perf-baseline spec ships `/etc/sysctl.d/99-left4me.conf` with `rmem_max`, `wmem_max`, `rmem_default`, `wmem_default`, `netdev_max_backlog`, `netdev_budget`, `vm.swappiness`. No per-socket UDP minimums, no default-qdisc directive, no TCP congestion-control setting. - `srcds_run` runs as system user `left4me`. srcds itself does not set `IP_TOS` or `SO_PRIORITY`, so its UDP packets leave the host with DSCP 0 and priority 0 — indistinguishable from any other UDP traffic to any qdisc. - The deploy ships nftables-relevant infrastructure only via package defaults (Debian Trixie ships `nftables` in base, but no `left4me` table is created). - No qdisc is explicitly configured. The kernel's per-iface default applies — `fq_codel` on Trixie, but only because Debian's default has been `fq_codel` since Buster. - The deploy script already copies sysctl drop-ins and runs `sysctl --system` (`deploy/deploy-test-server.sh:196`). ## Design ### Sysctl additions to `99-left4me.conf` Append to `deploy/files/etc/sysctl.d/99-left4me.conf`: ``` # Per-socket UDP buffer floors: protect game-server sockets that don't bump # their own SO_RCVBUF/SO_SNDBUF when softirq drains lag briefly. net.ipv4.udp_rmem_min = 16384 net.ipv4.udp_wmem_min = 16384 # Default qdisc for ifaces we don't explicitly shape with CAKE. Debian # Trixie already defaults to fq_codel; setting it explicitly is # belt-and-suspenders and survives kernel-default churn. net.core.default_qdisc = fq_codel # TCP congestion control: BBR for any bulk TCP egress on the host (admin # SSH, backups, package fetches, web-app responses) so a long flow does # not push the bottleneck queue ahead of game UDP. UDP srcds is # unaffected. net.ipv4.tcp_congestion_control = bbr ``` The deploy already runs `sysctl --system` after copying the conf (`deploy/deploy-test-server.sh:198`); no script change required for this block. ### nftables packet marking New file `deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft`: ```nft table inet left4me_mark { chain mangle_output { type filter hook output priority mangle; policy accept; meta skuid "left4me" meta l4proto udp ip dscp set ef meta priority set 0006:0000 meta skuid "left4me" meta l4proto udp ip6 dscp set ef meta priority set 0006:0000 } } ``` Per-element rationale: - `meta skuid "left4me"` — every srcds instance runs as that user. The match is exact; nothing else on the host matches. No false positives against the web app (which runs as `left4me` too but speaks TCP) or the build sandbox (different uid). - `meta l4proto udp` — bypass anything not UDP, including the future RCON/HTTP TCP traffic from the web app. - `ip dscp set ef` / `ip6 dscp set ef` — DSCP `EF` (Expedited Forwarding, decimal 46) is the standard low-latency marking. CAKE's `diffserv4` preset routes EF into its highest-priority "Voice" tin. Two rules, one per L3 family, because in an `inet` table the `ip` matcher only fires on v4 and `ip6` only on v6. - `meta priority set 0006:0000` — sets `skb->priority` to class `6:0`. Read by qdiscs that classify on skb priority (CAKE included) ahead of any DSCP table lookup. Set inline with the DSCP rule so a single rule-match runs both statements. The table is named `left4me_mark` and lives in its own `inet` namespace. It does not touch, depend on, or conflict with any nftables config the operator may run independently. `nft -f` loads the file; `nft delete table inet left4me_mark` cleanly removes it. New unit `deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service`: ```ini [Unit] Description=left4me nftables packet marking (DSCP EF + priority for srcds) After=network-pre.target Before=network.target Wants=network-pre.target [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft ExecStop=/usr/sbin/nft delete table inet left4me_mark [Install] WantedBy=multi-user.target ``` `After=network-pre.target` / `Before=network.target` keeps the rules in place before any iface comes up, so the very first packet srcds emits post-boot is already marked. Deploy script changes: - Ensure `nftables` is installed (`apt-get install -y nftables`; idempotent — package is in Trixie base). - Create `/usr/local/lib/left4me/nft/` and copy `left4me-mark.nft` into it. - Copy the unit, `daemon-reload`, `systemctl enable --now left4me-nft-mark.service`. ### CAKE egress shaper — test deploy mechanism Three files plus deploy-script changes. All operator-tunable knobs go in the env file; the helper and unit are static. **`deploy/files/etc/left4me/cake.env`** (template; deploy installs only if absent so operator edits survive re-runs): ``` # Uplink bandwidth in Mbit/s. Set to ~95% of the smaller of measured # upload and measured download. CAKE only shapes correctly when its # declared bandwidth sits below the real bottleneck. If unset, the # left4me-cake.service unit logs a warning and exits 0 (no shaping). LEFT4ME_UPLINK_MBIT= # Egress interface. If unset, auto-detected from the IPv4 default route. LEFT4ME_UPLINK_IFACE= ``` **`deploy/files/usr/local/libexec/left4me/left4me-apply-cake`** (mode `0755`, owner `root:root`). The helper takes a single argument — `apply` or `clear` — so the unit's `ExecStart` and `ExecStop` both call the same script and the unit file stays free of shell escaping: ```sh #!/bin/sh set -eu mode=${1:-apply} if [ -r /etc/left4me/cake.env ]; then . /etc/left4me/cake.env fi resolve_iface() { if [ -n "${LEFT4ME_UPLINK_IFACE:-}" ]; then printf '%s' "$LEFT4ME_UPLINK_IFACE" return fi ip -4 route show default | awk '/default/ {print $5; exit}' } case "$mode" in apply) if [ -z "${LEFT4ME_UPLINK_MBIT:-}" ]; then echo "left4me-cake: LEFT4ME_UPLINK_MBIT unset; skipping shaper" >&2 exit 0 fi iface=$(resolve_iface) if [ -z "$iface" ]; then echo "left4me-cake: cannot determine egress iface; skipping" >&2 exit 0 fi exec tc qdisc replace dev "$iface" root cake \ bandwidth "${LEFT4ME_UPLINK_MBIT}mbit" \ internet diffserv4 dual-dsthost ;; clear) iface=$(resolve_iface) if [ -z "$iface" ]; then exit 0 fi tc qdisc del dev "$iface" root 2>/dev/null || true ;; *) echo "usage: $0 [apply|clear]" >&2 exit 2 ;; esac ``` `tc qdisc replace` is idempotent: replaces an existing root qdisc on the iface, adds one if absent. Re-running the unit any time is safe. `clear` swallows the "no such qdisc" error so stop is also idempotent. Fail-soft on missing config matches the perf-baseline philosophy — the deploy does not refuse to boot servers because the operator has not yet filled in `LEFT4ME_UPLINK_MBIT`. The journal warning surfaces the gap. **`deploy/files/usr/local/lib/systemd/system/left4me-cake.service`**: ```ini [Unit] Description=left4me CAKE egress shaper After=network-online.target Wants=network-online.target [Service] Type=oneshot RemainAfterExit=yes EnvironmentFile=-/etc/left4me/cake.env ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear [Install] WantedBy=multi-user.target ``` Per-flag rationale for the `cake` invocation: - `bandwidth ${LEFT4ME_UPLINK_MBIT}mbit` — operator-declared, ≈95% of measured uplink. CAKE only shapes if its declared bandwidth is below the real bottleneck; setting it slightly low moves the queue into a place the host controls. - `internet` — overhead-accounting keyword that handles common Ethernet+ISP encapsulation (DOCSIS / GPON / PPPoE) correctly without undershooting. Conservative default. - `diffserv4` — four-tier DSCP-aware tin selection. Reads the EF marks set by the nftables rule and routes srcds packets into the highest-priority "Voice" tin. Without `diffserv4`, the marks are ignored. - `dual-dsthost` — egress fairness keyed on destination host. With ≥2 players connected, each player gets fair share regardless of how chatty the server is to any single client. Iface-flap behaviour: the kernel keeps the qdisc on an iface across link-down/link-up while the iface itself exists. If the iface is recreated (e.g., NetworkManager reconfiguration), `systemctl restart left4me-cake.service` reapplies. Documented; no auto-watchdog in v1. Deploy script changes (in `deploy/deploy-test-server.sh`): - Copy `cake.env` to `/etc/left4me/cake.env` only if absent (do not clobber operator edits). - Copy `left4me-apply-cake` to `/usr/local/libexec/left4me/`, mode `0755`, owner `root:root`. - Copy `left4me-cake.service` to `/usr/local/lib/systemd/system/`. - `systemctl daemon-reload` (already done in the existing flow). - `systemctl enable --now left4me-cake.service`. ### CAKE egress shaper — production deployment (systemd-networkd) On hosts running `systemd-networkd`, the CAKE configuration belongs in the matching `.network` file. systemd-networkd reapplies it across iface lifecycle events, addressing the only fragility of the test-deploy oneshot. Document in `deploy/README.md` Performance section: ```ini # /etc/systemd/network/.network [CAKE] Bandwidth=480M OverheadKeyword=internet PriorityQueueingPreset=diffserv4 EgressHostIsolation=yes ``` Directive names follow `systemd.network(5)`. Values mirror the test deploy's `tc` invocation: - `Bandwidth=480M` — placeholder; operator sets to ≈95% of measured uplink in their actual `.network`. - `OverheadKeyword=internet` — equivalent of the `internet` keyword. - `PriorityQueueingPreset=diffserv4` — equivalent of `diffserv4`. - `EgressHostIsolation=yes` — equivalent of `dual-dsthost` on egress. The nftables marking from the previous section ships unchanged on prod; it is qdisc-installer-agnostic. The test-deploy oneshot does NOT install on a host running `systemd-networkd`. v1 does not implement that gate — production hosts do not run the test-deploy script. If the boundary blurs in the future, add a check in `left4me-apply-cake` for `systemctl is-active systemd-networkd` and skip cleanly. ### Documented escape hatches Append to `deploy/README.md` Performance section, alongside the existing governor / CPU-affinity / NIC entries: - **Ingress shaping via IFB.** Egress CAKE alone does not protect srcds receive against ingress saturation (large workshop downloads, package fetches arriving at line rate). One-liner template using `modprobe ifb`, `ip link set ifb0 up`, `tc qdisc add dev ifb0 root cake bandwidth Xmbit ingress diffserv4 dual-srchost`, and a `tc filter` redirect from the uplink iface. Worth flipping only when measurement shows ingress hurting receive; in v1 we have no such measurement, so it stays documented. - **`net.core.busy_poll = 50` / `net.core.busy_read = 50`.** Reduces UDP receive median latency by polling for incoming packets briefly at syscall boundaries. Cost: measurable CPU per syscall under load. Worth flipping if a host is dedicated to game serving and CPU headroom is plentiful. - **`ethtool -K gro off`.** Some Source-engine ops disable generic receive offload to avoid receive-side coalescing latency. Hardware/driver dependent. Document, do not ship. These three entries follow the existing escape-hatch style: a one-liner or short config block, plus one sentence on when it matters. ### Files changed / added ``` deploy/files/etc/sysctl.d/99-left4me.conf (modified — block added) deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft (new) deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service (new) deploy/files/etc/left4me/cake.env (new — template, deploy preserves operator edits) deploy/files/usr/local/libexec/left4me/left4me-apply-cake (new) deploy/files/usr/local/lib/systemd/system/left4me-cake.service (new) deploy/deploy-test-server.sh (modified — install+enable nft and cake units, conditional copy of cake.env) deploy/README.md (modified — Network shaping subsection + 3 new escape hatches) deploy/tests/test_deploy_artifacts.py (modified — assertions for all artifacts above) ``` ## Tests Following the existing `assert "key=value" in text` pattern in `deploy/tests/test_deploy_artifacts.py`: **Sysctl block** (extension of the existing perf-baseline assertions): - Each of `net.ipv4.udp_rmem_min = 16384`, `net.ipv4.udp_wmem_min = 16384`, `net.core.default_qdisc = fq_codel`, `net.ipv4.tcp_congestion_control = bbr` is asserted as a separate line. **nftables marking artifacts:** - `left4me-mark.nft` ships with `table inet left4me_mark`, `chain mangle_output`, `meta skuid "left4me"`, `ip dscp set ef`, `ip6 dscp set ef`, and `meta priority set 0006:0000` each asserted as separate substring matches. (DSCP and priority statements appear inline on the same rule per L3 family; substring assertions don't depend on rule layout.) - `left4me-nft-mark.service` has `ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft`, `ExecStop=/usr/sbin/nft delete table inet left4me_mark`, `Type=oneshot`, `RemainAfterExit=yes`, `WantedBy=multi-user.target`. - `deploy-test-server.sh` invokes `systemctl enable --now left4me-nft-mark.service` (or equivalent at-deploy enabling step). **CAKE artifacts:** - `cake.env` template contains the literal lines `LEFT4ME_UPLINK_MBIT=` and `LEFT4ME_UPLINK_IFACE=` (commented or uncommented; matched as substring). - `left4me-apply-cake` contains the literals `tc qdisc replace`, `cake`, `bandwidth`, `internet`, `diffserv4`, `dual-dsthost`, `LEFT4ME_UPLINK_MBIT`, `LEFT4ME_UPLINK_IFACE`. - `left4me-apply-cake` is mode `0755` after deploy (asserted via the same mechanism the existing helper-script tests use). - `left4me-cake.service` contains `EnvironmentFile=-/etc/left4me/cake.env`, `ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply`, `ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear`, `Wants=network-online.target`, `Type=oneshot`, `WantedBy=multi-user.target`. - `deploy-test-server.sh` invokes `systemctl enable --now left4me-cake.service`. - `deploy-test-server.sh` copies `cake.env` only when target absent (asserted by literal substring of the guarding `[ -e /etc/left4me/cake.env ]` test or equivalent). No runtime networking tests in v1. The artifacts are static; their runtime behaviour requires a real iface and a real bandwidth load, which the operator measures. ## Rollout Single deploy. After the new sysctl block lands, `sysctl --system` applies it immediately (already in the deploy flow). The two new systemd units start on `systemctl enable --now`; CAKE without a configured `LEFT4ME_UPLINK_MBIT` logs a warning and no-ops, which is the expected fresh-deploy state. The operator measures their uplink, edits `/etc/left4me/cake.env`, and runs `systemctl restart left4me-cake.service`. Already-running game servers are unaffected by the network changes themselves. The marking applies on every emitted packet from the moment the nft rule loads; future-emitted packets pick up DSCP+priority without restarting any srcds instance. ## Open questions None blocking. v2 candidates if measurement justifies them: - A `LEFT4ME_INGRESS_MBIT` knob that flips on the IFB ingress shaper as a default, conditional on the env value being set. - A `left4me-net-doctor` helper that reports current qdisc, applied marks, and a one-shot saturation+ping measurement against a local endpoint. - A small Python wrapper in `l4d2host` that reads `cake.env` for display in the web UI, so the operator sees in one place whether shaping is active. ## References - `tc-cake(8)` — keyword semantics: `bandwidth`, `internet`, `diffserv4`, `dual-dsthost`, tin priority mapping. - `systemd.network(5)` — `[CAKE]` section directives: `Bandwidth=`, `OverheadKeyword=`, `PriorityQueueingPreset=`, `EgressHostIsolation=`. - `nft(8)` — `meta skuid`, `meta priority`, `ip dscp set`, table isolation semantics. - RFC 3246 — Expedited Forwarding (EF) PHB. - Linux kernel `Documentation/networking/tcp_bbr.txt` — BBR pairs with `fq` / `fq_codel` for correct pacing. - `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` — sibling spec; this spec extends `99-left4me.conf` and reuses the same deploy-test-artifact pattern.