docs(specs): l4d2 network shaping & marking — design

CAKE egress shaping (test-deploy oneshot + systemd-networkd [CAKE] block
on prod), nftables uid-based DSCP-EF + skb-priority marking for srcds
UDP, plus rounding sysctls (udp_rmem_min/wmem_min, default_qdisc=fq_codel,
tcp_congestion_control=bbr). Hardware-specific knobs stay documented
escape hatches matching the perf-baseline boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
mwiegand 2026-05-10 00:05:44 +02:00
parent 62d6d4cbcd
commit 0cc92f2c17
No known key found for this signature in database

View file

@ -0,0 +1,487 @@
# l4d2 network shaping & marking — design
Date: 2026-05-10
Status: design
## Summary
Add a network-side player-experience baseline alongside the existing host
perf baseline. Three concerns ship together:
1. **Mark srcds outbound packets** with DSCP `EF` and skb priority `6:0` so
any qdisc — host CAKE, ISP gear that honours DSCP, future systems —
recognises L4D2 game traffic as latency-sensitive. Marking happens by uid
match on the `left4me` user.
2. **Round out the UDP-socket sysctl baseline** (`udp_rmem_min`,
`udp_wmem_min`), set the default qdisc explicitly to `fq_codel`, and
switch TCP to `bbr` so coexisting TCP egress (admin, backups, web app,
apt) cannot bufferbloat the link the players share.
3. **Shape egress with CAKE.** On the test deploy, install a systemd oneshot
that applies `tc qdisc replace … cake …` from an operator-edited env
file. On production hosts running `systemd-networkd`, document the
equivalent `[CAKE]` section in the matching `.network` file as the
long-term path.
The intent is "all reasonable measures that do not depend on host-specific
hardware." Hardware-specific tuning (NIC ring buffers, IRQ pinning, CPU
governor, real-time scheduling, CPU affinity) remains a documented escape
hatch — same boundary the existing perf-baseline spec drew. The pieces
that *are* universally safe ship as defaults.
## Goals
- Game-server UDP packets carry an unambiguous priority signal in DSCP and
in `skb->priority`, set on the host before any qdisc inspects them.
- A coexisting bulk TCP flow on the same host (backup upload, package
fetch, web-app response) cannot push the bottleneck queue ahead of game
UDP under saturation.
- An operator who declares uplink bandwidth gets fair-queueing egress
shaping with diffserv-aware tin selection — i.e. EF-marked srcds traffic
drops into the highest-priority CAKE tin, per-destination-host fairness
keeps every connected player on equal footing.
- A production deployment using `systemd-networkd` has a one-block
configuration recipe, no helper script needed.
- Operators have a documented set of additional knobs (ingress shaping via
IFB, `busy_poll`, GRO toggling) for cases the default baseline does not
cover. None of these auto-apply.
## Non-goals
- NIC ring-buffer / IRQ pinning / RPS / RFS / hardware timestamping —
already declared host-specific in the perf-baseline spec; not
re-litigated here.
- `busy_poll` / `busy_read` as defaults — non-trivial CPU cost; documented
as opt-in.
- Ingress shaping via IFB as a default — only matters if egress CAKE turns
out load-bearing and ingress is also saturated; documented as opt-in.
- Real-time scheduling, governor changes — already declined by the
perf-baseline spec.
- Blueprint-side game settings (`sv_minrate`, `sv_maxrate`, tickrate,
`fps_max`) — owned by the server maintainer.
- Auto-detection or measurement of uplink bandwidth. CAKE only shapes
correctly when its declared bandwidth sits below the real bottleneck;
the operator must measure once and configure.
- Iface-flap watchdog. `tc qdisc replace` is idempotent; on prod,
`systemd-networkd` reapplies CAKE across iface lifecycle events. On
test, `systemctl restart left4me-cake.service` is the documented
recovery.
## Background
Current state (commit `62d6d4c` or thereabouts):
- The perf-baseline spec ships `/etc/sysctl.d/99-left4me.conf` with
`rmem_max`, `wmem_max`, `rmem_default`, `wmem_default`,
`netdev_max_backlog`, `netdev_budget`, `vm.swappiness`. No per-socket
UDP minimums, no default-qdisc directive, no TCP congestion-control
setting.
- `srcds_run` runs as system user `left4me`. srcds itself does not set
`IP_TOS` or `SO_PRIORITY`, so its UDP packets leave the host with
DSCP 0 and priority 0 — indistinguishable from any other UDP traffic to
any qdisc.
- The deploy ships nftables-relevant infrastructure only via package
defaults (Debian Trixie ships `nftables` in base, but no `left4me`
table is created).
- No qdisc is explicitly configured. The kernel's per-iface default
applies — `fq_codel` on Trixie, but only because Debian's default has
been `fq_codel` since Buster.
- The deploy script already copies sysctl drop-ins and runs
`sysctl --system` (`deploy/deploy-test-server.sh:196`).
## Design
### Sysctl additions to `99-left4me.conf`
Append to `deploy/files/etc/sysctl.d/99-left4me.conf`:
```
# Per-socket UDP buffer floors: protect game-server sockets that don't bump
# their own SO_RCVBUF/SO_SNDBUF when softirq drains lag briefly.
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384
# Default qdisc for ifaces we don't explicitly shape with CAKE. Debian
# Trixie already defaults to fq_codel; setting it explicitly is
# belt-and-suspenders and survives kernel-default churn.
net.core.default_qdisc = fq_codel
# TCP congestion control: BBR for any bulk TCP egress on the host (admin
# SSH, backups, package fetches, web-app responses) so a long flow does
# not push the bottleneck queue ahead of game UDP. UDP srcds is
# unaffected.
net.ipv4.tcp_congestion_control = bbr
```
The deploy already runs `sysctl --system` after copying the conf
(`deploy/deploy-test-server.sh:198`); no script change required for this
block.
### nftables packet marking
New file `deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft`:
```nft
table inet left4me_mark {
chain mangle_output {
type filter hook output priority mangle; policy accept;
meta skuid "left4me" meta l4proto udp ip dscp set ef meta priority set 0006:0000
meta skuid "left4me" meta l4proto udp ip6 dscp set ef meta priority set 0006:0000
}
}
```
Per-element rationale:
- `meta skuid "left4me"` — every srcds instance runs as that user. The
match is exact; nothing else on the host matches. No false positives
against the web app (which runs as `left4me` too but speaks TCP) or the
build sandbox (different uid).
- `meta l4proto udp` — bypass anything not UDP, including the future
RCON/HTTP TCP traffic from the web app.
- `ip dscp set ef` / `ip6 dscp set ef` — DSCP `EF` (Expedited Forwarding,
decimal 46) is the standard low-latency marking. CAKE's `diffserv4`
preset routes EF into its highest-priority "Voice" tin. Two rules,
one per L3 family, because in an `inet` table the `ip` matcher only
fires on v4 and `ip6` only on v6.
- `meta priority set 0006:0000` — sets `skb->priority` to class `6:0`.
Read by qdiscs that classify on skb priority (CAKE included) ahead of
any DSCP table lookup. Set inline with the DSCP rule so a single
rule-match runs both statements.
The table is named `left4me_mark` and lives in its own `inet` namespace.
It does not touch, depend on, or conflict with any nftables config the
operator may run independently. `nft -f` loads the file; `nft delete
table inet left4me_mark` cleanly removes it.
New unit `deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service`:
```ini
[Unit]
Description=left4me nftables packet marking (DSCP EF + priority for srcds)
After=network-pre.target
Before=network.target
Wants=network-pre.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft
ExecStop=/usr/sbin/nft delete table inet left4me_mark
[Install]
WantedBy=multi-user.target
```
`After=network-pre.target` / `Before=network.target` keeps the rules in
place before any iface comes up, so the very first packet srcds emits
post-boot is already marked.
Deploy script changes:
- Ensure `nftables` is installed (`apt-get install -y nftables`;
idempotent — package is in Trixie base).
- Create `/usr/local/lib/left4me/nft/` and copy `left4me-mark.nft` into
it.
- Copy the unit, `daemon-reload`, `systemctl enable --now
left4me-nft-mark.service`.
### CAKE egress shaper — test deploy mechanism
Three files plus deploy-script changes. All operator-tunable knobs go in
the env file; the helper and unit are static.
**`deploy/files/etc/left4me/cake.env`** (template; deploy installs only
if absent so operator edits survive re-runs):
```
# Uplink bandwidth in Mbit/s. Set to ~95% of the smaller of measured
# upload and measured download. CAKE only shapes correctly when its
# declared bandwidth sits below the real bottleneck. If unset, the
# left4me-cake.service unit logs a warning and exits 0 (no shaping).
LEFT4ME_UPLINK_MBIT=
# Egress interface. If unset, auto-detected from the IPv4 default route.
LEFT4ME_UPLINK_IFACE=
```
**`deploy/files/usr/local/libexec/left4me/left4me-apply-cake`** (mode
`0755`, owner `root:root`). The helper takes a single argument — `apply`
or `clear` — so the unit's `ExecStart` and `ExecStop` both call the same
script and the unit file stays free of shell escaping:
```sh
#!/bin/sh
set -eu
mode=${1:-apply}
if [ -r /etc/left4me/cake.env ]; then
. /etc/left4me/cake.env
fi
resolve_iface() {
if [ -n "${LEFT4ME_UPLINK_IFACE:-}" ]; then
printf '%s' "$LEFT4ME_UPLINK_IFACE"
return
fi
ip -4 route show default | awk '/default/ {print $5; exit}'
}
case "$mode" in
apply)
if [ -z "${LEFT4ME_UPLINK_MBIT:-}" ]; then
echo "left4me-cake: LEFT4ME_UPLINK_MBIT unset; skipping shaper" >&2
exit 0
fi
iface=$(resolve_iface)
if [ -z "$iface" ]; then
echo "left4me-cake: cannot determine egress iface; skipping" >&2
exit 0
fi
exec tc qdisc replace dev "$iface" root cake \
bandwidth "${LEFT4ME_UPLINK_MBIT}mbit" \
internet diffserv4 dual-dsthost
;;
clear)
iface=$(resolve_iface)
if [ -z "$iface" ]; then
exit 0
fi
tc qdisc del dev "$iface" root 2>/dev/null || true
;;
*)
echo "usage: $0 [apply|clear]" >&2
exit 2
;;
esac
```
`tc qdisc replace` is idempotent: replaces an existing root qdisc on the
iface, adds one if absent. Re-running the unit any time is safe. `clear`
swallows the "no such qdisc" error so stop is also idempotent.
Fail-soft on missing config matches the perf-baseline philosophy — the
deploy does not refuse to boot servers because the operator has not yet
filled in `LEFT4ME_UPLINK_MBIT`. The journal warning surfaces the gap.
**`deploy/files/usr/local/lib/systemd/system/left4me-cake.service`**:
```ini
[Unit]
Description=left4me CAKE egress shaper
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=-/etc/left4me/cake.env
ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply
ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear
[Install]
WantedBy=multi-user.target
```
Per-flag rationale for the `cake` invocation:
- `bandwidth ${LEFT4ME_UPLINK_MBIT}mbit` — operator-declared, ≈95% of
measured uplink. CAKE only shapes if its declared bandwidth is below
the real bottleneck; setting it slightly low moves the queue into a
place the host controls.
- `internet` — overhead-accounting keyword that handles common
Ethernet+ISP encapsulation (DOCSIS / GPON / PPPoE) correctly without
undershooting. Conservative default.
- `diffserv4` — four-tier DSCP-aware tin selection. Reads the EF marks
set by the nftables rule and routes srcds packets into the
highest-priority "Voice" tin. Without `diffserv4`, the marks are
ignored.
- `dual-dsthost` — egress fairness keyed on destination host. With ≥2
players connected, each player gets fair share regardless of how
chatty the server is to any single client.
Iface-flap behaviour: the kernel keeps the qdisc on an iface across
link-down/link-up while the iface itself exists. If the iface is
recreated (e.g., NetworkManager reconfiguration), `systemctl restart
left4me-cake.service` reapplies. Documented; no auto-watchdog in v1.
Deploy script changes (in `deploy/deploy-test-server.sh`):
- Copy `cake.env` to `/etc/left4me/cake.env` only if absent (do not
clobber operator edits).
- Copy `left4me-apply-cake` to `/usr/local/libexec/left4me/`, mode
`0755`, owner `root:root`.
- Copy `left4me-cake.service` to `/usr/local/lib/systemd/system/`.
- `systemctl daemon-reload` (already done in the existing flow).
- `systemctl enable --now left4me-cake.service`.
### CAKE egress shaper — production deployment (systemd-networkd)
On hosts running `systemd-networkd`, the CAKE configuration belongs in
the matching `.network` file. systemd-networkd reapplies it across iface
lifecycle events, addressing the only fragility of the test-deploy
oneshot.
Document in `deploy/README.md` Performance section:
```ini
# /etc/systemd/network/<your-uplink>.network
[CAKE]
Bandwidth=480M
OverheadKeyword=internet
PriorityQueueingPreset=diffserv4
EgressHostIsolation=yes
```
Directive names follow `systemd.network(5)`. Values mirror the test
deploy's `tc` invocation:
- `Bandwidth=480M` — placeholder; operator sets to ≈95% of measured
uplink in their actual `.network`.
- `OverheadKeyword=internet` — equivalent of the `internet` keyword.
- `PriorityQueueingPreset=diffserv4` — equivalent of `diffserv4`.
- `EgressHostIsolation=yes` — equivalent of `dual-dsthost` on egress.
The nftables marking from the previous section ships unchanged on prod;
it is qdisc-installer-agnostic.
The test-deploy oneshot does NOT install on a host running
`systemd-networkd`. v1 does not implement that gate — production hosts
do not run the test-deploy script. If the boundary blurs in the future,
add a check in `left4me-apply-cake` for `systemctl is-active
systemd-networkd` and skip cleanly.
### Documented escape hatches
Append to `deploy/README.md` Performance section, alongside the existing
governor / CPU-affinity / NIC entries:
- **Ingress shaping via IFB.** Egress CAKE alone does not protect srcds
receive against ingress saturation (large workshop downloads, package
fetches arriving at line rate). One-liner template using `modprobe
ifb`, `ip link set ifb0 up`, `tc qdisc add dev ifb0 root cake bandwidth
Xmbit ingress diffserv4 dual-srchost`, and a `tc filter` redirect from
the uplink iface. Worth flipping only when measurement shows ingress
hurting receive; in v1 we have no such measurement, so it stays
documented.
- **`net.core.busy_poll = 50` / `net.core.busy_read = 50`.** Reduces UDP
receive median latency by polling for incoming packets briefly at
syscall boundaries. Cost: measurable CPU per syscall under load. Worth
flipping if a host is dedicated to game serving and CPU headroom is
plentiful.
- **`ethtool -K <iface> gro off`.** Some Source-engine ops disable
generic receive offload to avoid receive-side coalescing latency.
Hardware/driver dependent. Document, do not ship.
These three entries follow the existing escape-hatch style: a one-liner
or short config block, plus one sentence on when it matters.
### Files changed / added
```
deploy/files/etc/sysctl.d/99-left4me.conf (modified — block added)
deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft (new)
deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service (new)
deploy/files/etc/left4me/cake.env (new — template, deploy preserves operator edits)
deploy/files/usr/local/libexec/left4me/left4me-apply-cake (new)
deploy/files/usr/local/lib/systemd/system/left4me-cake.service (new)
deploy/deploy-test-server.sh (modified — install+enable nft and cake units, conditional copy of cake.env)
deploy/README.md (modified — Network shaping subsection + 3 new escape hatches)
deploy/tests/test_deploy_artifacts.py (modified — assertions for all artifacts above)
```
## Tests
Following the existing `assert "key=value" in text` pattern in
`deploy/tests/test_deploy_artifacts.py`:
**Sysctl block** (extension of the existing perf-baseline assertions):
- Each of `net.ipv4.udp_rmem_min = 16384`, `net.ipv4.udp_wmem_min =
16384`, `net.core.default_qdisc = fq_codel`,
`net.ipv4.tcp_congestion_control = bbr` is asserted as a separate line.
**nftables marking artifacts:**
- `left4me-mark.nft` ships with `table inet left4me_mark`, `chain
mangle_output`, `meta skuid "left4me"`, `ip dscp set ef`, `ip6 dscp
set ef`, and `meta priority set 0006:0000` each asserted as separate
substring matches. (DSCP and priority statements appear inline on
the same rule per L3 family; substring assertions don't depend on
rule layout.)
- `left4me-nft-mark.service` has `ExecStart=/usr/sbin/nft -f
/usr/local/lib/left4me/nft/left4me-mark.nft`, `ExecStop=/usr/sbin/nft
delete table inet left4me_mark`, `Type=oneshot`,
`RemainAfterExit=yes`, `WantedBy=multi-user.target`.
- `deploy-test-server.sh` invokes `systemctl enable --now
left4me-nft-mark.service` (or equivalent at-deploy enabling step).
**CAKE artifacts:**
- `cake.env` template contains the literal lines `LEFT4ME_UPLINK_MBIT=`
and `LEFT4ME_UPLINK_IFACE=` (commented or uncommented; matched as
substring).
- `left4me-apply-cake` contains the literals `tc qdisc replace`, `cake`,
`bandwidth`, `internet`, `diffserv4`, `dual-dsthost`,
`LEFT4ME_UPLINK_MBIT`, `LEFT4ME_UPLINK_IFACE`.
- `left4me-apply-cake` is mode `0755` after deploy (asserted via the
same mechanism the existing helper-script tests use).
- `left4me-cake.service` contains
`EnvironmentFile=-/etc/left4me/cake.env`,
`ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply`,
`ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear`,
`Wants=network-online.target`, `Type=oneshot`,
`WantedBy=multi-user.target`.
- `deploy-test-server.sh` invokes `systemctl enable --now
left4me-cake.service`.
- `deploy-test-server.sh` copies `cake.env` only when target absent
(asserted by literal substring of the guarding `[ -e
/etc/left4me/cake.env ]` test or equivalent).
No runtime networking tests in v1. The artifacts are static; their
runtime behaviour requires a real iface and a real bandwidth load,
which the operator measures.
## Rollout
Single deploy. After the new sysctl block lands, `sysctl --system`
applies it immediately (already in the deploy flow). The two new
systemd units start on `systemctl enable --now`; CAKE without a
configured `LEFT4ME_UPLINK_MBIT` logs a warning and no-ops, which is
the expected fresh-deploy state. The operator measures their uplink,
edits `/etc/left4me/cake.env`, and runs `systemctl restart
left4me-cake.service`.
Already-running game servers are unaffected by the network changes
themselves. The marking applies on every emitted packet from the moment
the nft rule loads; future-emitted packets pick up DSCP+priority without
restarting any srcds instance.
## Open questions
None blocking. v2 candidates if measurement justifies them:
- A `LEFT4ME_INGRESS_MBIT` knob that flips on the IFB ingress shaper as
a default, conditional on the env value being set.
- A `left4me-net-doctor` helper that reports current qdisc, applied
marks, and a one-shot saturation+ping measurement against a local
endpoint.
- A small Python wrapper in `l4d2host` that reads `cake.env` for
display in the web UI, so the operator sees in one place whether
shaping is active.
## References
- `tc-cake(8)` — keyword semantics: `bandwidth`, `internet`,
`diffserv4`, `dual-dsthost`, tin priority mapping.
- `systemd.network(5)``[CAKE]` section directives:
`Bandwidth=`, `OverheadKeyword=`, `PriorityQueueingPreset=`,
`EgressHostIsolation=`.
- `nft(8)``meta skuid`, `meta priority`, `ip dscp set`, table
isolation semantics.
- RFC 3246 — Expedited Forwarding (EF) PHB.
- Linux kernel `Documentation/networking/tcp_bbr.txt` — BBR pairs with
`fq` / `fq_codel` for correct pacing.
- `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`
— sibling spec; this spec extends `99-left4me.conf` and reuses the
same deploy-test-artifact pattern.