left4me/docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md
mwiegand 0cc92f2c17
docs(specs): l4d2 network shaping & marking — design
CAKE egress shaping (test-deploy oneshot + systemd-networkd [CAKE] block
on prod), nftables uid-based DSCP-EF + skb-priority marking for srcds
UDP, plus rounding sysctls (udp_rmem_min/wmem_min, default_qdisc=fq_codel,
tcp_congestion_control=bbr). Hardware-specific knobs stay documented
escape hatches matching the perf-baseline boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:05:44 +02:00

20 KiB

l4d2 network shaping & marking — design

Date: 2026-05-10 Status: design

Summary

Add a network-side player-experience baseline alongside the existing host perf baseline. Three concerns ship together:

  1. Mark srcds outbound packets with DSCP EF and skb priority 6:0 so any qdisc — host CAKE, ISP gear that honours DSCP, future systems — recognises L4D2 game traffic as latency-sensitive. Marking happens by uid match on the left4me user.
  2. Round out the UDP-socket sysctl baseline (udp_rmem_min, udp_wmem_min), set the default qdisc explicitly to fq_codel, and switch TCP to bbr so coexisting TCP egress (admin, backups, web app, apt) cannot bufferbloat the link the players share.
  3. Shape egress with CAKE. On the test deploy, install a systemd oneshot that applies tc qdisc replace … cake … from an operator-edited env file. On production hosts running systemd-networkd, document the equivalent [CAKE] section in the matching .network file as the long-term path.

The intent is "all reasonable measures that do not depend on host-specific hardware." Hardware-specific tuning (NIC ring buffers, IRQ pinning, CPU governor, real-time scheduling, CPU affinity) remains a documented escape hatch — same boundary the existing perf-baseline spec drew. The pieces that are universally safe ship as defaults.

Goals

  • Game-server UDP packets carry an unambiguous priority signal in DSCP and in skb->priority, set on the host before any qdisc inspects them.
  • A coexisting bulk TCP flow on the same host (backup upload, package fetch, web-app response) cannot push the bottleneck queue ahead of game UDP under saturation.
  • An operator who declares uplink bandwidth gets fair-queueing egress shaping with diffserv-aware tin selection — i.e. EF-marked srcds traffic drops into the highest-priority CAKE tin, per-destination-host fairness keeps every connected player on equal footing.
  • A production deployment using systemd-networkd has a one-block configuration recipe, no helper script needed.
  • Operators have a documented set of additional knobs (ingress shaping via IFB, busy_poll, GRO toggling) for cases the default baseline does not cover. None of these auto-apply.

Non-goals

  • NIC ring-buffer / IRQ pinning / RPS / RFS / hardware timestamping — already declared host-specific in the perf-baseline spec; not re-litigated here.
  • busy_poll / busy_read as defaults — non-trivial CPU cost; documented as opt-in.
  • Ingress shaping via IFB as a default — only matters if egress CAKE turns out load-bearing and ingress is also saturated; documented as opt-in.
  • Real-time scheduling, governor changes — already declined by the perf-baseline spec.
  • Blueprint-side game settings (sv_minrate, sv_maxrate, tickrate, fps_max) — owned by the server maintainer.
  • Auto-detection or measurement of uplink bandwidth. CAKE only shapes correctly when its declared bandwidth sits below the real bottleneck; the operator must measure once and configure.
  • Iface-flap watchdog. tc qdisc replace is idempotent; on prod, systemd-networkd reapplies CAKE across iface lifecycle events. On test, systemctl restart left4me-cake.service is the documented recovery.

Background

Current state (commit 62d6d4c or thereabouts):

  • The perf-baseline spec ships /etc/sysctl.d/99-left4me.conf with rmem_max, wmem_max, rmem_default, wmem_default, netdev_max_backlog, netdev_budget, vm.swappiness. No per-socket UDP minimums, no default-qdisc directive, no TCP congestion-control setting.
  • srcds_run runs as system user left4me. srcds itself does not set IP_TOS or SO_PRIORITY, so its UDP packets leave the host with DSCP 0 and priority 0 — indistinguishable from any other UDP traffic to any qdisc.
  • The deploy ships nftables-relevant infrastructure only via package defaults (Debian Trixie ships nftables in base, but no left4me table is created).
  • No qdisc is explicitly configured. The kernel's per-iface default applies — fq_codel on Trixie, but only because Debian's default has been fq_codel since Buster.
  • The deploy script already copies sysctl drop-ins and runs sysctl --system (deploy/deploy-test-server.sh:196).

Design

Sysctl additions to 99-left4me.conf

Append to deploy/files/etc/sysctl.d/99-left4me.conf:

# Per-socket UDP buffer floors: protect game-server sockets that don't bump
# their own SO_RCVBUF/SO_SNDBUF when softirq drains lag briefly.
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384

# Default qdisc for ifaces we don't explicitly shape with CAKE. Debian
# Trixie already defaults to fq_codel; setting it explicitly is
# belt-and-suspenders and survives kernel-default churn.
net.core.default_qdisc = fq_codel

# TCP congestion control: BBR for any bulk TCP egress on the host (admin
# SSH, backups, package fetches, web-app responses) so a long flow does
# not push the bottleneck queue ahead of game UDP. UDP srcds is
# unaffected.
net.ipv4.tcp_congestion_control = bbr

The deploy already runs sysctl --system after copying the conf (deploy/deploy-test-server.sh:198); no script change required for this block.

nftables packet marking

New file deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft:

table inet left4me_mark {
    chain mangle_output {
        type filter hook output priority mangle; policy accept;
        meta skuid "left4me" meta l4proto udp ip dscp set ef meta priority set 0006:0000
        meta skuid "left4me" meta l4proto udp ip6 dscp set ef meta priority set 0006:0000
    }
}

Per-element rationale:

  • meta skuid "left4me" — every srcds instance runs as that user. The match is exact; nothing else on the host matches. No false positives against the web app (which runs as left4me too but speaks TCP) or the build sandbox (different uid).
  • meta l4proto udp — bypass anything not UDP, including the future RCON/HTTP TCP traffic from the web app.
  • ip dscp set ef / ip6 dscp set ef — DSCP EF (Expedited Forwarding, decimal 46) is the standard low-latency marking. CAKE's diffserv4 preset routes EF into its highest-priority "Voice" tin. Two rules, one per L3 family, because in an inet table the ip matcher only fires on v4 and ip6 only on v6.
  • meta priority set 0006:0000 — sets skb->priority to class 6:0. Read by qdiscs that classify on skb priority (CAKE included) ahead of any DSCP table lookup. Set inline with the DSCP rule so a single rule-match runs both statements.

The table is named left4me_mark and lives in its own inet namespace. It does not touch, depend on, or conflict with any nftables config the operator may run independently. nft -f loads the file; nft delete table inet left4me_mark cleanly removes it.

New unit deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service:

[Unit]
Description=left4me nftables packet marking (DSCP EF + priority for srcds)
After=network-pre.target
Before=network.target
Wants=network-pre.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft
ExecStop=/usr/sbin/nft delete table inet left4me_mark

[Install]
WantedBy=multi-user.target

After=network-pre.target / Before=network.target keeps the rules in place before any iface comes up, so the very first packet srcds emits post-boot is already marked.

Deploy script changes:

  • Ensure nftables is installed (apt-get install -y nftables; idempotent — package is in Trixie base).
  • Create /usr/local/lib/left4me/nft/ and copy left4me-mark.nft into it.
  • Copy the unit, daemon-reload, systemctl enable --now left4me-nft-mark.service.

CAKE egress shaper — test deploy mechanism

Three files plus deploy-script changes. All operator-tunable knobs go in the env file; the helper and unit are static.

deploy/files/etc/left4me/cake.env (template; deploy installs only if absent so operator edits survive re-runs):

# Uplink bandwidth in Mbit/s. Set to ~95% of the smaller of measured
# upload and measured download. CAKE only shapes correctly when its
# declared bandwidth sits below the real bottleneck. If unset, the
# left4me-cake.service unit logs a warning and exits 0 (no shaping).
LEFT4ME_UPLINK_MBIT=

# Egress interface. If unset, auto-detected from the IPv4 default route.
LEFT4ME_UPLINK_IFACE=

deploy/files/usr/local/libexec/left4me/left4me-apply-cake (mode 0755, owner root:root). The helper takes a single argument — apply or clear — so the unit's ExecStart and ExecStop both call the same script and the unit file stays free of shell escaping:

#!/bin/sh
set -eu

mode=${1:-apply}

if [ -r /etc/left4me/cake.env ]; then
    . /etc/left4me/cake.env
fi

resolve_iface() {
    if [ -n "${LEFT4ME_UPLINK_IFACE:-}" ]; then
        printf '%s' "$LEFT4ME_UPLINK_IFACE"
        return
    fi
    ip -4 route show default | awk '/default/ {print $5; exit}'
}

case "$mode" in
    apply)
        if [ -z "${LEFT4ME_UPLINK_MBIT:-}" ]; then
            echo "left4me-cake: LEFT4ME_UPLINK_MBIT unset; skipping shaper" >&2
            exit 0
        fi
        iface=$(resolve_iface)
        if [ -z "$iface" ]; then
            echo "left4me-cake: cannot determine egress iface; skipping" >&2
            exit 0
        fi
        exec tc qdisc replace dev "$iface" root cake \
            bandwidth "${LEFT4ME_UPLINK_MBIT}mbit" \
            internet diffserv4 dual-dsthost
        ;;
    clear)
        iface=$(resolve_iface)
        if [ -z "$iface" ]; then
            exit 0
        fi
        tc qdisc del dev "$iface" root 2>/dev/null || true
        ;;
    *)
        echo "usage: $0 [apply|clear]" >&2
        exit 2
        ;;
esac

tc qdisc replace is idempotent: replaces an existing root qdisc on the iface, adds one if absent. Re-running the unit any time is safe. clear swallows the "no such qdisc" error so stop is also idempotent.

Fail-soft on missing config matches the perf-baseline philosophy — the deploy does not refuse to boot servers because the operator has not yet filled in LEFT4ME_UPLINK_MBIT. The journal warning surfaces the gap.

deploy/files/usr/local/lib/systemd/system/left4me-cake.service:

[Unit]
Description=left4me CAKE egress shaper
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=-/etc/left4me/cake.env
ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply
ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear

[Install]
WantedBy=multi-user.target

Per-flag rationale for the cake invocation:

  • bandwidth ${LEFT4ME_UPLINK_MBIT}mbit — operator-declared, ≈95% of measured uplink. CAKE only shapes if its declared bandwidth is below the real bottleneck; setting it slightly low moves the queue into a place the host controls.
  • internet — overhead-accounting keyword that handles common Ethernet+ISP encapsulation (DOCSIS / GPON / PPPoE) correctly without undershooting. Conservative default.
  • diffserv4 — four-tier DSCP-aware tin selection. Reads the EF marks set by the nftables rule and routes srcds packets into the highest-priority "Voice" tin. Without diffserv4, the marks are ignored.
  • dual-dsthost — egress fairness keyed on destination host. With ≥2 players connected, each player gets fair share regardless of how chatty the server is to any single client.

Iface-flap behaviour: the kernel keeps the qdisc on an iface across link-down/link-up while the iface itself exists. If the iface is recreated (e.g., NetworkManager reconfiguration), systemctl restart left4me-cake.service reapplies. Documented; no auto-watchdog in v1.

Deploy script changes (in deploy/deploy-test-server.sh):

  • Copy cake.env to /etc/left4me/cake.env only if absent (do not clobber operator edits).
  • Copy left4me-apply-cake to /usr/local/libexec/left4me/, mode 0755, owner root:root.
  • Copy left4me-cake.service to /usr/local/lib/systemd/system/.
  • systemctl daemon-reload (already done in the existing flow).
  • systemctl enable --now left4me-cake.service.

CAKE egress shaper — production deployment (systemd-networkd)

On hosts running systemd-networkd, the CAKE configuration belongs in the matching .network file. systemd-networkd reapplies it across iface lifecycle events, addressing the only fragility of the test-deploy oneshot.

Document in deploy/README.md Performance section:

# /etc/systemd/network/<your-uplink>.network
[CAKE]
Bandwidth=480M
OverheadKeyword=internet
PriorityQueueingPreset=diffserv4
EgressHostIsolation=yes

Directive names follow systemd.network(5). Values mirror the test deploy's tc invocation:

  • Bandwidth=480M — placeholder; operator sets to ≈95% of measured uplink in their actual .network.
  • OverheadKeyword=internet — equivalent of the internet keyword.
  • PriorityQueueingPreset=diffserv4 — equivalent of diffserv4.
  • EgressHostIsolation=yes — equivalent of dual-dsthost on egress.

The nftables marking from the previous section ships unchanged on prod; it is qdisc-installer-agnostic.

The test-deploy oneshot does NOT install on a host running systemd-networkd. v1 does not implement that gate — production hosts do not run the test-deploy script. If the boundary blurs in the future, add a check in left4me-apply-cake for systemctl is-active systemd-networkd and skip cleanly.

Documented escape hatches

Append to deploy/README.md Performance section, alongside the existing governor / CPU-affinity / NIC entries:

  • Ingress shaping via IFB. Egress CAKE alone does not protect srcds receive against ingress saturation (large workshop downloads, package fetches arriving at line rate). One-liner template using modprobe ifb, ip link set ifb0 up, tc qdisc add dev ifb0 root cake bandwidth Xmbit ingress diffserv4 dual-srchost, and a tc filter redirect from the uplink iface. Worth flipping only when measurement shows ingress hurting receive; in v1 we have no such measurement, so it stays documented.
  • net.core.busy_poll = 50 / net.core.busy_read = 50. Reduces UDP receive median latency by polling for incoming packets briefly at syscall boundaries. Cost: measurable CPU per syscall under load. Worth flipping if a host is dedicated to game serving and CPU headroom is plentiful.
  • ethtool -K <iface> gro off. Some Source-engine ops disable generic receive offload to avoid receive-side coalescing latency. Hardware/driver dependent. Document, do not ship.

These three entries follow the existing escape-hatch style: a one-liner or short config block, plus one sentence on when it matters.

Files changed / added

deploy/files/etc/sysctl.d/99-left4me.conf                                 (modified — block added)
deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft                   (new)
deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service        (new)
deploy/files/etc/left4me/cake.env                                         (new — template, deploy preserves operator edits)
deploy/files/usr/local/libexec/left4me/left4me-apply-cake                 (new)
deploy/files/usr/local/lib/systemd/system/left4me-cake.service            (new)
deploy/deploy-test-server.sh                                              (modified — install+enable nft and cake units, conditional copy of cake.env)
deploy/README.md                                                          (modified — Network shaping subsection + 3 new escape hatches)
deploy/tests/test_deploy_artifacts.py                                     (modified — assertions for all artifacts above)

Tests

Following the existing assert "key=value" in text pattern in deploy/tests/test_deploy_artifacts.py:

Sysctl block (extension of the existing perf-baseline assertions):

  • Each of net.ipv4.udp_rmem_min = 16384, net.ipv4.udp_wmem_min = 16384, net.core.default_qdisc = fq_codel, net.ipv4.tcp_congestion_control = bbr is asserted as a separate line.

nftables marking artifacts:

  • left4me-mark.nft ships with table inet left4me_mark, chain mangle_output, meta skuid "left4me", ip dscp set ef, ip6 dscp set ef, and meta priority set 0006:0000 each asserted as separate substring matches. (DSCP and priority statements appear inline on the same rule per L3 family; substring assertions don't depend on rule layout.)
  • left4me-nft-mark.service has ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft, ExecStop=/usr/sbin/nft delete table inet left4me_mark, Type=oneshot, RemainAfterExit=yes, WantedBy=multi-user.target.
  • deploy-test-server.sh invokes systemctl enable --now left4me-nft-mark.service (or equivalent at-deploy enabling step).

CAKE artifacts:

  • cake.env template contains the literal lines LEFT4ME_UPLINK_MBIT= and LEFT4ME_UPLINK_IFACE= (commented or uncommented; matched as substring).
  • left4me-apply-cake contains the literals tc qdisc replace, cake, bandwidth, internet, diffserv4, dual-dsthost, LEFT4ME_UPLINK_MBIT, LEFT4ME_UPLINK_IFACE.
  • left4me-apply-cake is mode 0755 after deploy (asserted via the same mechanism the existing helper-script tests use).
  • left4me-cake.service contains EnvironmentFile=-/etc/left4me/cake.env, ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply, ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear, Wants=network-online.target, Type=oneshot, WantedBy=multi-user.target.
  • deploy-test-server.sh invokes systemctl enable --now left4me-cake.service.
  • deploy-test-server.sh copies cake.env only when target absent (asserted by literal substring of the guarding [ -e /etc/left4me/cake.env ] test or equivalent).

No runtime networking tests in v1. The artifacts are static; their runtime behaviour requires a real iface and a real bandwidth load, which the operator measures.

Rollout

Single deploy. After the new sysctl block lands, sysctl --system applies it immediately (already in the deploy flow). The two new systemd units start on systemctl enable --now; CAKE without a configured LEFT4ME_UPLINK_MBIT logs a warning and no-ops, which is the expected fresh-deploy state. The operator measures their uplink, edits /etc/left4me/cake.env, and runs systemctl restart left4me-cake.service.

Already-running game servers are unaffected by the network changes themselves. The marking applies on every emitted packet from the moment the nft rule loads; future-emitted packets pick up DSCP+priority without restarting any srcds instance.

Open questions

None blocking. v2 candidates if measurement justifies them:

  • A LEFT4ME_INGRESS_MBIT knob that flips on the IFB ingress shaper as a default, conditional on the env value being set.
  • A left4me-net-doctor helper that reports current qdisc, applied marks, and a one-shot saturation+ping measurement against a local endpoint.
  • A small Python wrapper in l4d2host that reads cake.env for display in the web UI, so the operator sees in one place whether shaping is active.

References

  • tc-cake(8) — keyword semantics: bandwidth, internet, diffserv4, dual-dsthost, tin priority mapping.
  • systemd.network(5)[CAKE] section directives: Bandwidth=, OverheadKeyword=, PriorityQueueingPreset=, EgressHostIsolation=.
  • nft(8)meta skuid, meta priority, ip dscp set, table isolation semantics.
  • RFC 3246 — Expedited Forwarding (EF) PHB.
  • Linux kernel Documentation/networking/tcp_bbr.txt — BBR pairs with fq / fq_codel for correct pacing.
  • docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md — sibling spec; this spec extends 99-left4me.conf and reuses the same deploy-test-artifact pattern.