left4me/docs/superpowers/specs/2026-05-09-l4d2-cpu-pinning-decision.md
mwiegand b62fc08127
docs(specs): l4d2 cpu pinning — decision record (deferred)
Investigated whether to hard-pin each srcds instance to a single core
within the existing AllowedCPUs=1-7 set. Modern kernels (5.13+) no
longer expose kernel.sched_migration_cost_ns or the other classic CFS
"laziness" tunables, so a global cheap-fix is unavailable. Decision
for now: trust CFS + Nice=-5 + AllowedCPUs=1-7. Per-instance
CPUAffinity= remains an opt-in escape hatch in deploy/README.md.
Documents the revisit triggers and the preferred implementation path
when the time comes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:41:40 +02:00

4.7 KiB
Raw Blame History

l4d2 cpu pinning — decision record (deferred)

Date: 2026-05-09 Status: decision (no implementation)

Question

After the lifecycle + drift fix landed (commits 8552c55, 67b5521), the question came up: with AllowedCPUs=1-7 already constraining game servers to cores 17, do CFS scheduler migrations within that range still cause meaningful jitter? Should we hard-pin each instance to a single core?

Investigation

The classic "lazy CFS" sysctl knob is gone on modern kernels. Verified on Trixie's running kernel 6.12 (ckn@10.0.4.128):

/sbin/sysctl -a | grep -E "sched_migration_cost|sched_min_granularity|sched_wakeup_granularity|sched_latency"
# (no output)

kernel.sched_migration_cost_ns and the other classic CFS tunables were removed in 5.13+ as part of the scheduler internals refactor that culminated in EEVDF (6.6). Only kernel.sched_rt_period_us / sched_rt_runtime_us remain. There is no global "be lazy about migrations" knob anymore.

Available paths

Option Cost Strictness Pays off when
Trust CFS + Nice=-5 + AllowedCPUs=1-7 (current) None Soft ≤ 3 instances on 7 cores; CFS rarely migrates active CPU-bound nice<0 tasks
Per-instance CPUAffinity=N drop-in Web-app machinery to write drop-ins, daemon-reload, modulo or DB-persisted assignment Strict ≥ 4 instances (each gets exclusive core), or measured jitter
isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7 kernel cmdline GRUB edit + reboot, host-specific Strongest (also evicts kernel softirqs/RCU/timer ticks from game cores) Tickrate-128 with measurable kernel-induced jitter
SCHED_FIFO per unit Risky (RT misconfig can stall kernel) Strict Already documented as ops-side escape hatch in deploy/README.md

Why deferring is defensible

  • The slice's AllowedCPUs=1-7 already prevents game servers from running on core 0. The open question is "do they migrate within 17?" — yes, CFS can migrate, but for long-running CPU-bound srcds with Nice=-5, migrations are infrequent. CFS prefers cache locality and only migrates when an idle core "steals" or a periodic load-balance tick detects imbalance.
  • With ≤ 3 instances on 7 game cores, the load balancer rarely sees imbalance to fix.
  • Per-instance hard pinning adds non-trivial machinery (drop-in writer through left4me-systemctl, or extending instance.env + a taskset wrapper in the unit). Not warranted unless we observe a real problem.
  • deploy/README.md already documents the CPUAffinity=N per-instance drop-in as an opt-in escape hatch. An operator who measures jitter can apply it without code changes.

Decision

No code change. Keep the current setup:

  • Slice-level AllowedCPUs=1-7 ensures game servers never touch core 0.
  • Nice=-5 keeps active srcds tasks weighted heavily so CFS prefers leaving them alone.
  • The CPUAffinity=N per-instance drop-in remains the documented escape hatch.

Revisit triggers

Any of these signals appears, then design + implement strict per-instance pinning:

  • ≥ 4 game-server instances running simultaneously on one host.
  • A specific server reports tickrate dips / rubber-banding correlated with another instance starting or a build sandbox firing.
  • perf stat -e sched:sched_migrate_task -p <srcds-pid> shows > 1 migration/sec under load.

When revisiting, two implementation paths to choose from:

  1. Modulo assignment in the host library. Read LEFT4ME_GAME_CPUS (or parse the slice's AllowedCPUs= drop-in), pick game_cpus[(int(name) - 1) % len(game_cpus)], write L4D2_CPU=N into instance.env, wrap the unit's ExecStart with taskset -c ${L4D2_CPU}. Stateless, deterministic, no DB column. Preferred.
  2. Persisted assignment. Add Server.cpu_pin column, web app picks at initialize time and stores. Survives LEFT4ME_GAME_CPUS changes (each server keeps its assigned core). Bigger ripple.

Verification (no-op confirmation)

ssh ckn@10.0.4.128 'systemctl show l4d2-game.slice -p AllowedCPUs'
# expect: AllowedCPUs=1-7

ssh ckn@10.0.4.128 'cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective'
# expect: 0   (everything-not-game still pinned to core 0)

# When ≥ 1 server is running:
ssh ckn@10.0.4.128 'for p in $(pgrep srcds); do grep ^Cpus_allowed_list /proc/$p/status; done'
# expect: 1-7   (CFS picks whichever of those is hottest at any given moment)

References

  • docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md — sibling design that introduced the AllowedCPUs=1-7 slice constraint this record builds on.
  • deploy/README.md "Performance Tuning" section — the CPUAffinity=N per-instance escape hatch.
  • Linux kernel changelog 5.13+ — removal of classic CFS tunable sysctls.