left4me/docs/superpowers/plans/2026-05-09-l4d2-cpu-isolation.md
mwiegand c91c029c38
docs(plans): l4d2 cpu isolation — implementation plan
Two TDD tasks: deploy-script cpuset block + tests, README
"CPU isolation" subsection. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:03:37 +02:00

12 KiB
Raw Blame History

L4D2 CPU Isolation Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively, scaled automatically across host sizes.

Architecture: Four 99-left4me-cpuset.conf drop-ins under /etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/, written by the deploy script from heredocs. LEFT4ME_SYSTEM_CPUS (default 0) and LEFT4ME_GAME_CPUS (default 1-$((NPROC-1))) are env-var overrides. Single-core hosts skip the cpuset writes with a warning.

Tech Stack: systemd cgroup-v2 AllowedCPUs= directive, bash heredoc + install, Linux nproc(1), pytest text-assertion tests.

Spec: docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md


File Structure

Files to modify:

  • deploy/deploy-test-server.sh — compute NPROC, default LEFT4ME_SYSTEM_CPUS=0 / LEFT4ME_GAME_CPUS=1-$((NPROC-1)), write four drop-in files. Skip when nproc < 2 (with stderr warning) unless either env var is set explicitly.
  • deploy/README.md — append a "CPU isolation" subsection inside the existing "Performance Tuning" section.
  • deploy/tests/test_deploy_artifacts.py — new test functions.

No host library or web app changes.


Pre-flight

  • Step 0a: Verify clean working tree

Run: git status Expected: nothing to commit, working tree clean

  • Step 0b: Verify the existing deploy tests are at the known-good baseline

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q Expected: 35 passed, 1 failed (the pre-existing unrelated test_deploy_script_has_safe_defaults_and_preserves_state).

If the count differs, stop and surface — this plan assumes that exact baseline.


Task 1: Deploy-script CPU-isolation block + tests

Write the four drop-ins from the deploy script in one cohesive block. The block computes NPROC once, resolves both env vars (with defaults), guards single-core hosts, and writes each drop-in via the existing install -m 0644 -o root -g root pattern. Tests cover defaults, overrides, single-core skip, and drop-in paths.

Files:

  • Modify: deploy/deploy-test-server.sh

  • Modify: deploy/tests/test_deploy_artifacts.py (new test function)

  • Step 1.1: Add the failing test

Open deploy/tests/test_deploy_artifacts.py and append (after the test_deploy_script_installs_perf_artifacts from the perf-baseline branch):

def test_deploy_script_writes_cpuset_drop_ins():
    script = DEPLOY_SCRIPT.read_text()

    # Reads nproc and binds defaults via ${VAR:-...}.
    assert "nproc" in script
    assert "LEFT4ME_SYSTEM_CPUS" in script
    assert "LEFT4ME_GAME_CPUS" in script
    assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
    # Default game-core expression: 1-(nproc-1). Match the form the
    # implementer chose; both `1-$((NPROC-1))` and `1-$((nproc-1))` are
    # acceptable as long as the upper bound is computed from nproc.
    assert ("1-$((NPROC-1))" in script) or ("1-$((nproc-1))" in script) \
        or ("LEFT4ME_GAME_CPUS:-1-" in script)

    # All four drop-in paths.
    for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
        assert f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf" in script

    # Drop-ins use the existing install pattern.
    assert "install -m 0644 -o root -g root" in script

    # Single-core host: skip with a warning to stderr.
    # Match either an explicit `nproc < 2` / `-lt 2` guard or `[ "$nproc" -ge 2 ]` form.
    assert ("nproc" in script) and (("-lt 2" in script) or ("-ge 2" in script) or ("< 2" in script))
    assert "skipping CPU isolation" in script.lower() or "skip cpu isolation" in script.lower()
  • Step 1.2: Run the new test, verify it fails

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_writes_cpuset_drop_ins -v Expected: FAIL — none of the new strings exist yet.

  • Step 1.3: Edit the deploy script — add the cpuset block

Open deploy/deploy-test-server.sh. Find the block that copies the slice files (added in the perf-baseline branch, around lines 139140):

$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice

Immediately after that pair, before any of the helper-script copies that follow, insert this block:

# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
# isn't a live game server to core 0; give game servers cores 1..N-1.
# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
NPROC=$(nproc)
SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
    GAME_CPUS=$LEFT4ME_GAME_CPUS
else
    GAME_CPUS="1-$((NPROC - 1))"
fi
if [ "$NPROC" -lt 2 ] && [ -z "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" ]; then
    printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
else
    for slice_name in system user l4d2-build; do
        $sudo_cmd mkdir -p "/etc/systemd/system/${slice_name}.slice.d"
        printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
            | $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
              "/etc/systemd/system/${slice_name}.slice.d/99-left4me-cpuset.conf"
    done
    $sudo_cmd mkdir -p "/etc/systemd/system/l4d2-game.slice.d"
    printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
        | $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
          "/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf"
fi

Notes for the implementer:

  • The single-core skip only triggers when neither override is set. If the operator sets either LEFT4ME_SYSTEM_CPUS or LEFT4ME_GAME_CPUS explicitly on a single-core host, honor their intent.

  • install -m 0644 -o root -g root /dev/stdin <dest> is the idiomatic way to install a small generated file from a pipeline (matches the existing pattern for sandbox-resolv.conf, just with /dev/stdin as source).

  • The mkdir -p for each .d directory is required: systemd reads drop-ins only from existing directories.

  • Step 1.4: Verify shell syntax still parses

Run: sh -n deploy/deploy-test-server.sh Expected: exit 0, no output.

  • Step 1.5: Run the new test and full deploy test suite

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q Expected: 36 passed, 1 failed (the pre-existing unrelated test, count goes from 35→36 because of the new test).

If your specific assertion forms in Step 1.1 don't match the implementation, adjust the test — but only the or branches; do not weaken the contract.

  • Step 1.6: Commit
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest

Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
EOF
)"

Task 2: README "CPU isolation" subsection

Append a subsection to deploy/README.md inside the existing "Performance Tuning" section, documenting the layout, the env-var overrides, the single-core skip, and the relationship to the existing per-instance CPUAffinity= escape hatch.

Files:

  • Modify: deploy/README.md

No test for this task — README content is documentation, not contract.

  • Step 2.1: Append the CPU isolation subsection

Open deploy/README.md. Find the existing ### Per-instance CPU affinity subsection (added in the perf-baseline branch). Insert a new subsection immediately before it (so the slice-level isolation is documented before the per-instance refinement that builds on top). The new subsection content:

### CPU isolation (cores)

The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.

Override the split by setting either env var when running the deploy:

```sh
LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host

On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.

Per-instance CPUAffinity= (next subsection) composes on top of this — the per-instance value must be a subset of l4d2-game.slice's AllowedCPUs=, which the kernel enforces.


(The outer triple-backticks above are markdown punctuation around this prompt block, not part of the README content. Inner code-block fences DO need to be written into the README. The `markdown` language tag on the outer fence in this plan is documentation-only.)

- [ ] **Step 2.2: Run the full deploy test suite**

Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 36 passed, 1 failed (unchanged; README has no test).

- [ ] **Step 2.3: Commit**

```bash
git add deploy/README.md
git commit -m "$(cat <<'EOF'
docs(deploy): document CPU isolation in performance-tuning section

Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.
EOF
)"

Final Verification

  • Step F.1: Full deploy + host + web test sweep

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q Expected: deploy 36 passed / 1 failed (pre-existing); host 111 passed / 1 skipped; web 313 passed / 1 skipped.

  • Step F.2: Working tree clean and commits in order

Run: git status && git log --oneline -5 Expected:

  • git status: clean.

  • Top of git log:

    1. docs(deploy): document CPU isolation in performance-tuning section
    2. feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
    3. docs(plans): l4d2 cpu isolation — implementation plan
    4. docs(specs): l4d2 cpu isolation — design
  • Step F.3: Operator-side smoke test (deferred, not part of this plan)

This plan ships artifacts. Confirming systemd actually enforces AllowedCPUs= on a real Trixie host is operator-side:

deploy/deploy-test-server.sh deploy-user@example-host
ssh deploy-user@example-host '
  systemctl cat system.slice | grep AllowedCPUs
  systemctl cat l4d2-game.slice | grep AllowedCPUs
  cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
  cat /sys/fs/cgroup/l4d2-game.slice/cpuset.cpus.effective
'
# Expect on an 8-core box:
#   system.slice    → AllowedCPUs=0   → cpuset.cpus.effective = 0
#   l4d2-game.slice → AllowedCPUs=1-7 → cpuset.cpus.effective = 1-7

End-to-end behavioural test (manual, ops-side): on a 4-core host, run two L4D2 instances + a script-sandbox build simultaneously. Confirm via htop (with affinity column on) that the srcds processes only ever appear on cores 1, 2, 3 and the sandbox + web stay on core 0.


Out of Scope (do NOT implement here)

  • Kernel isolcpus= / nohz_full= / rcu_nocbs= boot params.
  • NIC IRQ pinning automation.
  • Per-instance CPUAffinity= driven by a deploy-env knob.
  • A separate l4d2-web.slice.
  • Any web-app or host-library code changes.

If you find yourself touching any of these, stop — they belong in a separate spec.