left4me/docs/superpowers/plans/2026-05-09-l4d2-cpu-isolation.md
mwiegand c91c029c38
docs(plans): l4d2 cpu isolation — implementation plan
Two TDD tasks: deploy-script cpuset block + tests, README
"CPU isolation" subsection. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:03:37 +02:00

260 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# L4D2 CPU Isolation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively, scaled automatically across host sizes.
**Architecture:** Four `99-left4me-cpuset.conf` drop-ins under `/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/`, written by the deploy script from heredocs. `LEFT4ME_SYSTEM_CPUS` (default `0`) and `LEFT4ME_GAME_CPUS` (default `1-$((NPROC-1))`) are env-var overrides. Single-core hosts skip the cpuset writes with a warning.
**Tech Stack:** systemd cgroup-v2 `AllowedCPUs=` directive, bash heredoc + `install`, Linux `nproc(1)`, pytest text-assertion tests.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`
---
## File Structure
Files to modify:
- `deploy/deploy-test-server.sh` — compute `NPROC`, default `LEFT4ME_SYSTEM_CPUS=0` / `LEFT4ME_GAME_CPUS=1-$((NPROC-1))`, write four drop-in files. Skip when `nproc < 2` (with stderr warning) unless either env var is set explicitly.
- `deploy/README.md` — append a "CPU isolation" subsection inside the existing "Performance Tuning" section.
- `deploy/tests/test_deploy_artifacts.py` — new test functions.
No host library or web app changes.
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing deploy tests are at the known-good baseline**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 35 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`).
If the count differs, stop and surface — this plan assumes that exact baseline.
---
## Task 1: Deploy-script CPU-isolation block + tests
Write the four drop-ins from the deploy script in one cohesive block. The block computes `NPROC` once, resolves both env vars (with defaults), guards single-core hosts, and writes each drop-in via the existing `install -m 0644 -o root -g root` pattern. Tests cover defaults, overrides, single-core skip, and drop-in paths.
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Modify: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 1.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py` and append (after the `test_deploy_script_installs_perf_artifacts` from the perf-baseline branch):
```python
def test_deploy_script_writes_cpuset_drop_ins():
script = DEPLOY_SCRIPT.read_text()
# Reads nproc and binds defaults via ${VAR:-...}.
assert "nproc" in script
assert "LEFT4ME_SYSTEM_CPUS" in script
assert "LEFT4ME_GAME_CPUS" in script
assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
# Default game-core expression: 1-(nproc-1). Match the form the
# implementer chose; both `1-$((NPROC-1))` and `1-$((nproc-1))` are
# acceptable as long as the upper bound is computed from nproc.
assert ("1-$((NPROC-1))" in script) or ("1-$((nproc-1))" in script) \
or ("LEFT4ME_GAME_CPUS:-1-" in script)
# All four drop-in paths.
for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
assert f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf" in script
# Drop-ins use the existing install pattern.
assert "install -m 0644 -o root -g root" in script
# Single-core host: skip with a warning to stderr.
# Match either an explicit `nproc < 2` / `-lt 2` guard or `[ "$nproc" -ge 2 ]` form.
assert ("nproc" in script) and (("-lt 2" in script) or ("-ge 2" in script) or ("< 2" in script))
assert "skipping CPU isolation" in script.lower() or "skip cpu isolation" in script.lower()
```
- [ ] **Step 1.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_writes_cpuset_drop_ins -v`
Expected: FAIL — none of the new strings exist yet.
- [ ] **Step 1.3: Edit the deploy script — add the cpuset block**
Open `deploy/deploy-test-server.sh`. Find the block that copies the slice files (added in the perf-baseline branch, around lines 139140):
```sh
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
```
Immediately after that pair, before any of the helper-script copies that follow, insert this block:
```sh
# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
# isn't a live game server to core 0; give game servers cores 1..N-1.
# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
NPROC=$(nproc)
SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
GAME_CPUS=$LEFT4ME_GAME_CPUS
else
GAME_CPUS="1-$((NPROC - 1))"
fi
if [ "$NPROC" -lt 2 ] && [ -z "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" ]; then
printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
else
for slice_name in system user l4d2-build; do
$sudo_cmd mkdir -p "/etc/systemd/system/${slice_name}.slice.d"
printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
"/etc/systemd/system/${slice_name}.slice.d/99-left4me-cpuset.conf"
done
$sudo_cmd mkdir -p "/etc/systemd/system/l4d2-game.slice.d"
printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
"/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf"
fi
```
Notes for the implementer:
- The single-core skip only triggers when **neither** override is set. If the operator sets either `LEFT4ME_SYSTEM_CPUS` or `LEFT4ME_GAME_CPUS` explicitly on a single-core host, honor their intent.
- `install -m 0644 -o root -g root /dev/stdin <dest>` is the idiomatic way to install a small generated file from a pipeline (matches the existing pattern for sandbox-resolv.conf, just with `/dev/stdin` as source).
- The `mkdir -p` for each `.d` directory is required: systemd reads drop-ins only from existing directories.
- [ ] **Step 1.4: Verify shell syntax still parses**
Run: `sh -n deploy/deploy-test-server.sh`
Expected: exit 0, no output.
- [ ] **Step 1.5: Run the new test and full deploy test suite**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 36 passed, 1 failed (the pre-existing unrelated test, count goes from 35→36 because of the new test).
If your specific assertion forms in Step 1.1 don't match the implementation, adjust the test — but only the `or` branches; do not weaken the contract.
- [ ] **Step 1.6: Commit**
```bash
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
EOF
)"
```
---
## Task 2: README "CPU isolation" subsection
Append a subsection to `deploy/README.md` inside the existing "Performance Tuning" section, documenting the layout, the env-var overrides, the single-core skip, and the relationship to the existing per-instance `CPUAffinity=` escape hatch.
**Files:**
- Modify: `deploy/README.md`
No test for this task — README content is documentation, not contract.
- [ ] **Step 2.1: Append the CPU isolation subsection**
Open `deploy/README.md`. Find the existing `### Per-instance CPU affinity` subsection (added in the perf-baseline branch). Insert a new subsection **immediately before** it (so the slice-level isolation is documented before the per-instance refinement that builds on top). The new subsection content:
```markdown
### CPU isolation (cores)
The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
Override the split by setting either env var when running the deploy:
```sh
LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
```
On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
```
(The outer triple-backticks above are markdown punctuation around this prompt block, not part of the README content. Inner code-block fences DO need to be written into the README. The `markdown` language tag on the outer fence in this plan is documentation-only.)
- [ ] **Step 2.2: Run the full deploy test suite**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 36 passed, 1 failed (unchanged; README has no test).
- [ ] **Step 2.3: Commit**
```bash
git add deploy/README.md
git commit -m "$(cat <<'EOF'
docs(deploy): document CPU isolation in performance-tuning section
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full deploy + host + web test sweep**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: deploy 36 passed / 1 failed (pre-existing); host 111 passed / 1 skipped; web 313 passed / 1 skipped.
- [ ] **Step F.2: Working tree clean and commits in order**
Run: `git status && git log --oneline -5`
Expected:
- `git status`: clean.
- Top of `git log`:
1. `docs(deploy): document CPU isolation in performance-tuning section`
2. `feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest`
3. `docs(plans): l4d2 cpu isolation — implementation plan`
4. `docs(specs): l4d2 cpu isolation — design`
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
This plan ships artifacts. Confirming systemd actually enforces `AllowedCPUs=` on a real Trixie host is operator-side:
```sh
deploy/deploy-test-server.sh deploy-user@example-host
ssh deploy-user@example-host '
systemctl cat system.slice | grep AllowedCPUs
systemctl cat l4d2-game.slice | grep AllowedCPUs
cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
cat /sys/fs/cgroup/l4d2-game.slice/cpuset.cpus.effective
'
# Expect on an 8-core box:
# system.slice → AllowedCPUs=0 → cpuset.cpus.effective = 0
# l4d2-game.slice → AllowedCPUs=1-7 → cpuset.cpus.effective = 1-7
```
End-to-end behavioural test (manual, ops-side): on a 4-core host, run two L4D2 instances + a script-sandbox build simultaneously. Confirm via `htop` (with affinity column on) that the srcds processes only ever appear on cores 1, 2, 3 and the sandbox + web stay on core 0.
---
## Out of Scope (do NOT implement here)
- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot params.
- NIC IRQ pinning automation.
- Per-instance `CPUAffinity=` driven by a deploy-env knob.
- A separate `l4d2-web.slice`.
- Any web-app or host-library code changes.
If you find yourself touching any of these, stop — they belong in a separate spec.