Commit graph

79 commits

Author SHA1 Message Date
mwiegand
434ee20339
refactor(deploy): venv + steam now under /var/lib/left4me
Sync deployment references for the runtime state relocation
shipped via ckn-bw (commit 6fae2fd). /opt/left4me/ is now a
root-owned deploy-artifact root (just src/); .venv and steamcmd
live at /var/lib/left4me/{.venv,steam}.

Touches:
- deploy/files/.../left4me-web.service: PATH + ExecStart
- deploy/files/.../left4me-workshop-refresh.service: WorkingDirectory
  (was /opt/left4me, now /opt/left4me/src to match the web unit),
  PATH, ExecStart
- scripts/sbin/left4me wrapper: flask path
- deploy/tests/test_example_units.py: PATH + ExecStart assertions
  for the web unit; also fix a pre-existing broken assertion that
  read "Environment=PATH=..." (the unit has Environment=HOME=...
  PATH=... on one line, so "Environment=PATH=" was never present)
  - now reads just "PATH=..."
- deploy/README.md: paths
- l4d2host/tests/test_cli.py: LEFT4ME_STEAMCMD fixture path

Design + as-shipped record:
docs/superpowers/specs/2026-05-15-runtime-state-relocation-design.md.
The original (narrower) prereq spec at
docs/superpowers/specs/2026-05-15-handoff-noneditable-install.md
is marked superseded with a pointer to what shipped + why the
scope grew (setuptools writes egg-info to source during PEP 517
build prep).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 17:56:32 +02:00
mwiegand
6cf4517a88
fix(deploy/files): drop ProcSubset=pid from web reference unit
Mirrors ckn-bw fix: ProcSubset=pid hides /proc/sys/kernel/random/boot_id,
which journalctl needs at startup; web unit invokes journalctl for
live log streaming.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:14:40 +02:00
mwiegand
8971b23617
refactor(sandbox): collapse l4d2-sandbox user into left4me
The hardening refactor that just landed closes the same-uid attack
surface (FS view, ptrace, /proc visibility, signals) for the web +
gameserver units via systemd directives plus system-wide
kernel.yama.ptrace_scope=2. Keeping the script-sandbox on a separate
uid was the inconsistent half-step — defense-in-depth only, with
build-time-idmap complexity attached. One principle wins: harden
once, share the uid.

scripts/libexec/left4me-script-sandbox: drop the idmap block (uid
lookups, STAGING setup, cleanup_staging trap, mount --bind
--map-users), switch User=/Group= to left4me, point BindPaths at
\$OVERLAY_DIR directly. Header comment updated to reflect
hardening-not-uid as the same-uid defense. nsenter self-wrap kept —
it's about mount-namespace escape, not uid.

Tests + comments + companion docs updated. Build-time-idmap and
overlay-idmap plans marked SUPERSEDED; user-uid-split spec revised
to "1 user is correct"; one-line update notes on the hardening
specs and the build-overlay-unit-design.

Companion ckn-bw commit removes the l4d2-sandbox user + group and
tightens /var/lib/left4me from 0711 → 0755 (the traverse-only mode
was specifically for the sandbox uid).
2026-05-15 15:50:57 +02:00
mwiegand
8e678b6765
deploy/files: annotate reference units with per-directive hardening comments
Update the educational reference copies of left4me-server@.service and
left4me-web.service to match the new hardening composition from the
ckn-bw reactor (HARDENING_COMMON + HARDENING_SERVER / HARDENING_WEB).
Per-directive comments explain each defense's purpose and the threat
it addresses, so a cold reader of this repo can understand the threat
model from the unit file alone.

Top-of-file note in each reference points at the ckn-bw reactor as
the live source; reference is hand-synced.

gunicorn ExecStart in the web reference uses placeholder
'--workers 4 --threads 4' values; live emission interpolates from
metadata. This is the documented divergence between the reference
and the deployed unit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:54:10 +02:00
mwiegand
5284e28af7
refactor: move privileged scripts to scripts/{libexec,sbin}/; deploy/ is reference
Pulls the 5 privileged helpers out of deploy/files/usr/local/{libexec,sbin}/
into top-level scripts/{libexec,sbin}/. They are application-inherent code
(invoked at runtime via sudo from l4d2host/l4d2web), not deploy artifacts —
the previous nesting under deploy/files/ confused source-of-truth with
install-target FHS layout.

deploy/ now means "reference exemplar": README explaining the target
layout, plus example sudoers / sysctl / sandbox-resolv.conf / env
templates / curated systemd units (the ones ckn-bw's reactor emits).
Anyone building a fresh deployment (other than ckn-bw) reads this tree.

Dead static artifacts deleted: left4me-apply-cake helper, left4me-cake
+ left4me-nft-mark service units, cake.env, left4me-mark.nft, and the
superseded deploy-test-server.sh installer.

Tests split to match the new shape:
- scripts/tests/{test_overlay,test_script_sandbox,test_systemctl_helper,
  test_journalctl_helper,test_helpers_use_fixed_paths,test_sudoers_grants}.py
  with shared fixtures in conftest.py
- deploy/tests/test_example_units.py (renamed from test_deploy_artifacts.py)
  — slimmed to lock down the curated example units, sysctl, env templates

l4d2host/tests/test_overlay_helper.py: helper-source path updated to
scripts/libexec/left4me-overlay (was building the path segment-by-segment
under deploy/files/, missed by the path-prefix grep during pre-flight).

Runtime install-target paths (/usr/local/{libexec,sbin}/) unchanged, so
l4d2host/service_control.py, l4d2web/services/overlay_builders.py, the
sudoers grants, and the systemd units all keep their existing path
references.

Requires the matching ckn-bw change to bundles/left4me/items.py
(install_left4me_scripts repointed from /opt/left4me/src/deploy/files/...
to /opt/left4me/src/scripts/...). Left4me lands first so a fresh
git_deploy exposes the new source path before the bundle apply runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 12:05:30 +02:00
mwiegand
7a25c2453c
fix(left4me-script-sandbox): self-wrap into PID 1's mount namespace
The web service runs with PrivateTmp=true, which puts it in its own
mount namespace. Worker invokes the sandbox helper via sudo from there;
the helper's pre-systemd-run `mount --bind --map-users=...` lands in
the web service's namespace. systemd-run then spawns transient units
in PID 1's namespace where the bind is invisible — the BindPaths lookup
finds an empty staging dir owned by root, and the sandbox uid hits
permission-denied on every write.

Mirror the pattern from left4me-overlay's ExecStartPre wrapper: enter
PID 1's mount namespace at the start of the helper via `nsenter
--mount=/proc/1/ns/mnt`. Sentinel env var avoids exec recursion. The
gameserver helper handles this at the unit level; the script helper
doesn't have a unit so we self-wrap.

Diagnosis: 5 failed builds all hit the same EACCES on the first
`mkdir`/`tar mkdir`. Direct SSH-sudo invocations of the same helper
succeeded because SSH-sudo doesn't inherit a private namespace; only
the worker-invoked path is affected.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 01:33:13 +02:00
mwiegand
48381089d3
refactor(left4me-overlay): move uid translation to script-sandbox build
left4me-script-sandbox now pre-creates an idmapped bind staging path
(--map-users=<left4me_uid>:<sandbox_uid>:1) and points the sandbox's
BindPaths at that staging instead of the raw overlay dir. Writes from
inside the sandbox (uid l4d2-sandbox) land on disk as left4me, so all
overlay content is uniformly left4me-owned end-to-end.

left4me-overlay loses ~165 lines of idmap-on-mount logic: the per-
lowerdir stat + idmap-bind setup, the bind-umount loop in teardown,
the uid lookup helpers, the _is_mountpoint /proc/self/mountinfo parser,
and the LEFT4ME_TEST_* env-var stubs. It's back to a simple "validate
lowerdirs, mount overlay" shape; gameserver mount path no longer needs
to know about producer-side ownership decisions.

Verified on kernel 6.12 that the kernel idmap propagates through
systemd-run's plain re-bind of the staging path. Tests dropped 4
idmap-on-mount specs and one deploy-artifact regression check; added
test_script_sandbox_uses_idmap_staging to pin the new staging path
+ map flags + trap cleanup.

The post-build world-read chmod kludge in the sandbox is also dropped:
the web app reads overlay files via its primary uid (left4me).

Existing overlays on the test server are sandbox-owned from prior runs
and need a one-shot `chown -R left4me:left4me /var/lib/left4me/overlays`
during deploy. New overlays produced by the refactored sandbox are
left4me-owned from creation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 01:20:39 +02:00
mwiegand
dd918aca4b
fix(left4me-overlay): use /proc/self/mountinfo to detect bind mounts
os.path.ismount() compares st_dev against the parent dir, which silently
returns False for same-fs bind mounts. The idmap binds at runtime/<n>/
idmap/<basename> are exactly that case, so:

- cmd_umount skipped the bind-umount step every stop, leaving orphan
  binds in PID 1's mount namespace.
- cmd_mount's idempotency check then "didn't see" the orphan and
  re-bound on top, accumulating one mount per start/stop cycle.

Findmnt nesting like
    /var/lib/left4me/runtime/2/idmap/overlays_9
    └─/var/lib/left4me/runtime/2/idmap/overlays_9
is the visible symptom. Reboot wipes everything so the bug is invisible
on a fresh boot — only stop/start cycles accumulate.

Replace both ismount sites with a _is_mountpoint() helper that reads
/proc/self/mountinfo (column 5 is the mount point). Keep os.path.ismount
for the overlay merged check, where it's reliable (distinct fs type).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 01:02:18 +02:00
mwiegand
f5e36eef79
deploy: claim /usr/local/sbin/left4me admin CLI in deploy/files
ckn-bw was shipping the admin CLI wrapper (sudo left4me <flask
subcommand>) verbatim from its own bundle copy. Move ownership of the
file into left4me so ckn-bw's upcoming install-action approach can
deploy it from deploy/files/usr/local/sbin/left4me on the deployed
git checkout, eliminating the cross-repo duplication that masked the
idmap helper update earlier.

Also re-frame deploy/README.md: deploy/files/, deploy/templates/, and
deploy/tests/ are now genuinely canonical (read by ckn-bw via
git_deploy). Only deploy-test-server.sh remains a superseded artifact.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 00:41:06 +02:00
mwiegand
f231ebcb0d
doc(deploy): clarify ckn-bw verbatim-sync workflow for shipped files
Spell out that the deploy step for changes to verbatim-shipped files
(privileged helpers, sudoers, sysctl, …) is just re-syncing the bundle's
copy + bw apply. Removes ambiguity for the idmap helper change and any
future edit within the same set.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 23:57:31 +02:00
mwiegand
e4101de7a5
test(deploy): assert left4me-overlay idmaps sandbox-owned lowerdirs
Guards against silent regression of the idmap bind-mount step in the
privileged kernel-overlayfs helper. Asserts --map-users / --map-groups
argv, the runtime/<name>/idmap/ target path, the LEFT4ME_TEST_* stub-
env-var names, and the collision-detection table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 23:56:36 +02:00
mwiegand
90531864b3
harden(left4me-overlay): fix idmap collision risk, gate test stubs on PRINT_ONLY, wrap os.stat
Issue #1: idmap target now uses parent+name (overlays_workshop instead of
workshop) to prevent basename collisions across allowlist roots; explicit
die() on collision detected in the loop.

Issue #2: env-var uid stubs (renamed to LEFT4ME_TEST_SANDBOX_UID etc.) are
only honoured when LEFT4ME_OVERLAY_PRINT_ONLY=1, so a misconfigured systemd
unit override cannot influence real uid mapping.

Issue #3: os.stat(lowerdir) is wrapped in try/except OSError with a die()
that shell-quotes the path and includes the exception, matching the helper's
existing error style.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 23:53:32 +02:00
mwiegand
2f6a9cfba0
feat(left4me-overlay): idmap bind mounts for l4d2-sandbox-owned lowerdirs
Insert an idmapped bind mount in front of each lowerdir whose top-level
uid matches l4d2-sandbox at overlay-mount time, so that overlayfs copy-up
produces left4me-owned upperdir entries instead of EACCES.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 23:48:07 +02:00
mwiegand
674c4df360
deploy: add STEAM_WEB_API_KEY to web.env template
For the live-state panel's Steam profile enrichment (persona names +
avatars). Optional: empty value disables enrichment and the panel falls
back to in-game names + placeholder avatars.

The actual web.env is materialized by the ckn-bw bundle's Mako; the
template here documents the operator-facing shape.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 22:25:03 +02:00
mwiegand
e52219b1e9
deploy: weaken refresh-timer dep on web.service from Requires to Wants 2026-05-11 23:22:42 +02:00
mwiegand
8cc7f84801
deploy: schedule daily workshop refresh via systemd timer
Adds left4me-workshop-refresh.service (oneshot, triggers flask
workshop-refresh) and left4me-workshop-refresh.timer (04:00 daily,
Persistent=true, 15 min jitter). Wires both into the deploy script's
cp and enable blocks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-11 23:20:13 +02:00
mwiegand
0811d22c44
deploy/README: mark as historical reference, point at ckn-bw
ovh.left4me is now provisioned by the ckn-bw bundle bundles/left4me/
(attached via groups/applications/left4me.py); run `bw apply
ovh.left4me` from there.

Keep this directory verbatim as deployment-knowledge reference: what
was configured, what each unit/helper does, why the privileged
boundaries are drawn the way they are. Add a top-of-README
correspondence table marking which files migrated 1:1 vs. which are
obsolete in the new architecture (CAKE moved to systemd-networkd;
nft marking moved into the central nftables bundle; systemd units
are emitted by a metadata reactor; CPU isolation drop-ins are no
longer managed declaratively).

The deploy-test-server.sh stays here too — useful as a concrete walk-
through of the install steps the bundle now performs declaratively.
Just don't run it against an ovh.left4me node managed by ckn-bw; the
two would fight over file ownership, sudoers, and unit definitions.
2026-05-10 18:25:23 +02:00
mwiegand
a987304358
fix(deploy): make iproute2 explicit + document disable recipe
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:29:22 +02:00
mwiegand
9f0b51b455
docs(deploy): document network-shaping defaults + opt-in network knobs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:09:28 +02:00
mwiegand
26f3d270b0
feat(deploy): wire nft marking + CAKE shaper into deploy script
Installs nftables via apt/dnf, copies left4me-mark.nft and left4me-apply-cake
helper into system paths, conditionally seeds cake.env (preserving operator
edits), and enables left4me-nft-mark.service + left4me-cake.service on deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:04:12 +02:00
mwiegand
a9ca90537b
feat(deploy): left4me-cake.service oneshot wrapping apply-cake helper
The CAKE egress shaper now has a systemd unit that wraps the
left4me-apply-cake helper in apply and clear modes. The unit is a
oneshot that starts after network-online and survives service restarts,
allowing the shaper to persist across reboots and be managed by systemd.
The environment file is marked non-fatal (EnvironmentFile=-) to handle
missing or incomplete configurations gracefully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:58:42 +02:00
mwiegand
878639147a
feat(deploy): left4me-apply-cake helper with apply/clear modes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:52:16 +02:00
mwiegand
d783449d05
feat(deploy): cake.env template with documented uplink knobs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:49:08 +02:00
mwiegand
fbb342db87
feat(deploy): systemd unit to load/clear left4me_mark nftables table
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:35:27 +02:00
mwiegand
076bfb72ca
feat(deploy): nftables uid-based DSCP-EF + skb-priority marking for srcds
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:32:53 +02:00
mwiegand
e822e9fbc7
feat(deploy): extend sysctls with udp_*_min, fq_codel default, BBR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:28:24 +02:00
mwiegand
5eac51a93e
fix(deploy): wrap overlay helper with nsenter so it doesn't pin the unit's mount namespace
systemd's `+` Exec prefix removes sandbox/credentials but does NOT
detach from the unit's per-service mount namespace (created by
PrivateTmp/Protect*). The Python interpreter for the helper was
launched inside that namespace, and even though the helper internally
nsenter'd into PID 1 for the umount syscall, the calling Python
process itself never left the unit's namespace. Its existence pinned
the namespace alive, which kept the slave mount tree alive, which
made PID 1's umount return EBUSY for the entire duration of the
helper's run. The mount became unmountable the moment the helper
exited — empirically verified by polling /proc/*/ns/mnt during stop:
the only PID holding the dying namespace was the helper itself.

Wrap both ExecStartPre and ExecStopPost with `/usr/bin/nsenter
--mount=/proc/1/ns/mnt --` so the helper Python interpreter runs in
PID 1's mount namespace from the start. With the helper out of the
unit's namespace, umount succeeds first try once the cgroup empties.
Reset went from ~25 s with retry/lazy-fallback workarounds to ~0.5 s
clean.

Knock-on cleanups:
- Helper drops internal nsenter for the syscalls (already in PID 1's
  namespace), and drops the eager-retry loop + lazy-umount fallback +
  inner work_inner retry (no race left to ride out).
- Revert TimeoutStopSec=60s back to 15s.
- Tests updated to expect the new argv shapes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:13:59 +02:00
mwiegand
936c8bb81c
fix(deploy): ExecStart srcds_run from merged overlay, not installation/
srcds_run is a shell script that cd's to its own dirname before exec'ing
srcds_linux, so WorkingDirectory has no effect — the binary's path is what
determines where the engine reads gameinfo.txt and addons from. Pointing
at installation/srcds_run resolved everything against the lower layer, so
overlay-provided Metamod/SourceMod plugins and cfgs (zonemod, confogl)
never loaded. Switch to runtime/%i/merged/srcds_run so the engine sees
the merged tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:03:12 +02:00
mwiegand
ddf73c4d27
test(deploy): drop stale web.env lifecycle assertions
`test_deploy_script_has_safe_defaults_and_preserves_state` had been red
since commit caa8b83 ("rewrite web.env every deploy with machine-id-
derived SECRET_KEY"). Two assertions encoded the prior model:

- `if [ ! -f /etc/left4me/web.env ]` — the create-only-if-missing guard
  caa8b83 removed in favor of unconditional `install -m 0640 ...`.
- `. /etc/left4me/web.env not in script` — masked by the first failing
  but also stale: the deploy intentionally sources web.env in the
  alembic and seed-script-overlays helper subprocesses so they get
  DATABASE_URL.

Removed both. The full suite now runs 0 failed. The note left in place
points future readers at the live coverage path (install + SECRET_KEY
rewrite + run_left4me_with_env plumbing already asserted nearby).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:33:05 +02:00
mwiegand
59771f91c4
fix(deploy): drop deleted l4d2host.fs from pyproject + use nproc --all
Two bugs surfaced by the previous deploy attempt:

1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit
   packages= list. After deleting the fs/ package, pip install -e fails
   with "package directory './fs' does not exist".

2. The CPU-isolation deploy step uses `nproc` to detect host core count,
   but `nproc` honors Cpus_allowed of the calling shell. On a host that
   already has the cpuset drop-ins applied (system.slice/user.slice →
   AllowedCPUs=0), the SSH login lands constrained to one core and
   `nproc` returns 1 — making subsequent deploys think they're on a
   single-core box and skip the cpuset writes entirely. `nproc --all`
   reports installed processors regardless of affinity, which is what
   the deploy actually wants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:11:19 +02:00
mwiegand
ff6ce7b091
refactor(l4d2-host): unmount via ExecStopPost — single code path mirroring mount
Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until
now, the unit's ExecStartPre handled mount but the Python side still drove
unmount: stop_instance and _purge_instance both called _mounter.unmount,
which wrapped sudo + the helper. Two code paths for two halves of the
same lifecycle.

Move unmount into the unit:

- ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
  (ExecStopPost, not ExecStop, so it runs after the cgroup is cleared;
  ExecStop runs while srcds is alive and would EBUSY the umount syscall.)
- Helper's umount verb is now idempotent (mirrors mount): if merged
  isn't a mount point, return early. PRINT_ONLY mode bypasses both
  short-circuits so the unit tests still exercise the full nsenter argv.

Drop the dead Python machinery:

- _mounter.unmount(...) calls in stop_instance and _purge_instance
- _mounter global + KernelOverlayFSMounter import
- The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter
  class) — no production callers, just self-tests
- l4d2host/tests/test_kernel_overlayfs.py
- test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
  (tested Python-side unmount-failure tolerance that no longer exists)
- The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests

After this, the only thing start_instance does beyond cfg-staging is ask
systemd to enable+start the unit. stop/delete/reset only ask systemd to
disable; the overlay lifecycle lives entirely in the unit file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:09:52 +02:00
mwiegand
fc371711ec
fix(deploy): StartLimit* directives belong in [Unit], not [Service]
systemd 230+ moved StartLimitBurst= and StartLimitIntervalSec= from
[Service] into [Unit] (with the rename from StartLimitInterval=). Putting
them in [Service] makes systemd silently ignore them with a warning to
journalctl: "Unknown key 'StartLimitIntervalSec' in section [Service],
ignoring." — meaning the restart-loop cap I claimed in commit 519567e
wasn't actually applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:56:54 +02:00
mwiegand
a982995d5b
fix(deploy): ExecStartPre runs overlay helper with + prefix, not sudo
The unit has NoNewPrivileges=true (security hardening for srcds), which
blocks sudo's setuid escalation. The previous sudo'd ExecStartPre failed
on every start with "sudo: the 'no new privileges' switch is set, which
prevents sudo from running as root" -> Restart=on-failure loop.

systemd's `+` prefix runs the Exec command as PID 1 (root, no sandbox),
bypassing User=/Group=/NoNewPrivileges=. Equivalent privilege scope to
the sudoers rule the web app already uses for the same helper, just
without the sudo middleman.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:55:16 +02:00
mwiegand
56f5c30296
refactor(l4d2-host): unit's ExecStartPre is the sole code path to the mount
Before this change there were two callers of left4me-overlay mount:
the web app's start_instance (Python, in-process) and the unit's
ExecStartPre (shell, via sudo). The duplication invited divergence; the
helper's recently-added idempotency made both paths technically work
but at the cost of a "first wins" race and dead-code retry logic in
start_instance.

Drop the in-process _mounter.mount() call from start_instance. The web
app now only stages cfg files (which still must happen on the host
filesystem before mount, to avoid overlayfs copy-up changing ownership),
then asks systemd to enable+start the unit; the unit's ExecStartPre
does the mount.

Removed:
- os.path.ismount(merged) refusal in start_instance and its test
  (test_start_refuses_to_double_mount). The race the check guarded
  against is now handled by the helper's idempotency.
- _load_instance_env helper and the `os` import (both became dead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:54:05 +02:00
mwiegand
3d9b7ef771
fix(deploy): WorkingDirectory= prefix - so ExecStartPre can mount the overlay
systemd applies WorkingDirectory= to every Exec line including ExecStartPre.
With the merged dir not yet existing at boot time (the volatile overlay
mount has been wiped), the chdir into runtime/%i/merged/left4dead2 fails
with status=200/CHDIR before ExecStartPre can run the mount helper.

The `-` prefix makes chdir failure non-fatal: ExecStartPre runs in the
unit's home (cwd doesn't matter for the mount helper); ExecStart re-applies
WorkingDirectory once the mount has landed and chdirs successfully.

Companion to commit 519567e (which added the ExecStartPre mount + helper
idempotency but didn't account for the WorkingDirectory ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:51:58 +02:00
mwiegand
519567e156
fix(l4d2-host): mount overlay via ExecStartPre so enabled units boot cleanly
The lifecycle change to systemctl enable --now (commit 8552c55) made
units auto-start at boot. But the kernel-overlayfs mount is volatile
(reboot kills it), and the web app's start_instance only re-mounts in
response to a UI click. Result: at boot, systemd starts the unit, finds
empty merged/, CHDIR fails, Restart=on-failure spins forever (counter
hit 65 on ckn before this fix landed).

Fix:
- Unit gets `ExecStartPre=/usr/bin/sudo -n .../left4me-overlay mount %i`
  so the overlay is established before the main process starts.
- Helper is now idempotent: if merged is already a mount point, exit 0.
  Required because Restart=on-failure re-runs ExecStartPre on each
  cycle, and the web-app's start_instance also calls the helper, so
  both paths would otherwise collide on "already mounted".
- StartLimitBurst=5 + StartLimitIntervalSec=60s caps the restart loop
  instead of letting it spin indefinitely on a fundamental failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:47:20 +02:00
mwiegand
8552c559d3
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract -- only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:28:44 +02:00
mwiegand
20604dd79c
docs(deploy): document CPU isolation in performance-tuning section
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:06:59 +02:00
mwiegand
af3171102a
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:06:34 +02:00
mwiegand
e5126c8c0b
docs(deploy): tighten perf-tuning escape hatches
- RT example: add AmbientCapabilities=CAP_SYS_NICE so the User=left4me
  service can actually enter SCHED_FIFO on Trixie.
- CPU governor: note that linux-cpupower may need apt install.
- CPUAffinity=2: clarify that per-instance values typically increment.
- NIC tuning: note that ethtool may need apt install.
2026-05-09 10:15:45 +02:00
mwiegand
9e0f6f17ef
docs(deploy): performance-tuning escape-hatch section in README
Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
ops-side knobs for measured problems the perf baseline doesn't solve.
2026-05-09 10:09:40 +02:00
mwiegand
928519fa34
feat(deploy): install slice + sysctl artifacts and apply via sysctl --system
Copies l4d2-game.slice and l4d2-build.slice into
/usr/local/lib/systemd/system/, installs 99-left4me.conf into
/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
live this deploy, not on next reboot.
2026-05-09 10:05:41 +02:00
mwiegand
7e4a5691ed
feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500
Builds yield CPU/IO to game-server instances under contention via the
slice's weight=10, and are killed first under memory pressure
(servers have OOMScoreAdjust=-200).
2026-05-09 10:01:38 +02:00
mwiegand
b3fca4772c
feat(deploy): host sysctls for UDP buffers + netdev backlog/budget
99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults),
netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.
2026-05-09 09:53:07 +02:00
mwiegand
66d83a0282
docs(deploy): point slice files at perf baseline spec
Matches the spec-pointer comment Task 1 added to
left4me-server@.service. A future operator running
`systemctl cat l4d2-game.slice` now finds the rationale.
2026-05-09 09:51:48 +02:00
mwiegand
ad7d73608e
feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio
Flat top-level slices. Game wins under contention; build still gets
the box when uncontended. Referenced by left4me-server@.service and
the script-sandbox systemd-run invocation.
2026-05-09 09:48:41 +02:00
mwiegand
7193163488
feat(deploy): perf-baseline directives on left4me-server@.service
Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort,
OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256,
LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s,
LogRateLimitIntervalSec=0. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
2026-05-09 09:44:12 +02:00
mwiegand
965b67e6fc
fix(l4d2-host): script-sandbox normalizes file perms so web user can read
Cedapug's build script writes .cedapug/manifest.tsv with mode 0600 owned
by l4d2-sandbox; the web service (left4me uid) then 500s when streaming
that file via the download route — PermissionError on open().

Two fixes:
- UMask=0022 on the systemd-run unit so new file writes default to
  0644 / dirs to 0755.
- Post-script chmod o+r/o+rx walk over the overlay dir to backfill any
  stricter modes the script left behind (e.g. shells/tools that ignore
  umask and explicitly create with 0600).

The helper no longer execs systemd-run; it captures the rc, runs the
post-step, and exits with the original rc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:44:26 +02:00
mwiegand
caa8b83cf0
chore(deploy): rewrite web.env every deploy with machine-id-derived SECRET_KEY
Drops the 'only on first creation' guard so newly added env vars reach
existing boxes (today's SESSION_COOKIE_SECURE=false rake). SECRET_KEY
is now sha256(/etc/machine-id) — stable per host, no session
invalidation across redeploys, no state persisted in /etc that the
deploy has to tiptoe around. Single-operator test deployment; the
secret being machine-id-derivable is acceptable per deploy/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:39:02 +02:00
mwiegand
196d2db33e
feat(l4d2-web): seed example script overlays from examples/script-overlays/
Bundles four reference script overlays (cedapug_maps, l4d2center_maps,
competitive_rework, tickrate) and adds a `flask seed-script-overlays`
CLI that upserts each *.sh as a system-wide overlay. Test deploy
invokes it after the orphan-cleanup migration so fresh test servers
come up with the same overlays the user has been maintaining by hand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 18:41:08 +02:00