Commit graph

288 commits

Author SHA1 Message Date
mwiegand
9f0b51b455
docs(deploy): document network-shaping defaults + opt-in network knobs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:09:28 +02:00
mwiegand
26f3d270b0
feat(deploy): wire nft marking + CAKE shaper into deploy script
Installs nftables via apt/dnf, copies left4me-mark.nft and left4me-apply-cake
helper into system paths, conditionally seeds cake.env (preserving operator
edits), and enables left4me-nft-mark.service + left4me-cake.service on deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:04:12 +02:00
mwiegand
a9ca90537b
feat(deploy): left4me-cake.service oneshot wrapping apply-cake helper
The CAKE egress shaper now has a systemd unit that wraps the
left4me-apply-cake helper in apply and clear modes. The unit is a
oneshot that starts after network-online and survives service restarts,
allowing the shaper to persist across reboots and be managed by systemd.
The environment file is marked non-fatal (EnvironmentFile=-) to handle
missing or incomplete configurations gracefully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:58:42 +02:00
mwiegand
878639147a
feat(deploy): left4me-apply-cake helper with apply/clear modes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:52:16 +02:00
mwiegand
d783449d05
feat(deploy): cake.env template with documented uplink knobs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:49:08 +02:00
mwiegand
fbb342db87
feat(deploy): systemd unit to load/clear left4me_mark nftables table
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:35:27 +02:00
mwiegand
076bfb72ca
feat(deploy): nftables uid-based DSCP-EF + skb-priority marking for srcds
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:32:53 +02:00
mwiegand
e822e9fbc7
feat(deploy): extend sysctls with udp_*_min, fq_codel default, BBR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:28:24 +02:00
mwiegand
e1add4fffa
docs(plans): l4d2 network shaping & marking — implementation plan
Eight TDD tasks: sysctl extension, nftables marking (file + unit), CAKE
shaper (env + helper + unit), deploy-script wiring, README. Each task
adds one artifact with its assertion in test_deploy_artifacts.py and
ends in its own commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:10:40 +02:00
mwiegand
0cc92f2c17
docs(specs): l4d2 network shaping & marking — design
CAKE egress shaping (test-deploy oneshot + systemd-networkd [CAKE] block
on prod), nftables uid-based DSCP-EF + skb-priority marking for srcds
UDP, plus rounding sysctls (udp_rmem_min/wmem_min, default_qdisc=fq_codel,
tcp_congestion_control=bbr). Hardware-specific knobs stay documented
escape hatches matching the perf-baseline boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:05:44 +02:00
mwiegand
62d6d4cbcd
ui(files-overlay): label root row as "/" instead of "(overlay root)"
Tighter, more terminal-flavored. Mono font on the label echoes how
paths are rendered elsewhere in the tree. New-folder dialog title
also shows "/" when targeting the root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:50:14 +02:00
mwiegand
2bba1f31d0
fix(files-overlay): post-deploy bug sweep + root-as-row UX
Three bugs surfaced in browser testing, plus one UX request:

1. The Uploads panel and the binary-mode editor sub-panels stayed
   visible after `el.hidden = true` because their `display: flex/grid`
   rules in components.css have the same specificity as the UA's
   `[hidden]{display:none}` and come later in cascade. Add a targeted
   `[hidden]!important` rule for the affected classes.

2. Clicking a folder toggle inside a `files` overlay did nothing.
   `file-tree.js` looked for `.file-tree-children` via
   `button.nextElementSibling`, but the files-overlay row template
   inserts a per-row action span between the toggle and the children
   div. Switch to `closest('.file-tree-row').querySelector(':scope >
   .file-tree-children')` so both row variants resolve correctly.

3. Pressing Enter on the new-folder dialog did nothing — the keydown
   handler was attached with `{once:true}` inside `openNewFolder`,
   so the first letter the user typed consumed the listener and Enter
   never fired. Move the listener to module init so it survives
   subsequent keystrokes and dialog reopenings.

UX: render the overlay root as a row inside the tree (label
"(overlay root)") rather than as a separate toolbar. The root row
carries the same `+ new file · + new folder · ⬇ zip` hover-action
column as every other folder row, so drop-on-row, hover-reveal, and
data-target-path semantics are uniform across the tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:46:19 +02:00
mwiegand
76cd7ddda0
fix(files-overlay): fall back to getAsFile when webkitGetAsEntry returns null
webkitGetAsEntry() only returns an Entry for real OS-originated drag-drops;
synthetic DragEvents (and some browsers without folder-drop support) get
null back. Per-item fallback to getAsFile() keeps single-file drops working
in those cases without sacrificing the whole-folder upload path on real
OS drops.

Caught while end-to-end testing on the deploy box: a programmatically-
dispatched drop fired the listener and reached preventDefault(), but no
upload row appeared because the file collection loop never enqueued.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:11:41 +02:00
mwiegand
2d3c98866a
feat(files-overlay): user-managed file content as a third overlay type
Adds Overlay.type='files' whose source-of-truth IS the overlay directory
itself. Users can:

  * upload arbitrary files / whole folders by dragging from the OS onto a
    folder row in the file tree (one POST per file, queue with
    concurrency 3, per-file progress in a floating Uploads panel)
  * move via drag-and-drop inside the tree (same gesture, source
    distinguishes; refuses cycles)
  * create / edit / rename / replace through a single editor modal
    (text flavor for editable files, binary flavor with replace-upload
    for everything else; filename input is the rename surface)
  * mkdir empty folders (slashes allowed for nested intermediates)
  * stream a folder as a zip download
  * delete files and empty folders

Backend is type-agnostic past the new files_routes endpoints, so the
existing mount / spec / overlayfs / expose_server_cfg pipeline is reused
unchanged. is_editable gates the row's edit affordance and the /save
content rules. Three new safe-resolve helpers (write/delete/move) cover
the new operations with the same anchor-and-resolve pattern as listing
and download. FilesBuilder is a no-op so the build subsystem can
dispatch uniformly.

Spec: docs/superpowers/specs/2026-05-09-files-overlay-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 18:59:32 +02:00
mwiegand
36d3d83de6
docs: postmortem for the overlay-umount EBUSY rabbit hole
Captures the symptom (Reset blew up on `umount target busy`), the
false starts (eager retry, lazy fallback, TimeoutStopSec bump — all
shipped briefly and reverted), the actual root cause (the helper's
own Python interpreter inheriting and pinning the unit's mount
namespace), and the fix (nsenter at the systemd Exec line).

The lessons section is the part future-me reads first: a retry loop
is a hint that something we own is the blocker; probe `/proc/*/ns/mnt`
before assuming kernel async; `+` Exec prefix doesn't escape the
unit's mount namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:50:41 +02:00
mwiegand
87d56a0910
fix(web): event-delegate modal triggers so HTMX-swapped buttons work
The previous wiring attached click listeners on DOMContentLoaded, so
any [data-modal-open] / [data-modal-close] / dialog.modal element
that came in via a later HTMX partial swap silently lost its
behaviour. The server-detail Actions partial reloads its reset/delete
triggers on every state change, so reset was unclickable after the
first state change post-load.

Switch to a single delegated click handler on document. Same logic,
but matches via Element.closest() so it works regardless of when an
element was added to the DOM. No re-bind needed after HTMX swaps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:18:27 +02:00
mwiegand
5eac51a93e
fix(deploy): wrap overlay helper with nsenter so it doesn't pin the unit's mount namespace
systemd's `+` Exec prefix removes sandbox/credentials but does NOT
detach from the unit's per-service mount namespace (created by
PrivateTmp/Protect*). The Python interpreter for the helper was
launched inside that namespace, and even though the helper internally
nsenter'd into PID 1 for the umount syscall, the calling Python
process itself never left the unit's namespace. Its existence pinned
the namespace alive, which kept the slave mount tree alive, which
made PID 1's umount return EBUSY for the entire duration of the
helper's run. The mount became unmountable the moment the helper
exited — empirically verified by polling /proc/*/ns/mnt during stop:
the only PID holding the dying namespace was the helper itself.

Wrap both ExecStartPre and ExecStopPost with `/usr/bin/nsenter
--mount=/proc/1/ns/mnt --` so the helper Python interpreter runs in
PID 1's mount namespace from the start. With the helper out of the
unit's namespace, umount succeeds first try once the cgroup empties.
Reset went from ~25 s with retry/lazy-fallback workarounds to ~0.5 s
clean.

Knock-on cleanups:
- Helper drops internal nsenter for the syscalls (already in PID 1's
  namespace), and drops the eager-retry loop + lazy-umount fallback +
  inner work_inner retry (no race left to ride out).
- Revert TimeoutStopSec=60s back to 15s.
- Tests updated to expect the new argv shapes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:13:59 +02:00
mwiegand
936c8bb81c
fix(deploy): ExecStart srcds_run from merged overlay, not installation/
srcds_run is a shell script that cd's to its own dirname before exec'ing
srcds_linux, so WorkingDirectory has no effect — the binary's path is what
determines where the engine reads gameinfo.txt and addons from. Pointing
at installation/srcds_run resolved everything against the lower layer, so
overlay-provided Metamod/SourceMod plugins and cfgs (zonemod, confogl)
never loaded. Switch to runtime/%i/merged/srcds_run so the engine sees
the merged tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:03:12 +02:00
mwiegand
ddf73c4d27
test(deploy): drop stale web.env lifecycle assertions
`test_deploy_script_has_safe_defaults_and_preserves_state` had been red
since commit caa8b83 ("rewrite web.env every deploy with machine-id-
derived SECRET_KEY"). Two assertions encoded the prior model:

- `if [ ! -f /etc/left4me/web.env ]` — the create-only-if-missing guard
  caa8b83 removed in favor of unconditional `install -m 0640 ...`.
- `. /etc/left4me/web.env not in script` — masked by the first failing
  but also stale: the deploy intentionally sources web.env in the
  alembic and seed-script-overlays helper subprocesses so they get
  DATABASE_URL.

Removed both. The full suite now runs 0 failed. The note left in place
points future readers at the live coverage path (install + SECRET_KEY
rewrite + run_left4me_with_env plumbing already asserted nearby).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:33:05 +02:00
mwiegand
59771f91c4
fix(deploy): drop deleted l4d2host.fs from pyproject + use nproc --all
Two bugs surfaced by the previous deploy attempt:

1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit
   packages= list. After deleting the fs/ package, pip install -e fails
   with "package directory './fs' does not exist".

2. The CPU-isolation deploy step uses `nproc` to detect host core count,
   but `nproc` honors Cpus_allowed of the calling shell. On a host that
   already has the cpuset drop-ins applied (system.slice/user.slice →
   AllowedCPUs=0), the SSH login lands constrained to one core and
   `nproc` returns 1 — making subsequent deploys think they're on a
   single-core box and skip the cpuset writes entirely. `nproc --all`
   reports installed processors regardless of affinity, which is what
   the deploy actually wants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:11:19 +02:00
mwiegand
ff6ce7b091
refactor(l4d2-host): unmount via ExecStopPost — single code path mirroring mount
Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until
now, the unit's ExecStartPre handled mount but the Python side still drove
unmount: stop_instance and _purge_instance both called _mounter.unmount,
which wrapped sudo + the helper. Two code paths for two halves of the
same lifecycle.

Move unmount into the unit:

- ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
  (ExecStopPost, not ExecStop, so it runs after the cgroup is cleared;
  ExecStop runs while srcds is alive and would EBUSY the umount syscall.)
- Helper's umount verb is now idempotent (mirrors mount): if merged
  isn't a mount point, return early. PRINT_ONLY mode bypasses both
  short-circuits so the unit tests still exercise the full nsenter argv.

Drop the dead Python machinery:

- _mounter.unmount(...) calls in stop_instance and _purge_instance
- _mounter global + KernelOverlayFSMounter import
- The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter
  class) — no production callers, just self-tests
- l4d2host/tests/test_kernel_overlayfs.py
- test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
  (tested Python-side unmount-failure tolerance that no longer exists)
- The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests

After this, the only thing start_instance does beyond cfg-staging is ask
systemd to enable+start the unit. stop/delete/reset only ask systemd to
disable; the overlay lifecycle lives entirely in the unit file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:09:52 +02:00
mwiegand
fc371711ec
fix(deploy): StartLimit* directives belong in [Unit], not [Service]
systemd 230+ moved StartLimitBurst= and StartLimitIntervalSec= from
[Service] into [Unit] (with the rename from StartLimitInterval=). Putting
them in [Service] makes systemd silently ignore them with a warning to
journalctl: "Unknown key 'StartLimitIntervalSec' in section [Service],
ignoring." — meaning the restart-loop cap I claimed in commit 519567e
wasn't actually applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:56:54 +02:00
mwiegand
a982995d5b
fix(deploy): ExecStartPre runs overlay helper with + prefix, not sudo
The unit has NoNewPrivileges=true (security hardening for srcds), which
blocks sudo's setuid escalation. The previous sudo'd ExecStartPre failed
on every start with "sudo: the 'no new privileges' switch is set, which
prevents sudo from running as root" -> Restart=on-failure loop.

systemd's `+` prefix runs the Exec command as PID 1 (root, no sandbox),
bypassing User=/Group=/NoNewPrivileges=. Equivalent privilege scope to
the sudoers rule the web app already uses for the same helper, just
without the sudo middleman.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:55:16 +02:00
mwiegand
56f5c30296
refactor(l4d2-host): unit's ExecStartPre is the sole code path to the mount
Before this change there were two callers of left4me-overlay mount:
the web app's start_instance (Python, in-process) and the unit's
ExecStartPre (shell, via sudo). The duplication invited divergence; the
helper's recently-added idempotency made both paths technically work
but at the cost of a "first wins" race and dead-code retry logic in
start_instance.

Drop the in-process _mounter.mount() call from start_instance. The web
app now only stages cfg files (which still must happen on the host
filesystem before mount, to avoid overlayfs copy-up changing ownership),
then asks systemd to enable+start the unit; the unit's ExecStartPre
does the mount.

Removed:
- os.path.ismount(merged) refusal in start_instance and its test
  (test_start_refuses_to_double_mount). The race the check guarded
  against is now handled by the helper's idempotency.
- _load_instance_env helper and the `os` import (both became dead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:54:05 +02:00
mwiegand
3d9b7ef771
fix(deploy): WorkingDirectory= prefix - so ExecStartPre can mount the overlay
systemd applies WorkingDirectory= to every Exec line including ExecStartPre.
With the merged dir not yet existing at boot time (the volatile overlay
mount has been wiped), the chdir into runtime/%i/merged/left4dead2 fails
with status=200/CHDIR before ExecStartPre can run the mount helper.

The `-` prefix makes chdir failure non-fatal: ExecStartPre runs in the
unit's home (cwd doesn't matter for the mount helper); ExecStart re-applies
WorkingDirectory once the mount has landed and chdirs successfully.

Companion to commit 519567e (which added the ExecStartPre mount + helper
idempotency but didn't account for the WorkingDirectory ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:51:58 +02:00
mwiegand
519567e156
fix(l4d2-host): mount overlay via ExecStartPre so enabled units boot cleanly
The lifecycle change to systemctl enable --now (commit 8552c55) made
units auto-start at boot. But the kernel-overlayfs mount is volatile
(reboot kills it), and the web app's start_instance only re-mounts in
response to a UI click. Result: at boot, systemd starts the unit, finds
empty merged/, CHDIR fails, Restart=on-failure spins forever (counter
hit 65 on ckn before this fix landed).

Fix:
- Unit gets `ExecStartPre=/usr/bin/sudo -n .../left4me-overlay mount %i`
  so the overlay is established before the main process starts.
- Helper is now idempotent: if merged is already a mount point, exit 0.
  Required because Restart=on-failure re-runs ExecStartPre on each
  cycle, and the web-app's start_instance also calls the helper, so
  both paths would otherwise collide on "already mounted".
- StartLimitBurst=5 + StartLimitIntervalSec=60s caps the restart loop
  instead of letting it spin indefinitely on a fundamental failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:47:20 +02:00
mwiegand
b62fc08127
docs(specs): l4d2 cpu pinning — decision record (deferred)
Investigated whether to hard-pin each srcds instance to a single core
within the existing AllowedCPUs=1-7 set. Modern kernels (5.13+) no
longer expose kernel.sched_migration_cost_ns or the other classic CFS
"laziness" tunables, so a global cheap-fix is unavailable. Decision
for now: trust CFS + Nice=-5 + AllowedCPUs=1-7. Per-instance
CPUAffinity= remains an opt-in escape hatch in deploy/README.md.
Documents the revisit triggers and the preferred implementation path
when the time comes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:41:40 +02:00
mwiegand
67b5521eb6
feat(l4d2-web): periodic state poller refreshes Server.actual_state
A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs (queued/running/cancelling) are skipped to
avoid racing the post-job refresh. Catches reboot drift, OOM kills,
manual systemctl operations, and any other out-of-band state change.
Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:31:28 +02:00
mwiegand
8552c559d3
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract -- only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:28:44 +02:00
mwiegand
1dd674714a
docs(specs): perf baseline lifecycle — premise check on system vs user units
Make explicit that the project uses system units (root systemctl, unit
under /usr/local/lib/systemd/system/, WantedBy=multi-user.target), so
`systemctl enable --now` is the correct verb to make instances survive
a host reboot. User units have different lifecycle rules and would not
auto-start at boot without enable-linger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:25:34 +02:00
mwiegand
3b0bde9b50
docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan
Two TDD tasks: helper+service_control verb rename, then poller code
+ wiring + tests. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00
mwiegand
72cd7ca1ef
docs(specs): l4d2 server lifecycle reboot-and-drift — design
Switch lifecycle verbs from systemctl start/stop to enable --now /
disable --now (servers survive host reboot via WantedBy= symlinks),
plus a periodic state poller for runtime drift (OOM kills, manual
systemctl ops, exhausted Restart=on-failure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00
mwiegand
20604dd79c
docs(deploy): document CPU isolation in performance-tuning section
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:06:59 +02:00
mwiegand
af3171102a
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:06:34 +02:00
mwiegand
c91c029c38
docs(plans): l4d2 cpu isolation — implementation plan
Two TDD tasks: deploy-script cpuset block + tests, README
"CPU isolation" subsection. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:03:37 +02:00
mwiegand
17b7c2ff10
docs(specs): l4d2 cpu isolation — design
cgroup-v2 AllowedCPUs= drop-ins for system/user/build/game slices.
Defaults: core 0 for everything-not-game, cores 1..N-1 for game,
computed from nproc. LEFT4ME_SYSTEM_CPUS / LEFT4ME_GAME_CPUS
overrides; single-core hosts skip with a warning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:03:37 +02:00
mwiegand
e5126c8c0b
docs(deploy): tighten perf-tuning escape hatches
- RT example: add AmbientCapabilities=CAP_SYS_NICE so the User=left4me
  service can actually enter SCHED_FIFO on Trixie.
- CPU governor: note that linux-cpupower may need apt install.
- CPUAffinity=2: clarify that per-instance values typically increment.
- NIC tuning: note that ethtool may need apt install.
2026-05-09 10:15:45 +02:00
mwiegand
9e0f6f17ef
docs(deploy): performance-tuning escape-hatch section in README
Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
ops-side knobs for measured problems the perf baseline doesn't solve.
2026-05-09 10:09:40 +02:00
mwiegand
928519fa34
feat(deploy): install slice + sysctl artifacts and apply via sysctl --system
Copies l4d2-game.slice and l4d2-build.slice into
/usr/local/lib/systemd/system/, installs 99-left4me.conf into
/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
live this deploy, not on next reboot.
2026-05-09 10:05:41 +02:00
mwiegand
7e4a5691ed
feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500
Builds yield CPU/IO to game-server instances under contention via the
slice's weight=10, and are killed first under memory pressure
(servers have OOMScoreAdjust=-200).
2026-05-09 10:01:38 +02:00
mwiegand
b3fca4772c
feat(deploy): host sysctls for UDP buffers + netdev backlog/budget
99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults),
netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.
2026-05-09 09:53:07 +02:00
mwiegand
66d83a0282
docs(deploy): point slice files at perf baseline spec
Matches the spec-pointer comment Task 1 added to
left4me-server@.service. A future operator running
`systemctl cat l4d2-game.slice` now finds the rationale.
2026-05-09 09:51:48 +02:00
mwiegand
ad7d73608e
feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio
Flat top-level slices. Game wins under contention; build still gets
the box when uncontended. Referenced by left4me-server@.service and
the script-sandbox systemd-run invocation.
2026-05-09 09:48:41 +02:00
mwiegand
7193163488
feat(deploy): perf-baseline directives on left4me-server@.service
Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort,
OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256,
LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s,
LogRateLimitIntervalSec=0. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
2026-05-09 09:44:12 +02:00
mwiegand
851e6629aa
docs(plans): l4d2 server host perf baseline — implementation plan
Six tasks (TDD, one commit each): unit directives, slice files,
sysctl conf, sandbox slice + OOMScoreAdjust, deploy-script wiring,
README escape-hatch section. Final verification step with full
deploy + host + web pytest sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:39:12 +02:00
mwiegand
b6574e308b
docs(specs): perf baseline — fix transient-service phrasing
The existing left4me-script-sandbox helper uses systemd-run in
transient service mode (--unit=, no --scope). Spec wrongly said
'--scope'. No semantic change — the design's --slice= and
-p OOMScoreAdjust= guidance is identical for service vs scope mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:39:12 +02:00
mwiegand
db3b149045
docs(specs): l4d2 server host perf baseline — design
Approach A: per-instance unit directives (Nice, OOM, Memory caps,
KillSignal=SIGINT, log-rate disable), flat l4d2-game/l4d2-build slice
hierarchy with 100:1 CPU/IO weight ratio, sandbox into build slice with
OOMScoreAdjust=500, host sysctls for UDP buffers + netdev backlog/budget
+ vm.swappiness. SCHED_FIFO, CPU governor, CPUAffinity, NIC tuning are
documented escape hatches, not auto-applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 09:31:05 +02:00
mwiegand
965b67e6fc
fix(l4d2-host): script-sandbox normalizes file perms so web user can read
Cedapug's build script writes .cedapug/manifest.tsv with mode 0600 owned
by l4d2-sandbox; the web service (left4me uid) then 500s when streaming
that file via the download route — PermissionError on open().

Two fixes:
- UMask=0022 on the systemd-run unit so new file writes default to
  0644 / dirs to 0755.
- Post-script chmod o+r/o+rx walk over the overlay dir to backfill any
  stricter modes the script left behind (e.g. shells/tools that ignore
  umask and explicitly create with 0600).

The helper no longer execs systemd-run; it captures the rc, runs the
post-step, and exits with the original rc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:44:26 +02:00
mwiegand
c16e780283
feat(l4d2-web): server file tree — enable download symmetric with overlay tree
Adds a /servers/<id>/files/download route mirroring the overlay download
endpoint. Same safety rules: real-path must resolve under LEFT4ME_ROOT
(merged view threads through `installation/` and overlay layers, all
already inside the root). The server file-tree partial now renders
download links.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:40:04 +02:00
mwiegand
aacd95012e
feat(l4d2-web): blueprint rename moves to footer modal — matches overlay/server pattern
Drops the inline Name input from the blueprint edit form. A Rename link
sits next to Delete in the page footer; clicking opens a one-line modal
that posts to a new POST /blueprints/<id>/rename route. The main edit
form keeps the current name as a hidden input so its full Save still
works unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 01:37:29 +02:00