`test_deploy_script_has_safe_defaults_and_preserves_state` had been red
since commit caa8b83 ("rewrite web.env every deploy with machine-id-
derived SECRET_KEY"). Two assertions encoded the prior model:
- `if [ ! -f /etc/left4me/web.env ]` — the create-only-if-missing guard
caa8b83 removed in favor of unconditional `install -m 0640 ...`.
- `. /etc/left4me/web.env not in script` — masked by the first failing
but also stale: the deploy intentionally sources web.env in the
alembic and seed-script-overlays helper subprocesses so they get
DATABASE_URL.
Removed both. The full suite now runs 0 failed. The note left in place
points future readers at the live coverage path (install + SECRET_KEY
rewrite + run_left4me_with_env plumbing already asserted nearby).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugs surfaced by the previous deploy attempt:
1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit
packages= list. After deleting the fs/ package, pip install -e fails
with "package directory './fs' does not exist".
2. The CPU-isolation deploy step uses `nproc` to detect host core count,
but `nproc` honors Cpus_allowed of the calling shell. On a host that
already has the cpuset drop-ins applied (system.slice/user.slice →
AllowedCPUs=0), the SSH login lands constrained to one core and
`nproc` returns 1 — making subsequent deploys think they're on a
single-core box and skip the cpuset writes entirely. `nproc --all`
reports installed processors regardless of affinity, which is what
the deploy actually wants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until
now, the unit's ExecStartPre handled mount but the Python side still drove
unmount: stop_instance and _purge_instance both called _mounter.unmount,
which wrapped sudo + the helper. Two code paths for two halves of the
same lifecycle.
Move unmount into the unit:
- ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
(ExecStopPost, not ExecStop, so it runs after the cgroup is cleared;
ExecStop runs while srcds is alive and would EBUSY the umount syscall.)
- Helper's umount verb is now idempotent (mirrors mount): if merged
isn't a mount point, return early. PRINT_ONLY mode bypasses both
short-circuits so the unit tests still exercise the full nsenter argv.
Drop the dead Python machinery:
- _mounter.unmount(...) calls in stop_instance and _purge_instance
- _mounter global + KernelOverlayFSMounter import
- The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter
class) — no production callers, just self-tests
- l4d2host/tests/test_kernel_overlayfs.py
- test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
(tested Python-side unmount-failure tolerance that no longer exists)
- The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests
After this, the only thing start_instance does beyond cfg-staging is ask
systemd to enable+start the unit. stop/delete/reset only ask systemd to
disable; the overlay lifecycle lives entirely in the unit file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
systemd 230+ moved StartLimitBurst= and StartLimitIntervalSec= from
[Service] into [Unit] (with the rename from StartLimitInterval=). Putting
them in [Service] makes systemd silently ignore them with a warning to
journalctl: "Unknown key 'StartLimitIntervalSec' in section [Service],
ignoring." — meaning the restart-loop cap I claimed in commit 519567e
wasn't actually applied.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit has NoNewPrivileges=true (security hardening for srcds), which
blocks sudo's setuid escalation. The previous sudo'd ExecStartPre failed
on every start with "sudo: the 'no new privileges' switch is set, which
prevents sudo from running as root" -> Restart=on-failure loop.
systemd's `+` prefix runs the Exec command as PID 1 (root, no sandbox),
bypassing User=/Group=/NoNewPrivileges=. Equivalent privilege scope to
the sudoers rule the web app already uses for the same helper, just
without the sudo middleman.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before this change there were two callers of left4me-overlay mount:
the web app's start_instance (Python, in-process) and the unit's
ExecStartPre (shell, via sudo). The duplication invited divergence; the
helper's recently-added idempotency made both paths technically work
but at the cost of a "first wins" race and dead-code retry logic in
start_instance.
Drop the in-process _mounter.mount() call from start_instance. The web
app now only stages cfg files (which still must happen on the host
filesystem before mount, to avoid overlayfs copy-up changing ownership),
then asks systemd to enable+start the unit; the unit's ExecStartPre
does the mount.
Removed:
- os.path.ismount(merged) refusal in start_instance and its test
(test_start_refuses_to_double_mount). The race the check guarded
against is now handled by the helper's idempotency.
- _load_instance_env helper and the `os` import (both became dead).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
systemd applies WorkingDirectory= to every Exec line including ExecStartPre.
With the merged dir not yet existing at boot time (the volatile overlay
mount has been wiped), the chdir into runtime/%i/merged/left4dead2 fails
with status=200/CHDIR before ExecStartPre can run the mount helper.
The `-` prefix makes chdir failure non-fatal: ExecStartPre runs in the
unit's home (cwd doesn't matter for the mount helper); ExecStart re-applies
WorkingDirectory once the mount has landed and chdirs successfully.
Companion to commit 519567e (which added the ExecStartPre mount + helper
idempotency but didn't account for the WorkingDirectory ordering).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The lifecycle change to systemctl enable --now (commit 8552c55) made
units auto-start at boot. But the kernel-overlayfs mount is volatile
(reboot kills it), and the web app's start_instance only re-mounts in
response to a UI click. Result: at boot, systemd starts the unit, finds
empty merged/, CHDIR fails, Restart=on-failure spins forever (counter
hit 65 on ckn before this fix landed).
Fix:
- Unit gets `ExecStartPre=/usr/bin/sudo -n .../left4me-overlay mount %i`
so the overlay is established before the main process starts.
- Helper is now idempotent: if merged is already a mount point, exit 0.
Required because Restart=on-failure re-runs ExecStartPre on each
cycle, and the web-app's start_instance also calls the helper, so
both paths would otherwise collide on "already mounted".
- StartLimitBurst=5 + StartLimitIntervalSec=60s caps the restart loop
instead of letting it spin indefinitely on a fundamental failure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Investigated whether to hard-pin each srcds instance to a single core
within the existing AllowedCPUs=1-7 set. Modern kernels (5.13+) no
longer expose kernel.sched_migration_cost_ns or the other classic CFS
"laziness" tunables, so a global cheap-fix is unavailable. Decision
for now: trust CFS + Nice=-5 + AllowedCPUs=1-7. Per-instance
CPUAffinity= remains an opt-in escape hatch in deploy/README.md.
Documents the revisit triggers and the preferred implementation path
when the time comes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs (queued/running/cancelling) are skipped to
avoid racing the post-job refresh. Catches reboot drift, OOM kills,
manual systemctl operations, and any other out-of-band state change.
Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract -- only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make explicit that the project uses system units (root systemctl, unit
under /usr/local/lib/systemd/system/, WantedBy=multi-user.target), so
`systemctl enable --now` is the correct verb to make instances survive
a host reboot. User units have different lifecycle rules and would not
auto-start at boot without enable-linger.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two TDD tasks: helper+service_control verb rename, then poller code
+ wiring + tests. Operator-side smoke test in F.3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch lifecycle verbs from systemctl start/stop to enable --now /
disable --now (servers survive host reboot via WantedBy= symlinks),
plus a periodic state poller for runtime drift (OOM kills, manual
systemctl ops, exhausted Restart=on-failure).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two TDD tasks: deploy-script cpuset block + tests, README
"CPU isolation" subsection. Operator-side smoke test in F.3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cgroup-v2 AllowedCPUs= drop-ins for system/user/build/game slices.
Defaults: core 0 for everything-not-game, cores 1..N-1 for game,
computed from nproc. LEFT4ME_SYSTEM_CPUS / LEFT4ME_GAME_CPUS
overrides; single-core hosts skip with a warning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- RT example: add AmbientCapabilities=CAP_SYS_NICE so the User=left4me
service can actually enter SCHED_FIFO on Trixie.
- CPU governor: note that linux-cpupower may need apt install.
- CPUAffinity=2: clarify that per-instance values typically increment.
- NIC tuning: note that ethtool may need apt install.
Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
ops-side knobs for measured problems the perf baseline doesn't solve.
Copies l4d2-game.slice and l4d2-build.slice into
/usr/local/lib/systemd/system/, installs 99-left4me.conf into
/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
live this deploy, not on next reboot.
Builds yield CPU/IO to game-server instances under contention via the
slice's weight=10, and are killed first under memory pressure
(servers have OOMScoreAdjust=-200).
Matches the spec-pointer comment Task 1 added to
left4me-server@.service. A future operator running
`systemctl cat l4d2-game.slice` now finds the rationale.
Flat top-level slices. Game wins under contention; build still gets
the box when uncontended. Referenced by left4me-server@.service and
the script-sandbox systemd-run invocation.
Six tasks (TDD, one commit each): unit directives, slice files,
sysctl conf, sandbox slice + OOMScoreAdjust, deploy-script wiring,
README escape-hatch section. Final verification step with full
deploy + host + web pytest sweep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing left4me-script-sandbox helper uses systemd-run in
transient service mode (--unit=, no --scope). Spec wrongly said
'--scope'. No semantic change — the design's --slice= and
-p OOMScoreAdjust= guidance is identical for service vs scope mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cedapug's build script writes .cedapug/manifest.tsv with mode 0600 owned
by l4d2-sandbox; the web service (left4me uid) then 500s when streaming
that file via the download route — PermissionError on open().
Two fixes:
- UMask=0022 on the systemd-run unit so new file writes default to
0644 / dirs to 0755.
- Post-script chmod o+r/o+rx walk over the overlay dir to backfill any
stricter modes the script left behind (e.g. shells/tools that ignore
umask and explicitly create with 0600).
The helper no longer execs systemd-run; it captures the rc, runs the
post-step, and exits with the original rc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a /servers/<id>/files/download route mirroring the overlay download
endpoint. Same safety rules: real-path must resolve under LEFT4ME_ROOT
(merged view threads through `installation/` and overlay layers, all
already inside the root). The server file-tree partial now renders
download links.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the inline Name input from the blueprint edit form. A Rename link
sits next to Delete in the page footer; clicking opens a one-line modal
that posts to a new POST /blueprints/<id>/rename route. The main edit
form keeps the current name as a hidden input so its full Save still
works unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a Files section at the bottom of the server detail page that lists
the kernel-overlayfs merged view at runtime/<server_id>/merged/. Reuses
the overlay file-tree partial via two new template variables:
- files_base_url: parent passes "/overlays/<id>" or "/servers/<id>"
- download_supported: false for servers (runtime holds large game
binaries; no download endpoint), true for overlays (existing behavior)
New service helper safe_resolve_for_server_listing() rejects path
traversal beyond the merged root and returns None when the overlayfs
mount doesn't exist (server never started or just reset).
New route GET /servers/<id>/files?path=<rel> returns the lazy-load
file-tree fragment, gated to the server owner. No download counterpart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vendors HTMX 2.0.4 (the prior file was a 1-line stub) and uses it to poll
two new partials on a 2s tick while a job is in flight:
- /servers/<id>/actions → state badge, filtered action buttons,
last-job sentence, live job log (SSE) while a Start/Stop/Reset job
is running. When the job is terminal the partial re-renders without
hx-trigger and polling stops.
- /overlays/<id>/build-status → build state badge, last-build
sentence, live job log while a build_overlay job is running. Same
terminal-state stop behavior.
Server detail restructure:
- Editable name moves out of the page body into a Rename modal
triggered from a link next to Delete in the page footer.
- Compact dl with Port (linked as steam://run/550//+connect <host>:<port>)
and Blueprint.
- Actions row: state badge + state-filtered buttons (start/stop, reset)
+ last-job sentence. Drift warning when desired ≠ actual.
- Recent Jobs table removed.
Overlay detail restructure:
- Single panel, dl Type/Scope, no separate Last build row, no Builds
section.
- Script form gets two compound submits: "Save and build" and
"Save, reset and rebuild". Standalone Rebuild/Wipe gone.
- Build status state badge + last-build sentence under the editor;
action buttons hide while a build is in flight.
- Rename modal in the page footer next to Delete.
sse.js binds on htmx:load (covers initial document and post-swap inserts)
and closes EventSources on htmx:beforeCleanupElement to avoid leaking
streams across swaps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Detail panels: softer (color-mix --line-soft) border. h2 sub-section
spacing inside a single outer panel. admin and job_detail collapse to
one panel each.
- Color tokens: --color-button-primary / --color-button-danger stay
saturated in dark mode so white text on filled buttons stays readable.
- Site header: transparent, no full-width bar; aligned with panel-content
width. No more sticky.
- Page-level Delete: low-contrast outline button at the page footer
(left side, justify-content flex-start). Save buttons no longer
full-width (.stack > button { justify-self: end }).
- form-actions-inline helper for right-aligned button rows.
- New service: l4d2web.services.timeago.humanize_delta — used by the
upcoming server / overlay live-status partials.
- Server route: POST /servers/<id> renames the server (mirrors the
overlay update pattern, returns 409 on per-user duplicate).
- Overlay route: POST /overlays/<id>/script handles `action` form value
— `save_build` (default) or `save_reset_build` (wipes overlay dir
before queuing build). Redirect lands on /overlays/<id> instead of
the job page so users see the live status.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each linked overlay gets a checkbox on the blueprint detail page that opts
its server.cfg in as exec server_overlay_<id>. The web app builds the
spec with {path, alias} per overlay and prepends exec server_overlay_<id>
lines to the blueprint config in lowest-overlay-first order. The host
stages those copies in the overlayfs upper layer before mounting (avoids
copy-up writes against a sandbox-uid file). A live preview block above the
Config textarea shows what gets auto-executed.
Schema:
- alembic 0007: BlueprintOverlay.expose_server_cfg BOOLEAN
Spec contract:
- l4d2host OverlayRef(path, alias?). load_spec accepts both bare-string
and {path, alias} entries.
Side effects folded in (same file in l4d2_facade):
- start_server auto-initializes; the manual Initialize step is no longer
needed before Start.
- initialize_server no longer runs blueprint builders — builds happen on
overlay save, not on every server Start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make explicit that design specs go in docs/superpowers/specs/ and
implementation plans go in docs/superpowers/plans/, both committed
to git, with the YYYY-MM-DD-<topic>[-design].md naming already used
elsewhere in the tree. The plan-mode scratch file under
~/.claude/plans/ is fine while plan mode is open, but the persisted
artifact must end up inside the repo.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-row checkbox + numeric Order table on the blueprint
detail page with a drag-to-reorder list of selected overlays plus a
native <select> for adding more. Removing uses an × button per row;
the option sorted-inserts back into the dropdown alphabetically.
Native HTML5 drag-and-drop, no library, no JS-disabled fallback.
Server contract is unchanged: each list row owns one hidden
<input name="overlay_ids">, DOM order = submission order, and the
existing fallback_position branch in ordered_overlay_ids_from_form
absorbs the now-omitted overlay_position_<id> fields.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace per-row checkbox + numeric Order inputs with a drag-to-reorder
list of selected overlays plus a native <select> for adding more.
Native HTML5 DnD; no library, no JS-disabled fallback. Server contract
unchanged (overlay_ids in DOM order; existing fallback_position branch
absorbs the omitted overlay_position_<id> fields).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bash script, Arguments and Config are all structured text — render them
in a monospace font with tab-size: 4 and resize: vertical via a base
'textarea' rule in components.css. Add rows="8" + spellcheck="false"
to the blueprint Arguments/Config textareas (both edit and create
forms) so they're a sensible size and consistent with each other.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The flex 'gap' shorthand on .file-tree-row was setting row-gap as well
as column-gap, so when the .file-tree-children div wrapped to a new
line the row-gap (--space-s) added on top of the nested ul's
margin-top (--space-xs) — making the button-to-first-child gap visibly
bigger than the sibling-row gap. Switch to 'gap: 0 var(--space-s)' so
only column-gap applies; vertical rhythm is now owned exclusively by
the outer grid gap (--space-xs) and the nested ul margin-top
(--space-xs), both equal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CSS fixes that together turn the rendered file tree from
'everything on one line' into an actual tree:
- .file-tree-children: flex-basis: 100% so an expanded folder's children
wrap to the next line of the parent <li> flex container instead of
flowing inline next to the toggle button.
- .file-tree-row-file: padding-left = chevron width, so file rows align
visually with sibling folder names (folder names are offset by their
chevron; files have no chevron, so without padding they'd start at
the chevron column instead of the name column). Chevron itself
pinned to width: 1ch so rotated/un-rotated states have identical
layout.
Drops the 'only on first creation' guard so newly added env vars reach
existing boxes (today's SESSION_COOKIE_SECURE=false rake). SECRET_KEY
is now sha256(/etc/machine-id) — stable per host, no session
invalidation across redeploys, no state persisted in /etc that the
deploy has to tiptoe around. Single-operator test deployment; the
secret being machine-id-derivable is acceptable per deploy/README.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tickrate and other seeded examples whose overlay directory exists but
hasn't been built yet rendered a visually blank Files panel — entries
was [] (not None), so the template fell through to an empty <ul>. Use
'not file_tree_root_entries' so both None (dir missing) and []
(dir empty) trigger the 'No files yet' message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The vendored static/vendor/htmx.min.js turned out to be a 33-byte
placeholder, so the hx-get/hx-target/hx-trigger attributes on the
overlay file tree's folder buttons were inert: clicks rotated the
chevron (own JS) but never fetched. Switch the lazy-load to a
~30-line plain-JS handler in static/js/file-tree.js that fetches
button.dataset.filesUrl on first expand and dedupes via dataset.loaded.
Update the spec/plan to match. Route + partial contracts unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a server-rendered collapsible file tree section to the overlay
detail page so users can verify what their script/workshop overlays
produced and pull individual artifacts (VPKs, configs) without SSH.
HTMX-driven lazy folder expansion with click-to-download via send_file;
symlinks land anywhere under LEFT4ME_ROOT (so workshop addons stream
from the shared cache) but escapes are refused. Same access rule as the
rest of the page (admin or owner). 39 new tests; full web suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the design rationale for the new overlay-detail Files section
(verify build output, click-to-download for individual files via Flask
send_file, HTMX-driven lazy folder expansion) and the paired
implementation plan that produced it. Adds .superpowers/ to .gitignore
so brainstorm session artifacts never sneak into a future commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Host-side identifier (systemd unit name and /var/lib/left4me dirs) is now
str(server.id), centralized in services/server_identity.server_unit_name.
Server.name becomes a free-form display label, required and unique per
user (was [a-z0-9_-]{1,64} and globally unique).
Migration 0006 swaps the old global UNIQUE(name) for UNIQUE(user_id, name).
Web routes already keyed on id; templates only used name for display.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins to python3.13 to match the Debian Trixie production target.
Documents the dev setup in README and AGENTS.md so a fresh checkout
gets a working `python` via `direnv allow` + editable installs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles four reference script overlays (cedapug_maps, l4d2center_maps,
competitive_rework, tickrate) and adds a `flask seed-script-overlays`
CLI that upserts each *.sh as a system-wide overlay. Test deploy
invokes it after the orphan-cleanup migration so fresh test servers
come up with the same overlays the user has been maintaining by hand.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>