Commit graph

455 commits

Author SHA1 Message Date
mwiegand
d25fb57f30
profile: POST /profile/password validation branches
Implements the change-password endpoint:
- Per-IP rate limit reusing services/rate_limit
- Required fields, mismatched-confirm, policy, wrong-current
  branches each redirect with a specific ?error= key
- Rotates digest + password_changed_at, then re-stamps the
  current session marker so this browser stays logged in
  while other sessions get rejected by load_current_user

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:57:11 +02:00
mwiegand
eef85f36a9
profile: GET /profile page with change-password form
Adds the page reachable from the username link in the header.
Renders the form skeleton; the POST handler lands in the next
commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:55:34 +02:00
mwiegand
e75f379dcb
auth: reject sessions older than user.password_changed_at
load_current_user now treats a session whose pw_changed_at marker
is missing, malformed, or older than the user's current
password_changed_at as logged-out. Same shape as the existing
user.active check.

Forced fan-out updates to every test fixture that forges a session
via session_transaction(): each now stamps a current pw_changed_at
marker. test_deactivated_user_existing_session_invalidated keeps
its meaning — the deactivation still flips the user to inactive,
and load_current_user rejects the session via the user.active
branch before reaching the freshness branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:54:13 +02:00
mwiegand
84dc672180
auth: stamp password_changed_at marker in session on login
login_user now records the user's current password_changed_at on the
session. The next commit will use this marker to invalidate sessions
whose password has been rotated under them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:46:20 +02:00
mwiegand
26a6a9d7b0
rate-limit: extract generic helper, reuse from login
Pulled the per-IP sliding-window check out of auth_routes so the
upcoming /profile/password endpoint can share it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:45:51 +02:00
mwiegand
a5982941df
auth: validate_new_password helper (min length 8)
Single source of truth for the password policy, to be reused by the
upcoming /profile/password endpoint and (optionally) the create-user
CLI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:45:03 +02:00
mwiegand
2353378b23
alembic: add users.password_changed_at column
Backfills existing rows from created_at, then enforces NOT NULL.
Existing sessions without a pw_changed_at marker will be rejected
on next request once the freshness check lands (one-time forced
re-login post-deploy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:44:39 +02:00
mwiegand
eb1f2b82eb
models: add User.password_changed_at
First step of the self-service password-change feature: a timestamp
that backs the per-session freshness check used to invalidate other
sessions on password change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:43:25 +02:00
mwiegand
6eb9bd0ab3
docs: plan for profile page with self-service password change
Adds /profile reachable via header username, with change-password form
as its first section. Industry-standard session semantics: other sessions
invalidated on password change, current session kept, via new
users.password_changed_at column + session marker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 21:41:25 +02:00
mwiegand
e1149704c8
job_worker: don't duplicate streamed stderr on HostCommandError
services/host_commands.run_command pumps each stderr line into JobLog
via on_stderr (job_worker.py:215) before it raises HostCommandError —
appending exc.stderr again as a single row produced a second copy of
the entire traceback truncated at JOB_LOG_LINE_MAX_CHARS (4096), which
was visible as the awkward duplicated/cut-off second block at the end
of failed install logs.

Split the existing `except subprocess.CalledProcessError` into two:

  except HostCommandError: stderr already streamed — just record exit
    code + last error summary on the job/server row. No log append.

  except subprocess.CalledProcessError: catches raw CalledProcessErrors
    raised outside host_commands (no pump ran), so still append stderr
    to the log. Preserves the path test_called_process_error_fails_job
    exercises.

New regression test asserts a HostCommandError with multi-line stderr
doesn't land as a single concatenated JobLog row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:52:54 +02:00
mwiegand
363f429c7a
l4d2host: LEFT4ME_STEAMCMD env var for steamcmd path
SteamInstaller defaults to steamcmd="steamcmd" (bare name), which relies
on PATH lookup. Deployments that don't have steamcmd on PATH — or where
steamcmd.sh's `cd "$(dirname "$0")"` breaks under PATH-symlink invocation
(observed with the Valve-shipped script) — can now pin an absolute path
via LEFT4ME_STEAMCMD. Default keeps bare-name lookup for dev/tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 22:46:21 +02:00
mwiegand
c594d4b5e8
tests: admin user management
14 tests covering /admin/users/<id>/{deactivate,activate,delete}:

  - deactivate/activate flips and 404 on unknown user
  - deactivate-self refused (409)
  - deactivated user cannot log in (same 401 as wrong-password)
  - existing sessions stop working after deactivation (load_current_user
    returns None for inactive users → @require_login redirects to /login)
  - delete-self refused (409)
  - delete refuses when user owns Server, Blueprint, or custom Overlay
  - delete on orphan succeeds (302 → /admin/users)
  - delete nulls out Job.user_id (jobs survive as audit trail)
  - delete-other-admin succeeds when more than one admin exists

The "last admin" branch in the delete endpoint is defense-in-depth and
unreachable via normal flow (any path that triggers it is shadowed by
self-delete) — covered by a comment, not a test.
2026-05-10 21:19:03 +02:00
mwiegand
bcea450e98
admin: deactivate/activate/delete endpoints for /admin/users
Three new POST endpoints on the existing admin blueprint, all guarded
by @require_admin and CSRF (per the global before_request hook):

  /admin/users/<id>/deactivate  flips active=False (refuses self)
  /admin/users/<id>/activate    flips active=True
  /admin/users/<id>/delete      hard delete with safeties:
    - refuses self-delete
    - refuses delete-of-the-last-admin
    - refuses if the user owns Servers, Blueprints, or custom
      Overlays (operator deletes those first via existing UIs)
    - nulls out Job.user_id (jobs stay as audit trail; FK is nullable)

admin_users.html grows an Active column + an Actions column with the
appropriate button per row (none for self, Deactivate/Activate
toggle, Delete-with-confirmation modal). Modal pattern mirrors
blueprint_detail.html (same modal-close/modal-open data attrs,
csrf_token hidden field).

Refusal responses are 409 with a plain-text body (matches the
blueprint-in-use refusal at blueprint_routes.py:182). No flash
infrastructure introduced; consistent with the rest of the codebase.

All 367 existing tests still pass.
2026-05-10 21:15:52 +02:00
mwiegand
3490be5fb7
auth: reject inactive users at login + invalidate existing sessions
Two-pronged enforcement so deactivation has effect both for fresh
logins and already-issued sessions:

  - load_current_user(): treat User with active=False as logged-out
    (sets g.user=None). Existing sessions stop working immediately.
  - login(): include `not user.active` in the existing 401 condition,
    so deactivated accounts get the same "invalid credentials"
    response as wrong-password / unknown-user — no timing oracle for
    deactivation status.

Tests still green (12/12 in test_auth.py).
2026-05-10 21:13:31 +02:00
mwiegand
726acfa4ff
models: add User.active column for soft-delete (deactivation)
Default true; server_default '1'. Lets the admin UI deactivate a user
without losing the row or the user's content (servers, blueprints,
overlays). Reactivation flips it back. Migration 0008 adds the column
via op.add_column; downgrade uses batch_alter_table per SQLite ALTER
TABLE semantics, matching the 0007 pattern.
2026-05-10 21:12:27 +02:00
mwiegand
0811d22c44
deploy/README: mark as historical reference, point at ckn-bw
ovh.left4me is now provisioned by the ckn-bw bundle bundles/left4me/
(attached via groups/applications/left4me.py); run `bw apply
ovh.left4me` from there.

Keep this directory verbatim as deployment-knowledge reference: what
was configured, what each unit/helper does, why the privileged
boundaries are drawn the way they are. Add a top-of-README
correspondence table marking which files migrated 1:1 vs. which are
obsolete in the new architecture (CAKE moved to systemd-networkd;
nft marking moved into the central nftables bundle; systemd units
are emitted by a metadata reactor; CPU isolation drop-ins are no
longer managed declaratively).

The deploy-test-server.sh stays here too — useful as a concrete walk-
through of the install steps the bundle now performs declaratively.
Just don't run it against an ovh.left4me node managed by ckn-bw; the
two would fight over file ownership, sudoers, and unit definitions.
2026-05-10 18:25:23 +02:00
mwiegand
a987304358
fix(deploy): make iproute2 explicit + document disable recipe
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:29:22 +02:00
mwiegand
9f0b51b455
docs(deploy): document network-shaping defaults + opt-in network knobs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:09:28 +02:00
mwiegand
26f3d270b0
feat(deploy): wire nft marking + CAKE shaper into deploy script
Installs nftables via apt/dnf, copies left4me-mark.nft and left4me-apply-cake
helper into system paths, conditionally seeds cake.env (preserving operator
edits), and enables left4me-nft-mark.service + left4me-cake.service on deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 01:04:12 +02:00
mwiegand
a9ca90537b
feat(deploy): left4me-cake.service oneshot wrapping apply-cake helper
The CAKE egress shaper now has a systemd unit that wraps the
left4me-apply-cake helper in apply and clear modes. The unit is a
oneshot that starts after network-online and survives service restarts,
allowing the shaper to persist across reboots and be managed by systemd.
The environment file is marked non-fatal (EnvironmentFile=-) to handle
missing or incomplete configurations gracefully.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:58:42 +02:00
mwiegand
878639147a
feat(deploy): left4me-apply-cake helper with apply/clear modes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:52:16 +02:00
mwiegand
d783449d05
feat(deploy): cake.env template with documented uplink knobs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:49:08 +02:00
mwiegand
fbb342db87
feat(deploy): systemd unit to load/clear left4me_mark nftables table
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:35:27 +02:00
mwiegand
076bfb72ca
feat(deploy): nftables uid-based DSCP-EF + skb-priority marking for srcds
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:32:53 +02:00
mwiegand
e822e9fbc7
feat(deploy): extend sysctls with udp_*_min, fq_codel default, BBR
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:28:24 +02:00
mwiegand
e1add4fffa
docs(plans): l4d2 network shaping & marking — implementation plan
Eight TDD tasks: sysctl extension, nftables marking (file + unit), CAKE
shaper (env + helper + unit), deploy-script wiring, README. Each task
adds one artifact with its assertion in test_deploy_artifacts.py and
ends in its own commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:10:40 +02:00
mwiegand
0cc92f2c17
docs(specs): l4d2 network shaping & marking — design
CAKE egress shaping (test-deploy oneshot + systemd-networkd [CAKE] block
on prod), nftables uid-based DSCP-EF + skb-priority marking for srcds
UDP, plus rounding sysctls (udp_rmem_min/wmem_min, default_qdisc=fq_codel,
tcp_congestion_control=bbr). Hardware-specific knobs stay documented
escape hatches matching the perf-baseline boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 00:05:44 +02:00
mwiegand
62d6d4cbcd
ui(files-overlay): label root row as "/" instead of "(overlay root)"
Tighter, more terminal-flavored. Mono font on the label echoes how
paths are rendered elsewhere in the tree. New-folder dialog title
also shows "/" when targeting the root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:50:14 +02:00
mwiegand
2bba1f31d0
fix(files-overlay): post-deploy bug sweep + root-as-row UX
Three bugs surfaced in browser testing, plus one UX request:

1. The Uploads panel and the binary-mode editor sub-panels stayed
   visible after `el.hidden = true` because their `display: flex/grid`
   rules in components.css have the same specificity as the UA's
   `[hidden]{display:none}` and come later in cascade. Add a targeted
   `[hidden]!important` rule for the affected classes.

2. Clicking a folder toggle inside a `files` overlay did nothing.
   `file-tree.js` looked for `.file-tree-children` via
   `button.nextElementSibling`, but the files-overlay row template
   inserts a per-row action span between the toggle and the children
   div. Switch to `closest('.file-tree-row').querySelector(':scope >
   .file-tree-children')` so both row variants resolve correctly.

3. Pressing Enter on the new-folder dialog did nothing — the keydown
   handler was attached with `{once:true}` inside `openNewFolder`,
   so the first letter the user typed consumed the listener and Enter
   never fired. Move the listener to module init so it survives
   subsequent keystrokes and dialog reopenings.

UX: render the overlay root as a row inside the tree (label
"(overlay root)") rather than as a separate toolbar. The root row
carries the same `+ new file · + new folder · ⬇ zip` hover-action
column as every other folder row, so drop-on-row, hover-reveal, and
data-target-path semantics are uniform across the tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:46:19 +02:00
mwiegand
76cd7ddda0
fix(files-overlay): fall back to getAsFile when webkitGetAsEntry returns null
webkitGetAsEntry() only returns an Entry for real OS-originated drag-drops;
synthetic DragEvents (and some browsers without folder-drop support) get
null back. Per-item fallback to getAsFile() keeps single-file drops working
in those cases without sacrificing the whole-folder upload path on real
OS drops.

Caught while end-to-end testing on the deploy box: a programmatically-
dispatched drop fired the listener and reached preventDefault(), but no
upload row appeared because the file collection loop never enqueued.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 19:11:41 +02:00
mwiegand
2d3c98866a
feat(files-overlay): user-managed file content as a third overlay type
Adds Overlay.type='files' whose source-of-truth IS the overlay directory
itself. Users can:

  * upload arbitrary files / whole folders by dragging from the OS onto a
    folder row in the file tree (one POST per file, queue with
    concurrency 3, per-file progress in a floating Uploads panel)
  * move via drag-and-drop inside the tree (same gesture, source
    distinguishes; refuses cycles)
  * create / edit / rename / replace through a single editor modal
    (text flavor for editable files, binary flavor with replace-upload
    for everything else; filename input is the rename surface)
  * mkdir empty folders (slashes allowed for nested intermediates)
  * stream a folder as a zip download
  * delete files and empty folders

Backend is type-agnostic past the new files_routes endpoints, so the
existing mount / spec / overlayfs / expose_server_cfg pipeline is reused
unchanged. is_editable gates the row's edit affordance and the /save
content rules. Three new safe-resolve helpers (write/delete/move) cover
the new operations with the same anchor-and-resolve pattern as listing
and download. FilesBuilder is a no-op so the build subsystem can
dispatch uniformly.

Spec: docs/superpowers/specs/2026-05-09-files-overlay-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 18:59:32 +02:00
mwiegand
36d3d83de6
docs: postmortem for the overlay-umount EBUSY rabbit hole
Captures the symptom (Reset blew up on `umount target busy`), the
false starts (eager retry, lazy fallback, TimeoutStopSec bump — all
shipped briefly and reverted), the actual root cause (the helper's
own Python interpreter inheriting and pinning the unit's mount
namespace), and the fix (nsenter at the systemd Exec line).

The lessons section is the part future-me reads first: a retry loop
is a hint that something we own is the blocker; probe `/proc/*/ns/mnt`
before assuming kernel async; `+` Exec prefix doesn't escape the
unit's mount namespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:50:41 +02:00
mwiegand
87d56a0910
fix(web): event-delegate modal triggers so HTMX-swapped buttons work
The previous wiring attached click listeners on DOMContentLoaded, so
any [data-modal-open] / [data-modal-close] / dialog.modal element
that came in via a later HTMX partial swap silently lost its
behaviour. The server-detail Actions partial reloads its reset/delete
triggers on every state change, so reset was unclickable after the
first state change post-load.

Switch to a single delegated click handler on document. Same logic,
but matches via Element.closest() so it works regardless of when an
element was added to the DOM. No re-bind needed after HTMX swaps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:18:27 +02:00
mwiegand
5eac51a93e
fix(deploy): wrap overlay helper with nsenter so it doesn't pin the unit's mount namespace
systemd's `+` Exec prefix removes sandbox/credentials but does NOT
detach from the unit's per-service mount namespace (created by
PrivateTmp/Protect*). The Python interpreter for the helper was
launched inside that namespace, and even though the helper internally
nsenter'd into PID 1 for the umount syscall, the calling Python
process itself never left the unit's namespace. Its existence pinned
the namespace alive, which kept the slave mount tree alive, which
made PID 1's umount return EBUSY for the entire duration of the
helper's run. The mount became unmountable the moment the helper
exited — empirically verified by polling /proc/*/ns/mnt during stop:
the only PID holding the dying namespace was the helper itself.

Wrap both ExecStartPre and ExecStopPost with `/usr/bin/nsenter
--mount=/proc/1/ns/mnt --` so the helper Python interpreter runs in
PID 1's mount namespace from the start. With the helper out of the
unit's namespace, umount succeeds first try once the cgroup empties.
Reset went from ~25 s with retry/lazy-fallback workarounds to ~0.5 s
clean.

Knock-on cleanups:
- Helper drops internal nsenter for the syscalls (already in PID 1's
  namespace), and drops the eager-retry loop + lazy-umount fallback +
  inner work_inner retry (no race left to ride out).
- Revert TimeoutStopSec=60s back to 15s.
- Tests updated to expect the new argv shapes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:13:59 +02:00
mwiegand
936c8bb81c
fix(deploy): ExecStart srcds_run from merged overlay, not installation/
srcds_run is a shell script that cd's to its own dirname before exec'ing
srcds_linux, so WorkingDirectory has no effect — the binary's path is what
determines where the engine reads gameinfo.txt and addons from. Pointing
at installation/srcds_run resolved everything against the lower layer, so
overlay-provided Metamod/SourceMod plugins and cfgs (zonemod, confogl)
never loaded. Switch to runtime/%i/merged/srcds_run so the engine sees
the merged tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 14:03:12 +02:00
mwiegand
ddf73c4d27
test(deploy): drop stale web.env lifecycle assertions
`test_deploy_script_has_safe_defaults_and_preserves_state` had been red
since commit caa8b83 ("rewrite web.env every deploy with machine-id-
derived SECRET_KEY"). Two assertions encoded the prior model:

- `if [ ! -f /etc/left4me/web.env ]` — the create-only-if-missing guard
  caa8b83 removed in favor of unconditional `install -m 0640 ...`.
- `. /etc/left4me/web.env not in script` — masked by the first failing
  but also stale: the deploy intentionally sources web.env in the
  alembic and seed-script-overlays helper subprocesses so they get
  DATABASE_URL.

Removed both. The full suite now runs 0 failed. The note left in place
points future readers at the live coverage path (install + SECRET_KEY
rewrite + run_left4me_with_env plumbing already asserted nearby).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:33:05 +02:00
mwiegand
59771f91c4
fix(deploy): drop deleted l4d2host.fs from pyproject + use nproc --all
Two bugs surfaced by the previous deploy attempt:

1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit
   packages= list. After deleting the fs/ package, pip install -e fails
   with "package directory './fs' does not exist".

2. The CPU-isolation deploy step uses `nproc` to detect host core count,
   but `nproc` honors Cpus_allowed of the calling shell. On a host that
   already has the cpuset drop-ins applied (system.slice/user.slice →
   AllowedCPUs=0), the SSH login lands constrained to one core and
   `nproc` returns 1 — making subsequent deploys think they're on a
   single-core box and skip the cpuset writes entirely. `nproc --all`
   reports installed processors regardless of affinity, which is what
   the deploy actually wants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:11:19 +02:00
mwiegand
ff6ce7b091
refactor(l4d2-host): unmount via ExecStopPost — single code path mirroring mount
Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until
now, the unit's ExecStartPre handled mount but the Python side still drove
unmount: stop_instance and _purge_instance both called _mounter.unmount,
which wrapped sudo + the helper. Two code paths for two halves of the
same lifecycle.

Move unmount into the unit:

- ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i
  (ExecStopPost, not ExecStop, so it runs after the cgroup is cleared;
  ExecStop runs while srcds is alive and would EBUSY the umount syscall.)
- Helper's umount verb is now idempotent (mirrors mount): if merged
  isn't a mount point, return early. PRINT_ONLY mode bypasses both
  short-circuits so the unit tests still exercise the full nsenter argv.

Drop the dead Python machinery:

- _mounter.unmount(...) calls in stop_instance and _purge_instance
- _mounter global + KernelOverlayFSMounter import
- The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter
  class) — no production callers, just self-tests
- l4d2host/tests/test_kernel_overlayfs.py
- test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails
  (tested Python-side unmount-failure tolerance that no longer exists)
- The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests

After this, the only thing start_instance does beyond cfg-staging is ask
systemd to enable+start the unit. stop/delete/reset only ask systemd to
disable; the overlay lifecycle lives entirely in the unit file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 13:09:52 +02:00
mwiegand
fc371711ec
fix(deploy): StartLimit* directives belong in [Unit], not [Service]
systemd 230+ moved StartLimitBurst= and StartLimitIntervalSec= from
[Service] into [Unit] (with the rename from StartLimitInterval=). Putting
them in [Service] makes systemd silently ignore them with a warning to
journalctl: "Unknown key 'StartLimitIntervalSec' in section [Service],
ignoring." — meaning the restart-loop cap I claimed in commit 519567e
wasn't actually applied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:56:54 +02:00
mwiegand
a982995d5b
fix(deploy): ExecStartPre runs overlay helper with + prefix, not sudo
The unit has NoNewPrivileges=true (security hardening for srcds), which
blocks sudo's setuid escalation. The previous sudo'd ExecStartPre failed
on every start with "sudo: the 'no new privileges' switch is set, which
prevents sudo from running as root" -> Restart=on-failure loop.

systemd's `+` prefix runs the Exec command as PID 1 (root, no sandbox),
bypassing User=/Group=/NoNewPrivileges=. Equivalent privilege scope to
the sudoers rule the web app already uses for the same helper, just
without the sudo middleman.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:55:16 +02:00
mwiegand
56f5c30296
refactor(l4d2-host): unit's ExecStartPre is the sole code path to the mount
Before this change there were two callers of left4me-overlay mount:
the web app's start_instance (Python, in-process) and the unit's
ExecStartPre (shell, via sudo). The duplication invited divergence; the
helper's recently-added idempotency made both paths technically work
but at the cost of a "first wins" race and dead-code retry logic in
start_instance.

Drop the in-process _mounter.mount() call from start_instance. The web
app now only stages cfg files (which still must happen on the host
filesystem before mount, to avoid overlayfs copy-up changing ownership),
then asks systemd to enable+start the unit; the unit's ExecStartPre
does the mount.

Removed:
- os.path.ismount(merged) refusal in start_instance and its test
  (test_start_refuses_to_double_mount). The race the check guarded
  against is now handled by the helper's idempotency.
- _load_instance_env helper and the `os` import (both became dead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:54:05 +02:00
mwiegand
3d9b7ef771
fix(deploy): WorkingDirectory= prefix - so ExecStartPre can mount the overlay
systemd applies WorkingDirectory= to every Exec line including ExecStartPre.
With the merged dir not yet existing at boot time (the volatile overlay
mount has been wiped), the chdir into runtime/%i/merged/left4dead2 fails
with status=200/CHDIR before ExecStartPre can run the mount helper.

The `-` prefix makes chdir failure non-fatal: ExecStartPre runs in the
unit's home (cwd doesn't matter for the mount helper); ExecStart re-applies
WorkingDirectory once the mount has landed and chdirs successfully.

Companion to commit 519567e (which added the ExecStartPre mount + helper
idempotency but didn't account for the WorkingDirectory ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:51:58 +02:00
mwiegand
519567e156
fix(l4d2-host): mount overlay via ExecStartPre so enabled units boot cleanly
The lifecycle change to systemctl enable --now (commit 8552c55) made
units auto-start at boot. But the kernel-overlayfs mount is volatile
(reboot kills it), and the web app's start_instance only re-mounts in
response to a UI click. Result: at boot, systemd starts the unit, finds
empty merged/, CHDIR fails, Restart=on-failure spins forever (counter
hit 65 on ckn before this fix landed).

Fix:
- Unit gets `ExecStartPre=/usr/bin/sudo -n .../left4me-overlay mount %i`
  so the overlay is established before the main process starts.
- Helper is now idempotent: if merged is already a mount point, exit 0.
  Required because Restart=on-failure re-runs ExecStartPre on each
  cycle, and the web-app's start_instance also calls the helper, so
  both paths would otherwise collide on "already mounted".
- StartLimitBurst=5 + StartLimitIntervalSec=60s caps the restart loop
  instead of letting it spin indefinitely on a fundamental failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:47:20 +02:00
mwiegand
b62fc08127
docs(specs): l4d2 cpu pinning — decision record (deferred)
Investigated whether to hard-pin each srcds instance to a single core
within the existing AllowedCPUs=1-7 set. Modern kernels (5.13+) no
longer expose kernel.sched_migration_cost_ns or the other classic CFS
"laziness" tunables, so a global cheap-fix is unavailable. Decision
for now: trust CFS + Nice=-5 + AllowedCPUs=1-7. Per-instance
CPUAffinity= remains an opt-in escape hatch in deploy/README.md.
Documents the revisit triggers and the preferred implementation path
when the time comes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:41:40 +02:00
mwiegand
67b5521eb6
feat(l4d2-web): periodic state poller refreshes Server.actual_state
A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs (queued/running/cancelling) are skipped to
avoid racing the post-job refresh. Catches reboot drift, OOM kills,
manual systemctl operations, and any other out-of-band state change.
Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:31:28 +02:00
mwiegand
8552c559d3
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract -- only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:28:44 +02:00
mwiegand
1dd674714a
docs(specs): perf baseline lifecycle — premise check on system vs user units
Make explicit that the project uses system units (root systemctl, unit
under /usr/local/lib/systemd/system/, WantedBy=multi-user.target), so
`systemctl enable --now` is the correct verb to make instances survive
a host reboot. User units have different lifecycle rules and would not
auto-start at boot without enable-linger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:25:34 +02:00
mwiegand
3b0bde9b50
docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan
Two TDD tasks: helper+service_control verb rename, then poller code
+ wiring + tests. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00
mwiegand
72cd7ca1ef
docs(specs): l4d2 server lifecycle reboot-and-drift — design
Switch lifecycle verbs from systemctl start/stop to enable --now /
disable --now (servers survive host reboot via WantedBy= symlinks),
plus a periodic state poller for runtime drift (OOM kills, manual
systemctl ops, exhausted Restart=on-failure).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00
mwiegand
20604dd79c
docs(deploy): document CPU isolation in performance-tuning section
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 11:06:59 +02:00