left4me

Author	SHA1	Message	Date
mwiegand	5eac51a93e	fix(deploy): wrap overlay helper with nsenter so it doesn't pin the unit's mount namespace systemd's `+` Exec prefix removes sandbox/credentials but does NOT detach from the unit's per-service mount namespace (created by PrivateTmp/Protect). The Python interpreter for the helper was launched inside that namespace, and even though the helper internally nsenter'd into PID 1 for the umount syscall, the calling Python process itself never left the unit's namespace. Its existence pinned the namespace alive, which kept the slave mount tree alive, which made PID 1's umount return EBUSY for the entire duration of the helper's run. The mount became unmountable the moment the helper exited — empirically verified by polling /proc//ns/mnt during stop: the only PID holding the dying namespace was the helper itself. Wrap both ExecStartPre and ExecStopPost with `/usr/bin/nsenter --mount=/proc/1/ns/mnt --` so the helper Python interpreter runs in PID 1's mount namespace from the start. With the helper out of the unit's namespace, umount succeeds first try once the cgroup empties. Reset went from ~25 s with retry/lazy-fallback workarounds to ~0.5 s clean. Knock-on cleanups: - Helper drops internal nsenter for the syscalls (already in PID 1's namespace), and drops the eager-retry loop + lazy-umount fallback + inner work_inner retry (no race left to ride out). - Revert TimeoutStopSec=60s back to 15s. - Tests updated to expect the new argv shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 15:13:59 +02:00
mwiegand	936c8bb81c	fix(deploy): ExecStart srcds_run from merged overlay, not installation/ srcds_run is a shell script that cd's to its own dirname before exec'ing srcds_linux, so WorkingDirectory has no effect — the binary's path is what determines where the engine reads gameinfo.txt and addons from. Pointing at installation/srcds_run resolved everything against the lower layer, so overlay-provided Metamod/SourceMod plugins and cfgs (zonemod, confogl) never loaded. Switch to runtime/%i/merged/srcds_run so the engine sees the merged tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 14:03:12 +02:00
mwiegand	ddf73c4d27	test(deploy): drop stale web.env lifecycle assertions `test_deploy_script_has_safe_defaults_and_preserves_state` had been red since commit `caa8b83` ("rewrite web.env every deploy with machine-id- derived SECRET_KEY"). Two assertions encoded the prior model: - `if [ ! -f /etc/left4me/web.env ]` — the create-only-if-missing guard `caa8b83` removed in favor of unconditional `install -m 0640 ...`. - `. /etc/left4me/web.env not in script` — masked by the first failing but also stale: the deploy intentionally sources web.env in the alembic and seed-script-overlays helper subprocesses so they get DATABASE_URL. Removed both. The full suite now runs 0 failed. The note left in place points future readers at the live coverage path (install + SECRET_KEY rewrite + run_left4me_with_env plumbing already asserted nearby). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 13:33:05 +02:00
mwiegand	59771f91c4	fix(deploy): drop deleted l4d2host.fs from pyproject + use nproc --all Two bugs surfaced by the previous deploy attempt: 1. l4d2host/pyproject.toml still listed `l4d2host.fs` in the explicit packages= list. After deleting the fs/ package, pip install -e fails with "package directory './fs' does not exist". 2. The CPU-isolation deploy step uses `nproc` to detect host core count, but `nproc` honors Cpus_allowed of the calling shell. On a host that already has the cpuset drop-ins applied (system.slice/user.slice → AllowedCPUs=0), the SSH login lands constrained to one core and `nproc` returns 1 — making subsequent deploys think they're on a single-core box and skip the cpuset writes entirely. `nproc --all` reports installed processors regardless of affinity, which is what the deploy actually wants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 13:11:19 +02:00
mwiegand	ff6ce7b091	refactor(l4d2-host): unmount via ExecStopPost — single code path mirroring mount Symmetric with the earlier mount cleanup (commits 519567e..a982995). Until now, the unit's ExecStartPre handled mount but the Python side still drove unmount: stop_instance and _purge_instance both called _mounter.unmount, which wrapped sudo + the helper. Two code paths for two halves of the same lifecycle. Move unmount into the unit: - ExecStopPost=+/usr/local/libexec/left4me/left4me-overlay umount %i (ExecStopPost, not ExecStop, so it runs after the cgroup is cleared; ExecStop runs while srcds is alive and would EBUSY the umount syscall.) - Helper's umount verb is now idempotent (mirrors mount): if merged isn't a mount point, return early. PRINT_ONLY mode bypasses both short-circuits so the unit tests still exercise the full nsenter argv. Drop the dead Python machinery: - _mounter.unmount(...) calls in stop_instance and _purge_instance - _mounter global + KernelOverlayFSMounter import - The whole l4d2host/fs/ package (OverlayMounter ABC + KernelOverlayFSMounter class) — no production callers, just self-tests - l4d2host/tests/test_kernel_overlayfs.py - test_stop_succeeds_when_unmount_fails / test_delete_succeeds_when_unmount_fails (tested Python-side unmount-failure tolerance that no longer exists) - The l4d2host.fs.kernel_overlayfs.run_command monkeypatches in lifecycle tests After this, the only thing start_instance does beyond cfg-staging is ask systemd to enable+start the unit. stop/delete/reset only ask systemd to disable; the overlay lifecycle lives entirely in the unit file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 13:09:52 +02:00
mwiegand	fc371711ec	fix(deploy): StartLimit* directives belong in [Unit], not [Service] systemd 230+ moved StartLimitBurst= and StartLimitIntervalSec= from [Service] into [Unit] (with the rename from StartLimitInterval=). Putting them in [Service] makes systemd silently ignore them with a warning to journalctl: "Unknown key 'StartLimitIntervalSec' in section [Service], ignoring." — meaning the restart-loop cap I claimed in commit `519567e` wasn't actually applied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:56:54 +02:00
mwiegand	a982995d5b	fix(deploy): ExecStartPre runs overlay helper with `+` prefix, not sudo The unit has NoNewPrivileges=true (security hardening for srcds), which blocks sudo's setuid escalation. The previous sudo'd ExecStartPre failed on every start with "sudo: the 'no new privileges' switch is set, which prevents sudo from running as root" -> Restart=on-failure loop. systemd's `+` prefix runs the Exec command as PID 1 (root, no sandbox), bypassing User=/Group=/NoNewPrivileges=. Equivalent privilege scope to the sudoers rule the web app already uses for the same helper, just without the sudo middleman. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:55:16 +02:00
mwiegand	56f5c30296	refactor(l4d2-host): unit's ExecStartPre is the sole code path to the mount Before this change there were two callers of left4me-overlay mount: the web app's start_instance (Python, in-process) and the unit's ExecStartPre (shell, via sudo). The duplication invited divergence; the helper's recently-added idempotency made both paths technically work but at the cost of a "first wins" race and dead-code retry logic in start_instance. Drop the in-process _mounter.mount() call from start_instance. The web app now only stages cfg files (which still must happen on the host filesystem before mount, to avoid overlayfs copy-up changing ownership), then asks systemd to enable+start the unit; the unit's ExecStartPre does the mount. Removed: - os.path.ismount(merged) refusal in start_instance and its test (test_start_refuses_to_double_mount). The race the check guarded against is now handled by the helper's idempotency. - _load_instance_env helper and the `os` import (both became dead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:54:05 +02:00
mwiegand	3d9b7ef771	fix(deploy): WorkingDirectory= prefix `-` so ExecStartPre can mount the overlay systemd applies WorkingDirectory= to every Exec line including ExecStartPre. With the merged dir not yet existing at boot time (the volatile overlay mount has been wiped), the chdir into runtime/%i/merged/left4dead2 fails with status=200/CHDIR before ExecStartPre can run the mount helper. The `-` prefix makes chdir failure non-fatal: ExecStartPre runs in the unit's home (cwd doesn't matter for the mount helper); ExecStart re-applies WorkingDirectory once the mount has landed and chdirs successfully. Companion to commit `519567e` (which added the ExecStartPre mount + helper idempotency but didn't account for the WorkingDirectory ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:51:58 +02:00
mwiegand	519567e156	fix(l4d2-host): mount overlay via ExecStartPre so enabled units boot cleanly The lifecycle change to systemctl enable --now (commit `8552c55`) made units auto-start at boot. But the kernel-overlayfs mount is volatile (reboot kills it), and the web app's start_instance only re-mounts in response to a UI click. Result: at boot, systemd starts the unit, finds empty merged/, CHDIR fails, Restart=on-failure spins forever (counter hit 65 on ckn before this fix landed). Fix: - Unit gets `ExecStartPre=/usr/bin/sudo -n .../left4me-overlay mount %i` so the overlay is established before the main process starts. - Helper is now idempotent: if merged is already a mount point, exit 0. Required because Restart=on-failure re-runs ExecStartPre on each cycle, and the web-app's start_instance also calls the helper, so both paths would otherwise collide on "already mounted". - StartLimitBurst=5 + StartLimitIntervalSec=60s caps the restart loop instead of letting it spin indefinitely on a fundamental failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:47:20 +02:00
mwiegand	8552c559d3	feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now Servers started via the web UI now create a WantedBy= symlink under multi-user.target.wants/, so they auto-start on the next host reboot. Helper verbs renamed start/stop -> enable/disable; service_control.py renamed start_service/stop_service -> enable_service/disable_service. The user-facing l4d2ctl start/stop commands keep their names per the AGENTS.md contract -- only the implementation changes. Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 12:28:44 +02:00
mwiegand	20604dd79c	docs(deploy): document CPU isolation in performance-tuning section Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS / LEFT4ME_GAME_CPUS overrides, the single-core skip, and the subset-of relationship with per-instance CPUAffinity=. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 11:06:59 +02:00
mwiegand	af3171102a	feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes with a stderr warning unless an env var override is set. Spec: docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 11:06:34 +02:00
mwiegand	e5126c8c0b	docs(deploy): tighten perf-tuning escape hatches - RT example: add AmbientCapabilities=CAP_SYS_NICE so the User=left4me service can actually enter SCHED_FIFO on Trixie. - CPU governor: note that linux-cpupower may need apt install. - CPUAffinity=2: clarify that per-instance values typically increment. - NIC tuning: note that ethtool may need apt install.	2026-05-09 10:15:45 +02:00
mwiegand	9e0f6f17ef	docs(deploy): performance-tuning escape-hatch section in README Documents CPU governor, per-instance CPUAffinity, NIC tuning, and SCHED_FIFO opt-in patterns. None of these are auto-applied; they're ops-side knobs for measured problems the perf baseline doesn't solve.	2026-05-09 10:09:40 +02:00
mwiegand	928519fa34	feat(deploy): install slice + sysctl artifacts and apply via sysctl --system Copies l4d2-game.slice and l4d2-build.slice into /usr/local/lib/systemd/system/, installs 99-left4me.conf into /etc/sysctl.d/, and runs sysctl --system so the perf baseline is live this deploy, not on next reboot.	2026-05-09 10:05:41 +02:00
mwiegand	7e4a5691ed	feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500 Builds yield CPU/IO to game-server instances under contention via the slice's weight=10, and are killed first under memory pressure (servers have OOMScoreAdjust=-200).	2026-05-09 10:01:38 +02:00
mwiegand	b3fca4772c	feat(deploy): host sysctls for UDP buffers + netdev backlog/budget 99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults), netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.	2026-05-09 09:53:07 +02:00
mwiegand	66d83a0282	docs(deploy): point slice files at perf baseline spec Matches the spec-pointer comment Task 1 added to left4me-server@.service. A future operator running `systemctl cat l4d2-game.slice` now finds the rationale.	2026-05-09 09:51:48 +02:00
mwiegand	ad7d73608e	feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio Flat top-level slices. Game wins under contention; build still gets the box when uncontended. Referenced by left4me-server@.service and the script-sandbox systemd-run invocation.	2026-05-09 09:48:41 +02:00
mwiegand	7193163488	feat(deploy): perf-baseline directives on left4me-server@.service Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort, OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256, LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s, LogRateLimitIntervalSec=0. Spec: docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md	2026-05-09 09:44:12 +02:00
mwiegand	965b67e6fc	fix(l4d2-host): script-sandbox normalizes file perms so web user can read Cedapug's build script writes .cedapug/manifest.tsv with mode 0600 owned by l4d2-sandbox; the web service (left4me uid) then 500s when streaming that file via the download route — PermissionError on open(). Two fixes: - UMask=0022 on the systemd-run unit so new file writes default to 0644 / dirs to 0755. - Post-script chmod o+r/o+rx walk over the overlay dir to backfill any stricter modes the script left behind (e.g. shells/tools that ignore umask and explicitly create with 0600). The helper no longer execs systemd-run; it captures the rc, runs the post-step, and exits with the original rc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 01:44:26 +02:00
mwiegand	caa8b83cf0	chore(deploy): rewrite web.env every deploy with machine-id-derived SECRET_KEY Drops the 'only on first creation' guard so newly added env vars reach existing boxes (today's SESSION_COOKIE_SECURE=false rake). SECRET_KEY is now sha256(/etc/machine-id) — stable per host, no session invalidation across redeploys, no state persisted in /etc that the deploy has to tiptoe around. Single-operator test deployment; the secret being machine-id-derivable is acceptable per deploy/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:39:02 +02:00
mwiegand	196d2db33e	feat(l4d2-web): seed example script overlays from examples/script-overlays/ Bundles four reference script overlays (cedapug_maps, l4d2center_maps, competitive_rework, tickrate) and adds a `flask seed-script-overlays` CLI that upserts each *.sh as a system-wide overlay. Test deploy invokes it after the orphan-cleanup migration so fresh test servers come up with the same overlays the user has been maintaining by hand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 18:41:08 +02:00
mwiegand	ebddb0fab2	chore(deploy): install p7zip + coreutils for script-overlay tooling Script overlays commonly need 7z and md5sum (e.g. the l4d2center map sync recipe). Add p7zip-full to the apt install line, p7zip + p7zip-plugins to dnf, and coreutils explicitly so md5sum is guaranteed even on slim base images. Lock both in with a regression test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:23:23 +02:00
mwiegand	023cc5c9b0	fix(deploy): chown WAL+SHM sidecars too, not just left4me.db SQLite in WAL mode (the default for this app) maintains left4me.db-wal and left4me.db-shm sidecar files alongside the main DB. All three must be writable by the web service uid; if any one is root-owned, SQLite reports "attempt to write a readonly database" on the next INSERT — which surfaced as a 500 on POST /overlays/{id}/script after I'd done ad-hoc root-side sqlite3.connect() inspection earlier and the resulting root-owned WAL/SHM persisted. Loop over all three paths in the deploy chmod step so root-owned sidecars are corrected on every deploy. Idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:11:42 +02:00
mwiegand	f6ca85fc6f	fix(deploy): chown left4me.db to left4me:left4me, not root:left4me The v2 hardening tightened the DB to mode 0640 owned by root:left4me, intending to block reads from the sandbox uid (l4d2-sandbox, not in the left4me group). It did — but it also took away write access from the web service itself, which runs as user left4me. With root owning the file, left4me only had group-read; INSERTs into the jobs table failed with "attempt to write a readonly database" and surfaced as a 500 on POST /overlays/{id}/script. Owner left4me + group left4me + mode 0640 keeps the same external posture (l4d2-sandbox gets nothing via "other") while restoring the web service's read+write access via "owner". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:09:47 +02:00
mwiegand	7e66936d03	feat(deploy): restrict script-sandbox egress to public internet only Adds IPAddressDeny= to the sandbox unit covering loopback (127/8 + ::1), link-local (169.254/16 + fe80::/10), multicast (224/4 + ff00::/8), all RFC1918 v4 (10/8, 172.16/12, 192.168/16), CGNAT (100.64/10), and ULA v6 (fc00::/7). The kernel attaches systemd's sd_fw_egress BPF program to the unit's cgroup; egress packets matching any of the deny prefixes are silently dropped at the cgroup boundary. Important: do NOT pair this with `IPAddressAllow=any`. Documentation claims "more specific rule wins" but on this systemd 257 + kernel 6.12 combo, having both set causes the allow to win unconditionally — the deny gets ignored. Empty IPAddressAllow + populated IPAddressDeny is the correct shape: kernel default "allow all" applies to non-listed addresses, and the listed prefixes are blocked. Because the host's resolv.conf typically points at a private-IP DNS server (10.0.0.1 in the test deploy), blocking RFC1918 also kills DNS. Adds a static /etc/left4me/sandbox-resolv.conf with public resolvers (Cloudflare 1.1.1.1, Google 8.8.8.8) and bind-mounts that into the sandbox at /etc/resolv.conf, replacing the host's resolver inside the sandbox only. Smoke-tested on ckn@10.0.4.128: - public 1.1.1.1:443: CONNECTED - public HTTPS via DNS (steamcommunity.com): 200 - localhost web app 127.0.0.1:8000: blocked (TimeoutError) - localhost sshd 127.0.0.1:22: blocked - private LAN ssh 10.0.4.128:22: blocked - private DNS 10.0.0.1:53: blocked AF_UNIX stays in RestrictAddressFamilies — dropping it would risk breaking NSS / syslog for marginal gain, and the IP-level filter addresses the primary threat (reaching the host's HTTP/SSH services). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:04:57 +02:00
mwiegand	ae443299c8	chore(deploy): drop bubblewrap apt dep + tighten left4me.db mode bubblewrap is no longer used now that left4me-script-sandbox runs as a systemd service unit. Remove it from the apt-get and dnf install lines. Also tighten the application database file mode after the alembic upgrade step: chown root:left4me, chmod 0640. The DB had been created at default 0644 by SQLite's open() call inside the web service, which made it world-readable on the host — i.e. readable by any uid that can traverse /var/lib/left4me, including the sandbox's l4d2-sandbox uid. Smoke-testing the v2 sandbox prototype on ckn@10.0.4.128 surfaced this: the sandbox could read "SQLite format 3" from the DB until the parent dir was masked with TemporaryFileSystem=. Tightening the file mode is the host-level fix; the sandbox-level mask is defense in depth. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:48:26 +02:00
mwiegand	4ee8f6af44	refactor(deploy): rewrite left4me-script-sandbox to systemd-only — drop bwrap Replaces the systemd-run --scope + bwrap composition with systemd-run in service-unit mode (--pipe --wait, transient .service unit). Same cgroup limits and walltime kill, plus the hardening directives that --scope units cannot carry: NoNewPrivileges, ProtectSystem=strict, ProtectHome, ProtectKernel{Tunables,Modules,Logs,ControlGroups}, RestrictNamespaces, RestrictAddressFamilies, RestrictSUIDSGID, LockPersonality, MemoryDenyWriteExecute, SystemCallFilter (seccomp), and an empty CapabilityBoundingSet (drops all caps). UID drop via User=/Group=. The TemporaryFileSystem="/etc /var/lib" pair is the gotcha: ProtectSystem=strict makes /var/lib read-only but visible, so the host DB at /var/lib/left4me/left4me.db (mode 0644) was readable from inside. Masking /var/lib with tmpfs hides the entire subtree; the BindPaths bind to /overlay is at a different path and unaffected. The Python side (ScriptBuilder, run_sandboxed_script, routes) is unchanged — same sudo-helper invocation, same argv shape. Loses PID-namespace isolation (no PrivatePID= directive in systemd). Host PIDs are visible via /proc and ps -ef but not signal-able due to UID mismatch — information disclosure only, not a privilege boundary. Smoke-tested on ckn@10.0.4.128 prior to this commit; all isolation invariants reproduced and the hardening directives provably blocked unshare(2), mount(2), personality(2), bpf(2), and sysctl writes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:47:30 +02:00
mwiegand	cf865d4915	fix(deploy): one-shot cleanup of orphan overlay dirs after globals removal Migration 0005_script_overlays drops the legacy l4d2center_maps / cedapug_maps overlay rows but leaves their /var/lib/left4me/overlays/{id} directories on disk. When the web app subsequently creates a new overlay and AUTOINCREMENT issues an id matching one of those orphans, create_overlay_directory(exist_ok=False) crashes with FileExistsError — which surfaced as a 500 on POST /overlays the first time a script overlay was created on a deployed test box. Adds a sentinel-gated sweep in deploy-test-server.sh that lists overlay ids in the DB, removes any directory under overlays/ whose id has no matching row, and drops the now-unused global_overlay_cache. Mirrors the .kernel-overlay-migrated sentinel pattern so reruns are no-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:16:33 +02:00
mwiegand	06ae84fbe4	fix(deploy): script-sandbox helper — UID drop via systemd-run, --unshare-user-try, /etc/alternatives Smoke testing on the test host revealed three issues with the helper as shipped: 1. bwrap 0.11+ rejects --uid without --unshare-user. Switching the UID drop from inside bwrap to systemd-run (--uid=l4d2-sandbox --gid=l4d2-sandbox) sidesteps the userns UID-mapping headaches and keeps file ownership on the bind-mounted /overlay matching l4d2-sandbox on the host (which the wipe path relies on). 2. bwrap running as an unprivileged uid still needs a user namespace to set up its mount-namespace bind-mounts. Adding --unshare-user-try gives it the userns context when needed and is a no-op otherwise. 3. /etc/alternatives wasn't bind-mounted, so symlinked tools like /usr/bin/awk -> /etc/alternatives/awk fell over inside the sandbox. Adds the ro-bind. Also: the helper now chowns the overlay dir to l4d2-sandbox before bwrap (idempotent — needed because the web app creates the dir as left4me), and the deploy script chmods /var/lib/left4me to 0711 so l4d2-sandbox can traverse to the bind-mount source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:12:46 +02:00
mwiegand	1e62a44c16	docs(deploy): replace globals overlay description with script overlays deploy/README.md still described the deleted managed-global overlays as the second overlay surface. Replace with a description of script overlays (bubblewrap + systemd-run sandbox, resource caps). Full test sweep: 367 passing, 2 skipped across l4d2web, l4d2host, deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:56:24 +02:00
mwiegand	e51a4d58a4	chore(deploy): provision l4d2-sandbox + bubblewrap; drop globals refresh timer deploy-test-server.sh: provisions the l4d2-sandbox system user (no home, nologin shell) and installs the bubblewrap apt/dnf package; copies the left4me-script-sandbox helper into /usr/local/libexec/left4me with mode 0755. Drops the global_overlay_cache directory provisioning, the refresh-global-overlays unit installation, and the timer enable. Deletes the orphaned left4me-refresh-global-overlays.{service,timer} files. Trims the matching paragraph from deploy/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:54:57 +02:00
mwiegand	75e703e1a4	feat(deploy): left4me-script-sandbox helper + sudoers fragment Privileged bash helper that wraps user-authored scripts in systemd-run --scope (cgroup limits + RuntimeMaxSec=3600) inside a bubblewrap sandbox dropped to the l4d2-sandbox uid. Network is shared with the host so scripts can fetch from Steam / l4d2center / etc.; filesystem is RO except for /overlay (rw bind from /var/lib/left4me/overlays/{id}) and tmpfs /tmp + /run. Adds a sudoers rule allowing the left4me user to invoke this helper without restrictions on its arguments. Strict argument validation is in the helper itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 15:53:21 +02:00
mwiegand	9985ecc56c	chore(deploy): cleanup left4me-web hardening + docs for kernel overlayfs Drop MountFlags=shared (the assumption that it propagated fuse mounts to host was incorrect on systemd 257 with ProtectSystem+ReadWritePaths). Restore PrivateTmp=true (was dropped in `593611e` for fuse propagation that did not work). Rewrite the comment block to describe the new model: mounts go through the left4me-overlay helper which nsenters into PID 1's mount namespace, so the unit's mount-ns layout is no longer load-bearing. Update the three user-facing READMEs (root, l4d2host, deploy) to drop fuse-overlayfs / fusermount3 prereqs and call out the kernel overlayfs mount path through the privileged helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:29:49 +02:00
mwiegand	172e574a00	chore(deploy): drop fuse-overlayfs apt dep + one-shot migrate upper/work Drop fuse-overlayfs / fuse3 from the apt/dnf install line — the new mount path is kernel overlayfs via the left4me-overlay helper, no fuse userspace needed. Add a one-shot migration block gated by /var/lib/left4me/.kernel-overlay-migrated that runs before daemon-reload: stop gameservers + web service, force- unmount any leftover fuse or overlay mounts under runtime/, then wipe and recreate empty upper/ and work/ for every instance. fuse-overlayfs running as a non-root user used user.fuseoverlayfs.* xattrs that kernel overlayfs ignores, so a pre-existing upper/ from the fuse era would resurrect "deleted" files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:28:00 +02:00
mwiegand	d5b321b557	feat(l4d2-host): KernelOverlayFSMounter + left4me-overlay helper New privileged helper at /usr/local/libexec/left4me/left4me-overlay (Python, system /usr/bin/python3, stdlib only) takes only the instance name, parses instance.env for L4D2_LOWERDIRS, validates each lowerdir against an allowlist (installation/, overlays/, global_overlay_cache/, workshop_cache/), refuses upperdirs tainted with user.fuseoverlayfs.* xattrs from the prior fuse era, and execs `nsenter --mount=/proc/1/ns/mnt -- mount -t overlay ...` so the resulting mount lives in the host namespace. Mirrors the existing left4me-systemctl / left4me-journalctl pattern; sudoers entry is verb-constrained. KernelOverlayFSMounter implements the existing OverlayMounter ABC, deriving the instance name from the merged path. No call sites use it yet — that's the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:23:58 +02:00
mwiegand	38548ab0d7	chore(deploy): raise gunicorn thread pool to 32 for SSE headroom Each SSE log-viewer or job-log stream holds a thread for its full lifetime. With --threads 8, a handful of open browser tabs could exhaust the pool. 32 keeps the same single-process scheduler invariant (_claim_lock in job_worker is process-local) while giving SSE plenty of headroom on the test box's user count. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 11:19:03 +02:00
mwiegand	ffc4cdbd7d	refactor(l4d2-web): remove legacy external overlay type The workshop + managed-global overlay surface fully covers the admin-SFTP flow that 'external' was a placeholder for. Drop the type from the model defaults, builder registry, routes, template, and tests, and add migration 0004 that deletes any leftover external rows along with their blueprint and job references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 09:31:04 +02:00
mwiegand	92d6ebbe82	feat(l4d2-web): managed global map overlays with daily refresh Adds two managed system overlays (l4d2center-maps, cedapug-maps) that fetch curated map archives from upstream sources and reconcile addons symlinks for non-Steam maps. A daily systemd timer enqueues a coalesced refresh_global_overlays worker job; downloads, extraction, and rebuilds run in the existing job worker and surface in the job log UI. Schema: GlobalOverlaySource / GlobalOverlayItem / GlobalOverlayItemFile plus nullable Job.user_id so system jobs render as "system" in the UI. The new builder reconciles symlinks against the per-source vpk cache and leaves foreign symlinks untouched. Initialize-time guard refuses to mount a partial overlay if any expected vpk is missing from cache. Refresh service uses shutil.move to handle EXDEV when /tmp and the cache live on different filesystems. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 08:05:14 +02:00
mwiegand	0e83ee07d7	fix(deploy): make test deployments safe to rerun Exclude local agent state from deploy archives, avoid recursive ownership over active runtime mounts, and let Alembic own schema upgrades before app startup.	2026-05-07 17:16:58 +02:00
mwiegand	b2a8d3d5e0	feat(deploy): workshop_cache provisioning Adds /var/lib/left4me/workshop_cache to the deploy mkdir list (owned by the left4me runtime user). Updates deploy/README.md to document the new directory and the workshop overlay layout: web app downloads VPKs into the cache and symlinks them into overlays/{overlay_id}/left4dead2/addons/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:53:49 +02:00
mwiegand	1968684c03	fix(deploy): MountFlags=shared on web service for fuse mount propagation ProtectSystem=full + ReadWritePaths implicitly give the unit a private mount namespace (systemd needs to remount /usr read-only). The default namespace propagation is slave, so mounts the worker creates inside never reach the host. The gameserver units (started via systemctl, each with their own namespace) then inherit a host that lacks the overlay, and their CHDIR into /var/lib/left4me/runtime/<name>/merged fails. Set MountFlags=shared so mount events propagate from the worker's namespace back to the host, then onward to gameserver units at their unshare time. Verified on test box: nsenter -t <gunicorn-pid> -m mount showed the fuse-overlayfs mount inside the worker but plain mount on the host did not, while web unit had ProtectSystem=full + ReadWritePaths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 02:01:24 +02:00
mwiegand	593611e194	fix(deploy): drop PrivateTmp on web service so fuse mounts propagate PrivateTmp=true gives the unit a private mount namespace. The worker's fuse-overlayfs mount lives only inside that namespace, so the host cannot see it and the gameserver unit (started via systemctl, with its own namespace inherited from the host) also cannot see it. The gameserver unit then fails CHDIR on /var/lib/left4me/runtime/<name>/merged/left4dead2. The mount must land in the host namespace so the gameserver unit inherits it at unshare time. Remaining hardening: dedicated user, ProtectSystem=full, ReadWritePaths, sudoers allowlist limited to two helper scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:57:43 +02:00
mwiegand	56b9523d88	fix(deploy): drop NoNewPrivileges on web service so FUSE mounts work The job worker calls fusermount3 (setuid-root) to mount per-instance FUSE overlays and sudo to invoke the privileged systemctl wrapper. NoNewPrivileges=true blocks both, surfacing as "fusermount3: mount failed: Operation not permitted" the first time a server is started. Hardening is still enforced via dedicated user, PrivateTmp, ProtectSystem=full, ReadWritePaths, and the narrow sudoers allowlist limited to two helper scripts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 01:51:39 +02:00
mwiegand	7d9939c71d	fix(deploy): exclude macOS AppleDouble files from deploy archive When tar runs on macOS it embeds ._* resource-fork sidecars next to each file. These ended up under l4d2web/alembic/versions/ on the target and alembic tried to import them as migration modules, failing with "source code string cannot contain null bytes". Set COPYFILE_DISABLE=1 and add an --exclude '._*' so the archive is portable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 00:58:29 +02:00
mwiegand	0210ecd301	config: allow SESSION_COOKIE_SECURE override and disable on test deploy The HTTP-only test deployment binds gunicorn to 0.0.0.0:8000 with no TLS terminator, so a hardcoded SESSION_COOKIE_SECURE=True breaks browser login. Make it opt-out via env (default True outside TESTING) and set SESSION_COOKIE_SECURE=false in the generated web.env so the test box keeps working over HTTP. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 00:56:48 +02:00
mwiegand	3809f85795	fix: load environment variables for alembic upgrade in deploy script to ensure database url is set properly	2026-05-06 21:01:35 +02:00
mwiegand	441c1db79b	fix: change directory before running alembic upgrade in deploy script to avoid pyproject.toml permission issues	2026-05-06 21:00:59 +02:00

1 2

53 commits