Compare commits

..

No commits in common. "master" and "harden-boundary-inputs" have entirely different histories.

334 changed files with 2282 additions and 79350 deletions

1
.envrc
View file

@ -1 +0,0 @@
layout uv

6
.gitignore vendored
View file

@ -1,7 +1,5 @@
.worktrees/ .worktrees/
.claude/
.venv/ .venv/
.direnv/
.pytest_cache/ .pytest_cache/
__pycache__/ __pycache__/
*.pyc *.pyc
@ -9,7 +7,3 @@ __pycache__/
l4d2web.db* l4d2web.db*
# CocoIndex Code (ccc) # CocoIndex Code (ccc)
/.cocoindex_code/ /.cocoindex_code/
.superpowers/
*.db
opencode.json
.tmp/

View file

@ -1 +0,0 @@
3.13

132
AGENTS.md
View file

@ -21,94 +21,6 @@ Do not invent architecture outside these plans unless explicitly requested.
### Workspace and tools ### Workspace and tools
- Do not use git worktrees. - Do not use git worktrees.
- Repo is a uv workspace; Python is pinned to 3.13 via `.python-version`. After fresh checkout: install `uv` (`brew install uv` / `curl -LsSf https://astral.sh/uv/install.sh | sh`), then `direnv allow` (or `uv sync` directly). See README **Local development** for details.
### Modals: inline vs routed
Two coexisting modal mechanisms, one module (`l4d2web/l4d2web/static/js/modals.js`). When adding a new modal, decide which pipeline it belongs to:
**Inline modal** — the dialog markup is pre-rendered into the page HTML. Content is whatever's already there; the JS just calls `showModal()` / `close()` on a specific `<dialog>` by id. Use when:
- It's a confirmation (delete, overwrite, reset)
- It's a transient prompt mid-flow (conflict resolution during upload)
- It's a form whose URL state would be noise (rename, new-folder, new-server)
- The content has no standalone-page equivalent
Hooks: `<button data-inline-modal-open="<dialog-id>">` opens; `<button data-inline-modal-close>` inside the dialog closes; Esc and backdrop click also close. Programmatic: `window.modals.openInline(idOrEl)` / `window.modals.closeInline(idOrEl)`.
**Routed modal** — content is server-rendered from a URL and lands in the persistent `<dialog id="modal-container">` slot. URL gains `?modal=<path>`, refresh + share + back/forward all work. Use when:
- The content has standalone-page meaning (editor, detail view, settings panel)
- "Share this view" or "refresh-stays-here" matters
- The URL state earns its keep
Hooks: `<a data-routed-modal href="<path>">` opens (full-page nav fallback if JS fails); `<button data-routed-modal-dismiss>` inside the swapped content closes. Programmatic: `window.modals.openRouted(path)` / `window.modals.closeRouted()`.
**Conventions for routed-modal templates** (templates that `{% extends base_layout %}`, where `base_layout` resolves to `_modal_partial.html` for `HX-Modal: 1` requests and `base.html` otherwise — see `app.py:inject_base_layout`):
- **The outermost element of `{% block content %}` is a `<div>`, NOT a `<dialog>`.** The persistent slot in `base.html` already provides top-layer + backdrop + focus-trap + Esc-to-close semantics. Nested `<dialog>` collapses to 2 px in every browser.
- **Close buttons use `data-routed-modal-dismiss`** (NOT the inline-modal attribute). `modals.js` delegates at document level.
- **Form-bearing content needs document-level event delegation** for submit/save/delete, gated on `event.target.closest("#modal-content")`. Direct binding to elements in the swapped-in fragment only works in standalone mode — HTMX-swapped content arrives as fresh DOM nodes with no listeners attached. See `static/js/files-overlay/editor.js`'s document-level click listener + the `routedSaveClicked` / `routedReplaceClicked` / `routedDeleteClicked` functions for the canonical pattern (read `data-*` attributes from the swapped DOM, NOT from JS state set during open).
- **CSS classes targeting modal chrome are scoped to the outer slot**`dialog.modal, div.modal` in `components.css`. The inner content div should NOT carry `class="modal modal-wide"` (the outer dialog owns chrome; otherwise both paint card-in-a-card).
**Reference:** `docs/superpowers/specs/2026-05-17-url-addressable-modals-design.md` (design + verification matrix) and the plan errata at the top of `docs/superpowers/plans/2026-05-17-url-addressable-modals.md`.
### Files overlay: module layout
The file-manager JS for files-type overlays is split across four
modules under `l4d2web/l4d2web/static/js/files-overlay/`, all loaded
with `defer` from `templates/overlay_detail.html`. They cooperate via
the `window.__filesOverlay` action registry that `core.js` sets up:
- **`core.js`** — manager-element detection (`.files-manager` guard),
derived state (`overlayId`, `baseUrl`, `treeRoot`, `csrfToken`),
shared helpers (`joinPath`, `parentOf`, `basename`, `humanSize`,
`fetchJson`, `postJson`, `postForm`, `refreshFolder`,
`findRowByPath`, `cssEscape`, `scheduleRefresh`), and the
document-level click listener that dispatches `[data-action]`
clicks through `__filesOverlay.handleAction(op, path, actionEl)`
into per-feature handlers.
- **`editor.js`** — URL-addressable editor only. Handles the new-file
route (`/files/new?at=...`), edit route for text + binary
(`/files/edit?path=...`), and the save / replace / delete delegated
click handlers scoped to `#modal-content`. Registers `"new-file"`
and `"edit"` into the registry.
- **`dialogs.js`** — the three inline `<dialog>` modals (new-folder,
delete-confirm, conflict). Module-scope state per dialog (one
delegated listener each, no clone-and-rebind). Exposes
`askConflict(path) → Promise<"overwrite"|"keep-both"|"cancel">`
on `__filesOverlay` for use by editor.js + uploads.js. Registers
`"new-folder"` and `"delete"` into the registry.
- **`uploads.js`** — upload queue (concurrency 3, XHR-based progress,
`data-upload-id` delegated cancel), drag-drop on `treeRoot`
(direct-bound — 5 coordinated events share highlight state), and
the `"zip"` registry handler. Exposes
`withCollisionSuffix(path) → suffixedPath` for the upload + save
conflict paths. Drag-drop on `treeRoot` is the **only** direct-bound
listener block in the four modules; everything else is document-level
delegation (see escape-hatch comments in-source).
When adding a new file-row action, the contract is:
1. Render the `<button data-action="my-op" data-target-path="...">` in
`templates/_overlay_file_node.html` (gated on the right capability
flag).
2. Pick the module that owns the action and register a handler:
`fo.registerHandler("my-op", (path, actionEl) => { ... })`.
3. The dispatch wiring in `core.js` takes care of catching the click
and calling the handler. No new listeners needed.
### Dev server and filesystem paths
- **Production paths (`/var/lib/left4me`, `/usr/local/lib/systemd/system`, `/usr/local/libexec/left4me`, `/etc/left4me`) exist only on Linux deploy hosts.** Never create or write to these on a developer machine. They are referenced in `l4d2host/l4d2host/paths.py` and the spec only as the production layout.
- **For local dev, always use `scripts/dev-server.py`.** It sets `LEFT4ME_ROOT=./.tmp/dev-server`, runs migrations, seeds demo content (admin + blueprint + script overlay + files overlay), and starts Flask on port 5051. Reset state with `rm -rf .tmp/dev-server` then re-run. Never invoke `flask run` directly — that leaves `LEFT4ME_ROOT` unset and the app falls back to the production `/var/lib/left4me`, which on macOS surfaces as "route returns 404 / empty modal / file not found" and can be mistaken for a code bug.
- **All ephemeral dev state lives under `.tmp/`** (gitignored). Use `$TMPDIR` only for transient files outside the repo. Do NOT use `/tmp`, `~/Library/Application Support`, or any system path for project state — only `.tmp/` (project-local) or `$TMPDIR` (sandbox-blessed).
- **Symptom-to-cause translation:** if a route returns 404 or behaves as if the filesystem is empty, the first diagnosis is "`LEFT4ME_ROOT` is wrong" (defaulted to the production path), not "code bug." Restart via `scripts/dev-server.py`.
### Planning artifacts
- Design specs live in `docs/superpowers/specs/` as `YYYY-MM-DD-<topic>-design.md`.
- Implementation plans live in `docs/superpowers/plans/` as `YYYY-MM-DD-<topic>.md` (suffix the topic with `-v1`/`-v2`/etc. if a plan is versioned).
- Commit both to git as soon as the user approves them.
- Do not leave specs or plans outside this repo. The `~/.claude/plans/<slug>.md` plan-mode scratch file is acceptable while plan mode is open; the persisted artifact must end up under `docs/superpowers/` and be committed.
### Naming and boundaries ### Naming and boundaries
@ -129,7 +41,7 @@ When adding a new file-row action, the contract is:
- `logs <name> --lines <n> --follow/--no-follow` - `logs <name> --lines <n> --follow/--no-follow`
- Runtime paths are rooted at `LEFT4ME_ROOT`, defaulting to `/var/lib/left4me`. - Runtime paths are rooted at `LEFT4ME_ROOT`, defaulting to `/var/lib/left4me`.
- Deployment/config management owns global units under `/usr/local/lib/systemd/system` and privileged helpers under `/usr/local/libexec/left4me`. - Deployment/config management owns global units under `/usr/local/lib/systemd/system` and privileged helpers under `/usr/local/libexec/left4me`.
- Overlay directories are populated by the web app (workshop downloads, managed-global refresh). The host library only mounts them. - Overlays are external directories (no overlay content management here).
- Fail-fast subprocess behavior; pass raw stderr; propagate return code. - Fail-fast subprocess behavior; pass raw stderr; propagate return code.
- No lock manager, no rollback, no preflight runtime checks. - No lock manager, no rollback, no preflight runtime checks.
- Delete missing instance/runtime dirs must succeed (no-op). - Delete missing instance/runtime dirs must succeed (no-op).
@ -177,45 +89,3 @@ Typical commands (once components exist):
- If a requested change conflicts with this file, follow explicit user instruction. - If a requested change conflicts with this file, follow explicit user instruction.
- If plans and code diverge, update plans or flag the mismatch clearly. - If plans and code diverge, update plans or flag the mismatch clearly.
## End-to-end tests
The Playwright-based browser tests under `l4d2web/tests/e2e/` need a
chromium binary, fetched on first setup:
```bash
uv run playwright install chromium
```
Always invoke as `uv run pytest -m e2e ...` (excluded from the default
fast suite via the `e2e` marker). Other forms crash Chromium under the
macOS sandbox; only this exact invocation is exempt.
## Editor bundle (CodeMirror 6)
The in-browser code editor on the blueprint config / overlay script /
files-modal textareas is bundled from `l4d2web/scripts/editor-src/`
via esbuild and committed pre-built to
`l4d2web/l4d2web/static/vendor/editor.bundle.js`. Source lives under
`l4d2web/scripts/editor-src/`; design and plan at
`docs/superpowers/specs/2026-05-17-textarea-editor-v2-design.md` and
`docs/superpowers/plans/2026-05-17-textarea-editor-v2.md`.
Rebuild after editing the source:
```bash
./l4d2web/scripts/build-editor.sh
```
Requires `node` + `npm` locally. The script overrides the npm cache to
`$TMPDIR/npm-cache` (set `NPM_CACHE` to override) to dodge root-owned
files in `~/.npm/_cacache/` from older npm versions. Commit the
regenerated `editor.bundle.js`, `editor.bundle.css`, and
`editor.bundle.sha256` alongside any source change.
Regenerate the autocomplete vocab from `./cvar_list` (live L4D2
cvarlist dump committed at repo root) after replacing the dump:
```bash
./l4d2web/scripts/build-vocab.py
```

View file

@ -27,7 +27,7 @@ Implementation plans remain the source of truth for architecture and task sequen
- `logs <name> --lines <n> --follow/--no-follow` - `logs <name> --lines <n> --follow/--no-follow`
- The web app calls host operations through `l4d2ctl`, not direct `l4d2host` imports. - The web app calls host operations through `l4d2ctl`, not direct `l4d2host` imports.
- Deployment uses `/var/lib/left4me` for runtime state, `/opt/left4me` for repository contents and the virtualenv, `/etc/left4me` for environment files, and global units under `/usr/local/lib/systemd/system`. - Deployment uses `/var/lib/left4me` for runtime state, `/opt/left4me` for repository contents and the virtualenv, `/etc/left4me` for environment files, and global units under `/usr/local/lib/systemd/system`.
- Overlay handling is directory-based; the web app populates each overlay (workshop downloads, managed-global refresh). - Overlay handling is directory-based and externally populated.
- No lock manager, no rollback, no preflight checks in host library. - No lock manager, no rollback, no preflight checks in host library.
- CLI propagates subprocess failures via stderr and return code. - CLI propagates subprocess failures via stderr and return code.
- `delete` on missing instance is no-op success. - `delete` on missing instance is no-op success.
@ -50,23 +50,13 @@ Implementation plans remain the source of truth for architecture and task sequen
See `deploy/README.md` for the Linux test deployment contract, including the runtime user, target filesystem layout, systemd units, privileged helpers, sudoers rules, admin bootstrap, and overlay reference rules. See `deploy/README.md` for the Linux test deployment contract, including the runtime user, target filesystem layout, systemd units, privileged helpers, sudoers rules, admin bootstrap, and overlay reference rules.
## Local development
This repo is a [uv](https://docs.astral.sh/uv/) workspace (`l4d2host` + `l4d2web` as members) with a committed `uv.lock` and a `.python-version` pinning Python 3.13 (matching the Debian Trixie production target).
One-time prereq: install `uv` (macOS: `brew install uv`; Linux: `curl -LsSf https://astral.sh/uv/install.sh | sh``uv` is not yet in Debian stable's apt).
1. `direnv allow` once per fresh checkout (and after any `.envrc` change). `.envrc` uses `use uv`, which runs `uv sync` and activates `.venv/` on `cd`.
2. Without direnv: `uv sync` at the repo root creates `.venv/`, installs both workspace members editable, and pulls in dev deps (pytest) from the lockfile.
3. Tests: `uv run pytest` (or just `pytest` once the venv is on PATH).
## Tech Stack (planned) ## Tech Stack (planned)
- Python 3.13+ (workspace uses uv + hatchling) - Python 3.12+
- Typer, PyYAML, pytest - Typer, PyYAML, pytest
- Flask, SQLAlchemy, Alembic - Flask, SQLAlchemy, Alembic
- HTMX (vendored locally), custom CSS, SSE - HTMX (vendored locally), custom CSS, SSE
- systemd units, kernel overlayfs (mounted via the `left4me-overlay` privileged helper), steamcmd - systemd user units, fuse-overlayfs, steamcmd
## Recommended Implementation Order ## Recommended Implementation Order

2198
cvar_list

File diff suppressed because it is too large Load diff

View file

@ -1,133 +1,66 @@
# left4me deploy — reference exemplar # left4me Deployment
> The canonical deploy of `ovh.left4me` is driven by This directory contains the production-like test deployment for a Linux server. It installs the repository into a fixed host layout, configures a dedicated runtime user, installs systemd units, and wires the web app to host operations through privileged helper commands.
> [ckn-bw](https://git.sublimity.de/cronekorkn/ckn-bw)'s `bundles/left4me/`
> (attached via `groups/applications/left4me.py`); run `bw apply ovh.left4me`
> from the ckn-bw repo to deploy.
>
> **`deploy/files/` is the canonical source of truth** for static deployment
> artifacts — sudoers, sysctl drop-in, and hardening drop-ins for the
> systemd service units. ckn-bw delivers these via **target-side symlinks**
> from their on-host paths into `/opt/left4me/src/deploy/files/...` (safe
> because `/opt/left4me/src` is root-owned at runtime; the application cannot
> rewrite its own deployment artifacts).
>
> **`deploy/scripts/` is the canonical source of truth** for privileged
> helpers. ckn-bw creates target-side symlinks from
> `/usr/local/{libexec/left4me,sbin}/` into
> `/opt/left4me/src/deploy/scripts/{libexec,sbin}/` after `git_deploy`.
>
> What remains under `deploy/files/usr/local/lib/systemd/system/` is a set
> of **reference fixtures** — a curated subset of the systemd units ckn-bw's
> reactor emits at apply time. They exist so a fresh consumer (other than
> ckn-bw) can read this tree and understand the live unit shape, and so that
> `deploy/tests/test_example_units.py` can assert the reference matches the
> live form. The live base units are emitted by ckn-bw's `systemd/units`
> reactor with per-host CPU pinning and worker counts; the reference files
> must not include hardening directives (those live in the drop-ins, not the
> base units).
## What's here ## Target Layout
| Path | Role | The deployment uses these paths:
|---|---|
| `files/etc/sudoers.d/left4me` | **Canonical** sudoers grants. Symlinked to `/etc/sudoers.d/left4me`. CI syntax test: `tests/test_sudoers.py`. |
| `files/etc/sysctl.d/99-left4me.conf` | **Canonical** sysctl drop-in (UDP buffers, fq_codel + BBR, `kernel.yama.ptrace_scope=2`). Symlinked to `/etc/sysctl.d/99-left4me.conf`. |
| `files/etc/systemd/system/left4me-web.service.d/10-hardening.conf` | **Canonical** hardening drop-in for `left4me-web.service`. Symlinked to the same on-host path. |
| `files/etc/systemd/system/left4me-server@.service.d/10-hardening.conf` | **Canonical** hardening drop-in for `left4me-server@.service`. Symlinked to the same on-host path. |
| `files/etc/left4me/sandbox-resolv.conf` | Example `/etc/resolv.conf` bound into the script-overlay sandbox (delivered as a bw `files{}` item, not a symlink). |
| `files/usr/local/lib/systemd/system/left4me-web.service` | **Reference fixture** — the web-app unit the reactor emits (per-host worker/thread counts omitted). |
| `files/usr/local/lib/systemd/system/left4me-server@.service` | **Reference fixture** — the per-instance gameserver unit template the reactor emits. |
| `files/usr/local/lib/systemd/system/left4me-workshop-refresh.{service,timer}` | **Reference fixture** — the daily workshop-refresh cron-equivalent. |
| `files/usr/local/lib/systemd/system/l4d2-{game,build}.slice` | **Reference fixture** — slice definitions (CPU/IO weights; reactor fills in `AllowedCPUs=` from host metadata). |
| `scripts/libexec/{left4me-overlay,left4me-systemctl,left4me-journalctl,left4me-script-sandbox}` | **Canonical** privileged helper commands. Symlinked under `/usr/local/libexec/left4me/`. |
| `scripts/sbin/left4me` | **Canonical** admin CLI wrapper. Symlinked to `/usr/local/sbin/left4me`. |
| `templates/etc/left4me/host.env` | Example host-library env (deployment-fixed paths). |
| `templates/etc/left4me/web.env.template` | Example web-app env. ckn-bw renders the real version via the matching Mako template in `bundles/left4me/files/etc/left4me/web.env.mako`. |
| `tests/test_example_units.py` | Locks down the reference units and env templates above; also asserts hardening drop-in shape. |
| `tests/test_sudoers.py` | Runs `visudo -cf` against the sudoers file in CI. |
## Target layout - `/etc/left4me/host.env`: host library environment configuration.
- `/etc/left4me/web.env`: web app environment configuration.
- `/opt/left4me/.venv`: Python virtual environment for deployed commands.
- `/opt/left4me`: deployed repository contents.
- `/var/lib/left4me/left4me.db`: SQLite database used by the web app.
- `/var/lib/left4me/installation`: shared L4D2 installation.
- `/var/lib/left4me/overlays`: externally managed overlay directories.
- `/var/lib/left4me/instances`: rendered instance specifications and per-instance state.
- `/var/lib/left4me/runtime`: per-instance runtime mount directories.
- `/var/lib/left4me/tmp`: temporary files used by deployment/runtime operations.
- `/usr/local/lib/systemd/system`: global systemd unit files, including `left4me-server@.service`.
- `/usr/local/libexec/left4me`: privileged helper commands, including `left4me-systemctl` and `left4me-journalctl`.
- `/etc/sudoers.d/left4me`: sudoers rules allowing the web/runtime commands to call the helpers non-interactively.
The deployment uses these on-host paths (FHS-aligned): Static units are generated for `/var/lib/left4me`. If `LEFT4ME_ROOT` changes, regenerate and reinstall the unit files instead of reusing the existing static units.
- `/etc/left4me/host.env` — host library environment configuration. ## Runtime User
- `/etc/left4me/web.env` — web app environment configuration.
- `/etc/left4me/sandbox-resolv.conf` — DNS resolv.conf bound into the
script-overlay sandbox.
- `/etc/sudoers.d/left4me` — sudoers rules letting the `left4me` uid call
the privileged helpers non-interactively.
- `/etc/sysctl.d/99-left4me.conf` — perf-baseline sysctls.
- `/opt/left4me/src` — deployed repository contents (via ckn-bw
`git_deploy`). Root-owned; read-only at runtime. `/opt/left4me/`
itself is also root-owned and contains only `src/`.
- `/var/lib/left4me/.venv` — Python virtual environment for the web app
(non-editable install of `l4d2host` + `l4d2web`).
- `/var/lib/left4me/steam` — steamcmd install (self-updates).
- `/var/lib/left4me/left4me.db` — SQLite database used by the web app.
- `/var/lib/left4me/installation` — shared L4D2 installation.
- `/var/lib/left4me/overlays` — overlay directories. Each overlay lives
at `${overlay_id}` under here.
- `/var/lib/left4me/workshop_cache` — deduplicated cache of `.vpk` files
downloaded for workshop overlays. One file per Steam item, named
`{steam_id}.vpk`. Workshop overlays symlink into this tree.
- `/var/lib/left4me/instances` — rendered instance specifications and
per-instance state.
- `/var/lib/left4me/runtime` — per-instance runtime mount directories.
- `/var/lib/left4me/tmp` — temporary files used by deployment/runtime
operations (incl. idmap staging binds).
- `/usr/local/lib/systemd/system/` — global systemd unit files emitted
by ckn-bw's `systemd_units` reactor.
- `/usr/local/libexec/left4me/` — privileged helper commands, symlinked
from `deploy/scripts/libexec/`.
- `/usr/local/sbin/left4me` — admin CLI wrapper, symlinked from
`deploy/scripts/sbin/left4me`.
## Runtime users The deployment creates and runs host operations as the dedicated runtime user:
One system user does everything: - Username: `left4me`
- Home: `/var/lib/left4me`
- Shell: `/usr/sbin/nologin`
- **`left4me`** (home `/var/lib/left4me`, shell `/usr/sbin/nologin`): ## Running A Test Deployment
web app, host library, gameserver runtime, and script-overlay
sandbox. The sandbox unit drops privileges via `systemd-run` and
runs the user-authored bash inside a fully hardened transient
service (see `deploy/scripts/libexec/left4me-script-sandbox`). Same-uid
attack surface — sandbox escape reaching `web.env`, the SQLite DB,
or running gameservers — is closed by that hardening profile plus
system-wide `kernel.yama.ptrace_scope=2`, rather than by a uid
boundary.
The user-count decision and its history live in Run the deployment from the repository root:
`docs/superpowers/specs/2026-05-15-user-uid-split-design.md`.
## Deployment ```bash
deploy/deploy-test-server.sh deploy-user@example-host
Production deploy:
```sh
# In the ckn-bw repo:
bw apply ovh.left4me
``` ```
Admin bootstrap is a manual one-time step after the first apply The SSH user must be able to run `sudo` on the target host. The deployment configures system packages, directories, environment files, helper scripts, sudoers rules, Python dependencies, and systemd units.
(ckn-bw deliberately doesn't seed an admin to keep credentials out of
the metadata pipeline):
```sh ## Admin Bootstrap
sudo -u left4me LEFT4ME_ADMIN_USERNAME=admin \
Set the bootstrap credentials in the environment when creating the first admin user:
```bash
LEFT4ME_ADMIN_USERNAME=admin \
LEFT4ME_ADMIN_PASSWORD='change-me' \ LEFT4ME_ADMIN_PASSWORD='change-me' \
/var/lib/left4me/.venv/bin/flask --app l4d2web.app:create_app \ flask create-user "$LEFT4ME_ADMIN_USERNAME" --admin
create-user "$LEFT4ME_ADMIN_USERNAME" --admin
``` ```
Rotate the bootstrap password after first login. Use a strong one-time password and rotate it after first login if needed.
## Overlay references ## Overlay References
Overlay references are relative paths below `${LEFT4ME_ROOT}/overlays`. Overlay references are relative paths below `${LEFT4ME_ROOT}/overlays`. With the default deployment root, they resolve under `/var/lib/left4me/overlays`.
With the default deployment root, they resolve under
`/var/lib/left4me/overlays`. New overlays use `${overlay_id}` as their Valid examples:
path; the digit-only form is the only one created by the web app.
- `standard`
- `competitive/base`
- `users/42/custom`
Invalid references are rejected: Invalid references are rejected:
@ -136,165 +69,4 @@ Invalid references are rejected:
- Empty path components such as `competitive//base`. - Empty path components such as `competitive//base`.
- Symlink escapes that resolve outside `${LEFT4ME_ROOT}/overlays`. - Symlink escapes that resolve outside `${LEFT4ME_ROOT}/overlays`.
The web app currently supports two overlay surfaces: Overlay content is external to the host library and deployment contract. Populate overlay directories separately before referencing them from blueprints or instance specs.
- **`workshop` overlays** (user-owned) — populated by downloading
`.vpk` files from the public Steam Web API into
`${LEFT4ME_ROOT}/workshop_cache/{steam_id}.vpk` and creating absolute
symlinks under
`${LEFT4ME_ROOT}/overlays/{overlay_id}/left4dead2/addons/{steam_id}.vpk`.
- **`script` overlays** — populated by an arbitrary user-authored bash
script that runs inside `systemd-run` as `left4me` (under a fully
hardened transient service unit), with the overlay directory
bind-mounted RW at `/overlay`. Resource caps: 1h walltime, 4 GB RAM,
512 tasks, 200% CPU, 20 GB post-build disk cap.
Both caches and overlay directories are owned by `left4me`. If the web
service ever runs as a different uid, ensure it shares a group with the
host process and that both trees are group-readable.
## Performance tuning
The deployment ships a host-side perf baseline (slices, unit directives,
sysctls). See
`docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`
for design rationale.
The knobs below are documented escape hatches — **not** auto-applied.
Apply only after measuring a need and understanding the failure modes.
### Network shaping
Three pieces of the baseline affect player-experience network behaviour:
1. **Per-flow marking.** ckn-bw's central `bundles/nftables/` consumes
left4me's nftables defaults and marks every UDP packet from uid
`left4me` with DSCP EF and `skb->priority` 6. srcds doesn't set
these itself, so without this rule its UDP is indistinguishable
from any other flow.
2. **Sysctl baseline.** `99-left4me.conf` sets `udp_rmem_min=16384`,
`udp_wmem_min=16384`, `default_qdisc=fq_codel`, and
`tcp_congestion_control=bbr`. Reduces head-of-line blocking when
bulk TCP egress coexists with game UDP.
3. **CAKE egress shaping.** Configured per-interface via systemd-networkd
metadata (`network/<iface>/cake` in ckn-bw's `bundles/network/`),
which reapplies the CAKE qdisc across iface lifecycle events. Set
the declared bandwidth to ≈95% of measured uplink — CAKE only shapes
if its declared bandwidth is *below* the real bottleneck. Idle links
with no competing egress see no visible CAKE effect; the win
materialises under bulk traffic that would otherwise bufferbloat the
link the players share.
### CPU governor
The performance governor squeezes a few percent off jitter under bursty
load. `schedutil` is acceptable for sustained UDP workloads.
```sh
sudo cpupower frequency-set -g performance
```
Install via `sudo apt install linux-cpupower` if the binary isn't
present. Persist via your distro's CPU-frequency tooling (e.g.
`/etc/default/cpufrequtils`).
### CPU isolation (cores)
The deploy writes four `AllowedCPUs=` drop-ins so that by default only
`l4d2-game.slice` is allowed to run on cores `1..N-1`; `system.slice`,
`user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers
get the host minus core 0 exclusively; the build sandbox and the web
app stay on core 0; a logged-in admin running CPU-heavy work in their
shell can't steal cycles from a live match. Single-core hosts skip the
cpuset drop-ins entirely; the rest of the perf baseline (cgroup
weights, sysctls, OOM scores) still applies.
Per-instance `CPUAffinity=` (next subsection) composes on top of this —
the per-instance value must be a subset of `l4d2-game.slice`'s
`AllowedCPUs=`, which the kernel enforces.
### Per-instance CPU affinity
`srcds` is single-threaded per instance. On a multi-core host, pinning
each instance to its own core can cut jitter under contention. Drop in
`/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
```ini
[Service]
CPUAffinity=2
```
This pins the instance to CPU 2. A reasonable strategy on an N-core
host: leave core 0 for the kernel + IRQs + system services, then pin
one instance per remaining core.
### NIC tuning
Hardware-specific (install via `sudo apt install ethtool` if not
present). On a host with a single primary interface (replace `eth0`):
```sh
sudo ethtool -G eth0 rx 4096 tx 4096
sudo ethtool -K eth0 gro on lro off
```
If you run a high instance count, also pin the NIC's interrupts off
the cores that game servers occupy (see `/proc/interrupts` and
`/proc/irq/<n>/smp_affinity`).
### Real-time scheduling (advanced, opt-in)
Source-engine servers do not need real-time scheduling, and a
misbehaving `srcds` at any RT priority can starve kernel threads — even
with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of
CPU back. Use only if you have a measured jitter problem that the
baseline does not solve.
`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
```ini
[Service]
CPUSchedulingPolicy=fifo
CPUSchedulingPriority=10
LimitRTPRIO=10
AmbientCapabilities=CAP_SYS_NICE
```
The `AmbientCapabilities=CAP_SYS_NICE` line is needed because the
service runs as `User=left4me` with `NoNewPrivileges=true`; without it
some kernels/systemd combinations refuse to apply the RT policy.
### Additional opt-in network knobs
- **Ingress shaping via IFB.** Egress CAKE alone does not protect srcds
receive against ingress saturation (large workshop downloads,
package fetches arriving at line rate). Worth flipping only when
measurement shows ingress hurting receive.
sudo modprobe ifb && sudo ip link set ifb0 up
sudo tc qdisc add dev <uplink> handle ffff: ingress
sudo tc filter add dev <uplink> parent ffff: protocol ip u32 \
match u32 0 0 action mirred egress redirect dev ifb0
sudo tc qdisc add dev ifb0 root cake bandwidth Xmbit ingress \
diffserv4 dual-srchost
- **`net.core.busy_poll = 50` / `net.core.busy_read = 50`.** Reduces
UDP receive median latency by polling for incoming packets briefly
at syscall boundaries. Cost: measurable CPU per syscall under load.
Worth flipping if a host is dedicated to game serving and CPU
headroom is plentiful.
- **`ethtool -K <iface> gro off`.** Some Source-engine ops disable
generic receive offload to avoid receive-side coalescing latency.
Hardware/driver dependent; document only.
### Applying changes to running servers
Unit-file changes do not apply to already-running services. After any
change:
```sh
sudo systemctl daemon-reload
# Restart each game server via the web UI's stop + start, or:
sudo systemctl restart 'left4me-server@*.service'
```

181
deploy/deploy-test-server.sh Executable file
View file

@ -0,0 +1,181 @@
#!/bin/sh
set -eu
usage() {
printf 'Usage: %s <ssh-user@host>\n' "$0" >&2
exit 2
}
if [ "$#" -ne 1 ]; then
usage
fi
target=$1
script_dir=$(CDPATH= cd -- "$(dirname -- "$0")" && pwd)
repo_root=$(CDPATH= cd -- "$script_dir/.." && pwd)
tmp_dir=$(mktemp -d)
archive="$tmp_dir/left4me.tar.gz"
cleanup() {
rm -rf "$tmp_dir"
}
trap cleanup EXIT INT HUP TERM
tar -czf "$archive" \
--exclude .git \
--exclude .venv \
--exclude __pycache__ \
--exclude .pytest_cache \
--exclude '*.egg-info' \
--exclude 'l4d2web.db*' \
-C "$repo_root" .
remote_tmp=$(ssh "$target" 'mktemp -d')
scp "$archive" "$target:$remote_tmp/left4me.tar.gz"
admin_username_file=
admin_password_file=
if [ "${LEFT4ME_ADMIN_USERNAME+x}" = x ] && [ "${LEFT4ME_ADMIN_PASSWORD+x}" = x ]; then
admin_username_file="$tmp_dir/admin_username"
admin_password_file="$tmp_dir/admin_password"
umask 077
printf '%s' "$LEFT4ME_ADMIN_USERNAME" > "$admin_username_file"
printf '%s' "$LEFT4ME_ADMIN_PASSWORD" > "$admin_password_file"
scp "$admin_username_file" "$target:$remote_tmp/admin_username"
scp "$admin_password_file" "$target:$remote_tmp/admin_password"
fi
ssh "$target" sh -s -- "$remote_tmp" <<'REMOTE'
set -eu
remote_tmp=$1
archive="$remote_tmp/left4me.tar.gz"
repo_tmp="$remote_tmp/repo"
if [ "$(id -u)" -eq 0 ]; then
sudo_cmd=
else
sudo_cmd=sudo
fi
run_as_left4me() {
sudo -u left4me "$@"
}
run_left4me_with_env() {
run_as_left4me sh -c 'set -a; . /etc/left4me/host.env; . /etc/left4me/web.env; set +a; exec "$@"' sh "$@"
}
cleanup_remote() {
rm -rf "$remote_tmp"
}
trap cleanup_remote EXIT INT HUP TERM
if ! id left4me >/dev/null 2>&1; then
$sudo_cmd useradd --system --home-dir /var/lib/left4me --create-home --shell /usr/sbin/nologin left4me
fi
if command -v apt-get >/dev/null 2>&1; then
$sudo_cmd apt-get update
$sudo_cmd apt-get install -y python3 python3-venv python3-pip curl ca-certificates tar gzip fuse-overlayfs fuse3 sudo
elif command -v dnf >/dev/null 2>&1; then
$sudo_cmd dnf install -y python3 python3-pip curl ca-certificates tar gzip fuse-overlayfs fuse3 sudo
else
printf 'Unsupported package manager: expected apt-get or dnf\n' >&2
exit 1
fi
$sudo_cmd mkdir -p \
/etc/left4me \
/opt/left4me \
/usr/local/lib/systemd/system \
/usr/local/libexec/left4me \
/var/lib/left4me/installation \
/var/lib/left4me/overlays \
/var/lib/left4me/instances \
/var/lib/left4me/runtime \
/var/lib/left4me/tmp
$sudo_cmd chown -R left4me:left4me /var/lib/left4me /opt/left4me
mkdir -p "$repo_tmp"
tar -xzf "$archive" -C "$repo_tmp"
if [ -d /opt/left4me/.venv ]; then
$sudo_cmd mv /opt/left4me/.venv "$remote_tmp/venv"
fi
$sudo_cmd find /opt/left4me -mindepth 1 -maxdepth 1 -exec rm -rf {} +
$sudo_cmd cp -R "$repo_tmp"/. /opt/left4me/
if [ -d "$remote_tmp/venv" ]; then
$sudo_cmd mv "$remote_tmp/venv" /opt/left4me/.venv
fi
$sudo_cmd chown -R left4me:left4me /opt/left4me
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-systemctl /usr/local/libexec/left4me/left4me-systemctl
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-journalctl /usr/local/libexec/left4me/left4me-journalctl
$sudo_cmd chmod 0755 /usr/local/libexec/left4me/left4me-systemctl /usr/local/libexec/left4me/left4me-journalctl
$sudo_cmd cp /opt/left4me/deploy/files/etc/sudoers.d/left4me /etc/sudoers.d/left4me
$sudo_cmd chmod 0440 /etc/sudoers.d/left4me
$sudo_cmd visudo -cf /etc/sudoers.d/left4me
$sudo_cmd cp /opt/left4me/deploy/templates/etc/left4me/host.env /etc/left4me/host.env
$sudo_cmd chmod 0644 /etc/left4me/host.env
if [ ! -f /etc/left4me/web.env ]; then
secret_key=$(python3 -c 'import secrets; print(secrets.token_hex(32))')
tmp_web_env="$remote_tmp/web.env"
{
printf 'DATABASE_URL=sqlite:////var/lib/left4me/left4me.db\n'
printf 'SECRET_KEY=%s\n' "$secret_key"
printf 'JOB_WORKER_THREADS=4\n'
} > "$tmp_web_env"
$sudo_cmd install -m 0640 -o root -g left4me "$tmp_web_env" /etc/left4me/web.env
fi
if [ ! -x /opt/left4me/.venv/bin/python ]; then
run_as_left4me python3 -m venv /opt/left4me/.venv
fi
run_as_left4me /opt/left4me/.venv/bin/python -m pip install --upgrade pip
run_as_left4me /opt/left4me/.venv/bin/pip install -e /opt/left4me/l4d2host -e /opt/left4me/l4d2web
run_left4me_with_env env \
JOB_WORKER_ENABLED=false \
/opt/left4me/.venv/bin/python -c "from l4d2web.app import create_app; create_app()"
run_as_left4me sh -c "cd /opt/left4me/l4d2web && set -a; . /etc/left4me/host.env; . /etc/left4me/web.env; set +a; env \
JOB_WORKER_ENABLED=false \
PYTHONPATH=/opt/left4me \
/opt/left4me/.venv/bin/alembic -c /opt/left4me/l4d2web/alembic.ini upgrade head"
if [ -f "$remote_tmp/admin_username" ] && [ -f "$remote_tmp/admin_password" ]; then
LEFT4ME_ADMIN_USERNAME=$(cat "$remote_tmp/admin_username")
LEFT4ME_ADMIN_PASSWORD=$(cat "$remote_tmp/admin_password")
if ! create_user_output=$(run_left4me_with_env env \
JOB_WORKER_ENABLED=false \
LEFT4ME_ADMIN_PASSWORD="$LEFT4ME_ADMIN_PASSWORD" \
/opt/left4me/.venv/bin/flask --app l4d2web.app:create_app create-user "$LEFT4ME_ADMIN_USERNAME" --admin 2>&1); then
case "$create_user_output" in
*'user already exists'*) printf '%s\n' "$create_user_output" ;;
*) printf '%s\n' "$create_user_output" >&2; exit 1 ;;
esac
else
printf '%s\n' "$create_user_output"
fi
fi
$sudo_cmd systemctl daemon-reload
$sudo_cmd systemctl enable --now left4me-web.service
$sudo_cmd systemctl restart left4me-web.service
for attempt in 1 2 3 4 5 6 7 8 9 10; do
if curl -fsS http://127.0.0.1:8000/health; then
exit 0
fi
sleep 1
done
$sudo_cmd systemctl status left4me-web.service --no-pager >&2 || true
$sudo_cmd journalctl -u left4me-web.service -n 80 --no-pager >&2 || true
exit 1
REMOTE

View file

@ -1,6 +0,0 @@
# Sandbox-only resolver config — bind-mounted into script-overlay sandboxes
# at /etc/resolv.conf. The host's resolver (often a private/LAN DNS server)
# is unreachable from inside the sandbox because IPAddressDeny= blocks
# egress to RFC1918 / loopback. Public resolvers keep DNS working.
nameserver 1.1.1.1
nameserver 8.8.8.8

View file

@ -1,5 +1,3 @@
Defaults:left4me !requiretty Defaults:left4me !requiretty
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl * left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-journalctl * left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-journalctl *
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-overlay mount *, /usr/local/libexec/left4me/left4me-overlay umount *
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox

View file

@ -1,41 +0,0 @@
# Host-side perf baseline for left4me — see
# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
#
# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
# Source-engine UDP across multiple instances. 8 MiB matches the standard
# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
# explicitly enlarge their buffers.
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
# netdev_budget = 600 gives softirq more drain headroom per pass.
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
# Latency-sensitive default: avoid swap unless the box is really under
# pressure. Harmless on swapless hosts.
vm.swappiness = 10
# Per-socket UDP buffer floors: protect game-server sockets that don't bump
# their own SO_RCVBUF/SO_SNDBUF when softirq drains lag briefly.
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384
# Default qdisc for ifaces we don't explicitly shape with CAKE. Debian Trixie
# already defaults to fq_codel; setting it explicitly is belt-and-suspenders
# and survives kernel-default churn.
net.core.default_qdisc = fq_codel
# TCP congestion control: BBR for any bulk TCP egress on the host (admin SSH,
# backups, package fetches, web-app responses) so a long flow does not push
# the bottleneck queue ahead of game UDP. UDP srcds is unaffected.
net.ipv4.tcp_congestion_control = bbr
# Block ptrace except from CAP_SYS_PTRACE holders. Belt-and-braces with
# SystemCallFilter=~@debug + PrivateUsers=true in the gameserver unit.
# See docs/superpowers/specs/2026-05-15-hardening-defenses-survey.md.
kernel.yama.ptrace_scope = 2

View file

@ -1,82 +0,0 @@
# Hardening drop-in for left4me-server@.service.
#
# Source of truth: this file (in left4me/deploy/files/). ckn-bw deploys
# it to /etc/systemd/system/left4me-server@.service.d/10-hardening.conf
# via a target-side symlink into the checkout.
#
# Gameserver unit: full hardening profile. No sudo path inside; no
# sudo-incompatibility carve-outs.
[Service]
NoNewPrivileges=true
RestrictSUIDSGID=true
CapabilityBoundingSet=
AmbientCapabilities=
# srcds_linux is i386 (Source 2007 engine). Bare 'native' kills every
# 32-bit syscall and traps srcds_run in a respawn loop.
SystemCallArchitectures=native x86
SystemCallFilter=@system-service
SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete @privileged
TemporaryFileSystem=/var/lib /etc /opt /home /root /srv /mnt /media
BindReadOnlyPaths=/var/lib/left4me/installation
BindReadOnlyPaths=/var/lib/left4me/overlays
# Workshop VPKs in overlays are symlinks into workshop_cache;
# without this bind they dangle inside the unit and Source
# silently fails to load the addons.
BindReadOnlyPaths=/var/lib/left4me/workshop_cache
# Steam SDK: srcds dlopen's ~/.steam/sdk32/steamclient.so for
# Steam master-server registration. Without this, SteamAPI_Init
# fails and the server falls back to LAN-only mode regardless
# of sv_lan=0 — clients then get "LAN servers are restricted
# to local clients (class C)". .steam holds symlinks into
# /var/lib/left4me/steam, so both paths need to be bound back
# through TemporaryFileSystem.
BindReadOnlyPaths=/var/lib/left4me/.steam
BindReadOnlyPaths=/var/lib/left4me/steam
BindReadOnlyPaths=/etc/left4me/host.env
BindReadOnlyPaths=/etc/ssl
BindReadOnlyPaths=/etc/ca-certificates
BindReadOnlyPaths=/etc/resolv.conf
BindReadOnlyPaths=/etc/nsswitch.conf
BindReadOnlyPaths=/etc/alternatives
BindPaths=/var/lib/left4me/runtime/%i
ProtectSystem=strict
ProtectHome=true
PrivateUsers=true
# PrivatePIDs is the test-plan amendment that closes D2.b: same-uid
# ProtectProc=invisible cannot hide gunicorn from srcds (both run as
# uid 980); a private PID namespace does.
PrivatePIDs=true
PrivateTmp=true
PrivateDevices=true
PrivateIPC=true
RestrictNamespaces=true
RestrictRealtime=true
ProtectProc=invisible
# ProcSubset=pid intentionally OMITTED — it hides /proc/cpuinfo and
# /proc/sys/*, which breaks Source's tier0/cpu.cpp and (downstream)
# SteamAPI_Init's pipe-creation step. Server then registers as LAN-only
# and rejects external clients with "LAN servers are restricted to
# local clients (class C)". PrivatePIDs=true (kernel PID namespace) is
# the load-bearing peer-process isolation; ProtectProc=invisible is the
# foreign-uid /proc hide. Losing ProcSubset=pid only exposes host kernel
# info (cpuinfo, meminfo, sysctls), which is not sensitive in this
# threat model. See ckn-bw commit 4339289 for the original fix.
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectClock=true
ProtectControlGroups=true
ProtectHostname=true
LockPersonality=true
RemoveIPC=true
KeyringMode=private
UMask=0027
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
# Lock srcds bindable sockets to the game port range. Hard-coded range
# because systemd directive variable substitution is uneven for
# SocketBindAllow.
SocketBindAllow=udp:27000-27999
SocketBindAllow=tcp:27000-27999
# W+X mprotect (text relocations in Source engine i386 .so files) is
# incompatible with the memory-deny-write-execute directive; that
# directive is therefore intentionally absent from this drop-in.

View file

@ -1,44 +0,0 @@
# Hardening drop-in for left4me-web.service.
#
# Source of truth: this file (in left4me/deploy/files/). ckn-bw deploys
# it to /etc/systemd/system/left4me-web.service.d/10-hardening.conf via a
# target-side symlink into the checkout.
#
# See docs/superpowers/specs/2026-05-15-hardening-defenses-survey.md
# and 2026-05-15-hardening-test-plan.md for the threat model and the
# verification matrix.
#
# This unit is the web app; some sudo-incompatible directives are
# intentionally absent:
# NoNewPrivileges — blocks sudo's setuid escalation
# PrivateUsers — breaks sudo's host-root mapping
# RestrictSUIDSGID — blocks setuid()/setgid()
# CapabilityBoundingSet — empty value would deny sudo's caps
# @privileged exclusion in SystemCallFilter — blocks sudo's setuid syscall
# All of those are unconditional on the gameserver unit (no sudo there).
[Service]
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ProtectProc=invisible
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectKernelLogs=true
ProtectClock=true
ProtectControlGroups=true
ProtectHostname=true
LockPersonality=true
# `native x86` (not just `native`) — the install job fork-execs
# steamcmd_linux, a 32-bit binary, which makes i386-numbered syscalls.
# Under `native` alone the kernel SIGSYS-kills it (bash exit 159 =
# 128+SIGSYS). Mirrors the server unit, which needs the same allowance
# for srcds_linux. See deploy/files/etc/systemd/system/left4me-server@.service.d/10-hardening.conf.
SystemCallArchitectures=native x86
SystemCallFilter=@system-service
SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=true
RestrictRealtime=true
RemoveIPC=true
KeyringMode=private
UMask=0027

View file

@ -1,8 +0,0 @@
# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target
[Slice]
CPUWeight=10
IOWeight=10

View file

@ -1,8 +0,0 @@
# Perf baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
[Unit]
Description=left4me game-server slice
Before=slices.target
[Slice]
CPUWeight=1000
IOWeight=1000

View file

@ -1,24 +1,7 @@
# left4me gameserver — system unit, one instance per gameserver.
#
# This is the REFERENCE COPY of the deployed unit base body. The live
# source is the systemd/units reactor at
# ~/Projekte/ckn-bw/bundles/left4me/metadata.py (look for
# 'left4me-server@.service').
#
# Hardening: see left4me-server@.service.d/10-hardening.conf
#
# Threat model: docs/superpowers/specs/2026-05-15-hardening-threat-model.md
# Defenses survey: docs/superpowers/specs/2026-05-15-hardening-defenses-survey.md
# Test plan + results: docs/superpowers/specs/2026-05-15-hardening-test-plan.md
[Unit] [Unit]
Description=left4me server instance %i Description=left4me server instance %i
After=network-online.target After=network-online.target
Wants=network-online.target Wants=network-online.target
# Bound the restart loop. Without these, a persistent ExecStartPre or
# ExecStart failure spins indefinitely.
StartLimitBurst=5
StartLimitIntervalSec=60s
[Service] [Service]
Type=simple Type=simple
@ -26,38 +9,19 @@ User=left4me
Group=left4me Group=left4me
EnvironmentFile=/etc/left4me/host.env EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
# `-` prefix: chdir failure is non-fatal. The merged dir only exists WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
# once ExecStartPre's overlay mount succeeds. ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
# `+` prefix runs the helper as PID 1 (root, all caps, host
# namespaces) — required because the hardening drop-in sets
# NoNewPrivileges and PrivateUsers; both block sudo's setuid path.
# nsenter into PID 1's mount namespace ensures the umount in
# ExecStopPost succeeds without EBUSY from the unit's own
# slave-mount tree.
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
# Run from the merged overlay, NOT installation/. srcds_run cds to its
# own dirname before exec'ing srcds_linux; the binary's path determines
# gameinfo + addons lookup.
ExecStart=/var/lib/left4me/runtime/%i/merged/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
Restart=on-failure Restart=on-failure
RestartSec=5 RestartSec=5
NoNewPrivileges=true
# === Resource control baseline === PrivateTmp=true
# See docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md PrivateDevices=true
Slice=l4d2-game.slice ProtectHome=true
Nice=-5 ProtectSystem=strict
IOSchedulingClass=best-effort ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays
IOSchedulingPriority=4 ReadWritePaths=/var/lib/left4me/runtime/%i
OOMScoreAdjust=-200 RestrictSUIDSGID=true
MemoryHigh=1.5G LockPersonality=true
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
[Install] [Install]
WantedBy=multi-user.target WantedBy=multi-user.target

View file

@ -1,14 +1,3 @@
# left4me web application — system unit.
#
# This is the REFERENCE COPY of the deployed unit base body. The live
# source is the systemd/units reactor at
# ~/Projekte/ckn-bw/bundles/left4me/metadata.py (look for
# 'left4me-web.service').
#
# Hardening: see left4me-web.service.d/10-hardening.conf
#
# Threat model + defenses + tests: see docs/superpowers/specs/2026-05-15-hardening-*
[Unit] [Unit]
Description=left4me web application Description=left4me web application
After=network-online.target After=network-online.target
@ -18,19 +7,17 @@ Wants=network-online.target
Type=simple Type=simple
User=left4me User=left4me
Group=left4me Group=left4me
WorkingDirectory=/opt/left4me/src WorkingDirectory=/opt/left4me
Environment=HOME=/var/lib/left4me PATH=/var/lib/left4me/.venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Environment=HOME=/var/lib/left4me
Environment=PATH=/opt/left4me/.venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
EnvironmentFile=/etc/left4me/host.env EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/etc/left4me/web.env EnvironmentFile=/etc/left4me/web.env
# Placeholder values for --workers / --threads. Live emission interpolates ExecStart=/opt/left4me/.venv/bin/gunicorn --workers 1 --threads 8 --bind 0.0.0.0:8000 'l4d2web.app:create_app()'
# from metadata.get('left4me/gunicorn_workers') and gunicorn_threads.
ExecStart=/var/lib/left4me/.venv/bin/gunicorn --workers 1 --threads 32 --bind 127.0.0.1:8000 'l4d2web.app:create_app()'
Restart=on-failure Restart=on-failure
RestartSec=3 RestartSec=3
NoNewPrivileges=true
# Web writes broadly under /var/lib/left4me (DB, instance configs, PrivateTmp=true
# overlays, runtime). Kept inline because it's web-specific ProtectSystem=full
# (server@ uses BindPaths to bind only its instance dir).
ReadWritePaths=/var/lib/left4me ReadWritePaths=/var/lib/left4me
[Install] [Install]

View file

@ -1,15 +0,0 @@
[Unit]
Description=left4me daily workshop refresh (enqueue job)
After=network-online.target left4me-web.service
Wants=left4me-web.service
[Service]
Type=oneshot
User=left4me
Group=left4me
WorkingDirectory=/opt/left4me/src
Environment=HOME=/var/lib/left4me
Environment=PATH=/var/lib/left4me/.venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/etc/left4me/web.env
ExecStart=/var/lib/left4me/.venv/bin/flask --app l4d2web.app:create_app workshop-refresh

View file

@ -1,11 +0,0 @@
[Unit]
Description=left4me daily workshop refresh
[Timer]
OnCalendar=*-*-* 04:00:00
Persistent=true
RandomizedDelaySec=15min
Unit=left4me-workshop-refresh.service
[Install]
WantedBy=timers.target

View file

@ -37,16 +37,6 @@ case "$follow_flag" in
esac esac
unit="left4me-server@${name}.service" unit="left4me-server@${name}.service"
if [ -x /bin/systemctl ]; then
systemctl=/bin/systemctl
elif [ -x /usr/bin/systemctl ]; then
systemctl=/usr/bin/systemctl
else
printf '%s\n' 'systemctl not found at /bin/systemctl or /usr/bin/systemctl' >&2
exit 69
fi
if [ -x /bin/journalctl ]; then if [ -x /bin/journalctl ]; then
journalctl=/bin/journalctl journalctl=/bin/journalctl
elif [ -x /usr/bin/journalctl ]; then elif [ -x /usr/bin/journalctl ]; then
@ -56,20 +46,8 @@ else
exit 69 exit 69
fi fi
# Anchor `--since` to the moment systemd began the unit's current start
# transaction so the log panel starts at the latest run. Force LC_ALL=C so
# the day-of-week prefix is in a locale journalctl reliably parses.
start_time=$(LC_ALL=C "$systemctl" show -p InactiveExitTimestamp --value "$unit" 2>/dev/null || true)
if [ -n "$start_time" ]; then
if [ -n "$follow_arg" ]; then
exec "$journalctl" -u "$unit" --since "$start_time" -n "$lines" -o cat "$follow_arg"
fi
exec "$journalctl" -u "$unit" --since "$start_time" -n "$lines" -o cat
fi
# Unit has never run: no --since cutoff. `-f` will attach on first start.
if [ -n "$follow_arg" ]; then if [ -n "$follow_arg" ]; then
exec "$journalctl" -u "$unit" -n "$lines" -o cat "$follow_arg" exec "$journalctl" -u "$unit" -n "$lines" -o cat "$follow_arg"
fi fi
exec "$journalctl" -u "$unit" -n "$lines" -o cat exec "$journalctl" -u "$unit" -n "$lines" -o cat

View file

@ -2,7 +2,7 @@
set -eu set -eu
usage() { usage() {
printf '%s\n' "usage: left4me-systemctl enable|disable|show <server-name>" >&2 printf '%s\n' "usage: left4me-systemctl start|stop|show <server-name>" >&2
exit 2 exit 2
} }
@ -22,7 +22,7 @@ action=$1
name=$2 name=$2
case "$action" in case "$action" in
enable|disable|show) ;; start|stop|show) ;;
*) usage ;; *) usage ;;
esac esac
@ -38,7 +38,7 @@ else
fi fi
case "$action" in case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;; start) exec "$systemctl" start "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;; stop) exec "$systemctl" stop "$unit" ;;
show) exec "$systemctl" show --property=ActiveState --property=SubState "$unit" ;; show) exec "$systemctl" show --property=ActiveState --property=SubState "$unit" ;;
esac esac

View file

@ -1,244 +0,0 @@
#!/usr/bin/python3
"""Privileged overlay mount helper for left4me.
Invoked from the systemd unit's ExecStartPre / ExecStopPost via
`+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- …`. The unit-level
nsenter is what makes this work: it runs the helper Python interpreter
inside PID 1's mount namespace. Without it, the `+` Exec prefix
removes the sandbox/credentials but does NOT detach from the unit's
per-service mount namespace, and the helper process itself would pin
that namespace alive — turning every umount into a multi-second EBUSY
race with the kernel's deferred namespace cleanup. With the unit-level
nsenter the helper has no such reference and umount succeeds first try.
Validates inputs strictly, then performs `mount -t overlay` /
`umount` directly — no internal nsenter, since the helper is already
running where the syscalls need to take effect.
Verbs:
mount <name> Reads ${LEFT4ME_ROOT}/instances/<name>/instance.env
for L4D2_LOWERDIRS, validates every lowerdir is
under one of installation/overlays/workshop_cache/
global_overlay_cache, then mounts the kernel
overlay at runtime/<name>/merged.
umount <name> Unmounts runtime/<name>/merged and cleans up the
kernel-overlayfs `work/work` orphan.
Set LEFT4ME_OVERLAY_PRINT_ONLY=1 to print the would-be argv (one line,
shell-quoted) and exit 0 instead of execv. Used by tests.
"""
import os
import re
import shlex
import shutil
import subprocess
import sys
from pathlib import Path
NAME_RE = re.compile(r"^[a-z0-9][a-z0-9_-]{0,63}$")
DEFAULT_ROOT = "/var/lib/left4me"
LOWERDIR_ALLOWLIST = (
"installation",
"overlays",
"global_overlay_cache",
"workshop_cache",
)
MAX_LOWERDIRS = 500
MOUNT_BIN = "/bin/mount"
UMOUNT_BIN = "/bin/umount"
def die(msg: str) -> None:
sys.stderr.write(f"left4me-overlay: {msg}\n")
sys.exit(1)
def root() -> Path:
return Path(os.environ.get("LEFT4ME_ROOT") or DEFAULT_ROOT)
def validate_name(name: str) -> str:
if not NAME_RE.fullmatch(name):
die(f"invalid instance name: {name!r}")
return name
def parse_lowerdirs(env_path: Path) -> list[str]:
if not env_path.is_file():
die(f"instance.env not found: {env_path}")
raw = None
for line in env_path.read_text().splitlines():
if "=" not in line:
continue
key, value = line.split("=", 1)
if key.strip() == "L4D2_LOWERDIRS":
raw = value
break
if raw is None:
die(f"L4D2_LOWERDIRS not set in {env_path}")
if raw == "":
die(f"L4D2_LOWERDIRS is empty in {env_path}")
parts = raw.split(":")
if any(p == "" for p in parts):
die(f"L4D2_LOWERDIRS contains an empty entry: {raw!r}")
if len(parts) > MAX_LOWERDIRS:
die(f"L4D2_LOWERDIRS has {len(parts)} entries (cap {MAX_LOWERDIRS})")
return parts
def canonical_under(allowed_roots: list[Path], path: Path) -> Path:
try:
canonical = path.resolve(strict=True)
except (FileNotFoundError, RuntimeError):
die(f"path does not exist or has a symlink loop: {path}")
for r in allowed_roots:
if canonical == r or r in canonical.parents:
return canonical
die(f"path is outside the permitted roots: {path} (resolved: {canonical})")
_LISTXATTR = getattr(os, "listxattr", None)
def _entry_has_fuse_xattr(path: str) -> str | None:
if _LISTXATTR is None:
return None
try:
attrs = _LISTXATTR(path, follow_symlinks=False)
except OSError:
return None
for a in attrs:
if a.startswith("user.fuseoverlayfs."):
return a
return None
def assert_no_fuse_xattrs(upper: Path) -> None:
if not upper.exists() or _LISTXATTR is None:
return
for dirpath, dirnames, filenames in os.walk(upper):
for entry in (dirpath, *(os.path.join(dirpath, n) for n in dirnames),
*(os.path.join(dirpath, n) for n in filenames)):
tainted = _entry_has_fuse_xattr(entry)
if tainted:
die(
f"upperdir contains fuse-overlayfs xattr {tainted!r} on {entry}; "
"wipe upper/ and work/ before mounting"
)
def _print_argv(argv: list[str]) -> None:
"""Emit one shell-quoted argv line to stdout (PRINT_ONLY helper, no exit)."""
print(" ".join(shlex.quote(a) for a in argv))
def exec_or_print(argv: list[str]) -> None:
if os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") == "1":
_print_argv(argv)
sys.exit(0)
os.execv(argv[0], argv)
def cmd_mount(name: str) -> None:
name = validate_name(name)
r = root()
runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
merged_for_check = (runtime_name_dir / "merged").resolve(strict=True)
# Idempotency for unit restart cycles: if a previous start mounted
# successfully but ExecStart failed afterwards (and Restart=on-failure
# fires another cycle), the second ExecStartPre would otherwise refuse
# to mount-on-top. Short-circuit here so the second cycle just gets
# straight to ExecStart. PRINT_ONLY (test mode) bypasses this so the
# tests can exercise the full nsenter argv regardless of mount state.
if (
os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") != "1"
and os.path.ismount(merged_for_check)
):
return
instance_env = r / "instances" / name / "instance.env"
raw_lowerdirs = parse_lowerdirs(instance_env)
allowed_roots = [(r / sub).resolve() for sub in LOWERDIR_ALLOWLIST]
canonical_lowerdirs = [str(canonical_under(allowed_roots, Path(p))) for p in raw_lowerdirs]
upper = (runtime_name_dir / "upper").resolve(strict=True)
work = (runtime_name_dir / "work").resolve(strict=True)
merged = merged_for_check
for label, path in (("upper", upper), ("work", work), ("merged", merged)):
if path.parent != runtime_name_dir:
die(f"{label} resolved outside runtime/{name}: {path}")
assert_no_fuse_xattrs(upper)
options = f"lowerdir={':'.join(canonical_lowerdirs)},upperdir={upper},workdir={work}"
argv = [
MOUNT_BIN,
"-t", "overlay",
"overlay",
"-o", options,
str(merged),
]
exec_or_print(argv)
def cmd_umount(name: str) -> None:
name = validate_name(name)
r = root()
runtime_name_dir = (r / "runtime" / name).resolve(strict=True)
merged_path = runtime_name_dir / "merged"
work_inner = runtime_name_dir / "work" / "work"
overlay_umount_argv = [
UMOUNT_BIN,
# Resolve only if it exists; PRINT_ONLY tests always pre-create it.
str(merged_path.resolve(strict=True) if merged_path.exists() else merged_path),
]
if os.environ.get("LEFT4ME_OVERLAY_PRINT_ONLY") == "1":
_print_argv(overlay_umount_argv)
sys.exit(0)
if merged_path.exists():
merged = merged_path.resolve(strict=True)
if merged.parent != runtime_name_dir:
die(f"merged resolved outside runtime/{name}: {merged}")
# Idempotency: only umount if currently a mount point. Mirrors
# cmd_mount's symmetric check; a redundant cleanup pass — or a
# call after a partial _purge_instance — must be a no-op.
#
# No retry loop here: with the helper running in PID 1's mount
# namespace (via the unit-level `nsenter --mount=/proc/1/ns/mnt`
# in ExecStopPost), it holds no reference to the unit's
# per-service mount namespace, so the cgroup-empty → namespace
# reaped → umount-clears sequence happens without any race
# window for us to ride out. EBUSY here is a real error.
if os.path.ismount(merged):
subprocess.run(overlay_umount_argv, check=True)
# Kernel-overlayfs creates work_inner during mount with root:root mode
# 0/0. After unmount it's an orphan that the unit's User= (left4me)
# cannot traverse via shutil.rmtree, so reset/delete in instances.py
# blows up with EACCES on `runtime/<name>/work/work`. The helper is
# the only code path with root that knows about this directory, so
# the cleanup belongs here. Safe to nuke — the kernel re-creates it
# on the next mount. Run unconditionally — covers both "we just
# unmounted" and "previous teardown didn't finish" cases.
if work_inner.exists():
shutil.rmtree(work_inner)
def main(argv: list[str]) -> None:
if len(argv) != 3 or argv[1] not in ("mount", "umount"):
sys.stderr.write("usage: left4me-overlay mount|umount <name>\n")
sys.exit(2)
if argv[1] == "mount":
cmd_mount(argv[2])
else:
cmd_umount(argv[2])
if __name__ == "__main__":
main(sys.argv)

View file

@ -1,81 +0,0 @@
#!/bin/bash
# Privileged sandbox launcher for left4me script overlays.
#
# Invoked via sudo by the web user with two arguments:
# <overlay_id> numeric overlay id; bind-mounts /var/lib/left4me/overlays/<id>
# read-write at /overlay inside the sandbox.
# <script_path> absolute path to a bash file already written by the web app;
# bind-mounted read-only at /script.sh inside the sandbox.
#
# The script runs as a transient systemd .service with the full hardening
# surface: cgroup limits + walltime kill, NoNewPrivileges, ProtectSystem,
# ProtectHome, kernel-tunable / -module / -log protection, namespace
# restriction, address-family restriction, capability bounding (empty),
# seccomp filter (@system-service @network-io), MemoryDenyWriteExecute,
# LockPersonality, RestrictSUIDSGID. Network namespace is *not* restricted —
# scripts must reach the public internet to download workshop / l4d2center
# / cedapug content. PID namespace is shared with the host (no
# PrivatePID= directive in systemd); host PIDs are visible via /proc.
# Same-uid attack surface (the sandbox runs as left4me, so do the
# gameservers and the web app) is covered by the hardening profile plus
# system-wide kernel.yama.ptrace_scope=2 — see
# docs/superpowers/specs/2026-05-15-hardening-threat-model.md.
set -euo pipefail
# Self-wrap into PID 1's mount namespace before doing anything mount-related.
# The web app's left4me-web.service has PrivateTmp=true, which gives it a
# private mount namespace. When the worker invokes us via sudo, we inherit
# that namespace; our `mount --bind` would land there. systemd-run below
# spawns transient units in PID 1's namespace (where they don't see the
# private bind), so the sandbox would bind onto an empty staging dir and
# permission-deny on every write. The sentinel env var avoids an exec loop.
if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then
exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \
/usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@"
fi
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1
SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir at $OVERLAY_DIR" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script at $SCRIPT" >&2; exit 65; }
if [[ "${LEFT4ME_SCRIPT_SANDBOX_DRY_RUN:-}" == "1" ]]; then
echo "DRY RUN: overlay_id=$OVERLAY_ID script=$SCRIPT overlay_dir=$OVERLAY_DIR"
exit 0
fi
SCRIPT_RC=0
systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
--slice=l4d2-build.slice \
-p OOMScoreAdjust=500 \
-p User=left4me -p Group=left4me \
-p UMask=0022 \
-p NoNewPrivileges=yes \
-p ProtectSystem=strict -p ProtectHome=yes \
-p PrivateTmp=yes -p PrivateDevices=yes -p PrivateIPC=yes \
-p ProtectKernelTunables=yes -p ProtectKernelModules=yes \
-p ProtectKernelLogs=yes -p ProtectControlGroups=yes \
-p RestrictNamespaces=yes \
-p RestrictAddressFamilies="AF_INET AF_INET6 AF_UNIX" \
-p RestrictSUIDSGID=yes -p LockPersonality=yes \
-p MemoryDenyWriteExecute=yes \
-p SystemCallFilter="@system-service @network-io" \
-p SystemCallArchitectures=native \
-p CapabilityBoundingSet= -p AmbientCapabilities= \
-p IPAddressDeny="127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7" \
-p TemporaryFileSystem="/etc /var/lib" \
-p BindReadOnlyPaths="/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives ${SCRIPT}:/script.sh" \
-p BindPaths="${OVERLAY_DIR}:/overlay" \
-p WorkingDirectory=/overlay \
-p Environment="HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay" \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- /bin/bash /script.sh || SCRIPT_RC=$?
exit $SCRIPT_RC

View file

@ -1,17 +0,0 @@
#!/bin/sh
# Run l4d2web flask CLI commands as the left4me user with the deploy env loaded.
# Usage: left4me <flask-subcommand> [args...]
# Examples:
# left4me create-user alice --admin
# left4me seed-script-overlays /opt/left4me/src/examples/script-overlays
# left4me routes
set -eu
exec sudo -u left4me sh -c '
set -a
. /etc/left4me/host.env
. /etc/left4me/web.env
set +a
export JOB_WORKER_ENABLED=false
export PYTHONPATH=/opt/left4me/src
exec /var/lib/left4me/.venv/bin/flask --app l4d2web.app:create_app "$@"
' sh "$@"

View file

@ -1,36 +0,0 @@
"""Shared fixtures and path constants for `deploy/scripts/tests/`."""
import os
from pathlib import Path
ROOT = Path(__file__).resolve().parents[3]
SCRIPTS = ROOT / "deploy" / "scripts"
LIBEXEC = SCRIPTS / "libexec"
SBIN = SCRIPTS / "sbin"
# `deploy/` is also the parent of the scripts/ tree. The sudoers example
# lives at `deploy/files/etc/sudoers.d/left4me` and is the canonical
# statement of which paths sudo grants to the `left4me` uid.
# `deploy/scripts/tests/test_sudoers_grants.py` reads it from there.
DEPLOY = ROOT / "deploy"
def fake_command(tmp_path, command_name):
"""Drop a no-op stub of `command_name` into `tmp_path`. Returns the
marker file the stub writes its args to, so tests can assert that the
helper rejected bad input before invoking the real command.
"""
marker = tmp_path / f"{command_name}.args"
command = tmp_path / command_name
command.write_text(f"#!/bin/sh\nprintf '%s\\n' \"$*\" > '{marker}'\nexit 0\n")
command.chmod(0o755)
return marker
def env_with_fake_commands(tmp_path):
"""Build an environment that prepends `tmp_path` onto PATH so helpers
find the fake commands first.
"""
env = os.environ.copy()
env["PATH"] = f"{tmp_path}{os.pathsep}{env.get('PATH', '')}"
return env

View file

@ -1,15 +0,0 @@
from conftest import LIBEXEC
SYSTEMCTL_HELPER = LIBEXEC / "left4me-systemctl"
JOURNALCTL_HELPER = LIBEXEC / "left4me-journalctl"
def test_helpers_use_fixed_system_tool_paths_not_sudo_path():
systemctl = SYSTEMCTL_HELPER.read_text()
journalctl = JOURNALCTL_HELPER.read_text()
assert "command -v systemctl" not in systemctl
assert "command -v journalctl" not in journalctl
assert "/bin/systemctl" in systemctl or "/usr/bin/systemctl" in systemctl
assert "/bin/journalctl" in journalctl or "/usr/bin/journalctl" in journalctl

View file

@ -1,38 +0,0 @@
import subprocess
from conftest import LIBEXEC, env_with_fake_commands, fake_command
JOURNALCTL_HELPER = LIBEXEC / "left4me-journalctl"
def test_journalctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_path):
subprocess.run(["sh", "-n", str(JOURNALCTL_HELPER)], check=True)
marker = fake_command(tmp_path, "journalctl")
for args in [
["../evil", "--lines", "25", "--no-follow"],
["alpha", "--bad", "25", "--no-follow"],
["alpha", "--lines", "not-number", "--no-follow"],
["alpha", "--lines", "25", "--bad-follow"],
["bad/name", "--lines", "25", "--no-follow"],
]:
result = subprocess.run(
["sh", str(JOURNALCTL_HELPER), *args],
env=env_with_fake_commands(tmp_path),
check=False,
)
assert result.returncode != 0, f"helper accepted bad args: {args!r}"
assert not marker.exists(), f"helper invoked journalctl for: {args!r}"
script = JOURNALCTL_HELPER.read_text()
assert 'unit="left4me-server@${name}.service"' in script
# Anchors `--since` to the unit's most recent start so the panel shows
# the current run (and any post-restart lines until reload).
assert 'InactiveExitTimestamp' in script
assert 'LC_ALL=C' in script
assert 'exec "$journalctl" -u "$unit" --since "$start_time" -n "$lines" -o cat "$follow_arg"' in script
assert 'exec "$journalctl" -u "$unit" --since "$start_time" -n "$lines" -o cat' in script
# Never-started fallback keeps the legacy unit-only form.
assert 'exec "$journalctl" -u "$unit" -n "$lines" -o cat "$follow_arg"' in script
assert 'exec "$journalctl" -u "$unit" -n "$lines" -o cat' in script

View file

@ -1,32 +0,0 @@
from conftest import LIBEXEC
OVERLAY_HELPER = LIBEXEC / "left4me-overlay"
def test_overlay_helper_is_python_with_strict_validation():
text = OVERLAY_HELPER.read_text()
assert text.startswith("#!/usr/bin/python3")
# Validation surface
assert "NAME_RE = re.compile" in text
assert "LOWERDIR_ALLOWLIST" in text
assert "user.fuseoverlayfs." in text
assert "MAX_LOWERDIRS = 500" in text
# Mounts via PID 1's mount namespace
assert "/proc/1/ns/mnt" in text
assert "nsenter" in text
# Verbs are mount and umount (not unmount)
assert '"mount"' in text and '"umount"' in text
assert '"unmount"' not in text
def test_overlay_helper_mount_is_idempotent_when_already_mounted():
"""ExecStartPre runs on every Restart=on-failure cycle. If a previous
start mounted successfully but ExecStart failed afterwards, the next
ExecStartPre would re-mount on top -- which fails. The helper must
short-circuit when merged is already a mount point.
"""
text = OVERLAY_HELPER.read_text()
# Two ismount checks now: one in cmd_mount (skip if mounted),
# one in cmd_umount (skip if not mounted).
assert text.count("os.path.ismount") >= 2

View file

@ -1,146 +0,0 @@
import subprocess
from conftest import LIBEXEC
SCRIPT_SANDBOX_HELPER = LIBEXEC / "left4me-script-sandbox"
def test_script_sandbox_helper_present():
assert SCRIPT_SANDBOX_HELPER.is_file()
assert SCRIPT_SANDBOX_HELPER.read_text().startswith("#!/bin/bash")
mode = SCRIPT_SANDBOX_HELPER.stat().st_mode & 0o777
assert mode == 0o755, f"expected 0755, got {oct(mode)}"
def test_script_sandbox_helper_passes_shell_syntax_check():
subprocess.run(["bash", "-n", str(SCRIPT_SANDBOX_HELPER)], check=True)
def test_script_sandbox_helper_invokes_systemd_run_with_hardening():
text = SCRIPT_SANDBOX_HELPER.read_text()
# systemd-run service mode (no --scope), with synchronous I/O to caller.
assert "systemd-run" in text
assert "--scope" not in text, "v2 uses transient service units, not scopes"
assert "--pipe" in text
assert "--wait" in text
assert "--collect" in text
assert "--unit=" in text
# No bwrap.
assert "bwrap" not in text
assert "bubblewrap" not in text
# UID drop via systemd directives.
assert "User=left4me" in text
assert "Group=left4me" in text
# Cgroup limits unchanged from v1.
assert "MemoryMax=4G" in text
assert "MemorySwapMax=0" in text
assert "TasksMax=512" in text
assert "CPUQuota=200%" in text
assert "RuntimeMaxSec=3600" in text
# Hardening directives that v1 (scope mode) couldn't carry.
assert "NoNewPrivileges=yes" in text
assert "ProtectSystem=strict" in text
assert "ProtectHome=yes" in text
assert "PrivateTmp=yes" in text
assert "PrivateDevices=yes" in text
assert "PrivateIPC=yes" in text
assert "ProtectKernelTunables=yes" in text
assert "ProtectKernelModules=yes" in text
assert "ProtectKernelLogs=yes" in text
assert "ProtectControlGroups=yes" in text
assert "RestrictNamespaces=yes" in text
assert "RestrictSUIDSGID=yes" in text
assert "LockPersonality=yes" in text
assert "MemoryDenyWriteExecute=yes" in text
assert "SystemCallFilter=" in text
assert "@system-service" in text
assert "@network-io" in text
assert "CapabilityBoundingSet=" in text
assert "AmbientCapabilities=" in text
assert 'RestrictAddressFamilies="AF_INET AF_INET6 AF_UNIX"' in text
# Network namespace stays shared with host.
assert "PrivateNetwork=" not in text
# Mount setup: /etc and /var/lib masked with tmpfs; selective binds back.
assert 'TemporaryFileSystem="/etc /var/lib"' in text
assert "BindReadOnlyPaths=" in text
# The resolv.conf bind points at the sandbox-only file (not the host's
# /etc/resolv.conf, which typically references a private-IP DNS server
# that IPAddressDeny= blocks).
assert "/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf" in text
assert "/etc/ssl" in text
assert "/etc/ca-certificates" in text
assert "/etc/nsswitch.conf" in text
assert "/etc/alternatives" in text
assert "${SCRIPT}:/script.sh" in text
assert 'BindPaths="${OVERLAY_DIR}:/overlay"' in text
# IP egress filter: allow public, deny localhost / RFC1918 / link-local /
# multicast / CGNAT / ULA. systemd's "more specific rule wins" semantics
# mean public IPs hit the allow and listed ranges hit the deny.
# IPAddressDeny alone — no IPAddressAllow=any. Empirically, having both
# set causes the allow to win on this systemd/kernel combo regardless of
# the documented "more specific rule wins" behaviour. With only Deny,
# the kernel's default "allow all" applies to non-listed addresses.
assert "IPAddressDeny=" in text
assert "IPAddressAllow=any" not in text
# Explicit CIDRs — systemd-run's -p parser doesn't accept the
# `localhost` / `link-local` / `multicast` shorthand keywords that
# work in unit files (only the full strings parse).
for token in (
"127.0.0.0/8",
"::1/128",
"169.254.0.0/16",
"fe80::/10",
"224.0.0.0/4",
"ff00::/8",
"10.0.0.0/8",
"172.16.0.0/12",
"192.168.0.0/16",
"100.64.0.0/10",
"fc00::/7",
):
assert token in text, f"missing {token!r} in IPAddressDeny set"
def test_script_sandbox_in_build_slice_with_oom_adjust():
text = SCRIPT_SANDBOX_HELPER.read_text()
# Put the transient unit in the low-weight build slice so it yields to
# game-server instances under CPU/IO contention.
assert "--slice=l4d2-build.slice" in text
# Sandbox dies first if the host hits memory pressure; servers
# (OOMScoreAdjust=-200) survive.
assert "-p OOMScoreAdjust=500" in text
def test_script_sandbox_helper_validates_overlay_id():
text = SCRIPT_SANDBOX_HELPER.read_text()
# Numeric-only overlay id
assert '[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]]' in text
# Overlay dir must exist
assert "/var/lib/left4me/overlays/" in text
assert "[[ -d $OVERLAY_DIR ]]" in text
# Script path must exist
assert "[[ -f $SCRIPT ]]" in text
def test_script_sandbox_helper_dry_run_mode(tmp_path):
overlay_root = tmp_path / "var/lib/left4me/overlays/42"
overlay_root.mkdir(parents=True)
fake_script = tmp_path / "fake.sh"
fake_script.write_text("echo hi")
helper_text = SCRIPT_SANDBOX_HELPER.read_text()
# We can't actually exec this without root; just verify the dry-run
# guard short-circuits before systemd-run runs.
assert 'LEFT4ME_SCRIPT_SANDBOX_DRY_RUN' in helper_text
assert 'exit 0' in helper_text

View file

@ -1,37 +0,0 @@
"""Audit the script→sudoers contract.
The sudoers file in `deploy/files/etc/sudoers.d/left4me` is a reference
example; ckn-bw ships its own verbatim copy under
`bundles/left4me/files/etc/sudoers.d/left4me`. The two are expected to
match. This test lives under `deploy/scripts/tests/` because the contract being
audited is about *scripts* (which paths the `left4me` uid can sudo into).
"""
from conftest import DEPLOY
SUDOERS = DEPLOY / "files/etc/sudoers.d/left4me"
def test_sudoers_allows_only_left4me_helpers_not_raw_system_tools():
sudoers = SUDOERS.read_text()
assert (
"left4me ALL=(root) NOPASSWD: "
"/usr/local/libexec/left4me/left4me-systemctl *"
) in sudoers
assert (
"left4me ALL=(root) NOPASSWD: "
"/usr/local/libexec/left4me/left4me-journalctl *"
) in sudoers
assert "/usr/local/libexec/left4me/left4me-overlay mount *" in sudoers
assert "/usr/local/libexec/left4me/left4me-overlay umount *" in sudoers
assert (
"left4me ALL=(root) NOPASSWD: "
"/usr/local/libexec/left4me/left4me-script-sandbox"
) in sudoers
assert "/bin/systemctl" not in sudoers
assert "/usr/bin/systemctl" not in sudoers
assert "/bin/journalctl" not in sudoers
assert "/usr/bin/journalctl" not in sudoers
assert "/bin/mount" not in sudoers
assert "/bin/umount" not in sudoers

View file

@ -1,39 +0,0 @@
import subprocess
from conftest import LIBEXEC, env_with_fake_commands, fake_command
SYSTEMCTL_HELPER = LIBEXEC / "left4me-systemctl"
def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_path):
subprocess.run(["sh", "-n", str(SYSTEMCTL_HELPER)], check=True)
marker = fake_command(tmp_path, "systemctl")
for args in [
["bad/action", "alpha"],
# `start` and `stop` are no longer accepted verbs — the lifecycle now
# uses `enable`/`disable` for reboot survival via WantedBy= symlinks.
["start", "alpha"],
["stop", "alpha"],
["enable", ""],
["enable", ".hidden"],
["enable", "bad..name"],
["enable", "bad/name"],
["enable", "bad\\name"],
["enable", "bad name"],
]:
result = subprocess.run(
["sh", str(SYSTEMCTL_HELPER), *args],
env=env_with_fake_commands(tmp_path),
check=False,
)
assert result.returncode != 0
assert not marker.exists()
script = SYSTEMCTL_HELPER.read_text()
assert 'unit="left4me-server@${name}.service"' in script
assert 'enable) exec "$systemctl" enable --now "$unit"' in script
assert 'disable) exec "$systemctl" disable --now "$unit"' in script
assert "--property=ActiveState" in script
assert "--property=SubState" in script

View file

@ -1,10 +1,3 @@
DATABASE_URL=sqlite:////var/lib/left4me/left4me.db DATABASE_URL=sqlite:////var/lib/left4me/left4me.db
SECRET_KEY=replace-with-generated-secret SECRET_KEY=replace-with-generated-secret
JOB_WORKER_THREADS=4 JOB_WORKER_THREADS=4
# Steam Web API key for ISteamUser/GetPlayerSummaries — used by the
# live-state poller to resolve player Steam IDs to persona names + avatars
# in the server detail panel. Free at https://steamcommunity.com/dev/apikey.
# Optional: if empty, the live-state panel still shows counts/map and the
# in-game name from RCON, just with placeholder avatars.
STEAM_WEB_API_KEY=

View file

@ -0,0 +1,187 @@
import os
import subprocess
from pathlib import Path
ROOT = Path(__file__).resolve().parents[2]
DEPLOY = ROOT / "deploy"
WEB_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-web.service"
SERVER_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-server@.service"
SYSTEMCTL_HELPER = DEPLOY / "files/usr/local/libexec/left4me/left4me-systemctl"
JOURNALCTL_HELPER = DEPLOY / "files/usr/local/libexec/left4me/left4me-journalctl"
SUDOERS = DEPLOY / "files/etc/sudoers.d/left4me"
HOST_ENV = DEPLOY / "templates/etc/left4me/host.env"
WEB_ENV_TEMPLATE = DEPLOY / "templates/etc/left4me/web.env.template"
DEPLOY_SCRIPT = DEPLOY / "deploy-test-server.sh"
def test_global_unit_files_exist_at_product_level_paths():
assert WEB_UNIT.is_file()
assert SERVER_UNIT.is_file()
def test_web_unit_contains_required_runtime_contract():
unit = WEB_UNIT.read_text()
assert "User=left4me" in unit
assert "Group=left4me" in unit
assert "WorkingDirectory=/opt/left4me" in unit
assert "Environment=PATH=/opt/left4me/.venv/bin:" in unit
assert "EnvironmentFile=/etc/left4me/host.env" in unit
assert "EnvironmentFile=/etc/left4me/web.env" in unit
assert "ExecStart=/opt/left4me/.venv/bin/gunicorn" in unit
assert "--workers 1" in unit
assert "NoNewPrivileges=true" in unit
assert "PrivateTmp=true" in unit
assert "ProtectSystem=full" in unit
assert "ReadWritePaths=/var/lib/left4me" in unit
def test_server_unit_contains_required_runtime_contract():
unit = SERVER_UNIT.read_text()
assert "User=left4me" in unit
assert "Group=left4me" in unit
assert "EnvironmentFile=/etc/left4me/host.env" in unit
assert "EnvironmentFile=/var/lib/left4me/instances/%i/instance.env" in unit
assert "WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
assert "ExecStart=/var/lib/left4me/installation/srcds_run" in unit
assert "$L4D2_ARGS" in unit
assert "${L4D2_ARGS}" not in unit
assert "NoNewPrivileges=true" in unit
assert "PrivateTmp=true" in unit
assert "PrivateDevices=true" in unit
assert "ProtectHome=true" in unit
assert "ProtectSystem=strict" in unit
assert "ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays" in unit
assert "ReadWritePaths=/var/lib/left4me/runtime/%i" in unit
assert "RestrictSUIDSGID=true" in unit
assert "LockPersonality=true" in unit
def _fake_command(tmp_path, command_name):
marker = tmp_path / f"{command_name}.args"
command = tmp_path / command_name
command.write_text(f"#!/bin/sh\nprintf '%s\n' \"$*\" > '{marker}'\nexit 0\n")
command.chmod(0o755)
return marker
def _env_with_fake_commands(tmp_path):
env = os.environ.copy()
env["PATH"] = f"{tmp_path}{os.pathsep}{env.get('PATH', '')}"
return env
def test_helpers_use_fixed_system_tool_paths_not_sudo_path():
systemctl = SYSTEMCTL_HELPER.read_text()
journalctl = JOURNALCTL_HELPER.read_text()
assert "command -v systemctl" not in systemctl
assert "command -v journalctl" not in journalctl
assert "/bin/systemctl" in systemctl or "/usr/bin/systemctl" in systemctl
assert "/bin/journalctl" in journalctl or "/usr/bin/journalctl" in journalctl
def test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_path):
subprocess.run(["sh", "-n", str(SYSTEMCTL_HELPER)], check=True)
marker = _fake_command(tmp_path, "systemctl")
for args in [
["bad/action", "alpha"],
["start", ""],
["start", ".hidden"],
["start", "bad..name"],
["start", "bad/name"],
["start", "bad\\name"],
["start", "bad name"],
]:
result = subprocess.run(["sh", str(SYSTEMCTL_HELPER), *args], env=_env_with_fake_commands(tmp_path), check=False)
assert result.returncode != 0
assert not marker.exists()
script = SYSTEMCTL_HELPER.read_text()
assert 'unit="left4me-server@${name}.service"' in script
assert 'start) exec "$systemctl" start "$unit"' in script
assert 'stop) exec "$systemctl" stop "$unit"' in script
assert "--property=ActiveState" in script
assert "--property=SubState" in script
def test_journalctl_helper_passes_shell_syntax_check_and_rejects_bad_args(tmp_path):
subprocess.run(["sh", "-n", str(JOURNALCTL_HELPER)], check=True)
marker = _fake_command(tmp_path, "journalctl")
for args in [
["../evil", "--lines", "25", "--no-follow"],
["alpha", "--bad", "25", "--no-follow"],
["alpha", "--lines", "not-number", "--no-follow"],
["alpha", "--lines", "25", "--bad-follow"],
["bad/name", "--lines", "25", "--no-follow"],
]:
result = subprocess.run(["sh", str(JOURNALCTL_HELPER), *args], env=_env_with_fake_commands(tmp_path), check=False)
assert result.returncode != 0
assert not marker.exists()
script = JOURNALCTL_HELPER.read_text()
assert 'unit="left4me-server@${name}.service"' in script
assert 'exec "$journalctl" -u "$unit" -n "$lines" -o cat "$follow_arg"' in script
assert 'exec "$journalctl" -u "$unit" -n "$lines" -o cat' in script
def test_sudoers_allows_only_left4me_helpers_not_raw_system_tools():
sudoers = SUDOERS.read_text()
assert (
"left4me ALL=(root) NOPASSWD: "
"/usr/local/libexec/left4me/left4me-systemctl *"
) in sudoers
assert (
"left4me ALL=(root) NOPASSWD: "
"/usr/local/libexec/left4me/left4me-journalctl *"
) in sudoers
assert "/bin/systemctl" not in sudoers
assert "/usr/bin/systemctl" not in sudoers
assert "/bin/journalctl" not in sudoers
assert "/usr/bin/journalctl" not in sudoers
def test_env_templates_contain_required_defaults():
host_env = HOST_ENV.read_text()
assert "Deployment units use fixed /var/lib/left4me paths" in host_env
assert host_env.endswith("LEFT4ME_ROOT=/var/lib/left4me\n")
assert WEB_ENV_TEMPLATE.read_text() == (
"DATABASE_URL=sqlite:////var/lib/left4me/left4me.db\n"
"SECRET_KEY=replace-with-generated-secret\n"
"JOB_WORKER_THREADS=4\n"
)
def test_deploy_script_has_safe_defaults_and_preserves_state() -> None:
script = DEPLOY_SCRIPT.read_text()
assert "useradd --system --home-dir /var/lib/left4me" in script
assert "/var/lib/left4me/installation" in script
assert "/var/lib/left4me/overlays" in script
assert "/var/lib/left4me/instances" in script
assert "/var/lib/left4me/runtime" in script
assert "tar" in script
assert "--exclude .venv" in script
assert "pip install -e /opt/left4me/l4d2host -e /opt/left4me/l4d2web" in script
assert "systemctl enable --now left4me-web.service" in script
assert "for attempt in" in script
assert "/opt/left4me/.venv" in script
assert "visudo -cf /etc/sudoers.d/left4me" in script
assert "if [ ! -f /etc/left4me/web.env ]" in script
assert ". /etc/left4me/web.env\n" not in script
assert "run_left4me_with_env" in script
assert "LEFT4ME_ADMIN_USERNAME" in script
assert "LEFT4ME_ADMIN_PASSWORD" in script
assert "user already exists" in script
assert "deploy/files" in script
def test_deploy_script_shell_syntax() -> None:
subprocess.run(["sh", "-n", str(DEPLOY_SCRIPT)], check=True)

View file

@ -1,330 +0,0 @@
"""Lockdown tests for the curated examples kept under `deploy/files/`.
`deploy/` is reference material. The production units are emitted by
ckn-bw's `systemd_units` reactor in `bundles/left4me/metadata.py`;
when reactor output drifts intentionally, update these examples to match.
"""
from pathlib import Path
ROOT = Path(__file__).resolve().parents[2]
DEPLOY = ROOT / "deploy"
WEB_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-web.service"
SERVER_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-server@.service"
GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
SANDBOX_RESOLV_CONF = DEPLOY / "files/etc/left4me/sandbox-resolv.conf"
HOST_ENV = DEPLOY / "templates/etc/left4me/host.env"
WEB_ENV_TEMPLATE = DEPLOY / "templates/etc/left4me/web.env.template"
WEB_HARDENING_DROPIN = DEPLOY / "files/etc/systemd/system/left4me-web.service.d/10-hardening.conf"
SERVER_HARDENING_DROPIN = DEPLOY / "files/etc/systemd/system/left4me-server@.service.d/10-hardening.conf"
def test_global_unit_files_exist_at_product_level_paths():
assert WEB_UNIT.is_file()
assert SERVER_UNIT.is_file()
def test_web_unit_contains_required_runtime_contract():
unit = WEB_UNIT.read_text()
assert "User=left4me" in unit
assert "Group=left4me" in unit
assert "WorkingDirectory=/opt/left4me" in unit
assert "PATH=/var/lib/left4me/.venv/bin:" in unit
assert "EnvironmentFile=/etc/left4me/host.env" in unit
assert "EnvironmentFile=/etc/left4me/web.env" in unit
assert "ExecStart=/var/lib/left4me/.venv/bin/gunicorn" in unit
assert "--workers 1" in unit
assert "--threads 32" in unit
# NoNewPrivileges must remain unset because sudo (used by the overlay,
# systemctl and journalctl helpers) is setuid.
assert "NoNewPrivileges=true" not in unit
assert "ReadWritePaths=/var/lib/left4me" in unit
# Mounts now happen in PID 1's namespace via the left4me-overlay helper,
# so MountFlags propagation is irrelevant — and the previous assumption
# that MountFlags=shared made it work was incorrect.
assert "MountFlags=" not in unit
# Hardening directives belong in the drop-in; must not appear in the base unit.
assert "PrivateTmp=" not in unit
assert "ProtectSystem=" not in unit
def test_server_unit_contains_required_runtime_contract():
unit = SERVER_UNIT.read_text()
assert "User=left4me" in unit
assert "Group=left4me" in unit
assert "EnvironmentFile=/etc/left4me/host.env" in unit
assert "EnvironmentFile=/var/lib/left4me/instances/%i/instance.env" in unit
# `-` prefix: chdir failure is non-fatal so ExecStartPre can run the
# mount helper before the merged dir exists. ExecStart re-applies and
# finds the dir once the mount has landed.
assert "WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2" in unit
# ExecStart must invoke srcds_run from the *merged* overlay tree, not
# from installation/. srcds_run cds to its own dirname; if we point at
# installation/, the engine reads gameinfo.txt and addons from the lower
# layer and never sees overlay plugins (Metamod/SourceMod) or cfgs.
assert "ExecStart=/var/lib/left4me/runtime/%i/merged/srcds_run" in unit
assert "$L4D2_ARGS" in unit
assert "${L4D2_ARGS}" not in unit
# Hardening directives belong in the drop-in; must not appear in the base unit.
assert "NoNewPrivileges=" not in unit
assert "PrivateTmp=" not in unit
assert "PrivateDevices=" not in unit
assert "ProtectHome=" not in unit
assert "ProtectSystem=" not in unit
assert "RestrictSUIDSGID=" not in unit
assert "LockPersonality=" not in unit
def test_server_unit_mounts_overlay_via_exec_start_pre():
"""At boot, systemd auto-starts enabled units before the web app gets a
chance to run start_instance's pre-start mount. The unit itself must
re-mount the overlay so reboots are transparent. Pairs with the helper's
idempotency check (test_overlay_helper_mount_is_idempotent_when_mounted).
The unit-level `nsenter --mount=/proc/1/ns/mnt --` is what makes
umount fast: without it, the helper Python process would inherit
the unit's per-service mount namespace and pin it alive, blocking
PID 1's umount until the helper exited. Wrapping with nsenter at
the Exec line puts the helper itself in PID 1's namespace.
"""
unit = SERVER_UNIT.read_text()
# `+` prefix: runs as PID 1 (root, no sandbox). Required because
# the unit has NoNewPrivileges=true, which blocks sudo's setuid
# escalation — and the helper needs root for the mount syscall.
assert (
"ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- "
"/usr/local/libexec/left4me/left4me-overlay mount %i"
in unit
)
# Bound the restart loop; without these, a CHDIR-failure (or any other
# pre-start error) spins indefinitely.
assert "StartLimitBurst=5" in unit
assert "StartLimitIntervalSec=60s" in unit
def test_server_unit_unmounts_overlay_via_exec_stop_post():
"""Single source of truth for unmount, mirroring the mount path.
ExecStopPost (not ExecStop) so it runs after srcds has fully exited
and the cgroup is cleared.
Same nsenter-at-Exec-line wrapping as ExecStartPre without it,
the helper process would itself hold a reference to the unit's
per-service mount namespace, and umount in PID 1 would loop on
EBUSY until the helper gave up. With it, umount succeeds first try.
"""
unit = SERVER_UNIT.read_text()
assert (
"ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- "
"/usr/local/libexec/left4me/left4me-overlay umount %i"
in unit
)
def test_server_unit_contains_perf_baseline_directives():
unit = SERVER_UNIT.read_text()
# Slice membership.
assert "Slice=l4d2-game.slice" in unit
# CFS priority bump (no SCHED_FIFO).
assert "Nice=-5" in unit
assert "CPUSchedulingPolicy=" not in unit
# I/O priority.
assert "IOSchedulingClass=best-effort" in unit
assert "IOSchedulingPriority=4" in unit
# OOM ordering: game servers survive, sandbox dies first.
assert "OOMScoreAdjust=-200" in unit
# Memory caps with headroom for map-load spikes.
assert "MemoryHigh=1.5G" in unit
assert "MemoryMax=2G" in unit
# Bounded fork surface.
assert "TasksMax=256" in unit
# Plenty of fds for plugin-heavy setups.
assert "LimitNOFILE=65536" in unit
# srcds clean shutdown via SIGINT, with time to flush. With the
# helper running in PID 1's mount namespace (via the unit-level
# nsenter on ExecStopPost), umount has no race window and the
# default 15 s is plenty for the whole stop transition.
assert "KillSignal=SIGINT" in unit
assert "TimeoutStopSec=15s" in unit
# Per-unit override of journald rate limiting (default drops srcds output).
assert "LogRateLimitIntervalSec=0" in unit
def test_l4d2_game_slice_exists_with_high_weights():
assert GAME_SLICE.is_file()
text = GAME_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=1000" in text
assert "IOWeight=1000" in text
def test_l4d2_build_slice_exists_with_low_weights():
assert BUILD_SLICE.is_file()
text = BUILD_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=10" in text
assert "IOWeight=10" in text
def test_sysctl_conf_present_with_perf_settings():
assert SYSCTL_CONF.is_file()
text = SYSCTL_CONF.read_text()
for line in (
"net.core.rmem_max = 8388608",
"net.core.wmem_max = 8388608",
"net.core.rmem_default = 524288",
"net.core.wmem_default = 524288",
"net.core.netdev_max_backlog = 5000",
"net.core.netdev_budget = 600",
"vm.swappiness = 10",
"net.ipv4.udp_rmem_min = 16384",
"net.ipv4.udp_wmem_min = 16384",
"net.core.default_qdisc = fq_codel",
"net.ipv4.tcp_congestion_control = bbr",
"kernel.yama.ptrace_scope = 2",
):
assert line in text, f"missing {line!r} in 99-left4me.conf"
def test_env_templates_contain_required_defaults():
host_env = HOST_ENV.read_text()
assert "Deployment units use fixed /var/lib/left4me paths" in host_env
assert host_env.endswith("LEFT4ME_ROOT=/var/lib/left4me\n")
web_env = WEB_ENV_TEMPLATE.read_text()
assert web_env.startswith(
"DATABASE_URL=sqlite:////var/lib/left4me/left4me.db\n"
"SECRET_KEY=replace-with-generated-secret\n"
"JOB_WORKER_THREADS=4\n"
)
assert web_env.rstrip().endswith("STEAM_WEB_API_KEY=")
def test_sandbox_resolv_conf_exists():
assert SANDBOX_RESOLV_CONF.is_file()
text = SANDBOX_RESOLV_CONF.read_text()
nameservers = [
line.split()[1]
for line in text.splitlines()
if line.startswith("nameserver ")
]
assert len(nameservers) >= 2, "expected at least two nameservers for redundancy"
# Sanity: the resolvers must be public (not RFC1918 / loopback). We don't
# pin the exact IPs — Cloudflare/Google/Quad9 are all acceptable.
for ns in nameservers:
assert not ns.startswith("127."), ns
assert not ns.startswith("10."), ns
assert not ns.startswith("192.168."), ns
first_octet = int(ns.split(".")[0])
# Reject 172.16.0.0/12.
if first_octet == 172:
second_octet = int(ns.split(".")[1])
assert not (16 <= second_octet <= 31), ns
def test_web_hardening_dropin_present_with_directives():
assert WEB_HARDENING_DROPIN.is_file()
text = WEB_HARDENING_DROPIN.read_text()
assert "[Service]" in text
# COMMON
for d in (
"ProtectProc=invisible",
"ProtectKernelTunables=true",
"ProtectKernelModules=true",
"ProtectKernelLogs=true",
"ProtectClock=true",
"ProtectControlGroups=true",
"ProtectHostname=true",
"LockPersonality=true",
"ProtectSystem=strict",
"ProtectHome=true",
"PrivateTmp=true",
"RestrictNamespaces=true",
"RestrictRealtime=true",
"RemoveIPC=true",
"KeyringMode=private",
"UMask=0027",
"RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX",
):
assert d in text, f"missing {d!r} in web hardening drop-in"
# WEB-specific
# `native x86` (not `native`) because the install job fork-execs
# steamcmd_linux (32-bit). Plain `native` produces SIGSYS (bash exit 159).
assert "SystemCallArchitectures=native x86" in text
assert "SystemCallFilter=@system-service" in text
assert "SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete" in text
# WEB must NOT include the sudo-incompatible directives.
assert "NoNewPrivileges=" not in text
assert "PrivateUsers=" not in text
assert "RestrictSUIDSGID=" not in text
assert "CapabilityBoundingSet=" not in text
assert "~@privileged" not in text
def test_server_hardening_dropin_present_with_directives():
assert SERVER_HARDENING_DROPIN.is_file()
text = SERVER_HARDENING_DROPIN.read_text()
assert "[Service]" in text
for d in (
"NoNewPrivileges=true",
"RestrictSUIDSGID=true",
"PrivateUsers=true",
"PrivatePIDs=true",
"PrivateIPC=true",
"PrivateDevices=true",
"CapabilityBoundingSet=",
"AmbientCapabilities=",
"SystemCallArchitectures=native x86",
"TemporaryFileSystem=/var/lib /etc /opt /home /root /srv /mnt /media",
"BindReadOnlyPaths=/var/lib/left4me/installation",
"BindReadOnlyPaths=/var/lib/left4me/overlays",
"BindReadOnlyPaths=/etc/left4me/host.env",
"BindPaths=/var/lib/left4me/runtime/%i",
"SocketBindAllow=udp:27000-27999",
"SocketBindAllow=tcp:27000-27999",
):
assert d in text, f"missing {d!r} in server hardening drop-in"
assert "SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete @privileged" in text
# MemoryDenyWriteExecute must remain absent (Source engine compat).
assert "MemoryDenyWriteExecute" not in text
# ProcSubset=pid must remain absent — hides /proc/cpuinfo and breaks
# SteamAPI master-server registration (LAN-only fallback). See
# ckn-bw 4339289 and the comment block in the drop-in itself.
for line in text.splitlines():
bare = line.split("#", 1)[0].strip()
assert bare != "ProcSubset=pid", "ProcSubset=pid must not be active in the server drop-in"
def test_hardening_dropins_agree_on_syscall_architectures():
# Both units fork-exec a 32-bit binary on critical paths: the web
# service runs the install job (steamcmd_linux), the server unit runs
# srcds_linux. Either drop-in without `x86` in SystemCallArchitectures
# SIGSYS-kills its child on first syscall (bash exit 159). They must
# agree, and both must include x86 — caught the hard way on
# 2026-05-15 when web had `native` only and the install job died.
import re
pat = re.compile(r"^SystemCallArchitectures=(.+)$", re.MULTILINE)
web_arch = pat.search(WEB_HARDENING_DROPIN.read_text()).group(1).strip()
srv_arch = pat.search(SERVER_HARDENING_DROPIN.read_text()).group(1).strip()
assert web_arch == srv_arch, (
f"hardening drop-ins disagree on SystemCallArchitectures: "
f"web={web_arch!r} server={srv_arch!r}. Both must include `x86`."
)
assert "x86" in web_arch.split(), (
f"SystemCallArchitectures missing x86: {web_arch!r}. Required for "
"steamcmd_linux (install job) and srcds_linux."
)

View file

@ -1,17 +0,0 @@
"""Syntax-check the sudoers drop-in via visudo before it leaves the repo."""
import shutil
import subprocess
from pathlib import Path
import pytest
SUDOERS = Path(__file__).resolve().parents[2] / "deploy/files/etc/sudoers.d/left4me"
@pytest.mark.skipif(shutil.which("visudo") is None, reason="visudo not installed")
def test_sudoers_parses():
result = subprocess.run(
["visudo", "-cf", str(SUDOERS)],
capture_output=True, text=True,
)
assert result.returncode == 0, f"visudo -cf failed: {result.stdout}{result.stderr}"

View file

@ -1,507 +0,0 @@
# L4D2 server cvar reference
Working notes from the 2026-05 research session on best-practice
L4D2 dedicated server settings. Sources cited inline; some findings
verified empirically on the `left4.me` Trixie test server (kernel
6.12.86). This is reference material, not a settled design.
## Quick lookup
| Topic | Recommended |
|---|---|
| Tickrate (stock) | 30 |
| Tickrate (competitive) | 100, requires Tickrate Enabler plugin |
| `sv_pure` | `2` (strict), or `0`/`1` for modded servers |
| `sv_cheats` | `0` (set to 1 only on private practice servers; disables VAC) |
| `sv_consistency` | `0` (allow custom campaigns) or `1` (strict for competitive) |
| `sv_alltalk` | `0` (no cross-team voice), `1` for casual / fun servers |
| `sv_lan` | `0` (internet server) |
| `sv_voiceenable` | `1` |
| `nb_update_frequency` | `0.033` (safe, no plugin), `0.014` with the SM fix plugin. **Cheat-protected — must be set via `sm_cvar`.** |
| `nb_update_framelimit` | `30` (was 15). Raises the per-frame bot-AI cap so commons don't lag at high counts. **Cheat-protected.** |
| `fps_max` | `64` for 30-tick, `0` (uncapped) for higher ticks |
| `net_maxcleartime` | `0.0001` — drop choked packets fast instead of stalling. **Cheat-protected.** |
| `sv_tags` | `"coop,custom"` (etc.) — Steam server browser hint |
| `sv_region` | `3` (EU) / `1` (US East) / `255` (any) |
## Copy-paste best practice config
A complete starting config that pairs with the project's existing
`examples/script-overlays/tickrate.sh` overlay (which installs the
Tickrate Enabler plugin) and a SourceMod install. Two files: the
plain `server.cfg` and a SourceMod-only `cfg/sourcemod/sourcemod.cfg`.
For background and per-cvar rationale, see the topic sections below.
### `server.cfg` (vanilla, non-cheat cvars only)
```
// --- Identity & discoverability ---
hostname "your server name here"
sv_tags "coop,custom"
sv_region 3 // 3=EU, 1=US East, 255=any
sv_lan 0
sv_steamgroup "0" // your Steam group ID for reserved slots
sv_search_key "0" // groups your servers in the lobby browser
// --- Security ---
sv_cheats 0
sv_pure 0 // 0/1 for modded servers; 2 = strict
sv_consistency 0 // 0 if hosting custom campaigns; 1 = strict
sv_password ""
sv_allow_lobby_connect_only 0 // let players connect via IP, not just lobby
// --- Voice / chat ---
sv_voiceenable 1
sv_alltalk 0
// --- Player limits (coop) ---
sv_maxplayers 4
sv_visiblemaxplayers 4
// (For versus: 8/8)
// --- Network rates (100-tick; requires Tickrate Enabler) ---
sv_minrate 100000
sv_maxrate 100000
sv_mincmdrate 100
sv_maxcmdrate 100
sv_minupdaterate 100
sv_maxupdaterate 100
sv_client_min_interp_ratio -1
sv_client_max_interp_ratio 2
net_splitpacket_maxrate 50000
net_splitrate 2
fps_max 0
sv_forcepreload 1
// --- Logging (used by left4me's log-streaming feature) ---
sv_logfile 1
sv_logflush 0
sv_logecho 1
sv_logbans 1
```
### `cfg/sourcemod/sourcemod.cfg` (cheat-flagged cvars, set via `sm_cvar`)
```
// --- Network tweaks (cheat-flagged or SM-managed) ---
sm_cvar net_maxcleartime 0.0001
// --- Simulation cadence (more frequent AI ticks; no behaviour change) ---
sm_cvar nb_update_frequency 0.033 // 0.014 if you have the AM fix plugin
sm_cvar nb_update_framelimit 30 // default 15 — raise per-frame bot AI cap
// --- Diagnostics ---
sm_cvar nb_stuck_dump_threshold 5 // log stuck bots ≥5s
```
> **If you're not running SourceMod**, the entire `sm_cvar` block
> above is dead — those cvars are cheat-protected and silently
> ignored from plain `server.cfg`. The vanilla block still applies
> and delivers the bulk of the network-feel improvements. See
> [Cheat-protected cvars and `sm_cvar`](#cheat-protected-cvars-and-sm_cvar).
For tickrates other than 100, see the
[Network rates](#network-rates) section below.
## Network rates
L4D2 default tickrate is **30**. Rates above the corresponding
ceiling are ignored without the
[Tickrate Enabler plugin](https://github.com/SirPlease/Server4Dead-Project/tree/master/Tickrate%20Enabler).
Rule of thumb: `sv_maxrate = tickrate × 1000`.
### 30-tick (stock)
```
sv_minrate 30000
sv_maxrate 30000
sv_mincmdrate 30
sv_maxcmdrate 30
sv_minupdaterate 30
sv_maxupdaterate 30
net_splitpacket_maxrate 30000
fps_max 64
```
### 60-tick (requires Tickrate Enabler)
```
sv_minrate 60000
sv_maxrate 60000
sv_mincmdrate 60
sv_maxcmdrate 60
sv_minupdaterate 60
sv_maxupdaterate 60
net_splitpacket_maxrate 60000
fps_max 128
```
### 100-tick (competitive, requires Tickrate Enabler)
```
sv_minrate 100000
sv_maxrate 100000
sv_mincmdrate 100
sv_maxcmdrate 100
sv_minupdaterate 100
sv_maxupdaterate 100
net_splitpacket_maxrate 100000
fps_max 0
```
### sv_min*rate vs. sv_max*rate
- Locking `min == max` (competitive servers do this) ensures every
client sends at the tickrate exactly. Strict — kicks clients
that dip below.
- Leaving a range (e.g. `min=10, max=30` on a 30-tick public
server) tolerates clients on weak connections or loaded CPUs.
- Setting `sv_mincmdrate=0` means *no enforced minimum* — clients
could send as few as 1-2 cmds/sec. Bad. Pick a floor that's
playable (~10 minimum).
## Cheat-protected cvars and `sm_cvar`
Several gameplay-affecting cvars are flagged as "cheat" in L4D2 and
**cannot be set via `server.cfg` unless `sv_cheats 1`** — which
disables VAC and gates achievements. Trying to set them from cfg
silently fails (the value stays at default).
To set them on a real (VAC-protected) server: install SourceMod and
use `sm_cvar <name> <value>` instead of `<name> <value>`. SourceMod
bypasses the cheat protection for *server-side cvar writes only*
(does not grant cheat commands to players).
Cheat-protected cvars worth knowing:
- `nb_update_frequency` — common-infected pathing/state update
rate (see below).
- `director_*` — most director cvars (AI difficulty, panic events,
pacing).
- `z_*` — most zombie-behavior cvars.
`sm_cvar` writes go in `cfg/sourcemod/sourcemod.cfg` (auto-execed
by SM on map change) or in any cfg under `cfg/sourcemod/`. SM
re-applies these on every map change — important because
cheat-protected cvars *reset to defaults on map change* even
within the same server session.
## `nb_update_frequency`
Like raising server tickrate, this controls *how often* common
infected and witches get an AI tick — it doesn't change what they
decide, only how quickly the engine asks them. Pure cadence cvar.
Default `0.1` (10 Hz), independent of server tickrate.
| Value | Effect |
|---|---|
| `0.1` (default) | Common-infected look choppy regardless of tickrate |
| `0.033` | ~30 Hz updates; smooth, safe without plugin |
| `0.024` | Lowest "safe" without plugin per community testing |
| `0.014` | ~71 Hz; clients with `cl_interp 0` see jittery commons unless the [nb_update_frequency fix plugin](https://forums.alliedmods.net/showthread.php?t=344019) is installed |
Set via `sm_cvar nb_update_frequency 0.033` in
`cfg/sourcemod/sourcemod.cfg` (or any sm-auto-execed cfg). Without
SourceMod, you cannot reliably set this on a VAC-protected server.
## NextBot scheduler & diagnostics
`nb_update_frequency` (covered above) is *how often* the scheduler
asks bots to think. Two related cvars are also pure
cadence/throughput — no behaviour change — and one is a passive
diagnostic.
### `nb_update_framelimit`
Default `15`. **Maximum number of NextBots that get an AI tick per
server frame.** Above this cap the engine round-robins bots across
frames, so at 30 commons on a 30-tick server each common gets a
fresh think roughly every other frame — visible as "zombies
hesitate before chasing." Raising this to `30``60` lets every bot
think every frame at the cost of linear extra CPU. Does not alter
how bots decide what to do; only how often they get to decide.
This is the most under-documented L4D2 cvar and the one most often
blamed on tickrate or `nb_update_frequency` when it's neither.
Cheat-protected — use `sm_cvar`.
### `nb_stuck_dump_threshold`
Default `-1` (disabled). Set to `5` to log any bot stuck for ≥5
seconds to the server console. Costs nothing at runtime and is the
single best diagnostic for "why do zombies keep clipping into
geometry on this custom campaign?" tickets. Pure logging — does
not affect bot behaviour. Cheat-protected.
## Lag compensation
Most lag-compensation cvars are present but not in the truncated
`cvar_list` dump. Verify on your own server with `sm_cvar <name>`
(no value) before relying on them.
| Cvar | Default | Notes |
|---|---|---|
| `sv_unlag` | `1` | Enable lag compensation. Keep on. |
| `sv_maxunlag` | `0.5``1.0` | Max ms of lag-comp rewind. Confogl uses `1`. Higher = better for higher-ping shots. |
| `sv_unlag_fixstuck` | `1` | Used by upstream Competitive-Rework. |
| `sv_forcepreload` | `0` | Set to `1` to preload server-side assets at boot. Smoother first map. Confirmed in `cvar_list`. |
## Packet compression & high-entity-count tuning
Relevant when running custom servers with raised `z_common_limit`,
big mob spawns, or many addon entities. At high entity counts,
snapshots routinely exceed the UDP MTU and get split into multiple
packets. Clients perceive this as "lag" — but it's really
*snapshot drops*, visible in `net_graph` as updates/sec dipping
well below `sv_maxupdaterate`. The fix is on the wire, not in the
simulation.
Source: [Lux's L4D2 high-zombie-count discussion (Steam)](https://steamcommunity.com/app/550/discussions/0/2568690416482192538/).
| Cvar | Default | Recommended | Notes |
|---|---|---|---|
| `net_compresspackets` | varies | `1` | Enable LZ-style packet compression. Cheap CPU win for high-entity servers. Verify with `sm_cvar`. |
| `net_compresspackets_minsize` | varies | `2324` | Compress packets ≥ this size — roughly the wire MTU. |
| `net_splitrate` | `1` | `2` | Allow 2 split-packet pieces per net frame; drains queue faster. Confirmed in `cvar_list`. |
| `net_splitpacket_maxrate` | `15000` | `50000`+ | Throughput cap when sending split packets. |
| `net_maxcleartime` | `4.0` | `0.0001` | Don't stall on choke — drop choked packets fast. Confirmed real (RCON-verified 2026-05-20: `sm_cvar net_maxcleartime` returns the set value). |
| `sv_extra_client_connect_time` | varies | `0.0001` | Tiny handshake speedup from the Lux thread. Verify with `sm_cvar`. |
Several of these are missing from the local `cvar_list` dump but
that file is **not exhaustive** — see
[Verifying a cvar actually exists](#verifying-a-cvar-actually-exists)
below. Several of these lines exist verbatim in upstream
Competitive-Rework's `cfg/server.cfg`, which has been running on
public servers for years.
## Server discoverability
Cosmetic but real UX wins for public servers.
| Cvar | Recommended | What it does |
|---|---|---|
| `sv_tags` | `"coop,custom,modded"` (your choice) | Comma-separated tags shown in the Steam server browser. Players filter on these. |
| `sv_region` | `3` (EU), `1` (US East), `255` (any) | Region reported to the master server. Set this and your server appears in the right regional browser. |
| `sv_search_key` | `"left4me"` (or your own string) | When players search from the in-game lobby, only servers with a matching key appear. Useful for grouping a fleet. |
| `sv_steamgroup` | your group's ID | Steam group members get reserved-slot priority (with the appropriate plugin). |
| `sv_lan` | `0` | Set `1` only for local-only play; skips Steam auth (players can't friend-join). |
## Logging hygiene
Relevant because the project's log-streaming feature (the work in
`l4d2web/static/js/files-overlay/editor.js` and adjacent) tails
the server log file. These cvars control what actually gets
written.
| Cvar | Recommended | Notes |
|---|---|---|
| `sv_logfile` | `1` | Server log to disk. Required for log-streaming. |
| `sv_logflush` | `0` | Don't flush after every line — slow. Keep at `0` unless you're debugging crashes. |
| `sv_logecho` | `1` | Mirror log to stdout — needed for any process that tails srcds's console. |
| `sv_logbans` | `1` | Log every `kickid` / `banid` to the same log file. Cheap audit trail. |
| `sv_log_onefile` | `0` | Default — one log per day. `1` rolls everything into a single file (gets large quickly). |
| `sv_logsdir` | `"logs"` | Default. Path is relative to the game directory. |
## Verifying a cvar actually exists
The local `/Users/mwiegand/Projekte/left4me/cvar_list` dump (~2199
entries) is **incomplete** — it's missing several real L4D2 cvars
that upstream Competitive-Rework uses and that have been verified
in-engine via RCON. Likely it was generated via the in-engine
`cvarlist` command, which truncates and filters.
Authoritative existence check via SourceMod console (RCON):
```
sm_cvar <name> # no value → "Value of cvar X: Y" if real,
# "unknown" otherwise
```
The screenshot evidence for `net_maxcleartime` (2026-05-20):
```
> sm_cvar net_maxcleartime
[SM] Value of cvar "net_maxcleartime": "0.0001"
```
Rule of thumb when copying configs from elsewhere:
1. If the cvar is in `cvar_list` → it's definitely real.
2. If it's *not* in `cvar_list` but is in upstream Competitive-
Rework's `server.cfg` → probably real, but verify via `sm_cvar`
before relying on it.
3. If it's in neither and only mentioned in a random forum post →
high probability it's a CSGO/CS:S or HL2 cvar that someone
assumed exists in L4D2.
## Cvars that DO NOT exist in L4D2 (despite some guides claiming otherwise)
These come up in older guides or are inherited from other Source
games but don't actually exist in L4D2's command set. Verified by
RCON `sm_cvar <name>` returning "unknown":
- `z_resolve_zombie_collision_multiplier` — confirmed unknown in
current L4D2 builds (verified via RCON 2026-05-14). Some
community guides list it; it's not in the binary.
- `z_update_rate` — referenced in older tuning guides but not a
real L4D2 cvar. The actual zombie-AI cadence knob is
`nb_update_frequency`.
If a guide tells you to set one of these in L4D2, the guide is
wrong or out of date.
**Earlier revisions of this doc also listed `net_maxcleartime`
here. That was wrong** — it's a real L4D2 cvar (RCON-verified
2026-05-20 returning `0.0001` on `left4.me`). It just happens to
be missing from the `cvar_list` dump. The lesson: the cvar_list
file is useful as a positive check but unreliable as a negative
check (see
[Verifying a cvar actually exists](#verifying-a-cvar-actually-exists)).
## Security and integrity
```
sv_cheats 0
sv_pure 2 # force Steam-only files (strictest)
sv_consistency 1 # enforce file hashes for critical files
# (set 0 if hosting custom campaigns)
sv_lan 0 # internet server
```
Launch the server with `-secure` to enable VAC. `sv_cheats 1`
requires `-insecure` (no VAC) — only acceptable on private
practice servers.
`sv_pure 2` breaks many workshop maps/mods. Use `sv_pure 0` or `1`
for modded servers.
## Player limits
```
# Co-op / Scavenge
sv_maxplayers 4
sv_visiblemaxplayers 4
# Versus
sv_maxplayers 8
sv_visiblemaxplayers 8
```
## Voice
```
sv_voiceenable 1
sv_alltalk 0 # 1 = cross-team voice (casual / fun servers)
```
## Recommended plugins (SourceMod ecosystem)
| Plugin | Purpose |
|---|---|
| MetaMod:Source + SourceMod | Required foundation for most of the below |
| [Tickrate Enabler](https://github.com/SirPlease/Server4Dead-Project/tree/master/Tickrate%20Enabler) | Unlock >30 tick servers |
| [Little Anti-Cheat](https://github.com/J-Tanzanite/Little-Anti-Cheat) | Aimbot / angle-cheat detection |
| SMAC | Secondary AC layer (older but still works) |
| [ZoneMod](https://github.com/SirPlease/L4D2-Competitive-Rework) | Competitive Versus ruleset (full bundle: ZoneMod + MatchMode + Confogl-style plugins) |
| `l4d2_TKStopper` | Teamkill / griefing control |
| `l4d_sb_fix` | Survivor bot behavior fixes |
| [nb_update_frequency fix](https://forums.alliedmods.net/showthread.php?t=344019) | Eliminates client-side jitter at very low `nb_update_frequency` values |
## MetaMod:Source / SourceMod versioning
- Stable branches are pinned in URL paths: `1.10`, `1.11`, `1.12`,
etc. There is no "latest stable" alias URL — you pick the
branch.
- Within a branch, the `mmsource-latest-linux` and
`sourcemod-latest-linux` text files contain the current build's
filename, e.g. `mmsource-1.12.0-git1219-linux.tar.gz`. Curl the
pointer file, then curl the actual tarball.
- AM bumps stable every ~2-3 years. When 1.13 (or later) is
declared stable, update the `MM_BRANCH=1.12` / `SM_BRANCH=1.12`
pins in the seeded Sourcemod overlay script.
- L4D2 has no special branch — it uses whatever the current
stable supports. L4D2's engine is so stable that SM 1.11 and
1.12 both work.
Watch for stable announcements at
[Metamod:Source news](https://www.sourcemm.net/) and
[SourceMod releases](https://github.com/alliedmodders/sourcemod/releases).
## Empirically-verified kernel quirk (relevant if you tweak the helpers)
Idmapped bind mounts on kernel 6.12 (Trixie) **do** propagate
through plain `mount --bind` re-binds. Verified end-to-end on
`left4.me` during the 2026-05-15 build-time-idmap refactor: a
sandbox process inside a re-bound idmapped mount can write files,
and those writes land on disk with the idmap-translated uid.
This contradicts some published claims (including a generic
research-agent summary) that idmaps don't propagate through plain
re-bind on this kernel. Our use case is `mount --bind --map-users
src staging` → systemd-run with `BindPaths=staging:/overlay` (a
plain re-bind into the unit's namespace). It works.
The `--map-users <a>:<b>:<count>` direction is **on-disk uid
first**, then in-mount uid. The util-linux man page calls these
`<inner>:<outer>` which is confusing — `<inner>` means "the
filesystem's native uid" (on disk) and `<outer>` means "the uid
exposed outward through the mount." Empirically verified; do not
trust the man page's word choice.
## Project integration (left4me overlays)
The project already ships overlays in `examples/script-overlays/`
that map cleanly onto the recommendations above:
| Overlay | Use it for |
|---|---|
| [`tickrate.sh`](../examples/script-overlays/tickrate.sh) | Drop-in 100-tick foundation: installs the Tickrate Enabler plugin (`tickrate_enabler.dll/.so/.vdf`) and writes the core rate cvars (`sv_minrate/maxrate 100000`, `nb_update_frequency 0.014`, `net_splitpacket_maxrate 50000`, `net_maxcleartime 0.0001`, `fps_max 0`). Required base layer for any of the higher-tick recommendations in this doc. |
| [`competitive_rework.sh`](../examples/script-overlays/competitive_rework.sh) | Pulls the entire SirPlease/L4D2-Competitive-Rework master branch into the overlay. Full confogl bundle — plugins, configs, cfgogl per-mode tuning. Opinionated for tournament versus. Use this *or* `tickrate.sh`, not both. |
| [`cedapug_maps.sh`](../examples/script-overlays/cedapug_maps.sh), [`l4d2center_maps.sh`](../examples/script-overlays/l4d2center_maps.sh) | Competitive map pools (orthogonal to cvars). |
The cvars in the "Copy-paste best practice config" section above
are intended to be applied **on top of `tickrate.sh`** — either by
adding them to an instance's `spec.config` YAML list, or by
creating a new overlay (e.g. `examples/script-overlays/ux_polish.sh`)
that writes them to `$OVERLAY/left4dead2/cfg/server.cfg`.
How `spec.config` becomes `server.cfg` (for reference):
`l4d2host/l4d2host/instances.py:52-54` joins the YAML list with
newlines into `{LEFT4ME_ROOT}/instances/{name}/server.cfg`, then
that file is staged into the runtime upper layer at instance start.
## Launch parameters (reference)
Typical srcds invocation:
```
./srcds_run -console -game left4dead2 -secure -autoupdate \
+maxplayers 8 -port 27015 +exec server.cfg +log on
```
- `-secure` enables VAC. Don't run public servers without it.
- `-autoupdate` keeps the server patched automatically.
- `+exec server.cfg` runs your config on startup.
- `-tickrate <N>` sets the engine tickrate (requires Tickrate
Enabler for `N > 30`).
## Sources
Primary references used for the recommendations above:
- [L4D2-Competitive-Rework server.cfg](https://github.com/SirPlease/L4D2-Competitive-Rework/blob/master/cfg/server.cfg) — the canonical confogl/competitive cvar block. Many cheat-flagged cvars in this doc are sourced from here.
- [L4D2-Competitive-Rework cvar_tracking.cfg](https://github.com/SirPlease/L4D2-Competitive-Rework/blob/master/cfg/cvar_tracking.cfg) — client-cvar enforcement list (anti-cheat tracking; not directly used here but useful context).
- [Lux's L4D2 high-zombie-count packet compression analysis (Steam Discussions, app/550)](https://steamcommunity.com/app/550/discussions/0/2568690416482192538/) — origin of the `net_compresspackets` / `net_splitrate` / `net_maxcleartime` recommendations.
- [L4D2 Dedicated Server Guide (Steam Community)](https://steamcommunity.com/sharedfiles/filedetails/?id=276173458)
- [L4D2 Dedicated Server Network Tweaks (Steam Discussions)](https://steamcommunity.com/app/550/discussions/1/1839063537784156851/)
- [SirPlease/Server4Dead-Project — Tickrate Enabler](https://github.com/SirPlease/Server4Dead-Project/tree/master/Tickrate%20Enabler)
- [Valve Developer Community — L4D2 console commands](https://developer.valvesoftware.com/wiki/List_of_Left_4_Dead_2_console_commands_and_variables)
- [AlliedModders — nb_update_frequency fix (Experimental)](https://forums.alliedmods.net/showthread.php?t=344019)
- [Source Multiplayer Networking — Valve Developer Community](https://developer.valvesoftware.com/wiki/Source_Multiplayer_Networking)
- [Required Versions (SourceMod wiki)](https://wiki.alliedmods.net/Required_Versions_(SourceMod))
- [MetaMod:Source news](https://www.sourcemm.net/)
- Local: `/Users/mwiegand/Projekte/left4me/cvar_list` — 2199-line dump of L4D2 cvars (positive existence reference; *not* exhaustive — see [Verifying a cvar actually exists](#verifying-a-cvar-actually-exists)).
- Local: `examples/script-overlays/tickrate.sh`, `examples/script-overlays/competitive_rework.sh` — overlay scripts that apply these settings to a left4me instance.

File diff suppressed because it is too large Load diff

View file

@ -1,557 +0,0 @@
# L4D2 Workshop Overlays Implementation Plan
> **Approval gate:** This plan may be written and refined without further approval. Do not implement code changes from this plan until the user explicitly approves implementation.
**Goal:** Implement the workshop overlay feature per `docs/superpowers/specs/2026-05-07-l4d2-workshop-overlays-design.md`. Add a `WorkshopItem` registry, a typed `Overlay.type` column with a builder registry, a workshop builder that downloads from the Steam Web API and manages symlinks into a deduplicated cache, and the supporting routes, templates, jobs, and tests.
**Architecture:** Keep the v1 single-process Flask architecture. New code is additive: a `WorkshopBuilder` class registered in a builder dispatcher, a `steam_workshop` service module for the Steam Web API and downloader, two new database tables and one extended one, and two new job operations on the existing in-process worker. fuse-overlayfs mount handling in `l4d2host` is unchanged — workshop content arrives at overlay paths the same way externals do today.
---
## Locked Decisions
See `docs/superpowers/specs/2026-05-07-l4d2-workshop-overlays-design.md` for the design rationale. Implementation-relevant decisions:
- Typed overlays: `external` (existing rows; no-op builder) and `workshop` (new); future types deferred.
- No JSON `source_config` blob; per-type structured data in proper tables.
- `WorkshopItem` is a global deduplicated registry keyed on `steam_id`. Cache at `/var/lib/left4me/workshop_cache/{steam_id}.vpk`.
- Overlay symlinks are absolute, named `{steam_id}.vpk`; no Steam filename in any on-disk path.
- `overlay_workshop_items` is a pure association; toggle = remove/re-add.
- Collections are atomic UI bulk-imports; DB never tracks collection attribution.
- Single global admin "Refresh all workshop items" button.
- No cache GC in v1.
- `Overlay.user_id` is the scope (NULL = system, set = private); independent of `type`.
- Workshop overlays default to private; existing externals stay system-wide.
- One unified Create-overlay button with type radio; no path field — paths are always `str(overlay_id)`.
- `consumer_app_id == 550` validated at fetch/add; not stored.
- Input field accepts numeric ID, full Workshop URL, or multi-line batch.
- Auto-rebuild after add/remove with build coalescing.
- HTTPS for all Steam Web API calls.
- `Overlay.id` uses `AUTOINCREMENT`; `create_overlay_directory` uses `exist_ok=False`.
- Two partial unique indexes for overlay names: `(name) WHERE user_id IS NULL` and `(name, user_id) WHERE user_id IS NOT NULL`.
---
## Current Gap
- `Overlay` rows have `id`, `name`, `path`, no type, no scope.
- The web app cannot download anything from Steam; users must SFTP `.vpk` files into prepared overlay directories.
- The job worker has no operations for overlay builds or workshop refreshes.
- The mount/build pipeline assumes overlay directories are externally populated.
- There is no UI affordance to add or list workshop content.
---
## Task 1: Extend Tests First — Schema Migration And Models
**Files:**
- Create: `l4d2web/tests/test_workshop_overlay_models.py`
- Modify: `l4d2web/tests/test_models.py` (extend) — partial unique index behavior
Write tests against fresh SQLite schemas asserting:
- An `Overlay` migration round-trip: existing rows acquire `type='external'` and `user_id=NULL`; their `name` values remain unique by partial index.
- After migration, two externals (both `user_id=NULL`) with the same name are rejected by the system partial unique index.
- After migration, two users may both own a workshop overlay named `"my-maps"` (per-user partial unique index).
- `WorkshopItem.steam_id` is unique; concurrent inserts of the same `steam_id` raise integrity errors.
- `overlay_workshop_items` enforces `UNIQUE(overlay_id, workshop_item_id)`.
- `Overlay` deletion cascades `overlay_workshop_items` rows but does not delete `WorkshopItem` rows (`ON DELETE RESTRICT`).
- `Job.overlay_id` is nullable and references `overlays(id)`.
- `Overlay.id` does not reuse a deleted ID after the migration (AUTOINCREMENT).
Verification command:
```bash
pytest l4d2web/tests/test_workshop_overlay_models.py l4d2web/tests/test_models.py -q
```
Expected before implementation: FAIL.
---
## Task 2: Schema Migration And ORM Mappings
**Files:**
- Create: `l4d2web/alembic/versions/0002_workshop_overlays.py`
- Modify: `l4d2web/models.py`
Migration `0002_workshop_overlays` (`down_revision = "b2c684fddbd3"`):
1. `op.batch_alter_table("overlays")`:
- Add `type VARCHAR(16) NOT NULL DEFAULT 'external'` (server_default during migration; remove after backfill).
- Add `user_id INTEGER NULL REFERENCES users(id)`.
- Drop the existing `unique=True` on `name`.
- Add index `ix_overlays_type_user_id` on `(type, user_id)`.
- Switch `id` to `AUTOINCREMENT`.
2. After batch alter, create the two partial unique indexes via raw `op.create_index(..., postgresql_where=..., sqlite_where=...)`:
- `uq_overlay_name_system` on `(name)` `WHERE user_id IS NULL`.
- `uq_overlay_name_per_user` on `(name, user_id)` `WHERE user_id IS NOT NULL`.
3. `op.create_table("workshop_items", ...)` per spec data-model section.
4. `op.create_table("overlay_workshop_items", ...)` with the unique constraint and the reverse-lookup index.
5. `op.batch_alter_table("jobs")`: add `overlay_id INTEGER NULL REFERENCES overlays(id)`.
ORM (`models.py`):
- Extend `Overlay`: add `type`, `user_id`. Drop `unique=True` on `name`. Set `__table_args__` with the two partial indexes and `ix_overlays_type_user_id`.
- Extend `Job`: add `overlay_id` mapped column with FK.
- New `WorkshopItem` and `OverlayWorkshopItem` classes per spec. Set up `Overlay.workshop_items` relationship through the association.
Verification command:
```bash
pytest l4d2web/tests/test_workshop_overlay_models.py l4d2web/tests/test_models.py -q
```
Expected after implementation: PASS.
Run alembic against a fresh test DB to verify upgrade and downgrade succeed.
---
## Task 3: Tests First — Steam Web API And Downloader
**Files:**
- Create: `l4d2web/tests/test_steam_workshop.py`
Mock HTTP with `responses` or `pytest-httpserver`. Cover:
- `parse_workshop_input` accepts a single numeric ID, a single Workshop URL (`steamcommunity.com/sharedfiles/filedetails/?id=N`), and a multi-line whitespace-separated batch of either; returns deduplicated ordered list of digit-only IDs.
- `parse_workshop_input` rejects garbage, paths outside `?id=`, non-digit IDs.
- `resolve_collection` POSTs to the HTTPS endpoint with the form-encoded payload and returns `publishedfileid` children.
- `fetch_metadata_batch` POSTs once with `itemcount=N`; returns parsed `WorkshopMetadata` per item; captures `result != 1` into `last_error`; raises `WorkshopValidationError` when any `consumer_app_id != 550` during user-add; logs and skips during refresh-mode.
- `WorkshopMetadata.preview_url` is captured.
- `download_to_cache` writes `cache_root/{steam_id}.vpk.partial`, then `os.replace` to the final name; sets `os.utime(file, (time_updated, time_updated))`.
- `download_to_cache` is idempotent: a second call where on-disk `(mtime, size)` matches `(time_updated, file_size)` is a no-op (no HTTP request issued).
- `refresh_all` runs downloads via `ThreadPoolExecutor(max_workers=8)` and reports per-item errors without aborting the batch.
- All Steam API URLs use `https://`.
Verification command:
```bash
pytest l4d2web/tests/test_steam_workshop.py -q
```
Expected before implementation: FAIL.
---
## Task 4: Steam Workshop Service Module
**Files:**
- Create: `l4d2web/services/steam_workshop.py`
Public surface:
```python
def parse_workshop_input(raw: str) -> list[str]: ...
def resolve_collection(collection_id: str) -> list[str]: ...
def fetch_metadata_batch(steam_ids: list[str], *, mode: Literal["add","refresh"]) -> list[WorkshopMetadata]: ...
def download_to_cache(meta: WorkshopMetadata, cache_root: Path, *, on_progress=None, should_cancel=None) -> Path: ...
def refresh_all(items: list[WorkshopItem], cache_root: Path, executor_workers: int = 8) -> RefreshReport: ...
```
Implementation rules:
- Endpoints are HTTPS:
- `https://api.steampowered.com/ISteamRemoteStorage/GetCollectionDetails/v1/`
- `https://api.steampowered.com/ISteamRemoteStorage/GetPublishedFileDetails/v1/`
- Form-encoded POSTs with `itemcount=N` / `collectioncount=N` and `publishedfileids[i]=…` per index.
- Per-request timeout 30s; per-item ceiling 5min. No retry or backoff in v1.
- `consumer_app_id != 550`:
- In `mode="add"`: raise `WorkshopValidationError` with the offending `steam_id`.
- In `mode="refresh"`: log and skip; do not abort other items.
- `result != 1`: capture Steam's result code in the item's `last_error`; do not download; do not abort siblings.
- Cooperative cancellation: `download_to_cache` checks `should_cancel()` between chunked reads; `refresh_all`'s executor checks before each task.
- `WorkshopMetadata` is a dataclass with `steam_id, title, filename, file_url, file_size, time_updated, preview_url, consumer_app_id, result`.
- `RefreshReport` aggregates per-item outcomes for the caller's job log.
- Use a single `requests.Session` per call site for connection reuse.
Verification command:
```bash
pytest l4d2web/tests/test_steam_workshop.py -q
```
Expected after implementation: PASS.
---
## Task 5: Tests First — Path Helpers And Overlay Creation
**Files:**
- Create: `l4d2web/tests/test_workshop_paths.py`
- Create: `l4d2web/tests/test_overlay_creation.py`
Cover:
- `workshop_cache_root()` returns `LEFT4ME_ROOT/workshop_cache`.
- `cache_path(steam_id)` returns `cache_root / f"{steam_id}.vpk"` for valid digit strings; rejects non-digits, slashes, dot-dot.
- `generate_overlay_path(overlay_id)` returns `str(overlay_id)`; passes `validate_overlay_ref` from `l4d2host.paths`.
- `create_overlay_directory(overlay)` creates `LEFT4ME_ROOT/overlays/{path}/` with `exist_ok=False`. Calling twice raises (DB/disk drift surfaced loudly).
Verification command:
```bash
pytest l4d2web/tests/test_workshop_paths.py l4d2web/tests/test_overlay_creation.py -q
```
Expected before implementation: FAIL.
---
## Task 6: Path Helpers And Overlay Creation
**Files:**
- Create: `l4d2web/services/workshop_paths.py`
- Create: `l4d2web/services/overlay_creation.py`
`workshop_paths`:
```python
def workshop_cache_root() -> Path: ... # LEFT4ME_ROOT/workshop_cache
def cache_path(steam_id: str) -> Path: ... # validates digits-only; returns cache_root/{steam_id}.vpk
```
`overlay_creation`:
```python
def generate_overlay_path(overlay_id: int) -> str: ... # str(overlay_id) + validate_overlay_ref
def create_overlay_directory(overlay: Overlay) -> None: # makedirs(..., exist_ok=False)
...
```
Verification command:
```bash
pytest l4d2web/tests/test_workshop_paths.py l4d2web/tests/test_overlay_creation.py -q
```
Expected after implementation: PASS.
---
## Task 7: Tests First — Overlay Builders
**Files:**
- Create: `l4d2web/tests/test_overlay_builders.py`
Cover with `tmp_path`:
- `BUILDERS` dict resolves `"external"` and `"workshop"` to instances; unknown types raise `KeyError` (caller's error).
- `ExternalBuilder.build()` is a no-op: makes the overlay directory if missing, writes one log line, returns. Existing files in the directory are untouched.
- `WorkshopBuilder.build()` against a fixture overlay with three associated `WorkshopItem` rows (two with cache files present, one without):
- Creates `left4dead2/addons/` if missing.
- Creates symlinks `addons/{steam_id_a}.vpk → cache_root/{steam_id_a}.vpk` for items with cache files. Symlinks are absolute.
- Skips the uncached item; emits a warning log line. Does not create a dangling symlink.
- On a re-run with the same associations: no FS changes; logs report `unchanged=2 skipped(uncached)=1`.
- On a re-run after one association is removed: removes the obsolete symlink only; leaves cache files alone.
- On a re-run after one item is added: adds only the new symlink.
- Files in `addons/` that aren't symlinks into the cache are left untouched.
- `should_cancel` mid-build: stops between filesystem ops; partial state is consistent and a re-run heals.
Verification command:
```bash
pytest l4d2web/tests/test_overlay_builders.py -q
```
Expected before implementation: FAIL.
---
## Task 8: Overlay Builders And Dispatcher
**Files:**
- Create: `l4d2web/services/overlay_builders.py`
```python
class OverlayBuilder(Protocol):
def build(self, overlay: Overlay, *, on_stdout, on_stderr, should_cancel) -> None: ...
class ExternalBuilder: ...
class WorkshopBuilder: ...
BUILDERS: dict[str, OverlayBuilder] = {
"external": ExternalBuilder(),
"workshop": WorkshopBuilder(),
}
```
`WorkshopBuilder.build()`:
1. Load the overlay's `WorkshopItem` rows.
2. `os.makedirs(overlay_root / "left4dead2/addons", exist_ok=True)`.
3. Compute `desired = {f"{steam_id}.vpk": cache_path(steam_id)}` for items where `last_downloaded_at IS NOT NULL` and the cache file exists. Skip and warn for items missing a cache file.
4. Inspect existing entries in `addons/` via `os.scandir`: keep entries that are not symlinks into `workshop_cache`; otherwise diff against `desired` and apply changes via `os.unlink` and `os.symlink(absolute_target, link_path)`.
5. Emit `created N, removed M, unchanged K, skipped (uncached) S` log line.
6. Check `should_cancel()` between filesystem ops.
Verification command:
```bash
pytest l4d2web/tests/test_overlay_builders.py -q
```
Expected after implementation: PASS.
---
## Task 9: Tests First — Worker Scheduler Truth Table And Coalescing
**Files:**
- Modify: `l4d2web/tests/test_job_worker.py`
Add coverage:
- Truth table for `can_start`:
- `install` not claimed while `refresh_workshop_items`, any `build_overlay`, or any server job is running.
- `refresh_workshop_items` not claimed while `install`, any `build_overlay`, or any server job is running.
- `build_overlay(N)` not claimed while `install`, `refresh_workshop_items`, or another `build_overlay(N)` is running. Two `build_overlay` jobs for **different** overlay IDs claim concurrently.
- Server start/init blocks if `refresh_workshop_items` runs or if any `build_overlay(N)` runs where N ∈ overlays of the server's blueprint.
- `enqueue_build_overlay(overlay_id)`:
- Inserts a new queued job when no pending job exists.
- Returns the existing pending job when one is already queued (coalescing).
- Does not coalesce against running jobs (a new add after build start gets a fresh queued job).
- `refresh_workshop_items` post-completion enqueues `build_overlay` only for overlays whose items had `time_updated` advance or `filename` change; each such enqueue uses the coalescing helper.
Verification command:
```bash
pytest l4d2web/tests/test_job_worker.py -q
```
Expected before implementation: FAIL.
---
## Task 10: Worker Scheduler And New Operations
**Files:**
- Modify: `l4d2web/services/job_worker.py`
Changes:
- Define `OVERLAY_OPERATIONS = {"build_overlay"}` and `GLOBAL_OPERATIONS = {"install", "refresh_workshop_items"}`. Update `malformed_server_job` to allow `server_id IS NULL` for these.
- Extend `SchedulerState` with `running_overlays: set[int]` and `refresh_running: bool`.
- Update `claim_next_job()`:
- Compute `running_overlays` from queries against `running` jobs of operation `build_overlay`.
- Apply the truth-table rules above.
- Continue using `created_at, id` ordering for deterministic claim.
- Add `enqueue_build_overlay(overlay_id: int) -> Job` helper:
- Look for `queued` `build_overlay` job with same `overlay_id`. Return it if present.
- Otherwise insert a new queued job with `overlay_id` set, `server_id=None`, `operation="build_overlay"`.
- Update `run_job` dispatch:
- `build_overlay` → load `Overlay`, dispatch to `BUILDERS[overlay.type].build(overlay, on_stdout, on_stderr, should_cancel)`.
- `refresh_workshop_items` → call `steam_workshop.refresh_all(...)`. After completion, for each affected overlay, call `enqueue_build_overlay(overlay_id)`.
Verification command:
```bash
pytest l4d2web/tests/test_job_worker.py -q
```
Expected after implementation: PASS.
---
## Task 11: Tests First — Routes, Permissions, And Auto-Rebuild
**Files:**
- Modify: `l4d2web/tests/test_overlays.py`
- Create: `l4d2web/tests/test_workshop_routes.py`
Cover:
- `POST /overlays` with `type='workshop'` and `name` succeeds for any logged-in user; `path` is auto-generated; `user_id` is set; the directory exists at `LEFT4ME_ROOT/overlays/{id}`.
- `POST /overlays` with `type='external'` succeeds only for admins; `user_id` is NULL.
- Duplicate workshop name within the same user is rejected; duplicate names across users are accepted.
- Duplicate external name is rejected.
- Non-admins see `type='external' OR user_id=current_user.id` only when listing overlays.
- `POST /overlays/{id}/items` with one numeric ID adds an association and enqueues a coalesced `build_overlay`. The response is an HTMX fragment of the updated item table.
- `POST /overlays/{id}/items` with a multi-line batch (mix of IDs and URLs) adds all and enqueues one coalesced job for the batch.
- `POST /overlays/{id}/items` with a collection ID resolves members and adds N associations.
- Adding a non-L4D2 item (`consumer_app_id != 550`) returns HTTP 400 with a useful message; no association is created.
- Adding an item already in the overlay returns "already in overlay" (no 500).
- `POST /overlays/{id}/items/{item_id}/delete` removes the association and enqueues a coalesced build.
- `POST /overlays/{id}/build` enqueues the manual rebuild and redirects to the job page.
- `POST /admin/workshop/refresh` is admin-only; non-admins receive 403.
Mock `steam_workshop` HTTP layer for these tests.
Verification command:
```bash
pytest l4d2web/tests/test_overlays.py l4d2web/tests/test_workshop_routes.py -q
```
Expected before implementation: FAIL.
---
## Task 12: Routes And Templates
**Files:**
- Modify: `l4d2web/routes/overlay_routes.py`
- Create: `l4d2web/routes/workshop_routes.py`
- Modify: `l4d2web/routes/page_routes.py`
- Modify: `l4d2web/templates/overlays.html`
- Modify: `l4d2web/templates/overlay_detail.html`
- Create: `l4d2web/templates/_overlay_item_table.html`
- Modify: `l4d2web/templates/admin.html`
- Modify: `l4d2web/app.py` (register the workshop blueprint)
`overlay_routes.py`:
- `create_overlay`: read `type` and `name` from form. No `path` field accepted.
- `type='external'`: admin-only; `user_id=NULL`. After insert, set `path = generate_overlay_path(id)`; call `create_overlay_directory(overlay)`.
- `type='workshop'`: any logged-in user; `user_id=current_user.id`. After insert, set `path = generate_overlay_path(id)`; call `create_overlay_directory(overlay)`.
- `update_overlay`: forbid changing `type` and `path`. Workshop: owner or admin can edit `name`. External: admin-only `name` edits.
- `delete_overlay`: after the row deletes, `shutil.rmtree(LEFT4ME_ROOT/overlays/{path})` only if `overlay.path == str(overlay.id)` (legacy externals are left alone). Cache untouched.
`workshop_routes.py`:
- `POST /overlays/{id}/items`: parse input via `parse_workshop_input`; if a collection ID, resolve members; batch-fetch metadata in `mode="add"`; reject non-550 with HTTP 400; upsert `WorkshopItem` via SQLite `INSERT ... ON CONFLICT DO UPDATE` on `steam_id`; bulk-add associations catching `(overlay_id, workshop_item_id)` unique violations; call `enqueue_build_overlay(overlay_id)`; return rendered `_overlay_item_table.html` fragment.
- `POST /overlays/{id}/items/{item_id}/delete`: ownership check; remove association; call `enqueue_build_overlay(overlay_id)`; return updated fragment.
- `POST /overlays/{id}/build`: ownership check; enqueue (coalesced); redirect to `/jobs/{job_id}`.
- `POST /admin/workshop/refresh`: `@require_admin`; insert a `refresh_workshop_items` queued job; redirect to `/admin/jobs`.
`page_routes.py`:
- `overlays()`: admins see all; non-admins see `type='external' OR user_id=current_user.id`.
- `overlay_detail()`: load `WorkshopItem` rows for workshop-type overlays.
Templates:
- `overlays.html`: add Type column. Modal has type radio (External | Workshop) and name field. No path field.
- `overlay_detail.html`: branch on `overlay.type`.
- External view: read-only path display, name edit (admin only).
- Workshop view: an `<textarea>` accepting one or many IDs/URLs plus a radio (Items | Collection); item table with thumbnail (`preview_url`), `steam_id` linked to Steam, title, filename, time_updated, file_size, last_error, Remove; Rebuild button; small status indicator showing the latest related job.
- `_overlay_item_table.html`: renderable standalone for HTMX swaps.
- `admin.html`: add a CSRF-protected "Refresh all workshop items" button.
Verification command:
```bash
pytest l4d2web/tests/test_overlays.py l4d2web/tests/test_workshop_routes.py -q
```
Expected after implementation: PASS.
---
## Task 13: Tests First — Initialize-Time Guard
**Files:**
- Modify: `l4d2web/tests/test_l4d2_facade.py` (or create if missing)
Cover:
- `initialize_server(server_id)` calls `BUILDERS[overlay.type].build()` for each overlay in the blueprint before writing the spec.
- For workshop overlays, when an associated `WorkshopItem` lacks a cache file (`workshop_cache/{steam_id}.vpk` missing), `initialize_server` raises a clear error containing the missing `steam_id`s and the overlay name; the spec is not written; `l4d2ctl initialize` is not invoked.
- For workshop overlays where all items have cache files, the symlinks are present and `l4d2ctl initialize` runs.
Verification command:
```bash
pytest l4d2web/tests/test_l4d2_facade.py -q
```
Expected before implementation: FAIL.
---
## Task 14: Initialize-Time Guard
**Files:**
- Modify: `l4d2web/services/l4d2_facade.py`
Implementation:
- Before writing the temp spec, iterate over the blueprint's overlays and call `BUILDERS[overlay.type].build(...)`.
- For workshop overlays, the builder logs and skips uncached items rather than failing. After all builders run, perform a second pass: query the blueprint's workshop overlays for any associated `WorkshopItem` with no cache file. If any are found, raise an exception whose message names the missing `steam_id`s and points at the overlay page (`Open overlay {name} ({id}) and click Build`).
Verification command:
```bash
pytest l4d2web/tests/test_l4d2_facade.py -q
```
Expected after implementation: PASS.
---
## Task 15: Deploy Provisioning
**Files:**
- Modify: `deploy/install.sh` (or whichever provisioning script creates `/var/lib/left4me/`)
- Modify: `deploy/README.md`
Behavior:
- Provisioning creates `/var/lib/left4me/workshop_cache/` (mode 0755), owned by the web user.
- `deploy/README.md` documents:
- The new directory and its purpose.
- Permission requirement: web user owns; host user reads (shared group with `g+r` if uids differ).
- `LEFT4ME_ROOT` layout updated with the new subtree.
No tests; verify via test deploy.
---
## Task 16: Full Verification And Manual Test Plan
Run focused suites first:
```bash
pytest l4d2web/tests/test_workshop_overlay_models.py -q
pytest l4d2web/tests/test_models.py -q
pytest l4d2web/tests/test_steam_workshop.py -q
pytest l4d2web/tests/test_workshop_paths.py l4d2web/tests/test_overlay_creation.py -q
pytest l4d2web/tests/test_overlay_builders.py -q
pytest l4d2web/tests/test_job_worker.py -q
pytest l4d2web/tests/test_overlays.py l4d2web/tests/test_workshop_routes.py -q
pytest l4d2web/tests/test_l4d2_facade.py -q
```
Then run the full web suite:
```bash
pytest l4d2web/tests -q
```
Manual test plan on the test deploy:
1. Apply migration on a copy of the prod DB; verify all existing overlays read as `type='external'`, `user_id=NULL`; names still unique by partial index; two externals with the same name are rejected.
2. As non-admin, create a workshop overlay. Add a known popular L4D2 addon by URL. Verify the build job auto-enqueues. Verify symlink + cache file. Confirm web UI shows metadata and thumbnail.
3. Paste a multi-line block of item IDs and URLs. Verify all are parsed and added; verify coalescing (only one `build_overlay` job runs).
4. Add a 50-item collection. Verify all 50 metadata rows appear and no UI mention of "from collection". Verify single coalesced build job.
5. Remove an item. Verify auto-rebuild removes the symlink while the cache file remains.
6. As admin, click Refresh All. Verify only items with newer `time_updated` re-download. Verify affected overlays get coalesced `build_overlay` jobs enqueued.
7. Boot an L4D2 server with a workshop overlay attached. Connect locally and confirm the maps appear in the map vote and load.
8. Concurrency probe: enqueue Refresh All while a `build_overlay` is queued; verify scheduler waits per truth table.
9. Initialize-time guard: manually delete a cache file for an item that's in an overlay attached to a server's blueprint. Try to start the server; verify clear error mentioning the missing `steam_id`.
10. Negative: paste a non-L4D2 workshop ID (e.g., a Skyrim mod). Expect HTTP 400 with a clear message; no row inserted.
11. Negative: simulate Steam API down (block egress). Verify add fails with clean error, not 500. Verify refresh job logs the failure.
---
## Commit Strategy
Use small commits after passing relevant tests:
1. `feat(l4d2-web): typed overlays + workshop schema migration`
2. `feat(l4d2-web): steam workshop API client and downloader`
3. `feat(l4d2-web): overlay path helpers and creation`
4. `feat(l4d2-web): overlay builder registry with workshop builder`
5. `feat(l4d2-web): worker support for build_overlay and refresh_workshop_items`
6. `feat(l4d2-web): workshop overlay UI (routes + templates)`
7. `feat(l4d2-web): initialize-time guard for uncached workshop items`
8. `feat(deploy): workshop_cache provisioning`
Do not commit unless the user explicitly asks for commits.
---
## Open Approval Gate
Before modifying implementation files, ask the user for explicit approval to proceed with the workshop-overlays implementation.

View file

@ -1,229 +0,0 @@
# Kernel Overlayfs Helper Implementation Plan
> **Approval status:** User-approved 2026-05-08. Implementation proceeds.
**Goal:** Implement the kernel-overlayfs migration per `docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md`. Add a Python `left4me-overlay` privileged helper, a `KernelOverlayFSMounter` Python class, wire the existing `OverlayMounter` ABC through `l4d2host/instances.py`, drop `fuse-overlayfs` from the deploy stack, and migrate existing on-disk upper/work directories.
**Architecture:** The web app continues to call `l4d2ctl start|stop|delete <name>`; `l4d2host` continues to expose the same CLI verbs. Internally, `start_instance`/`stop_instance`/`delete_instance` move from a hardcoded subprocess call to `fuse-overlayfs`/`fusermount3` to using `KernelOverlayFSMounter`, which invokes the new sudo helper that mounts in PID 1's namespace via `nsenter`.
---
## Locked Decisions
See `docs/superpowers/specs/2026-05-08-kernel-overlayfs-helper-design.md` for the design rationale. Implementation-relevant summary:
- `left4me-overlay` Python helper in `/usr/local/libexec/left4me/`, owned root, mode 0755, system `/usr/bin/python3`, stdlib only.
- Verbs: `mount <name>`, `umount <name>`.
- Validation in helper: name regex; realpath + allowlist for each lowerdir; exact-prefix check for upper/work/merged; reject upperdir with `user.fuseoverlayfs.*` xattrs; lowerdir count ≤ 500.
- Sudoers verb-constrained: `mount *`, `umount *`.
- `KernelOverlayFSMounter` in `l4d2host/fs/kernel_overlayfs.py` — implements `OverlayMounter`. Derives `name` from the merged path's parent.
- `start_instance` adds `os.path.ismount(merged)` guard before mounting.
- Deploy migration: gated on sentinel file `/var/lib/left4me/.kernel-overlay-migrated`; stops gameservers + web, force-unmounts stale mounts, wipes upper/work, recreates empty.
- Web unit cleanup: drop `MountFlags=shared`, restore `PrivateTmp=true`, rewrite comment block. Keep `NoNewPrivileges` unset.
- Delete `l4d2host/fs/fuse_overlayfs.py` (currently unused — `start_instance` bypasses it).
- AGENTS.md contracts unchanged.
---
## Current Gap
- `l4d2host/instances.py` `start_instance` calls `fuse-overlayfs` directly (lines 85-101); `stop_instance`/`delete_instance` call `fusermount3 -u` directly. The `OverlayMounter` ABC at `l4d2host/fs/base.py` and the `FuseOverlayFSMounter` impl at `l4d2host/fs/fuse_overlayfs.py` exist but are unused.
- Mounts land in the web service's private mount namespace, invisible to host and to gameserver units. `MountFlags=shared` does not fix it.
- No privileged mount helper exists; only `left4me-systemctl` and `left4me-journalctl`.
- Deploy script installs `fuse-overlayfs` apt package and assumes it as a runtime tool.
- Existing `runtime/<name>/upper` directories may carry `user.fuseoverlayfs.*` xattrs that kernel overlayfs would silently ignore (resurrecting "deleted" files).
---
## Task 1: Helper Script + Sudoers + Mounter Class (RED-first)
**Files:**
- Create: `deploy/files/usr/local/libexec/left4me/left4me-overlay` (Python, mode 0755 after deploy)
- Modify: `deploy/files/etc/sudoers.d/left4me`
- Create: `l4d2host/fs/kernel_overlayfs.py`
- Create: `l4d2host/tests/test_kernel_overlayfs.py`
- Create: `l4d2host/tests/test_overlay_helper.py`
- Modify: `deploy/tests/test_deploy_artifacts.py` (assert helper deployed + sudoers entry)
Test plan (RED first):
1. `test_kernel_overlayfs.py::test_mount_invokes_helper_with_name` — mock `run_command`, call `KernelOverlayFSMounter().mount(lowerdirs="/x:/y", upperdir=Path("/var/lib/left4me/runtime/alpha/upper"), workdir=Path("/var/lib/left4me/runtime/alpha/work"), merged=Path("/var/lib/left4me/runtime/alpha/merged"))`, assert argv `["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"]`.
2. `test_kernel_overlayfs.py::test_unmount_invokes_helper_with_umount_verb` — mock + call + assert argv with `umount`.
3. `test_overlay_helper.py` — drives the helper script as a subprocess with `LEFT4ME_OVERLAY_PRINT_ONLY=1` env var (helper prints the would-be `nsenter …` command line and exits 0 instead of execve), and with isolated `LEFT4ME_ROOT=tmp_path`. Cases:
- Valid mount: prints expected `nsenter --mount=/proc/1/ns/mnt -- /bin/mount -t overlay …` line.
- Valid umount: prints expected umount line.
- Bad name (`../escape`, uppercase, empty): exit non-zero, stderr matches.
- Lowerdir traversal (`/etc`, `/var/lib/left4me/../etc`, symlink escape): exit non-zero.
- Missing `instance.env`: exit non-zero.
- Tainted upperdir (with `user.fuseoverlayfs.opaque` xattr): exit non-zero with clear message. (Optional: skip if `setfattr` is unavailable on dev machine; keep test on Linux only via `pytest.mark.skipif`.)
- Lowerdir count > 500: exit non-zero.
4. `test_deploy_artifacts.py` — assert `/usr/local/libexec/left4me/left4me-overlay` is present in deployed files; sudoers includes the new lines.
Implementation:
- Helper script structure: `argparse` for the verb, then path-validation funcs, then `os.execv("/usr/bin/nsenter", [...])` (or printing it under `LEFT4ME_OVERLAY_PRINT_ONLY`).
- `KernelOverlayFSMounter`: `name = merged.parent.name` (with a one-line comment), then `run_command(["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", verb, name], on_stdout=…, on_stderr=…, passthrough=…, should_cancel=…)`.
**Verification:**
```
python3 -m pytest l4d2host/tests/test_kernel_overlayfs.py l4d2host/tests/test_overlay_helper.py deploy/tests/test_deploy_artifacts.py -q
```
Expected before implementation: FAIL on missing class/script. After: all green.
**Commit:** `feat(l4d2-host): KernelOverlayFSMounter + left4me-overlay helper`
---
## Task 2: Wire OverlayMounter Through Lifecycle + Drop Fuse Module
**Files:**
- Modify: `l4d2host/instances.py` (start/stop/delete)
- Modify: `l4d2host/tests/test_lifecycle.py` (update argv assertions, add double-mount guard test)
- Delete: `l4d2host/fs/fuse_overlayfs.py`
- Verify: `l4d2host/fs/__init__.py` does not re-export `FuseOverlayFSMounter`
Test plan (update RED, then GREEN):
1. `test_lifecycle.py::test_start_order` — change assertion: `calls[0]` is now `["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "mount", "alpha"]`. Adjust setup so the test still creates the merged directory.
2. `test_lifecycle.py::test_stop_succeeds_when_unmount_fails``cmd[0:5] == ["sudo", "-n", "/usr/local/libexec/left4me/left4me-overlay", "umount", "alpha"]`.
3. `test_lifecycle.py::test_delete_succeeds_when_unmount_fails` — same.
4. NEW `test_lifecycle.py::test_start_refuses_double_mount` — monkeypatch `os.path.ismount` to return True; expect `start_instance` to raise `subprocess.CalledProcessError`; assert NO mount command was issued.
5. `test_lifecycle.py::test_lifecycle_rejects_unsafe_instance_names` — unchanged.
6. `test_lifecycle.py::test_delete_missing_is_noop` — unchanged.
Implementation:
- `instances.py` imports `KernelOverlayFSMounter`. Module-level singleton instance (`_mounter = KernelOverlayFSMounter()`). Replace direct `run_command([...fuse-overlayfs...])` with `_mounter.mount(...)`. Replace direct `run_command([...fusermount3...])` with `_mounter.unmount(...)` (still inside the existing try/except for stop/delete).
- Add the ismount guard at the top of `start_instance` after `runtime_dir` is computed, before `emit_step("mounting runtime overlay...")`. Raise `subprocess.CalledProcessError(returncode=1, cmd=["mount-guard"], stderr="runtime overlay already mounted at <path>; refusing to double-mount")`.
- Delete `l4d2host/fs/fuse_overlayfs.py`.
- Confirm `l4d2host/fs/__init__.py` is empty (already verified to be 1 line).
**Verification:**
```
python3 -m pytest l4d2host/tests -q
python3 -m pytest l4d2web/tests -q
```
Both green. Web tests: the `"Step: mounting runtime overlay..."` log line is preserved in `start_instance`.
**Commit:** `refactor(l4d2-host): start/stop/delete go through OverlayMounter; drop FuseOverlayFSMounter`
---
## Task 3: Deploy Script Migration (Apt Deps + Wipe Upper/Work)
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Modify: `deploy/tests/test_deploy_artifacts.py` (assert deploy script contains migration lines; assert `fuse-overlayfs` no longer in apt-get install)
Test plan:
1. `test_deploy_artifacts.py::test_deploy_script_drops_fuse_overlayfs_apt_dep``assert "fuse-overlayfs" not in deploy_script` and `assert "kernel-overlay-migrated" in deploy_script`.
2. `test_deploy_artifacts.py::test_deploy_script_migration_block_uses_sentinel``assert ".kernel-overlay-migrated" in deploy_script`.
Implementation:
In `deploy/deploy-test-server.sh`, drop `fuse-overlayfs` from the apt-get and dnf lines (lines 82, 84). Insert before the existing `systemctl restart left4me-web.service` (line 182):
```sh
# One-time migration: fuse-overlayfs upperdir → kernel overlayfs upperdir.
# fuse-overlayfs running as the left4me user uses user.fuseoverlayfs.* xattrs
# for whiteouts and opaque dirs; kernel overlayfs ignores those, so any
# pre-existing upper/ from the fuse era would resurrect "deleted" files.
sentinel=/var/lib/left4me/.kernel-overlay-migrated
if [ ! -e "$sentinel" ]; then
$sudo_cmd systemctl stop 'left4me-server@*.service' 2>/dev/null || true
$sudo_cmd systemctl stop left4me-web.service 2>/dev/null || true
$sudo_cmd sh -c 'findmnt -t fuse.fuse-overlayfs -o TARGET --noheadings | xargs -r -n1 fusermount3 -u 2>/dev/null || true'
$sudo_cmd sh -c "findmnt -t overlay -o TARGET --noheadings | grep '/var/lib/left4me/runtime/' | xargs -r -n1 umount 2>/dev/null || true"
$sudo_cmd sh -c 'for d in /var/lib/left4me/runtime/*/; do [ -d "$d" ] || continue; rm -rf "$d/upper" "$d/work"; mkdir -p "$d/upper" "$d/work"; chown left4me:left4me "$d/upper" "$d/work"; done'
$sudo_cmd touch "$sentinel"
$sudo_cmd chown left4me:left4me "$sentinel"
fi
```
**Verification:**
```
python3 -m pytest deploy/tests -q
```
Green.
**Commit:** `chore(deploy): drop fuse-overlayfs apt dep + one-shot migrate upper/work`
---
## Task 4: Web Unit Hardening Cleanup + Docs
**Files:**
- Modify: `deploy/files/usr/local/lib/systemd/system/left4me-web.service`
- Modify: `deploy/tests/test_deploy_artifacts.py`
- Modify: `README.md`
- Modify: `l4d2host/README.md`
- Modify: `deploy/README.md`
Test plan:
1. `test_deploy_artifacts.py::test_web_unit_contains_required_runtime_contract` — drop `assert "MountFlags=shared" in unit` (or rather: replace with `assert "MountFlags=" not in unit`); add `assert "PrivateTmp=true" in unit`; add `assert "left4me-overlay" not in unit` (just to be precise — the unit shouldn't reference the helper directly, only via Python code).
Implementation:
Edit `left4me-web.service`:
- Drop `MountFlags=shared`.
- Restore `PrivateTmp=true`.
- Rewrite the comment block above hardening lines to explain: mounts now go through the `left4me-overlay` helper which `nsenter`s into PID 1's mount namespace, so this unit's namespace is irrelevant to gameserver visibility. `NoNewPrivileges` stays unset because sudo is setuid.
README updates:
- `README.md` (line ~59): drop fuse-overlayfs from tech-stack list; replace with "kernel overlayfs via privileged helper".
- `l4d2host/README.md`: lines 29, 52, 64 reference fuse — update to "kernel overlayfs (mount via the `left4me-overlay` helper deployed to `/usr/local/libexec/left4me/`)".
- `deploy/README.md`: add `/usr/local/libexec/left4me/left4me-overlay` to the privileged-helpers inventory.
**Verification:**
```
python3 -m pytest deploy/tests -q
```
Green. Manual readthrough of the three READMEs confirms no stale fuse references.
**Commit:** `chore(deploy): cleanup left4me-web hardening + docs for kernel overlayfs`
---
## Task 5: End-to-End Verification on `ckn@10.0.4.128`
**Pre-deploy:** branch is clean, all four prior commits land, all tests green locally.
**Deploy:**
```
deploy/deploy-test-server.sh ckn@10.0.4.128
```
**Verification commands on the box:**
1. `test -e /var/lib/left4me/.kernel-overlay-migrated && echo migrated` — sentinel created.
2. `systemctl status left4me-web.service --no-pager``active (running)`, recent invocation timestamp.
3. From the UI or via `sudo -u left4me /opt/left4me/.venv/bin/l4d2ctl start test-server` — exit 0.
4. `findmnt /var/lib/left4me/runtime/test-server/merged` — shows fstype `overlay` in the host namespace.
5. `systemctl status left4me-server@test-server --no-pager``active (running)` after the start; **not** in `activating (auto-restart)`. No `status=200/CHDIR` errors in `journalctl -u left4me-server@test-server`.
6. `sudo journalctl -k --since "5 minutes ago" | grep -i apparmor | tail` — no overlay-related denials.
7. Negative test: `sudo -u left4me sudo -n /usr/local/libexec/left4me/left4me-overlay mount '../escape'` — exits non-zero with validation error.
8. Idempotency: `l4d2ctl stop test-server && l4d2ctl stop test-server` — both succeed (per the prior `fix(l4d2-host): make stop_instance idempotent` commit, still holds).
9. Re-start: `l4d2ctl start test-server` — succeeds, `findmnt` shows the mount again.
10. Double-mount guard: while the server is running, attempting another start (not via UI; via Python REPL or a second job) — `start_instance` raises `CalledProcessError` with the "refusing to double-mount" message. Optional, can be left to the unit test.
**On failure of any step:** stop and report. Do NOT push. The deploy script is rerunnable; the migration sentinel stays so wipe doesn't repeat.
---
## Out Of Scope
- See spec's "Out Of Scope" section.
- This plan does not push commits; pushing is a separate user decision after end-to-end verification passes.

View file

@ -1,350 +0,0 @@
# L4D2 Script Overlays Implementation Plan
> **Approval status:** User-approved 2026-05-08. Implementation proceeds.
**Goal:** Implement the `script` overlay type per `docs/superpowers/specs/2026-05-08-l4d2-script-overlays-design.md`. Add an `Overlay.script` TEXT column and `Overlay.last_build_status` enum-string column, a `ScriptBuilder` that runs user bash inside a `bubblewrap` + `systemd-run --scope` sandbox via a new `left4me-script-sandbox` privileged helper, route + UI surface for editing/wiping/rebuilding, and delete the entire managed-globals (`l4d2center_maps`, `cedapug_maps`) subsystem and its daily-refresh timer/CLI.
**Architecture:** The web app continues to enqueue `build_overlay` jobs for any overlay row. The job worker dispatches via `BUILDERS[overlay.type].build(...)`. After this change `BUILDERS = {"workshop": WorkshopBuilder(), "script": ScriptBuilder()}`. The new `ScriptBuilder` writes `overlay.script` to a tmpfile and execs `sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>`, which itself execs `systemd-run --scope --collect ... -- bwrap [namespace flags] /bin/bash /script.sh`. stdout/stderr stream through the existing `run_with_streamed_output` helper into the existing job-log SSE plumbing. The job-completion path writes `Overlay.last_build_status` based on the build outcome. The kernel-overlayfs mount layer (`KernelOverlayFSMounter`) is unchanged.
---
## Locked Decisions
See `docs/superpowers/specs/2026-05-08-l4d2-script-overlays-design.md` for design rationale. Implementation-relevant summary:
- Final overlay type list: `workshop` (unchanged) + `script` (new). Drop `l4d2center_maps`, `cedapug_maps`.
- New columns on `overlays`: `script TEXT NOT NULL DEFAULT ''`, `last_build_status VARCHAR(16) NOT NULL DEFAULT ''`.
- Drop tables (FK order): `global_overlay_item_files`, `global_overlay_items`, `global_overlay_sources`.
- `ScriptBuilder` in `l4d2web/services/overlay_builders.py`, uses existing `run_with_streamed_output`.
- Privileged helper `left4me-script-sandbox` (bash, mode 0755, owned root). `systemd-run --scope --collect -p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 -p CPUQuota=200% -p RuntimeMaxSec=3600 -- bwrap …`. Limits 1 h walltime, 4 GB RAM, 20 GB post-build `du` cap.
- New system user `l4d2-sandbox` (`/usr/sbin/nologin`, no home). New apt dep `bubblewrap`.
- Sudoers verb-unrestricted: `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox`.
- Daily refresh subsystem deleted: `left4me-refresh-global-overlays.{timer,service}` and `flask refresh-global-overlays` CLI removed. No replacement.
- Wipe is the same sandbox helper invoked with the literal script `find /overlay -mindepth 1 -delete`.
- `auto_refresh` column NOT added in this iteration.
- Test deploy DB is wiped on rollout; migration includes `DELETE FROM overlays WHERE type IN ('l4d2center_maps', 'cedapug_maps')` for safety.
---
## Current Gap
- `l4d2web/models.py` `Overlay` has no `script` or `last_build_status` columns. The 3 globals tables are present.
- `l4d2web/services/overlay_builders.py` `BUILDERS = {"workshop": WorkshopBuilder(), "l4d2center_maps": GlobalMapOverlayBuilder(), "cedapug_maps": GlobalMapOverlayBuilder()}`. No `ScriptBuilder`.
- `l4d2web/services/{global_map_sources,global_overlay_refresh,global_map_cache,global_overlays}.py` exist and are referenced by routes / CLI.
- `l4d2web/services/job_worker.py` carries `refresh_global_overlays_running` plumbing.
- `l4d2web/cli.py` defines `refresh-global-overlays`.
- `l4d2web/routes/overlay_routes.py` has no `/script`, `/wipe`, or `/build` endpoints for non-workshop types.
- `l4d2web/templates/overlays.html` create modal type radio offers only `workshop`.
- `l4d2web/templates/overlay_detail.html` has a global-source block (~lines 3446) that should not survive.
- `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.{timer,service}` exist.
- `deploy/deploy-test-server.sh` provisions `global_overlay_cache/` and does not provision `l4d2-sandbox` or install `bubblewrap`.
- Seven `tests/test_global_*.py` files exist and reference removed code.
---
## Task 1: Schema migration (alembic 0005)
**Files:**
- Create: `l4d2web/alembic/versions/0005_script_overlays.py` (revises `0004_drop_legacy_external_overlay_type`).
- Modify: `l4d2web/models.py``Overlay` gains `script` and `last_build_status` columns; remove `GlobalOverlaySource`, `GlobalOverlayItem`, `GlobalOverlayItemFile` model classes.
- Modify: `l4d2web/tests/test_overlay_models.py` (or whichever existing test asserts the Overlay schema; create one if absent) — assert new columns present.
Test plan (RED first):
1. `tests/test_alembic_migrations.py::test_upgrade_0005_adds_script_columns` — apply migrations to a fresh in-memory SQLite, assert `script` and `last_build_status` columns present on `overlays`, assert no `global_overlay_*` tables, assert old data wipe `DELETE FROM overlays WHERE type IN (...)` is part of the upgrade.
2. `tests/test_alembic_migrations.py::test_downgrade_0005_restores_globals` (only if downgrade is supported in the project's migration policy; skip with `pytest.skip` if not — kernel-overlayfs migration is one-way, follow that precedent).
3. `tests/test_overlay_models.py::test_overlay_has_script_columns``Overlay(...)` instance has `script=''` and `last_build_status=''` defaults.
Implementation:
- Migration uses `op.drop_table('global_overlay_item_files')` etc. in correct FK order; uses `op.add_column('overlays', sa.Column('script', sa.Text(), nullable=False, server_default=''))` and similar for `last_build_status` (`sa.String(16)`).
- The `DELETE FROM overlays WHERE type IN ('l4d2center_maps','cedapug_maps')` runs *before* the column additions so the operation is straightforward — these rows do not reference the new columns.
- `models.py`: delete the three globals model classes outright; add the two new columns to `Overlay` with explicit defaults.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_alembic_migrations.py l4d2web/tests/test_overlay_models.py -q
```
**Commit:** `feat(l4d2-web): script overlay schema — add overlay.script + last_build_status, drop globals tables`
---
## Task 2: ScriptBuilder + BUILDERS registry update
**Files:**
- Modify: `l4d2web/services/overlay_builders.py` — add `ScriptBuilder`, remove `GlobalMapOverlayBuilder`, change `BUILDERS` dict.
- Rewrite: `l4d2web/tests/test_overlay_builders.py` — drop globals-builder tests, add ScriptBuilder tests.
Test plan (RED first):
1. `test_overlay_builders.py::test_builders_registry``set(BUILDERS) == {"workshop", "script"}`. Assert `"l4d2center_maps"` and `"cedapug_maps"` and `"external"` are absent.
2. `test_overlay_builders.py::test_script_builder_invokes_helper` — patch `run_with_streamed_output` to capture argv; build an `Overlay(id=42, type='script', script='echo hi')`; assert argv shape `["sudo", "-n", "/usr/local/libexec/left4me/left4me-script-sandbox", "42", <script_path>]` and that the script_path file exists with content `"echo hi"` at invocation time. Verify the tmpfile is unlinked after build.
3. `test_overlay_builders.py::test_script_builder_disk_cap` — fake `subprocess.check_output` for `du` to return `25000000000`; build raises `BuildError("disk-cap-exceeded")` and `on_stderr` was called with the cap message.
4. `test_overlay_builders.py::test_script_builder_streams_output` — fake `run_with_streamed_output` invokes both `on_stdout("hello\n")` and `on_stderr("warn\n")`; both lambda lists capture the lines.
5. `test_overlay_builders.py::test_script_builder_cancel``should_cancel` returns True after the first stdout line; assert `run_with_streamed_output` propagated cancellation (the existing helper's contract — the test just ensures we pass `should_cancel` through and don't run the disk-budget check on cancel).
6. `test_overlay_builders.py::test_workshop_builder_unchanged` — smoke test that `WorkshopBuilder` still exists and is invokable (regression guard against accidental removal during refactor).
Implementation:
- Add `import os, subprocess, tempfile` at the top of `overlay_builders.py` if not present.
- `ScriptBuilder` exactly as in the spec (verbatim copy from the design doc, §Build Lifecycle).
- Define a small `BuildError` exception class if one doesn't already exist locally; reuse the existing one if `WorkshopBuilder` already raises a similar type.
- `_enforce_disk_budget` calls `subprocess.check_output(["du", "-sb", str(overlay_path(overlay_id))])`; the existing `overlay_path` helper in the module already returns the absolute Path. Parse first whitespace-delimited integer; cap is `20 * 1024**3`.
- Job-completion path: locate the existing path that handles `build_overlay` job success/failure (likely in `services/job_worker.py` or a related orchestration module). Add a single column write: on success `last_build_status='ok'`, on `BuildError` / non-zero exit / cancel `last_build_status='failed'`. Add a `tests/test_job_worker.py::test_build_overlay_writes_last_build_status` covering both branches.
- Remove `GlobalMapOverlayBuilder` class and any helper functions it owns that are not used elsewhere.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_overlay_builders.py l4d2web/tests/test_job_worker.py -q
```
**Commit:** `feat(l4d2-web): ScriptBuilder + BUILDERS registry update`
---
## Task 3: Delete global-overlay services + CLI command + their tests
**Files:**
- Delete: `l4d2web/services/global_map_sources.py`
- Delete: `l4d2web/services/global_overlay_refresh.py`
- Delete: `l4d2web/services/global_map_cache.py`
- Delete: `l4d2web/services/global_overlays.py`
- Modify: `l4d2web/cli.py` — remove `refresh-global-overlays` command (lines ~4455). Drop any imports that go orphaned.
- Delete: `l4d2web/tests/test_global_map_sources.py`
- Delete: `l4d2web/tests/test_global_overlay_models.py`
- Delete: `l4d2web/tests/test_global_overlay_builders.py`
- Delete: `l4d2web/tests/test_global_overlay_cli.py`
- Delete: `l4d2web/tests/test_global_overlay_refresh.py`
- Delete: `l4d2web/tests/test_global_overlays.py`
- Delete: `l4d2web/tests/test_global_map_cache.py`
- Audit & fix: any other module that imports the deleted modules. Likely candidates: `l4d2web/app.py` (CLI registration), `routes/overlay_routes.py`, `routes/page_routes.py`. Resolve by deletion of the dead import / call site, not by stubbing.
- Modify: `pyproject.toml` — drop `py7zr` from dependencies (only used by the deleted globals subsystem).
Test plan:
1. RED-first via grep: `grep -RIn 'global_map_sources\|global_overlay_refresh\|global_map_cache\|global_overlays\|refresh_global_overlays\|GlobalMapOverlayBuilder' l4d2web/ deploy/` — should return zero hits at the end of this task. Add this as `tests/test_no_globals_references.py::test_no_globals_imports` if you want it as a permanent regression guard, otherwise spot-check.
2. Existing `tests/test_cli.py` (or whichever covers Flask CLI) loses any cases for `refresh-global-overlays`; add a `test_refresh_global_overlays_command_removed` that asserts the click command is not registered.
Implementation:
- Delete files via `git rm`.
- In `cli.py`, remove the command function and its `@app.cli.command(...)` decorator. Drop any helper imports that become orphaned.
- Remove `py7zr` from `pyproject.toml` and re-lock if a lockfile is present.
**Verification:**
```
python3 -m pytest l4d2web/tests/ -q
grep -RIn 'global_map_sources\|global_overlay_refresh\|global_map_cache\|global_overlays\|refresh_global_overlays\|GlobalMapOverlayBuilder' l4d2web/ deploy/ || echo "clean"
```
**Commit:** `refactor(l4d2-web): drop global-overlays subsystem in favor of script type`
---
## Task 4: Job worker — drop refresh_global_overlays from scheduler
**Files:**
- Modify: `l4d2web/services/job_worker.py` — remove `"refresh_global_overlays"` from `GLOBAL_OPERATIONS`; remove `refresh_global_overlays_running` field from `SchedulerState` and any references in `can_start()`; check whether `blocked_servers_by_overlay` was added solely for the globals subsystem and remove if so.
- Modify: `l4d2web/tests/test_job_worker.py` — drop `refresh_global_overlays` truth-table rows; add explicit `build_overlay` truth-table cases for `script`-type overlays (mechanically identical to workshop, but pinned by test).
Test plan:
1. `test_job_worker.py::test_global_operations_set``GLOBAL_OPERATIONS == {"install", "refresh_workshop_items"}` (or whatever subset remains; pin it).
2. `test_job_worker.py::test_build_overlay_script_type_blocks_per_overlay` — start `build_overlay(overlay_id=7)` for a `script`-type overlay; assert second `build_overlay(overlay_id=7)` cannot start; assert `build_overlay(overlay_id=8)` can.
3. `test_job_worker.py::test_build_overlay_blocks_server_init_on_blueprint_overlay` — existing test, may need re-pinning if it referenced globals.
Implementation:
- Remove the field from the dataclass / TypedDict that backs `SchedulerState`.
- Remove any update sites that flipped the flag (the worker's enqueue / on-start / on-complete paths).
- The remaining mutex rules (`install` / `refresh_workshop_items` are global; `build_overlay` per-overlay; server ops block on overlays in their blueprint) are unchanged structurally.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_job_worker.py -q
```
**Commit:** `refactor(l4d2-web): drop refresh_global_overlays from scheduler`
---
## Task 5: Routes (script update / wipe / build)
**Files:**
- Modify: `l4d2web/routes/overlay_routes.py` — add three POST endpoints.
- Create: `l4d2web/tests/test_script_overlay_routes.py`.
Test plan (RED first):
1. `test_script_overlay_routes.py::test_create_script_overlay` — POST `/overlays` with form `{"name": "x", "type": "script"}` as a regular user → 302 to detail; row exists with `type='script'`, `script=''`, `last_build_status=''`, `user_id=current_user.id`, `path=str(id)`.
2. `test_script_overlay_routes.py::test_admin_creates_system_wide_script_overlay` — admin POST with system-wide flag → row has `user_id=NULL`.
3. `test_script_overlay_routes.py::test_update_script_body_enqueues_build` — POST `/overlays/{id}/script` with `{"script": "echo new"}` → row.script updated; one new `build_overlay` job enqueued for the overlay; second immediate POST coalesces (no second job inserted while first is pending).
4. `test_script_overlay_routes.py::test_manual_rebuild` — POST `/overlays/{id}/build` → enqueues `build_overlay`; coalesces.
5. `test_script_overlay_routes.py::test_wipe_runs_find_delete` — POST `/overlays/{id}/wipe` → invokes `ScriptBuilder.build` (or the underlying helper) with the literal script `find /overlay -mindepth 1 -delete`. After success, row.last_build_status `==''`. Does not enqueue a `build_overlay`.
6. `test_script_overlay_routes.py::test_wipe_refuses_during_running_build` — set scheduler state to `build_overlay(overlay_id=7)` running; POST `/overlays/7/wipe` → 409 (or whatever the existing pattern uses for scheduler conflicts), no sandbox invocation.
7. `test_script_overlay_routes.py::test_permissions_non_owner_denied` — user A creates private script overlay; user B POSTs `/overlays/{id}/script` → 403.
8. `test_script_overlay_routes.py::test_permissions_admin_can_edit_any` — admin POSTs `/overlays/{id}/script` for user A's row → 200.
Implementation:
- Mirror the existing `_can_edit_overlay()` permission helper.
- The `/wipe` endpoint can either (a) call `ScriptBuilder` directly with a synthetic `Overlay`-like object whose `.script` is the find command and whose `.id` is the real overlay id, or (b) factor a `_run_sandbox(overlay_id, script_text, on_stdout, on_stderr, should_cancel)` helper out of `ScriptBuilder.build()` and call it from both. (b) is cleaner; do (b).
- Wipe runs **synchronously** in the request thread (small, fast). It does NOT enqueue a job. Surface log output as flash messages or by streaming through the existing log infra — pick whichever matches the existing wipe-equivalent pattern (workshop overlays don't have a wipe; closest analog is the existing delete-overlay flow).
- The `/script` endpoint enqueues via the same `enqueue_build_overlay(overlay_id)` helper used by workshop overlays' add/remove flows. Coalescing is already implemented there.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_script_overlay_routes.py l4d2web/tests/test_overlay_routes.py -q
```
**Commit:** `feat(l4d2-web): script overlay routes (script update / wipe / build)`
---
## Task 6: Templates (overlays.html + overlay_detail.html)
**Files:**
- Modify: `l4d2web/templates/overlays.html` — add `script` to the create-modal type radio (lines ~2949).
- Modify: `l4d2web/templates/overlay_detail.html` — add a `{% if overlay.type == 'script' %}` block with textarea + Save / Rebuild / Wipe buttons + status badge; delete the global-source block (lines ~3446).
- Modify: `l4d2web/tests/test_pages.py` — assert script-section renders for type=`script`, workshop-section renders for type=`workshop`, global-source-section is absent.
Test plan:
1. `test_pages.py::test_overlay_create_modal_offers_script_type` — GET `/overlays`; HTML contains `value="script"` radio.
2. `test_pages.py::test_overlay_detail_script_section` — create script overlay, GET `/overlays/{id}`; HTML contains `<textarea name="script">`, "Rebuild" button, "Wipe" button, status badge element.
3. `test_pages.py::test_overlay_detail_workshop_section_unchanged` — existing workshop detail still has thumbnail grid, add-item form, etc.
4. `test_pages.py::test_overlay_detail_no_global_source_block` — page HTML has no element from the deleted global-source block (check for an attribute or string unique to that block).
Implementation:
- Detail-page wipe button uses a small confirm-modal pattern (copy from the existing delete-overlay confirm modal).
- Status badge: existing CSS classes for ok/warn/error already exist in `static/`; reuse them.
- No new JS deps. Plain `<form method="post">` with HTMX `hx-post` for the script update if a streaming UX is desired (match existing patterns).
**Verification:**
```
python3 -m pytest l4d2web/tests/test_pages.py -q
```
Manual: start dev server (`flask run`), create a script overlay, paste `echo "hi" > foo`, click Save, watch log stream. Then click Wipe; confirm dir is empty. Then click Rebuild; confirm `foo` reappears.
**Commit:** `feat(l4d2-web): script overlay UI`
---
## Task 7: Libexec sandbox helper + sudoers + deploy-artifacts test
**Files:**
- Create: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` (bash, mode 0755 after deploy, owned root).
- Modify: `deploy/files/etc/sudoers.d/left4me` — append the rule.
- Modify: `deploy/tests/test_deploy_artifacts.py` — assert helper file present + sudoers contains the new line.
Test plan (RED first):
1. `test_deploy_artifacts.py::test_script_sandbox_helper_present` — file exists, mode bits indicate 0755 (or whatever the test framework allows checking pre-deploy), shebang is `#!/bin/bash`.
2. `test_deploy_artifacts.py::test_sudoers_includes_script_sandbox_rule` — sudoers file contains the exact line `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox`.
3. Optional integration test (skip on non-Linux dev): drive the helper as a subprocess with a synthesized fake `/var/lib/left4me/overlays/1/` and a no-op script, assert `bwrap` invocation happens (use a mock `systemd-run` or `LEFT4ME_SCRIPT_SANDBOX_DRY_RUN=1` env that prints the would-be invocation and exits 0). Mirrors the `LEFT4ME_OVERLAY_PRINT_ONLY=1` pattern from the kernel-overlayfs helper test.
Implementation:
- Helper script verbatim from the spec §Sandbox.
- Sudoers fragment: append (don't replace existing rules). The existing fragment has rules for `left4me-overlay`, `left4me-systemctl`, `left4me-journalctl` — match the same formatting (one rule per line, no trailing whitespace).
**Verification:**
```
python3 -m pytest deploy/tests/test_deploy_artifacts.py -q
bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
```
**Commit:** `feat(deploy): left4me-script-sandbox helper + sudoers fragment`
---
## Task 8: Deploy script — provision l4d2-sandbox + bubblewrap; drop globals timer
**Files:**
- Modify: `deploy/deploy-test-server.sh` — add `useradd --system ... l4d2-sandbox`, add `apt-get install -y bubblewrap`, ensure helper installation step picks up `left4me-script-sandbox` (likely automatic if it's a glob in `deploy/files/usr/local/libexec/left4me/*`); drop the `mkdir global_overlay_cache` line if present.
- Delete: `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer`
- Delete: `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.service`
- Modify: `deploy/tests/test_deploy_artifacts.py` — assert the two unit files are absent; assert `useradd l4d2-sandbox` and `apt-get install ... bubblewrap` lines are present in the deploy script.
Test plan:
1. `test_deploy_artifacts.py::test_globals_refresh_units_removed` — files do not exist under `deploy/files/usr/local/lib/systemd/system/`.
2. `test_deploy_artifacts.py::test_deploy_script_provisions_sandbox_user` — grep the deploy script for the useradd line.
3. `test_deploy_artifacts.py::test_deploy_script_installs_bubblewrap` — grep for `bubblewrap` in apt invocations.
Implementation:
- `useradd` line uses `--system --no-create-home --shell /usr/sbin/nologin`. Idempotency: wrap with `id l4d2-sandbox &>/dev/null || useradd ...`.
- `apt-get install`: append `bubblewrap` to whatever package list the script already maintains.
- Globals timer/service deletions: `git rm`.
**Verification:**
```
python3 -m pytest deploy/tests/ -q
shellcheck deploy/deploy-test-server.sh deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
```
**Commit:** `chore(deploy): provision l4d2-sandbox + bubblewrap; drop globals refresh timer`
---
## Task 9: Full pytest run + drift fixes
**Files:** as needed across the repo.
Test plan: run the full test suite for both packages; chase down any drift caused by removed model classes, dropped imports, or template changes.
```
python3 -m pytest l4d2web/tests/ -q
python3 -m pytest l4d2host/tests/ -q
python3 -m pytest deploy/tests/ -q
```
Implementation: fix what breaks. Common drift sources to expect:
- Tests that imported from deleted modules.
- Tests that asserted exact `BUILDERS` keyset (good — they should have been updated in Task 2).
- Tests that built fixtures with `type='l4d2center_maps'` or `type='cedapug_maps'` — those tests likely belong to the deleted set or need conversion to `type='script'`.
- Template snapshot tests (if any) that captured the deleted global-source block.
**Verification:** all three suites green.
**Commit:** `chore(l4d2-web): test suite drift fixes after script-overlays migration` (only if drift fixes needed; skip if Tasks 18 left the suite green)
---
## End-to-end deployment verification (manual, on test host)
After all tasks committed:
1. **Reset deploy:** run `deploy/deploy-test-server.sh` from clean state. Confirm `bubblewrap` installed (`dpkg -l bubblewrap`), `l4d2-sandbox` user exists (`id l4d2-sandbox`), `/usr/local/libexec/left4me/left4me-script-sandbox` is mode 0755 and root-owned, `sudo -ln` as `left4me` shows the new rule.
2. **Sandbox smoke:** as `left4me`, write `/tmp/echo.sh` containing `echo $(whoami) > /overlay/sentinel`. `mkdir -p /var/lib/left4me/overlays/1`. `sudo /usr/local/libexec/left4me/left4me-script-sandbox 1 /tmp/echo.sh`. Confirm `/var/lib/left4me/overlays/1/sentinel` contains `l4d2-sandbox` and is owned by `l4d2-sandbox`. Confirm `/etc/passwd`, `/var/lib/left4me/l4d2web.db`, and `/home` are not visible inside the sandbox by running probe scripts.
3. **Resource limits:**
- `dd if=/dev/zero of=/overlay/big bs=1M count=25000` → succeeds inside sandbox; `ScriptBuilder._enforce_disk_budget` flags the build failed; `last_build_status='failed'`.
- `sleep 7200` → killed at 1 h by `RuntimeMaxSec=3600`.
- Memory hog (`python3 -c "x=' '*(5*1024**3)"`) → OOM at 4 GB.
4. **App-level happy path:** as a non-admin user, create a script overlay via the UI, paste an old `competitive_rework`-style script, Save → build runs, succeeds, addons appear in `overlays/{id}/left4dead2/`. Stack onto a server blueprint, start the server, verify content mounts via the L4D2 admin console (`map workshop/...`).
5. **Wipe:** click Wipe → dir empty (find -delete output in log). Click Rebuild → repopulates. `last_build_status` cycles: `''``'ok'`.
6. **Scheduler:** start a server using the script overlay; in another browser tab attempt to Rebuild → 409 / scheduler-blocked. Stop server; rebuild succeeds.
7. **Audit log:** `journalctl --since "5 min ago" | grep run-` shows transient scopes per build with cgroup memory accounting visible.
These are not required for any single commit but should pass before declaring the work done.

View file

@ -1,146 +0,0 @@
# L4D2 Script Sandbox v2 Implementation Plan
> **Approval status:** User-approved 2026-05-08 after smoke-testing the v2 prototype on `ckn@10.0.4.128`.
**Goal:** Replace the bwrap-based sandbox helper with a systemd-only one per `docs/superpowers/specs/2026-05-08-l4d2-script-sandbox-v2-systemd.md`. Drop the `bubblewrap` apt dep. Tighten `left4me.db` file mode to 0640 root:left4me. Update the deploy-artifact tests to assert the new helper shape.
**Architecture:** See spec. Helper invokes `systemd-run --pipe --wait` in service-unit mode with full hardening directives. No bwrap. Web-app side (`ScriptBuilder`, `run_sandboxed_script`, routes) is unchanged.
---
## Locked Decisions
See spec §Locked Decisions for rationale. Implementation summary:
- Helper file at the same path (`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`) is rewritten in place.
- The sudoers rule is unchanged.
- `bubblewrap` dropped from `apt-get install` / `dnf install` lines.
- `left4me.db` chmod 0640 added to deploy script as a post-init step.
- Sandbox UID, system user, overlay-dir chown logic, and ScriptBuilder API stay the same.
---
## Current Gap
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` invokes `systemd-run --scope ... -- bwrap [namespace flags] /bin/bash /script.sh`.
- `deploy/deploy-test-server.sh` line ~84 installs `bubblewrap` via apt/dnf.
- `deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_invokes_systemd_run_and_bwrap` asserts `bwrap`, `--unshare-pid`, `--uid=l4d2-sandbox`, etc.
- `deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_bubblewrap` asserts `bubblewrap` is in apt/dnf install lines.
- `left4me.db` is created at deploy time with the default 0644 permissions; any host user can read it.
---
## Task 1: Rewrite the sandbox helper to be systemd-only
**Files:**
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — replace the `systemd-run --scope … bwrap …` invocation with `systemd-run --service --pipe --wait …` carrying the hardening directives.
Test plan:
1. `bash -n` syntax check (already covered by `test_script_sandbox_helper_passes_shell_syntax_check`).
2. `test_deploy_artifacts.py::test_script_sandbox_helper_invokes_systemd_run_and_bwrap` is replaced by a new pin: `test_script_sandbox_helper_invokes_systemd_run_with_hardening`. Asserts:
- No `bwrap` reference remains.
- `systemd-run` is invoked with `--pipe`, `--wait`, `--collect`, `--unit=` (transient service unit form, no `--scope`).
- All hardening directives present: `NoNewPrivileges=yes`, `ProtectSystem=strict`, `ProtectHome=yes`, `PrivateTmp=yes`, `PrivateDevices=yes`, `PrivateIPC=yes`, `ProtectKernelTunables=yes`, `ProtectKernelModules=yes`, `ProtectKernelLogs=yes`, `ProtectControlGroups=yes`, `RestrictNamespaces=yes`, `RestrictSUIDSGID=yes`, `LockPersonality=yes`, `MemoryDenyWriteExecute=yes`, `SystemCallFilter=`, `CapabilityBoundingSet=` (empty), `User=l4d2-sandbox`, `Group=l4d2-sandbox`.
- `TemporaryFileSystem=` covers `/etc` and `/var/lib`.
- `BindReadOnlyPaths=` includes `/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives` and the script bind `${SCRIPT}:/script.sh`.
- `BindPaths=` carries the overlay bind.
- Cgroup limits unchanged (`MemoryMax=4G`, `MemorySwapMax=0`, `TasksMax=512`, `CPUQuota=200%`, `RuntimeMaxSec=3600`).
3. Existing `test_script_sandbox_helper_dry_run_mode` keeps passing — the dry-run guard still short-circuits before `systemd-run`.
4. Existing `test_script_sandbox_helper_validates_overlay_id` keeps passing — argument validation is unchanged.
Implementation: helper body verbatim from the spec §Helper.
**Verification:**
```
python3 -m pytest deploy/tests/test_deploy_artifacts.py -q
bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
```
**Commit:** `refactor(deploy): rewrite left4me-script-sandbox to systemd-only — drop bwrap`
---
## Task 2: Drop bubblewrap apt/dnf dep + tighten left4me.db mode
**Files:**
- Modify: `deploy/deploy-test-server.sh` — remove `bubblewrap` from `apt-get install` / `dnf install` package lists; add a post-init step that ensures `left4me.db` is mode 0640 owned `root:left4me`.
- Modify: `deploy/tests/test_deploy_artifacts.py` — replace `test_deploy_script_installs_bubblewrap` with `test_deploy_script_does_not_install_bubblewrap`; add `test_deploy_script_tightens_left4me_db_permissions`.
Test plan:
1. `test_deploy_script_does_not_install_bubblewrap` — for each `apt-get install` / `dnf install` line, `bubblewrap` is absent.
2. `test_deploy_script_tightens_left4me_db_permissions` — script contains `chmod 0640 /var/lib/left4me/left4me.db` and `chown root:left4me /var/lib/left4me/left4me.db` (in either order).
3. `test_deploy_script_shell_syntax` keeps passing (`sh -n`).
Implementation:
- Remove the bare `bubblewrap` token from the two install lines.
- After the `alembic upgrade head` step (which creates the DB if missing), add:
```
$sudo_cmd chown root:left4me /var/lib/left4me/left4me.db
$sudo_cmd chmod 0640 /var/lib/left4me/left4me.db
```
Idempotent — re-runs are no-ops.
**Verification:**
```
python3 -m pytest deploy/tests/test_deploy_artifacts.py -q
sh -n deploy/deploy-test-server.sh
```
**Commit:** `chore(deploy): drop bubblewrap apt dep + tighten left4me.db mode to 0640 root:left4me`
---
## Task 3: Deploy + smoke-test on the test host
**Files:** none.
This is an operational verification step, not a code change. Run `deploy/deploy-test-server.sh ckn@10.0.4.128`, then on the host re-run the same smoke battery used to validate the prototype:
1. **Identity / privileges**: `id` returns `uid=996 gid=985`; `/proc/self/status` shows `NoNewPrivs: 1` and `CapBnd: 0000000000000000`.
2. **Filesystem isolation**: `/etc/passwd` absent, `/etc/alternatives/awk` present, `/var/lib/left4me/left4me.db` absent, `/home` inaccessible, `/usr` not writable, `/overlay` writable.
3. **Tools + network**: `awk` resolves through `/etc/alternatives`; `curl https://steamcommunity.com/` returns 200.
4. **Cgroup limits**: while a 5s-sleep script runs, `cat /sys/fs/cgroup/.../memory.max` returns `4294967296`; `pids.max` `512`; `cpu.max` `200000 100000`.
5. **Memory cap**: 5 GB Python alloc raises `MemoryError`.
6. **Wipe**: `find /overlay -mindepth 1 -delete` empties the overlay dir.
7. **Seccomp / restriction probes**: `unshare -U`, `mount -t tmpfs`, `setarch -X`, `bpf` setsockopt all fail with EPERM/EINVAL.
8. **Build via web UI**: log in as admin, create a script overlay with `echo "hi" > foo`, click Save, confirm job succeeds and `foo` appears in `/var/lib/left4me/overlays/{id}/foo`.
9. **DB hardening**: `stat -c "%a %U:%G" /var/lib/left4me/left4me.db` returns `640 root:left4me`.
Mark this task complete only after every check passes on the live host.
**Commit:** none (operational verification — record results in conversation/PR description).
---
## Task 4: Drift sweep + push
**Files:** as needed across the repo.
Run the full test suite for all three packages; chase any drift caused by the helper rewrite or deploy-script changes.
```
python3 -m pytest l4d2web/tests/ -q
python3 -m pytest l4d2host/tests/ -q
python3 -m pytest deploy/tests/ -q
```
Implementation: fix what breaks. Expected: nothing new should break, since the Python-side contract is unchanged. If something does, treat it as a sign of an unintended coupling and address.
Push the commits to `origin/master`.
**Verification:** all three suites green; `git status` clean; commits visible on `git.sublimity.de/cronekorkn/left4me`.
**Commit:** none unless drift fixes are needed.
---
## Rollback plan
If Task 3 surfaces a blocker (a hardening directive breaks a real-world script class, seccomp filter is too narrow, BindPaths semantics differ on the host's systemd version), roll back via `git revert` of Tasks 1+2 and redeploy. Git history preserves both the v1 and v2 helper. The Python side never changed, so reverting only the deploy artifacts is sufficient — no DB migration to undo, no template change to roll back.

View file

@ -1,89 +0,0 @@
# L4D2 Script Sandbox v3 Implementation Plan
> **Approval status:** User-approved 2026-05-08; implemented and pushed in `7e66936`. This plan is recorded retrospectively for symmetry with the v1 / v2 plans.
**Goal:** Restrict the sandbox to public-internet egress per `docs/superpowers/specs/2026-05-08-l4d2-script-sandbox-v3-egress-filter.md`. Bind a static public-resolver `resolv.conf` into the sandbox.
---
## Locked Decisions (see spec for rationale)
- `IPAddressDeny=` only; no `IPAddressAllow=any`.
- Explicit CIDRs (no `localhost` / `link-local` shorthand keywords — `systemd-run -p` parser rejects them).
- Static `nameserver 1.1.1.1` + `nameserver 8.8.8.8` in a sandbox-only resolv.conf.
- `AF_UNIX` left enabled.
---
## Current Gap (at start of this iteration)
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` (v2) shares the host network namespace with no egress filter.
- The helper bind-mounts `/etc/resolv.conf` from the host into the sandbox (which points at private-IP DNS).
- `deploy/deploy-test-server.sh` does not install a sandbox-only resolv.conf.
- No deploy-artifact tests for `IPAddressDeny=` or for the resolv.conf shape.
---
## Task 1: Add `IPAddressDeny=`, swap resolv.conf bind, ship the static file
**Files:**
- Create: `deploy/files/etc/left4me/sandbox-resolv.conf` — two `nameserver` lines + a header comment.
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — add `-p IPAddressDeny="..."` directive (11 explicit CIDRs); replace the `/etc/resolv.conf:/etc/resolv.conf` token in `BindReadOnlyPaths=` with `/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf`.
- Modify: `deploy/deploy-test-server.sh` — add an `install -m 0644 -o root -g root .../sandbox-resolv.conf /etc/left4me/sandbox-resolv.conf` line near the existing `host.env` install.
- Modify: `deploy/tests/test_deploy_artifacts.py` — extend `test_script_sandbox_helper_invokes_systemd_run_with_hardening` to assert each CIDR is present and that `IPAddressAllow=any` is **absent** (regression guard); update the BindReadOnlyPaths assertion to expect the sandbox-resolv.conf bind; add `test_sandbox_resolv_conf_exists` and `test_deploy_script_installs_sandbox_resolv_conf`.
Test plan (RED-first not used here; the work was driven by smoke-test feedback against a live host):
1. `test_script_sandbox_helper_invokes_systemd_run_with_hardening``IPAddressDeny=` present with all 11 CIDRs; no `IPAddressAllow=any`; resolv.conf bind path is `/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf`.
2. `test_sandbox_resolv_conf_exists` — file present, ≥2 nameservers, all in non-private space.
3. `test_deploy_script_installs_sandbox_resolv_conf` — deploy script references both source path under `deploy/files/etc/left4me/sandbox-resolv.conf` and target path `/etc/left4me/sandbox-resolv.conf`.
**Verification:**
```
sh -n deploy/deploy-test-server.sh
bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
python3 -m pytest deploy/tests/ -q
```
**Commit:** `feat(deploy): restrict script-sandbox egress to public internet only`
---
## Task 2: Deploy + smoke-test on `ckn@10.0.4.128`
**Files:** none.
Run `deploy/deploy-test-server.sh ckn@10.0.4.128`. Then on the host, invoke the helper with a probe script that opens TCP connections to:
- `1.1.1.1:443` — must connect (public)
- `127.0.0.1:8000` — must block (web app on loopback)
- `127.0.0.1:22` — must block (sshd on loopback)
- `10.0.4.128:22` — must block (host's external SSH on private LAN)
- `10.0.0.1:53` — must block (LAN DNS resolver)
Plus `curl -m 5 https://steamcommunity.com/` end-to-end (DNS + HTTPS) → 200.
Inside the sandbox, `cat /etc/resolv.conf` must show the two public resolvers.
If any of the localhost / private targets connects, the deny is being silently overridden — see spec §Locked Decisions point 1.
**Commit:** none — operational verification.
---
## Lessons surfaced during execution
These belong in the spec but are repeated here as the "things the next person should not have to rediscover":
- **`IPAddressAllow=any` silently overrides every `IPAddressDeny=` rule** on this systemd 257 / kernel 6.12 combo, despite documentation stating "more specific rule wins". The negative test (`IPAddressAllow=any not in text`) locks this in.
- **systemd-run's `-p` parser rejects the `localhost` / `link-local` / `multicast` shorthand keywords** even though they parse fine in unit files. Use explicit CIDRs.
- **`/var/lib/left4me/.../left4me.db` is mode 0644 by default** — writing this file from the web app left it world-readable. Tightening to 0640 root:left4me happens in v2's deploy-script change; v3 does not re-touch it.
- **bpftool ships separately on Debian.** It's not needed for runtime, but `apt-get install bpftool` is useful for inspecting `sd_fw_egress` attach state when debugging filter behaviour.
---
## Rollback
`git revert 7e66936` and redeploy. The change is purely in deploy artifacts; no app code, no DB migration. Reverting reopens the previous v2 reachability.

View file

@ -1,161 +0,0 @@
# Overlay File Tree Implementation Plan
> **Approval status:** User-approved 2026-05-08; implemented + deployed in the same session. This plan is committed retrospectively to record the work.
**Goal:** Build the overlay-detail "Files" section per `docs/superpowers/specs/2026-05-08-overlay-file-tree-design.md` — a server-rendered collapsible tree of `${LEFT4ME_ROOT}/overlays/{overlay.id}/` with HTMX lazy expansion and click-to-download for individual files. Read-only; same access rule as the rest of the overlay detail page.
**Architecture:** A new `files_bp` blueprint exposes two GETs: `/overlays/<id>/files?path=<rel>` returns the listing as an HTML fragment (used both for first paint at the root level via `page_routes.overlay_detail` context, and for HTMX swaps when a folder expands), and `/overlays/<id>/files/download?path=<rel>` streams a single file. Pure helpers live in `l4d2web/services/overlay_files.py`: `safe_resolve_for_listing` (refuses symlink escape from overlay root), `safe_resolve_for_download` (allows symlink targets anywhere under `LEFT4ME_ROOT` — workshop addons stream from the shared cache; absolute symlinks to `/etc/passwd` are still blocked), and `list_directory` (one-level scan, dirs-first sort, 500-entry cap, symlink + broken-symlink markers, resolved size for files). Two Jinja partials (`_overlay_file_tree.html`, `_overlay_file_node.html`) plus a 12-line event-delegated `static/js/file-tree.js` for collapse/re-expand handle the UI; styles append to `static/css/components.css` against existing tokens.
---
## Locked Decisions
See the design doc for rationale. Implementation-relevant summary:
- New blueprint `files_bp` registered in `l4d2web/app.py` next to `overlay_bp`.
- Path resolution chains through `l4d2host.paths.overlay_path()` (already validates the overlay ref + resolves under `LEFT4ME_ROOT/overlays/`) and `l4d2web.services.security.validate_overlay_ref` (rejects empty/`.`/`..`/absolute/whitespace/backslash for the sub-path component).
- Listing rule: target must be a descendant of `overlay_root` after `Path.resolve()`. Download rule: real path must be a descendant of `LEFT4ME_ROOT` after `os.path.realpath()`.
- Tree shape: single recursive partial. `_overlay_file_tree.html` renders `<ul>`; `_overlay_file_node.html` renders one folder or file `<li>`. Folder buttons carry `data-files-url="/overlays/{id}/files?path=…"`. `static/js/file-tree.js` handles every click — toggles `aria-expanded` + `hidden`, fetches once on first expand, dedupes rapid clicks via `dataset.loaded`.
- `DEFAULT_MAX_ENTRIES = 500` in the helper module; re-resolved per call so tests can monkeypatch.
- No changes to `l4d2host`, builders, or workshop/script edit flows.
---
## Task 1: Pure helpers — path safety + directory listing
**Files:**
- Create: `l4d2web/services/overlay_files.py``safe_resolve_for_listing`, `safe_resolve_for_download`, `list_directory`, `_format_size`, `DEFAULT_MAX_ENTRIES`.
- Create: `l4d2web/tests/test_overlay_files.py` — 20 tests (path safety, listing semantics, symlink + broken-symlink handling, sort order, truncation cap, human-size formatting).
Test plan (RED first):
1. Listing returns overlay root for empty sub-path; joins under root for nested sub-path; rejects `..`, absolute path, empty component (`foo//bar`); rejects symlink escaping the overlay root even when target sits in `workshop_cache/`.
2. Download rejects empty path; returns real path for a regular file; follows a symlink into `workshop_cache/`; rejects a symlink to a path outside `LEFT4ME_ROOT`; rejects `..` and absolute paths.
3. `list_directory`: empty dir → empty list, truncated 0; dirs-first then files, both case-insensitive alphabetical; `kind ∈ {"dir", "file"}`; `rel` is forward-slash relative to overlay root; symlinks marked with `is_symlink=True` and resolved-target size; broken symlinks marked `broken=True` with `size=None`; truncation at supplied cap returns first N + `truncated_count`; `size_human` formats `5 B` and `3.0 MB` correctly.
**Implementation:**
- `safe_resolve_for_listing` calls `l4d2host.paths.overlay_path(overlay_path_value).resolve()` for the overlay root, short-circuits on empty `sub_path`, validates the sub-path via `validate_overlay_ref`, then `(overlay_root / sub_path).resolve(strict=False)` and asserts the result is the overlay root or a descendant.
- `safe_resolve_for_download` rejects empty `sub_path`, validates, builds `overlay_root / sub_path`, applies `os.path.realpath()`, asserts the result is under `get_left4me_root().resolve()`.
- `list_directory(target, overlay_root, *, max_entries=None)` uses `os.scandir` (free `stat` cache, `follow_symlinks` toggle). Per entry: `is_symlink = entry.is_symlink()`; `is_dir = entry.is_dir(follow_symlinks=True)` inside a try (OSError → broken=True, kind=file, size=None); regular files use `entry.stat(follow_symlinks=True).st_size`. `rel` is `"/".join(Path(entry.path).relative_to(overlay_root).parts)`. Sort by `(0 if dir else 1, name.casefold())`. Truncate to `max_entries or DEFAULT_MAX_ENTRIES`.
- `_format_size`: bytes (`N B`, no decimal) up to 1024, then KB/MB/GB/TB at one decimal place.
**Verification:**
```
pytest l4d2web/tests/test_overlay_files.py -q
```
**Commit:** part of Task 4's bundled `feat` commit.
---
## Task 2: HTTP routes — files_bp blueprint
**Files:**
- Create: `l4d2web/routes/files_routes.py``files_bp` with `GET /overlays/<id>/files` (fragment) and `GET /overlays/<id>/files/download` (stream).
- Modify: `l4d2web/app.py``from l4d2web.routes.files_routes import bp as files_bp` and `app.register_blueprint(files_bp)` next to `overlay_bp`.
- Create: `l4d2web/tests/test_overlay_files_routes.py` — 16 HTTP-level tests at this stage (3 more added in Task 4).
Test plan (RED first):
- Fragment: 200 + entries for root listing; 200 + entries for sub-directory; 400 on `..`, absolute path, empty component; 404 on unknown overlay; 404 on missing sub-dir; 403 on foreign user's overlay; 200 for admin viewing foreign overlay; truncation cap exposes "+ N more" footer (monkeypatch `DEFAULT_MAX_ENTRIES`); broken symlink rendered with `broken` badge and no `<a>` link.
- Download: 200 + `Content-Disposition: attachment` + exact byte match for regular file; 200 + cache content for workshop-cache symlink; 400 for symlink resolving outside `LEFT4ME_ROOT`; 400 for directory target; 404 for missing file; 403 for foreign user's overlay.
**Implementation:**
- Decorator stack: `@files_bp.get(...)` + `@require_login`. Auth gate inside the handler mirrors `page_routes.overlay_detail:194` (`g.user.admin or overlay.user_id is None or overlay.user_id == g.user.id`).
- Shared `_load_overlay_for_user(overlay_id, user)` does the lookup, the auth gate, and `db.expunge(overlay)` so the route can read scalar attributes after the session closes.
- `ValueError` from either resolver → `Response("invalid path", status=400)`. `target.is_dir()` failure on the listing route → 404. `real.exists()` / `real.is_dir()` failure on the download route → 404 / 400.
- `send_file(str(real), as_attachment=True, download_name=os.path.basename(real))`.
- The fragment renders `_overlay_file_tree.html` only — no `base.html` shell — so HTMX swaps inject just the `<ul>` content.
**Verification:**
```
pytest l4d2web/tests/test_overlay_files_routes.py -q
```
**Commit:** part of Task 4's bundled `feat` commit.
---
## Task 3: Templates + page-routes integration
**Files:**
- Create: `l4d2web/templates/_overlay_file_tree.html``<ul class="file-tree" role="group">` + per-entry `_overlay_file_node.html` include + optional truncated-footer `<li>`.
- Create: `l4d2web/templates/_overlay_file_node.html` — folder row (button + HTMX attrs + empty `<div class="file-tree-children" hidden>`) or file row (`<a>` for regular/symlink files; `<span>` for broken symlinks; `link` / `broken link` badges; `size_human`).
- Modify: `l4d2web/templates/overlay_detail.html` — add `<section class="panel"><h2>Files</h2>…</section>` between the type-specific sections and the existing "Used by" section. Renders empty-state `<p class="muted">No files yet — build this overlay to populate it.</p>` when `file_tree_root_entries is none`, else includes the partial.
- Modify: `l4d2web/routes/page_routes.py` — import the helpers, add `_root_file_tree(overlay)` (returns `(entries, truncated_count)` or `(None, 0)` on `ValueError` / missing dir / legacy absolute `overlay.path`), pass `file_tree_root_entries` + `file_tree_truncated` + `file_tree_truncated_count` into `render_template("overlay_detail.html", …)`.
Test plan (RED first, added to `test_overlay_files_routes.py`):
- `test_overlay_detail_renders_files_section_with_tree` — page contains "Files" header + entry names.
- `test_overlay_detail_shows_empty_state_when_overlay_dir_missing` — wipe directory, page shows "No files yet".
- `test_overlay_detail_files_section_present_for_workshop_overlays` — workshop type also gets the section.
**Implementation:**
- Section placement matters: `<section><h2>Files</h2>…</section>` is inserted before the existing "Used by" `<section>`.
- The partial uses `{% set entries = file_tree_root_entries %}` etc. so the same partial works whether called from the page (with full context) or from the HTMX route (rendering directly with named kwargs).
- `_root_file_tree` swallows `ValueError` and missing-dir cases into `(None, 0)`, and the template's `{% if file_tree_root_entries is none %}` renders the empty state.
- Use `overlay.path` (not `str(overlay.id)`) so legacy/seeded rows whose path differs still work correctly when resolvable.
**Verification:**
```
pytest l4d2web/tests/test_overlay_files_routes.py -q -k overlay_detail
pytest l4d2web/tests/ -q # no regressions across the full suite
```
**Commit:** part of Task 4's bundled `feat` commit.
---
## Task 4: CSS + JS + base.html script wiring
**Files:**
- Create: `l4d2web/static/js/file-tree.js` — event-delegated `click` handler that toggles `aria-expanded` on `.file-tree-toggle` and `hidden` on the next `.file-tree-children` sibling, and on first expand fires `fetch(button.dataset.filesUrl)` and innerHTMLs the response. `dataset.loaded` flag dedupes rapid clicks; cleared on error to allow retry.
- Modify: `l4d2web/templates/base.html``<script src="{{ url_for('static', filename='js/file-tree.js') }}"></script>` next to the existing `csrf.js` / `sse.js` / `modal.js` lines.
- Modify: `l4d2web/static/css/components.css` — append `~50` lines: `.file-tree`, `.file-tree-row`, `.file-tree-toggle` (transparent button, inherits color), `.file-tree-toggle .chevron` rotation transform on `aria-expanded="true"`, `.file-tree-children[hidden]`, `.file-tree-badge` + `.file-tree-badge-warn`. All against existing tokens (`--space-xs`, `--space-l`, `--color-surface-muted`, `--color-muted`, `--color-danger`, `--radius-s`).
**Implementation:**
- The JS handler fires on every click. First-expand path: read `button.dataset.filesUrl`, set `dataset.loaded="1"` optimistically, `fetch(url, {credentials: "same-origin"})`, replace `.file-tree-children` innerHTML with the response. Subsequent clicks just toggle `aria-expanded` + `hidden` — no re-fetch since `dataset.loaded` is set. On fetch error: `delete dataset.loaded` so a future click retries.
- The CSS chevron is a Unicode `` inside a `<span class="chevron">`; rotated 90° on expanded via `transform: rotate(90deg)` with a 120ms transition.
**Verification:**
```
pytest l4d2web/tests/ -q # 293 passed, 1 skipped
```
Manual smoke (post-deploy on `ckn@10.0.4.128`):
- Navigate to an overlay detail page with a populated runtime directory.
- Confirm the "Files" section renders the root level.
- Click a folder: HTMX request fires once, children appear, chevron rotates.
- Click again: children hide; no second request in DevTools network tab.
- Click a file: browser downloads it with the correct filename.
- Visit another user's overlay as a non-admin: 403.
**Commit:** `feat(l4d2-web): overlay detail Files section with HTMX file tree + downloads` — covers all four tasks (services helper + routes + templates + CSS/JS), since the feature is small and the tasks share a single set of integration tests.
---
## End-to-end verification
After all tasks committed:
```
pytest l4d2web/tests/ -q # 293 passed, 1 skipped
deploy/deploy-test-server.sh ckn@10.0.4.128
ssh ckn@10.0.4.128 'systemctl status left4me-web --no-pager | head -10'
curl -s http://10.0.4.128:8000/health # {"status":"ok"}
```
Then exercise the manual smoke checklist from Task 4 against the deployed instance.

View file

@ -1,140 +0,0 @@
# Per-overlay `server.cfg` aliases — opt-in via blueprint checkbox
## Context
L4D2 overlays stack via kernel overlayfs. When two overlays both ship `left4dead2/cfg/server.cfg`, the topmost wins; lower-layer copies become unreachable. On top of that, the blueprint's own `server.cfg` is copied into `merged/.../cfg/server.cfg` at instance start (`l4d2host/instances.py:112-115`), so the merged view's `server.cfg` is always the blueprint's.
We want a per-blueprint opt-in mechanism: for each linked overlay, the blueprint owner can check a box to expose that overlay's `server.cfg` as a reloadable alias under a known name. The alias is identified by overlay id (`server_overlay_<id>.cfg`), so it's stable across overlay renames and namespaced.
Trade-off accepted: only checked overlays are addressable in the in-game console. That's intentional — explicit opt-in beats automatic exposure of every overlay's config.
Earlier rounds of this plan considered (and rejected):
- Doing it in each script overlay: too easy to forget, doesn't scale.
- A name-as-slug constraint with auto-aliasing for every overlay: more invasive (regex on names, blueprint-wide collision checks) and exposes everything by default.
## Approach
- New boolean column `expose_server_cfg` on `BlueprintOverlay` (per-blueprint, per-overlay state).
- Blueprint detail page: each linked overlay row gets a checkbox labeled with its alias (`exec server_overlay_<id>`).
- Spec yaml carries an optional `alias` per overlay; web app sets it to `overlay_<id>` when the box is checked, otherwise omits it.
- Host copies `<lowerdir>/left4dead2/cfg/server.cfg``merged/left4dead2/cfg/server_<alias>.cfg` at instance start, only for entries with `alias` set and an existing source. Pre-sweep removes stale aliases from prior starts.
- **Auto-inject `exec` lines into the blueprint's final `server.cfg`**: for each opted-in overlay, prepend `exec server_overlay_<id>` to the config list, in `BlueprintOverlay.position` ascending order (lowest overlay first, highest last), with the user's custom config lines appended after. Source-style cfg semantics: later lines override earlier ones, so this gives "lowest overlay's settings → higher overlay's settings → blueprint customizations" in the right precedence.
No constraint on `Overlay.name`. No alias / slug column on `Overlay`. The previously-added manual `cp` in `competitive_rework.sh` gets reverted (the framework will do it when checked).
## Changes
### 1. Schema
`l4d2web/models.py``BlueprintOverlay` gets:
```python
expose_server_cfg: Mapped[bool] = mapped_column(Boolean, nullable=False, default=False, server_default=text("0"))
```
`l4d2web/alembic/versions/0007_blueprint_overlay_expose_server_cfg.py` — new Alembic migration:
- `op.add_column("blueprint_overlays", sa.Column("expose_server_cfg", sa.Boolean(), nullable=False, server_default=sa.text("0")))`
- Downgrade drops the column.
### 2. Spec contract (host ↔ web)
`l4d2host/spec.py`: replace `overlays: list[str]` with typed refs.
```python
@dataclass(slots=True)
class OverlayRef:
path: str
alias: str | None = None # if set, copy server.cfg to server_<alias>.cfg in merged
@dataclass(slots=True)
class InstanceSpec:
port: int
overlays: list[OverlayRef] = field(default_factory=list)
arguments: list[str] = field(default_factory=list)
config: list[str] = field(default_factory=list)
```
`load_spec` accepts both shapes per overlay entry: a bare string is treated as `OverlayRef(path=string)` (back-compat for hand-written specs and existing tests); a dict carries `path` and optional `alias`.
`l4d2web/services/l4d2_facade.py`:
- `load_server_blueprint_bundle`: change select to `select(Overlay.id, Overlay.path, BlueprintOverlay.expose_server_cfg)`, ordered by `BlueprintOverlay.position` ascending (already the case). Returns the raw list of (id, path, expose) tuples to the caller.
- `build_server_spec_payload`:
- Emit overlays as dicts: `{"path": p}` if not exposed, `{"path": p, "alias": f"overlay_{i}"}` if exposed.
- Build `exec_lines = [f"exec server_overlay_{i}" for i, _, expose in rows if expose]` — same ordering as overlays (lowest first).
- Set `config = exec_lines + json.loads(blueprint.config)`. Net effect: `exec` lines appear at the top of the written `instance_dir/server.cfg`, blueprint custom lines follow.
### 3. Lowerdir construction (host)
`l4d2host/instances.py:44`:
```python
lowerdirs = [str(overlay_path(o.path, root=root)) for o in spec.overlays]
```
### 4. Per-overlay copy in `start_instance` (host)
`l4d2host/instances.py`, after the existing main `server.cfg` copy. New block:
```python
emit_step("copying per-overlay server.cfg aliases...", on_stdout, passthrough)
cfg_dir = runtime_dir / "merged" / "left4dead2" / "cfg"
for stale in cfg_dir.glob("server_*.cfg"):
stale.unlink()
for o in spec.overlays:
if not o.alias:
continue
src = root / "overlays" / o.path / "left4dead2" / "cfg" / "server.cfg"
if not src.exists():
continue
shutil.copy2(src, cfg_dir / f"server_{o.alias}.cfg")
```
- Sweep first: prevents orphans when a checkbox is unticked or an overlay is removed from the blueprint.
- Skip overlays with no `alias` (not opted in) and overlays whose lower dir has no `server.cfg` (workshop overlays etc.).
- Writes go to the upper layer of the overlayfs mount; lower dirs untouched.
### 5. Blueprint detail UI
`l4d2web/templates/blueprint_detail.html` — extend each linked-overlay `<li>` with a checkbox + label showing the alias inline.
`l4d2web/routes/page_routes.py` `blueprint_page`: also pass an `overlay_expose_state: dict[int, bool]` keyed by overlay_id so the template can read the current `expose_server_cfg` value.
`l4d2web/routes/blueprint_routes.py` (`replace_blueprint_overlays` and its callers): also read `expose_server_cfg_ids` from the form (`request.form.getlist("expose_server_cfg_ids")`), convert to `set[int]`, and set `BlueprintOverlay.expose_server_cfg = (overlay_id in expose_set)` per row.
### 6. Revert the manual cp
`examples/script-overlays/competitive_rework.sh`: remove the `cp "$DEST/cfg/server.cfg" "$DEST/cfg/server_competitive.cfg"` block added in the previous round. The framework handles this on demand now.
## Critical files
- `l4d2web/models.py``BlueprintOverlay.expose_server_cfg`
- `l4d2web/alembic/versions/0007_*.py` — new Alembic migration
- `l4d2web/routes/blueprint_routes.py` — read checkbox set on save, persist expose flag
- `l4d2web/routes/page_routes.py` — pass overlay state map to template
- `l4d2web/templates/blueprint_detail.html` — checkbox + alias display
- `l4d2web/services/l4d2_facade.py` — emit alias per overlay in spec payload + prepend exec lines
- `l4d2host/spec.py``OverlayRef` dataclass + spec deserialization
- `l4d2host/instances.py` — lowerdir construction + per-overlay copy step + sweep
- `examples/script-overlays/competitive_rework.sh` — remove manual `cp`
## Out of scope
- No constraint on `Overlay.name`.
- No `cfg_alias` / `slug` column on `Overlay`.
- No per-blueprint custom alias text (id-based naming is fixed: `overlay_<id>`).
- No automatic detection of which overlays ship a `server.cfg` to gate the checkbox in UI — checkbox is always available; the host silently skips at start time if the source doesn't exist.
## Verification
1. **Unit tests**:
- `l4d2host/tests/`: `start_instance` test where two overlays exist on disk — one with `server.cfg`, one without; spec marks both with aliases; assert only the one with a source produces `server_<alias>.cfg` in merged. Pre-existing `server_old.cfg` in merged is swept.
- `l4d2host/tests/`: spec yaml round-trip test for `OverlayRef` with and without `alias`; back-compat test for bare-string entries.
- `l4d2web/tests/`: blueprint payload build asserts overlays without `expose_server_cfg` produce no `alias`; with, produce `overlay_<id>`.
- `l4d2web/tests/`: blueprint payload `config` field equals `["exec server_overlay_<id_low>", "exec server_overlay_<id_high>", *blueprint_custom_lines]``exec` lines in `BlueprintOverlay.position` ascending order, custom lines last, no exec lines for unchecked overlays.
- `l4d2web/tests/`: form submit with `expose_server_cfg_ids=[6, 8]` updates the matching `BlueprintOverlay` rows; unchecked rows reset to false.
- Run: `pytest l4d2host/tests -q`, `pytest l4d2web/tests -q`.
2. **End-to-end on the test server (`ckn@10.0.4.128`)**:
- Deploy via `deploy/deploy-test-server.sh`.
- Blueprint detail: each linked overlay shows a checkbox with its alias label.
- Tick the box for `competitive_rework`; save; reload; checkbox stays checked.
- Start a server using that blueprint: `ls /var/lib/left4me/runtime/<name>/merged/left4dead2/cfg/server_*.cfg` → shows `server_overlay_<id>.cfg` for the checked overlay only.
- Inspect the written `server.cfg`: `head -n 5 /var/lib/left4me/instances/<name>/server.cfg` → top lines are `exec server_overlay_<id>` for each checked overlay in lowest-first order, followed by the blueprint's custom lines.
- In-game console: server boot should auto-load the per-overlay configs.
- Untick the box, restart the server → `server_overlay_<id>.cfg` no longer present in merged, and the corresponding `exec` line is no longer in the written `server.cfg`.
3. **Negative**: tick an overlay that doesn't ship a `server.cfg` (e.g. a workshop overlay) → start succeeds, no alias file produced (host skipped silently).

View file

@ -1,260 +0,0 @@
# L4D2 CPU Isolation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively, scaled automatically across host sizes.
**Architecture:** Four `99-left4me-cpuset.conf` drop-ins under `/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/`, written by the deploy script from heredocs. `LEFT4ME_SYSTEM_CPUS` (default `0`) and `LEFT4ME_GAME_CPUS` (default `1-$((NPROC-1))`) are env-var overrides. Single-core hosts skip the cpuset writes with a warning.
**Tech Stack:** systemd cgroup-v2 `AllowedCPUs=` directive, bash heredoc + `install`, Linux `nproc(1)`, pytest text-assertion tests.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`
---
## File Structure
Files to modify:
- `deploy/deploy-test-server.sh` — compute `NPROC`, default `LEFT4ME_SYSTEM_CPUS=0` / `LEFT4ME_GAME_CPUS=1-$((NPROC-1))`, write four drop-in files. Skip when `nproc < 2` (with stderr warning) unless either env var is set explicitly.
- `deploy/README.md` — append a "CPU isolation" subsection inside the existing "Performance Tuning" section.
- `deploy/tests/test_deploy_artifacts.py` — new test functions.
No host library or web app changes.
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing deploy tests are at the known-good baseline**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 35 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`).
If the count differs, stop and surface — this plan assumes that exact baseline.
---
## Task 1: Deploy-script CPU-isolation block + tests
Write the four drop-ins from the deploy script in one cohesive block. The block computes `NPROC` once, resolves both env vars (with defaults), guards single-core hosts, and writes each drop-in via the existing `install -m 0644 -o root -g root` pattern. Tests cover defaults, overrides, single-core skip, and drop-in paths.
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Modify: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 1.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py` and append (after the `test_deploy_script_installs_perf_artifacts` from the perf-baseline branch):
```python
def test_deploy_script_writes_cpuset_drop_ins():
script = DEPLOY_SCRIPT.read_text()
# Reads nproc and binds defaults via ${VAR:-...}.
assert "nproc" in script
assert "LEFT4ME_SYSTEM_CPUS" in script
assert "LEFT4ME_GAME_CPUS" in script
assert "${LEFT4ME_SYSTEM_CPUS:-0}" in script
# Default game-core expression: 1-(nproc-1). Match the form the
# implementer chose; both `1-$((NPROC-1))` and `1-$((nproc-1))` are
# acceptable as long as the upper bound is computed from nproc.
assert ("1-$((NPROC-1))" in script) or ("1-$((nproc-1))" in script) \
or ("LEFT4ME_GAME_CPUS:-1-" in script)
# All four drop-in paths.
for slice_name in ("system", "user", "l4d2-build", "l4d2-game"):
assert f"/etc/systemd/system/{slice_name}.slice.d/99-left4me-cpuset.conf" in script
# Drop-ins use the existing install pattern.
assert "install -m 0644 -o root -g root" in script
# Single-core host: skip with a warning to stderr.
# Match either an explicit `nproc < 2` / `-lt 2` guard or `[ "$nproc" -ge 2 ]` form.
assert ("nproc" in script) and (("-lt 2" in script) or ("-ge 2" in script) or ("< 2" in script))
assert "skipping CPU isolation" in script.lower() or "skip cpu isolation" in script.lower()
```
- [ ] **Step 1.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_writes_cpuset_drop_ins -v`
Expected: FAIL — none of the new strings exist yet.
- [ ] **Step 1.3: Edit the deploy script — add the cpuset block**
Open `deploy/deploy-test-server.sh`. Find the block that copies the slice files (added in the perf-baseline branch, around lines 139140):
```sh
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
```
Immediately after that pair, before any of the helper-script copies that follow, insert this block:
```sh
# CPU isolation via cgroup-v2 AllowedCPUs= drop-ins. Pin everything that
# isn't a live game server to core 0; give game servers cores 1..N-1.
# See docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
NPROC=$(nproc)
SYSTEM_CPUS=${LEFT4ME_SYSTEM_CPUS:-0}
if [ "${LEFT4ME_GAME_CPUS+x}" = x ]; then
GAME_CPUS=$LEFT4ME_GAME_CPUS
else
GAME_CPUS="1-$((NPROC - 1))"
fi
if [ "$NPROC" -lt 2 ] && [ -z "${LEFT4ME_SYSTEM_CPUS+x}${LEFT4ME_GAME_CPUS+x}" ]; then
printf 'left4me deploy: skipping CPU isolation (nproc=%s); cpuset drop-ins not written.\n' "$NPROC" >&2
else
for slice_name in system user l4d2-build; do
$sudo_cmd mkdir -p "/etc/systemd/system/${slice_name}.slice.d"
printf '[Slice]\nAllowedCPUs=%s\n' "$SYSTEM_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
"/etc/systemd/system/${slice_name}.slice.d/99-left4me-cpuset.conf"
done
$sudo_cmd mkdir -p "/etc/systemd/system/l4d2-game.slice.d"
printf '[Slice]\nAllowedCPUs=%s\n' "$GAME_CPUS" \
| $sudo_cmd install -m 0644 -o root -g root /dev/stdin \
"/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf"
fi
```
Notes for the implementer:
- The single-core skip only triggers when **neither** override is set. If the operator sets either `LEFT4ME_SYSTEM_CPUS` or `LEFT4ME_GAME_CPUS` explicitly on a single-core host, honor their intent.
- `install -m 0644 -o root -g root /dev/stdin <dest>` is the idiomatic way to install a small generated file from a pipeline (matches the existing pattern for sandbox-resolv.conf, just with `/dev/stdin` as source).
- The `mkdir -p` for each `.d` directory is required: systemd reads drop-ins only from existing directories.
- [ ] **Step 1.4: Verify shell syntax still parses**
Run: `sh -n deploy/deploy-test-server.sh`
Expected: exit 0, no output.
- [ ] **Step 1.5: Run the new test and full deploy test suite**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 36 passed, 1 failed (the pre-existing unrelated test, count goes from 35→36 because of the new test).
If your specific assertion forms in Step 1.1 don't match the implementation, adjust the test — but only the `or` branches; do not weaken the contract.
- [ ] **Step 1.6: Commit**
```bash
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest
Computes NPROC at deploy time. Defaults LEFT4ME_SYSTEM_CPUS=0 and
LEFT4ME_GAME_CPUS=1-(NPROC-1). Single-core hosts skip cpuset writes
with a stderr warning unless an env var override is set. Spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md
EOF
)"
```
---
## Task 2: README "CPU isolation" subsection
Append a subsection to `deploy/README.md` inside the existing "Performance Tuning" section, documenting the layout, the env-var overrides, the single-core skip, and the relationship to the existing per-instance `CPUAffinity=` escape hatch.
**Files:**
- Modify: `deploy/README.md`
No test for this task — README content is documentation, not contract.
- [ ] **Step 2.1: Append the CPU isolation subsection**
Open `deploy/README.md`. Find the existing `### Per-instance CPU affinity` subsection (added in the perf-baseline branch). Insert a new subsection **immediately before** it (so the slice-level isolation is documented before the per-instance refinement that builds on top). The new subsection content:
```markdown
### CPU isolation (cores)
The deploy script writes four `AllowedCPUs=` drop-ins so that, by default, only `l4d2-game.slice` is allowed to run on cores 1..N-1; `system.slice`, `user.slice`, and `l4d2-build.slice` are pinned to core 0. Game servers thus get the host minus core 0 exclusively, the build sandbox and the web app stay on core 0, and a logged-in admin running CPU-heavy work in their shell can't steal cycles from a live match.
Override the split by setting either env var when running the deploy:
```sh
LEFT4ME_SYSTEM_CPUS="0,1" LEFT4ME_GAME_CPUS="2-7" deploy/deploy-test-server.sh deploy-user@host
```
On single-core hosts the deploy skips the cpuset drop-ins entirely and prints a warning to stderr; the rest of the perf baseline (cgroup weights, sysctls, OOM scores) still applies. To force isolation on a single-core host anyway (rarely useful), set either env var explicitly.
Per-instance `CPUAffinity=` (next subsection) composes on top of this — the per-instance value must be a subset of `l4d2-game.slice`'s `AllowedCPUs=`, which the kernel enforces.
```
(The outer triple-backticks above are markdown punctuation around this prompt block, not part of the README content. Inner code-block fences DO need to be written into the README. The `markdown` language tag on the outer fence in this plan is documentation-only.)
- [ ] **Step 2.2: Run the full deploy test suite**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: 36 passed, 1 failed (unchanged; README has no test).
- [ ] **Step 2.3: Commit**
```bash
git add deploy/README.md
git commit -m "$(cat <<'EOF'
docs(deploy): document CPU isolation in performance-tuning section
Explains the core-0-vs-game-cores split, the LEFT4ME_SYSTEM_CPUS /
LEFT4ME_GAME_CPUS overrides, the single-core skip, and the
subset-of relationship with per-instance CPUAffinity=.
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full deploy + host + web test sweep**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: deploy 36 passed / 1 failed (pre-existing); host 111 passed / 1 skipped; web 313 passed / 1 skipped.
- [ ] **Step F.2: Working tree clean and commits in order**
Run: `git status && git log --oneline -5`
Expected:
- `git status`: clean.
- Top of `git log`:
1. `docs(deploy): document CPU isolation in performance-tuning section`
2. `feat(deploy): cgroup-v2 cpuset drop-ins pin system to core 0, game to rest`
3. `docs(plans): l4d2 cpu isolation — implementation plan`
4. `docs(specs): l4d2 cpu isolation — design`
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
This plan ships artifacts. Confirming systemd actually enforces `AllowedCPUs=` on a real Trixie host is operator-side:
```sh
deploy/deploy-test-server.sh deploy-user@example-host
ssh deploy-user@example-host '
systemctl cat system.slice | grep AllowedCPUs
systemctl cat l4d2-game.slice | grep AllowedCPUs
cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective
cat /sys/fs/cgroup/l4d2-game.slice/cpuset.cpus.effective
'
# Expect on an 8-core box:
# system.slice → AllowedCPUs=0 → cpuset.cpus.effective = 0
# l4d2-game.slice → AllowedCPUs=1-7 → cpuset.cpus.effective = 1-7
```
End-to-end behavioural test (manual, ops-side): on a 4-core host, run two L4D2 instances + a script-sandbox build simultaneously. Confirm via `htop` (with affinity column on) that the srcds processes only ever appear on cores 1, 2, 3 and the sandbox + web stay on core 0.
---
## Out of Scope (do NOT implement here)
- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot params.
- NIC IRQ pinning automation.
- Per-instance `CPUAffinity=` driven by a deploy-env knob.
- A separate `l4d2-web.slice`.
- Any web-app or host-library code changes.
If you find yourself touching any of these, stop — they belong in a separate spec.

View file

@ -1,686 +0,0 @@
# L4D2 Server Host Perf Baseline Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Apply a host-side performance and resource-isolation baseline (systemd directives, slice hierarchy, host sysctls) to every L4D2 server instance, leaving game ConVars to the maintainer.
**Architecture:** Add resource-control directives to `left4me-server@.service`; introduce two flat top-level slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10) so the build sandbox is starved by the kernel under contention; ship `/etc/sysctl.d/99-left4me.conf` for UDP buffer and netdev tuning; place the script-sandbox transient unit into `l4d2-build.slice` with `OOMScoreAdjust=500`. RT scheduling, CPU governor, CPUAffinity, NIC tuning are documentation-only escape hatches.
**Tech Stack:** systemd unit files (service + slice), `systemd-run` properties, Linux sysctl, bash deploy script, pytest text-assertion tests under `deploy/tests/test_deploy_artifacts.py`.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`
---
## File Structure
Files to create:
- `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` — high-weight slice for game-server instances.
- `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` — low-weight slice for sandboxed script-overlay builds.
- `deploy/files/etc/sysctl.d/99-left4me.conf` — host UDP/netdev/swap sysctls.
Files to modify:
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` — add resource-control directives (`Slice`, `Nice`, `IOSchedulingClass`, `IOSchedulingPriority`, `OOMScoreAdjust`, `MemoryHigh`, `MemoryMax`, `TasksMax`, `LimitNOFILE`, `KillSignal`, `TimeoutStopSec`, `LogRateLimitIntervalSec`).
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — add `--slice=l4d2-build.slice` and `-p OOMScoreAdjust=500` to the `systemd-run` invocation.
- `deploy/deploy-test-server.sh` — copy the two slice files and the sysctl conf during deploy; run `sysctl --system` so values take effect immediately.
- `deploy/README.md` — append a "Performance tuning" section with the four documented escape hatches.
- `deploy/tests/test_deploy_artifacts.py` — new tests for each artifact above (text assertions following the existing `assert "X" in text` style).
No application code (Python, Flask, host library) is touched.
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing deploy tests pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: all green.
If any test is already red, stop and surface — this plan assumes the baseline is green.
---
## Task 1: Per-Instance Unit Resource-Control Directives
Add the per-instance baseline to `left4me-server@.service`. This task is self-contained even though `Slice=l4d2-game.slice` references a slice that doesn't exist yet — systemd does not validate the reference until the unit is actually started, and the deploy artifact tests are pure text checks.
**Files:**
- Modify: `deploy/files/usr/local/lib/systemd/system/left4me-server@.service`
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 1.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py` and append (after `test_server_unit_contains_required_runtime_contract`):
```python
def test_server_unit_contains_perf_baseline_directives():
unit = SERVER_UNIT.read_text()
# Slice membership.
assert "Slice=l4d2-game.slice" in unit
# CFS priority bump (no SCHED_FIFO).
assert "Nice=-5" in unit
assert "CPUSchedulingPolicy=" not in unit
# I/O priority.
assert "IOSchedulingClass=best-effort" in unit
assert "IOSchedulingPriority=4" in unit
# OOM ordering: game servers survive, sandbox dies first.
assert "OOMScoreAdjust=-200" in unit
# Memory caps with headroom for map-load spikes.
assert "MemoryHigh=1.5G" in unit
assert "MemoryMax=2G" in unit
# Bounded fork surface.
assert "TasksMax=256" in unit
# Plenty of fds for plugin-heavy setups.
assert "LimitNOFILE=65536" in unit
# srcds clean shutdown via SIGINT, with time to flush.
assert "KillSignal=SIGINT" in unit
assert "TimeoutStopSec=15s" in unit
# Per-unit override of journald rate limiting (default drops srcds output).
assert "LogRateLimitIntervalSec=0" in unit
```
- [ ] **Step 1.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
Expected: FAIL — first failing assert is on `Slice=l4d2-game.slice`.
- [ ] **Step 1.3: Edit the unit file**
Open `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` and replace its contents with:
```ini
[Unit]
Description=left4me server instance %i
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=left4me
Group=left4me
EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
WorkingDirectory=/var/lib/left4me/runtime/%i/merged/left4dead2
ExecStart=/var/lib/left4me/installation/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
Restart=on-failure
RestartSec=5
# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
# Hardening (unchanged from previous baseline).
NoNewPrivileges=true
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
ProtectSystem=strict
ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays
ReadWritePaths=/var/lib/left4me/runtime/%i
RestrictSUIDSGID=true
LockPersonality=true
[Install]
WantedBy=multi-user.target
```
- [ ] **Step 1.4: Run the new test, verify it passes**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_perf_baseline_directives -v`
Expected: PASS.
- [ ] **Step 1.5: Re-run the existing server-unit test, verify still passes**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_server_unit_contains_required_runtime_contract -v`
Expected: PASS — the existing assertions (`User=left4me`, `Group=left4me`, hardening directives, etc.) still match.
- [ ] **Step 1.6: Commit**
```bash
git add deploy/files/usr/local/lib/systemd/system/left4me-server@.service deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): perf-baseline directives on left4me-server@.service
Slice=l4d2-game.slice, Nice=-5, IOSchedulingClass=best-effort,
OOMScoreAdjust=-200, MemoryHigh=1.5G, MemoryMax=2G, TasksMax=256,
LimitNOFILE=65536, KillSignal=SIGINT, TimeoutStopSec=15s,
LogRateLimitIntervalSec=0. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
EOF
)"
```
---
## Task 2: Slice Unit Files
Create the two slice unit files. After this task the perf unit's `Slice=l4d2-game.slice` reference is satisfied.
**Files:**
- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`
- Create: `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`
- Test: `deploy/tests/test_deploy_artifacts.py` (new constants + new test functions)
- [ ] **Step 2.1: Add path constants and failing tests**
Open `deploy/tests/test_deploy_artifacts.py`. After the existing `SERVER_UNIT = ...` line, add:
```python
GAME_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-game.slice"
BUILD_SLICE = DEPLOY / "files/usr/local/lib/systemd/system/l4d2-build.slice"
```
After the new `test_server_unit_contains_perf_baseline_directives`, append:
```python
def test_l4d2_game_slice_exists_with_high_weights():
assert GAME_SLICE.is_file()
text = GAME_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=1000" in text
assert "IOWeight=1000" in text
def test_l4d2_build_slice_exists_with_low_weights():
assert BUILD_SLICE.is_file()
text = BUILD_SLICE.read_text()
assert "[Slice]" in text
assert "CPUWeight=10" in text
assert "IOWeight=10" in text
```
- [ ] **Step 2.2: Run the new tests, verify they fail**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
Expected: FAIL on `assert GAME_SLICE.is_file()` (file does not exist).
- [ ] **Step 2.3: Create the game slice file**
Create `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice` with:
```ini
[Unit]
Description=left4me game-server slice
Before=slices.target
[Slice]
CPUWeight=1000
IOWeight=1000
```
- [ ] **Step 2.4: Create the build slice file**
Create `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice` with:
```ini
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target
[Slice]
CPUWeight=10
IOWeight=10
```
- [ ] **Step 2.5: Run the new tests, verify they pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_l4d2_game_slice_exists_with_high_weights deploy/tests/test_deploy_artifacts.py::test_l4d2_build_slice_exists_with_low_weights -v`
Expected: PASS.
- [ ] **Step 2.6: Commit**
```bash
git add deploy/files/usr/local/lib/systemd/system/l4d2-game.slice deploy/files/usr/local/lib/systemd/system/l4d2-build.slice deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio
Flat top-level slices. Game wins under contention; build still gets
the box when uncontended. Referenced by left4me-server@.service and
the script-sandbox systemd-run invocation.
EOF
)"
```
---
## Task 3: Host Sysctls
Ship a `/etc/sysctl.d/` drop-in for UDP buffers, netdev backlog, netdev budget, and `vm.swappiness`.
**Files:**
- Create: `deploy/files/etc/sysctl.d/99-left4me.conf`
- Test: `deploy/tests/test_deploy_artifacts.py` (new constant + new test function)
- [ ] **Step 3.1: Add path constant and failing test**
Open `deploy/tests/test_deploy_artifacts.py`. After the slice constants, add:
```python
SYSCTL_CONF = DEPLOY / "files/etc/sysctl.d/99-left4me.conf"
```
Append a new test:
```python
def test_sysctl_conf_present_with_perf_settings():
assert SYSCTL_CONF.is_file()
text = SYSCTL_CONF.read_text()
for line in (
"net.core.rmem_max = 8388608",
"net.core.wmem_max = 8388608",
"net.core.rmem_default = 524288",
"net.core.wmem_default = 524288",
"net.core.netdev_max_backlog = 5000",
"net.core.netdev_budget = 600",
"vm.swappiness = 10",
):
assert line in text, f"missing {line!r} in 99-left4me.conf"
```
- [ ] **Step 3.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
Expected: FAIL on `assert SYSCTL_CONF.is_file()`.
- [ ] **Step 3.3: Create the sysctl conf file**
Create `deploy/files/etc/sysctl.d/99-left4me.conf` with:
```
# Host-side perf baseline for left4me — see
# docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
#
# UDP socket buffers: distro defaults of ~128 KiB are too small for sustained
# Source-engine UDP across multiple instances. 8 MiB matches the standard
# 1 Gbit recommendation; rmem_default/wmem_default protect sockets that don't
# explicitly enlarge their buffers.
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
# Kernel softirq UDP path: the per-CPU backlog queue starts dropping packets
# at the default 1000 under multi-instance burst; 5000 absorbs realistic peaks.
# netdev_budget = 600 gives softirq more drain headroom per pass.
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
# Latency-sensitive default: avoid swap unless the box is really under
# pressure. Harmless on swapless hosts.
vm.swappiness = 10
```
- [ ] **Step 3.4: Run the new test, verify it passes**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v`
Expected: PASS.
- [ ] **Step 3.5: Commit**
```bash
git add deploy/files/etc/sysctl.d/99-left4me.conf deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): host sysctls for UDP buffers + netdev backlog/budget
99-left4me.conf: rmem_max/wmem_max=8M (with 512K defaults),
netdev_max_backlog=5000, netdev_budget=600, vm.swappiness=10.
EOF
)"
```
---
## Task 4: Sandbox in Build Slice
Place the script-sandbox transient unit into `l4d2-build.slice` and give it `OOMScoreAdjust=500` so it dies first under memory pressure.
**Files:**
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 4.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py`. Append:
```python
def test_script_sandbox_in_build_slice_with_oom_adjust():
text = SCRIPT_SANDBOX_HELPER.read_text()
# Put the transient unit in the low-weight build slice so it yields to
# game-server instances under CPU/IO contention.
assert "--slice=l4d2-build.slice" in text
# Sandbox dies first if the host hits memory pressure; servers
# (OOMScoreAdjust=-200) survive.
assert "-p OOMScoreAdjust=500" in text
```
- [ ] **Step 4.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust -v`
Expected: FAIL — neither string is in the helper yet.
- [ ] **Step 4.3: Edit the sandbox helper**
Open `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`. Locate the `systemd-run` invocation that begins with:
```
systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
```
Insert two new lines immediately after the `--unit=` line, before `-p User=l4d2-sandbox`. The block becomes:
```
systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
--slice=l4d2-build.slice \
-p OOMScoreAdjust=500 \
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
```
Leave every other `-p` line untouched.
- [ ] **Step 4.4: Verify shell syntax still parses**
Run: `bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
Expected: exit 0, no output.
- [ ] **Step 4.5: Run the new test and the existing sandbox-helper tests, verify they pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_script_sandbox_in_build_slice_with_oom_adjust deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_invokes_systemd_run_with_hardening deploy/tests/test_deploy_artifacts.py::test_script_sandbox_helper_passes_shell_syntax_check -v`
Expected: PASS for all three. The hardening test still matches because it only checks for substring presence; we added strings, didn't remove any.
- [ ] **Step 4.6: Commit**
```bash
git add deploy/files/usr/local/libexec/left4me/left4me-script-sandbox deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500
Builds yield CPU/IO to game-server instances under contention via the
slice's weight=10, and are killed first under memory pressure
(servers have OOMScoreAdjust=-200).
EOF
)"
```
---
## Task 5: Deploy Script Installs Slice + Sysctl Artifacts
Wire the new artifacts into `deploy-test-server.sh` so a fresh deploy actually puts them on disk and applies the sysctls.
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Test: `deploy/tests/test_deploy_artifacts.py` (new test function)
- [ ] **Step 5.1: Add the failing test**
Open `deploy/tests/test_deploy_artifacts.py`. Append:
```python
def test_deploy_script_installs_perf_artifacts():
script = DEPLOY_SCRIPT.read_text()
# Slice files copied into the system-wide systemd unit dir.
assert "/usr/local/lib/systemd/system/l4d2-game.slice" in script
assert "/usr/local/lib/systemd/system/l4d2-build.slice" in script
# Sysctl drop-in installed under /etc/sysctl.d/.
assert "/etc/sysctl.d/99-left4me.conf" in script
# Values applied immediately, not on next boot.
assert "sysctl --system" in script
```
- [ ] **Step 5.2: Run the new test, verify it fails**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts -v`
Expected: FAIL on the first assertion.
- [ ] **Step 5.3: Edit the deploy script — copy the slice + sysctl files**
Open `deploy/deploy-test-server.sh`. Find the block that copies unit files (currently around line 138):
```sh
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
```
Add two new lines immediately after the `left4me-server@.service` copy line, so the block becomes:
```sh
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-web.service /usr/local/lib/systemd/system/left4me-web.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service /usr/local/lib/systemd/system/left4me-server@.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-game.slice /usr/local/lib/systemd/system/l4d2-game.slice
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/l4d2-build.slice /usr/local/lib/systemd/system/l4d2-build.slice
```
- [ ] **Step 5.4: Edit the deploy script — install the sysctl conf and apply it**
In `deploy/deploy-test-server.sh`, find the block that installs `/etc/left4me/sandbox-resolv.conf` (currently around lines 153155):
```sh
$sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/etc/left4me/sandbox-resolv.conf \
/etc/left4me/sandbox-resolv.conf
```
Immediately after that block, add:
```sh
# Host perf-baseline sysctls. Apply with `sysctl --system` so values
# take effect this deploy, not on next reboot.
$sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/etc/sysctl.d/99-left4me.conf \
/etc/sysctl.d/99-left4me.conf
$sudo_cmd sysctl --system >/dev/null
```
- [ ] **Step 5.5: Verify the deploy script's shell syntax still parses**
Run: `sh -n deploy/deploy-test-server.sh`
Expected: exit 0, no output.
- [ ] **Step 5.6: Run the new test and the existing deploy-script tests, verify they pass**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_perf_artifacts deploy/tests/test_deploy_artifacts.py::test_deploy_script_has_safe_defaults_and_preserves_state deploy/tests/test_deploy_artifacts.py::test_deploy_script_shell_syntax -v`
Expected: PASS for all three.
- [ ] **Step 5.7: Commit**
```bash
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(deploy): install slice + sysctl artifacts and apply via sysctl --system
Copies l4d2-game.slice and l4d2-build.slice into
/usr/local/lib/systemd/system/, installs 99-left4me.conf into
/etc/sysctl.d/, and runs sysctl --system so the perf baseline is
live this deploy, not on next reboot.
EOF
)"
```
---
## Task 6: Performance-Tuning Section in deploy/README.md
Document the four escape hatches the spec lists as opt-in: CPU governor, per-instance `CPUAffinity`, NIC tuning, and SCHED_FIFO.
**Files:**
- Modify: `deploy/README.md`
No test for this task — README content is documentation, not contract.
- [ ] **Step 6.1: Append the Performance Tuning section**
Open `deploy/README.md`. Append (after the existing final paragraph) a new section:
```markdown
## Performance Tuning
The deployment ships a host-side perf baseline (slices, unit directives, sysctls). See `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` for design rationale.
The following knobs are documented escape hatches — they are **not** auto-applied. Apply only if you have measured a need and understand the failure modes.
### CPU governor
The performance governor squeezes a few percent off jitter under bursty load. `schedutil` is acceptable for sustained UDP workloads.
```sh
sudo cpupower frequency-set -g performance
```
Persist via your distro's CPU-frequency tooling (e.g. `/etc/default/cpufrequtils`).
### Per-instance CPU affinity
`srcds` is single-threaded per instance. On a multi-core host, pinning each instance to its own core can cut jitter under contention. Drop in `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf`:
```ini
[Service]
CPUAffinity=2
```
A reasonable strategy on an N-core host: leave core 0 for the kernel + IRQs + system services, then pin one instance per remaining core.
### NIC tuning
Hardware-specific. On a host with a single primary interface (replace `eth0`):
```sh
sudo ethtool -G eth0 rx 4096 tx 4096
sudo ethtool -K eth0 gro on lro off
```
If you run a high instance count, also pin the NIC's interrupts off the cores that game servers occupy (see `/proc/interrupts` and `/proc/irq/<n>/smp_affinity`).
### Real-time scheduling (advanced, opt-in)
Source-engine servers do not need real-time scheduling, and a misbehaving `srcds` at any RT priority can starve kernel threads — even with the default `kernel.sched_rt_runtime_us=950000` throttling 5% of CPU back. Use only if you have a measured jitter problem that the baseline does not solve.
`/etc/systemd/system/left4me-server@.service.d/realtime.conf`:
```ini
[Service]
CPUSchedulingPolicy=fifo
CPUSchedulingPriority=10
LimitRTPRIO=10
```
### Applying changes to running servers
Unit-file changes do not apply to already-running services. After any change:
```sh
sudo systemctl daemon-reload
# Restart each game server via the web UI's stop + start, or:
sudo systemctl restart 'left4me-server@*.service'
```
```
- [ ] **Step 6.2: Run the full deploy test suite and verify it stays green**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py -q`
Expected: all green. README changes have no test, but should not break any existing tests.
- [ ] **Step 6.3: Commit**
```bash
git add deploy/README.md
git commit -m "$(cat <<'EOF'
docs(deploy): performance-tuning escape-hatch section in README
Documents CPU governor, per-instance CPUAffinity, NIC tuning, and
SCHED_FIFO opt-in patterns. None of these are auto-applied; they're
ops-side knobs for measured problems the perf baseline doesn't solve.
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full deploy test suite green**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
Expected: all green.
- [ ] **Step F.2: Host library + web tests still green (regression check)**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q && pytest l4d2web/tests -q`
Expected: all green. Nothing in this plan touches host or web Python code, but a clean run rules out accidental import-time damage.
- [ ] **Step F.3: Working tree clean and commits in order**
Run: `git status && git log --oneline -8`
Expected:
- `git status`: `nothing to commit, working tree clean`.
- `git log`: six new commits in this order, top-most first:
1. `docs(deploy): performance-tuning escape-hatch section in README`
2. `feat(deploy): install slice + sysctl artifacts and apply via sysctl --system`
3. `feat(deploy): script-sandbox runs in l4d2-build.slice + OOMScoreAdjust=500`
4. `feat(deploy): host sysctls for UDP buffers + netdev backlog/budget`
5. `feat(deploy): l4d2-game.slice + l4d2-build.slice with 100:1 weight ratio`
6. `feat(deploy): perf-baseline directives on left4me-server@.service`
If any step is missing or out of order, do not amend — diagnose, fix, and create new commits.
- [ ] **Step F.4: Manual deploy smoke test (deferred, ops-side)**
This plan ships artifacts. Confirming that systemd actually accepts and applies them on a real host requires running the deploy script against a test target. That validation is operator-side, not part of this implementation:
```sh
deploy/deploy-test-server.sh deploy-user@example-host
ssh deploy-user@example-host 'systemctl cat l4d2-game.slice'
ssh deploy-user@example-host 'sysctl net.core.rmem_max' # expect 8388608
ssh deploy-user@example-host 'systemd-analyze verify /usr/local/lib/systemd/system/left4me-server@.service'
```
Document any deploy-time problems back into the spec or this plan as v1.x corrections. Do not invent fixes that go beyond the spec.
---
## Out of Scope (do NOT implement here)
Listed in the spec — repeated for clarity:
- ConVars / blueprint arguments / tickrate / sv_minrate.
- SCHED_FIFO auto-apply.
- CPU governor auto-apply.
- Per-instance `CPUAffinity` auto-apply.
- NIC ring-buffer / IRQ-pinning code.
- Job-scheduler awareness ("don't build while server X has players").
- Hardening tightening (`ProtectKernelTunables=yes`, etc.).
If you find yourself touching any of these, stop — they belong in a separate spec.

View file

@ -1,584 +0,0 @@
# L4D2 Server Lifecycle: Reboot-Safe + Drift Reconciliation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make L4D2 server instances survive a host reboot (Part A) and converge `Server.actual_state` to systemd reality every ~30s for out-of-band drift (Part B).
**Architecture:** Helper script + `service_control.py` switch from `systemctl start/stop` to `systemctl enable --now / disable --now`. A new background thread spawned with the job workers polls every server's status periodically and writes the result via the existing `refresh_server_actual_state()` path. Skip servers with in-flight jobs to avoid racing with the post-job refresh.
**Tech Stack:** bash helper script + sudoers; Python `subprocess` via `l4d2host.service_control.systemctl_command`; SQLAlchemy via `session_scope()`; threading; pytest.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md`
---
## File Structure
Files to modify (Part A — lifecycle verb change):
- `deploy/files/usr/local/libexec/left4me/left4me-systemctl` — accept verbs `enable`/`disable`/`show` (drop `start`/`stop`).
- `l4d2host/service_control.py` — rename `start_service``enable_service`, `stop_service``disable_service`. Action tokens become `"enable"` / `"disable"`.
- `l4d2host/instances.py` — call `enable_service` from `start_instance`; call `disable_service` from `stop_instance` and `_purge_instance`.
- `l4d2host/tests/test_lifecycle.py` — update mock-call expectations.
- `l4d2host/tests/test_service_control.py` — new file with direct unit tests for `enable_service` / `disable_service`.
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args` — update the verb assertions.
Files to modify (Part B — poller):
- `l4d2web/services/job_worker.py` — add `start_state_poller`, `state_poller_loop`, `poll_all_servers`.
- `l4d2web/app.py` — call `start_state_poller(app)` next to `start_job_workers(app)`.
- `l4d2web/config.py` — default `STATE_POLLER_INTERVAL_SECONDS = 30`.
- `l4d2web/tests/test_job_worker.py` — four new tests for the poller.
No host-library, web-app facade, or CLI surface signatures change. The `l4d2ctl start <name>` / `l4d2ctl stop <name>` commands keep their names (per `AGENTS.md`).
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing test suite is at the known-good baseline**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: 460 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
If the count differs, stop and surface — this plan assumes that exact baseline.
---
## Task 1: Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
This task changes the helper script, the Python wrapper, and the instance lifecycle in one cohesive commit. The change is end-to-end vertical — splitting it across commits would leave broken intermediate states (helper accepting verbs that no caller uses, or callers using verbs the helper rejects).
**Files:**
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-systemctl`
- Modify: `l4d2host/service_control.py`
- Modify: `l4d2host/instances.py`
- Modify: `l4d2host/tests/test_lifecycle.py`
- Create: `l4d2host/tests/test_service_control.py`
- Modify: `deploy/tests/test_deploy_artifacts.py`
### Step 1.1: Update the deploy artifact test for the helper
Open `deploy/tests/test_deploy_artifacts.py`. Find `test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`.
Replace the assertions that check the helper's case-statement bodies. Currently the test asserts something like:
```python
assert 'start) exec "$systemctl" start "$unit"' in script
assert 'stop) exec "$systemctl" stop "$unit"' in script
```
Update to:
```python
assert 'enable)' in script
assert 'enable --now' in script
assert 'disable)' in script
assert 'disable --now' in script
```
Keep the `--property=ActiveState` and `--property=SubState` assertions for the `show` action (unchanged).
The rejected-action examples list (currently includes things like `["bad/action", "alpha"]`) is unchanged — those are still bad. If the test currently asserts that `start` and `stop` are accepted (e.g., a positive case), drop those — `start`/`stop` are now rejected verbs, not accepted ones.
### Step 1.2: Run the updated artifact test to verify it fails
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
Expected: FAIL — the helper script still has `start)`/`stop)` cases, not `enable)`/`disable)`.
### Step 1.3: Edit the helper script
Open `deploy/files/usr/local/libexec/left4me/left4me-systemctl`. Find the case-statement (currently around lines 2427). Replace:
```sh
case "$action" in
start) exec "$systemctl" start "$unit" ;;
stop) exec "$systemctl" stop "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) ...
esac
```
with:
```sh
case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) ...
esac
```
Keep the rest of the script (shebang, name validation, `*)` reject-and-exit branch) unchanged. The exact form of the `*)` reject case in the existing helper should be preserved.
### Step 1.4: Verify the helper script still parses
Run: `sh -n deploy/files/usr/local/libexec/left4me/left4me-systemctl`
Expected: exit 0, no output.
### Step 1.5: Run the artifact test, verify it passes
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
Expected: PASS.
### Step 1.6: Update `service_control.py`
Open `l4d2host/service_control.py`. Replace:
```python
def start_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("start", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
def stop_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("stop", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
```
with:
```python
def enable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("enable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
def disable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("disable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
```
`show_service`, `stream_command`, `stream_journal`, and the `systemctl_command` / `journalctl_command` helpers are unchanged.
### Step 1.7: Update `instances.py` to call the new names
Open `l4d2host/instances.py`. Replace the import:
```python
from l4d2host.service_control import start_service, stop_service
```
with:
```python
from l4d2host.service_control import disable_service, enable_service
```
Inside `start_instance`, find the `start_service(...)` call (around line 137 in current source) and replace with `enable_service(...)`. Inside `stop_instance` (line 159) and `_purge_instance` (line 194), replace `stop_service(...)` with `disable_service(...)`. Keep all keyword arguments identical — only the function name changes.
### Step 1.8: Update `test_lifecycle.py`
Open `l4d2host/tests/test_lifecycle.py`. Search for every assertion that references the `start` or `stop` action token in mock-call expectations against `service_control.run_command` or `systemctl_command`. The tests typically look for argument lists like `["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "<name>"]`.
Update each occurrence:
- `"start"``"enable"` (in the `start_instance` test paths)
- `"stop"``"disable"` (in the `stop_instance`, `delete_instance`, `reset_instance`, and `_purge_instance` test paths)
Some tests may import `start_service` / `stop_service` directly. Update those imports to `enable_service` / `disable_service`.
### Step 1.9: Create direct unit tests for `enable_service` / `disable_service`
Create `l4d2host/tests/test_service_control.py` with:
```python
from unittest.mock import patch
from l4d2host.service_control import (
SYSTEMCTL_HELPER,
disable_service,
enable_service,
)
@patch("l4d2host.service_control.run_command")
def test_enable_service_invokes_helper_with_enable_action(mock_run):
enable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
@patch("l4d2host.service_control.run_command")
def test_disable_service_invokes_helper_with_disable_action(mock_run):
disable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]
```
### Step 1.10: Run the host-library tests
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q`
Expected: all green (110 or 111 passing depending on whether `test_service_control.py` already existed; `+2` from the new direct tests).
If anything red: fix the test expectations, not the implementation. The implementation matches the spec exactly. Most likely failure mode: a test in `test_lifecycle.py` you missed updating; search for any remaining string literal `"start"` or `"stop"` in helper-arg-list contexts.
### Step 1.11: Run the deploy artifact test suite
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
Expected: 36 passed, 1 failed (the pre-existing unrelated test).
### Step 1.12: Commit
```bash
git add deploy/files/usr/local/libexec/left4me/left4me-systemctl \
l4d2host/service_control.py l4d2host/instances.py \
l4d2host/tests/test_lifecycle.py \
l4d2host/tests/test_service_control.py \
deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract — only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"
```
---
## Task 2: Part B — Periodic state poller
This task adds the poller code, wires it into the Flask startup, exposes its config knob, and tests four behaviors. One cohesive commit.
**Files:**
- Modify: `l4d2web/services/job_worker.py`
- Modify: `l4d2web/app.py`
- Modify: `l4d2web/config.py`
- Modify: `l4d2web/tests/test_job_worker.py`
### Step 2.1: Add the failing tests
Open `l4d2web/tests/test_job_worker.py`. Append after the existing tests:
```python
def test_state_poller_refreshes_each_server(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Server
with session_scope() as db:
db.add_all([
Server(id=11, name="alpha", port=27015, blueprint_id=None,
desired_state="running", actual_state="unknown"),
Server(id=12, name="beta", port=27016, blueprint_id=None,
desired_state="running", actual_state="unknown"),
])
refreshed = []
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
with app.app_context():
jw.poll_all_servers()
assert sorted(refreshed) == [11, 12]
def test_state_poller_skips_servers_with_inflight_jobs(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Job, Server
with session_scope() as db:
db.add(Server(id=21, name="gamma", port=27017, blueprint_id=None,
desired_state="running", actual_state="running"))
db.add(Job(server_id=21, operation="stop", state="running"))
refreshed = []
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
with app.app_context():
jw.poll_all_servers()
assert refreshed == []
def test_state_poller_swallows_per_server_exceptions(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Server
with session_scope() as db:
db.add_all([
Server(id=31, name="bad", port=27018, blueprint_id=None,
desired_state="running", actual_state="unknown"),
Server(id=32, name="good", port=27019, blueprint_id=None,
desired_state="running", actual_state="unknown"),
])
refreshed = []
def fake_refresh(sid):
if sid == 31:
raise RuntimeError("simulated host failure")
refreshed.append(sid)
monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
with app.app_context():
jw.poll_all_servers() # must not raise
assert refreshed == [32]
def test_state_poller_disabled_when_job_workers_disabled(monkeypatch):
"""create_app must not spawn the poller thread when JOB_WORKER_ENABLED=False."""
import threading
from l4d2web.app import create_app
spawned = []
real_thread_init = threading.Thread.__init__
def tracking_init(self, *args, **kwargs):
if kwargs.get("name") == "left4me-state-poller":
spawned.append(True)
real_thread_init(self, *args, **kwargs)
monkeypatch.setattr(threading.Thread, "__init__", tracking_init)
create_app({"TESTING": True, "JOB_WORKER_ENABLED": False})
assert not spawned
```
(The tests assume the existing `app` fixture from `conftest.py`. If your project uses a different fixture name, adjust accordingly. The polling tests run `poll_all_servers()` synchronously to avoid testing the loop's `time.sleep`.)
### Step 2.2: Run the new tests, verify they fail
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
Expected: FAIL — `poll_all_servers` and `start_state_poller` don't exist yet.
### Step 2.3: Add the poller code to `job_worker.py`
Open `l4d2web/services/job_worker.py`. Add at the bottom of the file:
```python
def start_state_poller(app):
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
daemon=True,
name="left4me-state-poller",
)
thread.start()
def state_poller_loop(app, interval: float) -> None:
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass
time.sleep(interval)
def poll_all_servers() -> None:
with session_scope() as db:
active_server_ids = set(db.scalars(
select(Job.server_id).where(Job.state.in_(("queued", "running")))
).all())
server_ids = [
sid for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass
```
`Server`, `Job`, `select`, `session_scope`, `threading`, `time`, and `refresh_server_actual_state` are already imported in this file. Verify by scanning the existing imports; if any are missing (unlikely for `select`/`Server`/`Job` since the worker uses them), add them.
### Step 2.4: Wire the poller into `create_app`
Open `l4d2web/app.py`. Find the existing `start_job_workers(app)` call (around line 91, inside the `if should_start_workers:` block). Add `start_state_poller(app)` immediately after it:
```python
if should_start_workers:
recover_stale_jobs()
start_job_workers(app)
start_state_poller(app)
```
Also update the import:
```python
from l4d2web.services.job_worker import (
recover_stale_jobs,
start_job_workers,
start_state_poller,
)
```
(If the existing import is single-line `from ... import recover_stale_jobs, start_job_workers`, just add `start_state_poller` to the list.)
### Step 2.5: Add the config default
Open `l4d2web/config.py`. Find the dict literal that contains other defaults like `JOB_WORKER_THREADS`, `PORT_RANGE_START`, etc. Add:
```python
"STATE_POLLER_INTERVAL_SECONDS": 30,
```
In the env-var-loading section (where `LEFT4ME_PORT_RANGE_START` etc. are read), add:
```python
"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("LEFT4ME_STATE_POLLER_INTERVAL_SECONDS", "30")),
```
### Step 2.6: Run the four new tests, verify they pass
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
Expected: PASS for all four.
### Step 2.7: Run the full web test suite
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests -q`
Expected: 317 passed, 1 skipped (313 + 4 new tests).
### Step 2.8: Commit
```bash
git add l4d2web/services/job_worker.py l4d2web/app.py l4d2web/config.py l4d2web/tests/test_job_worker.py
git commit -m "$(cat <<'EOF'
feat(l4d2-web): periodic state poller refreshes Server.actual_state
A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs are skipped to avoid racing the post-job
refresh. Catches reboot drift, OOM kills, manual systemctl operations,
and any other out-of-band state change. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full test sweep**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: ~466 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
- [ ] **Step F.2: Working tree clean and commit shape**
Run: `git status && git log --oneline -5`
Expected:
- `git status`: clean.
- Top of `git log`:
1. `feat(l4d2-web): periodic state poller refreshes Server.actual_state`
2. `feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now`
3. `docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan`
4. `docs(specs): l4d2 server lifecycle reboot-and-drift — design`
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
End-to-end on `ckn@10.0.4.128` after deploy:
```sh
deploy/deploy-test-server.sh ckn@10.0.4.128
# Confirm the helper now drives enable/disable
ssh ckn@10.0.4.128 'cat /usr/local/libexec/left4me/left4me-systemctl | grep -E "enable|disable"'
# expect: enable) exec "$systemctl" enable --now "$unit"
# disable) exec "$systemctl" disable --now "$unit"
# Click "start" in the web UI for a server. Then:
ssh ckn@10.0.4.128 'systemctl is-enabled left4me-server@1.service'
# expect: enabled
# Reboot the host:
ssh ckn@10.0.4.128 'sudo systemctl reboot'
# wait for it to come back, then:
ssh ckn@10.0.4.128 'systemctl is-active left4me-server@1.service && pgrep -fa srcds'
# expect: active, srcds running with no UI intervention
# Confirm the poller corrects out-of-band drift
ssh ckn@10.0.4.128 'sudo systemctl disable --now left4me-server@1.service'
# Within ~30s the web UI's actual_state for server 1 flips from "running" to "stopped".
ssh ckn@10.0.4.128 'sudo -u left4me /opt/left4me/.venv/bin/python -c "
import sqlite3
c = sqlite3.connect(\"/var/lib/left4me/left4me.db\")
print(c.execute(\"SELECT id, actual_state, actual_state_updated_at FROM servers WHERE id=1\").fetchone())
"'
# expect: actual_state='stopped' with a fresh updated_at.
```
---
## Out of Scope (do NOT implement here)
- Auto-restart on `desired_state=running && actual_state=stopped`.
- UI banners for stale-state warnings.
- Reconciliation of orphan systemd units.
- Per-server poll intervals.
- Replacing `Restart=on-failure`.
- Touching the pre-existing red test (`test_deploy_script_has_safe_defaults_and_preserves_state`).
If you find yourself touching any of these, stop — they belong in a separate spec.

View file

@ -1,161 +0,0 @@
# Overlay umount helper was pinning the unit's mount namespace alive
> **Status:** fixed in `5eac51a` (helper nsenter wrap) and `87d56a0`
> (modal delegation). This doc is a postmortem so future maintainers
> don't walk the same path.
## Symptom
After commit `936c8bb` ("ExecStart srcds_run from merged overlay,
not installation/"), every Reset job started failing:
```
OSError: [Errno 16] Device or resource busy:
'/var/lib/left4me/runtime/<id>/merged'
```
`shutil.rmtree(runtime_dir)` in `_purge_instance` tripped on the
still-mounted `merged/`. The unit's `ExecStopPost` had run the umount
helper, the helper had returned non-zero, the unit went `failed`, and
the rmtree downstream couldn't proceed.
## False starts (don't repeat these)
We initially modeled this as an unavoidable kernel-level race between
ExecStopPost and the deferred reaping of the unit's per-service mount
namespace. The "fixes" applied in that frame:
1. **Eager-retry loop in `cmd_umount`** (started at 4 s deadline,
bumped to 12 s, then 25 s). Each bump worked sometimes and broke
sometimes — because we were timing the helper's own life, not the
kernel's reaping (see root cause).
2. **Lazy-umount (`umount -l`) fallback** if eager retries exhausted.
This *would* have made the unit not go `failed`, but it left
`work/work` half-finalized and just moved the EBUSY downstream.
3. **`TimeoutStopSec=15s``60s`** to give ExecStopPost more retry
room. This made Stop sit in "stopping" for tens of seconds.
All three workarounds shipped to the test box and were reverted in
`5eac51a` once we found the actual cause.
## Root cause
A live empirical probe (`/tmp/probe-umount2.sh` on the test box,
polling `/proc/*/ns/mnt` while a stop was in flight) showed:
```
[t= 0.00] mounted=Y holders=[]
[t= 2.27] mounted=Y holders=[35259(left4me-overlay) ]
[t= 4.53] mounted=Y holders=[35259(left4me-overlay) ]
[t= … ] (steady for ~22 s)
[t=22.97] mounted=Y holders=[35259(left4me-overlay) ]
[t=25.22] mounted=N holders=[] ← helper finally exited
```
The single PID holding a reference to the unit's dying mount namespace
was **our own umount helper** running as ExecStopPost. The EBUSY
window matched the helper's retry budget exactly. The mount became
unmountable the moment the helper exited.
### Why the helper was holding the namespace
systemd's `+` Exec prefix removes sandbox & credentials, but does
**not** detach from the unit's per-service mount namespace (created
by `PrivateTmp=true` + `Protect*` directives). The Python interpreter
that runs `left4me-overlay` was launched inside the unit's namespace.
Inside the helper we did:
```python
subprocess.run([NSENTER, "--mount=/proc/1/ns/mnt", "--", UMOUNT_BIN, ...])
```
That nsenter put the *child process* (the umount syscall) in PID 1's
namespace — but the *parent process* (the helper Python interpreter)
never left the unit's namespace. As long as the helper was alive, it
held a reference to that namespace, which kept the slave-mount tree
alive, which made `umount` in PID 1 return EBUSY (mount-propagation
can't reconcile a slave that still has open references).
Self-defeating loop: the helper tried to umount the namespace it was
holding open. The mount only released when the helper gave up.
### Why this didn't surface before commit `936c8bb`
Before that commit, `ExecStart` invoked `srcds_run` from the
`installation/` lower layer. Srcds processes had cwd / mmaps in
`installation/`, **not** in the overlay mount. The unit's namespace
still existed and the helper still pinned it, but the kernel didn't
need to reconcile any references inside the overlay — so `umount` in
PID 1 found nothing busy and succeeded immediately.
Once srcds started running from inside `merged/`, the unit's namespace
gained file references inside the overlay, and the helper's
namespace-pin became the thing keeping those references in place.
## Fix
**One change at the systemd Exec line, two consequential cleanups.**
### `deploy/files/usr/local/lib/systemd/system/left4me-server@.service`
Wrap both helper invocations with nsenter at the unit level:
```ini
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
```
`nsenter` runs in the unit's namespace momentarily, switches its own
mount namespace to PID 1's, then `execve`s the helper. From that
point the helper Python interpreter — *the long-lived parent process*
— lives in PID 1's namespace and holds no reference to the unit's
namespace.
`TimeoutStopSec` reverts to `15s`.
### `deploy/files/usr/local/libexec/left4me/left4me-overlay`
With the helper already in PID 1's namespace, internal nsenter is
redundant. Removed:
- `nsenter --mount=/proc/1/ns/mnt --` prefix on the mount/umount argv.
- `cmd_umount`'s eager-retry loop (no race left to ride out).
- Lazy-umount (`umount -l`) fallback (no fallback needed; eager
succeeds first try).
- `work_inner` cleanup retry (no kernel-finalisation residual after a
successful eager umount).
- `import time`.
Kept: input validation, idempotency guards (`os.path.ismount`),
`work_inner` rmtree (the kernel-overlayfs orphan dir is unrelated to
the namespace issue and still needs cleaning up).
## Verification
After deploy on the test box:
| Metric | Before fix | After fix |
|---|---|---|
| Reset duration (`l4d2ctl reset 3`) | ~25 s | ~0.5 s |
| `holders=` of dying namespace | `[helper_pid]` for ~25 s | `[]` immediately |
| Unit state after Stop | `failed` | `inactive` |
| ExecStopPost exit code | 32 (EBUSY) | 0 |
UI flow (`/servers/3` → Start → Reset): job `#164 reset succeeded`
in 1.3 s end-to-end. No `failed` rows on subsequent resets.
## Lessons
- **A retry loop is a hint, not a fix.** If you find yourself reaching
for "retry until kernel finishes," check whether *your own process*
is what's blocking the kernel from finishing. nsenter at the
syscall level looks right, but only escapes the namespace for the
child process; the parent still pins it.
- **Probe for the holder, don't assume async.** `/proc/*/ns/mnt` plus
a tight polling loop quickly tells you who's actually holding a
namespace alive. We jumped to "task_work_add reaping" as the
explanation and burned a round of workarounds before checking.
- **`+` prefix only escapes sandbox & credentials.** Mount namespace
inheritance is unaffected; if you need PID 1's namespace, do
`nsenter` yourself at the Exec line.

View file

@ -1,895 +0,0 @@
# L4D2 Network Shaping & Marking Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Ship a network-side player-experience baseline alongside the existing host perf baseline: nftables uid-based DSCP-EF + skb-priority marking for srcds UDP, rounding sysctls (`udp_rmem_min`/`wmem_min`, `default_qdisc=fq_codel`, `tcp_congestion_control=bbr`), and CAKE egress shaping via a systemd oneshot driven by an operator-edited env file. Production hosts running `systemd-networkd` consume an equivalent `[CAKE]` section documented in the README.
**Architecture:** Eight ship-ready artifacts under `deploy/files/...`, wired into `deploy-test-server.sh`, asserted in `deploy/tests/test_deploy_artifacts.py`, and documented in `deploy/README.md`. Each artifact is a separate, independently-testable file. The CAKE helper takes an `apply`/`clear` mode argument so the unit's `ExecStart`/`ExecStop` are clean shell calls without escape soup.
**Tech Stack:** sysctl, nftables (`inet` table, output hook, mangle priority), tc-cake, systemd oneshot units, POSIX `/bin/sh` for the helper, pytest substring assertions.
**Spec:** `docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md`.
---
## File Structure
**New files (`deploy/files/...`):**
- `usr/local/lib/left4me/nft/left4me-mark.nft` — nftables ruleset, own `inet` table.
- `usr/local/lib/systemd/system/left4me-nft-mark.service` — applies/removes the table.
- `etc/left4me/cake.env` — operator-edited template (deploy preserves edits).
- `usr/local/libexec/left4me/left4me-apply-cake` — POSIX shell helper, `apply`/`clear` modes.
- `usr/local/lib/systemd/system/left4me-cake.service` — runs the helper at network-online, clears on stop.
**Modified files:**
- `deploy/files/etc/sysctl.d/99-left4me.conf` — append four new directives.
- `deploy/deploy-test-server.sh` — add `nftables iproute2` to apt/dnf install lines, copy the new artifacts, conditional cake.env copy, enable the two new units.
- `deploy/README.md` — Network shaping subsection + three new escape hatches (IFB ingress, busy_poll, GRO).
- `deploy/tests/test_deploy_artifacts.py` — add path constants and assertions.
Each task adds (or extends) one artifact and the matching test, ending in a commit. Order matters: sysctl extension first (smallest, isolated), then the nftables pair, then the CAKE pair, then deploy-script wiring (depends on every prior task), then README.
---
### Task 1: Sysctl additions to `99-left4me.conf`
**Files:**
- Modify: `deploy/files/etc/sysctl.d/99-left4me.conf` (append block)
- Modify: `deploy/tests/test_deploy_artifacts.py:199-211` (extend existing `test_sysctl_conf_present_with_perf_settings`)
- [ ] **Step 1: Extend the existing sysctl test with the new lines.**
In `deploy/tests/test_deploy_artifacts.py`, edit `test_sysctl_conf_present_with_perf_settings` to append four lines to the tuple it already iterates:
```python
def test_sysctl_conf_present_with_perf_settings():
assert SYSCTL_CONF.is_file()
text = SYSCTL_CONF.read_text()
for line in (
"net.core.rmem_max = 8388608",
"net.core.wmem_max = 8388608",
"net.core.rmem_default = 524288",
"net.core.wmem_default = 524288",
"net.core.netdev_max_backlog = 5000",
"net.core.netdev_budget = 600",
"vm.swappiness = 10",
"net.ipv4.udp_rmem_min = 16384",
"net.ipv4.udp_wmem_min = 16384",
"net.core.default_qdisc = fq_codel",
"net.ipv4.tcp_congestion_control = bbr",
):
assert line in text, f"missing {line!r} in 99-left4me.conf"
```
- [ ] **Step 2: Run the test to verify it fails.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v
```
Expected: FAIL — `AssertionError: missing 'net.ipv4.udp_rmem_min = 16384' in 99-left4me.conf`.
- [ ] **Step 3: Append the new block to `99-left4me.conf`.**
Open `deploy/files/etc/sysctl.d/99-left4me.conf` and append (after the existing `vm.swappiness = 10` line):
```
# Per-socket UDP buffer floors: protect game-server sockets that don't bump
# their own SO_RCVBUF/SO_SNDBUF when softirq drains lag briefly.
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384
# Default qdisc for ifaces we don't explicitly shape with CAKE. Debian Trixie
# already defaults to fq_codel; setting it explicitly is belt-and-suspenders
# and survives kernel-default churn.
net.core.default_qdisc = fq_codel
# TCP congestion control: BBR for any bulk TCP egress on the host (admin SSH,
# backups, package fetches, web-app responses) so a long flow does not push
# the bottleneck queue ahead of game UDP. UDP srcds is unaffected.
net.ipv4.tcp_congestion_control = bbr
```
- [ ] **Step 4: Run the test again to verify it passes.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_sysctl_conf_present_with_perf_settings -v
```
Expected: PASS.
- [ ] **Step 5: Commit.**
```
git add deploy/files/etc/sysctl.d/99-left4me.conf deploy/tests/test_deploy_artifacts.py
git commit -m "feat(deploy): extend sysctls with udp_*_min, fq_codel default, BBR"
```
---
### Task 2: nftables marking file
**Files:**
- Create: `deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft`
- Modify: `deploy/tests/test_deploy_artifacts.py` (add path constant + new test function)
- [ ] **Step 1: Add the path constant and a failing test.**
In `deploy/tests/test_deploy_artifacts.py`, add the constant near the existing path constants block (around line 26, after `DEPLOY_SCRIPT`):
```python
NFT_MARK_FILE = DEPLOY / "files/usr/local/lib/left4me/nft/left4me-mark.nft"
```
Append this test function to the bottom of the file:
```python
def test_nft_mark_file_marks_left4me_udp_with_dscp_ef_and_priority():
assert NFT_MARK_FILE.is_file()
text = NFT_MARK_FILE.read_text()
# Own table in the inet family so it cannot conflict with operator nftables config.
assert "table inet left4me_mark" in text
assert "chain mangle_output" in text
assert "type filter hook output priority mangle" in text
# Match by uid (every srcds runs as `left4me`) restricted to UDP.
assert 'meta skuid "left4me"' in text
assert "meta l4proto udp" in text
# DSCP EF for both L3 families; in `inet` tables, `ip` only fires on v4
# and `ip6` only on v6.
assert "ip dscp set ef" in text
assert "ip6 dscp set ef" in text
# skb->priority class 6:0, set inline alongside DSCP.
assert "meta priority set 0006:0000" in text
```
- [ ] **Step 2: Run the new test and confirm it fails.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_nft_mark_file_marks_left4me_udp_with_dscp_ef_and_priority -v
```
Expected: FAIL — `AssertionError: assert False` on `NFT_MARK_FILE.is_file()`.
- [ ] **Step 3: Create the directory and write the nftables file.**
```
mkdir -p deploy/files/usr/local/lib/left4me/nft
```
Write `deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft`:
```nft
# left4me — uid-based DSCP/priority marking for srcds UDP egress.
# Loaded by left4me-nft-mark.service into its own `inet` table so it cannot
# conflict with whatever the operator already runs in /etc/nftables.conf.
# See docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md.
table inet left4me_mark {
chain mangle_output {
type filter hook output priority mangle; policy accept;
meta skuid "left4me" meta l4proto udp ip dscp set ef meta priority set 0006:0000
meta skuid "left4me" meta l4proto udp ip6 dscp set ef meta priority set 0006:0000
}
}
```
- [ ] **Step 4: Re-run the test and confirm it passes.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_nft_mark_file_marks_left4me_udp_with_dscp_ef_and_priority -v
```
Expected: PASS.
- [ ] **Step 5: Commit.**
```
git add deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft deploy/tests/test_deploy_artifacts.py
git commit -m "feat(deploy): nftables uid-based DSCP-EF + skb-priority marking for srcds"
```
---
### Task 3: nftables systemd unit
**Files:**
- Create: `deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service`
- Modify: `deploy/tests/test_deploy_artifacts.py` (path constant + test)
- [ ] **Step 1: Add the path constant and a failing test.**
Append the constant near the existing systemd-unit constants (around line 16):
```python
NFT_MARK_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-nft-mark.service"
```
Append the test:
```python
def test_nft_mark_unit_loads_and_clears_left4me_table():
assert NFT_MARK_UNIT.is_file()
text = NFT_MARK_UNIT.read_text()
# Loads the rules early so the very first packet srcds emits is marked.
assert "After=network-pre.target" in text
assert "Before=network.target" in text
assert "Wants=network-pre.target" in text
# Oneshot lifecycle: load on start, drop on stop.
assert "Type=oneshot" in text
assert "RemainAfterExit=yes" in text
assert (
"ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft"
in text
)
assert "ExecStop=/usr/sbin/nft delete table inet left4me_mark" in text
assert "WantedBy=multi-user.target" in text
```
- [ ] **Step 2: Run the test and confirm FAIL.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_nft_mark_unit_loads_and_clears_left4me_table -v
```
Expected: FAIL — `assert False` on `NFT_MARK_UNIT.is_file()`.
- [ ] **Step 3: Write the unit file.**
`deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service`:
```ini
[Unit]
Description=left4me nftables packet marking (DSCP EF + priority for srcds)
After=network-pre.target
Before=network.target
Wants=network-pre.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft
ExecStop=/usr/sbin/nft delete table inet left4me_mark
[Install]
WantedBy=multi-user.target
```
- [ ] **Step 4: Re-run the test and confirm PASS.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_nft_mark_unit_loads_and_clears_left4me_table -v
```
Expected: PASS.
- [ ] **Step 5: Commit.**
```
git add deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service deploy/tests/test_deploy_artifacts.py
git commit -m "feat(deploy): systemd unit to load/clear left4me_mark nftables table"
```
---
### Task 4: CAKE env template
**Files:**
- Create: `deploy/files/etc/left4me/cake.env`
- Modify: `deploy/tests/test_deploy_artifacts.py` (path constant + test)
- [ ] **Step 1: Add path constant and failing test.**
Append the constant near the other `/etc/left4me` constants (around line 22):
```python
CAKE_ENV = DEPLOY / "files/etc/left4me/cake.env"
```
Append the test:
```python
def test_cake_env_template_documents_required_knobs():
assert CAKE_ENV.is_file()
text = CAKE_ENV.read_text()
# Both knobs are documented and present (commented OK; the deploy preserves
# operator edits, so the template must not bake in a wrong value).
assert "LEFT4ME_UPLINK_MBIT" in text
assert "LEFT4ME_UPLINK_IFACE" in text
# Empty defaults: shaper unit no-ops with a journal warning when unset.
assert "LEFT4ME_UPLINK_MBIT=" in text
assert "LEFT4ME_UPLINK_IFACE=" in text
```
- [ ] **Step 2: Run and confirm FAIL.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_cake_env_template_documents_required_knobs -v
```
Expected: FAIL on `CAKE_ENV.is_file()`.
- [ ] **Step 3: Write the env template.**
`deploy/files/etc/left4me/cake.env`:
```
# left4me — CAKE egress shaper config. Consumed by left4me-cake.service via
# its EnvironmentFile=. Edit then `systemctl restart left4me-cake.service`.
# See docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md.
# Uplink bandwidth in Mbit/s. Set to ~95% of the smaller of measured upload
# and measured download. CAKE only shapes correctly when its declared
# bandwidth sits below the real bottleneck. If unset, the shaper unit logs
# a warning and exits 0 (no shaping).
LEFT4ME_UPLINK_MBIT=
# Egress interface. If unset, auto-detected from the IPv4 default route.
LEFT4ME_UPLINK_IFACE=
```
- [ ] **Step 4: Re-run and confirm PASS.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_cake_env_template_documents_required_knobs -v
```
Expected: PASS.
- [ ] **Step 5: Commit.**
```
git add deploy/files/etc/left4me/cake.env deploy/tests/test_deploy_artifacts.py
git commit -m "feat(deploy): cake.env template with documented uplink knobs"
```
---
### Task 5: CAKE helper script
**Files:**
- Create: `deploy/files/usr/local/libexec/left4me/left4me-apply-cake`
- Modify: `deploy/tests/test_deploy_artifacts.py` (path constant + tests)
- [ ] **Step 1: Add path constant and failing tests.**
Append the constant near the libexec helper constants (around line 21):
```python
APPLY_CAKE_HELPER = DEPLOY / "files/usr/local/libexec/left4me/left4me-apply-cake"
```
Append two test functions:
```python
def test_apply_cake_helper_supports_apply_and_clear_modes():
assert APPLY_CAKE_HELPER.is_file()
text = APPLY_CAKE_HELPER.read_text()
assert text.startswith("#!/bin/sh")
# Both knobs are read from the env file.
assert "LEFT4ME_UPLINK_MBIT" in text
assert "LEFT4ME_UPLINK_IFACE" in text
assert ". /etc/left4me/cake.env" in text
# Iface fallback to default route.
assert "ip -4 route show default" in text
# Two modes; default to apply.
assert "mode=${1:-apply}" in text
assert 'apply)' in text and 'clear)' in text
# Apply: idempotent `tc qdisc replace` with the documented flags.
assert "tc qdisc replace" in text
assert "cake" in text
assert "bandwidth" in text
assert "internet" in text
assert "diffserv4" in text
assert "dual-dsthost" in text
# Clear: tolerates a missing qdisc.
assert "tc qdisc del" in text
assert "|| true" in text
# Fail-soft on missing config.
assert "LEFT4ME_UPLINK_MBIT unset" in text
def test_apply_cake_helper_passes_shell_syntax_check():
subprocess.run(["sh", "-n", str(APPLY_CAKE_HELPER)], check=True)
```
- [ ] **Step 2: Run and confirm FAIL.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_apply_cake_helper_supports_apply_and_clear_modes deploy/tests/test_deploy_artifacts.py::test_apply_cake_helper_passes_shell_syntax_check -v
```
Expected: both FAIL.
- [ ] **Step 3: Write the helper.**
`deploy/files/usr/local/libexec/left4me/left4me-apply-cake`:
```sh
#!/bin/sh
# left4me — apply or clear CAKE egress shaper on the configured uplink.
# Driven by left4me-cake.service. See spec
# docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md.
set -eu
mode=${1:-apply}
if [ -r /etc/left4me/cake.env ]; then
. /etc/left4me/cake.env
fi
resolve_iface() {
if [ -n "${LEFT4ME_UPLINK_IFACE:-}" ]; then
printf '%s' "$LEFT4ME_UPLINK_IFACE"
return
fi
ip -4 route show default | awk '/default/ {print $5; exit}'
}
case "$mode" in
apply)
if [ -z "${LEFT4ME_UPLINK_MBIT:-}" ]; then
echo "left4me-cake: LEFT4ME_UPLINK_MBIT unset; skipping shaper" >&2
exit 0
fi
iface=$(resolve_iface)
if [ -z "$iface" ]; then
echo "left4me-cake: cannot determine egress iface; skipping" >&2
exit 0
fi
exec tc qdisc replace dev "$iface" root cake \
bandwidth "${LEFT4ME_UPLINK_MBIT}mbit" \
internet diffserv4 dual-dsthost
;;
clear)
iface=$(resolve_iface)
if [ -z "$iface" ]; then
exit 0
fi
tc qdisc del dev "$iface" root 2>/dev/null || true
;;
*)
echo "usage: $0 [apply|clear]" >&2
exit 2
;;
esac
```
Make it executable in the repo (the deploy script also `chmod 0755`s the destination, but executable mode in the source tree is conventional here):
```
chmod 0755 deploy/files/usr/local/libexec/left4me/left4me-apply-cake
```
- [ ] **Step 4: Re-run and confirm PASS.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_apply_cake_helper_supports_apply_and_clear_modes deploy/tests/test_deploy_artifacts.py::test_apply_cake_helper_passes_shell_syntax_check -v
```
Expected: both PASS.
- [ ] **Step 5: Commit.**
```
git add deploy/files/usr/local/libexec/left4me/left4me-apply-cake deploy/tests/test_deploy_artifacts.py
git commit -m "feat(deploy): left4me-apply-cake helper with apply/clear modes"
```
---
### Task 6: CAKE systemd unit
**Files:**
- Create: `deploy/files/usr/local/lib/systemd/system/left4me-cake.service`
- Modify: `deploy/tests/test_deploy_artifacts.py` (path constant + test)
- [ ] **Step 1: Add path constant and failing test.**
Append the constant near the existing systemd-unit constants (around line 16):
```python
CAKE_UNIT = DEPLOY / "files/usr/local/lib/systemd/system/left4me-cake.service"
```
Append the test:
```python
def test_cake_unit_runs_helper_in_apply_and_clear_modes():
assert CAKE_UNIT.is_file()
text = CAKE_UNIT.read_text()
assert "After=network-online.target" in text
assert "Wants=network-online.target" in text
assert "Type=oneshot" in text
assert "RemainAfterExit=yes" in text
# `-` prefix: missing env file is non-fatal (deploy ships one, but be safe).
assert "EnvironmentFile=-/etc/left4me/cake.env" in text
assert (
"ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply" in text
)
assert (
"ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear" in text
)
assert "WantedBy=multi-user.target" in text
```
- [ ] **Step 2: Run and confirm FAIL.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_cake_unit_runs_helper_in_apply_and_clear_modes -v
```
Expected: FAIL on `CAKE_UNIT.is_file()`.
- [ ] **Step 3: Write the unit.**
`deploy/files/usr/local/lib/systemd/system/left4me-cake.service`:
```ini
[Unit]
Description=left4me CAKE egress shaper
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=-/etc/left4me/cake.env
ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply
ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear
[Install]
WantedBy=multi-user.target
```
- [ ] **Step 4: Re-run and confirm PASS.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_cake_unit_runs_helper_in_apply_and_clear_modes -v
```
Expected: PASS.
- [ ] **Step 5: Commit.**
```
git add deploy/files/usr/local/lib/systemd/system/left4me-cake.service deploy/tests/test_deploy_artifacts.py
git commit -m "feat(deploy): left4me-cake.service oneshot wrapping apply-cake helper"
```
---
### Task 7: Wire artifacts into `deploy-test-server.sh`
**Files:**
- Modify: `deploy/deploy-test-server.sh`
- Modify: `deploy/tests/test_deploy_artifacts.py` (new test)
This task adds: `nftables` to apt/dnf install lines, copies the four new artifact files into their target paths, conditionally copies `cake.env` only if absent, and `systemctl enable --now`s the two new units. Each piece gets its own assertion in a single new test function.
- [ ] **Step 1: Add the new test.**
Append to `deploy/tests/test_deploy_artifacts.py`:
```python
def test_deploy_script_installs_network_shaping_artifacts():
script = DEPLOY_SCRIPT.read_text()
# nftables: package install on both apt and dnf paths.
apt_lines = [l for l in script.splitlines() if "apt-get install" in l]
dnf_lines = [l for l in script.splitlines() if "dnf install" in l]
assert apt_lines and dnf_lines
for line in apt_lines:
assert "nftables" in line, line
for line in dnf_lines:
assert "nftables" in line, line
# nft rules + unit copied to system paths.
assert "/usr/local/lib/left4me/nft/left4me-mark.nft" in script
assert (
"/usr/local/lib/systemd/system/left4me-nft-mark.service" in script
)
assert "systemctl enable --now left4me-nft-mark.service" in script
# CAKE helper + unit copied; helper made executable.
assert "/usr/local/libexec/left4me/left4me-apply-cake" in script
assert (
"/usr/local/lib/systemd/system/left4me-cake.service" in script
)
assert "chmod 0755" in script and "left4me-apply-cake" in script
assert "systemctl enable --now left4me-cake.service" in script
# cake.env: copied only if absent (operator edits survive re-deploys).
assert "/etc/left4me/cake.env" in script
assert "[ -e /etc/left4me/cake.env ]" in script
```
- [ ] **Step 2: Run and confirm FAIL.**
```
pytest deploy/tests/test_deploy_artifacts.py::test_deploy_script_installs_network_shaping_artifacts -v
```
Expected: FAIL on the first missing string.
- [ ] **Step 3: Edit `deploy-test-server.sh`.**
Make these targeted edits — do not rewrite the script.
(a) **Append `nftables` to both package-install lines (line 88 and line 90 in the current file).**
Old (line 88):
```
$sudo_cmd apt-get install -y python3 python3-venv python3-pip curl ca-certificates tar gzip util-linux sudo coreutils p7zip-full
```
New:
```
$sudo_cmd apt-get install -y python3 python3-venv python3-pip curl ca-certificates tar gzip util-linux sudo coreutils p7zip-full nftables
```
Old (line 90):
```
$sudo_cmd dnf install -y python3 python3-pip curl ca-certificates tar gzip util-linux sudo coreutils p7zip p7zip-plugins
```
New:
```
$sudo_cmd dnf install -y python3 python3-pip curl ca-certificates tar gzip util-linux sudo coreutils p7zip p7zip-plugins nftables
```
(b) **Add the nft-rules-dir creation to the `mkdir -p` block (currently lines 96-106).**
Append `/usr/local/lib/left4me/nft` to the existing `mkdir -p` invocation:
Old (lines 96-106):
```
$sudo_cmd mkdir -p \
/etc/left4me \
/opt/left4me \
/usr/local/lib/systemd/system \
/usr/local/libexec/left4me \
/var/lib/left4me/installation \
/var/lib/left4me/overlays \
/var/lib/left4me/instances \
/var/lib/left4me/runtime \
/var/lib/left4me/workshop_cache \
/var/lib/left4me/tmp
```
New (insert one line after `/usr/local/libexec/left4me`):
```
$sudo_cmd mkdir -p \
/etc/left4me \
/opt/left4me \
/usr/local/lib/systemd/system \
/usr/local/libexec/left4me \
/usr/local/lib/left4me/nft \
/var/lib/left4me/installation \
/var/lib/left4me/overlays \
/var/lib/left4me/instances \
/var/lib/left4me/runtime \
/var/lib/left4me/workshop_cache \
/var/lib/left4me/tmp
```
(c) **Copy the new systemd units alongside the existing ones (after line 140's `l4d2-build.slice` copy).**
Insert immediately after the `l4d2-build.slice` copy (the existing line that ends `l4d2-build.slice`):
```
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service /usr/local/lib/systemd/system/left4me-nft-mark.service
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/lib/systemd/system/left4me-cake.service /usr/local/lib/systemd/system/left4me-cake.service
```
(d) **Copy the nftables rules file alongside the existing `install`-mode copies (next to the sandbox-resolv.conf install at lines 189-191).**
Insert after the sandbox-resolv install block:
```
# Network packet marking + shaping. See spec
# docs/superpowers/specs/2026-05-10-l4d2-network-shaping-design.md.
$sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft \
/usr/local/lib/left4me/nft/left4me-mark.nft
```
(e) **Copy the CAKE helper alongside the other libexec helpers (after the existing `cp` block at lines 175-179).**
Find the existing `cp` block that copies `left4me-systemctl`, `left4me-journalctl`, `left4me-overlay`, `left4me-script-sandbox`. Add a new `cp` line for `left4me-apply-cake`, and add it to the `chmod 0755` line on line 179:
Old (line 178):
```
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-script-sandbox /usr/local/libexec/left4me/left4me-script-sandbox
```
After it, insert:
```
$sudo_cmd cp /opt/left4me/deploy/files/usr/local/libexec/left4me/left4me-apply-cake /usr/local/libexec/left4me/left4me-apply-cake
```
Old (line 179):
```
$sudo_cmd chmod 0755 /usr/local/libexec/left4me/left4me-systemctl /usr/local/libexec/left4me/left4me-journalctl /usr/local/libexec/left4me/left4me-overlay /usr/local/libexec/left4me/left4me-script-sandbox
```
New (append `left4me-apply-cake`):
```
$sudo_cmd chmod 0755 /usr/local/libexec/left4me/left4me-systemctl /usr/local/libexec/left4me/left4me-journalctl /usr/local/libexec/left4me/left4me-overlay /usr/local/libexec/left4me/left4me-script-sandbox /usr/local/libexec/left4me/left4me-apply-cake
```
(f) **Conditionally copy `cake.env` (after the existing sysctl install/apply block at lines 193-198).**
Insert immediately after `$sudo_cmd sysctl --system >/dev/null`:
```
# CAKE config: ship the template only if the operator hasn't created one
# (their LEFT4ME_UPLINK_MBIT value must survive re-deploys).
if [ ! -e /etc/left4me/cake.env ]; then
$sudo_cmd install -m 0644 -o root -g root \
/opt/left4me/deploy/files/etc/left4me/cake.env \
/etc/left4me/cake.env
fi
```
(g) **Enable the new units alongside the existing `systemctl enable --now left4me-web.service`.**
Find the existing block (around line 315-316):
```
$sudo_cmd systemctl daemon-reload
$sudo_cmd systemctl enable --now left4me-web.service
```
Insert two lines between them:
```
$sudo_cmd systemctl daemon-reload
$sudo_cmd systemctl enable --now left4me-nft-mark.service
$sudo_cmd systemctl enable --now left4me-cake.service
$sudo_cmd systemctl enable --now left4me-web.service
```
- [ ] **Step 4: Re-run all existing tests + the new one to make sure nothing regressed.**
```
pytest deploy/tests/test_deploy_artifacts.py -v
```
Expected: every test passes, including the new `test_deploy_script_installs_network_shaping_artifacts` and the unmodified `test_deploy_script_shell_syntax` (the latter validates `sh -n` on the modified script).
- [ ] **Step 5: Commit.**
```
git add deploy/deploy-test-server.sh deploy/tests/test_deploy_artifacts.py
git commit -m "feat(deploy): wire nft marking + CAKE shaper into deploy script"
```
---
### Task 8: README documentation
**Files:**
- Modify: `deploy/README.md`
This is documentation only — no test asserts the README contents. Run an `sh -n` of the deploy script one more time after editing, just as a hygiene check (the README change can't affect it, but the test suite is fast).
- [ ] **Step 1: Open `deploy/README.md` and locate the existing Performance tuning section.**
The previous perf-baseline spec added a "Performance tuning" section (entries for CPU governor, CPU affinity, NIC tuning, and real-time scheduling opt-in). Find it.
- [ ] **Step 2: Add a "Network shaping" subsection.**
Add this subsection at the top of "Performance tuning" (before the existing entries; network-shaping covers the universal artifacts that ship by default, while the existing entries are escape hatches):
```markdown
### Network shaping
The deploy ships three things that affect player-experience network behaviour:
1. **Per-flow marking.** `left4me-nft-mark.service` loads a small nftables
table (`inet left4me_mark`) that marks every UDP packet from uid `left4me`
with DSCP EF and `skb->priority` 6. srcds doesn't set these itself, so
without this rule its UDP is indistinguishable from any other flow.
2. **Sysctl baseline.** `99-left4me.conf` sets `udp_rmem_min=16384`,
`udp_wmem_min=16384`, `default_qdisc=fq_codel`, and
`tcp_congestion_control=bbr`. Reduces head-of-line blocking when bulk
TCP egress (backups, package fetches, web responses) coexists with
game UDP.
3. **CAKE egress shaping.** `left4me-cake.service` runs
`tc qdisc replace dev <iface> root cake bandwidth Xmbit internet
diffserv4 dual-dsthost` from `/etc/left4me/cake.env`. CAKE only shapes
if its declared bandwidth is **below** the real bottleneck, so set
`LEFT4ME_UPLINK_MBIT` to ≈95% of measured uplink:
sudoedit /etc/left4me/cake.env
# set LEFT4ME_UPLINK_MBIT=480 (or whatever ~95% of your uplink is)
sudo systemctl restart left4me-cake.service
`LEFT4ME_UPLINK_IFACE` is auto-detected from the IPv4 default route;
override only on hosts with multi-homed setups.
At idle 500 Mbit with no competing egress, CAKE shapes nothing — that's
expected, not a bug. The win materialises when bulk traffic on the
same uplink would otherwise bufferbloat the link the players share.
**Production hosts running `systemd-networkd`** should NOT use the
`left4me-cake.service` oneshot. Instead, configure the equivalent in the
matching `.network` file, which systemd-networkd reapplies across iface
lifecycle events:
# /etc/systemd/network/<your-uplink>.network
[CAKE]
Bandwidth=480M
OverheadKeyword=internet
PriorityQueueingPreset=diffserv4
EgressHostIsolation=yes
The nftables marking from (1) is qdisc-installer-agnostic and ships
unchanged on production.
```
- [ ] **Step 3: Append the three new escape hatches to the existing Performance tuning section.**
Add after the existing escape-hatch entries (CPU governor / CPU affinity / NIC tuning / real-time scheduling):
```markdown
### Additional opt-in network knobs
- **Ingress shaping via IFB.** Egress CAKE alone does not protect srcds
receive against ingress saturation (large workshop downloads, package
fetches arriving at line rate). One-liner:
sudo modprobe ifb && sudo ip link set ifb0 up
sudo tc qdisc add dev <uplink> handle ffff: ingress
sudo tc filter add dev <uplink> parent ffff: protocol ip u32 \
match u32 0 0 action mirred egress redirect dev ifb0
sudo tc qdisc add dev ifb0 root cake bandwidth Xmbit ingress \
diffserv4 dual-srchost
Worth flipping only when measurement shows ingress hurting receive.
- **`net.core.busy_poll = 50` / `net.core.busy_read = 50`.** Reduces UDP
receive median latency by polling for incoming packets briefly at
syscall boundaries. Cost: measurable CPU per syscall under load. Worth
flipping if a host is dedicated to game serving and CPU headroom is
plentiful.
- **`ethtool -K <iface> gro off`.** Some Source-engine ops disable
generic receive offload to avoid receive-side coalescing latency.
Hardware/driver dependent; document only.
```
- [ ] **Step 4: Re-run the full test suite.**
```
pytest deploy/tests/test_deploy_artifacts.py -v
```
Expected: every test passes, including `test_deploy_script_shell_syntax`.
- [ ] **Step 5: Commit.**
```
git add deploy/README.md
git commit -m "docs(deploy): document network-shaping defaults + opt-in network knobs"
```
---
## Final verification
After all eight tasks land, run the whole suite once more and verify the new files are tracked:
```
pytest deploy/tests/test_deploy_artifacts.py -v
git status
git log --oneline -10
```
Every test should pass. `git status` should be clean. The last 8 commits should match the eight tasks above.
The new files in the tree:
```
deploy/files/etc/left4me/cake.env
deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft
deploy/files/usr/local/lib/systemd/system/left4me-cake.service
deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service
deploy/files/usr/local/libexec/left4me/left4me-apply-cake
```
Modified files:
```
deploy/files/etc/sysctl.d/99-left4me.conf
deploy/deploy-test-server.sh
deploy/README.md
deploy/tests/test_deploy_artifacts.py
```

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -1,131 +0,0 @@
# RCON Password Display Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Show the RCON password on the server detail page with a show/hide toggle.
**Architecture:** Three-file change. An external JS file (`password-reveal.js`) provides the reveal/hide interaction via event delegation on `[data-password-toggle]` attributes — no inline handlers or HTML event attributes. The template adds a row to the existing `.server-info` definition list with a masked span, value span, and toggle button. Base.html adds the script include alongside existing JS files.
**Tech Stack:** Vanilla JS, Jinja2 templates, Flask
---
## File Structure
| File | Responsibility |
|------|---------------|
| `l4d2web/static/js/password-reveal.js` | New. Delegated click listener for show/hide toggle on `[data-password-toggle]` |
| `l4d2web/templates/server_detail.html` | Add one `<div>` row to `.server-info` DL |
| `l4d2web/templates/base.html` | Add `<script src="...password-reveal.js">` |
---
### Task 1: Create the reveal/hide JS
**Files:**
- Create: `l4d2web/static/js/password-reveal.js`
- [ ] **Step 1: Create `password-reveal.js`**
```js
document.addEventListener('click', (e) => {
const btn = e.target.closest('[data-password-toggle]');
if (!btn) return;
const id = btn.dataset.passwordToggle;
const mask = document.querySelector(`[data-password-field="${id}"].password-mask`);
const value = document.querySelector(`[data-password-field="${id}"].password-value`);
if (!mask || !value) return;
const hidden = value.hidden;
value.hidden = !hidden;
mask.hidden = hidden;
btn.textContent = hidden ? 'hide' : 'show';
btn.setAttribute('aria-label', hidden ? 'Hide RCON password' : 'Show RCON password');
});
```
- [ ] **Step 2: Verify the file exists**
Run: `ls -la l4d2web/static/js/password-reveal.js`
Expected: File exists, is about 450 bytes
- [ ] **Step 3: Commit**
```bash
git add l4d2web/static/js/password-reveal.js
git commit -m "feat: add password reveal toggle JS"
```
---
### Task 2: Add RCON password row to server detail template
**Files:**
- Modify: `l4d2web/templates/server_detail.html:13`
- [ ] **Step 1: Add the RCON password row after the blueprint row**
Insert after line 13 (`</dd></div>` for blueprint):
```html
<div><dt>RCON Password</dt><dd><span class="password-mask" data-password-field="{{ server.id }}">••••••••••••</span><span class="password-value" data-password-field="{{ server.id }}" hidden>{{ server.rcon_password }}</span> <button class="link-button" data-password-toggle="{{ server.id }}" aria-label="Show RCON password">show</button></dd></div>
```
Expected result: the `.server-info` DL now shows three rows: Port, Blueprint, RCON Password.
- [ ] **Step 2: Verify template renders**
Run: `python -c "from jinja2 import Environment; env=Environment(); env.parse(open('l4d2web/templates/server_detail.html').read()); print('parse ok')"`
Expected: `parse ok`
- [ ] **Step 3: Commit**
```bash
git add l4d2web/templates/server_detail.html
git commit -m "feat: add RCON password row to server detail page"
```
---
### Task 3: Include the script in base template
**Files:**
- Modify: `l4d2web/templates/base.html:44`
- [ ] **Step 1: Add the script include**
Insert after line 43 (`<script src="...file-tree.js">`):
```html
<script src="{{ url_for('static', filename='js/password-reveal.js') }}"></script>
```
Expected result: `base.html` now has 5 script includes: htmx, csrf.js, sse.js, modal.js, file-tree.js, password-reveal.js.
- [ ] **Step 2: Verify the app starts**
Run: `cd l4d2web && python -c "from l4d2web.app import create_app; app=create_app(); print('app created ok')"` (or similar smoke test)
Expected: App initializes without import/template errors.
- [ ] **Step 3: Commit**
```bash
git add l4d2web/templates/base.html
git commit -m "feat: include password-reveal.js in base template"
```
---
### Task 4: Run tests
**Files:** None
- [ ] **Step 1: Run existing test suite**
Run: `pytest l4d2web/tests -q`
Expected: All tests pass (no regressions from this purely-presentational change)
- [ ] **Step 2: If any tests fail, investigate and fix**
Run: `pytest l4d2web/tests -q --tb=long`
Expected: Clear failure report to debug

View file

@ -1,408 +0,0 @@
# Server Hostname (Source `hostname` cvar) Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add a `hostname` column to the Server model so users can set the Source `hostname` cvar (server browser/MOTD name), with an ephemeral `"<username> <server.name>"` fallback resolved at deploy time.
**Architecture:** New `hostname VARCHAR(128)` column on `servers` table (default `""`). Empty = auto-generate at deploy. The `build_server_spec_payload()` function gains a `resolved_hostname` kwarg; `initialize_server()` resolves the fallback ephemerally. The server detail page shows an inline form under RCON password. Same `POST /servers/<id>` endpoint handles saving.
**Tech Stack:** Python 3.12+, Flask, SQLAlchemy, Alembic, pytest.
---
### Task 1: Add `hostname` column to Server model and migration
**Files:**
- Modify: `l4d2web/models.py`
- Create: `l4d2web/alembic/versions/0011_server_hostname.py`
- [ ] **Step 1: Add hostname column to model**
Add to the `Server` class in `l4d2web/models.py:131`:
```python
hostname: Mapped[str] = mapped_column(String(128), default="", nullable=False)
```
Place it after `rcon_password` (line 148) and before `created_at` (line 149).
- [ ] **Step 2: Create the migration**
Create `l4d2web/alembic/versions/0011_server_hostname.py`:
```python
"""add hostname column to servers
Revision ID: 0011_server_hostname
Revises: 0010_server_live_state
Create Date: 2026-05-13
"""
from __future__ import annotations
from typing import Sequence, Union
import sqlalchemy as sa
from alembic import op
revision: str = "0011_server_hostname"
down_revision: Union[str, Sequence[str], None] = "0010_server_live_state"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
with op.batch_alter_table("servers") as batch_op:
batch_op.add_column(
sa.Column("hostname", sa.String(length=128), nullable=False, server_default="")
)
def downgrade() -> None:
with op.batch_alter_table("servers") as batch_op:
batch_op.drop_column("hostname")
```
- [ ] **Step 3: Verify migration applies cleanly**
Run: `cd l4d2web && alembic upgrade head`
Expected: runs `0011_server_hostname` migration, adds the column.
Run: `cd l4d2web && alembic downgrade -1`
Expected: drops the column.
Run: `cd l4d2web && alembic upgrade head`
Expected: re-adds the column.
- [ ] **Step 4: Commit model + migration**
```bash
git add l4d2web/models.py l4d2web/alembic/versions/0011_server_hostname.py
git commit -m "feat(l4d2-web): add hostname column to Server model"
```
---
### Task 2: Accept and save `hostname` on server update
**Files:**
- Modify: `l4d2web/routes/server_routes.py`
- Test: `l4d2web/tests/test_servers.py`
- [ ] **Step 1: Write failing hostname update tests**
Add to `l4d2web/tests/test_servers.py`:
```python
def test_create_server_hostname_defaults_empty(user_client_with_blueprints) -> None:
from sqlalchemy import select
from l4d2web.models import Server
client, data = user_client_with_blueprints
response = client.post(
"/servers",
data={"name": "alpha", "port": "27015", "blueprint_id": str(data["blueprint_id"])},
headers={"X-CSRF-Token": "test-token"},
)
assert response.status_code == 302
with session_scope() as session:
server = session.scalar(select(Server).where(Server.name == "alpha"))
assert server is not None
assert server.hostname == ""
def test_update_server_hostname_via_form(user_client_with_blueprints) -> None:
from sqlalchemy import select
from l4d2web.models import Server
client, data = user_client_with_blueprints
create = client.post(
"/servers",
data={"name": "alpha", "port": "27015", "blueprint_id": str(data["blueprint_id"])},
headers={"X-CSRF-Token": "test-token"},
)
server_id = create.headers["Location"].rsplit("/", 1)[1]
update = client.post(
f"/servers/{server_id}",
data={"name": "alpha", "hostname": "My Cool Server"},
headers={"X-CSRF-Token": "test-token"},
)
assert update.status_code == 302
with session_scope() as session:
server = session.scalar(select(Server).where(Server.name == "alpha"))
assert server is not None
assert server.hostname == "My Cool Server"
def test_update_server_clears_hostname(user_client_with_blueprints) -> None:
from sqlalchemy import select
from l4d2web.models import Server
client, data = user_client_with_blueprints
create = client.post(
"/servers",
data={"name": "alpha", "port": "27015", "blueprint_id": str(data["blueprint_id"])},
headers={"X-CSRF-Token": "test-token"},
)
server_id = create.headers["Location"].rsplit("/", 1)[1]
# Set hostname first
client.post(
f"/servers/{server_id}",
data={"name": "alpha", "hostname": "My Cool Server"},
headers={"X-CSRF-Token": "test-token"},
)
# Clear it
client.post(
f"/servers/{server_id}",
data={"name": "alpha", "hostname": ""},
headers={"X-CSRF-Token": "test-token"},
)
with session_scope() as session:
server = session.scalar(select(Server).where(Server.name == "alpha"))
assert server is not None
assert server.hostname == ""
```
- [ ] **Step 2: Run tests to verify failure**
Run: `pytest l4d2web/tests/test_servers.py -k "hostname" -v`
Expected: FAIL — `Server` object has no attribute `hostname`.
- [ ] **Step 3: Save hostname from form in update route**
In `l4d2web/routes/server_routes.py`, modify `update_server_form` (around line 130) to save hostname:
```python
server.name = name
server.hostname = request.form.get("hostname", "")
```
- [ ] **Step 4: Run tests to verify pass**
Run: `pytest l4d2web/tests/test_servers.py -k "hostname" -v`
Expected: PASS.
- [ ] **Step 5: Commit hostname update support**
```bash
git add l4d2web/routes/server_routes.py l4d2web/tests/test_servers.py
git commit -m "feat(l4d2-web): accept hostname on server update, default empty on create"
```
---
### Task 3: Emit `hostname` in spec payload with ephemeral fallback
**Files:**
- Modify: `l4d2web/services/l4d2_facade.py`
- Test: `l4d2web/tests/test_l4d2_facade.py`
- [ ] **Step 1: Write failing hostname spec tests**
Add to `l4d2web/tests/test_l4d2_facade.py`:
```python
def test_build_server_spec_payload_injects_hostname() -> None:
from l4d2web.services.l4d2_facade import build_server_spec_payload
bp = Blueprint(id=1, user_id=1, name="bp", arguments="[]", config='["sv_consistency 1"]')
srv = Server(id=1, user_id=1, blueprint_id=1, name="alpha", port=27015, rcon_password="sekret")
spec = build_server_spec_payload(srv, bp, [], resolved_hostname="My Server")
cfg = spec["config"]
assert "hostname \"My Server\"" in cfg
assert cfg[-1] == "rcon_password \"sekret\""
def test_build_server_spec_payload_omits_hostname_when_empty() -> None:
from l4d2web.services.l4d2_facade import build_server_spec_payload
bp = Blueprint(id=1, user_id=1, name="bp", arguments="[]", config="[]")
srv = Server(id=1, user_id=1, blueprint_id=1, name="alpha", port=27015, rcon_password="sekret")
spec = build_server_spec_payload(srv, bp, [])
for line in spec["config"]:
assert not line.startswith("hostname ")
def test_initialize_server_resolves_fallback_hostname(
monkeypatch: pytest.MonkeyPatch, server_with_blueprint,
) -> None:
"""When server.hostname is empty, deploy emits hostname "<username> <server.name>"."""
from l4d2web.services.l4d2_facade import initialize_server
spec_contents: list[str] = []
def fake_run_command(cmd, **kwargs):
nonlocal spec_contents
spec_path = cmd[cmd.index("-f") + 1]
spec_contents.append(Path(spec_path).read_text())
return CommandResult(returncode=0, stdout="", stderr="")
monkeypatch.setattr("l4d2web.services.host_commands.run_command", fake_run_command)
server_id, _ = server_with_blueprint
initialize_server(server_id)
assert len(spec_contents) == 1
assert "hostname" in spec_contents[0]
# The fixture creates user "alice" and server named "alpha"
assert '"alice alpha"' in spec_contents[0]
```
- [ ] **Step 2: Run tests to verify failure**
Run: `pytest l4d2web/tests/test_l4d2_facade.py -k "hostname" -v`
Expected: FAIL — `build_server_spec_payload()` got unexpected keyword `resolved_hostname`.
- [ ] **Step 3: Add `resolved_hostname` kwarg and emit line**
In `l4d2web/services/l4d2_facade.py`, modify `build_server_spec_payload` signature and add the hostname injection before `rcon_password`:
```python
def build_server_spec_payload(
server: Server,
blueprint: Blueprint,
overlay_rows: list[tuple[int, str, bool]],
*,
resolved_hostname: str = "",
) -> dict:
```
Inside the function, after building `config_lines` and before the `if server.rcon_password:` block, add:
```python
if resolved_hostname:
config_lines.append(f'hostname "{resolved_hostname}"')
```
Then in `initialize_server`, resolve the fallback. Add the `User` import at the top:
```python
from l4d2web.models import (
Blueprint,
BlueprintOverlay,
Overlay,
OverlayWorkshopItem,
Server,
User,
WorkshopItem,
)
```
In `initialize_server`, after `load_server_blueprint_bundle(server_id)`, add:
```python
# Resolve hostname — explicit override or ephemeral fallback
if server.hostname:
resolved_hostname = server.hostname
else:
with session_scope() as db:
user = db.get(User, server.user_id)
resolved_hostname = f"{user.username} {server.name}"
```
Then pass it to `build_server_spec_payload`:
```python
spec_path = write_temp_spec(build_server_spec_payload(
server, blueprint, overlay_rows, resolved_hostname=resolved_hostname,
))
```
- [ ] **Step 4: Run tests to verify pass**
Run: `pytest l4d2web/tests/test_l4d2_facade.py -k "hostname" -v`
Expected: PASS.
Also run full suite to check nothing broken: `pytest l4d2web/tests/test_l4d2_facade.py -v`
- [ ] **Step 5: Commit hostname spec payload**
```bash
git add l4d2web/services/l4d2_facade.py l4d2web/tests/test_l4d2_facade.py
git commit -m "feat(l4d2-web): emit hostname in spec config with ephemeral fallback"
```
---
### Task 4: Add hostname form to server detail page
**Files:**
- Modify: `l4d2web/templates/server_detail.html`
- [ ] **Step 1: Verify the current template renders correctly first**
Run: `pytest l4d2web/tests -q`
Expected: PASS (baseline).
- [ ] **Step 2: Add hostname form under RCON password**
In `l4d2web/templates/server_detail.html`, after the RCON password `<dd>` block (closing `</dd>` at line 14) and before the closing `</dl>` (line 15), add:
```html
<div><dt>Hostname</dt>
<dd>
<form method="post" action="/servers/{{ server.id }}" class="inline-save">
<input type="hidden" name="csrf_token" value="{{ session.get('csrf_token', '') }}">
<input name="hostname" value="{{ server.hostname }}" placeholder="{{ user.username }} {{ server.name }}" maxlength="128">
<button type="submit">Save</button>
<span class="field-hint">Leave empty for auto: "{{ user.username }} {{ server.name }}"</span>
</form>
</dd>
</div>
```
The `user` variable is already available in the template context (the server detail route passes it through the auth mechanism).
- [ ] **Step 3: Run full test suite to verify nothing broken**
Run: `pytest l4d2web/tests -q`
Expected: PASS.
- [ ] **Step 4: Commit template change**
```bash
git add l4d2web/templates/server_detail.html
git commit -m "feat(l4d2-web): add hostname edit form to server detail page"
```
---
### Task 5: Final integration verification
**Files:**
- Run full test suites
- [ ] **Step 1: Run all tests**
Run: `pytest l4d2web/tests -q`
Expected: PASS.
Run: `pytest l4d2host/tests -q` (host lib must not be affected)
Expected: PASS.
- [ ] **Step 2: Run alembic check to ensure migration is the latest**
Run: `cd l4d2web && alembic check`
Expected: "No new upgrade operations detected."
- [ ] **Step 3: Commit any final touches needed**
```bash
git add -A
git commit -m "chore: finalize server hostname feature"
```
---
## Self-Review
- [ ] Spec coverage: model column, migration, update route saves hostname, spec payload emits hostname line, ephemeral fallback resolved in initialize_server, template has inline form.
- [ ] Placeholder scan: no TODOs or TBDs.
- [ ] Type/name consistency: `resolved_hostname` kwarg matches usage in both caller and callee.
- [ ] Verification: each task has exact test commands and expected outcomes.

View file

@ -1,331 +0,0 @@
# Idmapped lowerdirs for left4me kernel-overlayfs
> **SUPERSEDED 2026-05-15** by the uid-collapse refactor
> ([`2026-05-15-uid-collapse.md`](2026-05-15-uid-collapse.md)). With
> `l4d2-sandbox` collapsed into `left4me`, all overlay content is
> uniformly `left4me`-owned end-to-end and no idmap is needed at
> mount time either. Kept for design-evolution context.
## Context
Kernel-overlayfs copy-up preserves the lower-layer file's owner and mode in the
upperdir. Script overlays today are built by `left4me-script-sandbox` running as
uid `l4d2-sandbox`, and the helper finalizes them as `l4d2-sandbox:l4d2-sandbox
0755`. When the L4D2 server (uid `left4me`) tries to write into a directory
that exists only in the lower layer — e.g. SourceMod's `addons/sourcemod/logs/`
for log rotation — copy-up succeeds but the result is `l4d2-sandbox`-owned, so
the write fails `EACCES`. Workshop overlays are unaffected because the web app
(uid `left4me`) builds them as `left4me`-owned with symlinks into a `left4me`-
owned cache.
We considered four fixes (chown-flip on every rebuild, shared group, collapse
sandbox uid into `left4me`, idmapped lowerdir bind mounts). The user chose the
idmap path: disk state stays untouched, the mount stack remaps `l4d2-sandbox →
left4me` at mount time, kernel-overlayfs sees a `left4me`-owned lower layer,
and copy-up creates upperdir entries owned by `left4me` naturally. No
ownership flipping, no shared group, no security regression.
Outcome: `sm_cvar` writes succeed, SM logs land in `runtime/<n>/upper/...`,
and any future "writes into a sandbox-built lower layer" works without
left4me-specific plumbing.
## Environment confirmed
Test server: Debian Trixie, kernel `6.12.86+deb13-amd64`, util-linux supports
`mount --map-users <on_disk>:<in_mount>:<count>`. Idmapped lowerdirs for
overlayfs landed mainline in 6.6, so 6.12 is fine. Verified end-to-end on
`/var/lib/left4me/` (ext4) in a temp dir on 2026-05-14:
1. Source dir owned `l4d2-sandbox:l4d2-sandbox` (uid 981).
2. `mount --bind --map-users=981:980:1 --map-groups=981:980:1 src dst``dst`
view shows uid 980 (left4me).
3. Overlay mount with the idmapped path as `lowerdir=` — merged view also
shows uid 980.
4. `sudo -u left4me touch merged/addons/sourcemod/logs/L_test.log` — succeeds.
`sudo -u left4me bash -c "echo x >> merged/file-from-sandbox.txt"`
succeeds (copy-up of existing file).
5. `upper/` after writes is entirely `left4me`-owned (uid 980).
Caveat surfaced during testing: `--map-users` direction is **on-disk uid
first**, not "inner-namespace uid". The util-linux man page calls it
`<inner>:<outer>:<count>` but `<inner>` means "the filesystem's native view"
(on disk) and `<outer>` means "what the mount exposes outward". Easy to get
wrong; do not trust the man page wording.
## Approach
The privileged mount helper grows one step before the overlay mount: for each
lowerdir whose owning uid is `l4d2-sandbox`, create an idmapped bind mount at
`runtime/<n>/idmap/<basename>` that remaps that uid to `left4me`. Use the
idmapped paths (instead of raw paths) in the `lowerdir=` string passed to
`mount -t overlay`. On umount, tear the idmap binds down after the overlay
itself is unmounted.
Lowerdirs already owned by `left4me` (workshop builds, `installation/`,
caches) bypass the idmap step and are used as-is, so workshop overlays keep
working without behavior change.
## Execution shape
Implementation will be driven by `superpowers:subagent-driven-development`:
fresh implementer subagent per task, followed by a spec-compliance reviewer
then a code-quality reviewer. The project's `AGENTS.md` forbids git
worktrees, so all work happens in the live tree. Commits land directly on
`master` per the user-confirmed project pattern.
Tasks ordered for review-friendly progression. Each task is independently
committable; the deploy/verify step at the end exercises the whole chain on
the real test server.
### Task 1 — Idmap bind mounts in `left4me-overlay`
Edit `deploy/files/usr/local/libexec/left4me/left4me-overlay` and
`l4d2host/tests/test_overlay_helper.py` together (TDD: write failing
PRINT_ONLY-mode tests first, then make them pass).
Behavior to add:
- Resolve `l4d2_sandbox_uid` and `left4me_uid` (and gids) via `pwd.getpwnam`
/ `grp.getgrnam`. Hard fail with a clear message if either is missing.
- On `mount <name>`: before constructing the `lowerdir=` string, for each
resolved lowerdir, stat it; if the top-level dir's `st_uid` equals
`l4d2_sandbox_uid`, create `runtime/<n>/idmap/<basename>` (mode `0700`,
root-owned), and if it's not already a mountpoint, exec `mount --bind
--map-users=<l4d2_sandbox_uid>:<left4me_uid>:1
--map-groups=<l4d2_sandbox_gid>:<left4me_gid>:1 <src> <target>`. Use
numeric uids/gids in the argv. Substitute the idmap path into the
`lowerdir=` colon string in place of the original path.
- On `umount <name>`: after the existing `umount` of `merged`, iterate
`runtime/<n>/idmap/*`, `umount` each that is a mountpoint, then
`shutil.rmtree(runtime/<n>/idmap, ignore_errors=True)`. Idempotent.
- PRINT_ONLY mode emits the bind-mount argv (one line per bind) before the
overlay-mount argv, same shell-quoting style.
Tests to add to `test_overlay_helper.py` (reuse PRINT_ONLY harness):
- `test_mount_idmaps_sandbox_owned_lowerdir` — tmp lower owned by
faked-sandbox uid, assert helper emits `mount --bind --map-users=...`
argv and the overlay `lowerdir=` references the idmap path.
- `test_mount_skips_idmap_for_left4me_owned_lowerdir` — assert no bind
argv, raw path in `lowerdir=`.
- `test_umount_unwinds_idmap_binds` — pre-seed an idmap subdir as a
mountpoint sentinel; assert the umount sequence in PRINT_ONLY includes
the bind teardown after the overlay umount.
Uid lookup in tests: monkeypatch `pwd.getpwnam` to return synthetic uids
matching what the test's `chown` set up. (No root required.)
### Task 2 — Deploy-artifact regression test
Edit `deploy/tests/test_deploy_artifacts.py`. Add a single test that opens
`deploy/files/usr/local/libexec/left4me/left4me-overlay` and asserts the
strings `--map-users` and `runtime/` followed by `idmap` (or similar
identifying marker) are present. Cheap guard against silent regression of
the deploy artifact.
### Task 3 — Deploy README mirror note
Edit `deploy/README.md`. Add one line under the existing ckn-bw mirror
notes flagging that the helper file change must be picked up by
`bundles/left4me/` in ckn-bw (no new group, user, or unit needed).
### Task 4 — Persist the plan in-repo
Per `AGENTS.md`: "the persisted artifact must end up under
`docs/superpowers/` and be committed." Copy this scratch plan to
`docs/superpowers/plans/2026-05-14-overlay-idmap.md` and commit it. Do
this as a separate commit from Task 1 so the plan lands before the
implementation.
### Task 5 — Deploy and verify on `left4.me`
Out-of-band, after the code tasks land:
1. ckn-bw apply (or scp the helper into place on the test server) to get
the new helper deployed.
2. Stop server 2: `sudo systemctl stop left4me-server@2`.
3. Clear the stale `l4d2-sandbox`-owned upperdir SM dirs:
`sudo rm -rf /var/lib/left4me/runtime/2/upper/left4dead2/addons/sourcemod/{logs,data}`.
4. Start server 2: `sudo systemctl start left4me-server@2`.
5. Confirm `journalctl -u left4me-server@2 -o cat -n 50` shows the new
`mount --bind --map-users=...` line.
6. RCON `sm_cvar nb_update_frequency 0.0333`. Expect no `Platform returned
error: "Permission denied"` log line.
7. `sudo ls -ln /var/lib/left4me/runtime/2/upper/left4dead2/addons/sourcemod/logs/`.
Expect uid 980 (left4me).
8. Restart again to confirm idempotency: idmap binds set up fresh, no
leftover mounts from prior start.
## Files to modify
### 1. `deploy/files/usr/local/libexec/left4me/left4me-overlay`
Single privileged code path; everything else flows from here.
**Add a helper function** to decide whether a path needs idmapping. Stat the
directory; if its `st_uid` matches the resolved `l4d2-sandbox` uid, return the
idmapped path under `runtime/<n>/idmap/`; otherwise return the input path
unchanged.
**In `cmd_mount`**, before constructing `lowerdir=`:
1. `os.makedirs(runtime_dir / "idmap", exist_ok=True)` (root-owned, mode
`0o700`; only the helper writes here).
2. Resolve `l4d2_sandbox_uid = pwd.getpwnam("l4d2-sandbox").pw_uid` and
`left4me_uid = pwd.getpwnam("left4me").pw_uid`. Cache. Fail fast with a
clear message if either user is missing.
3. For each `lowerdir` in the resolved list, compute
`idmapped_path(lowerdir)`. If remapping is required:
- Create the target directory under `runtime/<n>/idmap/<basename>` if
missing.
- Skip the bind if it's already mounted there (`os.path.ismount`).
- Otherwise exec `mount --bind
--map-users=<l4d2_sandbox_uid>:<left4me_uid>:1
--map-groups=<l4d2_sandbox_gid>:<left4me_gid>:1 <src> <target>`.
**Direction note**: first arg is the on-disk uid, second arg is the
uid the mount exposes. We verified empirically that this is what the
kernel honors despite ambiguous man-page wording.
4. Pass the (possibly idmapped) paths into the `lowerdir=` colon string in
the same order.
**In `cmd_umount`**, after the existing overlay umount:
- For each subdirectory under `runtime/<n>/idmap/`, if it's a mountpoint,
`umount` it.
- `shutil.rmtree(runtime_dir / "idmap", ignore_errors=True)` after all binds
are gone.
- Idempotent: re-running after the dir is gone is a no-op.
**PRINT_ONLY mode**: emit the bind-mount argv before the overlay-mount argv,
on separate lines, so test assertions can match. Same shell-quoting.
**Allowlist**: no change needed — the idmap binds land under `runtime/`,
which is already write-permitted for the helper.
### 2. `l4d2host/tests/test_overlay_helper.py`
Add tests using the existing PRINT_ONLY harness:
- `test_mount_idmaps_sandbox_owned_lowerdir`: create a tmp lowerdir, `chown`
it to a fake-`l4d2-sandbox` (use `monkeypatch` on the uid lookup if running
unprivileged), run helper in PRINT_ONLY, assert a `mount --bind
--map-users=...` line appears and the `lowerdir=` string references the
idmap path.
- `test_mount_skips_idmap_for_left4me_owned_lowerdir`: tmp lowerdir owned by
the test user, assert no bind-mount argv emitted and `lowerdir=`
references the raw path.
- `test_umount_unwinds_idmap_binds`: pre-create `runtime/<n>/idmap/foo` as a
mountpoint sentinel, assert the PRINT_ONLY umount sequence includes the
bind-mount teardown before the overlay umount? Actually overlay-first,
then binds — match the helper order.
Reuse the existing `LEFT4ME_OVERLAY_PRINT_ONLY=1` plumbing rather than
inventing a new mode.
### 3. `deploy/tests/test_deploy_artifacts.py`
Add a grep-style assertion that the helper file contains the strings
`--map-users` and `idmap` so the deploy artifact can't silently regress.
### 4. `deploy/README.md`
One-line mirror note: ckn-bw's `bundles/left4me/` ships the helper verbatim,
so no new bundle-side change is needed beyond updating the file. No new
group, no new user, no new systemd unit. Flag this explicitly so the next
deploy-to-prod step is "rebuild the helper file in ckn-bw, `bw apply
ovh.left4me`".
## Migration
No on-disk schema change. Existing overlays keep their current ownership
(`l4d2-sandbox`-owned for script builds, `left4me`-owned for workshop
builds). The mount helper picks the right path per lowerdir at next
`systemctl start <instance>`.
Already-running instances on the test server pick up the change after a
service restart. Live SourceMod sessions whose `addons/sourcemod/logs/`
copy-up is already broken in upperdir need a fix too: the upperdir entries
are `l4d2-sandbox:l4d2-sandbox 0755` from the previous broken copy-up. The
helper doesn't touch upper/ on mount, so those stale entries persist.
Two safe migration options:
1. Manual: on the test server, stop server 2, `rm -rf
runtime/2/upper/left4dead2/addons/sourcemod/logs runtime/2/upper/left4dead2/addons/sourcemod/data`,
start it again. Copy-up will redo with idmapped lower → `left4me`-owned
upper.
2. Automatic: have `start_instance` proactively delete known-SM writable
dirs from `upper/` if their uid is `l4d2-sandbox`. Out of scope for this
change unless we hit it again.
Recommend option 1 — one-shot, no code change.
## Verification
End-to-end test on `left4.me`:
1. `bw apply ovh.left4me` (or scp the updated helper into place).
2. Stop server 2: `sudo systemctl stop left4me-server@2`.
3. Clean the stale broken SM upperdir: `sudo rm -rf
/var/lib/left4me/runtime/2/upper/left4dead2/addons/sourcemod/{logs,data}`.
4. Start server 2: `sudo systemctl start left4me-server@2`.
5. From inside `left4me-overlay mount` argv (check `journalctl -u
left4me-server@2 -o cat -n 50`), confirm `mount --bind --map-users=...`
was executed.
6. RCON to server 2, `sm_cvar nb_update_frequency 0.0333`. Expect no
`Platform returned error: "Permission denied"` log line.
7. `ls -ln
/var/lib/left4me/runtime/2/upper/left4dead2/addons/sourcemod/logs/`.
Expect files owned by `left4me`'s numeric uid.
8. `sudo umount /var/lib/left4me/runtime/2/merged` should still work; then
verify `runtime/2/idmap/` is cleaned up by `ExecStopPost`. Restart and
confirm idempotent (no leftover binds).
Local tests:
- `pytest l4d2host/tests/test_overlay_helper.py -q`
- `pytest deploy/tests/test_deploy_artifacts.py -q`
## Risks and edge cases
- **Workshop overlay misidentification**: a workshop overlay with a
`l4d2-sandbox`-owned subdir somehow (e.g. partial migration) would get
idmapped despite containing `left4me`-owned files. Files with other uids
through an idmap appear as the overflow uid (`nobody`/65534), which would
break reads. Mitigation: check ownership of the **top-level overlay
directory** as the trigger, not file-by-file. If the top is sandbox-owned,
trust the whole tree; if the top is left4me-owned, no idmap. This matches
what each builder actually produces.
- **`installation/` and caches**: always `left4me`-owned, never idmapped.
- **Symlinks inside script overlays**: idmap operates at the mount level,
not per-inode. Symlink ownership translates the same as files. Targets
inside the overlay resolve through the same mount. Targets outside (none
in script overlays today; workshop ones don't take this code path) would
not be affected.
- **Mount namespace**: the helper runs in PID 1's mount namespace via the
unit's `ExecStartPre=+nsenter ...`. Bind mounts created there persist
until the matching `ExecStopPost` umount, exactly like the overlay mount
itself.
- **Crash mid-build**: idmap binds are created only at `mount` time, not at
build time. A crashed build leaves no orphan mounts.
- **Crash mid-start (ExecStartPre fails between bind and overlay mount)**:
systemd's `Restart=on-failure` re-invokes ExecStartPre. The helper checks
`os.path.ismount` on each idmap target and skips already-mounted ones.
Idempotent.
- **`runtime/<n>/idmap/` cleanup on `_purge_instance`**: existing
`shutil.rmtree(runtime_dir)` after `disable_service` already triggers the
helper's umount sequence, which removes the idmap dir. No new code.
- **util-linux flag form**: prefer `--map-users <inner>:<outer>:<count>` and
`--map-groups` (numeric uids/gids resolved by the helper) over the
`X-mount.idmap=` mount-option syntax — clearer and easier to log.
## Out of scope
- Web app uid split (`l4d2-web` separate from `left4me`) — orthogonal,
rejected for this change.
- Gameserver uid split (separating the gameserver-runtime uid from
`left4me`) — planned for a later session. **One forward-compat
coupling**: the helper looks up `pwd.getpwnam("left4me")` as the in-mount
target uid. When the gameserver moves to its own user (e.g. `l4d2-game`),
change that one string. Everything else (script-sandbox uid, workshop
builder uid, file-tree endpoint, idmap cleanup) is uid-agnostic.
- Replacing `l4d2-sandbox` with a different uid scheme — kept as defense in
depth.
- Spec doc updates (including the bubblewrap → systemd-run wording
correction in the existing script-overlay spec): dropped per user
decision; this change ships plan-only.

View file

@ -1,387 +0,0 @@
# Add an RCON console to the server detail page
## Context
The server detail page (`/servers/<id>`) already exposes the RCON password,
live state polling, log streaming, and start/stop actions, but to send any
arbitrary command (`changelevel`, `sm_kick`, `mp_*`, `say`, etc.) the user
has to open a separate RCON client and reconnect. Adding an inline console
turns the web UI into a complete operator tool for the owner of a server:
type a command, see the reply, recall earlier commands via persisted
history.
Scope is intentionally narrow:
- One server, one user (the owner). Multi-user shared console = not now.
- Per-user history persisted across reloads.
- No blocklist — owner already has the RCON password and can run anything
via any RCON client; the UI is a thin wrapper.
## Design decisions (already settled)
| Topic | Choice |
|---|---|
| UI placement | Panel on `server_detail.html`, between **Live State** and **Files**. |
| Output transport | **HTMX append swap**, not SSE. RCON is request/response — SSE adds no value. Matches existing inline-form / `hx-swap` patterns in the codebase. |
| Safety | Owner-of-server check only (`Server.user_id == current_user.id`). No command blocklist. **No admin override** — admins can already SSH if needed; an unaudited UI backdoor isn't worth the asymmetry. |
| History | New `command_history` table, scoped per (user, server). Stores **command + reply + error flag** so the full transcript can be replayed on page reload. |
| Transcript on page load | **Replays the last 50 rows** for this user+server, rendered server-side into the transcript via the same `_console_line.html` partial used for live additions. Visually identical to live lines (no "old vs new" distinction — the whole point is page-reload continuity). |
| Transcript height | Fixed max-height ~400 px, internal vertical scroll. New lines auto-scroll to the bottom on add AND on initial load. Page layout below stays stable. |
| Clear button | None. Reload doesn't help (it replays). If anyone wants to drop history, that's a separate concern handled later. |
| RCON timeout | **30s per command.** Comfortably covers a cold map load with custom add-ons (community-observed worst case ~25s on modest hardware). 3× the python-valve default. Far below `director_transition_timeout` (120s) so no aliasing. If a command exceeds 30s, the RCON exec packet was already sent — the server still did the work; the user just doesn't see the textual reply but sees the effect in the Server Log SSE panel above. |
| Worker model | Rely on `gunicorn --threads N` (or whatever the existing deployment uses for the long-lived SSE log streams). Threads share memory; one stuck `changelevel` holds a thread, not a process. Don't scale processes — adding hundreds of workers wastes RAM (~100 MB each); threads cost nothing. |
## Server-side changes
### 1. Extend `l4d2web/services/rcon.py`
The wire-protocol layer already exists (`l4d2web/services/rcon.py:64`).
Add a generic command executor with **multi-packet response handling**:
```python
def execute_command(
host: str, port: int, password: str, command: str, *, timeout: float = 30.0
) -> str:
"""Authenticate, send a single command, return the joined reply body.
Implements the trailing-marker pattern: after the exec packet we
immediately send an empty SERVERDATA_RESPONSE_VALUE packet with a
sentinel req_id. We then read response packets, concatenating bodies,
until we see the sentinel echo back. This is the only reliable way
to detect end-of-output, because Source RCON splits replies >4096 B
across multiple packets with no length header.
"""
```
Implementation notes:
- Factor `_connect_and_auth(sock, password)` out of `query_status` so
both functions share the auth dance.
- Use req_id `0xDEADBEEF` (or any constant ≠ the exec req_id) for the
sentinel; read packets until one comes back with that req_id.
- Input validation **inside this function** (not just at the route):
- Reject empty / whitespace-only `command``ValueError`.
- Reject embedded `\x00` bytes (would corrupt the null-terminated
wire format) → `ValueError`.
- Cap length at 1000 chars (RCON packet limit is 4096 incl. headers;
no real command needs more). Longer → `ValueError`.
- Trim trailing whitespace from the joined body. Otherwise return verbatim.
- Existing `RconError` / `RconAuthError` exception types are reused.
Tests in `l4d2web/tests/test_rcon.py` (extend the `FakeRconServer` to
support multi-packet replies):
- happy path: single-packet response
- multi-packet response (synthesize a >4096 B reply)
- empty reply (server replies only with the sentinel — case for `say`)
- bad password → `RconAuthError`
- timeout (fake server sleeps longer than the test timeout)
- input validation: empty / null byte / oversized → `ValueError`
### 2. New `CommandHistory` model (`l4d2web/models.py`)
Append at the bottom of `models.py`:
```python
class CommandHistory(Base):
__tablename__ = "command_history"
__table_args__ = (
Index("ix_cmdhist_user_server_id", "user_id", "server_id", "id"),
)
id: Mapped[int] = mapped_column(Integer, primary_key=True)
user_id: Mapped[int] = mapped_column(ForeignKey("users.id"), nullable=False)
server_id: Mapped[int] = mapped_column(ForeignKey("servers.id", ondelete="CASCADE"), nullable=False)
command: Mapped[str] = mapped_column(Text, nullable=False)
reply: Mapped[str] = mapped_column(Text, nullable=False, default="", server_default="")
is_error: Mapped[bool] = mapped_column(Boolean, nullable=False, default=False, server_default=text("0"))
created_at: Mapped[datetime] = mapped_column(DateTime, default=now_utc, nullable=False)
```
Index `(user_id, server_id, id)` because every lookup is "latest N for
this user+server", ordered by `id DESC`.
A row is persisted on **every** RCON outcome — successful reply,
empty reply, and error (auth fail, connect refused, `RconError`). The
`is_error` flag drives the red styling on replay, so the transcript
looks identical after a page reload.
**Storage cost**: most replies are <500 B; `status` ~1 KB;
`sm plugins list` a few KB; `cvarlist` can be 50 KB+. A power user
running 100 commands/day at an average ~2 KB → ~73 MB/year. SQLite
handles that without complaint; a trim job (cap N per user/server,
e.g. last 5000) can be added if anyone ever notices.
**Privacy note for the implementer**: replies from `status` include
player names (user-controlled strings from random Steam users) and
SteamID64s. Treat them as untrusted text on output (handled by Jinja
auto-escaping — see §5) and don't surface them outside this user's
session.
### 3. New alembic migration `0012_command_history.py`
Mirror `l4d2web/alembic/versions/0011_server_hostname.py`:
- `revision = "0012_command_history"`
- `down_revision = "0011_server_hostname"`
- `upgrade()`: `op.create_table("command_history", …)` with columns
`id`, `user_id`, `server_id`, `command (Text)`, `reply (Text, server_default="")`,
`is_error (Boolean, server_default="0")`, `created_at`; plus
`op.create_index("ix_cmdhist_user_server_id", ...)`.
- `downgrade()`: drop index then table.
- `test_alembic_migrations.py` auto-discovers revisions (skim once to
confirm; no edit if so).
### 4. New route module `l4d2web/routes/console_routes.py`
Two endpoints, both `@require_login`, both verify ownership with
**404** on miss (matches the existing pattern at
`page_routes.py:303` — no admin backdoor).
**`POST /servers/<id>/console`** — submit a command.
- CSRF-checked (form field `csrf_token`).
- Form field `command`. Validation happens twice: at the route (return a
user-facing error fragment for empty / oversized) and inside
`execute_command` (defence in depth — never trust a single layer).
- Calls
`rcon.execute_command("127.0.0.1", server.port, server.rcon_password, command)`.
- **Every outcome persists a `CommandHistory` row** (so the transcript
fully reconstructs on page reload):
- Success with reply → `command`, `reply`, `is_error=False`.
- Success with empty reply (e.g. `say`) → `command`, `reply=""`,
`is_error=False`. Template renders `(no reply)` in muted text.
- `RconAuthError` / `RconError` / connect-failed → `command`,
`reply=<exception message>`, `is_error=True`. Red styling on render.
- On `ValueError` from input validation (empty / null byte / oversized):
render an error fragment, **do not** insert history (the command
never reached the wire — nothing happened to remember).
- Returns 200 in all cases (errors are rendered, not raised) so HTMX
appends them to the transcript like any other line.
**`GET /servers/<id>/console/history?before=<id>&limit=50`** — paged
history for up-arrow navigation.
- Returns JSON `[{"id": …, "command": …}, …]` ordered newest-first.
- The client owns the input state; this stays JSON, not HTML.
- `limit` clamped to ≤200.
Register the blueprint in `l4d2web/app.py` alongside the other
`*_routes` modules.
**Also extend `server_detail()` in `page_routes.py`** to fetch the last
50 `CommandHistory` rows for this `(user, server)`, ordered oldest-first
(so they iterate naturally in the template), and pass as
`console_history` in the render context. Use the same `session_scope`
block that already loads `server` and `blueprint` (`page_routes.py:301`)
— one extra `db.scalars(select(CommandHistory)…)` call, no new round
trip cost.
### 5. Template fragment `templates/_console_line.html`
```jinja2
<div class="console-line{% if error %} console-error{% endif %}">
<div class="console-prompt">&gt; {{ command }}</div>
{% if reply %}
<pre class="console-reply">{{ reply }}</pre>
{% else %}
<div class="console-reply muted">(no reply)</div>
{% endif %}
</div>
```
**XSS reminder for the implementer:** `reply` originates from the game
server's RCON output — we do not trust it. **Never use `|safe`**, never
`{{ reply|markdown }}`, never anything that bypasses Jinja's default
HTML escaping. The existing `{{ reply }}` is the right call.
### 6. Console panel in `templates/server_detail.html`
Insert between the existing live-state section (line 3337) and the
Files section (line 39):
```jinja2
<h2 class="section-title">Console</h2>
<section class="panel console-panel">
<div id="console-transcript-{{ server.id }}"
class="console-transcript"
data-autoscroll>
{% for h in console_history %}
{% include "_console_line.html" with context %}
{# Loops with h.command, h.reply, h.is_error, h.created_at #}
{% endfor %}
</div>
<form hx-post="/servers/{{ server.id }}/console"
hx-target="#console-transcript-{{ server.id }}"
hx-swap="beforeend"
hx-indicator=".console-spinner"
hx-on::after-request="this.command.value=''; this.command.focus(); this.closest('section').querySelector('[data-autoscroll]').scrollTop = 1e9"
class="console-input-form"
data-console-form data-server-id="{{ server.id }}">
<input type="hidden" name="csrf_token" value="{{ session.get('csrf_token', '') }}">
<span class="console-prompt-glyph">&gt;</span>
<input name="command" autocomplete="off" spellcheck="false" maxlength="1000"
placeholder="status, changelevel c1m1_hotel, sm_kick …">
<span class="console-spinner" aria-hidden="true"></span>
<button type="submit">Send</button>
</form>
</section>
```
- Transcript is server-side rendered with the last 50 history rows on
page load. `_console_line.html` is the single source of truth for
line layout — same template, same look, whether the line came from
this session or last week.
- `hx-indicator` gives visible feedback during slow commands (a
`changelevel` can sit at ~10s+).
- `maxlength="1000"` on the input mirrors the server-side cap.
- The `hx-on::after-request` inline scrolls the transcript to the
bottom after each new line. On initial page load, the JS module
scrolls to the bottom once after the DOM is ready (so the most
recent history is visible, not the oldest).
**Cross-feature interaction (do not "fix"):** Silent or slow commands
(`say`, `kick`, `changelevel`) will produce empty or terse RCON replies
in this transcript. The actual game-side effect is already visible in
the **Server Log** SSE panel right above. A future implementer should
NOT try to mirror server-log lines back into the console transcript —
that's a redundancy, not a feature.
### 7. New `static/js/console-history.js`
Tiny module bound to `[data-console-form]`:
- **On DOM ready**: scroll each `[data-autoscroll]` transcript to the
bottom so the most recent replayed lines are visible. This is the
initial-load equivalent of the `hx-on::after-request` scroll.
- **On first focus** of the input: lazy-fetch
`/servers/<id>/console/history?limit=50` and cache the array in
memory. (Distinct from the rendered-on-load transcript: this cache
is *just commands* for up/down recall — replies don't matter for
navigation, so the JSON endpoint stays narrow.)
- **ArrowUp / ArrowDown**: walk the cached array, set `input.value`.
- ArrowUp from a non-history state: snapshot the current value so
ArrowDown can restore it.
- **ArrowUp past the end**: fetch next page using
`?before=<oldest_cached_id>`. If empty, stop.
- **After a successful submit** (`htmx:afterRequest` with 2xx):
prepend the just-sent command to the in-memory cache so it's
instantly recallable.
Loaded via a `<script defer>` line in `base.html` next to the other
small static JS modules (same pattern as `sse.js`).
### 8. Concurrency sanity (no code, just verifying the design)
`live_state_poller.py` already opens fresh RCON connections every 5s
against the same port. SrcDS handles concurrent RCON sessions cleanly
(each is independently auth'd, no shared state). The console adds at
most one more concurrent connection per active user — well within
limits. No locking needed.
### 9. Minimal CSS in `static/css/`
Monospace transcript, dark background, `console-error` styled like the
existing error pills. Match the visual weight of the existing log-stream
`<pre>` block on the detail page — no new design system.
## Files to touch
| File | Change |
|---|---|
| `l4d2web/services/rcon.py` | Add `execute_command()` with multi-packet handling + input validation; extract `_connect_and_auth()` |
| `l4d2web/tests/test_rcon.py` | Extend `FakeRconServer` for multi-packet; add success / multi-packet / empty / bad-pw / timeout / validation tests |
| `l4d2web/models.py` | Add `CommandHistory` (with `reply`, `is_error`) |
| `l4d2web/alembic/versions/0012_command_history.py` | New migration |
| `l4d2web/routes/console_routes.py` | **NEW** — POST + GET endpoints |
| `l4d2web/routes/page_routes.py` | Extend `server_detail()` to fetch last 50 history rows and pass `console_history` |
| `l4d2web/app.py` | Register the new blueprint |
| `l4d2web/templates/_console_line.html` | **NEW** fragment |
| `l4d2web/templates/server_detail.html` | Insert console panel section (with server-rendered replay loop) |
| `l4d2web/static/js/console-history.js` | **NEW** up/down history nav + initial scroll-to-bottom |
| `l4d2web/templates/base.html` | `<script defer src="…/console-history.js">` |
| `l4d2web/static/css/*.css` | Console panel styling (fixed-height scroll transcript, error variant) |
| `l4d2web/tests/test_console_routes.py` | **NEW** route tests |
## Tests to write explicitly
**`test_rcon.py`** (extending existing file):
- `execute_command` happy path, single-packet reply
- `execute_command` multi-packet reply (>4096 B) reassembled in order
- `execute_command` empty reply (server returns only the sentinel)
- `execute_command` bad password → `RconAuthError`
- `execute_command` socket timeout → `RconError`
- Input validation: empty / whitespace-only / null-byte / oversized → `ValueError`
**`test_console_routes.py`** (new):
- not logged in → 302 to login
- logged in but not server owner → **404** (not 403 — match
`page_routes.py:303`)
- valid command → 200, fragment HTML rendered, `CommandHistory` row
inserted with `reply` populated and `is_error=False`
- empty RCON reply → 200, fragment renders `(no reply)`, history row
inserted with `reply=""`, `is_error=False`
- RCON error (mock `execute_command` to raise) → 200, error fragment,
history row inserted with `is_error=True` and the exception message
in `reply`
- empty/oversized command (validation error before wire) → 200, error
fragment, **no** history row
- CSRF token missing → rejected
- `GET /console/history` returns newest-first
- `GET /console/history?before=<id>` paginates correctly
- `GET /console/history?limit=10000` is clamped to ≤200
**`test_page_routes.py`** (extend existing if present, otherwise add):
- `server_detail` returns the last 50 `CommandHistory` rows for the
viewing user only, oldest-first in the rendered page (newest at the
bottom of the transcript)
- a history row belonging to another user for the same server is **not**
visible (ownership scoping is by `user_id`, not just `server_id`)
## What we are deliberately NOT doing
- No command blocklist or admin gate — owner already has the password.
- **No admin override** to console other users' servers (admins can SSH if
they truly need to; UI backdoor would be unaudited and asymmetric).
- No shared multi-user view of the same console.
- No streaming output (RCON doesn't stream; replies are one-shot).
- No autocomplete of cvars — out of scope; up-arrow history is enough.
- No "Clear transcript" button — the transcript replays on every page
load by design. Discarding history is a different concern (delete
rows from the DB) and is out of scope for v1.
- No history-trim job — file an issue if anyone hits >100k rows; not
worth pre-empting at this scale.
- No mirroring of server-log lines into the console transcript — the
Server Log panel above already serves that purpose.
## Verification
1. `pytest l4d2web/tests/test_rcon.py l4d2web/tests/test_console_routes.py l4d2web/tests/test_alembic_migrations.py` — unit + migration tests pass.
2. Boot the web app locally, log in, open a server detail page for a
running server, send `status` — multi-line reply renders in the
transcript; the input clears and refocuses; spinner shows during
the request; the transcript scrolls to the new line at the bottom.
3. Send `cvarlist` — a large multi-packet response — and confirm the
full output reassembles, not truncated.
4. Send `say hello` — transcript shows `> say hello` followed by
`(no reply)` in muted text; the line appears in the Server Log
panel above.
5. Send `changelevel c1m1_hotel` — request takes ~1020s, spinner
visible the whole time, then a (likely empty) reply appears, and
the live-state panel updates to the new map within 5s.
6. Send an invalid command (e.g. `nonsense_cvar`) — reply renders
normally (RCON tolerates unknown commands).
7. Send a command with embedded null bytes (via curl, since the
browser strips them) — returns 200 with an error fragment, no
history row.
8. Send a 2000-char command — rejected with an error fragment, no
history row.
9. **Reload the page** — the transcript reappears identical to before,
showing the same `> status`, `> say hello`, `> nonsense_cvar` lines
with their replies, scrolled to the bottom. Errors are still red.
10. Focus the input, press ArrowUp — the previous command reappears.
ArrowDown restores the empty state.
11. Send 60+ commands, then ArrowUp past the in-memory page boundary —
older commands load on demand.
12. Stop the server, try to send a command — surfaces as a styled
`console-error` line ("connect failed") rather than a 500; **a
history row IS inserted** with `is_error=True`, so the error
replays on next page load.
13. Log in as a different user, visit `/servers/<other-user-id>`
404, no console rendered. POST to that URL also 404. The other
user's transcript is not visible.
14. Confirm that a `cvarlist`-class large reply persists fully in the
DB (`SELECT length(reply) FROM command_history ORDER BY id DESC LIMIT 1;`)
and replays in full on page reload.

View file

@ -1,270 +0,0 @@
# Build-time idmap: move the uid translation from the gameserver mount
into the script sandbox
> **SUPERSEDED 2026-05-15** by the uid-collapse refactor
> ([`2026-05-15-uid-collapse.md`](2026-05-15-uid-collapse.md)). The
> idmap pattern this plan introduced is removed because source uid
> (`left4me`) now equals target uid (`left4me`) — the translation is
> a no-op. Kept for design-evolution context.
## Context
The current idmap implementation translates uids at **gameserver mount
time**: `left4me-overlay` stats each lowerdir, creates a per-lowerdir
idmapped bind under `runtime/<n>/idmap/<basename>` for the sandbox-
owned ones, then uses those bind paths in the overlay's `lowerdir=`.
On stop, the binds get torn down. Works correctly today, but spreads
the idmap concern across two helpers and adds mount lifecycle code on
every gameserver start.
Cleaner alternative: do the idmap translation at **script-sandbox
build time**, so files land on disk as `left4me`-owned. The on-disk
state then matches workshop-built overlays (also left4me-owned), and
the gameserver mount path becomes uniform — no per-lowerdir stat,
no idmap binds, no extra cleanup.
This plan switches the architecture to the build-time approach and
reverts the gameserver-mount idmap code.
## Verified mechanism
Tested end-to-end on `left4.me` (Trixie, kernel 6.12.86, ext4) on
2026-05-15:
1. `/source/` dir owned by `left4me` on disk.
2. `mount --bind --map-users=980:981:1 --map-groups=980:981:1
/source /idmapped` — inside `/idmapped`, files appear as uid 981
(sandbox view).
3. `mount --bind /idmapped /rebound` — a plain second bind. The idmap
**propagates** to `/rebound` (rebound view also shows uid 981).
This is what `BindPaths=` in the sandbox unit does.
4. `sudo -u l4d2-sandbox touch /rebound/x.txt` — write **succeeds**.
The file lands on disk owned by `left4me` (uid 980).
Map direction is the inverse of the gameserver-side map:
`--map-users=<disk_uid>:<mount_uid>:1` where disk is `left4me` and
mount-side is `l4d2-sandbox`. Inside the bind, the sandbox uid sees
its own uid as itself; writes from that uid get translated back to
the disk-side (left4me) for storage.
## Approach
### Script-sandbox helper (`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`)
Pre-create an idmapped bind staging path, point the sandbox's
BindPaths at it, clean up on exit. Concretely:
1. **Remove** the existing `chown -R l4d2-sandbox:l4d2-sandbox
"$OVERLAY_DIR"` and `chmod 0755` lines. The overlay dir stays
`left4me`-owned (web app's creation default).
2. **Add** a setup block before `systemd-run`:
```bash
STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID}
trap 'umount "$STAGING" 2>/dev/null || true; rmdir "$STAGING" 2>/dev/null || true' EXIT
mkdir -p "$STAGING"
mount --bind \
--map-users=$(id -u left4me):$(id -u l4d2-sandbox):1 \
--map-groups=$(id -g left4me):$(id -g l4d2-sandbox):1 \
"$OVERLAY_DIR" "$STAGING"
```
3. **Change** the systemd-run line:
- `BindPaths="${OVERLAY_DIR}:/overlay"``BindPaths="${STAGING}:/overlay"`
4. **Remove** the post-build `find ... chmod o+r` block. Files end up
left4me-owned, web app reads them via its primary uid. The
world-read kludge was only needed because of the old sandbox-
owned files; with this change it's obsolete.
`trap` ensures the staging bind is umounted even on errors / signals.
Idempotent: if the helper is re-run, `umount + rmdir` handle existing
state, and `mkdir -p` + `mount --bind` over an existing mountpoint
adds another bind that the next exit cleans up. The kernel 6.12 bind
nesting on the same path works fine (verified during the recent
gameserver-side idmap fix).
### Gameserver-mount helper (`deploy/files/usr/local/libexec/left4me/left4me-overlay`)
Revert the idmap logic added in commit `2f6a9cf` (+ fix in `9053186`,
+ mountpoint-detection fix in `dd918ac`). Specifically:
1. **Remove** the per-lowerdir stat + idmap-decision loop in `cmd_mount`.
`lowerdir=` becomes the simple colon-join of resolved lowerdirs
(the pre-2f6a9cf shape).
2. **Remove** the bind-umount loop in `cmd_umount` and the
`shutil.rmtree(idmap_dir, ...)` line.
3. **Remove** the `_is_mountpoint`, `_lookup_uid`, and `_get_user_ids`
helpers — no longer used. (Keep `os.path.ismount` for the merged
overlay check; that one's reliable.)
4. **Remove** the `LEFT4ME_TEST_*_UID/GID` test-only env-var stubs.
5. **Remove** the idmap PRINT_ONLY emission.
The helper shrinks back to the pre-idmap size (~242 lines from current 381).
### Tests
In `l4d2host/tests/test_overlay_helper.py`:
1. **Remove** `test_mount_idmaps_sandbox_owned_lowerdir`.
2. **Remove** `test_mount_skips_idmap_for_left4me_owned_lowerdir`.
3. **Remove** `test_umount_unwinds_idmap_binds`.
4. **Remove** `test_is_mountpoint_detects_same_fs_bind_mount` and the
`_load_helper_module` helper.
5. **Remove** `_setup_instance_with_uid` and the `FAKE_*_UID/GID`
constants.
6. **Remove** the `LEFT4ME_TEST_*` env-var injection in `_run`.
In `deploy/tests/test_deploy_artifacts.py`:
1. **Remove** `test_overlay_helper_idmaps_sandbox_owned_lowerdirs`
(the regression test for the soon-removed feature).
2. **Add** a new test `test_script_sandbox_uses_idmap_staging` that
asserts the sandbox helper contains:
- `--map-users=` and `--map-groups=` strings (the bind setup),
- `/var/lib/left4me/tmp/sandbox-idmap-` (the staging path prefix),
- `BindPaths="${STAGING}:/overlay"` (or close equivalent — point
the bind at the idmapped staging path, not at OVERLAY_DIR).
- A `trap` for cleanup.
3. **Remove** the existing `chown -R l4d2-sandbox` assertion in the
sandbox-helper test (if any).
### Migration
Existing overlays under `/var/lib/left4me/overlays/<id>/` are a mix:
- Workshop-built: already `left4me`-owned (no migration needed).
- Script-built (e.g. server 2's overlays 4 and 9): currently
`l4d2-sandbox`-owned from the prior helper version. **Need chown to
`left4me:left4me`.**
One-shot migration command on the test server (run before deploying
the new helpers, OR after — both work because the new script-sandbox
also expects left4me-owned dirs):
```bash
sudo chown -R left4me:left4me /var/lib/left4me/overlays/
```
That's safe — overlays/* are all overlay content, no other tenants.
The workshop ones are already left4me; the chown is a no-op for them.
The script-built ones get flipped to the new ownership model.
Running gameservers using the old idmap-bind setup will keep working
on the old overlays/<id> files (which they bind via the now-orphan
idmap bind that's already in place). The next stop/start cycle picks
up the new helper, which:
- Doesn't create any new idmap binds (gameserver-side helper has
none),
- Cleans up the legacy idmap binds it finds (the existing umount loop
in the current helper handles this on the way out).
After the first stop/start cycle, no more idmap binds exist anywhere
in the system. Steady state.
### ckn-bw bundle
No changes needed. The `install_left4me_scripts` action picks up the
new helper contents from `/opt/left4me/src/deploy/files/usr/local/...`
on the next `git_deploy` apply. ckn-bw itself is content-agnostic
about the helper internals.
## Files to modify
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — add
idmap bind setup + trap cleanup; remove old chown; switch BindPaths.
- `deploy/files/usr/local/libexec/left4me/left4me-overlay` — revert the
~140 lines of idmap-handling code; remove uid lookup, mountinfo
helper, test-stub env vars; drop the idmap PRINT_ONLY emission.
- `l4d2host/tests/test_overlay_helper.py` — drop idmap tests and
helpers.
- `deploy/tests/test_deploy_artifacts.py` — flip the asserted
invariant (helper has idmap → sandbox has idmap).
## Verification
End-to-end on `left4.me`:
1. Push left4me commit, `bw apply ovh.left4me`.
2. `sudo chown -R left4me:left4me /var/lib/left4me/overlays/` (one-shot
migration).
3. `sudo systemctl restart left4me-server@2`.
4. `sudo findmnt --task 1 -o TARGET | grep runtime/2` — expect *only*
`runtime/2/merged`, no `idmap/*` subdirs.
5. `sudo ls -ln /var/lib/left4me/overlays/9/` and a couple of other
script overlays — expect `left4me:left4me`.
6. Trigger an overlay rebuild from the web UI on a script overlay.
Confirm the build succeeds and the resulting files are
left4me-owned on disk.
7. `sudo -u left4me touch
/var/lib/left4me/runtime/2/merged/left4dead2/addons/sourcemod/logs/test.log`
— expect write to succeed (verifies SM logging path still works).
8. RCON `sm_cvar nb_update_frequency 0.0333` — no permission-denied
line in `journalctl -u left4me-server@2`.
Local tests:
```
pytest l4d2host/tests/test_overlay_helper.py -q
pytest deploy/tests/test_deploy_artifacts.py -q
```
Both should pass with reduced test count (removed idmap-on-mount
tests, added one sandbox-helper assertion).
## Risks
- **Kernel version dependency**: idmap propagation through plain
re-bind was verified on 6.12.86. Older kernels may behave
differently. ovh.left4me is on Trixie's 6.12, so we're fine; future
hosts on older kernels would need verification. Document the kernel
floor (≥ 6.6 for overlayfs+idmap, but ≥ 6.x for the propagation —
we have no exact lower bound documented).
- **Stale idmap binds during migration**: server 2 currently has two
active gameserver-side idmap binds (`runtime/2/idmap/overlays_4`
and `overlays_9`). The first stop after deploy uses the existing
helper code (with `_is_mountpoint` fix) to umount them. Verified
in the recent fix cycle. New starts won't create new binds.
- **Sandbox migration of in-flight builds**: if a script-overlay
build is running during the deploy + chown migration, the chown
could happen mid-write. Mitigation: don't run the chown while a
build is active; check via `systemctl list-units
'left4me-script-*'` first.
- **The trap-based cleanup in bash**: if the helper is hit with
SIGKILL, the trap doesn't fire and the staging bind leaks. Same
exposure as today's leaks (gameserver-side stale binds on similar
scenarios). Acceptable; the next sandbox run for the same overlay
id `umount`s the leftover bind first via the trap setup pattern
(`umount; rmdir; mkdir -p; mount --bind` is idempotent).
## Why this is worth doing despite the working current solution
Today's idmap-on-mount works and is correct. The reasons to refactor:
- **Architectural locality**: the uid translation is a build-time
concern (the sandbox creates files); having it as a mount-time
concern means the gameserver path needs to know about a producer-
side decision.
- **Code reduction**: helper shrinks by ~140 lines; tests by ~150.
Removed code is removed bug surface.
- **On-disk consistency**: all overlay content becomes `left4me`-
owned. Easier to reason about (no two-tier ownership), easier to
manually inspect (no per-overlay-type ownership).
- **Mount lifecycle simplification**: no per-instance idmap dir
creation, no per-start uid lookups, no per-stop bind teardown, no
stacked-bind regression hazard from the same-fs `os.path.ismount`
trap (we already fixed that once).
- **Web app read path**: drops the world-read chmod kludge in the
sandbox helper. File-tree download endpoint reads via primary uid.
The cost (refactor + migration) is paid once; the benefit is
permanent.
## Out of scope
- Splitting the web-app uid from the gameserver uid (future change
noted in earlier plans).
- Rewriting shell helpers in Python.
- `left4me-apply-cake` cleanup (still drifting along in the install
glob).
- Re-examining whether `l4d2-sandbox` should exist as a separate uid
at all (this plan keeps it, but the cost-benefit might shift
later).

View file

@ -1,198 +0,0 @@
# Deploy-dir architecture rethink — implementation plan
## Context
Resolves the open questions in `docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md`. After the 2026-05-15 script-consolidation work, `deploy/` ended up half-canonical / half-historical: the privileged scripts were treated as load-bearing source-of-truth there, while sudoers/sysctl/env-templates stayed duplicated against ckn-bw, and the obsolete `deploy-test-server.sh` plus a pile of dead static unit files lingered. The shape worked but couldn't be described in two sentences.
This plan commits to the framing the user picked: **`deploy/` is a reference exemplar** — readable enough that a fresh consumer (ckn-bw today, hypothetical docker/ansible/manual tomorrow) could build a deployment from it, but not the live source of truth for installed binaries. The privileged scripts are **application-inherent code** and move out of `deploy/` to top-level `scripts/{libexec,sbin}/`. Dead code is deleted in the same pass. ckn-bw is updated to read scripts from the new location. The intended outcome: `deploy/` shrinks to README + example configs + a couple of curated example units, the rules for "what goes here" fit in two sentences, and the cross-repo install path becomes self-explanatory.
## End state
```
left4me/
scripts/
libexec/
left4me-overlay # 244-line Python helper (mount/umount)
left4me-script-sandbox # 109-line bash (systemd-run sandbox)
left4me-systemctl # 44-line sh wrapper
left4me-journalctl # 53-line sh wrapper
sbin/
left4me # 17-line admin CLI wrapper
tests/
test_overlay.py
test_script_sandbox.py
test_systemctl_helper.py
test_journalctl_helper.py
test_sudoers_grants.py # tests the contract between scripts and sudoers
deploy/ # REFERENCE ONLY — see deploy/README.md
README.md # rewritten: explains target layout, points at scripts/
files/
etc/
sudoers.d/left4me # example, ckn-bw ships its own verbatim copy
sysctl.d/99-left4me.conf # example
left4me/sandbox-resolv.conf # example
usr/local/lib/systemd/system/
left4me-server@.service # curated example of what ckn-bw's reactor emits
left4me-web.service # curated example
left4me-workshop-refresh.service # curated example
left4me-workshop-refresh.timer # curated example
l4d2-game.slice # curated example
l4d2-build.slice # curated example
templates/etc/left4me/
host.env # example, ckn-bw renders its own mako version
web.env.template
tests/
test_example_units.py # slimmed: just locks down the curated examples
l4d2host/ # unchanged
l4d2web/ # unchanged
docs/
```
## Step-by-step
### 1. Create `scripts/` and move helpers
- `mkdir -p scripts/libexec scripts/sbin scripts/tests`
- `git mv` the four live helpers and the admin CLI:
- `deploy/files/usr/local/libexec/left4me/left4me-overlay``scripts/libexec/left4me-overlay`
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox``scripts/libexec/left4me-script-sandbox`
- `deploy/files/usr/local/libexec/left4me/left4me-systemctl``scripts/libexec/left4me-systemctl`
- `deploy/files/usr/local/libexec/left4me/left4me-journalctl``scripts/libexec/left4me-journalctl`
- `deploy/files/usr/local/sbin/left4me``scripts/sbin/left4me`
- The scripts' contents are unchanged. Every install-target path inside them (`/usr/local/libexec/left4me/...`, `/etc/left4me/...`, `/var/lib/left4me/...`) stays exactly as is — those are runtime paths, not source-tree paths.
### 2. Delete dead code
- `git rm` (truly obsolete; replacements live elsewhere or feature was retired):
- `deploy/files/usr/local/libexec/left4me/left4me-apply-cake` — CAKE migrated to systemd-networkd via `network/<iface>/cake` node metadata in ckn-bw.
- `deploy/files/usr/local/lib/systemd/system/left4me-cake.service` — same reason.
- `deploy/files/etc/left4me/cake.env` — bandwidth lives in node metadata, not an env file.
- `deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service` — central `bundles/nftables/` consumes the rules now.
- `deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft` — same. After the delete, the now-empty `deploy/files/usr/local/lib/left4me/` and its `nft/` child disappear (git doesn't track empty dirs).
- `deploy/deploy-test-server.sh` — superseded by `bw apply`; content survives in git history.
- **Do NOT delete** `deploy/files/usr/local/lib/systemd/system/left4me-workshop-refresh.{service,timer}`. The workshop-refresh job is live (invokes `flask workshop-refresh`, defined in `l4d2web/cli.py`); ckn-bw's reactor emits these on production. They stay as curated examples, same category as `left4me-server@.service` / `left4me-web.service` / the slices. (This corrects the framing in `docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md` and item 2 of `docs/superpowers/specs/2026-05-15-janitorial-cleanup.md`, both of which lumped workshop-refresh together with truly-dead units.)
- Stale `__pycache__` dirs under `deploy/files/usr/local/libexec/left4me/` are deleted by the moves in step 1.
### 3. Split and relocate `deploy/tests/test_deploy_artifacts.py`
The current file (~880 lines) is doing four jobs. Split as follows; do not duplicate tests across files.
**Concrete sequence to preserve git history where it counts**:
1. `git mv deploy/tests/test_deploy_artifacts.py deploy/tests/test_example_units.py` — single rename, history follows via `git log --follow`.
2. In the renamed file, delete every test except the "Keep in `deploy/tests/test_example_units.py`" list below. The kept tests track the unit/sysctl/env-template examples, which is what `deploy/tests/` will mean afterwards.
3. Create new `scripts/tests/*.py` files (and `conftest.py`) by writing them fresh — pasting the relevant test functions across. The extracted tests lose direct rename history, but blame against the new files still resolves to the originals one git ref back; acceptable tradeoff.
**Move to `scripts/tests/`** (tests of script behavior + the sudoers contract that gates the scripts):
- `scripts/tests/test_overlay.py``test_overlay_helper_is_python_with_strict_validation`, `test_overlay_helper_mount_is_idempotent_when_already_mounted`
- `scripts/tests/test_script_sandbox.py``test_script_sandbox_helper_present`, `test_script_sandbox_helper_passes_shell_syntax_check`, `test_script_sandbox_helper_invokes_systemd_run_with_hardening`, `test_script_sandbox_uses_idmap_staging`, `test_script_sandbox_in_build_slice_with_oom_adjust`, `test_script_sandbox_helper_validates_overlay_id`, `test_script_sandbox_helper_dry_run_mode`
- `scripts/tests/test_systemctl_helper.py``test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`
- `scripts/tests/test_journalctl_helper.py``test_journalctl_helper_passes_shell_syntax_check_and_rejects_bad_args`
- `scripts/tests/test_helpers_use_fixed_paths.py``test_helpers_use_fixed_system_tool_paths_not_sudo_path`
- `scripts/tests/test_sudoers_grants.py``test_sudoers_allows_only_left4me_helpers_not_raw_system_tools` (still reads `deploy/files/etc/sudoers.d/left4me` as the canonical example; comment why)
The `ROOT/DEPLOY` path-prefix constants in each file get rewritten so `SCRIPTS = Path(__file__).resolve().parents[2] / "scripts"` and helpers resolve to `SCRIPTS / "libexec/left4me-overlay"` etc. Shared helpers (`_fake_command`, `_env_with_fake_commands`) move into `scripts/tests/conftest.py`.
**Keep in `deploy/tests/test_example_units.py`** (locks down the curated examples; renamed from the current file):
- `test_global_unit_files_exist_at_product_level_paths`
- `test_web_unit_contains_required_runtime_contract`
- `test_server_unit_contains_required_runtime_contract`
- `test_server_unit_mounts_overlay_via_exec_start_pre`
- `test_server_unit_unmounts_overlay_via_exec_stop_post`
- `test_server_unit_contains_perf_baseline_directives`
- `test_l4d2_game_slice_exists_with_high_weights`
- `test_l4d2_build_slice_exists_with_low_weights`
- `test_sysctl_conf_present_with_perf_settings`
- `test_env_templates_contain_required_defaults`
- `test_sandbox_resolv_conf_exists`
Add a top-of-file docstring: *"These tests lock down the curated examples kept in `deploy/files/` for reference. The production units are emitted by ckn-bw's reactor in `bundles/left4me/metadata.py`; when reactor output drifts intentionally, update the examples here too."*
**Delete entirely** (target removed or no longer load-bearing):
- All `test_deploy_script_*` tests (12 tests; `deploy-test-server.sh` is gone)
- `test_globals_refresh_units_removed` — file already deleted; nothing to lock down
- `test_nft_mark_file_marks_left4me_udp_with_dscp_ef_and_priority`, `test_nft_mark_unit_loads_and_clears_left4me_table` — nft-mark moved to central nftables bundle
- `test_cake_env_template_documents_required_knobs`, `test_apply_cake_helper_supports_apply_and_clear_modes`, `test_apply_cake_helper_passes_shell_syntax_check`, `test_cake_unit_runs_helper_in_apply_and_clear_modes` — CAKE moved to systemd-networkd
- `test_deploy_script_installs_overlay_helper_with_executable_mode`, `test_deploy_script_installs_script_sandbox_helper` — install responsibility now lives in ckn-bw's bundle, not in any left4me-side script
Final file count: `scripts/tests/` gets 6 files, `deploy/tests/test_example_units.py` is one file, `deploy/tests/test_deploy_artifacts.py` is gone (renamed).
### 4. Rewrite `deploy/README.md`
Reframe the top of the file as: *"This directory is a reference exemplar. The canonical deploy is [ckn-bw](https://git.sublimity.de/cronekorkn/ckn-bw)'s `bundles/left4me/` (run `bw apply ovh.left4me`). Files under `deploy/files/` and `deploy/templates/` are readable examples — not the binaries / configs ckn-bw actually installs. Read them to understand the target layout if you're building a fresh deployment by other means."*
Update the file/status table:
- Drop rows for files that no longer exist (apply-cake, cake.service, cake.env, nft-mark.*, workshop-refresh.*).
- Drop the `deploy-test-server.sh` row.
- For the privileged-scripts rows, change `files/usr/local/libexec/left4me/...``(moved to scripts/libexec/, installed by ckn-bw's install_left4me_scripts action)`; same for the sbin row.
- Mark the remaining `files/etc/...` and `files/usr/local/lib/systemd/system/...` entries explicitly as **example**: ckn-bw ships its own verbatim copies of the configs, its reactor emits the units.
Keep the "Target Layout" / "Runtime User" / "Overlay References" / "Performance Tuning" sections — they're useful reference prose. Strip the "Running A Test Deployment" / "Admin Bootstrap" sections that refer to the deleted shell installer; replace with a one-paragraph pointer to ckn-bw.
### 5. ckn-bw cross-repo update
The `install_left4me_scripts` action in `bundles/left4me/items.py` currently reads from `/opt/left4me/src/deploy/files/usr/local/{libexec,sbin}/`. Update it to read from `/opt/left4me/src/scripts/{libexec,sbin}/`. The install target is unchanged (`/usr/local/libexec/left4me/`, `/usr/local/sbin/left4me`), so nothing on the deployed host moves.
This is a separate PR in the ckn-bw repo. It must land **at the same time** as the left4me move — the install action depends on the source paths existing. Coordination:
1. Open both PRs simultaneously.
2. Merge order: left4me first (scripts exist at the new path in `/opt/left4me/src/` only after a fresh `git_deploy`), then ckn-bw, then `bw apply ovh.left4me`.
3. Alternative: have the ckn-bw PR fall back to the old path if the new path doesn't exist (one extra glob); decide during ckn-bw review whether the complexity is worth the looser coupling. Default: no fallback, coordinate the merges.
Verification on the deploy target: after `bw apply`, the files under `/usr/local/libexec/left4me/` and `/usr/local/sbin/left4me` should be byte-identical to before. Sudoers, services, the web app: all unchanged.
### 6. Mark adjacent specs / docs as resolved
- `docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md`: prepend a `**Resolved 2026-05-15 by docs/superpowers/plans/…</plan-name>.md.**` line at the top. Leave the body intact for archaeology.
- `docs/superpowers/specs/2026-05-15-janitorial-cleanup.md`: cross out items 1, 5, 6 (now handled here). Item 2 needs a rewrite — the framing "all static unit files are obsolete drift" was wrong; the live reactor-emitted set (`server@`, `web`, `workshop-refresh.{service,timer}`, `l4d2-{game,build}.slice`) stays in `deploy/files/` as curated examples. The truly-dead two (`left4me-cake.service`, `left4me-nft-mark.service`) are already deleted by this plan, so item 2 collapses to "no remaining work."
- No memory file changes needed; the project state captured here is structural and re-derivable from `deploy/README.md` after the rewrite lands.
### 7. Rollback notes
If `bw apply ovh.left4me` against the test server breaks something after the cross-repo merge:
1. Revert the ckn-bw `install_left4me_scripts` action change to the old source path (`/opt/left4me/src/deploy/files/usr/local/{libexec,sbin}/`). Re-apply.
2. The left4me side never needs reverting in isolation — the scripts at the new path are byte-identical to the old ones, so a stale ckn-bw install action against a *new* left4me checkout would fail at `install -t` (source path missing). That failure is loud and safe: nothing on the deployed system gets modified.
3. The only foot-gun is **partial rollout**: ckn-bw updated but left4me not yet checked out at the right revision. The `git_deploy` step pins the revision, so as long as the two PRs reference compatible commits, the deployed `/opt/left4me/src/` always matches the action's expectation.
## What does NOT change
- Runtime install-target paths (`/usr/local/libexec/left4me/...`, `/usr/local/sbin/left4me`) — every reference inside `l4d2host/service_control.py:7-8`, `l4d2web/services/overlay_builders.py:34`, the sudoers file, and the systemd units stays the same.
- The Python packages `l4d2host/` and `l4d2web/`.
- ckn-bw's bundles for sudoers / sysctl / sandbox-resolv.conf — those keep their own verbatim copies (the user picked "deploy/ keeps configs as examples; duplication-with-ckn-bw is OK because deploy/ is explicitly reference"). Janitoring the duplication is *not* in scope for this plan.
- The Mako env templates in ckn-bw — they stay where they are, since they need bw's metadata access for rendering.
- The recent overlay-idmap / script-sandbox idmap-staging work — untouched.
## Critical files (jump points for the implementor)
- `deploy/tests/test_deploy_artifacts.py` — the source for the test split (lines 20-32 are the path constants; tests grouped roughly by helper from line 138 onward)
- `deploy/README.md` — full rewrite of the top section, partial rewrite of the table
- `l4d2host/service_control.py:7-8` — verify install-target paths unchanged (sanity)
- `l4d2web/services/overlay_builders.py:34` — same
- `deploy/files/etc/sudoers.d/left4me` — sanity-check that no path inside changed
- `deploy/files/usr/local/lib/systemd/system/{left4me-server@.service,left4me-web.service,l4d2-{game,build}.slice}` — survive as curated examples
- ckn-bw repo: `bundles/left4me/items.py` — the `install_left4me_scripts` action (separate PR)
## Verification
End-to-end:
1. **Source-tree consistency.** `find scripts deploy -type f | sort` matches the layout in "End state" above (modulo `__pycache__`).
2. **All tests pass locally.** From the repo root: `pytest scripts/tests/ deploy/tests/ l4d2host/tests/ l4d2web/tests/` — every test passes. Specifically verify `scripts/tests/test_sudoers_grants.py` still reads `deploy/files/etc/sudoers.d/left4me` correctly (path constant points across the dir boundary).
3. **Shell syntax checks.** The split tests should still run `sh -n` / `bash -n` against the moved scripts; no script edits means no syntax regressions, but the test paths must resolve.
4. **No accidental application breakage.** `grep -rn '/usr/local/libexec/left4me\|/usr/local/sbin/left4me' l4d2host l4d2web` returns the same hits as before (paths are install-target, source moves don't affect them).
5. **ckn-bw dry-run.** Once the ckn-bw PR is up, `bw apply --dry-run ovh.left4me` from the ckn-bw repo: the diff should show **no changes** to files under `/usr/local/libexec/left4me/` or `/usr/local/sbin/left4me` (byte-identical content via the new path).
6. **Production apply.** `bw apply ovh.left4me` against the real test server. After apply: `systemctl status left4me-web.service` is green, starting a game server via the web UI still works (overlay mount → srcds_run → unmount on stop), running an overlay build script through the sandbox still works.
## Out of scope (handled elsewhere or deferred)
- The Mako template duplication in ckn-bw — separate cleanup; the templates legitimately need bw's metadata access.
- The 1/2/3-user uid-split decision — `docs/superpowers/specs/2026-05-15-user-uid-split-design.md`.
- The script-sandbox → systemd template unit refactor — `docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md`.
- Remaining janitorial items: item 3 (bubblewrap→systemd-run doc drift), item 4 (stale gameserver-side idmap binds), calendar reminder for SM 1.13 stable. Items 1, 2 (partial — see step 6), 5, 6 are subsumed here.
- Rewriting the shell helpers in Python / packaging them as console_scripts — explicitly rejected in the recent script-consolidation plan (egg-info + TOCTOU privilege concerns).
- Historical references inside `docs/superpowers/plans/*` and `docs/superpowers/specs/*` to `deploy/files/...` or `deploy-test-server.sh` paths. Those are time-stamped snapshots of past sessions; they don't get rewritten when the underlying tree moves.

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -1,198 +0,0 @@
# Plan: scope Server Log to the current unit invocation
## Context
Today the Server Log panel on `/servers/<id>` shows the last 200 lines of the
unit's **entire** journal — i.e. across every prior `start` / `stop` /
`reset` cycle — then follows. That means a freshly-started server can mix
lines from the current boot with leftovers from yesterday, which makes the
log harder to reason about. The user wants the panel to begin at the most
recent unit start.
The right systemd primitive is `_SYSTEMD_INVOCATION_ID`: systemd assigns a
fresh 128-bit ID to every (re)start of a unit, queryable via
`systemctl show -p InvocationID --value <unit>`. Filtering
`journalctl _SYSTEMD_INVOCATION_ID=<id>` gives exactly that one run.
User decisions (already confirmed):
- **Scope** — always last invocation; no toggle, no historical view.
- **Empty case** (unit has never been started) — SSE stays open with
keepalives, yields no data lines, attaches when first invocation appears.
- **Mid-stream restart** — backend force-disconnects when the InvocationID
changes. `EventSource` reconnects on its own and the next request picks
up the new run.
## Architecture
Three layers, smallest blast radius first:
```
browser ─SSE─► l4d2web routes/log_routes.py
l4d2web services/l4d2_facade.stream_server_logs
│ (l4d2ctl logs <unit> --lines N --follow)
l4d2host CLI logs → service_control.stream_journal
┌────────────┴──────────────────┐
▼ ▼
sudo left4me-systemctl show sudo left4me-journalctl
-p ActiveState <unit> --invocation-id <hex32>
-p SubState --lines N --follow|--no-follow
-p InvocationID ←NEW
```
`l4d2host.service_control.stream_journal` becomes the orchestrator:
1. Resolve `InvocationID` via `show_service` (already returns a parsed
`key=value` dict in `status.py:32-37` — adding a property is harmless).
2. **Empty `InvocationID`** (unit never ran):
- `follow=False` → return `iter(())`.
- `follow=True` → yield `""` every ~10 s as a keepalive nudge; poll
`InvocationID` every ~3 s; once it appears, fall through to step 3.
3. **Non-empty `InvocationID`**`Popen` the journalctl helper with
`--invocation-id <id>`. Start a daemon thread that re-reads the unit's
`InvocationID` every ~5 s; if it changes, call `proc.terminate()`.
The generator's normal end-of-stream path then closes the SSE response,
the browser's `EventSource` reconnects, and the next call picks up the
new ID.
`lines` cap is preserved (journalctl `-n N` inside the invocation), so a
long-running server doesn't dump tens of thousands of lines on page-load.
## Files to change
### Helpers (`deploy/scripts/libexec/`)
- **`left4me-systemctl`** — extend `show` to also request
`--property=InvocationID`. One-line change at `:43`:
```sh
show) exec "$systemctl" show \
--property=ActiveState \
--property=SubState \
--property=InvocationID \
"$unit" ;;
```
- **`left4me-journalctl`** — replace the unit-based filter with an
invocation-id-based one. New CLI signature:
```
left4me-journalctl <name> --invocation-id <hex32> --lines <n> --follow|--no-follow
```
Validate `<hex32>` against `^[0-9a-f]{32}$` (32 lowercase hex chars), same
defensive style as the existing name validation. Exec:
```sh
exec "$journalctl" \
_SYSTEMD_INVOCATION_ID="$invocation_id" \
-n "$lines" -o cat $follow_arg
```
No `-u <unit>` — the invocation ID is globally unique, the predicate is
enough. Old `<name> --lines <n> --follow` shape is removed (no callers
remain after the host layer change).
### Host (`l4d2host/`)
- **`l4d2host/service_control.py`**
- `journalctl_command(name, *, invocation_id, lines, follow)` → builds
the new arg list. Drop the `-u`-based form.
- `stream_journal(name, *, lines=200, follow=True)` → orchestrator from
Architecture step 1-3 above. Helpers in this file:
- `get_invocation_id(name) -> str` (parses `show_service` output;
returns `""` if unset).
- `_stream_with_restart_guard(invocation_id, lines, follow)`
`Popen` + daemon poller thread.
- Keep `stream_command` as-is (still consumed by `host_commands`).
- **`l4d2host/logs.py`** — no signature change; just forwards.
- **`l4d2host/cli.py`** — `logs` command keeps `--lines`/`--follow` flags
unchanged. No CLI-surface break.
### Web (`l4d2web/`)
- **`services/l4d2_facade.py`** — `stream_server_logs` keeps its signature.
The behavior change is fully inherited from the host layer.
- **`routes/log_routes.py`** — unchanged. The existing keepalive logic at
`:33` (`if line == "": yield ": keepalive\n\n"`) already handles the
empty-line nudges the host yields during the idle wait.
### Tests
- **`l4d2host/tests/test_logs.py`**
- Update `test_stream_instance_logs_uses_journalctl_helper` for the new
arg shape: `["sudo", "-n", "/usr/local/libexec/left4me/left4me-journalctl", "alpha", "--invocation-id", "<32hex>", "--lines", "25", "--no-follow"]`.
Stub out `get_invocation_id` to return a known ID.
- Add: empty InvocationID + `follow=False` → empty iterator (no
journalctl call).
- Add: empty InvocationID + `follow=True` → yields `""` then the next
`get_invocation_id` returns a real ID and the journalctl helper is
called once.
- Add: invocation changes mid-stream → poller calls `proc.terminate()`.
- **`l4d2host/tests/test_cli.py`** — `test_logs_command_streams_lines`:
update expected helper invocation, or stub at the `stream_instance_logs`
level (it's already monkeypatched in similar tests).
- **`deploy/scripts/tests/test_journalctl_helper.py`** — update existing
shell-syntax & argument-validation test for the new CLI signature.
Assert rejection of malformed invocation IDs (too short, non-hex,
uppercase, embedded slash).
- **`l4d2web/tests/test_status_and_server_logs.py`** — should pass
unchanged (the SSE shape and route surface haven't moved).
## Critical files
- `deploy/scripts/libexec/left4me-systemctl` (extend `show`)
- `deploy/scripts/libexec/left4me-journalctl` (rewrite CLI shape)
- `l4d2host/l4d2host/service_control.py` (orchestrator + helpers)
- `l4d2host/tests/test_logs.py`, `l4d2host/tests/test_cli.py`
- `deploy/scripts/tests/test_journalctl_helper.py`
- (No changes:) `l4d2web/services/l4d2_facade.py`, `l4d2web/routes/log_routes.py`
## Why not alternative approaches
- **`--since <ActiveEnterTimestamp>`** — works in the happy path but is
fragile to clock skew, system suspend, and units that restart inside
the same second. `_SYSTEMD_INVOCATION_ID` was added to systemd
specifically for this filter.
- **String-match the systemd `Started …` marker** — locale-dependent,
breaks with systemd-message changes, can't survive `Restart=`.
- **Toggle in the UI** — user explicitly opted out; YAGNI.
## Verification
1. **Unit tests** (sandboxed):
```
uv run --package l4d2host pytest l4d2host/tests/
uv run --package l4d2web pytest l4d2web/tests/
uv run pytest deploy/scripts/tests/ deploy/tests/
```
2. **Manual on the host** (`ckn@10.0.4.128`):
```
# the unit is running
l4d2ctl logs vanilla --no-follow | head -5
# → first lines should be from this run's start, not yesterday
# never-started case (pick an unstarted server, or stop first)
l4d2ctl stop vanilla && l4d2ctl logs vanilla --no-follow
# → empty output, exit 0
```
3. **End-to-end in browser**:
- Open `/servers/1`. Confirm log starts at this run's first line.
- Click Stop. Stream goes quiet. Click Start. SSE auto-reconnects and
shows the new run from line one.
- Open a fresh server that has never been started: log panel is empty
but connection is alive; clicking Start makes log appear within seconds.

View file

@ -1,226 +0,0 @@
# UID collapse — remove `l4d2-sandbox` user
## Context
The hardening refactor landed earlier today
(`docs/superpowers/plans/2026-05-15-hardening-refactor.md`) deployed
the systemd-directive composition that covers all same-uid attack
vectors for the gameserver + web units running as `left4me`.
The script-sandbox unit still runs as a separate uid `l4d2-sandbox`
(981) with a build-time idmap (`mount --bind --map-users=980:981:1`)
translating sandbox-side writes to land on disk as `left4me`. After
the hardening refactor, the same-uid attack vectors the sandbox uid
defends against (FS-view access, ptrace, /proc, signals) are
already closed by the sandbox's own systemd-run hardening profile.
The separate uid is now defense-in-depth only — and it's
inconsistent with the decision *not* to split the web/server uid.
Pick one principle. Option C from the discussion: **one user**.
Delete `l4d2-sandbox`, simplify the sandbox helper, remove the
idmap. Architecture gets smaller (one fewer uid, no idmap binds,
~30 lines deleted from the helper). Trade: if sandbox hardening
regresses, kernel uid boundary no longer helps — consistent with
what we already accepted for server/web.
## Approach
1. **Edit `scripts/libexec/left4me-script-sandbox`** (left4me repo):
delete the idmap block (lines 49-78 per Phase 1 exploration —
the `LEFT4ME_UID`/`SANDBOX_UID` lookups, `STAGING` setup,
`cleanup_staging` trap, `mount --bind --map-users=…` call).
Change `User=l4d2-sandbox -p Group=l4d2-sandbox` (line 85)
to `User=left4me -p Group=left4me`. Change
`BindPaths="${STAGING}:/overlay"` (line 102) to
`BindPaths="${OVERLAY_DIR}:/overlay"`. Keep the
`nsenter --mount=/proc/1/ns/mnt` self-wrap at the top — it's
about namespace escape, not uid.
2. **Update `scripts/tests/test_script_sandbox.py`** (left4me repo):
- Lines 36-37: change `User=l4d2-sandbox`/`Group=l4d2-sandbox`
assertions → `User=left4me`/`Group=left4me`.
- Delete `test_script_sandbox_uses_idmap_staging` (lines 114-133)
entirely — it asserts the idmap and staging exist; after
refactor neither does.
- Update line 165-166 comments to drop the sandbox-uid reference.
3. **Update inline comments** referencing the sandbox uid:
- `l4d2web/services/overlay_builders.py:342` (or near 100 — agents
reported different lines; locate via grep) — "as l4d2-sandbox"
→ "as left4me".
- `l4d2host/instances.py:80` — comment about l4d2-sandbox-owned
lower-layer files → reflect that all overlay content is now
left4me-owned end-to-end.
4. **Mark the build-time-idmap plan superseded**:
`docs/superpowers/plans/2026-05-15-build-time-idmap.md` — add a
top-line status note: "SUPERSEDED 2026-05-15 by the uid-collapse
refactor (this plan). The idmap pattern this plan introduced is
removed because source uid (`left4me`) now equals target uid
(`left4me`) — translation is a no-op." Same one-line treatment
for `docs/superpowers/plans/2026-05-14-overlay-idmap.md`.
5. **Update the user-uid-split spec's existing superseded header**:
`docs/superpowers/specs/2026-05-15-user-uid-split-design.md`
currently says "2 users (current state) is correct"; revise to
say "1 user (after uid-collapse refactor) is correct" and update
the reasoning paragraph.
6. **Light-touch updates to other docs** that reference
`l4d2-sandbox` for accuracy. Pragmatic scope — add a top-line
note instead of rewriting body content:
- `deploy/README.md` — drop the `l4d2-sandbox` bullet (line 84),
fix the paragraph at line 141 to reflect no-idmap state.
- `docs/superpowers/specs/2026-05-15-hardening-refactor-design.md`
and `2026-05-15-hardening-threat-model.md` — add a one-line
"Updated 2026-05-15: l4d2-sandbox collapsed into left4me; see
plans/2026-05-15-uid-collapse.md" note in the relevant context
section.
- `docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md`
— same one-line note (the spec's hardening profile sketch
references the old `User=l4d2-sandbox`; the new build-overlay-unit
refactor when it lands will inherit `User=left4me` from this
change).
- **Leave the 2026-05-08-* design specs alone.** They describe
historical design at the time; rewriting them obscures the
evolution. Anyone reading them sees the date and the
superseded-note chain leads forward.
7. **Remove `l4d2-sandbox` from the ckn-bw bundle**
(`~/Projekte/ckn-bw/bundles/left4me/items.py`):
- Delete the `l4d2-sandbox` entry from the `users` dict
(lines 54-58 per Phase 1).
- Delete the `l4d2-sandbox` entry from the `groups` dict
(line 44).
- Update the `/var/lib/left4me` mode comment + decide whether to
change `0711``0755`. The `0711` was specifically to let
`l4d2-sandbox` traverse (not list) the dir; with sandbox gone,
`0755` is the natural choice. Pick `0755`.
8. **On-host pre-flight**: before `bw apply`, chown any remaining
uid-981 files to `left4me`:
```bash
ssh left4.me 'sudo find /var/lib/left4me /opt/left4me -uid 981 -print
| head -50'
# If any results, chown them:
ssh left4.me 'sudo find /var/lib/left4me /opt/left4me -uid 981
-exec chown left4me:left4me {} +'
```
Per the build-time-idmap plan that landed earlier, new sandbox
writes already land as `left4me`, so the result should be small
or empty. The chown catches any stragglers.
9. **Cross-repo push + bw apply**:
- Commit left4me changes (helper, tests, doc updates) on master.
- Commit ckn-bw changes (users/groups deletion, mode change) on
master.
- Push both.
- `bw apply ovh.left4me`.
10. **Verify**:
- `getent passwd l4d2-sandbox` on the host → no result (user
removed).
- `sudo find /var/lib/left4me /opt/left4me -uid 981 -print`
empty.
- Trigger a sandbox build via the web UI; observe in
`journalctl -u 'left4me-script-*'` that the transient unit
runs as `left4me`, completes successfully, and the resulting
overlay files in `/var/lib/left4me/overlays/<id>/` are
`left4me:left4me`.
- `pytest scripts/tests/test_script_sandbox.py` locally passes
with updated assertions.
## Files to modify
**Left4me repo (`~/Projekte/left4me`):**
- `scripts/libexec/left4me-script-sandbox` — helper changes (step 1)
- `scripts/tests/test_script_sandbox.py` — test updates (step 2)
- `l4d2web/services/overlay_builders.py` — comment update (step 3)
- `l4d2host/instances.py` — comment update (step 3)
- `docs/superpowers/plans/2026-05-15-build-time-idmap.md`
SUPERSEDED header (step 4)
- `docs/superpowers/plans/2026-05-14-overlay-idmap.md`
SUPERSEDED header (step 4)
- `docs/superpowers/specs/2026-05-15-user-uid-split-design.md`
update existing superseded header (step 5)
- `docs/superpowers/specs/2026-05-15-hardening-refactor-design.md`
one-line note (step 6)
- `docs/superpowers/specs/2026-05-15-hardening-threat-model.md`
one-line note (step 6)
- `docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md`
one-line note (step 6)
- `deploy/README.md` — drop sandbox bullet, update idmap paragraph
(step 6)
**Ckn-bw repo (`~/Projekte/ckn-bw`):**
- `bundles/left4me/items.py` — drop `l4d2-sandbox` user + group;
tighten mode (step 7)
**Host actions (no commits):**
- pre-flight chown of orphan-981 files (step 8)
- `bw apply ovh.left4me` (step 9)
## Verification
End-to-end on `left4.me`:
```bash
# User removed
ssh left4.me 'getent passwd l4d2-sandbox; getent group l4d2-sandbox'
# Expect: empty (both)
# No orphan-uid files
ssh left4.me 'sudo find /var/lib/left4me /opt/left4me -uid 981 -print 2>/dev/null'
# Expect: empty
# Sandbox build runs as left4me end-to-end
# (Trigger via web UI; then check)
ssh left4.me 'sudo journalctl --since "5 minutes ago" -u "left4me-script-*" | head -30'
# Expect: clean run, no permission errors
ssh left4.me 'sudo ls -ln /var/lib/left4me/overlays/<id>/ | head -5'
# Expect: uid 980 (left4me), not 981
# Local tests
cd ~/Projekte/left4me && pytest scripts/tests/test_script_sandbox.py -q
# Expect: all green (one fewer test — the idmap test was deleted)
```
## Rollback
If the deploy goes wrong:
- `git revert` the left4me commits + the ckn-bw commit, push,
`bw apply` again.
- ckn-bw will recreate the `l4d2-sandbox` user on the host.
- The old helper script comes back via `git_deploy`.
- Any files chown'd from 981→980 in the pre-flight stay at 980 —
that's fine because the new helper would have written them as 980
anyway.
## Risks
- **Sandbox build running during `bw apply`**: ckn-bw's user-removal
step might fail if a `l4d2-sandbox`-uid process is alive.
Mitigation: don't apply during a build. Quick check before apply:
`ssh left4.me 'sudo systemctl list-units --type=service "left4me-script-*"'`
→ expect "0 loaded units".
- **Orphan files not caught by the pre-flight find**: if any uid-981
file exists outside `/var/lib/left4me` or `/opt/left4me`, the user
removal succeeds but the file becomes orphan-uid. Practically these
paths are exhaustive; if paranoid, expand the find to `/`.
- **The `nsenter` self-wrap still needs `PrivateTmp=true` on the web
unit to be the *reason* the wrap exists**. If the web unit's
PrivateTmp ever goes away, the wrap becomes unnecessary. Not
affected by this refactor; flag for future cleanup.
## Out of scope
- Renaming `left4me` to something else (e.g., `l4d2-app`). Cosmetic
only; not worth the migration cost.
- The broader configmgmt responsibility reshape (drop-ins owned by
left4me, ckn-bw as thin file-shipper). Deferred per the
hardening-refactor design.
- `build-overlay-unit` template refactor
(`docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md`)
— still queued; will inherit `User=left4me` cleanly from this work.
- Rewriting historical 2026-05-08-* design specs.

View file

@ -1,408 +0,0 @@
# Plan — collapse left4me venv chain into uv workspace + `uv sync`
**Status:** executed (left4me side). ckn-bw side queued — see
`~/Projekte/ckn-bw/bundles/left4me/` and the matching section below.
**Notable deviations from the original handoff
(`docs/superpowers/specs/2026-05-15-handoff-uv-workspace.md`):**
- Handoff assumed `pkg_apt: uv` works on Debian Trixie. It does not — uv
is in `experimental`/`sid` only. Replaced with a `left4me_install_uv`
action that downloads a pinned 0.11.8 tarball from astral-sh/uv
releases, SHA256-verifies, installs to `/usr/local/bin/`.
- Handoff assumed the existing layout (`l4d2host/pyproject.toml` with
`package-dir = "."`) was workspace-compatible. It was not — setuptools
writes `egg-info/` to source during any build, which fails on the
root-owned `/opt/left4me/src` tree. Required layout restructure to
`l4d2host/l4d2host/` (package source nested) plus a switch from
setuptools to hatchling.
- `git` is not installed on the prod host (bw drives git from the
control machine). Verification check #1 uses `find` for build
artifacts instead of `git status`.
## Context
The production deploy of left4me to `ovh.left4me` currently uses a 5-action
chain in `ckn-bw/bundles/left4me/items.py` that builds out a Python venv
under `/var/lib/left4me/.venv` by chaining `python3 -m venv``pip upgrade`
`pip install` (with an 8-line tempdir-copy dance because the source at
`/opt/left4me/src` is root-owned and setuptools wants to write `.egg-info/`
into it) → `alembic upgrade``seed_overlays`. The chain has three
problems:
1. **Non-deterministic prod deploys.** `pip install` resolves whatever is
latest at apply time. A transitive CVE-relevant bump between two
`bw apply` runs is invisible until something breaks.
2. **Cognitive cost.** The tempdir-copy in `left4me_pip_install` is the
single longest, gnarliest action in the bundle.
3. **Implicit cross-package dep.** `l4d2web` imports from `l4d2host.paths`
in 5 files but doesn't declare the dependency — today's setup works
only because both get `pip install -e`'d side-by-side.
This plan migrates the repo to a uv workspace with a committed `uv.lock`,
replacing the 5-action chain with `left4me_install_uv` (download +
SHA256 verify, idempotent — only re-runs on version change) plus
`left4me_uv_sync`. On the steady-state path (uv already pinned at
0.11.8), only `uv_sync` fires per deploy. Both sides of the change
(left4me repo and the ckn-bw `left4me` bundle) ship together. The plan
executes the migration sequence already documented in
`/Users/mwiegand/Projekte/left4me/docs/superpowers/specs/2026-05-15-handoff-uv-workspace.md`
— treat that handoff as the design document. This plan adds the
empirically-verified ground truth, resolves the small open questions, and
encodes the executable sequence.
## Source of truth
- **Design**: `docs/superpowers/specs/2026-05-15-handoff-uv-workspace.md`
(in the left4me repo) — read this first; do not duplicate its content
here.
- **Sibling context** (don't dive in):
`docs/superpowers/specs/2026-05-15-deployment-responsibility-design.md`
(just-shipped; left the venv chain alone),
`docs/superpowers/specs/2026-05-15-runtime-state-relocation-design.md`
(made `/opt/left4me/src` root-owned, which is *why* the current
tempdir-copy dance exists).
## Resolved questions (from planning)
- **Branch flow**: direct-to-master on both sides. (Matches left4me's
recent workflow, e.g. `b13d164`, `55b0138`. ckn-bw side committed
but NOT pushed — operator pushes manually.)
- **Python version alignment**: align all three pyprojects (root + both
members) to `requires-python = ">=3.13"`. Matches `.envrc` and the
production host. Removes the workspace-vs-member skew.
- **Spike test scope**: extend beyond the handoff to also dry-run a
`uv sync --frozen` shape against a root-owned source — the production
command path is `sync`, not `build`, and they're different code paths.
- **Scope handoff at `git push`**: agent's deliverable is two ready-to-deploy
commits (left4me pushed; ckn-bw committed but unpushed). The user runs
`bw apply ovh.left4me`, the post-apply restart, and the 6-check
verification matrix themselves. (Per session memory:
`feedback_left4me_deploy_workflow` — supersedes the original prompt's
ask to drive apply + verify end-to-end.) The spike test remains agent
work — it's information gathering, and the one-shot direct install
fits the "one-shot via direct command" rule from the same memory.
- **uv install vector**: direct GitHub tarball download + SHA256 verify
against the official `.sha256` sibling, install to `/usr/local/bin/`.
The handoff doc's `pkg_apt: uv` assumption was wrong — uv is not in
Debian Trixie's apt archive (in `experimental`/`sid` only). Astral's
canonical methods are curl-pipe-sh and direct tarball; we chose
tarball for auditability and pattern-match with the existing
`left4me_install_steamcmd` action. Pin to **uv 0.11.8** to match
the local brew-installed version, eliminating the lockfile-format-skew
risk between dev and prod.
## Ground-truth from exploration
- **Cross-package imports confirmed**: 5 files in `l4d2web/` import
`from l4d2host.paths`:
- `l4d2web/routes/overlay_routes.py`
- `l4d2web/services/overlay_creation.py`
- `l4d2web/services/overlay_builders.py`
- `l4d2web/services/overlay_files.py`
- `l4d2web/services/workshop_paths.py`
- **Layout compatibility**: both members use `[tool.setuptools.package-dir]
{name} = "."` (pyproject lives inside the package directory). uv
workspace `members = ["l4d2host", "l4d2web"]` handles this fine — uv
uses the pyproject as the project root regardless of the package-dir
mapping.
- **`.gitignore` already covers** `*.egg-info/`, `.venv/`, `__pycache__/`,
etc. No `.gitignore` changes needed.
- **No `pytest.ini` / `[tool.pytest.ini_options]` exists** — pytest
defaults work; `uv run pytest` from repo root will discover tests in
`l4d2host/tests/` and `l4d2web/tests/`.
- **Bundle action conventions** (from `ckn-bw/bundles/left4me/items.py`
and neighbors): every action sets `cascade_skip: False` explicitly.
Action keys in use: `command`, `triggered`, `cascade_skip`, `unless`,
`needs`, `triggers`, `comment`.
- **Additional `git_deploy` consumer**: `left4me_chmod_scripts` at
`items.py:324` also `needs: 'git_deploy:/opt/left4me/src'`. Untouched
by this refactor, but listed here so it's not missed during review.
- **Bundle README §"deploy-flow"**: lines 8490 of
`bundles/left4me/README.md` document the pip_install tempdir dance.
This is the prose to rewrite (not vague — those exact lines).
- **`apt.packages`** declaration: `metadata.py:2949`. Currently lists
`python3`, `python3-venv`, `python3-pip`, `python3-dev`, plus i386
multiarch entries.
- **uv NOT in Debian Trixie apt archive** (verified via
`apt-cache search "^uv$"` and `apt-cache policy uv` on the live host
— both return nothing for the actual `uv` package). Handoff doc's
assumption was wrong on this point.
- **`git` is NOT installed on the production host** (verified via
`command -v git` on prod returning empty; `/usr/bin/git` doesn't
exist). The bw `git_deploy` item operates from the *control* machine
(dev laptop), pushing files to prod via SSH — prod itself needs no
git. Implication: the handoff's verification check #1
(`sudo git -C /opt/left4me/src status --porcelain`) cannot be used.
Replace with `find /opt/left4me/src \( -name '*.egg-info' -o -name
build -o -name dist \) -print`.
- **ckn-bw is currently EVEN with `origin/master`** (verified via
`git status -sb` showing `## master...origin/master` with empty
log). The original prompt's "7 commits ahead" was stale — the
operator has since pushed. After our ckn-bw commit lands locally,
the repo will be 1 commit ahead (not 8).
- **Prod arch**: `x86_64` / `amd64`. **Prod curl**: 8.14.1 at
`/usr/bin/curl`. **Prod tar**: GNU tar 1.35. **Prod install**: GNU
coreutils 9.7. **`/usr/local/bin`** exists, root-owned, currently
contains only the `downtime` binary.
- **Current prod venv state**: `/var/lib/left4me/.venv/` exists, owned
by `left4me:left4me`, contains `python3.13`, `pip`, `alembic`,
`flask`, `gunicorn`, `l4d2ctl`. `pip show l4d2host` / `pip show
l4d2web` both report version 0.1.0. So uv will be adopting a venv
that already has working installs of both members + their deps.
- **Local dev environment**: `uv 0.11.8` (brew), `direnv 2.37.1`
(supports `use uv`), `python 3.13.13`. No `.venv` exists locally yet
— clean slate.
## Critical files
### left4me repo
- **NEW** `/Users/mwiegand/Projekte/left4me/pyproject.toml` — workspace root
- **NEW** `/Users/mwiegand/Projekte/left4me/uv.lock` — generated via `uv lock`
- `l4d2host/pyproject.toml:10` — bump `requires-python` to `>=3.13`
- `l4d2web/pyproject.toml:1018` — bump `requires-python`, add
`"l4d2host"` to `dependencies`, add `[tool.uv.sources] l4d2host = { workspace = true }`
- `.envrc` — replace `layout python python3.13` with `use uv` (with
fallback if direnv stdlib is too old)
- `README.md`, `AGENTS.md`, `l4d2web/README.md` — update install
instructions
### ckn-bw repo (`~/Projekte/ckn-bw/`)
- `bundles/left4me/metadata.py:2949`**ensure** `'curl': {}` is in
`apt.packages` (required by the new install action; verify it's not
already inherited from a base bundle). **Drop** `'python3-pip'` (uv
replaces pip; bundle has no other consumer). **Drop** `'python3-venv'`
(the chain no longer uses `python3 -m venv`; uv creates its own venv
via `UV_PROJECT_ENVIRONMENT`). **Keep** `'python3'`, `'python3-dev'`,
and the i386 multiarch entries.
**Do NOT add** `'uv': {}` — uv is not in Trixie's apt archive.
- `bundles/left4me/items.py:285305` — update `git_deploy:/opt/left4me/src`
triggers: replace `action:left4me_pip_install` with
`action:left4me_uv_sync`
- `bundles/left4me/items.py:328340`**DELETE** `left4me_create_venv`
- `bundles/left4me/items.py:342352`**DELETE** `left4me_pip_upgrade`
- `bundles/left4me/items.py:354382`**DELETE** `left4me_pip_install`
(replaced by `left4me_uv_sync` below)
- `bundles/left4me/items.py:384407``left4me_alembic_upgrade`:
update `needs:` (or `triggered_by:` equivalent) to point at
`action:left4me_uv_sync` instead of `action:left4me_pip_install`
- `bundles/left4me/items.py`**ADD** two new actions:
- `left4me_install_uv`: download pinned 0.11.8 tarball from
github.com/astral-sh/uv/releases/, SHA256-verify, install to
/usr/local/bin/. Idempotent via `unless: '/usr/local/bin/uv --version
| grep -qx "uv 0.11.8"'`. `needs: ['pkg_apt:curl']`,
`triggers: ['action:left4me_uv_sync']`. (Body matches the approved
preview, with `unless:` refined to `grep -qx` for BRE portability.)
- `left4me_uv_sync`: `sudo -u left4me env
UV_PROJECT_ENVIRONMENT=/var/lib/left4me/.venv /usr/local/bin/uv
sync --frozen --project /opt/left4me/src`. `triggered: True`,
`cascade_skip: False`, `needs:` includes
`'git_deploy:/opt/left4me/src'`, `'action:left4me_install_uv'`,
`'directory:/var/lib/left4me'`, `'user:left4me'`. `triggers:
['action:left4me_alembic_upgrade']`.
- `bundles/left4me/README.md:8490` — rewrite the deploy-flow description
to mention the install_uv + uv_sync chain instead of the tempdir-dance
## Execution steps
### Step 0 — Spike test (extended) — DO FIRST
Verify the architectural assumption empirically on the live host.
Uses the SAME install vector the production action will use (direct
tarball + SHA256 verify), so the spike doubles as a smoke test for
the install action itself.
```bash
# A. Install pinned uv on prod (one-shot via direct command; matches
# what the future bw action will do).
ssh ckn@left4.me '
set -e
tmpdir=$(mktemp -d); trap "rm -rf $tmpdir" EXIT
base=https://github.com/astral-sh/uv/releases/download/0.11.8
tar=uv-x86_64-unknown-linux-gnu.tar.gz
curl -fsSL -o $tmpdir/$tar $base/$tar
curl -fsSL -o $tmpdir/$tar.sha256 $base/$tar.sha256
(cd $tmpdir && sha256sum -c $tar.sha256)
tar -xzf $tmpdir/$tar -C $tmpdir --strip-components=1
sudo install -m 0755 $tmpdir/uv /usr/local/bin/uv
sudo install -m 0755 $tmpdir/uvx /usr/local/bin/uvx
/usr/local/bin/uv --version
'
# B. uv build against root-owned source: source must stay clean.
ssh ckn@left4.me '
sudo -u left4me sh -c "
wheels=\$(mktemp -d)
/usr/local/bin/uv build --wheel --sdist /opt/left4me/src/l4d2host --out-dir \$wheels
ls \$wheels
"
'
# Cleanliness probe — git not on prod, so use find for build artifacts.
# Expected: only-existing egg-info dirs (the ones already on disk from
# the current pip install -e flow); NO NEW artifacts from this run.
# Capture a baseline BEFORE the build, compare AFTER.
ssh ckn@left4.me 'sudo find /opt/left4me/src \( -name "*.egg-info" -o -name build -o -name dist -o -name "__pycache__" \) -printf "%T@ %p\n" | sort'
# C. Extended sync-shape check — dry-run `uv sync --frozen` against a
# root-owned workspace mock in /tmp. Verify the project root stays
# clean (no .python-version written, no transient files left over).
# This validates that `uv sync` (not just `uv build`) is safe against
# a read-only project tree, which is the actual production code path.
```
**Decision gate**:
- Source stays clean across B and C → proceed with full plan.
- New `*.egg-info` / `build/` / `dist/` directories appear in
`/opt/left4me/src` after `uv build` → fall back to **Medium scope**
(handoff §"Empirical spike" → fallback). Update the handoff doc to
record the fallback decision and re-plan.
- `uv sync` writes into the project root during step C → also fall back
to Medium scope. Same handoff update.
### Step 1 — left4me workspace setup (local)
1. Write `/Users/mwiegand/Projekte/left4me/pyproject.toml` (workspace root)
— see handoff §"What changes — left4me side / New: pyproject.toml"
2. Bump `l4d2host/pyproject.toml:10` to `requires-python = ">=3.13"`
3. Update `l4d2web/pyproject.toml`: bump `requires-python`, add
`"l4d2host"` to `dependencies`, append `[tool.uv.sources]` block
4. `uv lock` at the repo root → produces `uv.lock`
5. `uv sync` → creates `.venv/`, installs both members editable + pytest
6. `uv run pytest` → all green
7. Update `.envrc`: replace `layout python python3.13` with `use uv`
(fallback to `uv sync >/dev/null && source .venv/bin/activate` if
the dev's direnv version doesn't ship `use uv`)
8. Update `README.md`, `AGENTS.md`, `l4d2web/README.md`: replace the
`pip install -e ...` invocation with `uv sync` and add the one-time
prereq line about installing uv. Mention macOS (`brew install uv`)
and Linux (curl-pipe-sh from astral.sh) — **do NOT** suggest
`apt install uv`, as it's not in Debian's apt archive yet (only
`experimental`/`sid`).
### Step 2 — left4me commit + push
Single commit using the suggested message from the handoff
(§"Commit messages — left4me side"). Push to `origin` (gitlab on
sublimity.de — confirmed safe-publish-exempt per memory). The commit
makes the workspace and lockfile available to ckn-bw's `git_deploy`.
### Step 3 — ckn-bw bundle refactor
1. Edit `bundles/left4me/metadata.py:2949`:
- Ensure `'curl': {}` is in `apt.packages` (verify it's not already
inherited from a base bundle; if not, add it explicitly).
- Drop `'python3-pip'` (uv replaces pip; bundle has no other
consumer — grep the bundle to confirm).
- Drop `'python3-venv'` (chain no longer uses `python3 -m venv`).
- Keep `'python3'`, `'python3-dev'`, and the i386 multiarch entries.
- **Do NOT add `'uv': {}`** — not in Trixie's apt.
2. Edit `bundles/left4me/items.py`:
- Delete `left4me_create_venv`, `left4me_pip_upgrade`,
`left4me_pip_install` blocks (lines 328382 inclusive).
- Add `left4me_install_uv` action: downloads pinned uv 0.11.8 tarball
from github.com/astral-sh/uv/releases/, SHA256-verifies against the
official `.sha256` sibling, installs to `/usr/local/bin/{uv,uvx}`.
Idempotent via `unless: '/usr/local/bin/uv --version 2>/dev/null
| grep -qx "uv 0.11.8"'`. `needs: ['pkg_apt:curl']`,
`triggers: ['action:left4me_uv_sync']`, `triggered: False`,
`cascade_skip: False`.
- Add `left4me_uv_sync` action: `sudo -u left4me env
UV_PROJECT_ENVIRONMENT=/var/lib/left4me/.venv /usr/local/bin/uv
sync --frozen --project /opt/left4me/src`. `triggered: True`,
`cascade_skip: False`. `needs:` includes
`'git_deploy:/opt/left4me/src'`, `'action:left4me_install_uv'`,
`'directory:/var/lib/left4me'`, `'user:left4me'`. `triggers:
['action:left4me_alembic_upgrade']`.
- Update `git_deploy:/opt/left4me/src` triggers (lines 285305):
replace `'action:left4me_pip_install'` with
`'action:left4me_uv_sync'`. Keep `left4me_alembic_upgrade` and
`left4me_daemon_reload` triggers.
- Update `left4me_alembic_upgrade` (lines 384407): its dependency
on `left4me_pip_install` must now point at `left4me_uv_sync`.
3. Rewrite `bundles/left4me/README.md:8490` to describe the new
`install_uv → uv_sync → alembic_upgrade → seed_overlays + restart`
chain (drop the pip + tempdir-dance prose).
4. `(cd ~/Projekte/ckn-bw && .venv/bin/bw test)` → must pass clean.
### Step 4 — ckn-bw commit (DO NOT PUSH)
Single commit using the suggested message from the handoff
(§"Commit messages — ckn-bw side"). Do **not** `git push`. Per
verified state today, ckn-bw is currently EVEN with `origin/master`
(not 7 ahead as the original prompt claimed — the operator pushed
since the prompt was written). After this commit lands locally, the
repo will be 1 commit ahead of origin.
### Step 5 — Report to operator (handoff to user for deploy)
Agent's work ends here. Brief summary to the user including:
- Spike outcome (full uv-workspace path confirmed, or Medium-scope
fallback taken — including any handoff doc updates if the latter).
- What's committed and where it sits: left4me pushed to `origin/master`;
ckn-bw committed locally, now 1 commit ahead of origin (unpushed).
- The `bw apply ovh.left4me` invocation for the user to run, with the
expected output (left4me_install_uv runs the download+verify, three
old actions removed from the graph, two new actions present
(install_uv + uv_sync), alembic+seed+restart cascade fires).
- The 6-check verification matrix from handoff §"Verification
(end-to-end)" for the user to walk through after apply — with
check #1 amended: use
`sudo find /opt/left4me/src \( -name '*.egg-info' -o -name build
-o -name dist \) -newer <baseline>` instead of `git status`,
because git isn't installed on prod.
- Recovery path if uv refuses to adopt the existing venv: one-shot
`ssh ckn@left4.me 'sudo -u left4me rm -rf /var/lib/left4me/.venv'`,
then re-apply.
- Open follow-ups (uv version pinning policy — bump cadence, signing,
etc; direnv `use uv` fallback applied or not; whether to add a
separate `pkg_apt: curl` if it wasn't already declared).
**Do NOT run `bw apply`, the verification matrix, or the gameserver
round-trip — those are explicitly user-side per session memory.**
## Plan storage after approval
Per the user's global AGENTS.md (`~/.claude/agents/AGENTS.md`): specs
and plans live in the repo they describe, typically under `docs/`. After
ExitPlanMode and approval, this plan should be copied to
`/Users/mwiegand/Projekte/left4me/docs/superpowers/plans/2026-05-15-uv-workspace-execution.md`
as a peer to the design handoff, then committed alongside the left4me
changes in Step 2.
## What does NOT change (out of scope)
- Source ownership: `/opt/left4me/src` stays root-owned.
- Venv location: `/var/lib/left4me/.venv` stays where it is, owned by
the `left4me` user, accessed via `UV_PROJECT_ENVIRONMENT`.
- Hardening drop-ins, sudoers, sysctl, helpers — all stable from the
deployment-responsibility migration.
- systemd unit shapes — reactor-emitted, unchanged.
- `alembic_upgrade` and `seed_overlays` shell bodies — same commands,
just triggered from `uv_sync` instead of `pip_install`.
- `pkg_apt: python3` and `python3-dev` — kept (uv shells out to system
Python).
- Other ckn-bw bundles — this is left4me-specific.
- The build-overlay-unit refactor — separate queued thread.
- CI — none currently exists.
## Risks (carried from handoff, sized empirically)
1. **Spike test failure** → fall back to Medium scope. Graceful.
2. ~~Lockfile format skew between dev and prod~~**MITIGATED** by
pinning prod uv to 0.11.8 (same as local brew). Lockfile generated
by dev's uv 0.11.8 will be consumed by prod's uv 0.11.8 byte-for-byte
compatible. Risk effectively eliminated unless dev's brew bumps uv
independently — track this in the pinning-policy follow-up.
3. **direnv `use uv` availability** → local direnv is 2.37.1 (`use uv`
added in 2.34+, so we're fine). Fallback snippet documented in case
another dev has an older direnv.
4. **`alembic`/`flask` binary paths** → uv installs the same
`console_scripts` entrypoints as pip, so paths under
`/var/lib/left4me/.venv/bin/` are identical. Verify in verification
matrix.
5. **`--force-reinstall` semantics** → no longer needed; `uv sync` is
lockfile-aware, not package-version-aware.
6. **uv release artifact availability** → if github.com/astral-sh/uv
takes down release 0.11.8 (extremely unlikely but theoretically
possible), the install action would fail. Mitigation: pin a recent
stable release, monitor astral's deprecation cadence; if needed,
mirror the artifact to an internal location for future-proofing
(out of scope for this migration).
7. **SHA256 of the tarball** → we trust the `.sha256` sibling fetched
from the same github release. A future hardening pass could embed
the checksum in the bundle source for offline verification, but the
current trust model matches steamcmd's (also github-sourced).

File diff suppressed because it is too large Load diff

View file

@ -1,747 +0,0 @@
# timeago Shared Display Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Unify all user-facing datetime rendering in `l4d2web` behind a single `timeago` Jinja filter that returns a `<time>` element with a relative label and a precise UTC tooltip.
**Architecture:** Two callables in `l4d2web/l4d2web/services/timeago.py``humanize_delta` (pure text, source of truth for the relative-label ladder) and `format_time_html` (wraps the text in a `<time>` element). The latter is registered as Jinja filter `timeago` in the Flask app factory. Templates and routes migrate from raw datetime repr and bespoke inline math to `{{ ts | timeago }}`.
**Tech Stack:** Python 3.13, Flask, Jinja2, `markupsafe.Markup`, pytest.
**Reference spec:** `docs/superpowers/specs/2026-05-16-timeago-shared-display-design.md`
---
## File Structure
| File | Action | Responsibility |
|---|---|---|
| `l4d2web/l4d2web/services/timeago.py` | Rewrite | `humanize_delta` (new symmetric ladder) + new `format_time_html` |
| `l4d2web/l4d2web/app.py` | Modify | Register `timeago` filter in `create_app` |
| `l4d2web/tests/test_timeago.py` | Create | Unit tests for both helpers + Flask smoke test |
| `l4d2web/l4d2web/templates/admin_users.html` | Modify | Use filter for `created_at` / `updated_at` |
| `l4d2web/l4d2web/templates/blueprints.html` | Modify | Use filter for `created_at` / `updated_at` |
| `l4d2web/l4d2web/templates/_job_table.html` | Modify | Use filter for `created_at` / `finished_at` (with None guard) |
| `l4d2web/l4d2web/templates/job_detail.html` | Modify | Use filter for `created_at` / `started_at` / `finished_at` |
| `l4d2web/l4d2web/templates/_live_state.html` | Modify | Replace inline `(now - x).total_seconds()` with filter |
| `l4d2web/l4d2web/templates/_server_actions.html` | Modify | Switch from `latest_job_when` (string) to `latest_job_at \| timeago` |
| `l4d2web/l4d2web/templates/_overlay_build_status.html` | Modify | Switch from `latest_build_when` to `latest_build_at \| timeago` |
| `l4d2web/l4d2web/routes/page_routes.py` | Modify | Drop `humanize_delta` imports; pass raw datetime as `latest_job_at` / `latest_build_at` |
| `l4d2web/l4d2web/routes/server_routes.py` | Modify | Remove now-dead `now=` kwarg from `_live_state.html` render call |
---
## Task 1: Rewrite `humanize_delta` with the new ladder
**Files:**
- Modify: `l4d2web/l4d2web/services/timeago.py`
- Create: `l4d2web/tests/test_timeago.py`
The current ladder uses `just now` under 45s and clamps future deltas. The new ladder is symmetric, has second precision, and uses day-month (with year if different) for ≥7 days. Spec table in section "Ladder (long form, symmetric for past and future)".
- [ ] **Step 1: Create the test file with parameterised boundary tests**
Create `l4d2web/tests/test_timeago.py` with:
```python
from datetime import UTC, datetime, timedelta
import pytest
from l4d2web.services.timeago import humanize_delta
NOW = datetime(2026, 5, 16, 12, 0, 0, tzinfo=UTC)
@pytest.mark.parametrize(
("delta", "expected"),
[
# zero
(timedelta(0), "now"),
# past, seconds
(timedelta(seconds=1), "1 second ago"),
(timedelta(seconds=2), "2 seconds ago"),
(timedelta(seconds=59), "59 seconds ago"),
# past, minutes
(timedelta(seconds=60), "1 minute ago"),
(timedelta(minutes=1), "1 minute ago"),
(timedelta(minutes=2), "2 minutes ago"),
(timedelta(minutes=59), "59 minutes ago"),
# past, hours
(timedelta(minutes=60), "1 hour ago"),
(timedelta(hours=1), "1 hour ago"),
(timedelta(hours=2), "2 hours ago"),
(timedelta(hours=23), "23 hours ago"),
# past, days
(timedelta(hours=24), "1 day ago"),
(timedelta(days=1), "1 day ago"),
(timedelta(days=2), "2 days ago"),
(timedelta(days=6), "6 days ago"),
# past, date fallback same year (now = 16 May 2026)
(timedelta(days=7), "9 May"),
(timedelta(days=30), "16 Apr"),
(timedelta(days=120), "16 Jan"),
# past, date fallback different year
(timedelta(days=365), "16 May 2025"),
(timedelta(days=400), "11 Apr 2025"),
],
)
def test_humanize_delta_past(delta, expected):
then = NOW - delta
assert humanize_delta(then, now=NOW) == expected
@pytest.mark.parametrize(
("delta", "expected"),
[
# future, seconds
(timedelta(seconds=1), "in 1 second"),
(timedelta(seconds=2), "in 2 seconds"),
(timedelta(seconds=59), "in 59 seconds"),
# future, minutes
(timedelta(seconds=60), "in 1 minute"),
(timedelta(minutes=2), "in 2 minutes"),
(timedelta(minutes=59), "in 59 minutes"),
# future, hours
(timedelta(hours=1), "in 1 hour"),
(timedelta(hours=23), "in 23 hours"),
# future, days
(timedelta(days=1), "in 1 day"),
(timedelta(days=6), "in 6 days"),
# future, date fallback same year
(timedelta(days=7), "23 May"),
(timedelta(days=30), "15 Jun"),
# future, date fallback different year
(timedelta(days=365), "16 May 2027"),
],
)
def test_humanize_delta_future(delta, expected):
then = NOW + delta
assert humanize_delta(then, now=NOW) == expected
def test_humanize_delta_accepts_naive_input_as_utc():
then_naive = (NOW - timedelta(minutes=5)).replace(tzinfo=None)
assert humanize_delta(then_naive, now=NOW) == "5 minutes ago"
def test_humanize_delta_accepts_naive_now_as_utc():
then = NOW - timedelta(minutes=5)
now_naive = NOW.replace(tzinfo=None)
assert humanize_delta(then, now=now_naive) == "5 minutes ago"
def test_humanize_delta_default_now_is_datetime_now_utc():
then = datetime.now(UTC) - timedelta(seconds=3)
assert humanize_delta(then) in {"3 seconds ago", "2 seconds ago", "4 seconds ago"}
def test_humanize_delta_year_boundary_includes_year_when_years_differ():
now = datetime(2026, 1, 2, 12, 0, 0, tzinfo=UTC)
then = datetime(2025, 12, 30, 12, 0, 0, tzinfo=UTC)
assert humanize_delta(then, now=now) == "30 Dec 2025"
```
- [ ] **Step 2: Run the new tests to verify they fail against the current implementation**
Run: `pytest l4d2web/tests/test_timeago.py -v`
Expected: most past tests FAIL (current implementation returns `just now` under 45s, no singular `1 second ago`); all future tests FAIL (current clamps to 0 → `just now`); date-fallback tests FAIL (current returns ISO `2025-04-21` not `9 May`).
- [ ] **Step 3: Rewrite `humanize_delta` to satisfy the tests**
Replace the entire contents of `l4d2web/l4d2web/services/timeago.py` with:
```python
from datetime import UTC, datetime
_MONTHS = (
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
)
def _ensure_utc(dt: datetime) -> datetime:
if dt.tzinfo is None:
return dt.replace(tzinfo=UTC)
return dt
def _format_date(then: datetime, now: datetime) -> str:
month = _MONTHS[then.month - 1]
if then.year == now.year:
return f"{then.day} {month}"
return f"{then.day} {month} {then.year}"
def _relative_label(seconds: int, past: bool) -> str:
if seconds < 60:
unit, n = "second", seconds
elif seconds < 3600:
unit, n = "minute", seconds // 60
elif seconds < 86400:
unit, n = "hour", seconds // 3600
else:
unit, n = "day", seconds // 86400
plural = "" if n == 1 else "s"
if past:
return f"{n} {unit}{plural} ago"
return f"in {n} {unit}{plural}"
def humanize_delta(then: datetime, now: datetime | None = None) -> str:
if now is None:
now = datetime.now(UTC)
then = _ensure_utc(then)
now = _ensure_utc(now)
delta_seconds = int((now - then).total_seconds())
abs_seconds = abs(delta_seconds)
if abs_seconds == 0:
return "now"
if abs_seconds >= 7 * 86400:
return _format_date(then, now)
return _relative_label(abs_seconds, past=(delta_seconds > 0))
```
- [ ] **Step 4: Run the tests to verify they pass**
Run: `pytest l4d2web/tests/test_timeago.py -v`
Expected: all tests PASS.
- [ ] **Step 5: Run the full test suite to check for regressions in callers of `humanize_delta`**
Run: `pytest l4d2web/tests -q`
Expected: all tests pass. If any pre-existing test asserts on the legacy "just now" / 7-day ISO fallback strings via `latest_job_when` rendering, update those assertions to match the new format (e.g. "1 second ago", "9 May"). Note in commit message which tests were updated and why.
- [ ] **Step 6: Commit**
```bash
git add l4d2web/l4d2web/services/timeago.py l4d2web/tests/test_timeago.py
git commit -m "feat(timeago): symmetric ladder with second precision and date fallback
Rewrite humanize_delta as a symmetric past/future ladder with
sub-minute precision. Replace the bare ISO date fallback after 7 days
with a day-month form (year suppressed when same as now). Refs spec
docs/superpowers/specs/2026-05-16-timeago-shared-display-design.md."
```
---
## Task 2: Add `format_time_html` returning a `<time>` element
**Files:**
- Modify: `l4d2web/l4d2web/services/timeago.py`
- Modify: `l4d2web/tests/test_timeago.py`
- [ ] **Step 1: Append tests for `format_time_html` to the test file**
Append to `l4d2web/tests/test_timeago.py`:
```python
from markupsafe import Markup
from l4d2web.services.timeago import format_time_html
def test_format_time_html_returns_markup():
then = NOW - timedelta(minutes=5)
out = format_time_html(then, now=NOW)
assert isinstance(out, Markup)
def test_format_time_html_contains_time_element_with_attrs():
then = datetime(2026, 5, 16, 14, 32, 11, tzinfo=UTC)
now = then + timedelta(minutes=5)
out = str(format_time_html(then, now=now))
assert out.startswith("<time ")
assert out.endswith("</time>")
assert 'datetime="2026-05-16T14:32:11+00:00"' in out
assert 'title="2026-05-16 14:32:11 UTC"' in out
assert ">5 minutes ago<" in out
def test_format_time_html_label_matches_humanize_delta():
then = NOW - timedelta(hours=2)
label = humanize_delta(then, now=NOW)
out = str(format_time_html(then, now=NOW))
assert f">{label}<" in out
def test_format_time_html_normalises_naive_input_to_utc():
then_naive = datetime(2026, 5, 16, 14, 32, 11)
now = datetime(2026, 5, 16, 14, 37, 11, tzinfo=UTC)
out = str(format_time_html(then_naive, now=now))
assert 'datetime="2026-05-16T14:32:11+00:00"' in out
assert 'title="2026-05-16 14:32:11 UTC"' in out
```
- [ ] **Step 2: Run the new tests to verify they fail**
Run: `pytest l4d2web/tests/test_timeago.py -v -k format_time_html`
Expected: FAIL with `ImportError: cannot import name 'format_time_html'`.
- [ ] **Step 3: Implement `format_time_html` in `timeago.py`**
Append to `l4d2web/l4d2web/services/timeago.py`:
```python
from markupsafe import Markup, escape
def format_time_html(then: datetime, now: datetime | None = None) -> Markup:
if now is None:
now = datetime.now(UTC)
then_utc = _ensure_utc(then).astimezone(UTC)
now = _ensure_utc(now)
label = humanize_delta(then_utc, now=now)
iso = then_utc.isoformat()
title = then_utc.strftime("%Y-%m-%d %H:%M:%S UTC")
return Markup(
f'<time datetime="{escape(iso)}" title="{escape(title)}">'
f"{escape(label)}</time>"
)
```
Note: place the `from markupsafe import Markup, escape` import at the top of the file alongside the existing `from datetime import ...` line — don't leave it inline as written above.
- [ ] **Step 4: Run the tests to verify they pass**
Run: `pytest l4d2web/tests/test_timeago.py -v`
Expected: all tests PASS.
- [ ] **Step 5: Commit**
```bash
git add l4d2web/l4d2web/services/timeago.py l4d2web/tests/test_timeago.py
git commit -m "feat(timeago): add format_time_html returning a <time> element
Wrap humanize_delta in an HTML <time> element with datetime= and
title= attributes carrying the precise UTC value, so hovering surfaces
the exact timestamp regardless of the relative label."
```
---
## Task 3: Register `timeago` Jinja filter in the Flask app factory
**Files:**
- Modify: `l4d2web/l4d2web/app.py:37-58`
- Modify: `l4d2web/tests/test_timeago.py`
- [ ] **Step 1: Add a Flask smoke test for the filter**
There is no shared `app` fixture in this codebase — each test instantiates `create_app` directly (see `l4d2web/tests/test_health.py` for the minimal pattern). Append to `l4d2web/tests/test_timeago.py`:
```python
from flask import render_template_string
from l4d2web.app import create_app
def test_timeago_filter_registered_on_app():
app = create_app({"TESTING": True, "SECRET_KEY": "test"})
with app.app_context():
rendered = render_template_string(
"{{ ts | timeago }}",
ts=datetime.now(UTC) - timedelta(minutes=3),
)
assert "<time " in rendered
assert "&lt;time" not in rendered
assert "3 minutes ago" in rendered
```
- [ ] **Step 2: Verify the fixture and the failing assertion**
Run: `pytest l4d2web/tests/test_timeago.py::test_timeago_filter_registered_on_app -v`
Expected: FAIL with a Jinja `TemplateSyntaxError: No filter named 'timeago'` (or similar), confirming the filter is not yet registered.
- [ ] **Step 3: Register the filter in `create_app`**
In `l4d2web/l4d2web/app.py`:
Add the import near the other `from l4d2web...` imports at the top:
```python
from l4d2web.services.timeago import format_time_html
```
Inside `create_app`, register the filter immediately after `init_db()` runs and before the `@app.before_request` definitions. Add a single line:
```python
app.add_template_filter(format_time_html, "timeago")
```
- [ ] **Step 4: Run the smoke test to verify it passes**
Run: `pytest l4d2web/tests/test_timeago.py::test_timeago_filter_registered_on_app -v`
Expected: PASS.
- [ ] **Step 5: Run the full test suite to confirm nothing else broke**
Run: `pytest l4d2web/tests -q`
Expected: all tests pass.
- [ ] **Step 6: Commit**
```bash
git add l4d2web/l4d2web/app.py l4d2web/tests/test_timeago.py
git commit -m "feat(app): register timeago Jinja filter
Templates can now call {{ ts | timeago }} directly without route-side
precomputation."
```
---
## Task 4: Migrate `admin_users.html` and `blueprints.html`
**Files:**
- Modify: `l4d2web/l4d2web/templates/admin_users.html:25-26`
- Modify: `l4d2web/l4d2web/templates/blueprints.html:17-18`
Both templates render `created_at` / `updated_at` as raw Python `datetime` repr. No `None` guard needed — these columns are `nullable=False` in `models.py`.
- [ ] **Step 1: Modify `admin_users.html`**
Replace lines 25-26 of `l4d2web/l4d2web/templates/admin_users.html`:
```jinja
<td>{{ user.created_at }}</td>
<td>{{ user.updated_at }}</td>
```
with:
```jinja
<td>{{ user.created_at | timeago }}</td>
<td>{{ user.updated_at | timeago }}</td>
```
- [ ] **Step 2: Modify `blueprints.html`**
Replace lines 17-18 of `l4d2web/l4d2web/templates/blueprints.html`:
```jinja
<td>{{ blueprint.created_at }}</td>
<td>{{ blueprint.updated_at }}</td>
```
with:
```jinja
<td>{{ blueprint.created_at | timeago }}</td>
<td>{{ blueprint.updated_at | timeago }}</td>
```
- [ ] **Step 3: Run the existing tests for these pages**
Run: `pytest l4d2web/tests/test_admin_users.py l4d2web/tests/test_blueprints.py -q`
Expected: all tests pass. If a test asserts on the raw datetime string in the rendered HTML, update it to assert the presence of `<time ` for the same row instead.
- [ ] **Step 4: Commit**
```bash
git add l4d2web/l4d2web/templates/admin_users.html l4d2web/l4d2web/templates/blueprints.html
git commit -m "refactor(templates): use timeago filter for admin/blueprint timestamps"
```
---
## Task 5: Migrate `_job_table.html` and `job_detail.html` (with `None` guards)
**Files:**
- Modify: `l4d2web/l4d2web/templates/_job_table.html:22-23`
- Modify: `l4d2web/l4d2web/templates/job_detail.html:24-26`
In `models.py`, `Job.started_at` and `Job.finished_at` are nullable; `Job.created_at` is not. Preserve the existing `-` placeholder for nullable columns.
- [ ] **Step 1: Modify `_job_table.html`**
Replace lines 22-23 of `l4d2web/l4d2web/templates/_job_table.html`:
```jinja
<td>{{ job.created_at }}</td>
<td>{{ job.finished_at or "-" }}</td>
```
with:
```jinja
<td>{{ job.created_at | timeago }}</td>
<td>{% if job.finished_at %}{{ job.finished_at | timeago }}{% else %}-{% endif %}</td>
```
- [ ] **Step 2: Modify `job_detail.html`**
Replace lines 24-26 of `l4d2web/l4d2web/templates/job_detail.html`:
```jinja
<tr><th>Created</th><td>{{ job.created_at }}</td></tr>
<tr><th>Started</th><td>{{ job.started_at or "-" }}</td></tr>
<tr><th>Finished</th><td>{{ job.finished_at or "-" }}</td></tr>
```
with:
```jinja
<tr><th>Created</th><td>{{ job.created_at | timeago }}</td></tr>
<tr><th>Started</th><td>{% if job.started_at %}{{ job.started_at | timeago }}{% else %}-{% endif %}</td></tr>
<tr><th>Finished</th><td>{% if job.finished_at %}{{ job.finished_at | timeago }}{% else %}-{% endif %}</td></tr>
```
- [ ] **Step 3: Run the job-related tests**
Run: `pytest l4d2web/tests/test_job_logs.py l4d2web/tests/test_pages.py -q`
Expected: all tests pass. Update assertions that pin raw-datetime substrings to instead assert `<time `; the `-` placeholder for nullable fields must still render in the absence of `started_at` / `finished_at`.
- [ ] **Step 4: Commit**
```bash
git add l4d2web/l4d2web/templates/_job_table.html l4d2web/l4d2web/templates/job_detail.html
git commit -m "refactor(templates): use timeago filter for job timestamps
Preserves the existing '-' placeholder for nullable started_at /
finished_at columns."
```
---
## Task 6: Migrate `_live_state.html`
**Files:**
- Modify: `l4d2web/l4d2web/templates/_live_state.html:9-11, 30-33, 53-56`
Three call sites; all use bespoke `(now - x).total_seconds() // …` math. Replace with the filter. The `now` template variable becomes unused inside this file after the rewrite.
- [ ] **Step 1: Replace the `polled Ns ago` line (lines 9-11)**
In `l4d2web/l4d2web/templates/_live_state.html`, find:
```jinja
<small class="muted">
polled {{ ((now - snapshot.last_seen_at).total_seconds() | int) }}s ago
</small>
```
Replace with:
```jinja
<small class="muted">
polled {{ snapshot.last_seen_at | timeago }}
</small>
```
- [ ] **Step 2: Replace the `joined Nm ago` line (line 31)**
Find:
```jinja
<span class="meta">
joined {{ ((now - session.joined_at).total_seconds() // 60) | int }}m ago
· ping {{ session.min_ping }}-{{ session.max_ping }}ms
</span>
```
Replace with:
```jinja
<span class="meta">
joined {{ session.joined_at | timeago }}
· ping {{ session.min_ping }}-{{ session.max_ping }}ms
</span>
```
- [ ] **Step 3: Replace the `last seen Nm ago` line (line 55)**
Find:
```jinja
<span class="meta">
last seen {{ ((now - row.last_seen).total_seconds() // 60) | int }}m ago
</span>
```
Replace with:
```jinja
<span class="meta">
last seen {{ row.last_seen | timeago }}
</span>
```
- [ ] **Step 4: Run the live-state tests**
Run: `pytest l4d2web/tests/test_servers.py -q`
Expected: tests pass. The two tests `test_servers_index_renders_live_state_badge` and `test_live_state_fragment_renders_current_and_recent` (server_routes.py:449, 513) render this fragment. If they assert on `Nm ago` substrings, replace those assertions with checks for `<time ` or for the new long-form output (e.g. `joined 5 minutes ago`).
- [ ] **Step 5: Commit**
```bash
git add l4d2web/l4d2web/templates/_live_state.html
git commit -m "refactor(templates): use timeago filter in _live_state.html
Replaces three bespoke (now - x).total_seconds() expressions with the
shared filter, unifying vocabulary (no more '0m ago' inside the first
minute) and adding the UTC tooltip."
```
---
## Task 7: Migrate `_server_actions.html` + `_overlay_build_status.html` + `page_routes.py`
**Files:**
- Modify: `l4d2web/l4d2web/routes/page_routes.py:240-305, 442-484`
- Modify: `l4d2web/l4d2web/templates/_server_actions.html:25-32`
- Modify: `l4d2web/l4d2web/templates/_overlay_build_status.html:7-14`
The two route helpers currently precompute a string via `humanize_delta`. Replace with raw `datetime` passed under a new key, let the template apply the filter.
- [ ] **Step 1: Update `_build_server_actions_context` in `page_routes.py`**
In `l4d2web/l4d2web/routes/page_routes.py`, replace the block at lines 239-305 (function body of `_build_server_actions_context`) so that:
- Line 240 — remove `from l4d2web.services.timeago import humanize_delta`.
- Line 284 — rename `latest_job_when: str | None = None` to `latest_job_at: datetime | None = None`.
- Line 294 — replace `latest_job_when = humanize_delta(ref_time)` with `latest_job_at = ref_time`.
- Line 303 — update the returned dict key from `"latest_job_when": latest_job_when` to `"latest_job_at": latest_job_at`.
`datetime` is already imported at `page_routes.py:2` (`from datetime import UTC, datetime, timedelta`) — no import change needed.
- [ ] **Step 2: Update `_build_overlay_build_status_context` in `page_routes.py`**
In the same file, replace the block at lines 442-484 (function body of `_build_overlay_build_status_context`) so that:
- Line 443 — remove `from l4d2web.services.timeago import humanize_delta`.
- Line 467 — rename `latest_build_when: str | None = None` to `latest_build_at: datetime | None = None`.
- Line 475 — replace `latest_build_when = humanize_delta(ref_time)` with `latest_build_at = ref_time`.
- Line 481 — update the returned dict key from `"latest_build_when": latest_build_when` to `"latest_build_at": latest_build_at`.
- [ ] **Step 3: Update `_server_actions.html`**
In `l4d2web/l4d2web/templates/_server_actions.html`, line 29, replace:
```jinja
{{ latest_job_when }}
```
with:
```jinja
{{ latest_job_at | timeago }}
```
- [ ] **Step 4: Update `_overlay_build_status.html`**
In `l4d2web/l4d2web/templates/_overlay_build_status.html`, line 11, replace:
```jinja
{{ latest_build_when }}
```
with:
```jinja
{{ latest_build_at | timeago }}
```
- [ ] **Step 5: Run the test suite to catch context-key mismatches**
Run: `pytest l4d2web/tests -q`
Expected: tests pass. The most likely failure point is tests that check the rendered server actions fragment (`test_servers.py`) or overlay build status fragment. If any test asserts the old `latest_job_when` string output, update it to look for `<time ` or the new long-form output (e.g. `12 minutes ago`).
- [ ] **Step 6: Commit**
```bash
git add l4d2web/l4d2web/routes/page_routes.py l4d2web/l4d2web/templates/_server_actions.html l4d2web/l4d2web/templates/_overlay_build_status.html
git commit -m "refactor(page_routes): pass datetime to templates for timeago filter
Drop the inline humanize_delta imports and string-precomputation; pass
the raw datetime as latest_job_at / latest_build_at and let the
template apply the timeago filter. One fewer code path computing
relative-time strings."
```
---
## Task 8: Drop the dead `now=` kwarg from `_live_state.html` render call
**Files:**
- Modify: `l4d2web/l4d2web/routes/server_routes.py:266-275`
After Task 6, `_live_state.html` no longer reads `now`. Remove the kwarg from the only `render_template` call that passes it.
- [ ] **Step 1: Confirm no other template uses the `now` context variable**
Run: `grep -rn "\\bnow\\b" l4d2web/l4d2web/templates/`
Inspect the output. The only references should be in template *files* that we have already migrated. Expected: no remaining `(now - …)` or bare `{{ now }}` references in any template.
- [ ] **Step 2: Remove the `now=` kwarg**
In `l4d2web/l4d2web/routes/server_routes.py`, at line 273 inside the `render_template("_live_state.html", …)` call, remove the line:
```python
now=datetime.now(UTC).replace(tzinfo=None),
```
- [ ] **Step 3: Check whether `datetime` and `UTC` are still used in the file**
If lines 210 and 234 still reference `datetime.now(UTC).replace(tzinfo=None)` (for the `cutoff` and `recent_cutoff` variables), the imports stay. Don't remove them speculatively.
- [ ] **Step 4: Run the test suite**
Run: `pytest l4d2web/tests -q`
Expected: all tests pass. If a test passes a fake `now` into the live-state context expecting it to be respected, that test relied on dead code and should be updated to assert against `<time ` output relative to a real `datetime.now(UTC)` reference.
- [ ] **Step 5: Commit**
```bash
git add l4d2web/l4d2web/routes/server_routes.py
git commit -m "refactor(server_routes): drop unused 'now' kwarg from _live_state render
After the timeago migration, the live-state template no longer reads
'now' — it computes relative labels through the filter, which derives
its own reference time."
```
---
## Task 9: End-to-end verification
**Files:** none — verification only.
- [ ] **Step 1: Run the entire test suite**
Run: `pytest l4d2web/tests -q`
Expected: all tests pass.
- [ ] **Step 2: Run ruff if it's part of the project's check workflow**
Run: `ruff check l4d2web/`
Expected: no new violations. The `.ruff_cache/` directory at the project root suggests ruff is in active use.
- [ ] **Step 3: Confirm no remaining raw-datetime renders or bespoke inline-time math**
Run: `grep -rn -E "\\{\\{ [a-z_.]+\\.(created_at|updated_at|started_at|finished_at|joined_at|last_seen|last_seen_at)" l4d2web/l4d2web/templates/`
Expected: every match is followed by `| timeago` or `| timeago }}{% else %}…{% endif %}`. No bare `{{ x.created_at }}` should remain.
Run: `grep -rn "(now -" l4d2web/l4d2web/templates/`
Expected: no matches.
- [ ] **Step 4: Manual UI smoke (developer-side, optional but recommended)**
Start the dev server (see `README.md` for the exact command) and log in:
- Visit `/admin/users``Created` / `Updated` columns render `<time>` elements; hovering shows UTC.
- Visit `/blueprints` — same.
- Visit `/jobs` and a single job detail — `Created` / `Started` / `Finished` use the filter; null `Finished` shows `-`.
- Open a server with live state — `polled N seconds ago`, `joined N minutes ago`, `last seen N minutes ago`; check that page-source shows `<time` markup, not literal `&lt;time&gt;`.
- [ ] **Step 5: No-op commit not required — work is already committed across Tasks 1-8.**
End of plan.

View file

@ -1,725 +0,0 @@
# Console Command Autocomplete Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Add command/cvar autocomplete to the runtime console input on `server_detail.html`, sharing the editor's ranking algorithm via a pure-JS module compiled to a tiny additional bundle, with a vanilla dropdown that does not collide with the existing ArrowUp/Down history recall.
**Architecture:** Extract the editor's inlined ranking logic into a pure ES module `editor-src/vocab-rank.js`. The editor imports it directly; for the console, a second esbuild entry point bundles it into a small `static/vendor/vocab-rank.bundle.js` that exposes `window.__rankVocab`. A new `static/js/console-autocomplete.js` builds a vanilla dropdown (positioned absolutely under the console input), lazy-fetches `srccfg-vocab.json` on first focus, hides the dropdown once the user types past the first token, and binds Tab/Shift+Tab/Esc only — leaving ArrowUp/Down/Enter untouched for `console-history.js`.
**Tech Stack:** Vanilla JS (no framework), esbuild (IIFE bundles), CodeMirror 6 (editor-side only — console is plain `<input>`), HTMX (existing — for form submission and dynamic page-fragment swap), CSS variables defined in `tokens.css`/`editor.css`. Tests use Node's built-in `node:test` runner (no extra deps).
**Reference Spec:** `docs/superpowers/specs/2026-05-17-console-command-autocomplete-design.md`
---
## File Structure
**New files:**
- `l4d2web/scripts/editor-src/vocab-rank.js` — pure ranking module (ES, exports `rankVocab`)
- `l4d2web/scripts/editor-src/vocab-rank-entry.js` — IIFE entry that assigns `rankVocab` to `window.__rankVocab`
- `l4d2web/scripts/editor-src/vocab-rank.test.js` — Node `node:test` unit tests for the ranker
- `l4d2web/l4d2web/static/js/console-autocomplete.js` — vanilla dropdown, lazy fetch, key handling
- `l4d2web/l4d2web/static/css/console-autocomplete.css` — dropdown styling using existing CSS tokens
**Modified files:**
- `l4d2web/scripts/editor-src/autocomplete.js` — replace inlined `rank()` + scoring with `import { rankVocab } from "./vocab-rank.js"`
- `l4d2web/scripts/editor-src/package.json` — add `build:vocab-rank` script; chain into `build`
- `l4d2web/l4d2web/templates/base.html` — add `<script defer>` for `vocab-rank.bundle.js` and `console-autocomplete.js`; add `<link>` for `console-autocomplete.css`
**Build artifacts (regenerated, do not hand-edit):**
- `l4d2web/l4d2web/static/vendor/editor.bundle.js` — rebuilt because `autocomplete.js` changed
- `l4d2web/l4d2web/static/vendor/vocab-rank.bundle.js` — new tiny bundle
---
## Task 1: Extract `rankVocab` into a pure module (TDD)
**Goal:** Move the editor's inlined ranking logic into a standalone, testable, dependency-free function.
**Files:**
- Create: `l4d2web/scripts/editor-src/vocab-rank.js`
- Create: `l4d2web/scripts/editor-src/vocab-rank.test.js`
- [ ] **Step 1: Write the failing test file**
Create `l4d2web/scripts/editor-src/vocab-rank.test.js`:
```javascript
import { test } from "node:test";
import assert from "node:assert/strict";
import { rankVocab } from "./vocab-rank.js";
const vocab = {
cvars: [
{ name: "sv_cheats", desc: "Allow cheats" },
{ name: "sv_gravity" },
{ name: "mp_friendlyfire", desc: "Toggle FF" },
],
commands: [
{ name: "kick", desc: "Kick a player" },
{ name: "kickall", desc: "Kick everyone" },
{ name: "changelevel", desc: "Change map" },
],
};
test("exact match comes first", () => {
const out = rankVocab("kick", vocab);
assert.equal(out[0].name, "kick");
assert.equal(out[1].name, "kickall");
});
test("prefix matches beat substring matches", () => {
const out = rankVocab("sv_", vocab);
assert.equal(out[0].name, "sv_cheats");
assert.equal(out[1].name, "sv_gravity");
// mp_friendlyfire contains no "sv_" → should not appear
assert.ok(!out.some(e => e.name === "mp_friendlyfire"));
});
test("substring matches included after prefix matches", () => {
// "iendly" is a substring of mp_friendlyfire but a prefix of nothing
const out = rankVocab("iendly", vocab);
assert.equal(out.length, 1);
assert.equal(out[0].name, "mp_friendlyfire");
});
test("kind is preserved on each result", () => {
const out = rankVocab("kick", vocab);
assert.equal(out[0].kind, "command");
const sv = rankVocab("sv_cheats", vocab);
assert.equal(sv[0].kind, "cvar");
});
test("desc is preserved when present", () => {
const out = rankVocab("kick", vocab);
assert.equal(out[0].desc, "Kick a player");
});
test("desc is undefined when source had no desc", () => {
const out = rankVocab("sv_gravity", vocab);
assert.equal(out[0].desc, undefined);
});
test("results are capped at the configured limit", () => {
const big = { cvars: [], commands: [] };
for (let i = 0; i < 200; i++) big.commands.push({ name: `cmd${i}` });
const out = rankVocab("cmd", big, { limit: 50 });
assert.equal(out.length, 50);
});
test("default limit is 50", () => {
const big = { cvars: [], commands: [] };
for (let i = 0; i < 200; i++) big.commands.push({ name: `cmd${i}` });
const out = rankVocab("cmd", big);
assert.equal(out.length, 50);
});
test("empty query returns no results", () => {
const out = rankVocab("", vocab);
assert.equal(out.length, 0);
});
test("case-insensitive match", () => {
const out = rankVocab("KICK", vocab);
assert.equal(out[0].name, "kick");
});
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
cd l4d2web/scripts/editor-src && node --test vocab-rank.test.js
```
Expected: FAIL with `Cannot find module './vocab-rank.js'` or `ERR_MODULE_NOT_FOUND`.
- [ ] **Step 3: Create the ranker module**
Create `l4d2web/scripts/editor-src/vocab-rank.js`:
```javascript
// Pure, dependency-free ranking of a vocabulary against a query string.
// Used by both the CodeMirror editor (via autocomplete.js) and the
// runtime console (via the vocab-rank bundle exposed on window).
//
// Score (lower = better):
// exact match → 0
// prefix match → 1 + label.length (shorter prefix matches win)
// substring match → 10000 + indexOf (earlier substring beats later)
// no match → -1 (excluded)
function score(query, label) {
if (label === query) return 0;
if (label.startsWith(query)) return 1 + label.length;
const i = label.indexOf(query);
if (i !== -1) return 10000 + i;
return -1;
}
export function rankVocab(query, vocab, { limit = 50 } = {}) {
if (!query) return [];
const q = query.toLowerCase();
const entries = [
...vocab.cvars.map(e => ({ ...e, kind: "cvar" })),
...vocab.commands.map(e => ({ ...e, kind: "command" })),
];
const scored = [];
for (const e of entries) {
const s = score(q, e.name.toLowerCase());
if (s === -1) continue;
scored.push([s, e]);
if (scored.length > limit * 4) break;
}
scored.sort((a, b) => a[0] - b[0]);
return scored.slice(0, limit).map(([, e]) => e);
}
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
cd l4d2web/scripts/editor-src && node --test vocab-rank.test.js
```
Expected: PASS — 10 tests passing.
- [ ] **Step 5: Commit**
```bash
git add l4d2web/scripts/editor-src/vocab-rank.js \
l4d2web/scripts/editor-src/vocab-rank.test.js
git commit -m "feat(editor): extract pure rankVocab module + tests"
```
---
## Task 2: Refactor `autocomplete.js` to use the shared ranker
**Goal:** Replace the inlined `rank()` and scoring loop in `autocomplete.js` with a call to `rankVocab`, with no behavior change.
**Files:**
- Modify: `l4d2web/scripts/editor-src/autocomplete.js`
- [ ] **Step 1: Rewrite `autocomplete.js`**
Replace the entire file contents with:
```javascript
import { autocompletion } from "@codemirror/autocomplete";
import { rankVocab } from "./vocab-rank.js";
const WORD_RE = /[A-Za-z0-9_]{2,}/;
export function vocabCompletions(vocab) {
// vocab: { cvars: [{name, desc?}, …], commands: [{name, desc?}, …] }
return (context) => {
const word = context.matchBefore(WORD_RE);
if (!word || (word.from === word.to && !context.explicit)) return null;
const ranked = rankVocab(word.text, vocab);
const options = ranked.map(e => ({
label: e.name,
info: e.desc || e.kind,
type: e.kind === "command" ? "function" : "variable",
}));
return { from: word.from, options, validFor: WORD_RE };
};
}
export function autocompleteExtension(vocab) {
return autocompletion({
override: [vocabCompletions(vocab)],
activateOnTyping: true,
maxRenderedOptions: 8,
});
}
```
- [ ] **Step 2: Rebuild the editor bundle**
```bash
cd l4d2web/scripts/editor-src && npm run build
```
Expected: `editor.bundle.js` regenerated in `l4d2web/l4d2web/static/vendor/`. No esbuild warnings or errors.
- [ ] **Step 3: Manually verify editor autocomplete still works (regression check)**
```bash
cd l4d2web && python ../scripts/dev-server.py
```
(Note: per memory, the dev server is `scripts/dev-server.py` at repo root, not `flask run`.) Then in a browser:
1. Open a server-detail page with a config file editor visible, or navigate to any `.cfg` file edit view.
2. In the editor, type `sv_` — autocomplete dropdown appears with cvars (e.g. `sv_cheats`, `sv_gravity`).
3. Type `sv_cheats` exactly — `sv_cheats` is first in the list.
4. Press Tab — completion is accepted.
Stop the dev server (Ctrl+C).
- [ ] **Step 4: Commit**
```bash
git add l4d2web/scripts/editor-src/autocomplete.js \
l4d2web/l4d2web/static/vendor/editor.bundle.js
git commit -m "refactor(editor): use shared rankVocab in autocomplete"
```
---
## Task 3: Build a standalone ranker bundle for the console
**Goal:** Produce `vocab-rank.bundle.js` — a tiny IIFE that exposes `window.__rankVocab` — so the non-bundled console-autocomplete.js can call the same ranker.
**Files:**
- Create: `l4d2web/scripts/editor-src/vocab-rank-entry.js`
- Modify: `l4d2web/scripts/editor-src/package.json`
- [ ] **Step 1: Create the IIFE entry point**
Create `l4d2web/scripts/editor-src/vocab-rank-entry.js`:
```javascript
import { rankVocab } from "./vocab-rank.js";
// Expose as a global function so plain (non-module) scripts on
// server_detail.html can call window.__rankVocab(query, vocab).
window.__rankVocab = rankVocab;
```
- [ ] **Step 2: Add a build script for it in `package.json`**
Open `l4d2web/scripts/editor-src/package.json` and replace the `"scripts"` block with:
```json
"scripts": {
"build:editor": "esbuild editor-entry.js --bundle --minify --format=iife --global-name=__editor_pkg --outfile=../../l4d2web/static/vendor/editor.bundle.js --metafile=meta.json",
"build:vocab-rank": "esbuild vocab-rank-entry.js --bundle --minify --format=iife --outfile=../../l4d2web/static/vendor/vocab-rank.bundle.js",
"build": "npm run build:editor && npm run build:vocab-rank"
},
```
- [ ] **Step 3: Run the build**
```bash
cd l4d2web/scripts/editor-src && npm run build
```
Expected: two output files updated/created. Verify with:
```bash
ls -la l4d2web/l4d2web/static/vendor/editor.bundle.js l4d2web/l4d2web/static/vendor/vocab-rank.bundle.js
```
Expected: `vocab-rank.bundle.js` exists (should be ~1-3 KB).
- [ ] **Step 4: Smoke-test the bundle from Node**
Quick check the bundle is well-formed (no syntax errors):
```bash
node -e 'const fs = require("fs"); const code = fs.readFileSync("l4d2web/l4d2web/static/vendor/vocab-rank.bundle.js", "utf8"); new Function("window", code)({}); console.log("ok");'
```
Expected: prints `ok` (means the IIFE parsed and ran).
- [ ] **Step 5: Commit**
```bash
git add l4d2web/scripts/editor-src/vocab-rank-entry.js \
l4d2web/scripts/editor-src/package.json \
l4d2web/l4d2web/static/vendor/vocab-rank.bundle.js
git commit -m "feat(editor): build standalone vocab-rank bundle for console"
```
---
## Task 4: Build the console-autocomplete module
**Goal:** Create the vanilla-JS module that renders the dropdown, handles keyboard interaction, and binds to console forms (including HTMX-injected ones).
**Files:**
- Create: `l4d2web/l4d2web/static/js/console-autocomplete.js`
- [ ] **Step 1: Write the module**
Create `l4d2web/l4d2web/static/js/console-autocomplete.js`:
```javascript
// console-autocomplete.js
// Vanilla dropdown autocomplete for [data-console-form] inputs.
// Reads ranked completions from window.__rankVocab (loaded via
// vocab-rank.bundle.js). Owns: Tab, Shift+Tab, Esc, mouse events.
// Leaves: ArrowUp, ArrowDown, Enter (console-history.js owns those).
//
// First-token only: the dropdown is hidden as soon as the cursor
// is past the first space in the input.
const VOCAB_URL = "/static/data/srccfg-vocab.json";
const MAX_RENDERED = 8;
let vocabPromise = null;
function loadVocab() {
if (vocabPromise) return vocabPromise;
vocabPromise = fetch(VOCAB_URL, { credentials: "same-origin" })
.then(r => r.ok ? r.json() : Promise.reject(new Error("vocab fetch failed: " + r.status)))
.catch(err => { console.warn("[console-autocomplete] vocab load failed", err); return null; });
return vocabPromise;
}
function firstTokenSlice(value, caret) {
// Returns the substring [0, end-of-first-token) if the caret is
// within the first token; otherwise null.
const spaceIdx = value.indexOf(" ");
if (spaceIdx === -1) {
return { token: value, from: 0, to: value.length };
}
if (caret > spaceIdx) return null;
return { token: value.slice(0, spaceIdx), from: 0, to: spaceIdx };
}
function bindConsoleAutocomplete(form) {
if (form.dataset.consoleAutocompleteBound === "true") return;
form.dataset.consoleAutocompleteBound = "true";
const input = form.querySelector("input[name='command']");
if (!input) return;
// --- Dropdown DOM (created lazily on first show) ---
let dropdown = null;
let items = []; // current ranked items
let highlightIdx = 0; // index of currently-highlighted row
let vocab = null;
function ensureDropdown() {
if (dropdown) return dropdown;
dropdown = document.createElement("div");
dropdown.className = "console-autocomplete-dropdown";
dropdown.setAttribute("role", "listbox");
dropdown.style.display = "none";
document.body.appendChild(dropdown);
return dropdown;
}
function position() {
if (!dropdown) return;
const rect = input.getBoundingClientRect();
dropdown.style.left = `${rect.left + window.scrollX}px`;
dropdown.style.top = `${rect.bottom + window.scrollY}px`;
dropdown.style.minWidth = `${rect.width}px`;
}
function close() {
if (!dropdown) return;
dropdown.style.display = "none";
items = [];
highlightIdx = 0;
}
function render() {
ensureDropdown();
if (items.length === 0) { close(); return; }
const rows = items.slice(0, MAX_RENDERED).map((e, i) => {
const selected = i === highlightIdx ? " aria-selected='true'" : "";
const kindClass = e.kind === "command" ? "kind-command" : "kind-cvar";
const desc = e.desc ? `<span class="console-autocomplete-desc">${escapeHtml(e.desc)}</span>` : "";
return `<div class="console-autocomplete-row ${kindClass}"${selected} role="option" data-idx="${i}"><span class="console-autocomplete-name">${escapeHtml(e.name)}</span>${desc}</div>`;
}).join("");
dropdown.innerHTML = rows;
dropdown.style.display = "block";
position();
}
function escapeHtml(s) {
return String(s).replace(/[&<>"']/g, c => ({
"&": "&amp;", "<": "&lt;", ">": "&gt;", '"': "&quot;", "'": "&#39;",
}[c]));
}
function acceptHighlighted() {
if (items.length === 0) return;
const chosen = items[highlightIdx];
const slice = firstTokenSlice(input.value, input.selectionStart || 0);
if (!slice) return;
const before = input.value.slice(0, slice.from);
const after = input.value.slice(slice.to);
input.value = before + chosen.name + after;
// Place caret at end of inserted name
const caret = before.length + chosen.name.length;
input.setSelectionRange(caret, caret);
recompute();
}
function recompute() {
if (!vocab) return;
const slice = firstTokenSlice(input.value, input.selectionStart || 0);
if (!slice || !slice.token) { close(); return; }
items = window.__rankVocab(slice.token, vocab);
if (items.length === 0) { close(); return; }
highlightIdx = 0;
render();
}
// --- Lazy vocab fetch on first focus ---
input.addEventListener("focus", async () => {
if (!vocab) {
vocab = await loadVocab();
}
}, { once: true });
input.addEventListener("input", () => {
if (!vocab) return; // fetch may not have resolved yet; next input will recompute
recompute();
});
input.addEventListener("keydown", (event) => {
if (event.key === "Tab" && !event.shiftKey) {
if (items.length > 0) {
event.preventDefault();
acceptHighlighted();
}
} else if (event.key === "Tab" && event.shiftKey) {
if (items.length > 0) {
event.preventDefault();
highlightIdx = (highlightIdx - 1 + Math.min(items.length, MAX_RENDERED))
% Math.min(items.length, MAX_RENDERED);
render();
}
} else if (event.key === "Escape") {
if (dropdown && dropdown.style.display !== "none") {
event.preventDefault();
close();
}
}
// ArrowUp/ArrowDown/Enter intentionally NOT handled here.
});
input.addEventListener("blur", () => {
// Delay close so a click on a dropdown row can fire first.
setTimeout(close, 100);
});
// Mouse click on a row → accept that row.
document.addEventListener("mousedown", (event) => {
if (!dropdown || dropdown.style.display === "none") return;
const row = event.target.closest(".console-autocomplete-row");
if (!row || !dropdown.contains(row)) return;
event.preventDefault();
highlightIdx = parseInt(row.dataset.idx, 10) || 0;
acceptHighlighted();
input.focus();
});
// HTMX form submission clears the input; close on submit.
form.addEventListener("htmx:beforeRequest", close);
// Reposition on resize/scroll while dropdown is open.
window.addEventListener("resize", () => { if (dropdown && dropdown.style.display !== "none") position(); });
window.addEventListener("scroll", () => { if (dropdown && dropdown.style.display !== "none") position(); }, true);
}
function bindAll(root) {
if (!root) return;
const scope = root.matches && root.matches("[data-console-form]") ? [root] : [];
if (root.querySelectorAll) {
root.querySelectorAll("[data-console-form]").forEach((el) => scope.push(el));
}
scope.forEach(bindConsoleAutocomplete);
}
document.addEventListener("DOMContentLoaded", () => bindAll(document));
document.addEventListener("htmx:load", (event) => bindAll(event.detail.elt));
```
- [ ] **Step 2: Commit (no template/CSS wire-up yet — module is not yet loaded)**
```bash
git add l4d2web/l4d2web/static/js/console-autocomplete.js
git commit -m "feat(console): add vanilla autocomplete dropdown module"
```
---
## Task 5: Add dropdown stylesheet
**Goal:** Provide minimal CSS so the dropdown is positioned, themed via existing CSS tokens, and visually consistent with the editor's autocomplete popup.
**Files:**
- Create: `l4d2web/l4d2web/static/css/console-autocomplete.css`
- [ ] **Step 1: Write the stylesheet**
Create `l4d2web/l4d2web/static/css/console-autocomplete.css`:
```css
/* Console autocomplete dropdown.
Positioned absolutely under the console input by JS; visuals match
the editor's tooltip styling (var(--cm-*) tokens defined in
tokens.css and editor.css). */
.console-autocomplete-dropdown {
position: absolute;
z-index: 1000;
max-height: calc(8 * 2.4rem);
overflow-y: auto;
background-color: var(--cm-bg, #1e1e1e);
color: var(--cm-fg, #e0e0e0);
border: 1px solid var(--border-strong, #444);
border-radius: 4px;
font-family: var(--font-mono, ui-monospace, SFMono-Regular, Menlo, monospace);
font-size: 13px;
box-shadow: 0 4px 12px rgba(0, 0, 0, 0.3);
}
.console-autocomplete-row {
display: flex;
align-items: baseline;
gap: 0.75em;
padding: 0.3em 0.6em;
cursor: pointer;
white-space: nowrap;
}
.console-autocomplete-row[aria-selected="true"] {
background-color: var(--cm-selection, #264f78);
}
.console-autocomplete-row:hover {
background-color: var(--cm-selection, #264f78);
}
.console-autocomplete-name {
font-weight: 600;
}
.console-autocomplete-row.kind-cvar .console-autocomplete-name {
color: var(--cm-keyword, #569cd6);
}
.console-autocomplete-row.kind-command .console-autocomplete-name {
color: var(--cm-string, #ce9178);
}
.console-autocomplete-desc {
color: var(--fg-muted, #888);
font-size: 0.9em;
overflow: hidden;
text-overflow: ellipsis;
max-width: 40em;
}
```
- [ ] **Step 2: Commit**
```bash
git add l4d2web/l4d2web/static/css/console-autocomplete.css
git commit -m "feat(console): add autocomplete dropdown stylesheet"
```
---
## Task 6: Wire up in `base.html`
**Goal:** Load the ranker bundle, the console-autocomplete script, and the stylesheet — placed alongside the existing `console-history.js` tag so loading order matches.
**Files:**
- Modify: `l4d2web/l4d2web/templates/base.html`
- [ ] **Step 1: Read the current head/body script section**
Open `l4d2web/l4d2web/templates/base.html` and find the line that currently loads `console-history.js`:
```html
<script defer src="{{ url_for('static', filename='js/console-history.js') }}"></script>
```
- [ ] **Step 2: Add the new tags directly after it**
Add immediately after the `console-history.js` script tag:
```html
<script defer src="{{ url_for('static', filename='vendor/vocab-rank.bundle.js') }}"></script>
<script defer src="{{ url_for('static', filename='js/console-autocomplete.js') }}"></script>
```
And add to the `<head>` section (alongside other `<link rel="stylesheet">` tags — search for existing ones in `base.html`):
```html
<link rel="stylesheet" href="{{ url_for('static', filename='css/console-autocomplete.css') }}">
```
- [ ] **Step 3: Sanity-check the template renders without syntax errors**
```bash
cd l4d2web && python -c "from l4d2web.app import create_app; create_app(); print('ok')"
```
Expected: prints `ok` (Flask app boots; templates are valid Jinja).
- [ ] **Step 4: Commit**
```bash
git add l4d2web/l4d2web/templates/base.html
git commit -m "feat(console): wire up autocomplete bundle + stylesheet in base.html"
```
---
## Task 7: End-to-end smoke test
**Goal:** Verify the full feature works in the browser against the dev server.
- [ ] **Step 1: Start the dev server**
```bash
cd l4d2web && python ../scripts/dev-server.py
```
Expected: server starts on `http://localhost:5000` (or whatever the script reports). `LEFT4ME_ROOT` is auto-set to `.tmp/dev-server` and seeded with demo content per memory.
- [ ] **Step 2: Run through the smoke-test checklist in a browser**
Open a server-detail page (one of the demo servers seeded by the dev server). Then verify each:
1. **Vocab fetch is lazy.** Open DevTools → Network → filter `srccfg-vocab`. Reload page. **Expected:** no request yet.
2. **Click into the console input.** **Expected:** one `srccfg-vocab.json` request fires.
3. **Type `sv_`.** **Expected:** dropdown appears showing cvars starting with `sv_`. Top row highlighted.
4. **Press Tab.** **Expected:** first token replaced with the highlighted suggestion (e.g. `sv_cheats`). Dropdown updates with matches for the new query.
5. **Press Shift+Tab.** **Expected:** highlight moves up; or wraps to bottom if at top.
6. **Press Esc.** **Expected:** dropdown closes. Input value unchanged.
7. **Type a space then `god`.** **Expected:** dropdown stays hidden (we're past the first token).
8. **Press ArrowUp.** **Expected:** history recall works — input is replaced with a previously submitted command. No interference from autocomplete.
9. **Clear the input. Type `sv_che`.** Verify `sv_cheats` is highlighted in the dropdown. **Press Enter.** **Expected:** the server console receives `sv_che` (the typed text), not `sv_cheats`. Confirm in the console transcript.
10. **Refocus the input.** **Expected:** no second `srccfg-vocab.json` request (cached in module-scope promise).
11. **Click on a dropdown row with the mouse.** **Expected:** that row's command is inserted into the input.
12. **Editor regression check.** Navigate to a `.cfg` file in the editor (files view). Type `sv_`. **Expected:** editor's autocomplete still works exactly as before.
If all 12 pass, the feature is complete.
- [ ] **Step 3: Stop dev server (Ctrl+C) and confirm final commit state**
```bash
git log --oneline -10
git status
```
Expected: 6 new commits ahead of the pre-feature state; working tree clean.
---
## Verification Summary
- **Unit tests:** `cd l4d2web/scripts/editor-src && node --test vocab-rank.test.js` — 10 passing tests for the ranker.
- **Manual editor regression:** Editor autocomplete still works on `.cfg` files.
- **Manual console smoke test:** 12-point checklist in Task 7 Step 2.
- **No new runtime JS dependencies** added (vocab-rank.test.js uses only `node:test` + `node:assert/strict`, which are built into Node ≥ 18).
## What's Explicitly Out of Scope
- Argument value completion (player names, map names) — would require runtime data, not in `srccfg-vocab.json`.
- Fuzzy / typo-tolerant matching.
- Replacing CodeMirror's editor dropdown with a custom widget.
- Cross-browser e2e automation (no Playwright/Cypress in the codebase; not adding one as part of this work).

View file

@ -1,243 +0,0 @@
# Files-overlay E2E test handoff
## Context
The files-overlay rewrite (commits `4fa3964..8dc14f0`, May 2026)
moved all editor flows behind URL-addressable modals and split the
1091-line `files-overlay.js` monolith into four focused modules under
`l4d2web/l4d2web/static/js/files-overlay/`. Behavior was verified
step-by-step in Chromium during the rewrite, but there is no automated
browser regression coverage for the editor / dialog / upload flows.
The existing Playwright suite (`l4d2web/tests/e2e/test_editor.py`)
covers only the CodeMirror 6 controller — autocomplete, form-bridge,
copy/paste — invoked through a blueprint detail page. Nothing
exercises the file manager UI.
This handoff specifies what to add: fixture extensions, the test
cases worth writing, and the patterns / pitfalls a future implementer
should know before starting. Estimated effort: a focused half-day for
the seven critical cases, a full day for the full matrix.
## Goal
Lock down the user-visible behavior of the four files-overlay modules
against future regressions. The rewrite proved each module works in
isolation; e2e proves they cooperate over real DOM, real HTTP, real
HTMX, and real CodeMirror.
## Out of scope
- Re-testing pure CodeMirror behavior (the existing `test_editor.py`
covers this on a non-files page; the controller is the same one).
- Replacing the existing pytest route tests (`tests/test_overlay_files_routes.py`,
`tests/test_url_addressable_modals.py`). E2E adds *integration*
coverage on top of those, not in place.
- Performance / load testing of the upload queue (concurrency 3 is
the current behavior; testing it would need 4+ simultaneous uploads
and is high-flake low-value).
- The drag-drop-from-OS path. Playwright can't synthesize a real OS
drag (`webkitGetAsEntry` returns `null` for synthetic drops, so the
fallback `getAsFile` branch always runs). The internal-drag path
(row → folder) is testable; the external drag fallback is covered
enough by the route tests.
## Fixture work
`l4d2web/tests/e2e/conftest.py` currently seeds only a `User` and a
`Blueprint`. The files-overlay tests need a files-type overlay with a
working filesystem root. Add a new fixture (or extend `live_server`):
```python
# tests/e2e/conftest.py
@pytest.fixture(scope="function")
def files_overlay_server(tmp_path, monkeypatch):
"""live_server + a files-type Overlay seeded with a small fixture
set: one editable text file, one binary file, one nested folder
with one file inside.
Returns {base_url, user_id, overlay_id, overlay_root: Path}.
"""
# Same boot as live_server (extract a helper to avoid duplication).
# Set LEFT4ME_ROOT to tmp_path before create_app() so the files
# overlay's path resolution lands under tmp_path.
monkeypatch.setenv("LEFT4ME_ROOT", str(tmp_path))
...
with session_scope() as session:
user = User(username="alice", password_digest=hash_password("secret"), admin=False)
session.add(user); session.flush()
overlay = Overlay(name="cfgs", path="", type="files", user_id=user.id)
session.add(overlay); session.flush()
overlay.path = str(overlay.id)
overlay_root = tmp_path / "overlays" / str(overlay.id)
overlay_root.mkdir(parents=True)
(overlay_root / "server.cfg").write_text("hostname \"left4me\"\n")
(overlay_root / "icon.png").write_bytes(b"\x89PNG\r\n\x1a\n" + b"\x00" * 60)
(overlay_root / "cfg").mkdir()
(overlay_root / "cfg" / "admins.txt").write_text("STEAM_1:0:1\n")
user_id, overlay_id = user.id, overlay.id
...
yield {
"base_url": ...,
"user_id": user_id,
"overlay_id": overlay_id,
"overlay_root": overlay_root,
}
```
The `LEFT4ME_ROOT` env-var monkey-patch is critical — without it,
`overlay_files.resolve_overlay_root` falls back to the production
`/var/lib/left4me` path (per the `AGENTS.md` "symptom-to-cause"
note) and every route returns 404. Set it BEFORE `create_app()`.
## Test cases to add
Suggested file: `l4d2web/tests/e2e/test_files_overlay.py`. Pattern
each test like the existing `test_editor.py`: log in via the form,
navigate to `/overlays/<id>`, drive the UI through Playwright `page`
locators, assert on DOM state + filesystem state under
`overlay_root`.
### Tier 1 — critical paths (write these first)
1. **`test_edit_text_file_save_round_trip`**
- Click `server.cfg` filename. Wait for `#modal-content
textarea[data-rel-path="server.cfg"]`. URL should contain
`?modal=%2Foverlays%2F<id>%2Ffiles%2Fedit%3Fpath%3Dserver.cfg`.
- Modify content via Playwright `page.fill` on the textarea (or
via the `__filesEditor.setContent` controller for the CM6 case
— the existing `test_editor.py` shows both approaches).
- Click `.files-editor-save`. Modal closes (modal-container
`aria-modal` gone / `open` false).
- Assert `overlay_root / "server.cfg"` on disk has the new content.
2. **`test_create_new_file_routed`**
- Click `+ new file` on the overlay-root row. Wait for
`#modal-content textarea[data-rel-path=""]` and save button
labeled `Create`.
- Type a filename and content. Click Create.
- Assert file appears on disk + the file tree refreshes to show
the new row.
3. **`test_create_new_file_409_askConflict_keep_both`**
- Click `+ new file`. Type `cfg` as the filename (collides with
the seeded directory). Click Create.
- Wait for `#files-conflict-modal[open]`. Its
`.files-conflict-path` should read `cfg`.
- Click `[data-files-conflict-action="keep-both"]`.
- Assert the file `cfg (1)` appears on disk and the routed modal
closes.
- This is the path F4 (`8dc14f0`) added; without coverage it can
regress silently.
4. **`test_open_binary_file_renders_replace_ui`**
- Click `icon.png`. Modal opens.
- Assert `#modal-content .files-editor-binary[data-rel-path="icon.png"]`
exists, save button reads `Replace` and is disabled,
`.files-editor-replace-zone` and the download anchor are present.
5. **`test_binary_replace_via_browse_writes_new_bytes`**
- Open `icon.png` editor (as above).
- Click `.files-editor-replace-browse`. Use Playwright's
`page.expect_file_chooser()` to attach a small File buffer.
- Save button enables. Click it. Modal closes.
- Assert the file's bytes on disk are the new content.
6. **`test_new_folder_then_delete`**
- Click `+ new folder` on the overlay root. Inline dialog opens.
- Type a name, press Enter (keydown path). Dialog closes.
- Assert folder exists on disk + appears in tree.
- Click the folder's `✕`. Delete-confirm dialog opens with the
folder name. Click `.files-delete-confirm`.
- Assert folder gone from disk + from tree.
7. **`test_filename_rename_on_save`**
- Open `server.cfg`. Change the filename input to
`server-renamed.cfg`. Click Save.
- Assert disk has the new name + old name gone + tree row updated.
### Tier 2 — round out the matrix
8. **`test_drag_row_to_folder_moves_file`** — internal drag.
Playwright's `locator.drag_to()` can move a row onto a folder.
Assert the move via `/files/move` succeeded and disk reflects it.
9. **`test_upload_queue_progress`** — drop a single file onto the
tree root. The progress panel becomes visible; the row enters
`data-state="active"`, then `data-state="done"`. Assert the
uploaded file is on disk. (Skip the 409 / conflict / cancel
permutations — they're covered by the route tests.)
10. **`test_modal_close_on_escape_preserves_no_state`** — open the
routed editor, type some content but don't save, press Escape.
Modal closes. Reopen — content is fresh (no stale buffer),
`routedReplacement` cleared.
11. **`test_share_url_deep_link_reopens_editor`** — navigate
directly to `/overlays/<id>?modal=%2Foverlays%2F<id>%2Ffiles%2Fedit%3Fpath%3Dserver.cfg`.
Modal should auto-open on DOMContentLoaded (the bootstrap path
from `modals.js`). This is the URL-addressable spec's central
promise; without coverage it regresses easily.
### Tier 3 — nice to have
12. Server detail page hover-download button (the F0 prefactor):
seed a server, navigate to `/servers/<id>`, hover a file row,
click the `⬇` button, assert a file download initiates.
## Patterns to follow / pitfalls
- **The existing `test_editor.py` is the closest pattern.** Read it
end-to-end before starting. The login helper, the `live_server`
fixture shape, the `expect`-based assertions, and the way
Playwright interacts with the CM6 controller (`page.evaluate(...)`
on `window.__filesEditor`) all transfer.
- **Run with `uv run pytest -m e2e tests/e2e/test_files_overlay.py`.**
Anything else crashes Chromium under macOS sandbox.
`uv run playwright install chromium` once per fresh checkout.
- **Routed modals load via `htmx.ajax` — they're async.** Don't assert
immediately after the click. Use `expect(page.locator(...)).to_be_visible()`
with a timeout (Playwright's default 5s is fine).
- **Reading the file tree after a refresh is also async.** The JS
`scheduleRefresh` debounces by 50ms then fetches the directory
partial via HTMX. Use `expect(page.locator(".file-tree-row-file[data-target-path='...']")).to_be_visible()`
rather than polling DOM directly.
- **`data-rel-path` lives on the textarea in text mode and on the
binary panel in binary mode.** Tests asserting "the editor opened
for X" should query whichever matches — or use the fragment
wrapper `#files-editor-fragment` as a stable container.
- **The conflict dialog is inline, not routed.** Don't expect URL
changes when it opens. The decision tree:
- "Did the URL change?" → routed modal (editor) vs. inline modal
(new-folder, conflict, delete-confirm).
- **`SESSION_COOKIE_SECURE=0` is non-optional.** The fixture must set
it; otherwise the browser drops the session cookie over http and
every test redirects back to `/login`. The existing `conftest.py`
has the right pattern at line 39.
## Verification
Per AGENTS.md: `uv run pytest -m e2e tests/e2e/test_files_overlay.py -v`.
The tier-1 seven cases should pass green in <60s on a warm Chromium.
The full matrix (12 cases) target <2 minutes.
When wiring CI / pre-push hooks: the e2e marker is excluded from the
default fast suite, so the existing 580-passing `uv run pytest tests/`
run remains the quick check. The e2e suite runs explicitly when
`-m e2e` is set.
## References
- `l4d2web/tests/e2e/test_editor.py` — pattern model
- `l4d2web/tests/e2e/conftest.py:39``SESSION_COOKIE_SECURE` note
- `l4d2web/tests/test_url_addressable_modals.py` — non-browser route
tests that already cover the server-side contract (200/404/415/400
on edit, new, save). E2E shouldn't duplicate these.
- `l4d2web/l4d2web/static/js/files-overlay/{core,editor,dialogs,uploads}.js`
— read each file's module header comment for the listener layout
before writing assertions.
- `AGENTS.md` "Files overlay: module layout" — high-level orientation.
- `AGENTS.md` "Modals: inline vs routed" — decision tree the test
matrix follows.

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -1,973 +0,0 @@
# URL-Addressable Modals Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Pilot the swift3-style URL-addressable modal pattern in left4me by migrating the file editor's open/render flow. Same URL renders as a full page or a layoutless fragment based on an `HX-Modal: 1` request header. Save flow stays AJAX.
**Architecture:** Approach C (Hybrid). Custom ~50-line `modal-router.js` owns click intercept, `?modal=<path>` URL composition, history, and native `<dialog>` open/close. HTMX (already loaded) owns fetch + swap + loading state. Jinja `inject_base_layout` context processor switches between `base.html` and `_modal_partial.html` based on the header.
**Tech Stack:** Flask 3.x + Jinja2, HTMX 2.0.4, native `<dialog>`, CodeMirror 6 (already bundled as `editor.bundle.js`), pytest for backend tests, Chromium for frontend verification.
**Spec reference:** `docs/superpowers/specs/2026-05-17-url-addressable-modals-design.md`
---
## Errata (post-execution)
The plan shipped via 14 commits between 2026-05-17 and the same day's evening. Three defects in the verbatim plan code were caught by code review during execution; if you re-run this plan, watch for them:
1. **Task 1, Step 4 — context processor needs a `has_request_context()` guard.** Plan code reads `request.headers.get("HX-Modal")` unconditionally, but `tests/test_timeago.py` renders templates inside `app.app_context()` only (no request context). Without the guard the processor crashes with `RuntimeError: Working outside of request context`. Fix: `is_modal = has_request_context() and request.headers.get("HX-Modal") == "1"` (lazy import `from flask import has_request_context` is fine). Shipped in commit `82c3f04`.
2. **Task 3, Step 1 — test fixture must respect `LEFT4ME_ROOT`.** Plan code uses `path=str(overlay_root)` (absolute filesystem path) on the `Overlay` model. The codebase resolves `overlay.path` relative to `LEFT4ME_ROOT` via `validate_overlay_ref` and rejects absolute paths. Fix: `monkeypatch.setenv("LEFT4ME_ROOT", str(tmp_path))`, write files to `tmp_path/overlays/<id>/`, set `overlay.path = str(overlay.id)`. Mirrors `tests/test_overlay_files_routes.py`'s convention. Shipped in commit `60e7968`.
3. **Task 9, Step 2 — "save flow unchanged" was wrong.** The legacy save/delete handlers in `files-overlay.js` are direct-bound to `editorEls.saveBtn` / `editorEls.deleteBtn` (the inline dialog's specific elements), not document-delegated. The new server-rendered modal's identical-class buttons get no handler. Fix: add document-level event delegation for `.files-editor-save` and `.files-editor-delete` clicks gated on `modalContent.contains(btn)`, read `data-rel-path` from the textarea (NOT from a JS var the now-deleted open path used to set), use `window.__filesEditor.getValue()`, POST + `closeModal()` + `scheduleRefresh(parentOf(path))`. Also support rename: read filename input, compose `payload.new_path = parent/filename` when changed, handle 409 with alert + keep modal open. Shipped across commits `64cf203` and `33a2e52`.
## Tasks added during execution
Three tasks were inserted that weren't in the original plan:
- **Task 8.5 (commit `f6b8ecf`)**`overlay_file_editor.html`'s `<dialog open>` nested inside `<dialog id="modal-container">` collapses to 2 px tall in browsers. Replaced with `<div role="document">`. Bundled with CM6 `controller.destroy()` on modal close (memory leak fix — every open/close cycle had been orphaning an `EditorView` and a `matchMedia` listener) and a `mountOne` idempotency guard. CSS broadened: `dialog.modal, div.modal`.
- **Task 8.5b (commit `7829d1c`)** — the broadened CSS caused double-card painting (outer dialog + inner div both matched the `.modal` styling). Dropped `class="modal modal-wide"` and `role="document"` from the inner div; the outer dialog owns the chrome.
- **Task 9b (commit `33a2e52`)** — see defect #3 above for rename-on-save support.
## Design refinement during execution (Task 6 superseded)
Task 6's original "every close source updates state directly" code was replaced with a close-event-centric design: every close source (Esc cancel, backdrop click, `[data-modal-dismiss]`, browser back, `htmx:responseError`, programmatic close) just calls `dialog.close()`, and a single `close`-event listener clears `currentModalPath` and removes `?modal=` from the URL. This kills two latent bugs simultaneously: (a) the legacy `modal.js:31-33` backdrop handler closes `dialog.modal` without clearing URL, and (b) HTMX's `htmx.ajax` resolves on 4xx so plain `.then(() => showModal())` would open a modal on error responses. Shipped in commit `6e66375`. The revised design is in that commit's diff.
## Post-pilot polish (commits 5dc4xx after Task 10)
- Removed dangling `aria-labelledby="modal-content-title"` from `#modal-container` in `base.html` (referenced an id that never existed).
- Renamed the new editor template's outer `<div>` id from `files-editor-modal` to `files-editor-fragment` to resolve a duplicate-id W3C violation with the legacy inline `<dialog id="files-editor-modal">` in `overlay_detail.html`. Updated `editor.js`'s `closest()` to match both selectors so auto-language detection works for both modal pipelines.
---
## File Structure
| Path | New / Modify | Responsibility |
|------|--------------|----------------|
| `l4d2web/l4d2web/app.py` | Modify (insert ~5 lines after `add_template_filter`) | Register `inject_base_layout` context processor |
| `l4d2web/l4d2web/templates/_modal_partial.html` | New (1 line) | Layoutless base template — just `{% block content %}{% endblock %}` |
| `l4d2web/l4d2web/templates/overlay_file_editor.html` | New | Editor markup lifted from `overlay_detail.html:165-228`, content pre-filled, extends `base_layout` |
| `l4d2web/l4d2web/routes/files_routes.py` | Modify (add one route, ~30 lines) | `GET /overlays/<id>/files/edit?path=<rel>` |
| `l4d2web/l4d2web/templates/base.html` | Modify (insert ~3 lines) | Persistent `<dialog id="modal-container">` slot + `modal-router.js` script include |
| `l4d2web/l4d2web/static/js/modal-router.js` | New (~60 lines) | Click intercept, URL composition, history, open/close, bootstrap |
| `l4d2web/l4d2web/static/js/editor.js` | Modify (expose `initEditors(root)`, add `htmx:afterSwap` listener) | CM6 re-init after HTMX swap |
| `l4d2web/l4d2web/static/js/files-overlay.js` | Modify (change one code path) | Replace inline-dialog populate-and-show with `window.openModal(url)` |
| `l4d2web/l4d2web/templates/overlay_detail.html` | Modify (remove `<dialog id="files-editor-modal">` block at lines 165-228) | Delete the old inline editor dialog |
| `l4d2web/tests/test_url_addressable_modals.py` | New | pytest coverage for context processor + new edit route |
---
## Task 1: Layout context processor + partial template
**Files:**
- Create: `l4d2web/l4d2web/templates/_modal_partial.html`
- Modify: `l4d2web/l4d2web/app.py` (insert after `app.add_template_filter(format_time_html, "timeago")` on line 62)
- Test: `l4d2web/tests/test_url_addressable_modals.py` (new)
- [ ] **Step 1: Write the failing test**
Create `l4d2web/tests/test_url_addressable_modals.py`:
```python
from flask import render_template_string
from l4d2web.app import create_app
def _make_app(tmp_path, monkeypatch, db_name: str):
db_url = f"sqlite:///{tmp_path/db_name}"
monkeypatch.setenv("DATABASE_URL", db_url)
return create_app({"TESTING": True, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
def test_base_layout_is_modal_partial_when_hx_modal_header_set(tmp_path, monkeypatch):
app = _make_app(tmp_path, monkeypatch, "layout-modal.db")
with app.test_request_context("/", headers={"HX-Modal": "1"}):
assert render_template_string("{{ base_layout }}") == "_modal_partial.html"
def test_base_layout_is_base_html_for_normal_request(tmp_path, monkeypatch):
app = _make_app(tmp_path, monkeypatch, "layout-default.db")
with app.test_request_context("/"):
assert render_template_string("{{ base_layout }}") == "base.html"
def test_base_layout_does_not_react_to_plain_hx_request_header(tmp_path, monkeypatch):
# HTMX sets HX-Request on every request including the build-status poll;
# only HX-Modal should switch the layout.
app = _make_app(tmp_path, monkeypatch, "layout-hxreq.db")
with app.test_request_context("/", headers={"HX-Request": "true"}):
assert render_template_string("{{ base_layout }}") == "base.html"
```
- [ ] **Step 2: Run test to verify it fails**
Run: `cd l4d2web && uv run pytest tests/test_url_addressable_modals.py -v`
Expected: 3 failures (all asserting that `base_layout` resolves to something — currently undefined, so render fails with `UndefinedError` or returns empty string).
- [ ] **Step 3: Create the partial template**
Create `l4d2web/l4d2web/templates/_modal_partial.html` with exactly this content:
```jinja
{% block content %}{% endblock %}
```
- [ ] **Step 4: Register the context processor**
In `l4d2web/l4d2web/app.py`, insert immediately after line 62 (`app.add_template_filter(format_time_html, "timeago")`):
```python
@app.context_processor
def inject_base_layout() -> dict[str, str]:
is_modal = request.headers.get("HX-Modal") == "1"
return {"base_layout": "_modal_partial.html" if is_modal else "base.html"}
```
`request` is already imported at the top of the file.
- [ ] **Step 5: Run tests to verify pass**
Run: `cd l4d2web && uv run pytest tests/test_url_addressable_modals.py -v`
Expected: 3 passed.
- [ ] **Step 6: Commit**
```bash
git add l4d2web/l4d2web/app.py l4d2web/l4d2web/templates/_modal_partial.html l4d2web/tests/test_url_addressable_modals.py
git commit -m "$(cat <<'EOF'
feat(modals): layout context processor for HX-Modal header
Switches the Jinja base layout to _modal_partial.html (yield-only) when
the HX-Modal:1 request header is set, otherwise base.html. Foundation
for URL-addressable modals (spec 2026-05-17-url-addressable-modals).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 2: Editor template (file editor as standalone page)
**Files:**
- Create: `l4d2web/l4d2web/templates/overlay_file_editor.html`
- Test: covered by Task 3's route tests (template is unreachable until then)
This task is a lift-and-shift of the editor markup from `overlay_detail.html:165-228` into its own template with server-side content variables substituted in.
- [ ] **Step 1: Read the source markup to lift**
Run: `sed -n '164,228p' l4d2web/l4d2web/templates/overlay_detail.html`
Note the surrounding `{% if files_can_edit %}` guard — that gating moves to the route (only `files` overlays expose the link). The template itself unconditionally renders the editor.
- [ ] **Step 2: Create the new template**
Create `l4d2web/l4d2web/templates/overlay_file_editor.html`:
```jinja
{% extends base_layout %}
{% block title %}Edit {{ rel_path }} · {{ overlay.name }}{% endblock %}
{% block extra_head %}{% include "_editor_assets.html" %}{% endblock %}
{% block content %}
<dialog id="files-editor-modal" class="modal modal-wide" open aria-labelledby="files-editor-title">
<div class="modal-header">
<h2 id="files-editor-title" class="files-editor-path">
<span class="files-editor-title-text">{{ rel_path }}</span>
</h2>
<button type="button" class="modal-close" data-modal-dismiss aria-label="Close">&times;</button>
</div>
<div class="modal-body">
<label class="files-editor-field">
<span class="files-field-label">Filename</span>
<input type="text" class="files-editor-filename" data-editor-filename autocomplete="off" spellcheck="false" value="{{ rel_path }}">
</label>
<p class="files-editor-rename-hint" hidden>↻ Save will rename <code class="files-rename-from"></code><code class="files-rename-to"></code>.</p>
<div class="files-editor-text">
<label class="files-editor-field files-editor-language-field">
<span class="files-field-label">Language</span>
<select data-editor-language-select aria-label="Editor language">
<option value="auto">auto (from filename)</option>
<option value="srccfg">srccfg (.cfg)</option>
<option value="bash">bash (.sh)</option>
<option value="plain">plain</option>
</select>
</label>
<label class="files-editor-field">
<span class="files-field-label">Content</span>
<div class="editor-mount" style="--editor-rows: 14"><textarea class="files-editor-content" rows="14" spellcheck="false" data-editor-language="auto" data-overlay-id="{{ overlay.id }}" data-rel-path="{{ rel_path }}">{{ content }}</textarea></div>
</label>
<div class="files-editor-meta muted">
<span class="files-editor-byte-count">UTF-8 · {{ byte_count }} bytes</span>
<span>Ctrl+S to save</span>
</div>
</div>
</div>
<div class="modal-footer files-editor-footer">
<button type="button" class="danger-outline files-editor-delete">Delete</button>
<span class="files-editor-footer-spacer"></span>
<a class="button-secondary files-editor-download" href="/overlays/{{ overlay.id }}/files/download?path={{ rel_path|urlencode }}">⬇ Download</a>
<button type="button" class="button-secondary" data-modal-dismiss>Cancel</button>
<button type="button" class="files-editor-save">Save</button>
</div>
</dialog>
{% endblock %}
```
Notes baked into the markup:
- `{% extends base_layout %}` — picks `_modal_partial.html` or `base.html` based on the request header
- `<dialog … open>` for the full-page render — when standalone, the dialog stays open without `showModal()`. When fragment-rendered into the modal slot, `modal-router.js` calls `showModal()` on the *outer* `#modal-container` (not this inner dialog — see Task 4)
- `data-modal-dismiss` on close buttons — picked up by modal-router (deferred to Task 6)
- `data-overlay-id` + `data-rel-path` on the textarea — so the AJAX save in `files-overlay.js` can find its target without depending on global state
- Binary-file replacement UI from `overlay_detail.html:204-219` is **omitted** from this pilot template. Editable-only files reach this route (the route returns 415 for non-editable per Task 3). Binary replace stays inline-modal for now (out of pilot scope)
- [ ] **Step 3: Commit**
```bash
git add l4d2web/l4d2web/templates/overlay_file_editor.html
git commit -m "$(cat <<'EOF'
feat(modals): editor template that extends base_layout
Lifts the file editor markup out of overlay_detail.html into its own
template with server-side filename, content, byte count, and download
URL pre-filled. Uses {% extends base_layout %} so the same template
renders as either a full page or a layoutless modal fragment.
Binary replace UI deferred — pilot scope is editable text files only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 3: New GET `/overlays/<id>/files/edit` route
**Files:**
- Modify: `l4d2web/l4d2web/routes/files_routes.py` (add one route, ~35 lines)
- Test: `l4d2web/tests/test_url_addressable_modals.py` (extend)
The route mirrors the existing `overlay_file_content` at `files_routes.py:203-234`: resolves the path, checks editability, reads UTF-8 content. Difference: returns HTML (via `overlay_file_editor.html`) instead of JSON.
- [ ] **Step 1: Write the failing tests**
Append to `l4d2web/tests/test_url_addressable_modals.py`:
```python
from datetime import UTC, datetime
from l4d2web.auth import hash_password
from l4d2web.db import init_db, session_scope
from l4d2web.models import Overlay, User
def _auth_client_with_files_overlay(tmp_path, monkeypatch, db_name: str):
db_url = f"sqlite:///{tmp_path/db_name}"
monkeypatch.setenv("DATABASE_URL", db_url)
app = create_app({"TESTING": True, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
init_db()
overlay_root = tmp_path / "overlay_root"
overlay_root.mkdir()
(overlay_root / "server.cfg").write_text("hostname \"left4me\"\nrcon_password \"hunter2\"\n", encoding="utf-8")
with session_scope() as session:
user = User(username="alice", password_digest=hash_password("secret"), admin=False)
session.add(user)
session.flush()
overlay = Overlay(name="cfgs", path=str(overlay_root), type="files", user_id=user.id)
session.add(overlay)
session.flush()
user_id = user.id
overlay_id = overlay.id
client = app.test_client()
with client.session_transaction() as sess:
sess["user_id"] = user_id
sess["pw_changed_at"] = datetime.now(UTC).isoformat()
return client, overlay_id
def test_edit_route_renders_full_page_without_modal_header(tmp_path, monkeypatch):
client, overlay_id = _auth_client_with_files_overlay(tmp_path, monkeypatch, "edit-full.db")
response = client.get(f"/overlays/{overlay_id}/files/edit?path=server.cfg")
text = response.get_data(as_text=True)
assert response.status_code == 200
assert "<!doctype html>" in text.lower() # full base.html rendered
assert 'href="/dashboard"' in text # nav present
assert 'class="files-editor-content"' in text
assert 'rcon_password' in text # content pre-filled
def test_edit_route_renders_fragment_with_modal_header(tmp_path, monkeypatch):
client, overlay_id = _auth_client_with_files_overlay(tmp_path, monkeypatch, "edit-fragment.db")
response = client.get(
f"/overlays/{overlay_id}/files/edit?path=server.cfg",
headers={"HX-Modal": "1"},
)
text = response.get_data(as_text=True)
assert response.status_code == 200
assert "<html" not in text # layoutless
assert 'class="primary-nav"' not in text
assert 'class="files-editor-content"' in text
assert "hostname" in text # content pre-filled
def test_edit_route_404s_for_missing_file(tmp_path, monkeypatch):
client, overlay_id = _auth_client_with_files_overlay(tmp_path, monkeypatch, "edit-404.db")
response = client.get(f"/overlays/{overlay_id}/files/edit?path=nonexistent.cfg")
assert response.status_code == 404
def test_edit_route_415s_for_non_editable_file(tmp_path, monkeypatch):
client, overlay_id = _auth_client_with_files_overlay(tmp_path, monkeypatch, "edit-415.db")
# Forge a non-editable file by writing binary garbage.
with session_scope() as s:
overlay = s.query(Overlay).filter_by(id=overlay_id).one()
from pathlib import Path
Path(overlay.path).joinpath("blob.bin").write_bytes(b"\x00\x01\x02\x03" * 1024)
response = client.get(f"/overlays/{overlay_id}/files/edit?path=blob.bin")
assert response.status_code == 415
def test_edit_route_400s_for_path_traversal(tmp_path, monkeypatch):
client, overlay_id = _auth_client_with_files_overlay(tmp_path, monkeypatch, "edit-400.db")
response = client.get(f"/overlays/{overlay_id}/files/edit?path=../../etc/passwd")
assert response.status_code == 400
def test_edit_route_404s_for_non_files_overlay(tmp_path, monkeypatch):
db_url = f"sqlite:///{tmp_path/'edit-script-overlay.db'}"
monkeypatch.setenv("DATABASE_URL", db_url)
app = create_app({"TESTING": True, "DATABASE_URL": db_url, "SECRET_KEY": "test"})
init_db()
with session_scope() as s:
user = User(username="alice", password_digest=hash_password("secret"), admin=False)
s.add(user)
s.flush()
overlay = Overlay(name="scripted", path=str(tmp_path), type="script", user_id=user.id)
s.add(overlay)
s.flush()
user_id = user.id
overlay_id = overlay.id
client = app.test_client()
with client.session_transaction() as sess:
sess["user_id"] = user_id
sess["pw_changed_at"] = datetime.now(UTC).isoformat()
response = client.get(f"/overlays/{overlay_id}/files/edit?path=anything.cfg")
assert response.status_code == 404
```
- [ ] **Step 2: Run tests to verify they fail**
Run: `cd l4d2web && uv run pytest tests/test_url_addressable_modals.py -v -k edit_route`
Expected: 6 failures (route doesn't exist → 404 for all).
- [ ] **Step 3: Add the route**
In `l4d2web/l4d2web/routes/files_routes.py`, append immediately after the `overlay_file_content` function (line 234):
```python
@bp.get("/overlays/<int:overlay_id>/files/edit")
@require_login
def overlay_file_edit_page(overlay_id: int):
"""Server-rendered editor page. Renders full-page by default or as a
layoutless modal fragment when the HX-Modal header is set (see the
inject_base_layout context processor in app.py)."""
user = current_user()
assert user is not None
sub_path = request.args.get("path", "")
result = _load_files_overlay(overlay_id, user)
if isinstance(result, Response):
return result
overlay = result
try:
target = safe_resolve_for_listing(overlay.path, sub_path)
except ValueError:
return Response("invalid path", status=400)
if not target.exists() or not target.is_file():
return Response(status=404)
if not is_editable(target):
return Response("not editable", status=415)
try:
content = target.read_text(encoding="utf-8")
except OSError:
return Response("read failed", status=500)
except UnicodeDecodeError:
return Response("not editable", status=415)
return render_template(
"overlay_file_editor.html",
overlay=overlay,
rel_path=sub_path,
content=content,
byte_count=len(content.encode("utf-8")),
)
```
- [ ] **Step 4: Run tests to verify pass**
Run: `cd l4d2web && uv run pytest tests/test_url_addressable_modals.py -v`
Expected: 9 passed (3 from Task 1 + 6 new).
- [ ] **Step 5: Smoke-test the existing test suite for regressions**
Run: `cd l4d2web && uv run pytest tests/ -v --tb=short -q`
Expected: all tests pass. The context processor adds `base_layout` to every template render; existing templates ignore it (they all use `{% extends "base.html" %}` literally), so behavior is unchanged.
- [ ] **Step 6: Commit**
```bash
git add l4d2web/l4d2web/routes/files_routes.py l4d2web/tests/test_url_addressable_modals.py
git commit -m "$(cat <<'EOF'
feat(modals): GET /overlays/<id>/files/edit route
Server-renders the file editor as a real page. With HX-Modal:1 returns
a layoutless fragment for modal embedding; without it returns the full
standalone page. Mirrors overlay_file_content's path/editability checks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 4: Persistent modal slot in base.html
**Files:**
- Modify: `l4d2web/l4d2web/templates/base.html`
The slot is a sibling of `<main>`, sitting at body scope so backdrop renders over everything.
- [ ] **Step 1: Add the slot and script include**
In `l4d2web/l4d2web/templates/base.html`, modify the body section. After the closing `</main>` (currently line 39), insert the modal slot. After the `<script src="…/modal.js">` line (currently line 43), add the modal-router include.
The body section should look like this after the edit:
```html
<main class="container">
{% block content %}{% endblock %}
</main>
<dialog id="modal-container" class="modal modal-wide" aria-labelledby="modal-content-title">
<div id="modal-content"></div>
</dialog>
<script src="{{ url_for('static', filename='vendor/htmx.min.js') }}"></script>
<script src="{{ url_for('static', filename='js/csrf.js') }}"></script>
<script src="{{ url_for('static', filename='js/sse.js') }}"></script>
<script src="{{ url_for('static', filename='js/modal.js') }}"></script>
<script src="{{ url_for('static', filename='js/modal-router.js') }}"></script>
<script src="{{ url_for('static', filename='js/file-tree.js') }}"></script>
<script src="{{ url_for('static', filename='js/password-reveal.js') }}"></script>
<script defer src="{{ url_for('static', filename='js/console-history.js') }}"></script>
</body>
```
- [ ] **Step 2: Create an empty modal-router.js stub**
So the new `<script src>` doesn't 404. Create `l4d2web/l4d2web/static/js/modal-router.js`:
```javascript
// URL-addressable modal router (see docs/superpowers/specs/2026-05-17-url-addressable-modals-design.md).
// Implementation lands in subsequent tasks; this stub keeps base.html's
// script include from 404'ing during the staged rollout.
(function () {
"use strict";
})();
```
- [ ] **Step 3: Run existing tests for regressions**
Run: `cd l4d2web && uv run pytest tests/test_pages.py -v -q`
Expected: all pass. The added `<dialog>` is closed by default (no `open` attribute), so it's invisible and inert.
- [ ] **Step 4: Commit**
```bash
git add l4d2web/l4d2web/templates/base.html l4d2web/l4d2web/static/js/modal-router.js
git commit -m "$(cat <<'EOF'
feat(modals): persistent modal slot + router script stub in base.html
Adds <dialog id="modal-container"> with #modal-content slot at body
scope. Script stub created so the include doesn't 404; logic follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 5: `modal-router.js` — click intercept, open, fetch, show
**Files:**
- Modify: `l4d2web/l4d2web/static/js/modal-router.js`
This task wires the click → URL → fetch → show pipeline. Close/popstate/bootstrap come in Tasks 6 and 7.
- [ ] **Step 1: Implement click intercept + openModal + fetchAndShow**
Replace `l4d2web/l4d2web/static/js/modal-router.js` with:
```javascript
// URL-addressable modal router (see spec 2026-05-17-url-addressable-modals).
// Click intercept on a[data-modal] → ?modal=<path> in URL → htmx swap into
// #modal-content → showModal(). Close/popstate/bootstrap in later tasks.
(function () {
"use strict";
let currentModalPath = null; // race-guard against stale swaps
function openModal(path) {
const url = new URL(window.location.href);
url.searchParams.set("modal", path);
history.pushState({ modal: path }, "", url.toString());
fetchAndShow(path);
}
function fetchAndShow(path) {
currentModalPath = path;
if (typeof window.htmx === "undefined") {
console.error("[modal-router] htmx not loaded; cannot fetch modal");
return;
}
window.htmx.ajax("GET", path, {
target: "#modal-content",
swap: "innerHTML",
headers: { "HX-Modal": "1" },
}).then(() => {
// Race guard: if the user clicked again during the fetch, abandon
// this swap; the newer click will win.
if (currentModalPath !== path) return;
const dlg = document.getElementById("modal-container");
if (dlg && !dlg.open) dlg.showModal();
}).catch((err) => {
console.error("[modal-router] fetch failed", err);
});
}
document.addEventListener("click", (event) => {
const link = event.target.closest("a[data-modal]");
if (!link) return;
if (event.metaKey || event.ctrlKey || event.shiftKey || event.altKey) return;
if (event.button !== 0) return;
const href = link.getAttribute("href");
if (!href) return;
event.preventDefault();
openModal(href);
});
// Public API — used by files-overlay.js to open the editor from row clicks
// that aren't a literal <a data-modal> (existing event delegation).
window.openModal = openModal;
})();
```
- [ ] **Step 2: Chromium verification**
Steps to verify manually (or via the user's Chromium tooling). The new editor route is reachable but not yet linked from the file tree — use a temporary `<a>` for the smoke test.
1. Start the dev server: `cd l4d2web && uv run flask --app l4d2web.app:create_app run --debug --port 5000` (or whatever the project's dev-server idiom is — confirm at implementation time).
2. Log in and create a `files` overlay with a `.cfg` file in it (or use an existing one).
3. Open dev tools → Console.
4. In the console, run: `window.openModal('/overlays/<id>/files/edit?path=server.cfg')` (substitute real id).
5. **Expected:** URL gains `?modal=/overlays/<id>/files/edit?path=server.cfg`. Modal opens with the editor markup inside. (CodeMirror not yet mounted — that's Task 8 — so you'll see the raw `<textarea>` with content.)
6. Network tab: confirm the request to `/overlays/<id>/files/edit?path=server.cfg` carries the header `HX-Modal: 1`.
7. Negative check: confirm the build-status poll (`/overlays/<id>/build-status` every 2s) does **not** carry `HX-Modal: 1`.
- [ ] **Step 3: Commit**
```bash
git add l4d2web/l4d2web/static/js/modal-router.js
git commit -m "$(cat <<'EOF'
feat(modals): click intercept + openModal + fetchAndShow
a[data-modal] clicks push ?modal=<path> to URL and trigger htmx.ajax
into #modal-content with the HX-Modal header. window.openModal exposed
for non-<a> trigger sites (files-overlay row clicks). Race guard via
currentModalPath token. Close/popstate/bootstrap follow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 6: `modal-router.js` — close, popstate, dismiss, Esc
**Files:**
- Modify: `l4d2web/l4d2web/static/js/modal-router.js`
- [ ] **Step 1: Add close, popstate, dismiss-click, dialog cancel handlers**
In `modal-router.js`, replace the IIFE body with the expanded version. Insert the new function and listeners after `fetchAndShow` and before the `document.addEventListener("click", …)` for opens:
```javascript
function closeModal() {
currentModalPath = null;
const dlg = document.getElementById("modal-container");
if (dlg && dlg.open) dlg.close();
const url = new URL(window.location.href);
if (url.searchParams.has("modal")) {
url.searchParams.delete("modal");
history.pushState({}, "", url.toString());
}
}
window.addEventListener("popstate", () => {
const path = new URL(window.location.href).searchParams.get("modal");
if (path) {
fetchAndShow(path);
} else {
currentModalPath = null;
const dlg = document.getElementById("modal-container");
if (dlg && dlg.open) dlg.close();
}
});
// Dismiss triggers inside modal content
document.addEventListener("click", (event) => {
if (event.target.closest("[data-modal-dismiss]")) {
event.preventDefault();
closeModal();
}
});
// Esc key fires the dialog's 'cancel' event; sync URL.
document.addEventListener("DOMContentLoaded", () => {
const dlg = document.getElementById("modal-container");
if (dlg) {
dlg.addEventListener("cancel", (event) => {
event.preventDefault(); // prevent default close so we control URL sync
closeModal();
});
// Backdrop click — native <dialog> doesn't dismiss on backdrop; replicate.
dlg.addEventListener("click", (event) => {
if (event.target === dlg) closeModal();
});
}
});
window.closeModal = closeModal;
```
- [ ] **Step 2: Chromium verification**
1. From the prior task's setup (modal open via `window.openModal(...)`):
2. Click the `[data-modal-dismiss]` Cancel button (or the × in the modal header). **Expected:** modal closes, URL loses `?modal=…`, underlying overlay page intact and still polling build-status.
3. Open the modal again. Press Esc. **Expected:** same close behavior.
4. Open the modal again. Click on the backdrop outside the dialog content. **Expected:** same close behavior.
5. Open the modal again. Click browser back button. **Expected:** modal closes, URL clears.
6. Now click forward. **Expected:** modal re-opens with the same file's content.
- [ ] **Step 3: Commit**
```bash
git add l4d2web/l4d2web/static/js/modal-router.js
git commit -m "$(cat <<'EOF'
feat(modals): close, popstate, dismiss-click, Esc, backdrop-click
closeModal pops ?modal= from URL via pushState. popstate handler reacts
to back/forward by fetching or closing. [data-modal-dismiss] click,
native dialog 'cancel' (Esc), and backdrop click all funnel to
closeModal. window.closeModal exposed for callers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 7: `modal-router.js` — initial-load bootstrap
**Files:**
- Modify: `l4d2web/l4d2web/static/js/modal-router.js`
This makes refreshing on a `?modal=…` URL reopen the modal automatically — the headline feature.
- [ ] **Step 1: Add bootstrap on DOMContentLoaded**
Extend the existing `DOMContentLoaded` listener in `modal-router.js` (added in Task 6) so it also bootstraps from URL. Replace the block:
```javascript
document.addEventListener("DOMContentLoaded", () => {
const dlg = document.getElementById("modal-container");
if (dlg) {
dlg.addEventListener("cancel", (event) => {
event.preventDefault();
closeModal();
});
dlg.addEventListener("click", (event) => {
if (event.target === dlg) closeModal();
});
}
const initialPath = new URL(window.location.href).searchParams.get("modal");
if (initialPath) {
fetchAndShow(initialPath);
}
});
```
- [ ] **Step 2: Chromium verification**
1. Open the modal via `window.openModal('/overlays/<id>/files/edit?path=server.cfg')`.
2. Hit the browser refresh button. **Expected:** page reloads, modal re-opens automatically with the same file's content. URL retains `?modal=…`.
3. Copy the full URL. Open a new incognito window, log in, paste the URL. **Expected:** lands on the overlay page with the modal already open.
4. Negative: visit `/overlays/<id>` (no `?modal=`). **Expected:** modal does not open; underlying page renders normally.
- [ ] **Step 3: Commit**
```bash
git add l4d2web/l4d2web/static/js/modal-router.js
git commit -m "$(cat <<'EOF'
feat(modals): DOMContentLoaded bootstrap reopens modal from ?modal= URL
Refresh and share-link flows both work — the modal-state URL is the
canonical shareable artifact for "this overlay with this file open."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 8: CodeMirror re-init after HTMX swap
**Files:**
- Modify: `l4d2web/l4d2web/static/js/editor.js`
`editor.js` currently mounts CM6 once at `DOMContentLoaded`. After modal swap-in, the new `<textarea>` is unmounted. Fix: expose `initEditors(root)` and call it from an `htmx:afterSwap` listener.
- [ ] **Step 1: Refactor `init` to accept a root**
In `l4d2web/l4d2web/static/js/editor.js`, replace the existing `init` function (currently around lines 93-100) and the bootstrap-at-end (lines 101-105) with:
```javascript
function initEditors(root) {
const scope = root || document;
for (const ta of scope.querySelectorAll("textarea[data-editor-language]")) {
mountOne(ta).catch(err => {
console.error("[editor] mount failed", err);
unhideTextarea(ta);
});
}
}
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", () => initEditors(document));
} else {
initEditors(document);
}
// Re-init editors that arrive via HTMX swap (modal content, etc.).
document.body.addEventListener("htmx:afterSwap", (event) => {
if (event.target && event.target.id === "modal-content") {
initEditors(event.target);
}
});
// Expose for callers that need to re-mount imperatively.
if (window.__editor) {
window.__editor.initEditors = initEditors;
}
```
- [ ] **Step 2: Chromium verification**
1. Open the editor modal via the URL flow from Task 7 (`/overlays/<id>?modal=/overlays/<id>/files/edit?path=server.cfg`).
2. **Expected:** CM6 renders inside the modal — syntax-highlighted content, NOT a raw textarea. Byte count matches actual content size. Language dropdown reflects auto-detected language (srccfg for .cfg, bash for .sh).
3. Type into the editor. **Expected:** edits are reflected; UI is responsive.
4. Close the modal, re-open. **Expected:** CM6 re-mounts cleanly each time. No duplicate editor instances visible (only one rendered).
5. Open dev tools → Network → confirm no console errors mentioning `mount failed` or duplicate-init warnings.
- [ ] **Step 3: Commit**
```bash
git add l4d2web/l4d2web/static/js/editor.js
git commit -m "$(cat <<'EOF'
feat(editor): re-init CM6 on htmx:afterSwap into #modal-content
editor.js exposes initEditors(root) and listens for htmx:afterSwap so
editor textareas that arrive via modal swap get CM6 mounted. The
DOMContentLoaded path remains for first-paint mounting.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 9: Wire file-tree edit triggers to use `window.openModal`
**Files:**
- Modify: `l4d2web/l4d2web/static/js/files-overlay.js` (specific code path; rest unchanged)
`files-overlay.js` currently populates the empty inline `#files-editor-modal` dialog when a file row is clicked. Replace that code path with a call to `window.openModal(editUrl)`. The save flow (also in this file) stays untouched.
- [ ] **Step 1: Locate the "open editor" entry point**
Run: `grep -n "files-editor-modal\|showModal\|filesEditor\|getValue\|files-editor-content" l4d2web/l4d2web/static/js/files-overlay.js`
Identify the function that handles a file-row click and currently calls `showModal()` on `#files-editor-modal`, plus the code that stuffs filename + content + language into the empty markup. That whole code path becomes a single `window.openModal(editUrl)` call.
- [ ] **Step 2: Replace the inline-dialog open path**
In that function, replace the block that populates and shows the inline dialog with:
```javascript
const editUrl = `/overlays/${overlayId}/files/edit?path=${encodeURIComponent(relPath)}`;
if (typeof window.openModal === "function") {
window.openModal(editUrl);
} else {
// Graceful fallback if modal-router didn't load — full-page navigation
// still hits the same route and renders the standalone editor page.
window.location.href = editUrl;
}
```
Delete the code that previously read the file via `/files/content` JSON endpoint and set `filenameInput.value`, the language dropdown, byte-count text, `controller.setValue(...)`, and called `showModal()`. The new route delivers all of that as server-rendered HTML.
Keep untouched:
- The **save** handler (POSTs to `/overlays/<id>/files/save` reading `window.__filesEditor.getValue()` — still works inside the modal because CM6 re-init from Task 8 sets `window.__filesEditor` on the new instance)
- The **delete** button handler (POSTs to `/overlays/<id>/files/delete`)
- The **download** link (now a server-rendered `<a href>` in the template)
- The rename hint, replace-file flow, and any other in-modal interactions — these continue to bind on the editor element inside `#modal-content` via the existing event delegation
- [ ] **Step 3: Chromium verification**
1. On `/overlays/<id>` (a `files` overlay's page), click an editable file (e.g. `server.cfg`).
2. **Expected:** URL updates to `?modal=/overlays/<id>/files/edit?path=server.cfg`. Modal opens with CM6 editor, content pre-filled, language auto-detected.
3. Edit content and click **Save**. **Expected:** save succeeds (network request to `/overlays/<id>/files/save` returns 200), file persists.
4. Refresh the page (still on the `?modal=` URL). **Expected:** modal reopens with the *saved* (updated) content.
5. Click Cancel. **Expected:** modal closes; URL loses `?modal=`.
6. Race test: click file A, then immediately click file B before A's swap arrives. **Expected:** modal ends in file B's state, not file A's.
- [ ] **Step 4: Commit**
```bash
git add l4d2web/l4d2web/static/js/files-overlay.js
git commit -m "$(cat <<'EOF'
feat(files): file-row click opens editor via URL-addressable modal
files-overlay.js no longer fetches /files/content JSON and populates
the inline <dialog>; it calls window.openModal(<edit-url>) which the
modal-router handles end-to-end. Save flow unchanged — CM6 re-init on
htmx:afterSwap re-binds window.__filesEditor on the new instance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Task 10: Remove the dead inline editor dialog + final verification
**Files:**
- Modify: `l4d2web/l4d2web/templates/overlay_detail.html` (delete lines 164-228 — the `{% if files_can_edit %} <dialog id="files-editor-modal">…</dialog> {% endif %}` block)
- Modify: `l4d2web/tests/test_url_addressable_modals.py` (optional: add a test that the inline dialog is gone)
- [ ] **Step 1: Write a "dialog removed" assertion test**
Append to `l4d2web/tests/test_url_addressable_modals.py`:
```python
def test_overlay_detail_no_longer_includes_inline_editor_dialog(tmp_path, monkeypatch):
client, overlay_id = _auth_client_with_files_overlay(tmp_path, monkeypatch, "no-inline-dialog.db")
response = client.get(f"/overlays/{overlay_id}")
text = response.get_data(as_text=True)
assert response.status_code == 200
# The inline editor dialog is replaced by the URL-addressable route.
assert 'id="files-editor-modal"' not in text
# The persistent modal-container slot from base.html *is* present.
assert 'id="modal-container"' in text
```
- [ ] **Step 2: Run test to verify it fails**
Run: `cd l4d2web && uv run pytest tests/test_url_addressable_modals.py::test_overlay_detail_no_longer_includes_inline_editor_dialog -v`
Expected: FAIL — `id="files-editor-modal"` is still in `overlay_detail.html`.
- [ ] **Step 3: Remove the inline dialog**
In `l4d2web/l4d2web/templates/overlay_detail.html`, delete lines 164-228 inclusive — the entire `{% if files_can_edit %} … <dialog id="files-editor-modal"> … </dialog> … {% endif %}` block.
- [ ] **Step 4: Run all backend tests for regressions**
Run: `cd l4d2web && uv run pytest tests/ -v --tb=short -q`
Expected: all tests pass. The new assertion passes; nothing else regresses.
- [ ] **Step 5: Run the full Chromium verification matrix from the spec**
Walk through all 10 checks from `docs/superpowers/specs/2026-05-17-url-addressable-modals-design.md` ## Verification:
1. Direct link works as full page — paste `/overlays/<id>/files/edit?path=server.cfg` in a new tab, no `?modal=`, full-page editor renders, save works.
2. Modal open from overlay — click edit in the file tree, modal opens, URL gets `?modal=`.
3. Refresh in modal state — F5 reopens modal on the same overlay with build-status polling resumed.
4. Share URL — paste in incognito, lands with modal open.
5. Back button — closes modal, URL clears, underlying page intact.
6. Forward button — reopens modal with same content.
7. Esc to close — URL syncs.
8. Race on rapid clicks — final state is the last-clicked file.
9. No HTMX poll misclassification — build-status polls don't carry `HX-Modal:1`.
10. Existing inline dialogs unaffected — rename, delete, new-folder, conflict-resolution still open from `[data-modal-open]` triggers (these don't use `[data-modal]`).
- [ ] **Step 6: Commit**
```bash
git add l4d2web/l4d2web/templates/overlay_detail.html l4d2web/tests/test_url_addressable_modals.py
git commit -m "$(cat <<'EOF'
feat(modals): remove inline editor dialog, complete pilot migration
overlay_detail.html no longer carries the empty <dialog
id="files-editor-modal"> placeholder — content lives at
/overlays/<id>/files/edit?path=… and renders via the URL-addressable
modal pipeline. Pilot complete; spec follow-ups (save→hx-post, other
modals, server-side URL composition) deferred.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EOF
)"
```
---
## Self-review notes (against the spec)
- **Architecture (Approach C):** Tasks 5, 6, 7 implement the JS module exactly as specified — ~60 lines including comments and exposed API.
- **Layout switch via `HX-Modal: 1` header:** Task 1 implements it as a context processor; Task 1's third test explicitly guards against misclassifying HTMX's built-in `HX-Request` header.
- **`<dialog>` for show/hide:** Task 4 adds the persistent slot; Task 5 uses `showModal()`; Tasks 6/7 use `close()` and native `cancel` event.
- **Editor as a real page:** Tasks 2 + 3 cover this — separate template, separate route, dual-mode rendering.
- **CodeMirror re-init:** Task 8 covers `initEditors(root)` exposure + `htmx:afterSwap` listener.
- **Save flow stays AJAX:** Task 9 preserves the save path while replacing the *open* path.
- **Race guard, dismiss attrs, Esc, backdrop click, popstate, bootstrap:** all in Tasks 57.
- **Out of scope items** (binary replace, other modals, save-flow migration, server-side URL composition): not touched by any task — matches the spec's deferral.
No placeholders, no `TODO`s, no "implement appropriately." Every step has exact paths and exact code.

View file

@ -1,350 +0,0 @@
# Enable srcds log streaming + temp UDP capture listener
## Context
We want to start gathering data about what actually happens on our L4D2
servers (round boundaries, kills, team selection, lobby arrivals) so we can
later build round/match tracking and visualizations. The Source engine's
HL Log Standard UDP streaming (`logaddress_add`) is the right primary
source — it's built into srcds, no plugin required, well-documented (see
[HL Log Standard](https://developer.valvesoftware.com/wiki/HL_Log_Standard)).
This change does **two** things:
1. **Make every managed L4D2 server stream its logs** to a known UDP
endpoint, by auto-injecting `log on`, `mp_logdetail 3`, and
`logaddress_add <addr>` into generated `server.cfg` — alongside
`rcon_password`, where users can't accidentally break it.
2. **Stand up a deliberately disposable UDP listener** in the web app
that writes raw log lines to flat files (one per source address),
so we can observe a few days of real traffic before committing to
any schema or reducer design.
The listener is explicitly a Phase-1 *capture-only* tool. It does **not**
parse, reduce, store in DB, or render anything. That's the next plan,
once we have evidence of what L4D2 actually emits on our servers.
## Scope (in / out)
**In scope**
- Inject 3 cvars into the generated `server.cfg` at facade level.
- Config-driven listener address (env var, sensible default).
- UDP listener daemon thread, sibling of `live_state_poller`.
- Flat-file capture, one file per `(srcip, srcport)`.
- Dev-server integration (capture dir under `LEFT4ME_ROOT`).
- Production wiring (capture dir under `/var/lib/left4me/`, systemd
unit changes if any).
- Minimal smoke test for the listener.
**Out of scope (Phase 2+)**
- Parsing log lines into structured events.
- DB schema (`RawLogLine`, `LogEvent`, `MatchSession`, `Round`, etc.).
- Mapping source addr → `Server` row reliably (we have the data
in the flat-file name; we don't *need* to resolve it yet).
- `sv_logsecret` authentication (single-host loopback, defer).
- Any UI.
## Design
### 1. Auto-injected cvars in `server.cfg`
**File:** `l4d2web/l4d2web/services/l4d2_facade.py` (around line 41-48,
where `rcon_password` is appended after the user blueprint config).
After the existing `rcon_password` append, add:
```python
config_lines.append("log on")
config_lines.append("mp_logdetail 3")
log_addr = current_app.config["LOG_LISTENER_ADDR"] # e.g. "127.0.0.1:27800"
config_lines.append(f"logaddress_add {log_addr}")
```
Notes:
- Order matters: cvars must come *after* anything in the user
blueprint so users can't override them.
- `log on` is idempotent; safe to re-issue.
- `logaddress_add` is *additive* — re-running it just re-registers
the same destination. srcds tolerates this.
### 2. Listener address configuration
**File:** `l4d2web/l4d2web/config.py`
Add to `DEFAULT_CONFIG` (and the env-var loader):
```python
"LOG_LISTENER_ADDR": "127.0.0.1:27800", # what srcds logs to
"LOG_LISTENER_BIND": "127.0.0.1:27800", # what our listener binds
"LOG_LISTENER_ENABLED": True,
"LOG_CAPTURE_DIR": "/var/lib/left4me/captures", # overridden in dev
```
Port **27800** chosen to avoid:
- SRCDS server range 2701527050
- Steam client range 2700027015
- Steam master server range 2701027050
Override env var: `LEFT4ME_LOG_LISTENER_ADDR`, `LEFT4ME_LOG_LISTENER_BIND`,
`LEFT4ME_LOG_CAPTURE_DIR`.
**Dev override:** `scripts/dev-server.py` already sets
`LEFT4ME_ROOT=.tmp/dev-server` — extend it to also set
`LEFT4ME_LOG_CAPTURE_DIR=$LEFT4ME_ROOT/captures` so dev captures live
under `.tmp/` and don't pollute `/var/lib/left4me`.
### 3. The listener
**New file:** `l4d2web/l4d2web/services/log_listener.py`
Pattern: copy the daemon-thread shape from
`live_state_poller.py:230-245`. Single global guard; one thread.
Sketch:
```python
def start_log_listener(app):
if not app.config["LOG_LISTENER_ENABLED"]:
return
if _started: # global guard (match poller pattern)
return
bind = app.config["LOG_LISTENER_BIND"]
capture_dir = Path(app.config["LOG_CAPTURE_DIR"])
capture_dir.mkdir(parents=True, exist_ok=True)
t = threading.Thread(
target=_listener_loop,
args=(bind, capture_dir),
name="left4me-log-listener",
daemon=True,
)
t.start()
def _listener_loop(bind: str, capture_dir: Path) -> None:
host, port = bind.rsplit(":", 1)
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind((host, int(port)))
while True:
data, (srcip, srcport) = sock.recvfrom(4096)
# srcds log packets: 0xFF 0xFF 0xFF 0xFF 'R' or 'S', then body, then trailing 0x00 0x0A
# See HL Log Standard. We strip nothing on first pass — write raw.
path = capture_dir / f"{srcip}-{srcport}.log"
with path.open("ab") as f:
f.write(data)
```
Wiring (`l4d2web/l4d2web/app.py:157`, alongside the poller):
```python
if not app.config.get("TESTING"):
start_live_state_poller(app)
start_log_listener(app)
```
### 4. UDP packet structure note
srcds log packets aren't bare text — they have a small header:
- 4 bytes of `0xFF` (out-of-band marker)
- 1 byte type: `'R'` (no secret) or `'S'` (with `sv_logsecret`)
- If `'S'`: 4-byte little-endian secret
- Body: ASCII log line including the `L mm/dd/yyyy ...` prefix
- Trailing `0x00 0x0A` (null + newline)
For capture-only we just **write the raw bytes**; later parsers will
strip the header. The header is also useful diagnostic info (it tells
us whether `sv_logsecret` made it through).
### 5. Capture file format
One file per `(srcip, srcport)`, name `<ip>-<port>.log`, append mode,
unbuffered byte writes. Rotation is **out of scope** — these are
short-lived (days), and operators can `rm` them manually. If a file
gets surprisingly large, that itself is data.
### 6. Restart implications
Cvars in `server.cfg` are read by srcds at instance startup
(`instances.py:54,87`). Existing running servers **will not pick up
the new log destination** until they are restarted via
`l4d2ctl initialize` or a process restart. Document this in the
verification section — don't try to be clever about live-applying.
## Critical files
**To modify:**
- `l4d2web/l4d2web/services/l4d2_facade.py` — inject 3 cvars after `rcon_password`
- `l4d2web/l4d2web/config.py` — add 4 new config keys with env overrides
- `l4d2web/l4d2web/app.py` — start listener thread next to poller
- `scripts/dev-server.py` — set `LEFT4ME_LOG_CAPTURE_DIR` under `.tmp/`
**To create:**
- `l4d2web/l4d2web/services/log_listener.py` — UDP listener thread
**To reference (read-only):**
- `l4d2web/l4d2web/services/live_state_poller.py:230-245` — thread pattern
- `l4d2web/l4d2web/services/l4d2_facade.py:41-48` — rcon_password injection pattern
- `l4d2host/l4d2host/instances.py:54,87` — confirms server.cfg is generated at init time
- `deploy/files/usr/local/lib/systemd/system/left4me-web.service` — may need a
read-write path entry if `ReadWritePaths=` is set restrictively (check before
modifying)
## Verification
End-to-end smoke test, on the dev box first:
1. **Static checks**
```bash
cd /Users/mwiegand/Projekte/left4me
uv run --project l4d2web ruff check l4d2web
uv run --project l4d2web pytest l4d2web/tests -k "facade or config" -x
```
2. **Cvar injection unit test**
- Add a test in `l4d2web/tests/` that calls `build_server_spec_payload`
and asserts the generated config lines include `log on`,
`mp_logdetail 3`, and a `logaddress_add` line whose target matches
`LOG_LISTENER_ADDR`.
3. **Listener smoke test**
- Run dev server: `scripts/dev-server.py`
- In another shell, fake an srcds log packet:
```bash
printf '\xff\xff\xff\xff' \
'RL 05/19/2026 - 14:23:11: "Test<1><STEAM_1:0:1><>" connected, address "127.0.0.1:27015"\x00\x0a' \
| nc -u -w1 127.0.0.1 27800
```
- Confirm `ls .tmp/dev-server/captures/` shows a `127.0.0.1-<srcport>.log`.
- Confirm the file contains the bytes (use `xxd` to inspect the
header + body shape).
4. **Live end-to-end on one production server**
- Pick **one** server (least-busy), trigger a re-init via the web
UI so `server.cfg` is regenerated with the new cvars.
- Verify via RCON: `logaddress_list` should show our address.
- Connect to the server, run around. Confirm a file appears in
`/var/lib/left4me/captures/` with the server's source IP.
- `tail -f` and verify HL-Log-Standard lines (`L ... : "..."
entered the game`, `Loading map`, `World triggered "round_start"`).
5. **What we are explicitly NOT verifying yet**
- Parsing correctness — there's no parser.
- Reliable server-id mapping — we have srcip/srcport in the
filename, that's enough for now.
- Long-running stability past a few days — listener is temp.
## Post-deploy verification findings (2026-05-20)
The plan above describes what we INTENDED to ship. This section
documents what we ACTUALLY learned, including a wrong intermediate
conclusion that was later corrected.
### What worked
Listener deployed and confirmed bound (`ss -ulnp` showed
`gunicorn:28000`). Servers re-initialized; `server.cfg` had
`log on` / `logaddress_add 127.0.0.1:28000` after `rcon_password`.
Listener proven healthy end-to-end with a local `nc` probe writing
a capture file.
### What didn't (and why)
Captures stayed empty even during active play. Symptoms:
1. `logaddress_list` via RCON → `1 entry: 127.0.0.1:28000`
2. `log` via RCON → `currently logging to: file, console, udp`
3. Local `.log` files in
`/var/lib/left4me/runtime/<n>/merged/left4dead2/logs/` grow
normally; rich gameplay events show in `journalctl -u
left4me-server@<n>` (bot connect, team join, spawn, character
pick — full HL-Log-Standard verbosity) ✓
4. `tcpdump -i lo udp port 28000` during rcon say/echo/status bursts
**0 packets**
5. `tcpdump -i any host 127.0.0.1 and udp` → still 0 ✗
6. Toggle `log off` / `log on` live, `sv_logflush 1` → no effect ✗
7. Tested `SocketBindAllow=udp:32768-60999` drop-in (suspected the
ephemeral source-port bind was being rejected) → still 0
packets. Drop-in rolled back ✗
8. `strace -p <srcds_linux> -e sendto,sendmsg` → **zero sendto
calls toward the destination** ✗
A premature conclusion was reached and committed as `46bba0d`
("L4D2 logaddress UDP emit is dead"). User pushed back, asked for
verification. Research found multiple production HLstatsX:CE
instances running L4D2 stats successfully — disproving the
engine-bug theory.
### Real cause
Re-registered logaddress at a non-loopback IP (`172.30.0.5:28000`
on the wireguard interface) and reran the test. **8 packets in 12
seconds**, each a properly framed HL-Log-Standard payload —
including `Console<0>" say "wg-test-1"`, all the rcon-from lines,
and live poller status calls.
**The Source engine silently drops `logaddress` destinations in
`127.0.0.0/8`.** Registration succeeds (data-structure op), the
cvar API reports "logging to: udp", but the engine's send loop
filters out loopback destinations and never calls sendto for them.
This is presumably an anti-self-loop / anti-amplification measure
that I have not seen documented in any Valve or community source.
Everyone else using `logaddress` for stat tracking puts the
collector on a *separate host* or at minimum a different interface
IP — they never hit this. We're the unusual case of co-locating
the listener with the gameserver and naively pointing at
`127.0.0.1`.
### Fix
- `LOG_LISTENER_BIND` default → `0.0.0.0:28000` (accept on any
interface).
- `LOG_LISTENER_ADDR` default → `""` (empty). Production env file
MUST set this to a non-loopback IP. Dev gets a safe no-op
(cfg injector skips emitting log cvars when the address is
empty).
- Production `web.env`: set
`LOG_LISTENER_ADDR=<host-non-loopback-ip>:28000`. The host's
public interface IP works; the kernel's same-host routing
optimization keeps the actual traffic on `lo` internally, but the
*destination IP* in the packet header is non-loopback so Source's
filter passes.
### Implications for the roadmap (revised)
- The vanilla UDP logaddress mechanism works in L4D2. **No SourceMod
bridge is required for Phase 1** — we were going to add one to
work around a non-existent engine bug.
- `mp_logdetail` was correctly removed from the cfg injection: it
is a CS-only cvar; L4D2 prints `Unknown command` at startup.
- A future SourceMod plugin in `l4d2host/` may still be useful for
L4D-specific events the engine doesn't auto-log (kill events with
weapon detail, special-infected spawns) — see
`SMILEWHENYOUDIE/HLstatsX-CE`'s `superlogs-l4d.sp` for prior
art. That's a Phase 2 enhancement, not a Phase 1 prerequisite.
### Lessons (filed under "validate before implementing")
- I anchored hard on "L4D2 engine bug" after disproving two
reasonable alternative hypotheses (hibernation, SocketBindAllow).
The third alternative — destination-IP filter — wasn't tested
before committing the wrong-conclusion docs. Should have tried a
non-loopback destination as the very first test.
- The journal evidence of rich game events going to console/file
was real and significant, but I misread it as proof of an
engine-level UDP stub instead of evidence the engine's line
generator works fine and only the *destination filter* was at
play.
- The user's push-back ("maybe do research if it's really broken?")
forced the right next step. Worth internalizing: a single
contrarian search query found HLstatsX-on-L4D2 production
instances within seconds and would have prevented the
wrong-conclusion commit.
## Follow-ups (separate plans)
- **After a few days of capture data**: design
`RawLogLine` / `LogEvent` / `MatchSession` / `Round` schema.
- Build a reducer.
- Consider a SourceMod bridge for richer L4D2-specific events
(kill weapons, special-infected spawns, finale outcomes).
- `sv_logsecret` if the listener is ever moved off-host.

View file

@ -1,540 +0,0 @@
# Server detail — console + log autoscroll Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make the Log and Console transcripts on `/servers/<id>` stay pinned to their bottom on initial load, tab activation, and command append; cap the inline Console to the 20 newest entries while the modal keeps all 50.
**Architecture:** Single opt-in `data-autoscroll` attribute on any scroll-pinned region. One helper (`scrollAutoscrollTargets`) handles root, descendants, *and* ancestors so HTMX `beforeend` swaps that fire `htmx:load` on the inserted child still find and scroll the parent transcript. `tabs.js` calls the helper after activating a tab. `modals.js` already dispatches `modal:opened` on `showModal()`, so the Console modal hooks that event to scroll on first open.
**Tech Stack:** Flask + Jinja2 templates, vanilla JS, HTMX, Playwright for e2e, Claude-in-Chrome for live-browser verification.
**Reference spec:** `docs/superpowers/specs/2026-05-20-server-console-log-autoscroll-design.md`
---
## File map
| Path | Action | Responsibility |
|------|--------|----------------|
| `l4d2web/l4d2web/routes/page_routes.py` | modify (~L318-345) | Add `console_history_overview = console_history[-20:]` to the render context |
| `l4d2web/l4d2web/templates/server_detail.html` | modify (L60-73, L101, L111-117, L159) | Inline loop uses `console_history_overview`; add `data-autoscroll` to both `<pre class="log-stream">`; Console modal transcript wires `modal:opened` → scroll |
| `l4d2web/l4d2web/static/js/console-history.js` | modify (L159-179) | Rename and generalise `scrollConsolesToBottom``scrollAutoscrollTargets` with ancestor walk; expose on `window` |
| `l4d2web/l4d2web/static/js/tabs.js` | modify (~L9-19) | After `activateTab` toggles `hidden`, scroll any `[data-autoscroll]` in the newly-active pane |
| `l4d2web/tests/test_servers.py` | extend | Server-side: assert inline pane caps at 20 and modal keeps 50 |
| `l4d2web/tests/e2e/test_server_detail.py` | extend | E2E: Console tab pinned to bottom on activation; Console pane pinned to bottom after command submit |
---
### Task 1: Server-side slice for inline Console history
**Files:**
- Modify: `l4d2web/l4d2web/routes/page_routes.py:318-347`
- Test: `l4d2web/tests/test_servers.py`
- [ ] **Step 1: Write the failing test**
`test_servers.py` uses the `user_client_with_blueprints` fixture (returns `(client, data)` where `data` carries `user_id` and `blueprint_id`) plus direct DB writes via `session_scope()`. Mirror that pattern:
Append to `l4d2web/tests/test_servers.py`:
```python
def test_server_detail_inline_console_caps_at_20_modal_keeps_all(user_client_with_blueprints) -> None:
"""When > 20 CommandHistory rows exist for the server, the inline
Console transcript renders only the 20 newest (chronological order),
while the modal transcript renders the full set (capped at the
route-level 50)."""
import re
from datetime import UTC, datetime, timedelta
from l4d2web.models import CommandHistory, Server
client, data = user_client_with_blueprints
with session_scope() as db:
server = Server(
user_id=data["user_id"],
blueprint_id=data["blueprint_id"],
name="console-cap",
port=27123,
rcon_password="x",
)
db.add(server)
db.flush()
sid = server.id
for i in range(35):
db.add(CommandHistory(
user_id=data["user_id"],
server_id=sid,
command=f"cmd_{i:02d}",
reply=f"reply {i}",
is_error=False,
created_at=datetime.now(UTC) - timedelta(minutes=40 - i),
))
resp = client.get(f"/servers/{sid}")
assert resp.status_code == 200
body = resp.get_data(as_text=True)
inline_match = re.search(
rf'<div id="console-transcript-inline-{sid}"[^>]*>(.*?)</div>\s*<form',
body,
re.DOTALL,
)
assert inline_match, "inline transcript container not found"
inline_lines = inline_match.group(1).count('class="console-line')
assert inline_lines == 20, f"inline expected 20, got {inline_lines}"
modal_match = re.search(
rf'<div id="console-transcript-modal-{sid}"[^>]*>(.*?)</div>\s*<form',
body,
re.DOTALL,
)
assert modal_match, "modal transcript container not found"
modal_lines = modal_match.group(1).count('class="console-line')
assert modal_lines == 35, f"modal expected 35, got {modal_lines}"
```
- [ ] **Step 2: Run the test to verify it fails**
```bash
cd l4d2web && pytest tests/test_servers.py::test_server_detail_inline_console_caps_at_20_modal_keeps_all -v
```
Expected: FAIL (inline returns 35, not 20)
- [ ] **Step 3: Modify the route**
In `l4d2web/l4d2web/routes/page_routes.py`, locate `server_detail` (~L306) and the `console_history = list(reversed(...))` block (~L318-330). After it, add:
```python
console_history_overview = console_history[-20:]
```
Then in the `return render_template("server_detail.html", …)` call (~L335-347), add the new kwarg:
```python
return render_template(
"server_detail.html",
server=server,
blueprint=blueprint,
connect_host=connect_host,
file_tree_root_entries=file_tree_root_entries,
file_tree_truncated=file_tree_truncated_count > 0
if file_tree_root_entries is not None
else False,
file_tree_truncated_count=file_tree_truncated_count,
console_history=console_history,
console_history_overview=console_history_overview,
**ctx,
)
```
- [ ] **Step 4: Update the template to use the new variable**
In `l4d2web/l4d2web/templates/server_detail.html` at the **inline** Console pane (~L65-71, inside `<div role="tabpanel" data-tab="console">`):
Change:
```jinja
{% for h in console_history %}
```
to:
```jinja
{% for h in console_history_overview %}
```
Leave the **modal** Console transcript (~L112-116) iterating over `console_history` unchanged.
- [ ] **Step 5: Run the test to verify it passes**
```bash
cd l4d2web && pytest tests/test_servers.py::test_server_detail_inline_console_caps_at_20_modal_keeps_all -v
```
Expected: PASS
- [ ] **Step 6: Commit**
```bash
git add l4d2web/l4d2web/routes/page_routes.py \
l4d2web/l4d2web/templates/server_detail.html \
l4d2web/tests/test_servers.py
git commit -m "feat(server-detail): cap inline console to 20 newest; modal keeps 50"
```
---
### Task 2: Generalise `scrollConsolesToBottom` to walk ancestors
**Files:**
- Modify: `l4d2web/l4d2web/static/js/console-history.js:159-179`
The current helper only matches `root` and its descendants. When `htmx:load` fires after `hx-swap="beforeend"`, `event.detail.elt` is the newly inserted child line — neither it nor its descendants match `[data-autoscroll]`, so the transcript never scrolls. Adding an ancestor walk fixes this case without affecting the existing one.
- [ ] **Step 1: Rewrite the helper**
Replace lines 159-179 of `l4d2web/l4d2web/static/js/console-history.js` with:
```js
function scrollAutoscrollTargets(root) {
if (!root) return;
const targets = [];
// Case 1: root itself opts in.
if (root.matches && root.matches("[data-autoscroll]")) {
targets.push(root);
}
// Case 2: descendants opt in.
if (root.querySelectorAll) {
root.querySelectorAll("[data-autoscroll]").forEach((el) => targets.push(el));
}
// Case 3: neither — walk up. Handles htmx:load firing with the inserted
// child as the root after hx-swap="beforeend" on a console line.
if (targets.length === 0 && root.closest) {
const up = root.closest("[data-autoscroll]");
if (up) targets.push(up);
}
targets.forEach((el) => {
el.scrollTop = el.scrollHeight;
});
}
// Expose for tabs.js (and any future cross-module consumer). The script
// is `defer`red in base.html, so it runs before DOMContentLoaded and the
// global is defined by the time tabs.js's DCL-deferred initStrips runs.
window.scrollAutoscrollTargets = scrollAutoscrollTargets;
document.addEventListener("DOMContentLoaded", () => {
scrollAutoscrollTargets(document);
bindAllConsoleForms(document);
});
document.addEventListener("htmx:load", (event) => {
scrollAutoscrollTargets(event.detail.elt);
bindAllConsoleForms(event.detail.elt);
});
```
That is: rename the function, add the third (ancestor-walk) case, expose on `window`, and update the two listeners to call the new name. Behavior preserved for the existing two cases.
- [ ] **Step 2: Smoke-verify in a browser**
Start the dev server (or use a running one). Open `/servers/<id>` and run in the devtools console:
```js
typeof window.scrollAutoscrollTargets
```
Expected: `"function"`.
Then with a Console tab that already has > clientHeight of content, click into it and verify (manual eye check) that the transcript is no longer at the top. (It still won't be — tabs.js doesn't call the helper yet; that's Task 3. The smoke-check here is only that the helper is defined and reachable.)
- [ ] **Step 3: Commit**
```bash
git add l4d2web/l4d2web/static/js/console-history.js
git commit -m "feat(console): scrollAutoscrollTargets walks ancestors; expose on window"
```
---
### Task 3: `tabs.js` — scroll on tab activation
**Files:**
- Modify: `l4d2web/l4d2web/static/js/tabs.js:9-19`
- [ ] **Step 1: Edit `activateTab`**
In `l4d2web/l4d2web/static/js/tabs.js`, at the end of the `activateTab(strip, name)` function, after `strip.dataset.activeTab = name;`, append:
```js
// Pin any scroll-locked regions (log streams, console transcripts) in
// the newly-visible pane to the bottom. While the pane was hidden,
// their scrollHeight was 0 so previous appends couldn't anchor.
const activePane = strip.querySelector('[role="tabpanel"]:not([hidden])');
if (activePane && window.scrollAutoscrollTargets) {
window.scrollAutoscrollTargets(activePane);
}
```
The full block now looks like:
```js
function activateTab(strip, name) {
strip.querySelectorAll('[role="tab"]').forEach((t) => {
const on = t.dataset.tab === name;
t.setAttribute("aria-selected", on ? "true" : "false");
t.tabIndex = on ? 0 : -1;
});
strip.querySelectorAll('[role="tabpanel"]').forEach((p) => {
p.hidden = p.dataset.tab !== name;
});
strip.dataset.activeTab = name;
const activePane = strip.querySelector('[role="tabpanel"]:not([hidden])');
if (activePane && window.scrollAutoscrollTargets) {
window.scrollAutoscrollTargets(activePane);
}
}
```
- [ ] **Step 2: Add `data-autoscroll` to the log-stream elements**
In `l4d2web/l4d2web/templates/server_detail.html`:
- Inline log (~L61):
```jinja
<pre class="log-stream" data-autoscroll data-sse-url="/servers/{{ server.id }}/logs/stream"></pre>
```
- Modal log (~L101):
```jinja
<pre class="log-stream tall" data-autoscroll data-sse-url="/servers/{{ server.id }}/logs/stream"></pre>
```
- Modal job-log (~L159):
```jinja
<pre class="log-stream tall" data-autoscroll data-sse-url="/jobs/{{ latest_job.id }}/stream"></pre>
```
The Console transcripts already carry `data-autoscroll` and don't need editing.
- [ ] **Step 3: Live-browser verification**
This is the moment to confirm the user-visible bug is fixed. Start (or reuse) the dev server and seed > 20 console rows for `demo-server` (the dev seed has only 9; bring that to 30+):
```bash
sqlite3 .tmp/dev-server/l4d2web.db "
WITH RECURSIVE seq(n) AS (SELECT 1 UNION ALL SELECT n+1 FROM seq WHERE n<30)
INSERT INTO command_history (user_id, server_id, command, reply, is_error, created_at)
SELECT 1, 1, 'verify_cmd_' || printf('%02d', n),
'reply ' || n, 0, datetime('now', '-'||(35-n)||' minutes') FROM seq;"
```
Then in a browser, log into `/login` as `dev` / `devdevdev`, open `/servers/1`, click the **Console** tab, and run in devtools:
```js
(() => {
const t = document.querySelector('[id^="console-transcript-inline-"]');
return { scrollTop: t.scrollTop, scrollHeight: t.scrollHeight, clientHeight: t.clientHeight,
bottomDistance: t.scrollHeight - t.scrollTop - t.clientHeight,
inline_lines: t.querySelectorAll('.console-line').length };
})();
```
Expected: `bottomDistance < 2` (pinned to bottom), `inline_lines == 20`.
Same exercise on the **Log** tab: switch to Console, then back to Log; the SSE-streamed log should be pinned to bottom. If the log stream has no content yet (server isn't running and never has been), this assertion vacuously passes (`scrollHeight === clientHeight`).
- [ ] **Step 4: Commit**
```bash
git add l4d2web/l4d2web/static/js/tabs.js \
l4d2web/l4d2web/templates/server_detail.html
git commit -m "feat(server-detail): pin transcripts/logs to bottom on tab activation"
```
---
### Task 4: Pin Console-modal transcript on `modal:opened`
**Files:**
- Modify: `l4d2web/l4d2web/templates/server_detail.html:105-120` (Console modal)
When the Console modal opens via `data-inline-modal-open="console-modal"`, the dialog goes from `display:none` to displayed. The transcript inside it has `scrollHeight=0` while hidden, so any earlier autoscroll attempt did nothing. `modals.js:35` dispatches a `modal:opened` CustomEvent on `dialog.showModal()`; we listen for it on the dialog and re-pin.
- [ ] **Step 1: Add the inline listener via `hx-on` (no JS file changes)**
In `l4d2web/l4d2web/templates/server_detail.html`, change the Console modal opening tag (~L105):
```jinja
<dialog id="console-modal" class="modal" aria-labelledby="console-modal-title"
onmodal-opened="if(window.scrollAutoscrollTargets){window.scrollAutoscrollTargets(this)}">
```
The non-standard `onmodal-opened` attribute only works through an `addEventListener` registration — `dialog.dispatchEvent` doesn't invoke `onevent` handlers for custom events. So instead add a small DOMContentLoaded hook inside the template (one-off, not worth a new JS file):
Replace the opening `<dialog id="console-modal" …>` line with:
```jinja
<dialog id="console-modal" class="modal" aria-labelledby="console-modal-title">
```
…and **immediately before `{% endblock %}`** (end of file), add:
```jinja
<script>
// Pin the Console modal transcript to its bottom each time the modal
// opens. While the <dialog> is closed, its descendants have scrollHeight=0,
// so neither the page-load autoscroll nor htmx:load can anchor them.
// The 'modal:opened' CustomEvent is dispatched by modals.js on
// dialog.showModal().
(() => {
const dlg = document.getElementById("console-modal");
if (!dlg) return;
dlg.addEventListener("modal:opened", () => {
if (window.scrollAutoscrollTargets) {
window.scrollAutoscrollTargets(dlg);
}
});
})();
</script>
```
- [ ] **Step 2: Live-browser verification**
Reload `/servers/1` (after seeding from Task 3 Step 3). Click the ⛶ expand button on the Console tab to open `#console-modal`. In devtools:
```js
(() => {
const t = document.querySelector('[id^="console-transcript-modal-"]');
return { scrollTop: t.scrollTop, scrollHeight: t.scrollHeight,
clientHeight: t.clientHeight,
bottomDistance: t.scrollHeight - t.scrollTop - t.clientHeight };
})();
```
Expected: `bottomDistance < 2`.
- [ ] **Step 3: Commit**
```bash
git add l4d2web/l4d2web/templates/server_detail.html
git commit -m "feat(server-detail): pin Console-modal transcript on modal:opened"
```
---
### Task 5: e2e — Console tab pinned on activation; pinned after submit
**Files:**
- Modify: `l4d2web/tests/e2e/test_server_detail.py` (append)
- Possibly: `l4d2web/tests/e2e/conftest.py` (add seeded-history fixture if helpful)
The dev server seed has too few rows to trigger overflow; the e2e fixture seeds its own. Reusing the `server_with_files` fixture is fine — it already builds a Server; we just need to insert `CommandHistory` rows before navigating.
- [ ] **Step 1: Add a fixture that seeds console history**
Append to `l4d2web/tests/e2e/conftest.py`:
```python
@pytest.fixture(scope="function")
def server_with_console_history(server_with_files):
"""server_with_files + 30 seeded CommandHistory rows for that server,
so the inline Console transcript exceeds its visible height and the
autoscroll behaviour is observable."""
from datetime import UTC, datetime, timedelta
from l4d2web.models import CommandHistory
sid = server_with_files["server_id"]
uid = server_with_files["user_id"]
with session_scope() as session:
for i in range(30):
session.add(CommandHistory(
user_id=uid,
server_id=sid,
command=f"seed_{i:02d}",
reply=f"reply {i}",
is_error=False,
created_at=datetime.now(UTC) - timedelta(minutes=35 - i),
))
return server_with_files
```
- [ ] **Step 2: Write the failing e2e tests**
Append to `l4d2web/tests/e2e/test_server_detail.py`:
```python
def test_console_tab_pinned_to_bottom_on_activation(page: Page, server_with_console_history) -> None:
"""Clicking the Console tab leaves the transcript scrolled to its
bottom — the newest seeded command must be visible, not the oldest."""
base = server_with_console_history["base_url"]
sid = server_with_console_history["server_id"]
login(page, base)
page.goto(f"{base}/servers/{sid}")
strip = page.locator("[data-tab-strip]")
strip.locator('[role="tab"][data-tab="console"]').click()
transcript = page.locator(f"#console-transcript-inline-{sid}")
expect(transcript).to_be_visible()
# Pinned to bottom: |scrollHeight - scrollTop - clientHeight| < 2
bottom_distance = transcript.evaluate(
"(el) => el.scrollHeight - el.scrollTop - el.clientHeight"
)
assert abs(bottom_distance) < 2, f"transcript not pinned to bottom: {bottom_distance}px"
# Inline pane caps at 20 lines.
line_count = transcript.locator(".console-line").count()
assert line_count == 20, f"inline expected 20 lines, got {line_count}"
def test_console_pane_pinned_after_command_submit(page: Page, server_with_console_history) -> None:
"""After submitting a command, the transcript scrolls so the new line
is visible at the bottom.
The dev server has no live RCON, but the POST still records a
CommandHistory row and HTMX appends a console-line to the transcript;
that's enough to exercise the autoscroll path.
"""
base = server_with_console_history["base_url"]
sid = server_with_console_history["server_id"]
login(page, base)
page.goto(f"{base}/servers/{sid}")
strip = page.locator("[data-tab-strip]")
strip.locator('[role="tab"][data-tab="console"]').click()
transcript = page.locator(f"#console-transcript-inline-{sid}")
pane = page.locator('[role="tabpanel"][data-tab="console"]')
cmd_input = pane.locator('input[name="command"]')
cmd_input.fill("verify_submit")
cmd_input.press("Enter")
# Wait for the new line to appear in the DOM.
expect(transcript.locator(".console-line", has_text="verify_submit")).to_be_visible()
bottom_distance = transcript.evaluate(
"(el) => el.scrollHeight - el.scrollTop - el.clientHeight"
)
assert abs(bottom_distance) < 2, f"transcript not pinned after submit: {bottom_distance}px"
```
- [ ] **Step 3: Run e2e**
```bash
cd l4d2web && pytest tests/e2e/test_server_detail.py -m e2e -v
```
If the dev-machine doesn't have Chromium installed: `playwright install chromium` first.
Expected: PASS for both new tests; existing tests still pass.
- [ ] **Step 4: Commit**
```bash
git add l4d2web/tests/e2e/conftest.py l4d2web/tests/e2e/test_server_detail.py
git commit -m "test(e2e): console transcript pinned to bottom on tab + submit"
```
---
## Final verification (live browser)
After all tasks merge:
1. Reset / re-seed the dev DB if you used the `verify_cmd_*` seed above:
```bash
sqlite3 .tmp/dev-server/l4d2web.db "DELETE FROM command_history WHERE command LIKE 'verify_cmd_%' OR command LIKE 'seed_cmd_%';"
```
2. Restart `scripts/dev-server.py`.
3. In a browser, log in as `dev` / `devdevdev`, open `/servers/1`.
4. Click **Console** — transcript shows ≤20 most-recent commands, scrolled to bottom.
5. Submit any command (e.g. `status`) — the response error appends as a new line and the transcript scrolls to keep it visible.
6. Click ⛶ to open the Console modal — modal transcript shows all 50 most-recent (or however many exist), scrolled to bottom.
7. Switch to **Log**, then between Console and Log a few times — each transition leaves the active tab's transcript pinned to its bottom.
## Spec coverage check
- ✅ Server-side slice to 20 newest inline; modal keeps 50 — **Task 1**
- ✅ Both log-stream `<pre>`s get `data-autoscroll` — **Task 3 Step 2**
- ✅ Helper walks ancestors to handle htmx:load on appended child — **Task 2**
- ✅ Helper exposed on `window` — **Task 2**
- ✅ `tabs.js` pins on activation — **Task 3 Step 1**
- ✅ Console-modal pin on `modal:opened` — **Task 4**
- ✅ Unit/template coverage for cap — **Task 1**
- ✅ e2e coverage for tab activation + submit pin — **Task 5**

View file

@ -1,353 +0,0 @@
# L4D2 Global Map Overlays Design
**Goal:** Add two managed, system-wide map overlays, `l4d2center-maps` and `cedapug-maps`, populated from upstream map sources and refreshed daily through the existing job system.
**Approval status:** User-approved design direction. Implementation must not start until this spec is reviewed and an implementation plan is written.
## Context
`left4me` already has typed overlays, a builder registry, global overlays through `Overlay.user_id = NULL`, and queued overlay build jobs. Steam Workshop overlays use a cache plus symlinks into `left4dead2/addons/`, and server initialization already runs overlay builders before calling `l4d2ctl initialize`.
Global map sources fit the same model. The host library remains unchanged: it receives overlay refs and mounts directories. The web app owns map-source fetching, cache management, reconciliation, and job logs.
The two upstream sources are:
- `https://l4d2center.com/maps/servers/index.csv`
- `https://cedapug.com/custom`
## Locked Decisions
1. **One general operation.** Use `refresh_global_overlays`, not source-specific cron operations.
2. **Systemd owns time.** A systemd timer runs daily and invokes a Flask CLI command. The CLI only enqueues work; the existing worker performs downloads and writes logs.
3. **System jobs are nullable-owner jobs.** `jobs.user_id` becomes nullable. `NULL` means the job was created by the system. UI displays owner as `system`. Only admins can access system jobs.
4. **Managed global overlays are auto-seeded.** The app creates or repairs exactly one `l4d2center-maps` overlay and exactly one `cedapug-maps` overlay.
5. **Global overlays are normal system overlays for users.** `Overlay.user_id = NULL` makes them visible to every authenticated user and selectable in every user's blueprint editor.
6. **Managed types are not user-creatable.** Normal overlay creation does not offer `l4d2center_maps` or `cedapug_maps`. The seeder is the only code path that creates those types.
7. **Exact reconciliation.** Refresh makes each managed overlay match its upstream manifest. Removed upstream maps are removed from the managed overlay symlink set. Foreign files are left alone and logged.
8. **No initialize-time downloads.** `initialize_server()` may run builders to repair symlinks, but it must not fetch remote manifests or download large archives. Missing cache content fails clearly.
9. **Separate cache from Workshop.** Non-Steam global maps use `${LEFT4ME_ROOT}/global_overlay_cache`, not `${LEFT4ME_ROOT}/workshop_cache`.
10. **Source-specific parsing stays explicit.** Do not introduce a generic arbitrary HTTP source framework in this phase.
## Architecture
The design extends the existing overlay-builder registry:
```python
BUILDERS = {
"external": ExternalBuilder(),
"workshop": WorkshopBuilder(),
"l4d2center_maps": GlobalMapOverlayBuilder(),
"cedapug_maps": GlobalMapOverlayBuilder(),
}
```
Both global map overlay types share the same filesystem builder. Source-specific code lives in refresh services that know how to fetch and parse upstream manifests.
High-level flow:
```text
systemd timer
-> flask refresh-global-overlays
-> ensure_global_overlays()
-> enqueue refresh_global_overlays job (coalesced)
-> worker fetches manifests
-> worker downloads/extracts cache files
-> worker records desired VPK files
-> worker rebuilds overlay symlinks directly
```
Auto-seeded overlay rows use fixed names, managed types, `user_id = NULL`, and web-generated paths:
```text
name=l4d2center-maps, type=l4d2center_maps, user_id=NULL, path=str(id)
name=cedapug-maps, type=cedapug_maps, user_id=NULL, path=str(id)
```
## Data Model
### `jobs`
Change `jobs.user_id` from required to nullable.
`NULL` means a system-created job. Authorization rules become:
- Admins can view, stream, and cancel every job, including system jobs.
- Non-admins can access only jobs where `job.user_id == current_user.id`.
- System jobs are not visible to non-admins through direct job URLs.
Job list/detail pages use outer joins to `users` and render missing owners as `system`.
### `global_overlay_sources`
One row per managed global source overlay:
```text
id INTEGER PRIMARY KEY
overlay_id INTEGER NOT NULL UNIQUE REFERENCES overlays(id) ON DELETE CASCADE
source_key VARCHAR(64) NOT NULL UNIQUE -- l4d2center-maps | cedapug-maps
source_type VARCHAR(32) NOT NULL -- l4d2center_csv | cedapug_custom_page
source_url TEXT NOT NULL
last_manifest_hash VARCHAR(64) NOT NULL DEFAULT ''
last_refreshed_at DATETIME NULL
last_error TEXT NOT NULL DEFAULT ''
created_at DATETIME NOT NULL
updated_at DATETIME NOT NULL
```
`source_key` is stable and used by the seeder to repair missing rows.
### `global_overlay_items`
One row per manifest item belonging to a global overlay source:
```text
id INTEGER PRIMARY KEY
source_id INTEGER NOT NULL REFERENCES global_overlay_sources(id) ON DELETE CASCADE
item_key VARCHAR(255) NOT NULL -- stable per source
display_name VARCHAR(255) NOT NULL DEFAULT ''
download_url TEXT NOT NULL
expected_vpk_name VARCHAR(255) NOT NULL DEFAULT ''
expected_size BIGINT NULL
expected_md5 VARCHAR(32) NOT NULL DEFAULT ''
etag VARCHAR(255) NOT NULL DEFAULT ''
last_modified VARCHAR(255) NOT NULL DEFAULT ''
content_length BIGINT NULL
last_downloaded_at DATETIME NULL
last_error TEXT NOT NULL DEFAULT ''
created_at DATETIME NOT NULL
updated_at DATETIME NOT NULL
UNIQUE(source_id, item_key)
```
For `l4d2center`, `item_key` and `expected_vpk_name` come from the CSV `Name` column, and `expected_size` / `expected_md5` come from the CSV.
For `cedapug`, `item_key` is the direct download URL path basename, normalized without query parameters. CEDAPUG does not publish checksums in the observed page, so integrity uses HTTP metadata when available and archive extraction checks.
### `global_overlay_item_files`
One row per extracted VPK file that should appear in an overlay:
```text
id INTEGER PRIMARY KEY
item_id INTEGER NOT NULL REFERENCES global_overlay_items(id) ON DELETE CASCADE
vpk_name VARCHAR(255) NOT NULL
cache_path TEXT NOT NULL -- relative path under global_overlay_cache
size BIGINT NOT NULL
md5 VARCHAR(32) NOT NULL DEFAULT ''
created_at DATETIME NOT NULL
updated_at DATETIME NOT NULL
UNIQUE(item_id, vpk_name)
```
This extra file table handles archives that contain more than one `.vpk` without overloading the item row.
## Filesystem Layout
Use a cache separate from Steam Workshop:
```text
${LEFT4ME_ROOT}/
global_overlay_cache/
l4d2center-maps/
archives/
vpks/
cedapug-maps/
archives/
vpks/
overlays/
{overlay_id}/
left4dead2/addons/
*.vpk -> absolute symlink to global_overlay_cache/.../vpks/*.vpk
```
Cache file writes are atomic: download to `*.partial`, extract to a temporary directory, verify, then `os.replace()` final VPK files.
Symlink targets are absolute, matching the existing Workshop overlay design.
## Source Parsing
### L4D2Center
Fetch `https://l4d2center.com/maps/servers/index.csv` with a normal HTTP timeout.
The CSV is semicolon-delimited and contains:
```text
Name;Size;md5;Download link
```
Each item produces:
- `item_key = Name`
- `expected_vpk_name = Name`
- `expected_size = Size`
- `expected_md5 = md5`
- `download_url = Download link`
Downloads are `.7z` archives. Extraction uses a Python 7z implementation such as `py7zr` so tests do not depend on a system `7z` binary. After extraction, the expected VPK file must exist and match both size and md5. A mismatch fails that item and leaves the prior cached file in place.
### CEDAPUG
Fetch `https://cedapug.com/custom` and parse the embedded `renderCustomMapDownloads([...])` data.
Only direct download links are managed in v1:
- Relative links like `/maps/FatalFreight.zip` are converted to absolute `https://cedapug.com/maps/FatalFreight.zip`.
- External `http` links are logged and skipped in v1.
- Entries without a download link are built-in campaigns and skipped.
Downloads are `.zip` archives extracted with Python's standard `zipfile`. Every `.vpk` in the archive becomes a managed output file for that item. If no `.vpk` is present, the item fails and the prior cached files remain in place.
Because CEDAPUG does not publish checksums in the observed page, refresh detects changes using `ETag`, `Last-Modified`, `Content-Length`, and local extracted file metadata when available. A manual refresh can force revalidation by clearing item metadata in a later maintenance path; no force-refresh UI is included in this design.
## Refresh Job
`refresh_global_overlays` is a global worker operation.
Behavior:
1. Ensure both managed global overlays and source rows exist.
2. Fetch both manifests.
3. Upsert manifest items.
4. Mark items absent from the manifest as no longer desired by deleting their item rows; cascading deletes remove their file rows.
5. Download and extract new or changed items.
6. Keep prior cache files when an item download or verification fails, but record `last_error`.
7. Rebuild symlinks for changed sources directly through the same builder interface used by `build_overlay`.
8. Emit clear job logs: manifest counts, downloads, skips, removals, verification failures, and build summaries.
`refresh_global_overlays` does not enqueue child `build_overlay` jobs. Direct builder invocation keeps the overlay in sync before the refresh job releases its global mutex, so a server job cannot start against updated cache metadata but stale overlay symlinks.
Coalescing:
- If a `refresh_global_overlays` job is queued or running, CLI/admin requests return the existing job instead of inserting a duplicate.
## Builder Reconciliation
`GlobalMapOverlayBuilder` reads desired file rows for the overlay's source and reconciles only symlinks it manages.
Managed symlink rule:
- A symlink in `left4dead2/addons/` is managed if its resolved target is under `${LEFT4ME_ROOT}/global_overlay_cache/{source_key}/vpks/`.
- Managed symlinks absent from desired files are removed.
- Desired files missing from cache are skipped and logged as errors.
- Non-symlink files and symlinks outside the source cache are left untouched and logged as foreign entries.
This mirrors `WorkshopBuilder` behavior and keeps manual files safe.
## Scheduler Rules
`refresh_global_overlays` joins the existing global mutex group.
It must not run concurrently with:
- `install`
- `refresh_workshop_items`
- any `build_overlay`
- any server job (`initialize`, `start`, `stop`, `delete`)
No server or overlay job may start while `refresh_global_overlays` is running.
This conservative rule is acceptable because daily map refreshes are rare and large downloads should not race runtime changes.
## CLI And Systemd Timer
Add Flask CLI command:
```text
flask refresh-global-overlays
```
The command:
- Loads app config and DB.
- Ensures global overlays exist.
- Enqueues or returns the existing `refresh_global_overlays` job.
- Prints the job id.
- Does not run downloads itself.
Add deployment units:
```text
left4me-refresh-global-overlays.service
left4me-refresh-global-overlays.timer
```
Service command:
```text
/opt/left4me/.venv/bin/flask --app l4d2web.app:create_app refresh-global-overlays
```
Timer policy:
```text
OnCalendar=daily
Persistent=true
```
The service runs as the `left4me` user with `/etc/left4me/host.env` and `/etc/left4me/web.env`, matching `left4me-web.service`.
## Permissions And UI
Overlay list behavior:
- Admins see all overlays, including managed global map overlays.
- Non-admin users see system overlays and their own private workshop overlays.
- Managed global overlays appear in blueprint overlay selection for every user.
Creation behavior:
- Non-admin users can create only user-creatable types, currently `workshop`.
- Admins can create normal admin-creatable types, currently `external` and `workshop`.
- No user-facing create form offers `l4d2center_maps` or `cedapug_maps`.
- Auto-seeding is the only creation path for managed global map overlay types.
Admin controls:
- Add a manual "Refresh global overlays" action in the admin area.
- The action enqueues the same coalesced `refresh_global_overlays` job as the timer.
- Managed overlay detail pages show source type, source URL, last refresh time, last error, item count, and latest related jobs.
## Error Handling
- Manifest fetch failure fails the job if no source can be processed. If one source succeeds and one fails, the job should still finish failed with partial-success logs and preserve prior content for the failed source.
- Per-item download failures do not abort sibling items.
- Verification failures keep prior cached files and record `last_error` on the item.
- Extraction rejects path traversal entries and ignores non-VPK files.
- Unsupported CEDAPUG external links are skipped with a warning.
- Initialize-time checks fail if desired global map files are missing from cache, naming the overlay and missing VPK names.
## Tests
Test coverage should include:
- Auto-seeding creates exactly one source overlay per source and repairs missing source rows.
- `jobs.user_id` nullable behavior, outer joins, and `system` display.
- Non-admins cannot access system jobs directly.
- CLI coalesces queued/running `refresh_global_overlays` jobs.
- Scheduler truth table for the new global operation.
- L4D2Center CSV parser with semicolon-delimited fixture data.
- CEDAPUG embedded JavaScript parser with fixture HTML.
- L4D2Center download/extract verifies VPK size and md5.
- CEDAPUG download/extract records every VPK in a zip archive.
- Reconcile removes obsolete managed symlinks and leaves foreign files alone.
- Overlay create UI rejects managed singleton types.
- Blueprint overlay selection includes managed global overlays for all users.
- Deployment tests cover the service and timer artifacts.
## Out Of Scope
- User-created global map source overlays.
- Arbitrary configurable HTTP manifest sources.
- Force-refresh UI for CEDAPUG items.
- Cache garbage collection for unreferenced archive files.
- Client-side map download UX.
- Steam Workshop links discovered on the CEDAPUG page; those are skipped rather than imported into workshop overlays.
- Host-library awareness of managed overlay types.
## Implementation Boundaries
- `l4d2host` remains unchanged.
- The web app continues to call host operations only through `l4d2ctl`.
- Existing blueprint semantics remain unchanged: overlays are live-linked, ordered, and first overlay has highest precedence.
- Existing workshop overlay behavior remains unchanged except scheduler interactions with the new global operation.

View file

@ -1,226 +0,0 @@
# L4D2 Workshop Overlays Design
**Goal:** Let users add Steam Workshop content (.vpk addons and maps) to L4D2 servers from the web UI. Workshop downloads run as a new typed overlay that fits the existing `Overlay` + `BlueprintOverlay` model, downloaded via the public Steam Web API and exposed through the existing fuse-overlayfs mount layer.
**Approval status:** User-approved design direction. Implementation proceeds in lockstep with the companion plan at `docs/superpowers/plans/2026-05-07-l4d2-workshop-overlays.md`.
## Context
`left4me` users today add `.vpk` content to a server only by SFTP-ing files into a manually-prepared overlay directory or by maintaining shell scripts (`competitive_rework`, `workshop_maps`, `tickrate`, etc.) that wrap `curl`/`steamcmd`. The web app exposes overlay rows but offers no way for users to populate them.
This spec adds **workshop overlays**: a user-private overlay type that downloads `.vpk` files via the public `ISteamRemoteStorage` API and surfaces them through the existing mount layer. Users keep composing blueprints by stacking overlays — workshop overlays become another row alongside today's externally-managed ones.
This is the first *typed* overlay. The design adds a `type` column and a builder-registry so future overlay types (tarball, inline, manual upload) plug in without schema churn or workflow changes.
Steam Workshop content for L4D2 (consumer_app_id 550) is downloadable via two anonymous-POST endpoints with no Steam Web API key required: `GetCollectionDetails` resolves a collection ID to its child item IDs, and `GetPublishedFileDetails` returns per-item metadata including a public `file_url` for the `.vpk`. This is the same API the user's existing `steam-workshop-download` script uses.
L4D2-specific player-side pain points (sv_consistency / RestrictAddons configuration gotchas, the inability to push workshop content via `sv_downloadurl`) are documented in **Out of scope** and tracked as separate follow-ups. This spec stays strictly on workshop content acquisition.
## Locked Decisions
1. **Typed overlays.** `Overlay.type` joins `external` (existing rows; admin-managed; no-op builder) and `workshop` (new). Future types — tarball, inline, manual upload — slot in via the same builder registry without schema churn.
2. **No JSON `source_config` blob.** Per-type structured data lives in proper relational tables. JSON is reserved for genuinely opaque diagnostic payloads.
3. **Central deduplicated `WorkshopItem` registry** keyed on `steam_id`. Cache lives at `/var/lib/left4me/workshop_cache/{steam_id}.vpk`. Multiple overlays referencing the same Steam item share the same cache file.
4. **Symlinks, not copies.** Overlay directories contain `left4dead2/addons/{steam_id}.vpk` symlinks pointing into the cache. Both the cache file and the symlink are named by `{steam_id}` only — no Steam filename in any on-disk path, so Steam can rename the upstream `.vpk` without breaking lookup.
5. **Many-to-many association is pure** (no `enabled` flag). Toggle a workshop item by removing or re-adding the association. The shared cache makes this cheap.
6. **Collections are atomic UI bulk-imports.** Pasting a collection URL/ID resolves member items and creates N item associations. The DB never tracks "this came from a collection." Re-importing a collection is idempotent on existing items and additive for new ones.
7. **Single global admin "Refresh all workshop items" button.** One Steam metadata batch call, then re-download items whose `time_updated` advanced. No per-item, per-overlay, or scheduled refresh in v1.
8. **No cache GC in v1.** Cache grows monotonically. Reference-counted cleanup is a follow-up.
9. **Globality is independent of overlay type.** `Overlay.user_id` is the scope (NULL = system-wide, set = private to that user). v1 defaults newly-created workshop overlays to private and leaves existing external overlays as system-wide. A future "publish/share" button will let owners toggle `user_id` without changing type.
10. **One unified "Create overlay" UI button.** Modal has a type radio (External | Workshop). No path field — the web app generates the path for every new overlay.
11. **Strict scope.** v1 ships only the workshop type. L4D2 server-config gotchas, client-subscription helpers, other recipe types — all deferred to follow-up specs.
12. **`consumer_app_id == 550` validation** at every Steam API response at fetch/add time; non-L4D2 items are rejected and never reach the row. The value is a fixed precondition, not data.
13. **Input field accepts numeric ID, full Workshop URL, or a multi-line batch** of either. Pasting `123456` and pasting `steamcommunity.com/sharedfiles/filedetails/?id=123456` produce the same result; pasting many of either at once works too.
14. **Web-managed overlay paths.** All new overlays (any type) get `path = str(overlay_id)` at insert time. The user never picks a path. Existing legacy external overlay rows keep their current path values; migrating them to the ID-based scheme is a follow-up. `Overlay.id` uses SQLite `AUTOINCREMENT` so deleted IDs are never reused.
15. **Auto-rebuild on item change.** Adding or removing items from a workshop overlay automatically enqueues a `build_overlay` job. The "Rebuild" button on the detail page is for manual recovery only. New build jobs for an overlay coalesce with any pending one for the same overlay (don't queue duplicates).
16. **HTTPS** for all Steam Web API calls. The reference downloader uses HTTP; we don't.
## Architecture
```
Overlay row (type=workshop)
└─refs─▶ overlay_workshop_items
└─▶ WorkshopItem (global, by steam_id)
▼ download (Steam GetPublishedFileDetails + HTTP GET)
workshop_cache/{steam_id}.vpk
overlay_dir/left4dead2/addons/{steam_id}.vpk ─symlink─┘
```
Build dispatch via a registry:
```python
BUILDERS = {"external": ExternalBuilder(), "workshop": WorkshopBuilder()}
def build_overlay(overlay_id):
overlay = db.get(Overlay, overlay_id)
BUILDERS[overlay.type].build(overlay, on_stdout, on_stderr, should_cancel)
```
`ExternalBuilder` is a no-op for legacy admin-managed dirs. `WorkshopBuilder` performs an idempotent diff-apply of `addons/` symlinks against the current associations. Future types add their own builders without changing the dispatcher, the mount layer, or the blueprint editor.
## Data Model
### `Overlay` (extended)
```
id INTEGER PK AUTOINCREMENT
name VARCHAR(255) NOT NULL
path VARCHAR(255) NOT NULL -- new overlays: str(id); legacy externals: existing values
type VARCHAR(16) NOT NULL -- 'external' | 'workshop' (extensible)
user_id INTEGER NULL REFERENCES users(id) -- NULL = system-wide
created_at, updated_at
UNIQUE INDEX on (name) WHERE user_id IS NULL -- system overlays globally unique by name
UNIQUE INDEX on (name, user_id) WHERE user_id IS NOT NULL -- per-user namespace
INDEX on (type, user_id)
```
Two partial unique indexes are required because a naive composite `UNIQUE(name, user_id)` doesn't constrain externals — SQLite treats NULL as distinct in unique constraints, so two externals could share a name. Partial indexes preserve the prior global-uniqueness invariant for system rows.
### `WorkshopItem` (new)
```
id INTEGER PK
steam_id VARCHAR(20) NOT NULL UNIQUE -- 64-bit, store as text
title VARCHAR(255) NOT NULL DEFAULT ''
filename VARCHAR(255) NOT NULL DEFAULT '' -- upstream Steam filename, display only
file_url TEXT NOT NULL DEFAULT ''
file_size BIGINT NOT NULL DEFAULT 0
time_updated INTEGER NOT NULL DEFAULT 0 -- Steam epoch
preview_url TEXT NOT NULL DEFAULT '' -- thumbnail URL hot-linked from Steam
last_downloaded_at DATETIME NULL
last_error TEXT NOT NULL DEFAULT ''
created_at, updated_at
```
`consumer_app_id` is **not** stored. It's validated at fetch time and the row never exists for non-L4D2 items.
### `overlay_workshop_items` (new, pure association)
```
id INTEGER PK
overlay_id INTEGER NOT NULL REFERENCES overlays(id) ON DELETE CASCADE
workshop_item_id INTEGER NOT NULL REFERENCES workshop_items(id) ON DELETE RESTRICT
UNIQUE (overlay_id, workshop_item_id)
INDEX (workshop_item_id) -- reverse lookup for refresh
```
No `enabled` column — toggle is remove/add, which is cheap because the cache survives.
### `Job` (extended)
Add `overlay_id INTEGER NULL REFERENCES overlays(id)` for `build_overlay` jobs.
## Filesystem Layout
```
/var/lib/left4me/
overlays/
{overlay_id}/ # flat — same shape for every type
left4dead2/addons/
{steam_id}.vpk -> /var/lib/left4me/workshop_cache/{steam_id}.vpk
workshop_cache/
{steam_id}.vpk # one file per Steam item
```
- Every new overlay (workshop, future tarball/inline/manual) lives at `overlays/{overlay_id}/`. Legacy external overlays keep their pre-migration paths (e.g. `overlays/standard/`).
- `workshop_cache/` is created during deploy provisioning, not lazily — avoids races between concurrent first downloads.
- Web user owns both trees (mode 0755). Host user (`l4d2ctl`) needs read on both. If web and host are different users, they share a group.
- Symlink targets are absolute. Relative targets resolve in the merged-mount namespace and break across the host/web boundary.
- The builder never creates a dangling symlink. If a `WorkshopItem` lacks a cache file at build time, the builder logs a warning and skips it — fuse-overlayfs surfaces broken links to L4D2 as opaque addon-scan failures.
## UI
A single "Create overlay" button on `/overlays` opens a modal with type radio (External | Workshop) and a name field. No path field. The web app generates `path = str(overlay_id)` after insert.
Workshop overlay detail page (`/overlays/{id}` when `type='workshop'`) shows:
- A multi-line input plus a radio (Items | Collection). Pasting one or many IDs/URLs adds them in order; pasting a collection ID resolves its members.
- An item table with: thumbnail (`preview_url`), `steam_id` linking to Steam, title, filename, last-updated, size, last-error if any, Remove.
- A manual "Rebuild" button (for recovery only — every add/remove auto-enqueues a coalesced `build_overlay` job).
- Status indicator pulled from the latest related `Job` row.
External overlay detail page is unchanged in shape: read-only path display, name edit (admin only). The "External" type retains the existing admin-only SFTP-to-disk workflow until a future "manual upload" type replaces it.
The blueprint editor is unchanged in structure. Workshop overlays appear alongside externals in the user's overlay picker; ordering and stacking semantics are identical.
Admin section gets one new control: "Refresh all workshop items" button on the admin landing or workshop subsection. Pressing it enqueues a single `refresh_workshop_items` job.
### Routes
| Method | Path | Purpose |
|---|---|---|
| GET | `/overlays` | List with Type column, filtered by user permissions |
| POST | `/overlays` | Create; reads `type` and `name` only |
| GET | `/overlays/{id}` | Type-aware detail page |
| POST | `/overlays/{id}/items` | Add items or collection; auto-enqueues coalesced `build_overlay` |
| POST | `/overlays/{id}/items/{item_id}/delete` | Remove association; auto-enqueues coalesced `build_overlay` |
| POST | `/overlays/{id}/build` | Manual rebuild (recovery) |
| POST | `/admin/workshop/refresh` | Admin only; enqueue `refresh_workshop_items` |
HTMX usage stays minimal: only the add-item form and per-row delete swap a fragment. Everything else is full-page POST/redirect/GET.
## Job Operations
Two new operations join the existing job worker:
- **`build_overlay(overlay_id)`** — `Job.overlay_id` is set; `server_id` is NULL. Dispatches to `BUILDERS[overlay.type].build(...)`. Cancellation between filesystem operations.
- **`refresh_workshop_items()`** — admin-only. Both `server_id` and `overlay_id` are NULL. Phases: fetch all metadata in one batched call, download items where `time_updated` advanced, enqueue (coalesced) `build_overlay` for affected overlays. v1 doesn't wait on child builds; the admin sees them in the jobs list.
### Scheduler rules
- `install` and `refresh_workshop_items` are mutually exclusive with each other, with all `build_overlay`s, and with all server jobs.
- `build_overlay(overlay_id=N)` blocks if `install_running`, `refresh_running`, or another build for the same `overlay_id` is running. Builds for *different* overlays may run concurrently.
- Server start/init blocks if `refresh_running` or any `build_overlay` for an overlay referenced by the server's blueprint is running.
Coalescing: a new `build_overlay` for an overlay that already has a queued (not-yet-running) build returns the existing job instead of inserting a new row.
`initialize_server` synchronously calls each overlay's builder before writing the spec for `l4d2ctl initialize`. If a workshop overlay references uncached items (no file in `workshop_cache/`), `initialize_server` fails fast with a clear error naming the missing IDs and pointing the user at the overlay page. It never silently mounts a partial overlay.
## Permissions
- **External overlays**: admin-only create/edit. Visible to all authenticated users (system-wide).
- **Workshop overlays**: any logged-in user can create. Owner or admin can edit and delete. Visible to the owner and admins.
- **Admin refresh**: admin-only.
The `Overlay` listing query for non-admins becomes: `type='external' OR user_id=current_user.id`.
## Risks
- **Broken symlinks across host/web boundary** — mitigated by absolute targets, build-time pre-check skipping uncached items, and `deploy/` documenting permission requirements.
- **Initialize against uncached items** — would silently mount overlays missing maps. Mitigated by `initialize_server`'s fail-fast check; tested.
- **Steam API rate limits** — refresh of 100 items is one metadata POST plus 100 downloads at 8-way parallelism. No retry/backoff in v1; 429s surface verbatim in the job log.
- **Partial failure during refresh** — each item is independent; per-item errors land on the row. Re-running refresh retries failures.
- **Concurrent same-ID adds**`WorkshopItem.steam_id` unique handles cache dedup. `(overlay_id, workshop_item_id)` unique catches double-association; the route returns "already in overlay" rather than 500.
- **Build coalescing missed** — would enqueue dozens of redundant builds during multi-item adds. Mitigated by the `enqueue_build_overlay` helper; tested.
- **Worker concurrency rule miss** — the truth-table test in `test_job_worker.py` is the only way to trust the new scheduler logic; written before dispatch.
- **DB/disk drift** — a stray directory left by a prior failed delete could shadow a fresh overlay. Mitigated by `AUTOINCREMENT` (no ID reuse) and `os.makedirs(exist_ok=False)` (loud failure on collision).
- **Partial unique gap on SQLite** — naive composite `UNIQUE(name, user_id)` doesn't constrain externals because NULL is distinct. Mitigated by two partial unique indexes; tested explicitly.
- **Cache growth without GC** — accepted v1 trade-off.
- **Item removed from Steam** — refresh marks `result != 1`; row keeps last good cache file; UI surfaces error string. Operator decides removal.
- **L4D2 containerized run** — symlink absolute targets break if the server runs in a different mount namespace. Re-evaluate when containerization comes up.
## Out Of Scope
These came up in research and dialog but stay out of v1:
- **Publish / share button on overlays.** Lets owners flip `Overlay.user_id` between their own ID and NULL without changing type. The schema already supports it; only the UI is deferred.
- **Migrate legacy external overlay paths to the ID-based scheme.** Existing external rows keep their pre-migration paths in v1; a follow-up migration moves the directories on disk and updates the rows.
- **Switch from fuse-overlayfs to kernel overlayfs via a privileged helper.** Matches the existing systemd / steam-install sudoers helper pattern under `/usr/local/libexec/left4me/`. Workshop overlays would work identically under either mount engine — symlinks resolve through normal VFS in both.
- **`sv_consistency` / `addonconfig.cfg RestrictAddons` auto-handling.** When a workshop overlay attaches to a blueprint, surface a banner with a one-click fix. Most-cited L4D2 player pain.
- **Shareable Steam Workshop collection link for clients.** Server cannot push workshop content via `sv_downloadurl`; clients must subscribe themselves. A panel-generated collection makes that one click for players. Requires Steam OAuth.
- **Other overlay types.** `tarball` (covers the old `competitive_rework` GitHub-tarball recipe), `inline` (covers `tickrate`'s inline `server.cfg`), `manual` (file manager / upload, replaces the admin-SFTP external workflow). All slot in via the builder registry without schema churn.
- **Cache GC.** Reference-counted delete or admin "Clear unreferenced" page.
- **Per-item / per-overlay / scheduled refresh.** v1 has one global admin button; revisit if users want finer control.
- **Update-aware server restart UX.** Notify users when a running server's overlay content has been refreshed underneath it.
## Implementation Boundaries
- The host library contract is unchanged. Workshop content arrives in overlay directories the same way externals do today; `l4d2host` doesn't know overlays have types.
- The job-execution model is preserved: same workers, same logs, same cancel callbacks. Only the operations table grows.
- The blueprint privacy model and desired-vs-actual server state model are unchanged.
- No new frontend dependencies. Vendored HTMX + custom CSS + small inline JS.
- No new Steam Web API key required; both endpoints used accept anonymous POSTs.
- The companion implementation plan governs task ordering and verification commands. Implementation must not start without explicit user approval per that plan's gate.

View file

@ -1,80 +0,0 @@
# Kernel Overlayfs Helper Design
**Goal:** Replace the per-instance `fuse-overlayfs` mount with kernel-native overlayfs invoked through a privileged sudo helper that mounts in PID 1's mount namespace. Restores host-namespace visibility of the merged overlay so gameserver units (`left4me-server@%i.service`) can `chdir` into it at unshare time.
**Approval status:** User-approved design direction. Implementation proceeds in lockstep with the companion plan at `docs/superpowers/plans/2026-05-08-kernel-overlayfs-helper.md`.
## Context
**Symptom.** After redeploys, starting a gameserver leaves the systemd unit in `activating (auto-restart)` with `status=200/CHDIR — Changing to the requested working directory failed: No such file or directory`. Investigation showed:
- `fuse-overlayfs` running as `left4me` user mounts in `left4me-web.service`'s mount namespace.
- `ProtectSystem=full` + `ReadWritePaths=/var/lib/left4me` forces `PrivateMounts=yes` on the unit (`systemd-analyze security` confirms).
- The unit's bind of `/var/lib/left4me` shows `shared:471 master:1` in `/proc/<pid>/mountinfo` — slave-receive-only — so mounts created beneath it never propagate back to host.
- `MountFlags=shared` (added in commit `1968684` to fix this) sets only the unit's *root* propagation; it does not override the slave-direction propagation that `ProtectSystem`/`ReadWritePaths` apply to their bind mounts. The gameserver unit, on unshare, inherits *host* mounts and sees nothing at the merged path → CHDIR fails.
The system *appeared* to work for ~1d8h before this investigation because the prior fuse daemon happened to land in the host namespace via some transient state. The mechanism documented in `1968684` does not reliably work on systemd 257 with this hardening shape.
**Out-of-scope item now in scope.** The 2026-05-07 workshop-overlays spec already lists this transition at line 211: *"Switch from fuse-overlayfs to kernel overlayfs via a privileged helper. Matches the existing systemd / steam-install sudoers helper pattern under `/usr/local/libexec/left4me/`."* The mount-propagation bug is the trigger to do it now.
## Locked Decisions
1. **Privileged helper does the mount.** New `left4me-overlay` script under `/usr/local/libexec/left4me/`, invoked via `sudo -n`. Mirrors the existing `left4me-systemctl` and `left4me-journalctl` pattern. The helper enters PID 1's mount namespace via `nsenter --mount=/proc/1/ns/mnt` and then calls `/bin/mount -t overlay …` or `/bin/umount`. Result: all overlay mounts live in the host namespace, visible to gameserver units.
2. **Kernel-native overlayfs, not fuse.** Once a privileged helper exists, fuse-overlayfs's rootless-mount-via-setuid-`fusermount3` advantage disappears. Kernel overlayfs is faster, has no long-running daemon, simpler unmount, and one fewer runtime dep.
3. **Helper is Python, not shell.** Path canonicalization, env-file parsing, and lowerdir prefix-allowlist validation are too brittle in shell. Uses system `/usr/bin/python3` (never the venv) and stdlib only. Owned by root, mode 0755.
4. **Verbs are `mount` and `umount`.** Matches the kernel/userspace utility names; reduces cognitive friction over `unmount`.
5. **Helper takes only the instance name as input.** It reads `${LEFT4ME_ROOT:-/var/lib/left4me}/instances/<name>/instance.env` for `L4D2_LOWERDIRS=` and computes `upper`/`work`/`merged` from the runtime root. Equivalent in security to taking lowerdirs as args (the user already controls instance.env), and produces a one-line audit trail in `journalctl _COMM=sudo`.
6. **Strict path validation in the helper.**
- Instance name matches `^[a-z0-9][a-z0-9_-]{0,63}$` (mirrors `validate_instance_name` in `l4d2host/paths.py`).
- Each lowerdir from `L4D2_LOWERDIRS` is `os.path.realpath`'d and must resolve under one of an allowlist: `installation/`, `overlays/`, `global_overlay_cache/`, `workshop_cache/`. Empty entries and traversals are rejected.
- `upper`/`work`/`merged` must resolve exactly to `runtime/<name>/{upper,work,merged}`.
- Lowerdir count ≤ 500 (kernel overlayfs hard cap; was 64 before kernel 5.2).
7. **Whiteout-format guard.** `fuse-overlayfs` running as non-root uses `user.fuseoverlayfs.*` xattrs for whiteouts and opaque dirs, which kernel overlayfs ignores entirely. Before mounting, the helper walks `upperdir` once and refuses if any such xattr is present. Defensive; catches a stale fuse-era upperdir that wasn't wiped during migration.
8. **One-time migration: wipe existing `upper/` and `work/`.** Deploy script runs a gated migration (sentinel file `/var/lib/left4me/.kernel-overlay-migrated`) that stops gameservers, stops web service, unmounts any stale fuse/overlay mounts, recreates empty `upper`/`work` dirs for every instance. Players' in-place edits to merged content are sacrificed; v1 accepts this for a test deployment.
9. **Sudoers verb constraints.** `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-overlay mount *, /usr/local/libexec/left4me/left4me-overlay umount *`. Defense in depth (real validation lives in the helper); makes `sudo -l` output self-documenting.
10. **Wire the existing `OverlayMounter` ABC through.** `start_instance`/`stop_instance`/`delete_instance` today bypass the abstraction at `l4d2host/fs/base.py`. The new `KernelOverlayFSMounter` replaces the unused `FuseOverlayFSMounter` AND becomes the only path through `instances.py`. `FuseOverlayFSMounter` and the `fuse_overlayfs.py` module are deleted.
11. **Double-mount guard in `start_instance`.** Kernel mounts persist when the web worker dies (unlike fuse daemons, which die with their cgroup). `start_instance` checks `os.path.ismount(merged)` and refuses with a clear error rather than double-mounting.
12. **Hardening cleanup on `left4me-web.service`.** Drop `MountFlags=shared` (no longer the mechanism). Restore `PrivateTmp=true` (was dropped in commit `593611e` for fuse propagation that did not work). Keep `NoNewPrivileges` unset (sudo still requires setuid). Update the comment block to reflect the new model.
13. **AGENTS.md contracts unchanged.** The host library's CLI surface (`install`, `initialize`, `start`, `stop`, `delete`, `status`, `logs`) is unchanged. The web app continues to drive operations via `l4d2ctl`. The fuse-overlayfs implementation detail was never part of the public contract.
## Architecture
```
left4me-web.service (hardened, private mount namespace)
│ start_instance(name=…)
l4d2host.instances.start_instance
│ KernelOverlayFSMounter().mount(merged=…)
sudo -n /usr/local/libexec/left4me/left4me-overlay mount <name>
│ • validate name (regex)
│ • parse instance.env → L4D2_LOWERDIRS
│ • realpath each lowerdir, prefix-allowlist check
│ • compute upper/work/merged under runtime/<name>/
│ • walk upperdir, refuse if any user.fuseoverlayfs.* xattr
nsenter --mount=/proc/1/ns/mnt -- \
/bin/mount -t overlay overlay \
-o "lowerdir=…,upperdir=…,workdir=…" \
/var/lib/left4me/runtime/<name>/merged
host mount namespace now has the overlay; gameserver unit, on
unshare, inherits it and CHDIRs into …/merged/left4dead2 successfully.
```
## Operational Notes
- **Migration ordering on the test box (test-server, …).** The deploy script must, in order: (1) stop all `left4me-server@*.service`, (2) stop `left4me-web.service` (kills any lingering fuse-overlayfs daemons by reaping their cgroup), (3) `findmnt` + force-unmount any leftover fuse/overlay mounts under `/var/lib/left4me/runtime/`, (4) wipe and recreate `upper`/`work` for every instance, (5) deploy + start the new code. The sentinel file `/var/lib/left4me/.kernel-overlay-migrated` gates reruns.
- **Filesystem.** `/var/lib/left4me` is btrfs on the test box. Kernel overlayfs on btrfs is supported on kernel ≥ 5.10; the box is on 6.12 — fine. AppArmor ships enabled on Debian Trixie; verify no overlay-related denials in `journalctl -k` after first start.
- **Concurrency.** Two threads racing on `start_instance` for the same name is a latent issue unaffected by this change. The double-mount guard partly mitigates: the loser hits the existing mount and errors cleanly.
## Out Of Scope
- **Replace `sudo` with `AmbientCapabilities=CAP_SYS_ADMIN`** on a dedicated helper unit. Broader blast radius than the wrapper-script approach.
- **A `systemd-mount` per-instance mount unit.** Considered as the alternative architectural fix but adds more moving parts than the helper-script approach. The helper matches the established privileged-helper pattern in this codebase.
- **Re-enable `NoNewPrivileges` on `left4me-web.service`.** Requires removing sudo; not feasible while the helper invocation pattern stays.
- **Multi-process job-worker-claim safety.** The `_claim_lock` in `l4d2host/services/job_worker.py:131-138` is process-local; correctness depends on `--workers 1`. This change doesn't touch it.
- **Replicating the migration on production deployments.** v1 covers only the test-server deployment shape.

View file

@ -1,118 +0,0 @@
# L4D2 Blueprint Overlay Picker Design
**Goal:** Replace the checkbox + numeric-Order table on the blueprint detail page with a drag-to-reorder list and a single dropdown to add overlays. Drag-and-drop is the primary reorder mechanic; per-row Order text inputs are removed.
**Approval status:** User-approved. No companion implementation plan — small surface, implemented directly.
## Context
`templates/blueprint_detail.html:14-28` currently renders one HTML table for the blueprint's overlays. Each row carries a `Use` checkbox, a numeric `Order` text input, and the overlay name. To enable an overlay you check it; to reorder you type integers into per-row text fields. Adding a new overlay between existing ones means renumbering everything below it by hand.
This spec replaces that table with a single ordered list of *selected* overlays plus a `<select>` dropdown for adding more. Drag-to-reorder is the only reorder interaction. A ✕ button on each row removes it (returning it to the dropdown). Picking an entry from the dropdown appends it to the list (and removes it from the dropdown).
The change is intentionally scoped small: no two-panel layout, no filter widget, no touch / keyboard reorder support, no JS-disabled fallback. The native `<select>` element supplies typeahead-by-letter and keyboard navigation for free, which covers the no-drag path. The page is desktop-primary.
## Locked Decisions
1. **Single ordered list of selected overlays only.** No second pane. The "available" set lives in the `<select>`. Adding via dropdown is one click; removing via ✕ is one click; reordering is one drag.
2. **Native HTML5 drag-and-drop.** No vendored library, no polyfill. Touch-screen drag is unsupported on Android and rough on iOS — accepted because the page is desktop-primary. Add and remove still work on touch via the `<select>` and the `<button>`.
3. **JS-required UI.** If JS does not load, the page is unusable. No degradation to the old checkbox table.
4. **Server contract unchanged.** Each list row owns one `<input type="hidden" name="overlay_ids" value="{id}">`. Form-submission order = DOM order. The existing `ordered_overlay_ids_from_form` handler in `routes/blueprint_routes.py` already falls back to enumerate index when no `overlay_position_<id>` field is present, so it accepts the new shape with no Python edit.
5. **Dropdown re-sorted alphabetically on remove.** When ✕ removes a row, the corresponding `<option>` is sorted-inserted back into the `<select>` (case-insensitive name compare). The dropdown stays predictable.
6. **Drop-indicator visual.** A 2px focus-color bar drawn via `box-shadow … inset` on the row under the cursor: top-bar = "drop will land before this row", bottom-bar = "drop will land after this row". The hover side is computed by comparing `event.clientY` to the row's vertical midpoint.
7. **Drop on empty space inside the list = append.** Drop directly on the dragged row (or with no movement) = no-op. Escape during drag triggers `dragend`, which clears all visual classes.
8. **Out of scope:** keyboard reorder, ARIA live announcements, touch DnD polyfill, server-side cleanup of the now-unused `overlay_position_<id>` form-field path.
## Architecture
```
GET /blueprints/<id>
page_routes.blueprint_page
├─▶ selected_overlays (ordered by BlueprintOverlay.position)
└─▶ available_overlays = all_overlays \ selected_overlays
(alphabetical)
templates/blueprint_detail.html
<ol data-overlay-list> ← drag target, hidden inputs
<li data-overlay-id draggable> × ⋮⋮ name </li>
</ol>
<select data-overlay-add> ← add path
<option>Pick a name…</option>
<option value=overlay.id>name</option> ← available_overlays
</select>
static/js/blueprint-overlay-picker.js
├─ dragstart/over/leave/drop/end → reorder DOM under [data-overlay-list]
├─ click [data-action="remove"] → remove row + sorted-insert <option>
├─ change [data-overlay-add] → append <li>, remove <option>
└─ refreshEmpty() → toggle [data-overlay-empty][hidden]
POST /blueprints/<id>
form-encoded body: overlay_ids=<id>&overlay_ids=<id>&… (in DOM order)
blueprint_routes.update_blueprint_form
→ ordered_overlay_ids_from_form (existing; fallback_position branch)
→ replace_blueprint_overlays (existing)
```
## Form-contract details
The new template emits one hidden input per selected row, colocated as a child of the `<li>`:
```html
<li data-overlay-id="3" draggable="true">
<span class="overlay-picker-handle">⋮⋮</span>
<span class="overlay-picker-name">workshop_maps</span>
<button type="button" data-action="remove">×</button>
<input type="hidden" name="overlay_ids" value="3">
</li>
```
Browser form serialization preserves DOM order across multiple inputs that share a `name`. Werkzeug's `request.form.getlist("overlay_ids")` returns them in submission order. `ordered_overlay_ids_from_form` then assigns each id its enumerate-index position via the `fallback_position` branch (lines 19-31 of `routes/blueprint_routes.py`) and feeds the result to `replace_blueprint_overlays`.
The JSON path (`POST /blueprints` with `application/json`) already takes `overlay_ids` list order at line 64 of the same file — this spec does not affect it.
## UI / UX details
- **Empty state.** When no overlays are selected, a `[data-overlay-empty]` paragraph reads "No overlays selected. Pick one below to add." JS toggles its `hidden` attribute on every list mutation.
- **Drag handle.** Visual only (`⋮⋮` glyph). The whole row is `draggable="true"`; the user does not have to grab the handle specifically.
- **Drop indicator math.** During `dragover`, compute `event.clientY < rect.top + rect.height/2`; that boolean picks `drop-before` (bar at top) vs `drop-after` (bar at bottom). On `drop`, read which class is set and `insertBefore` or `insertBefore(…, target.nextSibling)` accordingly.
- **Sorted insert on remove.** Walk `<select>` children comparing `option.dataset.overlayName` (lowercased) against the removed name; `insertBefore` the new option ahead of the first option whose name sorts later, or append if none.
- **Reset select after add.** Set `select.value = ""` so the placeholder reappears after each add.
## Files
| Path | Change |
|---|---|
| `l4d2web/routes/page_routes.py` | Compute `available_overlays`; pass to template. |
| `l4d2web/templates/blueprint_detail.html` | Replace overlay table with `<ol>` + `<select>`; add `<script defer>`. |
| `l4d2web/static/css/components.css` | Append `.overlay-picker-*` rules. Reuse existing tokens. |
| `l4d2web/static/js/blueprint-overlay-picker.js` | New IIFE. ~150 LOC. |
| `l4d2web/tests/test_blueprints.py` | Two new GET-page assertions. |
| `l4d2web/tests/test_pages.py` | Update `test_blueprint_detail_has_ordered_overlay_form` to match new shape. |
## Verification
Manual browser flow (`/blueprints/<id>`):
1. Initial render shows the saved selection in saved order; dropdown holds the rest. No console errors.
2. Drag a row up/down. Focus-colored bar appears at the top or bottom of the hover-target row (depending on which half is hovered). On drop, the row moves; hidden inputs reflect the new order.
3. Click ✕ on a row. Row vanishes; the same name appears in the dropdown in alphabetical position.
4. Pick from the dropdown. New row appears at the end of the list; the option leaves the dropdown; the placeholder is reselected.
5. Save the blueprint, reload. Order survives the round-trip.
6. Press Escape mid-drag. Drop indicators clear; source row regains opacity; nothing moved.
Test commands:
```
pytest l4d2web/tests/test_blueprints.py -q
pytest l4d2web/tests -q
```
## Out of scope / future follow-ups
- **Drop the `overlay_position_<id>` server-side path.** Once no client emits those fields, `ordered_overlay_ids_from_form` collapses to `[int(v) for v in request.form.getlist("overlay_ids")]`. Test `test_form_update_preserves_ordered_overlays_and_multiline_fields` (`l4d2web/tests/test_blueprints.py:220`) gets simplified accordingly.
- **Touch-friendly DnD.** Vendor a polyfill (`drag-drop-touch`) or rewrite the picker on pointer events if mobile editing becomes a real use case.
- **Keyboard reorder.** Space-to-grab + arrow-keys + ARIA live announcements. Currently only add/remove are keyboard-accessible.
- **Filter on the selected list.** Not needed at v1's overlay counts; revisit if blueprints commonly carry 20+ overlays.

View file

@ -1,332 +0,0 @@
# L4D2 Script Overlays Design
> **Sandbox engine superseded by [`2026-05-08-l4d2-script-sandbox-v2-systemd.md`](2026-05-08-l4d2-script-sandbox-v2-systemd.md).**
> The v1 design below specifies `bubblewrap` + `systemd-run --scope` as the
> sandbox engine. The v2 design (approved 2026-05-08, same day) replaced that
> with `systemd-run` in service-unit mode and dropped `bubblewrap` entirely.
> The current implementation in `deploy/scripts/libexec/left4me-script-sandbox`
> follows v2; this v1 design is preserved for archaeology. The rest of the
> design (overlay-type unification, resource caps, helper auth model, etc.)
> still applies — only the sandbox-engine choice changed.
**Goal:** Add a single new overlay type, `script`, that lets users author arbitrary build recipes as bash and runs them inside a `bubblewrap` + `systemd-run --scope` sandbox. The new type subsumes the existing `l4d2center_maps` and `cedapug_maps` managed-globals overlay types, both of which are removed in the same change. After this work the overlay type list is exactly `workshop` (unchanged) and `script` (new).
**Approval status:** User-approved design direction. Implementation proceeds in lockstep with the companion plan at `docs/superpowers/plans/2026-05-08-l4d2-script-overlays.md`.
## Context
`left4me` users today have two ways to add content to a server: workshop overlays (rich UI for Steam Workshop items via `WorkshopBuilder`) and a pair of managed global-map overlay types (`l4d2center_maps`, `cedapug_maps`) with bespoke parsers, per-item DB rows, ETag-based change detection, and a daily refresh timer. They cannot author arbitrary build recipes.
The user's previous setup at `ckn-bw/bundles/left4dead2/files/scripts/overlays/` expressed every recipe as a small bash file: `competitive_rework` (GitHub tarball download), `tickrate` (inline `server.cfg` + addon DLL fetch), `standard` (workshop items + admin-list write), `workshop_maps` (workshop collection import), `l4d2center_maps` (CSV-driven map sync). All five fit naturally into a single "run a sandboxed bash script that populates the overlay dir" model.
The two managed global-map types in the current codebase are over-engineered for what they do — each is essentially "fetch a manifest, download archives, extract VPKs, place in `addons/`." Folding them into the new `script` type eliminates three database tables, two source-parser modules, the `GlobalMapOverlayBuilder`, the `py7zr` dependency, the global-overlay cache root, and the managed-singleton machinery, while letting an admin paste the equivalent shell code (which the user already wrote years ago) into a normal admin-owned, system-wide script overlay.
The trust model for the sandbox is "semi-public deployment, registered users." The threat surface is one user reading another user's overlay, the application DB, or arbitrary host secrets, plus runaway scripts exhausting disk/CPU/RAM. Network access is *not* restricted — scripts must be able to download from arbitrary URLs (GitHub, l4d2center, Steam CDN). Sandbox boundaries are namespace-based (mount, PID, IPC, UTS, cgroup), not command-allowlist-based; binary-allowlist sandboxing of bash is theatre because of `eval` and `exec`.
The test deploy DB is wiped as part of rollout; no data migration is performed. Existing user blueprints that reference `l4d2center_maps` or `cedapug_maps` overlay rows do not survive the change in the test environment.
A scheduled-refresh feature (the daily timer that today drives the global-map types) is intentionally **out of scope for this iteration**. The two existing systemd units and the `flask refresh-global-overlays` CLI command are deleted with no replacement. Refresh is reintroduced in a later iteration designed against concrete needs.
## Locked Decisions
1. **Single new overlay type: `script`.** Replaces both managed-globals types. Final type list: `workshop` + `script`. No `tarball`/`inline`/`manual` types — all of those collapse into `script` (with UI templates as a future ergonomics improvement).
2. **`Overlay.script` is a DB `TEXT` column** holding the raw bash. No file storage, no revision history in v1. Empty string for `workshop` rows.
3. **Build idempotency contract: script runs against the existing overlay dir.** No automatic wipe between builds. Users write `test -f … || curl …`-style guards if they want bandwidth efficiency. A manual "Wipe overlay" button on the detail page resets the dir to empty.
4. **No left4me-aware helpers in the sandbox.** The script sees pure bash plus whatever's in `/usr` (RO bind-mount of the host). Workshop items are not exposed via a helper — users wanting workshop content create a `workshop`-type overlay, which has its own first-class UX (thumbnails, collection paste, dedup cache, refresh).
5. **Sandbox engine: `bubblewrap` (`bwrap`) inside `systemd-run --scope --collect`.** `systemd-run` provides cgroup v2 limits + walltime kill via `RuntimeMaxSec`; `bwrap` provides the namespace isolation. Both are stable, well-audited, in-tree on Debian.
6. **Resource limits (system-wide, not per-overlay):** 1 hour walltime (`RuntimeMaxSec=3600`), 4 GB RAM (`MemoryMax=4G`, `MemorySwapMax=0`), 512 tasks, 200% CPU quota, post-build 20 GB disk cap on `du -sb` of the overlay dir.
7. **Network: host-shared.** No `--unshare-net`. Scripts have full outbound. Egress filtering is not in v1; the sandbox prevents reading internal state but does not prevent talking to internal IPs. Acceptable for the current trust model.
8. **No auto-seeding of "default" overlays.** Admin manually creates the equivalents of the old `l4d2center-maps`/`cedapug-maps` post-deploy by pasting the bash. The deploy script does not insert overlay rows.
9. **Daily/scheduled refresh: out of scope for this iteration.** No `auto_refresh` flag, no timer, no CLI command. Manual rebuild via the detail-page button is the only build trigger after this change.
10. **Permissions mirror workshop overlays.** Any logged-in user can create a private (`user_id = me`) script overlay. Admin can create system-wide (`user_id = NULL`). Owner or admin can edit/delete.
11. **Failure semantics via `Overlay.last_build_status`** (`'' | 'ok' | 'failed'`). Drives a "rebuild required" badge on the list and detail pages. Server initialization does **not** auto-block on `failed` (matches workshop's current behavior).
12. **Wipe is just another sandbox invocation.** The wipe endpoint runs the literal script `find /overlay -mindepth 1 -delete` through the same `left4me-script-sandbox` helper. No second helper, no privilege/UID puzzle (files are owned by `l4d2-sandbox`, who runs the wipe). After a successful wipe, `last_build_status` is reset to `''`. Wipe does **not** auto-enqueue a rebuild — the user decides.
13. **Privileged helper: `/usr/local/libexec/left4me/left4me-script-sandbox`.** Same pattern as the existing `left4me-overlay`, `left4me-systemctl`, `left4me-journalctl` helpers. Bash, owned root, mode 0755. The web user invokes it via `sudo -n` per a sudoers fragment. Root is needed to set up the namespaces; bwrap drops to the unprivileged `l4d2-sandbox` UID immediately.
14. **Dedicated sandbox UID `l4d2-sandbox`** (system user, `/usr/sbin/nologin`, no home). Owns nothing on the host outside what bwrap binds in. UID-drop happens inside the bwrap invocation via `--uid`/`--gid`.
15. **Strict argument validation in the helper.** Overlay id matches `^[0-9]+$`; overlay dir must exist under `/var/lib/left4me/overlays/`; script path must exist. Defense in depth — the real authorization check lives in the web app.
16. **Streaming I/O via the existing `run_with_streamed_output` helper.** Same plumbing `WorkshopBuilder` already uses for `steamcmd`/`curl` invocations. No new SSE/log path.
## Architecture
```text
Overlay row (type=script, script=TEXT, last_build_status)
▼ build_overlay(overlay_id) job
▼ BUILDERS["script"].build(overlay, on_stdout, on_stderr, should_cancel)
▼ ScriptBuilder writes overlay.script → tmpfile, then:
│ sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>
▼ Helper validates args, then exec()s:
│ systemd-run --scope --collect
│ -p MemoryMax=4G -p MemorySwapMax=0
│ -p TasksMax=512 -p CPUQuota=200%
│ -p RuntimeMaxSec=3600
│ -- bwrap [namespace flags...] /bin/bash /script.sh
▼ Inside the sandbox the script sees:
│ /overlay ← /var/lib/left4me/overlays/{id} RW (the build target)
│ /tmp,/run ← fresh tmpfs RW (ephemeral)
│ /usr,/lib,/lib64,/etc/{ssl,resolv.conf,nsswitch} RO (host-curated)
│ /proc,/dev ← fresh
│ network ← shared with host
│ UID/GID ← l4d2-sandbox (no_new_privs implicit in bwrap)
▼ stdout/stderr → run_with_streamed_output → existing job-log SSE stream
▼ After exit:
│ exit 0 ∧ du -sb /overlay ≤ 20 GB → last_build_status='ok'
│ any other outcome → last_build_status='failed'
```
The host library (`l4d2host`) is unchanged. The `KernelOverlayFSMounter` already mounts whatever's at `overlays/{id}/` regardless of how it got there. The Job model and worker model are essentially unchanged — `script` is just another overlay type for the same `build_overlay` operation that today supports `workshop`.
```python
BUILDERS = {
"workshop": WorkshopBuilder(),
"script": ScriptBuilder(),
}
```
## Data Model
### `Overlay` (modified)
```text
id INTEGER PK AUTOINCREMENT
name VARCHAR(255) NOT NULL
path VARCHAR(255) NOT NULL -- str(id) for new rows
type VARCHAR(16) NOT NULL -- 'workshop' | 'script'
user_id INTEGER NULL REFERENCES users(id) -- NULL = system-wide
script TEXT NOT NULL DEFAULT '' -- new; meaningful for type='script'
last_build_status VARCHAR(16) NOT NULL DEFAULT '' -- new; '' | 'ok' | 'failed'
created_at, updated_at
UNIQUE INDEX on (name) WHERE user_id IS NULL
UNIQUE INDEX on (name, user_id) WHERE user_id IS NOT NULL
INDEX on (type, user_id)
```
### Tables removed
- `global_overlay_item_files`
- `global_overlay_items`
- `global_overlay_sources`
Drop order matters for the SQLite migration: drop `_item_files` first (FK to `_items`), then `_items` (FK to `_sources`), then `_sources` (FK to `overlays`).
### Unchanged
`WorkshopItem`, `overlay_workshop_items`, `Job` (including `Job.overlay_id` and nullable `Job.user_id`), `Server`, `Blueprint`, etc.
## Filesystem Layout
```text
${LEFT4ME_ROOT}/
overlays/
{overlay_id}/ # script writes here; mounted by host
left4dead2/... # whatever the script produces
workshop_cache/{steam_id}.vpk # workshop type only — unchanged
# removed:
# global_overlay_cache/ # was used by managed-globals types
```
Single tree per overlay. No per-overlay scratch cache (the chosen idempotency model is "script runs against existing dir," so any caching the user wants lives inside the overlay dir and is preserved between builds).
The sandbox bind-mounts `${LEFT4ME_ROOT}/overlays/{id}/` to `/overlay` (RW). Nothing else under `${LEFT4ME_ROOT}` is visible inside the sandbox.
## Sandbox
### Helper script
`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`, mode 0755, owned root:
```bash
#!/bin/bash
# args: <overlay_id> <script_path>
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script" >&2; exit 65; }
SBX_UID=$(id -u l4d2-sandbox); SBX_GID=$(id -g l4d2-sandbox)
exec systemd-run --quiet --scope --collect \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- bwrap \
--die-with-parent --new-session \
--unshare-pid --unshare-ipc --unshare-uts --unshare-cgroup \
--uid "$SBX_UID" --gid "$SBX_GID" \
--proc /proc --dev /dev --tmpfs /tmp --tmpfs /run \
--ro-bind /usr /usr --ro-bind /lib /lib --ro-bind /lib64 /lib64 \
--symlink usr/bin /bin --symlink usr/sbin /sbin \
--ro-bind /etc/resolv.conf /etc/resolv.conf \
--ro-bind /etc/ssl /etc/ssl \
--ro-bind /etc/ca-certificates /etc/ca-certificates \
--ro-bind /etc/nsswitch.conf /etc/nsswitch.conf \
--bind "$OVERLAY_DIR" /overlay \
--chdir /overlay \
--setenv HOME /tmp --setenv PATH /usr/bin:/usr/sbin \
--setenv OVERLAY /overlay \
--ro-bind "$SCRIPT" /script.sh \
/bin/bash /script.sh
```
Network is *not* unshared (no `--unshare-net`); the sandbox shares the host network namespace. Every transient unit is visible via `systemctl list-units --type=scope` while running and journaled afterward (`journalctl --user-unit=run-…scope` or system journal depending on invocation).
### Sudoers fragment
Append to `deploy/files/etc/sudoers.d/left4me`:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
```
### System user
Provisioned in `deploy/deploy-test-server.sh`:
```bash
useradd --system --no-create-home --shell /usr/sbin/nologin l4d2-sandbox
apt-get install -y bubblewrap
```
## Build Lifecycle
`ScriptBuilder` lives in `l4d2web/services/overlay_builders.py` next to `WorkshopBuilder`:
```python
class ScriptBuilder:
def build(self, overlay, *, on_stdout, on_stderr, should_cancel):
with tempfile.NamedTemporaryFile("w", suffix=".sh", delete=False) as f:
f.write(overlay.script or "")
script_path = f.name
try:
cmd = [
"sudo", "-n",
"/usr/local/libexec/left4me/left4me-script-sandbox",
str(overlay.id), script_path,
]
run_with_streamed_output(cmd, on_stdout, on_stderr, should_cancel)
self._enforce_disk_budget(overlay.id, on_stderr)
finally:
os.unlink(script_path)
def _enforce_disk_budget(self, overlay_id, on_stderr):
size = subprocess.check_output(["du", "-sb", overlay_path(overlay_id)])
if int(size.split()[0]) > 20 * 1024**3:
on_stderr("overlay exceeded 20 GB disk cap")
raise BuildError("disk-cap-exceeded")
```
`run_with_streamed_output` is the existing helper used by `WorkshopBuilder` for `steamcmd`/`curl` invocations. The `should_cancel` callback fires `kill -TERM` on the sudo-`systemd-run` process tree; cgroup-collect tears down the whole scope on exit.
The job worker's existing job-completion path writes `Overlay.last_build_status = 'ok'` on success and `'failed'` on any non-zero exit / `BuildError` / cancel. This is a single column update inside the existing transaction; no new infrastructure.
## UI
### Create modal (`templates/overlays.html`)
The existing modal grows one option in the type radio: `Workshop | Script`. Name field unchanged. After insert, the web app generates `path = str(overlay_id)` for new rows (existing pattern).
### Detail page when `type='script'` (`templates/overlay_detail.html`)
- Plain styled `<textarea>` for `overlay.script` with a Save button → `POST /overlays/{id}/script`. No CodeMirror dependency in v1 (out of scope; keep frontend dep-light).
- "Rebuild" button → `POST /overlays/{id}/build`. Existing pattern from workshop overlays.
- "Wipe overlay" button (red, confirm-modal) → `POST /overlays/{id}/wipe`.
- `last_build_status` indicator badge: empty / "ok" / "failed".
- Live build log via existing SSE plumbing on the related Job row.
### Detail page when `type='workshop'`: unchanged.
### Sections removed
The global-source detail block (`overlay_detail.html` lines 3446) is deleted along with the managed-globals subsystem.
## Routes
`l4d2web/routes/overlay_routes.py` adds:
| Method | Path | Purpose |
|---|---|---|
| POST | `/overlays/{id}/script` | Update `script` text. Auto-enqueue coalesced `build_overlay` job. |
| POST | `/overlays/{id}/wipe` | Invoke `left4me-script-sandbox` with the literal script `find /overlay -mindepth 1 -delete`. Owner/admin only. Refuses if a `build_overlay` for this overlay is running. After success, set `last_build_status=''`. Does not auto-enqueue a rebuild. |
| POST | `/overlays/{id}/build` | Manual rebuild — same pattern as today's workshop overlay manual rebuild. |
Existing `POST /overlays` accepts `type=script` and an optional initial `script` body.
## Permissions
| Action | Who |
|---|---|
| Create script overlay (private, `user_id = me`) | Any authenticated user |
| Create script overlay (system-wide, `user_id = NULL`) | Admin |
| Edit (script body, name) | Owner or admin |
| Wipe / Rebuild | Owner or admin |
| Delete | Owner or admin |
| View | Owner, admin, or any user when `user_id IS NULL` |
These match the existing rules for workshop overlays.
## Job Worker / Scheduler
`services/job_worker.py` drops `"refresh_global_overlays"` from `GLOBAL_OPERATIONS` and removes the corresponding `refresh_global_overlays_running` and `blocked_servers_by_overlay` plumbing that exists only for the global-maps subsystem. The remaining mutex rules already cover:
- `build_overlay` per overlay (one running build per overlay).
- `install` and `refresh_workshop_items` as global mutexes.
- Server start/init blocks if any `build_overlay` for an overlay in the server's blueprint is running.
No new rules are needed for `script` — its build is mechanically identical to a `workshop` build from the scheduler's perspective.
## Daily Refresh — Removed
This iteration deletes the daily-refresh subsystem entirely:
- `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer` and `.service` — deleted.
- `flask refresh-global-overlays` CLI command in `l4d2web/cli.py` — deleted.
- No replacement timer, no replacement CLI, no `auto_refresh` column on `Overlay`.
The only build trigger after this change is the user clicking Rebuild on the detail page (or the auto-enqueue when they Save the script body). A scheduled-refresh feature is reintroduced in a future iteration designed against concrete operational needs.
## Risks
- **Sandbox escape via kernel bug.** `bwrap` has a strong track record but is not invulnerable. Mitigated by running as `l4d2-sandbox` (no privileged capabilities), no setuid binaries reachable, `no_new_privs` implicit. A successful escape would land in an unprivileged UID with no host secrets reachable.
- **Disk fill via runaway script.** A script that writes a 20 GB+ payload to `/overlay` succeeds inside the sandbox and only fails afterward at the post-build `du` check. The 20 GB lands on disk transiently. Mitigated by the kernel's per-cgroup IO accounting being unaware of file size (no good IO-time limit), accepting this as a v1 trade-off; a future improvement is overlay-dir-on-its-own-filesystem with a quota.
- **Network exfiltration.** Script can connect to anything outbound, including internal IPs. Acceptable for the current trust model (semi-public; users have credentials). Egress firewall is out of scope.
- **Build-mid-server-running.** The scheduler refuses `build_overlay` for an overlay attached to a starting/running server (existing rule, unchanged). Good. A user can still rebuild while a server using a *different* blueprint runs concurrently.
- **Wipe race with running build.** The wipe endpoint refuses if a `build_overlay` for the overlay is running. Without this check, a wipe could blow away files mid-script and produce undefined results.
- **Stale `last_build_status`.** A row inserted via direct DB write or restored from backup could carry an `'ok'` status that no longer reflects reality. Treated as cosmetic; users can rebuild to refresh.
- **Sudoers misconfig.** A typo in the sudoers fragment could grant `left4me` more than intended. Mitigated by deploy-artifact tests asserting the exact expected lines.
- **DB row deletion racing the sandbox.** A user deleting an overlay while its build runs would invalidate the bind-mount target. Mitigated by the existing scheduler rule that tracks running overlays; delete should refuse if a build is running. (Existing pattern for workshop overlays; reuse.)
- **Migration drops globals tables.** Acceptable for the test deploy. Production rollout would need a different migration story; this spec explicitly assumes test-deploy DB wipe.
## Out Of Scope
- **Scheduled / daily refresh.** Intentionally removed in this iteration. Reintroduced later, designed against the use cases that emerge.
- **Per-overlay resource overrides.** All script overlays share the same 1 h / 4 GB / 20 GB envelope. If a real overlay needs more (l4d2center mirror at peak), revisit.
- **CodeMirror or other rich script editor.** Plain `<textarea>` in v1.
- **Egress allowlist / proxy.** No network restrictions on the sandbox in v1.
- **`$CACHE` scratch dir** persisted across builds. Users cache inside the overlay dir if they want; idempotency model is "script runs against existing dir."
- **Multi-tenant cgroup tree per user.** All sandboxes share the same cgroup-quota envelope.
- **Revision history on `script` column.** No `overlay_script_revisions` table; whatever's in the row is the current script.
- **Auto-seeding of l4d2center / cedapug equivalents.** Admin pastes the script post-deploy.
- **Migration that preserves existing global-map overlay rows.** Test deploy DB is wiped.
- **Container-per-build (podman / docker).** Heavier than `bwrap`; revisit only if multi-tenant escalates to "fully public sign-up."
- **left4me-aware helpers** (`workshop`, `download`, `extract`) inside the sandbox. Pure bash + host `/usr` only.
## Implementation Boundaries
- **`l4d2host` is unchanged.** The host library has no concept of overlay types and the mount layer (`KernelOverlayFSMounter`) doesn't care how the overlay dir got populated.
- **The `OverlayBuilder` Protocol is unchanged** — same `build(overlay, *, on_stdout, on_stderr, should_cancel)` signature. `ScriptBuilder` plugs into the existing registry.
- **The job worker model is unchanged.** Same operations, same logs, same SSE plumbing, same scheduler rules (minus the refresh_global_overlays entry).
- **No new application-level dependencies.** Vendored HTMX, no new Python packages. Two new system dependencies: `bubblewrap` apt package and the `l4d2-sandbox` system user.
- **No new config keys.** Same env files (`/etc/left4me/host.env`, `/etc/left4me/web.env`).
- **DB migration is destructive for global-maps overlay rows.** This is acceptable per the test-deploy assumption; a production-rollout follow-up would need to address it.
- The companion implementation plan governs task ordering and verification commands. Implementation must not start without explicit user approval per that plan's gate.

View file

@ -1,138 +0,0 @@
# L4D2 Script Sandbox v2 — Systemd-Only
**Goal:** Replace the bwrap-based `left4me-script-sandbox` helper with one that uses `systemd-run` in **service-unit mode** alone. Drop `bubblewrap` as a system dependency. Gain capability bounding, seccomp filtering, kernel-tunable / -module / -log protection, address-family restriction, `LockPersonality`, `MemoryDenyWriteExecute`, and `RestrictSUIDSGID` — none of which the bwrap+systemd-run-scope composition could provide. Lose PID-namespace isolation (no `PrivatePID=` directive in systemd) — judged acceptable for the current trust model.
**Approval status:** User-approved 2026-05-08 after smoke testing on `ckn@10.0.4.128`.
## Context
The v1 sandbox (see `2026-05-08-l4d2-script-overlays-design.md`) layers `bubblewrap` for namespacing inside `systemd-run --scope` for cgroup limits. That works, but `--scope` units register an existing process tree and so cannot accept service-only directives like `NoNewPrivileges=`, `ProtectSystem=`, `SystemCallFilter=`, `CapabilityBoundingSet=`, etc. Smoke testing on the deployed host confirmed bwrap covers mount/PID/IPC/UTS namespacing well, but leaves capability bounding, seccomp, and kernel-surface protection unenforced.
A switch to `systemd-run` in default (transient service) mode unlocks the full hardening surface. Smoke testing of a v2 prototype against the deployed test host confirmed:
- Every isolation invariant the bwrap version provides (filesystem masking, UID drop, network reachability, `/overlay` RW bind, host-side `l4d2-sandbox` ownership, host secret hiding) is reproducible with systemd directives.
- All cgroup limits (`memory.max=4G`, `memory.swap.max=0`, `pids.max=512`, `cpu.max=200%`, `RuntimeMaxSec=3600`) apply identically.
- `MemoryError` fires at the 4 GB cap (cgroup-enforced).
- The wipe path (`find /overlay -mindepth 1 -delete`) succeeds.
- Hardening directives the v1 design couldn't express enforce real syscall blocks: `unshare(CLONE_NEWUSER)`, `mount(2)`, `personality(2)`, `bpf(2)`, `swapoff(2)`, `sysctl -w` are all blocked.
The single behavioral regression: host process IDs are visible via `/proc` and `ps -ef` because systemd has no `PrivatePID=` directive. Sending signals to those processes is still blocked by the kernel's UID-mismatch check (`l4d2-sandbox` cannot signal `root`-owned processes). Information disclosure is the only leak; signal capability is intact.
## Locked Decisions
1. **Replace the helper body wholesale.** No `bwrap` invocation. `systemd-run` in service mode does both isolation and resource limits.
2. **Helper path, sudoers rule, ScriptBuilder API, and `l4d2-sandbox` UID are unchanged.** The Python side (`run_sandboxed_script`, route handlers, tests) does not change.
3. **`bubblewrap` apt dependency dropped from `deploy-test-server.sh`.**
4. **`left4me.db` file mode tightened to 0640 root:left4me at deploy time.** This is a host-hygiene fix that is independent of the sandbox change but was surfaced by smoke testing — without it, *any* host user (and, transitively, the sandbox) could read the application database.
5. **`TemporaryFileSystem=/var/lib` is required.** `ProtectSystem=strict` makes `/var/lib/left4me` read-only but visible; the only way to reliably hide its contents from the unit is to mask the parent with a tmpfs. The `BindPaths=…/overlays/{id}:/overlay` mount is unaffected because `/overlay` is at a different path.
6. **`PrivatePID=` is not configured.** systemd has no such directive. `ps -ef` from inside the sandbox shows host processes. The kernel's UID-based signal restriction blocks any actual interaction with them. Acceptable for the current trust model.
7. **Walltime kill remains `RuntimeMaxSec=3600`.** Same as v1.
8. **Network namespace remains shared with the host.** No `PrivateNetwork=`. Scripts must reach Steam / l4d2center / GitHub / etc.
9. **`SystemCallFilter=@system-service @network-io`** is the seccomp baseline. systemd's curated `@system-service` group is "everything a normal service does"; adding `@network-io` is explicit even though it overlaps. Build failures revealing missing syscall classes are surfaced via `journalctl` and addressed by widening the filter (`@process`, etc.) on demand.
10. **Single helper file replaces v1.** Not adding a `-v2` variant. The v1 implementation is removed in the same change.
## Architecture
```text
sudo helper
└─ systemd-run --service (default) --pipe --wait
(transient .service unit, full hardening directives)
└─ /bin/bash /script.sh
```
systemd-run in service mode:
- Opens a transient service unit on the system bus.
- Applies all `-p` properties as the unit's exec context.
- Forks; the child sets up the unit's namespaces (mount, IPC, user), drops privileges to `User=l4d2-sandbox`, applies the seccomp filter, and `execve()`s `/bin/bash /script.sh`.
- `--pipe` connects the unit's stdin/stdout/stderr to the calling helper's stdio (so the existing `run_command` harness in `ScriptBuilder` continues to capture line-by-line).
- `--wait` blocks until the unit terminates and propagates the exit code.
- `--collect` removes the unit on exit even if it failed.
- The cgroup carries the resource limits; the systemd timer enforces `RuntimeMaxSec=3600`.
### Helper
`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`, mode 0755, owned root:
```bash
#!/bin/bash
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir at $OVERLAY_DIR" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script at $SCRIPT" >&2; exit 65; }
if [[ "${LEFT4ME_SCRIPT_SANDBOX_DRY_RUN:-}" == "1" ]]; then
echo "DRY RUN: overlay_id=$OVERLAY_ID script=$SCRIPT overlay_dir=$OVERLAY_DIR"
exit 0
fi
chown -R l4d2-sandbox:l4d2-sandbox "$OVERLAY_DIR"
chmod 0755 "$OVERLAY_DIR"
exec systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
-p NoNewPrivileges=yes \
-p ProtectSystem=strict -p ProtectHome=yes \
-p PrivateTmp=yes -p PrivateDevices=yes -p PrivateIPC=yes \
-p ProtectKernelTunables=yes -p ProtectKernelModules=yes \
-p ProtectKernelLogs=yes -p ProtectControlGroups=yes \
-p RestrictNamespaces=yes \
-p RestrictAddressFamilies="AF_INET AF_INET6 AF_UNIX" \
-p RestrictSUIDSGID=yes -p LockPersonality=yes \
-p MemoryDenyWriteExecute=yes \
-p SystemCallFilter="@system-service @network-io" \
-p SystemCallArchitectures=native \
-p CapabilityBoundingSet= -p AmbientCapabilities= \
-p TemporaryFileSystem="/etc /var/lib" \
-p BindReadOnlyPaths="/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives ${SCRIPT}:/script.sh" \
-p BindPaths="${OVERLAY_DIR}:/overlay" \
-p WorkingDirectory=/overlay \
-p Environment="HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay" \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- /bin/bash /script.sh
```
### Sudoers fragment
Unchanged from v1: `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox`.
### System user
Unchanged from v1: `l4d2-sandbox` (`useradd --system --no-create-home --shell /usr/sbin/nologin`).
### Filesystem expectations
- `/var/lib/left4me` must be mode 0711 (left4me-owned). Already provisioned by v1 deploy script.
- `/var/lib/left4me/left4me.db` mode 0640 root:left4me. **New** — added by this change.
- Overlay directory `/var/lib/left4me/overlays/{id}/` chowned to `l4d2-sandbox:l4d2-sandbox` 0755 by the helper before each run. Unchanged from v1.
## Build Lifecycle (unchanged from v1)
`ScriptBuilder.build()` writes the script to a 0644 tmpfile, exec's `sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>` via `run_command`, then runs `_enforce_disk_budget`. The helper's internal mechanism changes; the wrapper API is identical. `Overlay.last_build_status` is written by the job worker on completion.
## Risks
- **systemd CVE landing in our directive set.** Single-tool migration removes one isolation layer. Mitigated by uid drop + cgroup limits + `NoNewPrivileges=yes` (kernel-enforced state independent of namespace setup). The escape would be an unprivileged process with no filesystem isolation but still capped on resources; same severity envelope as a hypothetical bwrap CVE in v1. The trust model (registered users) makes a single isolation layer acceptable.
- **`SystemCallFilter` rejecting a syscall a user script unexpectedly needs.** Symptom: build fails with SIGSYS. Diagnosis: `journalctl --since "1 min ago" | grep SECCOMP`. Resolution: widen the filter (`+@process`, `+@privileged` if the script genuinely needs more than a normal service). v1 had no syscall filter, so this is a new failure class.
- **`ProtectSystem=strict` masking something a script wanted to write to.** Only `/overlay`, `/tmp`, `/run` are writable inside the sandbox. Same as v1.
- **Host PID visibility (no `PrivatePID=`).** Information disclosure; not a privilege boundary.
- **`MemoryDenyWriteExecute=yes` blocking JITs.** A script that launches `node` / a JIT runtime would fail because W+X mappings are blocked. None of the recipe set the user has historically used (curl + tar + cp) needs a JIT; revisit if a real script trips this.
- **`RestrictAddressFamilies` blocking some download tools.** `curl`, `wget`, `git over https` use `AF_INET`/`AF_INET6`; `getent hosts` uses `AF_UNIX` (nss). Smoke-tested as working. A script that wanted raw sockets (`AF_PACKET`) or netlink (`AF_NETLINK`) would fail; neither is plausible for build recipes.
## Out Of Scope
- **Per-overlay UID isolation.** Cross-script-overlay write access is still possible after a hypothetical sandbox bypass (every script overlay's dir is owned by `l4d2-sandbox`). A per-overlay UID pool was discussed as the next-step hardening but is deferred.
- **`PrivateNetwork=` / egress filtering.** No change from v1.
- **systemd-nspawn or LXC.** Researched; both are heavier than necessary for transient bash builds.
- **`PrivatePID=` workaround via `unshare`.** Not pursued — would require re-introducing a wrapper inside the unit, defeating the simplification.
## Implementation Boundaries
- **Web app code is unchanged.** `ScriptBuilder`, `run_sandboxed_script`, route handlers, models, migrations — all untouched. The migration is purely in the deployed helper script and adjacent deploy artifacts.
- **`bubblewrap` apt package removed.** Already absent from production paths after this change; deploy script updated.
- **No new systemd unit files.** Each invocation is a transient unit named `left4me-script-{overlay_id}-{pid}.service`.
- **No application-level dependency changes.** No new Python packages, no template changes, no DB migration.

View file

@ -1,113 +0,0 @@
# L4D2 Script Sandbox v3 — Egress Filter (Public Internet Only)
**Goal:** Restrict the script-overlay sandbox to public-internet egress only. Block reachability to the host's own services (localhost), the LAN, and any private RFC1918 / link-local / multicast / CGNAT / ULA addresses. Public DNS is preserved by bind-mounting a sandbox-only `resolv.conf` pointing at Cloudflare + Google.
**Approval status:** User-approved 2026-05-08. Implemented and smoke-tested on `ckn@10.0.4.128`.
## Context
After the v2 (systemd-only) migration, the sandbox still shared the host's network namespace. A live probe demonstrated the script could:
- Reach the web app on `127.0.0.1:8000` (HTTP 200 from `/health`).
- Reach the host's SSH daemon on `127.0.0.1:22` (banner returned).
- Reach the host on the LAN at `10.0.4.128:22` (banner returned).
- Reach the LAN gateway / DNS server at `10.0.0.1`.
- See Unix sockets in `/run` (`AF_UNIX` allowed).
The threat model says the sandbox should reach the public internet to download Workshop / l4d2center / GitHub content, but should **not** be able to talk to the host or LAN. systemd's `IPAddressDeny=` BPF cgroup egress filter is the right tool. It attaches a BPF program (`sd_fw_egress`) to the unit's cgroup; matching packets are silently dropped at send time.
A complication: the host's `/etc/resolv.conf` typically points at a private-IP DNS server (10.0.0.1 in the test deploy). Naively blocking `10.0.0.0/8` kills DNS, which kills outbound HTTP. The fix is to give the sandbox a static `resolv.conf` with public resolvers; DNS traffic then targets allowed public IPs.
## Locked Decisions
1. **`IPAddressDeny=` alone — no `IPAddressAllow=any`.** The systemd documentation claims "more specific rule wins" when both are set, but on systemd 257 + kernel 6.12 (and likely other combos), `IPAddressAllow=any` silently overrides every `IPAddressDeny=` rule. Verified empirically. With only `IPAddressDeny=` set, the kernel's default "allow all" applies to non-listed addresses; the listed CIDRs are dropped at the egress hook. **This must not be regressed** — adding back `IPAddressAllow=any` reopens every blocked range.
2. **Explicit CIDRs, no shorthand keywords.** systemd's unit-file parser accepts `localhost`, `link-local`, `multicast` shortcuts, but the `systemd-run -p` parser rejects them with `Failed to parse IP address prefix: localhost`. Use the CIDRs directly: `127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7`.
3. **Static `/etc/left4me/sandbox-resolv.conf` with public resolvers** (Cloudflare 1.1.1.1, Google 8.8.8.8). Bind-mounted into the sandbox at `/etc/resolv.conf` via `BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf`. Two nameservers for redundancy. Picking other public resolvers (Quad9, OpenDNS) would also be acceptable; the file is the source of truth, not the helper.
4. **`AF_UNIX` stays in `RestrictAddressFamilies=`.** Dropping it would risk breaking NSS / syslog / D-Bus introspection paths for marginal gain — the IP-level filter handles the actual threat (reaching host TCP services). The Unix-socket surface (D-Bus system bus, systemd notify) is uid-gated and `l4d2-sandbox` has no special D-Bus permissions.
5. **No `PrivateNetwork=`.** That would block all networking, including the public internet. The whole point of script overlays is reaching public download sources.
6. **No DNS-over-HTTPS or DNSSEC.** Plain UDP-53 to public resolvers is sufficient; the threat is "egress targeting", not "DNS hijacking". Revisit if the trust model relaxes.
## Architecture
```text
sudo helper (root)
└─ chown overlay dir to l4d2-sandbox
└─ systemd-run --service [...all v2 directives...]
-p IPAddressDeny="<11 CIDRs>"
-p BindReadOnlyPaths="/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf [...]"
└─ /bin/bash /script.sh
(egress to listed CIDRs dropped at sd_fw_egress BPF hook;
DNS goes to 1.1.1.1 / 8.8.8.8; everything else
reaches the public internet normally)
```
`IPAddressDeny=` blocks egress to:
| CIDR | Coverage |
|---|---|
| `127.0.0.0/8` | IPv4 loopback |
| `::1/128` | IPv6 loopback |
| `169.254.0.0/16` | IPv4 link-local (incl. AWS metadata, DHCP fallback) |
| `fe80::/10` | IPv6 link-local |
| `224.0.0.0/4` | IPv4 multicast |
| `ff00::/8` | IPv6 multicast |
| `10.0.0.0/8` | RFC1918 private |
| `172.16.0.0/12` | RFC1918 private |
| `192.168.0.0/16` | RFC1918 private |
| `100.64.0.0/10` | CGNAT (RFC6598) |
| `fc00::/7` | IPv6 ULA |
Public IPv4 / IPv6 destinations are unaffected.
## Files
- `deploy/files/etc/left4me/sandbox-resolv.conf` *(new)*`nameserver 1.1.1.1` + `nameserver 8.8.8.8`. Mode 0644 root-owned at deploy time.
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox``IPAddressDeny=` directive added; `BindReadOnlyPaths=` references the sandbox-resolv.conf instead of `/etc/resolv.conf`.
- `deploy/deploy-test-server.sh``install -m 0644 -o root -g root .../sandbox-resolv.conf /etc/left4me/sandbox-resolv.conf`.
- `deploy/tests/test_deploy_artifacts.py` — assert all of the above + the **negative assertion `IPAddressAllow=any not in text`** (regression guard).
The web app, ScriptBuilder, routes, models, and migrations are all unchanged. Same as v2.
## Verification
Smoke battery on the deployed host (probe script invoked through the helper as root):
| Target | Expected | Actual |
|---|---|---|
| `1.1.1.1:443` | connected | ✓ CONNECTED |
| `https://steamcommunity.com/` (DNS + HTTPS) | 200 | ✓ 200 |
| `127.0.0.1:8000` (web app) | blocked | ✓ TimeoutError |
| `127.0.0.1:22` (sshd) | blocked | ✓ TimeoutError |
| `10.0.4.128:22` (host LAN ssh) | blocked | ✓ TimeoutError |
| `10.0.0.1:53` (host's DNS resolver) | blocked | ✓ TimeoutError |
| `cat /etc/resolv.conf` inside | shows 1.1.1.1 + 8.8.8.8 | ✓ |
`bpftool cgroup show` against the unit's cgroup confirms `sd_fw_egress` and `sd_fw_ingress` are attached.
## Risks
- **`IPAddressAllow=` accidentally added back.** Reopens every blocked range silently. Mitigation: explicit negative test in `test_deploy_artifacts.py` plus a comment in the helper.
- **Public DNS resolver outage.** 1.1.1.1 and 8.8.8.8 are both down → DNS in sandbox fails → builds fail. Two resolvers from independent operators makes this very unlikely. Operator can change the file in `/etc/left4me/sandbox-resolv.conf` if they prefer different resolvers; the helper picks it up on next invocation.
- **Public DNS resolver privacy.** Cloudflare and Google see hostnames the scripts query. Acceptable for the workload (Steam Workshop, GitHub, etc. are public anyway); switch to Quad9 or self-hosted if this is a concern.
- **Future kernel/systemd that flips the documented "more specific wins" semantics.** If a future systemd version actually implements the documented behavior, a unit with only `IPAddressDeny=` continues to work; the negative test on `IPAddressAllow=any` keeps the regression-safe configuration locked in. Re-test on each major systemd upgrade.
- **Scripts that legitimately need a private IP.** E.g., a self-hosted internal mirror at 10.x. Not a use case today; if it arises, expose specific IPs via a future `IPAddressAllow=10.x.y.z/32` for that one host (not blanket).
## Out Of Scope
- **Per-overlay UID isolation.** Cross-script-overlay write access via the shared `l4d2-sandbox` UID is still possible after a hypothetical sandbox bypass. Deferred from earlier discussions.
- **Egress allowlist by hostname / domain.** Would require a forward proxy (Squid, mitmproxy). Heavier than warranted for the trust model.
- **Dropping `AF_UNIX` from `RestrictAddressFamilies=`.** Tangential to IP-level egress; risks breaking NSS / syslog.
- **DNSSEC / DoH.** Threat model is egress targeting, not DNS hijacking.
- **Network-namespace isolation (`PrivateNetwork=` + custom netns + NAT).** Heavier than `IPAddressDeny=` for equivalent outcome.
## Implementation Boundaries
- **No app code change.** Helper-side only.
- **No new systemd units.** Same transient `left4me-script-{id}-{pid}.service` pattern.
- **No new apt deps.** `bpftool` was used during smoke testing but is not required at runtime.
- **One new deploy artifact.** `sandbox-resolv.conf` shipped under `deploy/files/etc/left4me/`.

View file

@ -1,115 +0,0 @@
# Overlay File Tree Section Design
**Goal:** Add a "Files" section to the overlay detail page (`/overlays/<id>`) that renders a collapsible tree of the overlay's runtime directory at `${LEFT4ME_ROOT}/overlays/{overlay.id}/`, with lazy expansion of folders (one fetch per first-time expand) and click-to-download for individual files. Same access rule as the rest of the overlay detail page (admin or `overlay.user_id == g.user.id`). Read-only; no rename/delete/upload in v1.
**Approval status:** User-approved 2026-05-08 (visual companion brainstorm + plan-mode review). Implemented + deployed in the same session. The lazy-load originally targeted HTMX (vendored in `base.html`), but the post-deploy smoke uncovered that `static/vendor/htmx.min.js` was a 33-byte placeholder — the real library was never vendored. Rather than vendoring full HTMX for one feature, the lazy-load was switched to plain JS using the same fetch + innerHTML pattern (~30 lines in `static/js/file-tree.js`). The route + partial contracts are unchanged.
## Context
Today, the overlay detail page shows the row's metadata (name, type, scope, path, last build status), a workshop-items table or script editor depending on `overlay.type`, and links to the build-job stream. It never shows what's actually inside the overlay directory on disk. To verify "did my script actually produce what I expected?" or "did the right VPKs land in `addons/`?" the user has to SSH into the host and `ls /var/lib/left4me/overlays/{id}/`.
Click-to-download is a secondary nice-to-have: workshop overlays' `addons/*.vpk` are absolute symlinks into the shared `${LEFT4ME_ROOT}/workshop_cache/`, and pulling a single VPK to a dev box otherwise means scping with the right path translation.
## Locked Decisions
1. **All overlay types show the section.** Script + workshop + system/managed. Consistency over a tighter scope; even workshop's predictable `addons/*.vpk` layout is worth confirming.
2. **Collapsible tree, lazy load on expand.** Tree can get large; only the root level is rendered server-side at first paint. Each folder click fires `GET /overlays/<id>/files?path=<rel>` and innerHTMLs the response into that folder's `.file-tree-children` div. The no-JS path still shows the root level (the same partial is server-rendered) — folders just won't expand.
3. **Single delegated JS handler.** `static/js/file-tree.js` listens for `click` on `document`, finds the closest `.file-tree-toggle` button, toggles `aria-expanded` + `hidden`, and on first expand fires a `fetch()` against the URL in the button's `data-files-url`. Subsequent toggles never re-fetch (`button.dataset.loaded` flag, set optimistically before the fetch to dedupe rapid clicks; cleared on error to allow retry).
4. **Single-file download in v1.** No bulk archive (e.g., "download whole overlay as `.tar.gz`"). Files are streamed via Flask `send_file(..., as_attachment=True)`. No size cap — VPKs are commonly 100500 MB and that's the whole point.
5. **No auto-refresh.** The tree reflects what was on disk at page render. After a build, the user reloads the page. Polling/SSE would duplicate the existing live-log mechanism on the build-job page for negligible benefit.
6. **Same access rule as the rest of the page.** `g.user.admin or overlay.user_id is None or overlay.user_id == g.user.id`. GETs need no CSRF (`l4d2web/app.py:56`).
7. **`overlay.path` not `overlay.id`.** The runtime directory is reached via `overlay.path` (current creation flow guarantees `path == str(id)`, but legacy/seeded rows may differ). Path resolution happens through the existing `l4d2host.paths.overlay_path()` helper, which already validates the ref string and resolves+verifies it stays under `${LEFT4ME_ROOT}/overlays/`.
8. **Empty / unresolvable → empty state.** If the overlay's path is unresolvable (legacy absolute-path rows) or the directory doesn't exist (overlay never built), the section renders "No files yet — build this overlay to populate it." rather than crashing.
9. **500-entry cap per folder.** Folders with more than 500 children render the alphabetical-first 500 plus a `+ M more (truncated)` footer. Tunable at runtime via `l4d2web.services.overlay_files.DEFAULT_MAX_ENTRIES` (re-resolved per call so tests can monkeypatch).
10. **Hidden files shown.** No filtering of `.git`, `.DS_Store`, etc. Users want ground truth.
11. **One dedicated blueprint, `files_bp`.** Not folded into `overlay_routes.py` (which is exclusively POST mutations) or `page_routes.py` (top-level pages, not embedded fragments). `files_bp` owns both the tree fragment and the download endpoint.
## Architecture
```text
GET /overlays/<id> (page_routes.overlay_detail)
▼ computes (file_tree_root_entries, truncated_count) via
│ _root_file_tree(overlay) → safe_resolve_for_listing(overlay.path, "")
│ → list_directory(overlay_root, overlay_root)
▼ renders overlay_detail.html, which includes _overlay_file_tree.html
for the root level (or the empty-state <p>).
GET /overlays/<id>/files?path=<rel> (files_routes.overlay_files_fragment)
▼ auth gate (admin or owner)
▼ safe_resolve_for_listing(overlay.path, rel) → Path under overlay_root
▼ list_directory(target, overlay_root) → entries[], truncated_count
▼ renders _overlay_file_tree.html (partial only — no base.html)
GET /overlays/<id>/files/download?path=<rel> (files_routes.overlay_files_download)
▼ auth gate
▼ safe_resolve_for_download(overlay.path, rel) → real Path under LEFT4ME_ROOT
│ (follows symlinks; allows targets anywhere in LEFT4ME_ROOT, e.g. workshop_cache)
▼ Flask send_file(real, as_attachment=True, download_name=basename(real))
```
### File tree fragment shape
`_overlay_file_tree.html` produces a `<ul class="file-tree">` containing one `_overlay_file_node.html` row per entry plus an optional truncated-footer `<li>`. A folder row is a `<button class="file-tree-toggle" data-files-url="…">` followed by an empty `<div class="file-tree-children" hidden>` that becomes the fetch target. A file row is an `<a href=".../files/download?path=…">` (or a plain `<span>` for broken symlinks) plus optional badges (`link`, `broken link`) and the resolved size.
Nesting after expand:
```html
<li class="file-tree-row file-tree-row-dir">
<button class="file-tree-toggle" aria-expanded="true" data-files-url="…"></button>
<div class="file-tree-children">
<ul class="file-tree" role="group"></ul> <!-- inserted by file-tree.js -->
</div>
</li>
```
## Path Safety
Two resolvers in `l4d2web/services/overlay_files.py`:
- `safe_resolve_for_listing(overlay.path, sub_path)` — resolves `overlay_root / sub_path`, applies `Path.resolve(strict=False)`, requires the result to be the overlay root or a descendant. Used by the tree-fragment route. **Refuses to recurse through symlinks that leave the overlay root**, including symlinks pointing into `workshop_cache/` — listing has no need to follow them, since workshop addons are leaf files, not directories we'd descend into.
- `safe_resolve_for_download(overlay.path, sub_path)` — resolves the candidate path, applies `os.path.realpath()`, requires the result to be under `${LEFT4ME_ROOT}` (anywhere — overlay dir, `workshop_cache/`, future siblings). This is the relaxed gate that lets workshop addons stream from the shared cache while still blocking absolute symlinks to `/etc/passwd` planted by a malicious script overlay.
Both resolvers re-use `l4d2host.paths.overlay_path()` (which itself calls `validate_overlay_ref`) for the overlay-root resolution, and `l4d2web.services.security.validate_overlay_ref` for the sub-path component (rejects empty / `.` / `..` / absolute / whitespace / backslash). Empty `sub_path` is valid for listing (means "the overlay root") and invalid for download.
Listing: `target.is_dir()` check after resolution; non-directory → 404.
Download: `real.exists()` check (404), `real.is_dir()` rejection (400 — "not a file").
## Symlink Behaviour
`list_directory` uses `os.scandir()` with explicit `follow_symlinks` flags:
- `is_symlink = entry.is_symlink()`
- `kind`: `entry.is_dir(follow_symlinks=True)` inside a try block. Raised `OSError` → broken symlink, treated as `kind="file"` with `broken=True` and no `<a>` download link.
- `size`: `entry.stat(follow_symlinks=True).st_size` for files (resolved target's size — what users care about for VPKs); `None` for dirs and broken symlinks.
Symlinked directories pointing inside the overlay root are rendered as folders and remain expandable; the listing-time safety check rejects expansion if the symlink resolves outside the overlay root.
Concurrent build vs listing race: a build mid-symlink-rewrite can yield a transient broken-symlink view. Acceptable — page is a snapshot; the visible "broken link" badge tells the user to refresh.
## Test Strategy
Two test modules, both following existing fixture patterns (`tests/test_script_overlay_routes.py` style — `monkeypatch.setenv("LEFT4ME_ROOT", str(tmp_path))`, app fixture with `TESTING=True`).
- `tests/test_overlay_files.py` — pure-helper unit tests (Flask-free): listing-resolver happy/sad paths (root, sub-path, `..`, absolute, empty component, symlink-out-of-overlay), download-resolver happy/sad paths (regular file, workshop-cache symlink, outside-LEFT4ME_ROOT symlink, traversal, absolute, empty), `list_directory` behaviour (empty, dir-first sort, kind detection, rel paths, symlink markers, broken-symlink markers, truncation cap, human-size formatting).
- `tests/test_overlay_files_routes.py` — HTTP integration tests: tree-fragment 200 / 400 / 403 / 404 across the same axes; download 200 / 400 / 403 / 404 + content-disposition + byte-exact body for both regular files and workshop-cache symlinks; admin-can-view-foreign overlay; truncation-via-route (monkeypatching `DEFAULT_MAX_ENTRIES`); broken-symlink rendering omits the `<a>` download link; the page-level `overlay_detail` integration shows the section with entries when populated and the empty state when the directory is missing.
39 tests total. The full web suite (`pytest l4d2web/tests/ -q`) must remain green.
## Out of Scope
- Bulk download (e.g., "download overlay as tar.gz").
- Inline file preview (text peek, image thumbnail).
- File deletion / rename / upload from the UI.
- Auto-refresh while a build is active.
- Filtering hidden files or applying a `.gitignore`-style rule.
- Reusable file-tree component for things outside overlays.

View file

@ -1,55 +0,0 @@
# Server ID as Host Identifier Design
**Goal:** Decouple the user-facing server label from the host-side identifier. The systemd unit name and on-disk paths become functions of `Server.id`; `Server.name` becomes a free-form display label.
**Approval status:** User-approved 2026-05-08.
## Context
`Server.name` was doing two unrelated jobs. It was the human label rendered in the UI *and* the literal string fed to `l4d2ctl`, which became the systemd unit instance (`left4me-server@<name>.service`) and the directories under `/var/lib/left4me/{instances,runtime}/<name>/`. To stay safe as a unit-template parameter and a path component, the name was forced through `[a-z0-9][a-z0-9_-]{0,63}` and held globally unique. The cost was a UI that demanded machine-friendly slugs, no rename support, and an awkward divergence from overlays — which already separate identity (`id`) from label (`name`).
This change moves servers onto the same model as overlays. Web URLs already key on `id` (`/servers/<int:server_id>`), so the change is mostly local: pick an id-derived host identifier, pass that everywhere `server.name` was passed, and relax the `name` constraints.
## Locked Decisions
1. **Host-side identifier = plain numeric id.** `left4me-server@42.service`, `/var/lib/left4me/instances/42/`, `/var/lib/left4me/runtime/42/`. The host CLI's `validate_instance_name` regex (`[a-z0-9][a-z0-9_-]{0,63}`), the systemctl helper's argument check (`[A-Za-z0-9_.-]`), and the unit template (`%i`) all already accept digit-only strings — no host-side change.
2. **Name = free-form display label, unique per user, required (≤128 chars).** Whitespace is stripped on save. Two users can both have a server named "Practice"; one user cannot.
3. **No data preservation.** Dev-only deploy. Existing servers on the test host are not migrated; their old `left4me-server@<old-name>.service` units and `<old-name>/` directories become orphans and are cleaned up manually.
4. **Single source of truth for the id-to-host-name rule.** A one-line helper (`server_unit_name(server_id) -> str(server_id)`) lives in `l4d2web/services/server_identity.py`. Every callsite that used to pass `server.name` to `l4d2ctl` or `journalctl` calls this. Future format tweaks (e.g. `srv-{id}`) are a one-line edit.
## Schema
`servers` (Alembic 0006):
- Drop the (unnamed) global `UNIQUE (name)` from the original 0001 schema.
- Add `UNIQUE (user_id, name)` as `uq_servers_user_name`.
- Column stays `name VARCHAR(128) NOT NULL`.
The migration uses `batch_alter_table(recreate="always")` with a `naming_convention` so the originally-anonymous unique can be referenced as `uq_servers_name` for `drop_constraint`.
## Code touchpoints
- `l4d2web/services/server_identity.py` (new)
- `l4d2web/models.py` — drop `unique=True` on `Server.name`; add `__table_args__` with the per-user unique.
- `l4d2web/alembic/versions/0006_server_name_per_user.py` (new)
- `l4d2web/services/l4d2_facade.py` — five `l4d2ctl` invocations switched to `server_unit_name(server.id)`. Parameter renamed to `unit_name` on `server_status` / `stream_server_logs`.
- `l4d2web/services/job_worker.py` — status refresh uses `server_unit_name(server.id)`. The `server_name` log-label variable still holds `server.name` (the display label); that's correct now and shows up in job logs as e.g. "starting initialize for My Practice".
- `l4d2web/routes/log_routes.py` — SSE log stream feeds `server_unit_name(server.id)` to `journalctl`.
- `l4d2web/routes/server_routes.py` — replace `validate_instance_name` with `_validate_display_name` (strip + non-empty + length ≤128). Broaden the `IntegrityError` handler to disambiguate `servers.name` (409 "name already in use") from `servers.port` (409 "port already in use") via the underlying SQLite error string.
- `l4d2web/services/security.py``validate_instance_name` deleted (no remaining callers).
- `l4d2web/templates/servers.html` — name input gains `maxlength="128"`.
## Failure modes
- **Name with shell metacharacters reaches a host command.** Cannot happen — the host call now receives only `str(server.id)` (digits). The display name is never passed through `l4d2ctl`.
- **Two servers under the same user with the same name.** Blocked at the DB layer (`uq_servers_user_name`); surfaced as a 409 "name already in use" with no row written.
- **Migration on a DB with existing servers.** `batch_alter_table(recreate="always")` rebuilds the table preserving rows; the new per-user constraint is satisfied trivially since the old global constraint already enforced strict uniqueness.
## Verification
1. `python -m pytest l4d2web l4d2host deploy` from the repo root — green.
2. Stepwise migration on a fresh sqlite (upgrade to 0005, insert two users + a server, upgrade to 0006): row preserved, second user can take the same name, same user cannot (UNIQUE constraint failed: servers.user_id, servers.name).
3. Post-deploy on the test host: create a server named `"My Practice"` (with the space), confirm the systemd unit is `left4me-server@<id>.service`, confirm `/var/lib/left4me/runtime/<id>/merged` is mounted on start, confirm log streaming still works.
## Operator note
After deploy, on the test host: stop and remove any pre-existing `left4me-server@<old-name>.service` units and their `/var/lib/left4me/{instances,runtime}/<old-name>/` directories. The web app no longer references them.

View file

@ -1,220 +0,0 @@
# Files overlay (user-managed file content)
## Context
In the prior `ckn-bw` setup, per-server config-style files (`admins.txt`, `motd.txt`, mapcycle, etc.) lived under `bundles/left4dead2/files/scripts/overlays/standard`. `left4me` has no equivalent: today an overlay's contents come from either Steam Workshop (`workshop` type) or a user-authored bash build script (`script` type). Both have an external source-of-truth, so neither is the right home for files the user owns directly. The user wants both online editing of text files *and* arbitrary file upload, and we unify them into a single mechanism.
## Goal
Add a third overlay type `files` whose source-of-truth IS the overlay directory itself. Provide a web UI to:
- **Upload** any file or whole folder by dragging it onto a folder row in the tree (drag from the OS).
- **Move** files and folders by dragging rows inside the tree (internal drag).
- **Create / edit / rename / replace** files through a single modal editor, opened from row buttons. Modal adapts to text or binary content.
- **Download** files (or zip an entire folder).
- **Delete** files and empty folders.
- **Create new folders** explicitly (including nested intermediates in one shot).
Reuse the existing overlayfs / spec / mount / `expose_server_cfg` pipeline unchanged: a `files` overlay is a normal overlay attached to blueprints.
## Non-goals (v1)
- Per-server overrides (servers still bind to a blueprint without per-instance file changes).
- Concurrency policing when an overlay is in use by a running server. Overlayfs technically calls lower-layer mutation undefined behavior, but L4D2 reads most config at boot, so "edits visible on next start" is acceptable.
- Versioning / undo / history.
- Syntax highlighting (CodeMirror-style). Plain `<textarea>`; can add later.
- "Save As" copy. The filename input *is* Save-As.
- Recursive directory delete from the UI.
- Multi-file drop into the binary "replace" zone (single file only).
## Approach
### Data model
`Overlay.type` accepts a new value: `"files"` (in addition to `"workshop"` and `"script"`). No schema change needed — `Overlay.type` is already `String(16)`. The `script` column stays empty for files overlays; `last_build_status` is set to `"ok"` on creation and not otherwise managed. Privacy follows the existing `user_id` rules unchanged.
`BlueprintOverlay` and the `expose_server_cfg` checkbox keep working as-is: a `files` overlay containing a `server.cfg` is exposed via the same alias mechanism the 2026-05-08 plan introduced.
### Filesystem layout
A files overlay lives at `${LEFT4ME_ROOT}/overlays/{overlay.path}/` like every other overlay. Example contents:
```
overlays/{id}/
left4dead2/
cfg/
server.cfg
motd.txt
mapcycle.txt
addons/
sourcemod/configs/admins_simple.ini
custom_map.vpk
```
The `InstanceSpec` / `OverlayRef` shape already supports this. The spec builder in `l4d2web/services/l4d2_facade.py` doesn't need to learn about overlay types, only to keep emitting `path` (and `alias` when `expose_server_cfg` is set).
### Builder registration
`l4d2web/services/overlay_builders.py::BUILDERS` gains a `"files"` entry whose `build()` is a no-op that ensures `_overlay_root(overlay)` exists. The route layer also short-circuits: there is no "rebuild" concept for a files overlay — every save / upload / move / mkdir / delete is immediately authoritative.
### Safety helpers
`l4d2web/services/overlay_files.py` already has `safe_resolve_for_listing` and `safe_resolve_for_download` (anchor-and-resolve, refuse `..` traversal and symlink-target escapes). Add three siblings using the same pattern:
- `safe_resolve_for_write(overlay_path_value, sub_path) -> Path` — destination path. Refuses empty `sub_path`, refuses any escape, refuses to overwrite an existing symlink, refuses a path whose parent resolves to a non-directory.
- `safe_resolve_for_delete(overlay_path_value, sub_path) -> Path` — same root-escape rules; allows deleting files and empty directories. Non-empty directory delete returns an error.
- `safe_resolve_for_move(overlay_path_value, src, dst) -> tuple[Path, Path]` — both endpoints inside the overlay root. Refuses `dst` inside `src` (cycle). Refuses if `src` doesn't exist. Refuses if `dst` parent is missing or not a directory. Refuses overwriting a symlink at `dst`.
Plus a small predicate:
- `is_editable(path: Path) -> bool` — true iff `path` is a regular file (not symlink), size ≤ 1 MiB, and first 8 KiB decodes as strict UTF-8. Surfaced via `_entry_dict` in listings as `editable: bool`.
### UI design
The file-manager lives inside the existing overlay detail page, only when `overlay.type == "files"`. Layout follows the existing `<ul class="file-tree">` pattern, extended as below.
#### Tree row buttons (hover-reveal, CSS `:hover`)
| Row | Buttons (left-to-right) | Click on row body | Draggable |
|---|---|---|---|
| Folder (incl. overlay root) | `+ new file` · `+ new folder` · `⬇ zip` · `✕` | toggle expand/collapse | yes (move subtree) |
| File (any) | `edit` · `⬇` · `✕` | nothing | yes (move file) |
Files always show `edit` regardless of editability — the modal adapts. Touch devices fall back to always-visible buttons via a `(hover: none)` media query.
#### Drag-and-drop on tree rows — single gesture, source distinguishes
| Drag source | Action | Visual on hovered row | Endpoint |
|---|---|---|---|
| OS file/folder (`dataTransfer.files` / `webkitGetAsEntry`) | upload | green outline + `↑ Release to upload N items here` | `POST /overlays/{id}/files/upload` |
| Tree row (file or folder) | move | green outline + `↦ Move {name} here` | `POST /overlays/{id}/files/move` |
Refused drops (UI rejects without server round-trip): drop on self, drop on own ancestor (cycle), drop where parent doesn't exist. Conflict at destination → server returns 409 → overwrite/keep-both modal.
#### Upload progress panel
Each dropped item becomes one `POST /files/upload` request (one file part, `target_path` set to the dropped row's path, `webkitRelativePath` preserved). A floating "Uploads" panel docks to the bottom-right of the page while there is at least one in-flight or queued upload, and auto-collapses when the queue is empty.
- **Per-file rows** in the panel: filename, target path (subtle), progress bar driven by `XMLHttpRequest.upload.onprogress`, queue position, per-file cancel button.
- **Concurrency:** at most 3 uploads in flight; remainder queue. Drop-while-uploading appends to the queue with no special UI.
- **Cancel mid-flight:** aborts the XHR; server cleans up any partial file in a `finally` block.
- **Conflicts:** a 409 on an individual file pauses just that upload (panel row shows "conflict — overwrite / keep both") and opens the existing overwrite/keep-both modal scoped to that one path. The rest of the queue keeps running.
- **Errors:** per-file error states (413 too large, 415 bad content, 422 path validation, 5xx) stay sticky in the panel until the user dismisses them. The panel has a "clear done" toggle.
- **Tree refresh:** when an upload finishes, the affected parent folder's listing partial is re-fetched (`hx-get` on the folder row). Debounced (50 ms) so many siblings finishing in one tick coalesce into one fetch.
#### Editor modal — single `<dialog>` with two flavors
The editor modal opens via the row's `edit` button or the folder's `+ new file` button.
**Common chrome (both flavors):**
- **Title** = full path (e.g. `left4dead2/cfg/motd.txt`). For new files: `addons/sourcemod/configs/…new file`.
- **Filename input** — single line, slashes rejected. Diverging from the original shows an inline `↻ Save will rename foo.txt → bar.txt` hint.
- **Footer**`Delete` on the left (only for existing files), then `⬇ Download`, `Cancel`, `Save`/`Create` on the right.
**Text flavor** (file is editable, or new file):
- Content `<textarea>`, 1 MiB cap on save, UTF-8 only.
- Footer hint: `UTF-8 · {n} bytes` + `Ctrl+S to save`.
**Binary flavor** (existing file is not editable):
- Replaces the textarea with a "Replace file" panel: a label noting `⛌ Inline editing not available · {size} · binary content`, plus a drop zone (`↑ Drop a file here to replace`) with a `browse` link as fallback. Single file only.
- Once a replacement is queued, the drop zone shows `↻ {newName} · {size} · queued` with an `✕` to clear the queue.
**Save semantics** (atomic per call; rename + content change happen in one server operation):
| Mode | Filename unchanged | Filename changed |
|---|---|---|
| Text | write content | rename + write content |
| Binary, no replacement queued | (Save disabled) | rename only |
| Binary, replacement queued | overwrite content | rename + overwrite content |
Rename target collision → 409 → overwrite/keep-both modal (same modal as upload conflicts).
#### `+ new folder` dialog
A small dedicated `<dialog>` separate from the editor. Single text input for the folder name. Slashes allowed → creates intermediate dirs (`mkdir(parents=True, exist_ok=False)`).
#### `+ new file` flow
Reuses the editor modal in text flavor with empty content; the filename input is empty and focused, the title shows the source folder + `…new file`.
### Web routes
In `l4d2web/routes/files_routes.py` (alongside the existing `overlay_files_fragment` and `download` endpoints):
| Method | Path | Body | Purpose |
|---|---|---|---|
| GET | `/overlays/{id}/files/content` | `?path=` | Returns `{path, content}` for an editable file. 415 if not editable. |
| POST | `/overlays/{id}/files/save` | JSON `{path, content, new_path?}` | Text-mode save. Optional `new_path` performs rename atomically with the write. |
| POST | `/overlays/{id}/files/replace` | multipart `path`, `file`, optional `new_path` | Binary-mode replace. Optional `new_path` performs rename atomically. |
| POST | `/overlays/{id}/files/upload` | multipart `target_path`, single `file` part (carrying `webkitRelativePath`) | OS-drag upload, one file per request. Creates intermediate dirs via `mkdir(parents=True)`. Cleans up partial writes on cancel via `finally`. 200 on success, 409 on conflict, 413/415/422 on validation failure. |
| POST | `/overlays/{id}/files/move` | JSON `{src, dst}` | Internal drag move (and plain rename when same parent). |
| POST | `/overlays/{id}/files/mkdir` | JSON `{path}` | Create empty folder; slashes in `path` produce nested intermediates. |
| POST | `/overlays/{id}/files/delete` | form `path` | Delete file or empty folder. |
| GET | `/overlays/{id}/files/download_zip` | `?path=` | Stream a zip of the folder's contents. |
Existing `GET /overlays/{id}/files?path=...` and `GET /overlays/{id}/files/download?path=...` stay as-is. The listing endpoint additionally returns `editable` per file row.
All new routes:
- 404 when `overlay.type != "files"`.
- Require `overlay.user_id == current_user.id` (or admin).
- Use the new safe-resolve helpers.
- CSRF via the existing `csrf.js` injection (multipart endpoints included).
### Tech stack
Stay inside the project's established stack — Flask + Jinja2 + HTMX + tiny vanilla JS in `static/js/` + custom CSS with tokens, no build step:
- **Templates:** Jinja2 partials, returned as HTMX swaps where appropriate (subtree refresh after upload/move/mkdir/delete).
- **Modals:** native `<dialog>` with the existing `data-modal-open` / `data-modal-close` event-delegated handlers.
- **JS:** vanilla. Extend `static/js/file-tree.js` (or add a sibling `files-overlay.js`) covering: `dragstart` on rows, `dragover` highlight + source-discrimination (`dataTransfer.types.includes("Files")` vs internal MIME), `webkitGetAsEntry()` walk for whole-folder OS drops, editor modal open/save (Ctrl+S, fetch POST), binary replace-zone drop handler, conflict-modal flow, new-folder dialog, upload queue + floating progress panel (XHR per file, concurrency 3, abort on cancel, debounced tree-refresh on completion).
- **CSS:** extend `tokens.css` and `components.css` with file-manager-specific rules — drop-target outline, hover-reveal action column, editor modal sizing, replace-zone styling.
No external libraries (no Dropzone, no jsTree, no CodeMirror) — adding one would be a meaningful departure from the project's "no build step, vendored libs only" posture.
### Creation flow for new overlays
The "create overlay" UI gains a third radio option: `Files`. Selecting it skips the type-specific fields (no Steam Workshop selector, no script editor) and creates an empty `Overlay` row with `type="files"`, `last_build_status="ok"`, and an empty directory.
### Host-side
No changes. The mount helper, instance lifecycle, and srcds startup don't care what produced the contents of an overlay directory.
### Migration / Alembic
None. `Overlay.type` already stores arbitrary strings; introducing a new value is data-only.
## Critical files
| Layer | File | Change |
|---|---|---|
| Models | `l4d2web/models.py` | None (Overlay.type already String) |
| Builders | `l4d2web/services/overlay_builders.py` | Register `FilesBuilder` (no-op `build`) |
| Safety | `l4d2web/services/overlay_files.py` | Add `safe_resolve_for_write`, `safe_resolve_for_delete`, `safe_resolve_for_move`; add `is_editable` and surface it via `_entry_dict` |
| Routes | `l4d2web/routes/files_routes.py` | Add `content`, `save`, `replace`, `upload`, `move`, `mkdir`, `delete`, `download_zip` endpoints |
| Templates | `l4d2web/templates/overlay_detail.html`, `l4d2web/templates/_overlay_file_tree.html` | Hover-reveal action buttons; `data-target-path` on folder rows; `draggable="true"` on file/folder rows; editor modal `<dialog>` with both flavors; new-folder modal `<dialog>`; conflict modal `<dialog>` |
| Static JS | `l4d2web/static/js/file-tree.js` (extend) or new `files-overlay.js` | Drag-drop wiring, modal save, binary replace, mkdir, conflict flow, upload queue + panel |
| Static CSS | `l4d2web/static/css/components.css` | Drop-target outline, hover action column, editor modal sizing, replace-zone, upload panel |
| Create form | overlay creation template + route | Add `files` option to the type radio |
| Spec / facade | `l4d2web/services/l4d2_facade.py` | None — already type-agnostic |
| Host spec | `l4d2host/spec.py`, `l4d2host/instances.py` | None |
| Tests | adjacent to each touched module | safe-resolve refusals; `is_editable` heuristic; CRUD round-trip; ownership; non-files-type 404s; multipart with `webkitRelativePath`; move refuses cycles; conflict (409); zip stream; mkdir parents |
## Verification
1. **Safety unit tests**`safe_resolve_for_write`, `_for_delete`, `_for_move` reject `..` traversal, absolute paths, symlink-target escapes, attempts to overwrite a symlink, non-empty-dir delete, and `dst` inside `src`.
2. **Editability heuristic**`is_editable` returns false for files > 1 MiB, symlinks, files with non-UTF-8 bytes in their first 8 KiB.
3. **Editor round-trip (text)** — from a folder row, "+ new file" → modal → save creates `left4dead2/cfg/admins.txt`; row appears with `edit` button; edit; rename via filename input; delete.
4. **Editor round-trip (binary)** — upload a `.vpk`, click `edit`, queue a replacement file via drop, change filename, Save → rename + replace happen atomically.
5. **Upload single file** — drag a file from the OS onto `left4dead2/cfg/`; appears with size and download link.
6. **Upload whole folder** — drag `addons/sourcemod/` from the OS onto the overlay root; nested structure preserved; intermediate directories auto-created.
7. **Conflict on upload** — drop a file with a colliding name; overwrite/keep-both modal; both choices behave correctly.
8. **Move within tree** — drag `motd.txt` onto `addons/`; file moves; tree refreshes.
9. **Move refusals** — drag a folder onto itself or a descendant; UI rejects without server round-trip.
10. **mkdir**`+ new folder` with name `sourcemod/configs` creates both intermediates; collision returns 409.
11. **Zip download**`⬇ zip` on `addons/` streams a valid zip containing the subtree.
12. **Mount integration** — attach the files overlay to a blueprint, start a server, confirm the files appear under `runtime/{server_id}/merged/...`.
13. **server.cfg alias** — with `expose_server_cfg=true` and a `server.cfg` in the files overlay, `exec server_overlay_{id}` is auto-injected into the merged `server.cfg`.
14. **Type isolation** — every new endpoint returns 404 for `workshop` and `script` overlays.
15. **Browser smoke test** — Chromium and Firefox: drag a folder containing nested files into a row; confirm `webkitRelativePath` arrives correctly.
16. **Upload progress panel** — drop 5 files of mixed sizes; panel shows 3 in flight, 2 queued; per-file progress bars advance; canceling one file aborts that XHR cleanly without affecting the others; partial file is removed server-side; tree refreshes once per parent folder (debounced) when uploads finish.
17. **End-to-end on the real test box** — deploy the branch to `ckn@10.0.4.128` via the project's deploy path, then drive the running web UI through the `claude-in-chrome` MCP tools end-to-end: create a `files` overlay, attach to a blueprint, exercise every CRUD path, boot a server, confirm the files materialize in the merged mount. Iterate until all paths work without errors.

View file

@ -1,131 +0,0 @@
# l4d2 cpu isolation — design
Date: 2026-05-09
Status: design
## Summary
Constrain every cgroup that isn't a live game server to core 0; give game servers cores 1..N-1 exclusively. Implementation is systemd cgroup-v2 `AllowedCPUs=` drop-ins, computed at deploy time from `nproc`, overridable via env vars. Lands on top of the perf baseline shipped in `851e662..e5126c8`.
## Goals
- A logged-in admin doing CPU-heavy work, the script-build sandbox, and the Flask web app cannot steal cycles from a live match.
- Layout scales automatically across host sizes (4-core, 8-core, 16-core) without per-host edits.
- Operator can override the default `0` / `1..N-1` split for NUMA boxes or hyperthread quirks.
- Single-core hosts degrade gracefully: skip CPU isolation, keep the rest of the perf baseline.
## Non-goals
- Kernel `isolcpus=` / `nohz_full=` / `rcu_nocbs=` boot parameters. True core isolation (eviction of softirqs, RCU, timer ticks) requires GRUB edits + reboot + per-host tuning. cgroup cpuset is sufficient for L4D2 tickrates; document as a future opt-in if measurement justifies it.
- NIC IRQ pinning. Hardware-specific; already documented as an escape hatch in `deploy/README.md`.
- Per-instance pinning *within* the game-core set. The slice-level cpuset is the floor; the existing per-instance `CPUAffinity=` drop-in escape hatch (already in `deploy/README.md`) composes on top — the kernel enforces "per-instance value must be a subset of slice's allowed set."
- A separate `l4d2-web.slice`. The web app is light; living in `system.slice` on core 0 is fine.
- Web-app or host-library code changes. Pure deploy-side artifact work.
## Background
The perf baseline (commit range `851e662..e5126c8`) introduced two slices (`l4d2-game.slice` weight 1000, `l4d2-build.slice` weight 10), per-instance unit directives (Nice, OOM, memory caps), and host sysctls. None of those constrain *which* CPUs cgroups run on. Under the kernel CFS, every task can move to any core; the build sandbox, ssh sessions, the web app, and game servers all compete for the same cores.
## Design
### Topology
```
core 0 cores 1..N-1
───────── ────────────
system.slice AllowedCPUs=0
user.slice AllowedCPUs=0
l4d2-build.slice AllowedCPUs=0
l4d2-game.slice AllowedCPUs=1-(N-1)
```
Everything that isn't a live game server (Flask web app, ssh sessions, journald, script-sandbox builds, cron, systemd housekeeping) is funneled to core 0. Game servers get cores 1..N-1 exclusively.
### Why slice-level `AllowedCPUs=`, not per-instance `CPUAffinity=`
- **Hierarchy does the work for free.** A cpuset on `l4d2-game.slice` propagates to every `left4me-server@*.service` automatically. No per-instance drop-ins to manage; no logic in the web app to pick cores.
- **Hot-applied.** cgroup-v2 cpuset changes apply to running cgroups; existing servers move next time the kernel schedules them. No need to restart instances after a deploy.
- **Composable.** A future operator who wants per-instance pinning *within* the game cores adds `CPUAffinity=N` via `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` (already documented). The slice constraint and per-instance pin compose; the kernel enforces subset-of.
### Why drop-ins, not edits to the existing `.slice` files
The two slice files we ship today (`l4d2-game.slice`, `l4d2-build.slice`) are static text and host-portable. `AllowedCPUs=1-7` is true on an 8-core host and wrong on a 4-core host. Drop-ins under `<unit>.d/*.conf` are the standard systemd pattern for host-specific overrides. We already use `99-` prefixing for the sysctl drop-in so it lex-orders last; reuse that.
### Operator override
Two env vars consumed by the deploy script:
- `LEFT4ME_SYSTEM_CPUS` — defaults to `0`. Goes into `system.slice`, `user.slice`, `l4d2-build.slice` drop-ins.
- `LEFT4ME_GAME_CPUS` — defaults to `1-$((NPROC-1))`. Goes into `l4d2-game.slice` drop-in.
Operators with NUMA boxes, hyperthread quirks, or "I want core 0 *and* core 1 for system" set the vars explicitly. Defaults handle the typical case.
### Single-core fallback
If `nproc < 2`, skip CPU isolation entirely (write no drop-ins). Print a warning to stderr explaining the deploy is leaving cpuset unset. The rest of the perf baseline still applies (weights, sysctls, OOM scores).
If `LEFT4ME_GAME_CPUS` or `LEFT4ME_SYSTEM_CPUS` is set explicitly on a single-core host, honor the operator's intent — they presumably know what they're doing — but still write the drop-ins.
### Drop-in layout
Four files written to `/etc/systemd/system/`, each named `99-left4me-cpuset.conf`:
```
/etc/systemd/system/system.slice.d/99-left4me-cpuset.conf
/etc/systemd/system/user.slice.d/99-left4me-cpuset.conf
/etc/systemd/system/l4d2-build.slice.d/99-left4me-cpuset.conf
/etc/systemd/system/l4d2-game.slice.d/99-left4me-cpuset.conf
```
Each file contains:
```ini
[Slice]
AllowedCPUs=<resolved value>
```
### systemd compatibility
`AllowedCPUs=` is systemd 244+. Debian Trixie ships systemd 256+. Cgroup-v2 cpuset controller is enabled by default on Trixie; systemd auto-enables the controller when `AllowedCPUs=` is set on a unit. No additional machinery.
### Files changed / added
```
deploy/deploy-test-server.sh (modified — compute layout, write four drop-ins)
deploy/README.md (modified — new "CPU isolation" subsection inside Performance Tuning)
deploy/tests/test_deploy_artifacts.py (modified — new tests)
```
## Tests
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
`assert "X" in script` pattern:
- For `deploy-test-server.sh`, assert:
- All four drop-in paths (`/etc/systemd/system/{system,user,l4d2-build,l4d2-game}.slice.d/99-left4me-cpuset.conf`) appear.
- The script reads `nproc` (substring `nproc` plus a default-binding form for `LEFT4ME_GAME_CPUS`).
- The script honors `LEFT4ME_SYSTEM_CPUS` and `LEFT4ME_GAME_CPUS` env-var overrides (substrings present, default-binding form like `${LEFT4ME_SYSTEM_CPUS:-...}`).
- The script has a single-core fallback (substring guarding `nproc -lt 2` or equivalent, with a warning to stderr).
- Each drop-in is written via the existing `install -m 0644 -o root -g root` heredoc pattern.
No runtime tests in this spec — verifying that systemd actually enforces `AllowedCPUs=` is operator-side via `cat /sys/fs/cgroup/<slice>/cpuset.cpus.effective` after deploy.
## Rollout
Single deploy. cgroup-v2 cpuset changes apply to running cgroups, so already-running servers move next time the kernel reschedules them — no instance restarts required. The `daemon-reload` already in the deploy script picks up the new drop-ins.
If something goes wrong (cpuset too narrow, a slice can't run any process), `systemctl status <slice>` will show the error and the operator can either fix the env vars and redeploy or `rm /etc/systemd/system/<slice>.slice.d/99-left4me-cpuset.conf` followed by `systemctl daemon-reload` to revert.
## Open questions
None blocking. Possible v2 candidates if measurement justifies them:
- Pair this with kernel `isolcpus=` boot params for true core isolation.
- Auto-pin NIC IRQs to core 0 (would compose with this isolation).
- Per-instance `CPUAffinity=` driven by a deploy-env knob, partitioning the game-core set across instances deterministically.
## References
- systemd.resource-control(5) — `AllowedCPUs=` semantics.
- Linux Documentation/admin-guide/cgroup-v2.rst — cpuset controller behavior on `cpuset.cpus` / `cpuset.cpus.effective`.
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md` — sibling work that introduced the slices this spec extends.

View file

@ -1,83 +0,0 @@
# l4d2 cpu pinning — decision record (deferred)
Date: 2026-05-09
Status: decision (no implementation)
## Question
After the lifecycle + drift fix landed (commits `8552c55`, `67b5521`), the
question came up: with `AllowedCPUs=1-7` already constraining game servers
to cores 17, do CFS scheduler migrations *within* that range still cause
meaningful jitter? Should we hard-pin each instance to a single core?
## Investigation
The classic "lazy CFS" sysctl knob is **gone** on modern kernels. Verified
on Trixie's running kernel 6.12 (`ckn@10.0.4.128`):
```
/sbin/sysctl -a | grep -E "sched_migration_cost|sched_min_granularity|sched_wakeup_granularity|sched_latency"
# (no output)
```
`kernel.sched_migration_cost_ns` and the other classic CFS tunables were
removed in 5.13+ as part of the scheduler internals refactor that culminated
in EEVDF (6.6). Only `kernel.sched_rt_period_us` / `sched_rt_runtime_us`
remain. There is no global "be lazy about migrations" knob anymore.
### Available paths
| Option | Cost | Strictness | Pays off when |
|---|---|---|---|
| Trust CFS + `Nice=-5` + `AllowedCPUs=1-7` (current) | None | Soft | ≤ 3 instances on 7 cores; CFS rarely migrates active CPU-bound nice<0 tasks |
| Per-instance `CPUAffinity=N` drop-in | Web-app machinery to write drop-ins, daemon-reload, modulo or DB-persisted assignment | Strict | ≥ 4 instances (each gets exclusive core), or measured jitter |
| `isolcpus=1-7 nohz_full=1-7 rcu_nocbs=1-7` kernel cmdline | GRUB edit + reboot, host-specific | Strongest (also evicts kernel softirqs/RCU/timer ticks from game cores) | Tickrate-128 with measurable kernel-induced jitter |
| `SCHED_FIFO` per unit | Risky (RT misconfig can stall kernel) | Strict | Already documented as ops-side escape hatch in `deploy/README.md` |
### Why deferring is defensible
- The slice's `AllowedCPUs=1-7` already prevents game servers from running on core 0. The open question is "do they migrate within 17?" — yes, CFS can migrate, but for long-running CPU-bound `srcds` with `Nice=-5`, migrations are infrequent. CFS prefers cache locality and only migrates when an idle core "steals" or a periodic load-balance tick detects imbalance.
- With ≤ 3 instances on 7 game cores, the load balancer rarely sees imbalance to fix.
- Per-instance hard pinning adds non-trivial machinery (drop-in writer through `left4me-systemctl`, or extending `instance.env` + a `taskset` wrapper in the unit). Not warranted unless we observe a real problem.
- `deploy/README.md` already documents the `CPUAffinity=N` per-instance drop-in as an opt-in escape hatch. An operator who measures jitter can apply it without code changes.
## Decision
**No code change.** Keep the current setup:
- Slice-level `AllowedCPUs=1-7` ensures game servers never touch core 0.
- `Nice=-5` keeps active srcds tasks weighted heavily so CFS prefers leaving them alone.
- The `CPUAffinity=N` per-instance drop-in remains the documented escape hatch.
## Revisit triggers
Any of these signals appears, then design + implement strict per-instance pinning:
- ≥ 4 game-server instances running simultaneously on one host.
- A specific server reports tickrate dips / rubber-banding correlated with another instance starting or a build sandbox firing.
- `perf stat -e sched:sched_migrate_task -p <srcds-pid>` shows > 1 migration/sec under load.
When revisiting, two implementation paths to choose from:
1. **Modulo assignment in the host library.** Read `LEFT4ME_GAME_CPUS` (or parse the slice's `AllowedCPUs=` drop-in), pick `game_cpus[(int(name) - 1) % len(game_cpus)]`, write `L4D2_CPU=N` into `instance.env`, wrap the unit's `ExecStart` with `taskset -c ${L4D2_CPU}`. Stateless, deterministic, no DB column. **Preferred.**
2. **Persisted assignment.** Add `Server.cpu_pin` column, web app picks at initialize time and stores. Survives `LEFT4ME_GAME_CPUS` changes (each server keeps its assigned core). Bigger ripple.
## Verification (no-op confirmation)
```sh
ssh ckn@10.0.4.128 'systemctl show l4d2-game.slice -p AllowedCPUs'
# expect: AllowedCPUs=1-7
ssh ckn@10.0.4.128 'cat /sys/fs/cgroup/system.slice/cpuset.cpus.effective'
# expect: 0 (everything-not-game still pinned to core 0)
# When ≥ 1 server is running:
ssh ckn@10.0.4.128 'for p in $(pgrep srcds); do grep ^Cpus_allowed_list /proc/$p/status; done'
# expect: 1-7 (CFS picks whichever of those is hottest at any given moment)
```
## References
- `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md` — sibling design that introduced the `AllowedCPUs=1-7` slice constraint this record builds on.
- `deploy/README.md` "Performance Tuning" section — the `CPUAffinity=N` per-instance escape hatch.
- Linux kernel changelog 5.13+ — removal of classic CFS tunable sysctls.

View file

@ -1,230 +0,0 @@
# l4d2 server host perf baseline — design
Date: 2026-05-09
Status: design
## Summary
Apply a host-side performance and resource-isolation baseline to every L4D2 server instance, using systemd unit directives, a slice hierarchy, and host sysctls. The blueprint-level game configuration (tickrate, sv_minrate/maxrate, fps_max, plugins) stays the responsibility of the individual server maintainer and is out of scope.
## Goals
- Game-server processes get measurable scheduling, I/O, and OOM priority over the script-build sandbox and over interactive system traffic.
- One misbehaving server cannot OOM-kill its siblings or the host.
- The kernel's UDP path is sized for sustained Source-engine traffic instead of distro defaults.
- Operators have documented escape hatches for host-specific tuning (CPU pinning, governor, NIC IRQs, real-time scheduling) without any of it being imposed by default.
## Non-goals
- ConVars, blueprint arguments, plugins, tickrate, rate values — owned by the maintainer of each server.
- Real-time (`SCHED_FIFO`/`SCHED_RR`) scheduling for game servers. Documented as opt-in only; see Out-of-scope rationale.
- CPU governor changes. Documented opt-in only.
- Per-instance `CPUAffinity`. Host-specific; documented only.
- NIC ring-buffer / IRQ-pinning changes. Hardware-specific; documented only.
- Job-scheduler awareness ("don't build a script overlay while server X has players"). Cgroup weights cover this in v1; revisit if real-world data disagrees.
- Hardening tightening (`ProtectKernelTunables=yes`, etc.). Security-focused, separate spec.
## Background
Current state (commit `965b67e`):
- `deploy/files/usr/local/lib/systemd/system/left4me-server@.service` runs `srcds_run` as user `left4me` with security hardening (`NoNewPrivileges`, `PrivateTmp`, `PrivateDevices`, `ProtectHome`, `ProtectSystem=strict`, `ReadOnlyPaths`, `ReadWritePaths`, `RestrictSUIDSGID`, `LockPersonality`) but **no scheduling, memory, OOM, kill-signal, or log-rate directives**.
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` runs script-overlay builds via `systemd-run --scope` with `CPUQuota=200%` and `RuntimeMaxSec=3600`, but in the **default cgroup** — it competes against game servers as an equal sibling under `system.slice`.
- No host sysctls are deployed. Linux defaults (`rmem_max`/`wmem_max` ≈ 128 KB, `netdev_max_backlog=1000`) are below what sustained UDP gameplay across multiple instances expects.
srcds is single-threaded per instance, so multi-instance hosts contend over CPU cycles, kernel softirq budget, and journald rate limits.
## Design
### Slice topology
Flat top-level slices, siblings of `system.slice` and `user.slice`:
```
-.slice
├── system.slice (default CPUWeight=100, IOWeight=100)
├── user.slice (default CPUWeight=100, IOWeight=100)
├── l4d2-game.slice (CPUWeight=1000, IOWeight=1000)
└── l4d2-build.slice (CPUWeight=10, IOWeight=10)
```
Rationale:
- 100:1 weight ratio between game and build means: under contention, the build sandbox is starved; when uncontended, the build still gets the full box modulo its own `CPUQuota=200%`.
- Flat (not nested under `system.slice`) so a logged-in admin running a heavy task in `user.slice` cannot steal cycles from a live match.
### Per-instance unit additions (`left4me-server@.service`)
Add to `[Service]`:
```
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
```
Per-directive justification:
- `Slice=l4d2-game.slice` — places the instance in the high-weight slice.
- `Nice=-5` — modest CFS priority bump. Negative `Nice` set by systemd does not require `CAP_SYS_NICE` because systemd applies the value before dropping to the unit user. SCHED_FIFO is intentionally rejected; see Out-of-scope rationale.
- `IOSchedulingClass=best-effort` + `IOSchedulingPriority=4` — explicit best-effort with a slight bump above the default of 4 in the same class on most distros; deterministic and harmless.
- `OOMScoreAdjust=-200` — game servers survive memory pressure; sandbox dies first (see sandbox section).
- `MemoryHigh=1.5G`, `MemoryMax=2G` — soft + hard ceiling. Typical L4D2 srcds runs ~500800 MB; map-load spikes fit in headroom; a runaway is bounded.
- `TasksMax=256` — bounds thread count well above srcds' steady-state usage; prevents fork-bomb style failures from leaking host-wide.
- `LimitNOFILE=65536` — Valve wiki recommendation; cheap and matches multi-plugin setups.
- `KillSignal=SIGINT` — srcds responds to SIGINT for clean shutdown (writes demos, flushes logs); SIGTERM is harsher.
- `TimeoutStopSec=15s` — gives srcds time to finish flush before SIGKILL.
- `LogRateLimitIntervalSec=0` — disables journald per-unit rate limiting (default `10000 msgs/30s`). srcds + plugins exceed this on busy maps; dropped messages break diagnostics.
Existing security directives are kept verbatim.
### Slice unit files
New file `deploy/files/usr/local/lib/systemd/system/l4d2-game.slice`:
```ini
[Unit]
Description=left4me game-server slice
Before=slices.target
[Slice]
CPUWeight=1000
IOWeight=1000
```
New file `deploy/files/usr/local/lib/systemd/system/l4d2-build.slice`:
```ini
[Unit]
Description=left4me script-sandbox build slice
Before=slices.target
[Slice]
CPUWeight=10
IOWeight=10
```
### Sandbox slice + OOM placement
Edit `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` to add to the `systemd-run` invocation (transient service mode — the existing helper uses `--unit=` without `--scope`):
- `--slice=l4d2-build.slice`
- `-p OOMScoreAdjust=500`
Existing `CPUQuota=200%` and `RuntimeMaxSec=3600` stay. Cgroup weight (slice) and CPU quota (per-unit) compose: weight handles contention, quota handles the absolute ceiling.
### Host sysctls
New file `deploy/files/etc/sysctl.d/99-left4me.conf`:
```
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.rmem_default = 524288
net.core.wmem_default = 524288
net.core.netdev_max_backlog = 5000
net.core.netdev_budget = 600
vm.swappiness = 10
```
Per-value justification:
- `rmem_max`/`wmem_max = 8 MB` — Linux default of ~128 KB is a known bottleneck for sustained UDP. 8 MB is the standard 1 Gbit recommendation (Red Hat performance guide); enough headroom for ~10 instances on a host without going to 16 MB.
- `rmem_default`/`wmem_default = 512 KB` — protects sockets that don't explicitly call `setsockopt(SO_RCVBUF/SO_SNDBUF)`; harmless when they do.
- `netdev_max_backlog = 5000` — default `1000` overflows under multi-instance UDP burst; the per-CPU softnet queue starts dropping packets once full.
- `netdev_budget = 600` — gives softirq more packet-drain headroom per pass; default `300` is undersized for multi-Gbit-class hosts.
- `vm.swappiness = 10` — universally recommended for latency-sensitive servers; harmless on swapless hosts.
### Deploy script integration
`deploy/deploy-test-server.sh` must:
1. Copy `etc/sysctl.d/99-left4me.conf` to `/etc/sysctl.d/`.
2. Run `sysctl --system` (or `sysctl -p /etc/sysctl.d/99-left4me.conf`) so values take effect immediately, not on next boot.
3. Copy the two `.slice` files into `/usr/local/lib/systemd/system/`.
4. `systemctl daemon-reload` after unit/slice changes (already done in current deploy flow).
5. No explicit `systemctl start` of the slices is required — they activate on first child reference.
### Documented escape hatches (no auto-apply)
Append a "Performance tuning" section to `deploy/README.md`:
- **CPU governor**: `cpupower frequency-set -g performance` if jitter under load matters more than power. Schedutil is acceptable for sustained UDP workloads. Provide the one-liner; do not ship a oneshot service in v1.
- **CPU affinity per instance**: example drop-in at `/etc/systemd/system/left4me-server@<name>.service.d/affinity.conf` setting `CPUAffinity=N`. Document the strategy "one instance per core, leave core 0 for system + IRQ".
- **NIC tuning**: example `ethtool -G <iface> rx 4096 tx 4096`, IRQ-pinning hints. Hardware-specific; ops-only.
- **Real-time scheduling opt-in**: example drop-in adding `CPUSchedulingPolicy=fifo`, `CPUSchedulingPriority=10`, `LimitRTPRIO=10`. Include a one-paragraph warning citing RT-throttling defaults (`sched_rt_runtime_us=950000`) and the failure mode if a single instance misbehaves.
These stay pure documentation in v1 — no code paths, no tests asserting them.
### Out-of-scope rationale
- **SCHED_FIFO**: a misbehaving srcds at any RT priority can starve kernel threads and produces failure modes that are harder to diagnose than the jitter problem it claims to solve. `Nice=-5` plus the slice weights captures the practical benefit. Ops who need RT can opt in via the documented drop-in.
- **CPU governor auto-set**: Phoronix and Arch comparisons show `schedutil` is within noise of `performance` on sustained workloads like Source UDP; aggressively forcing `performance` would surprise users on power-managed hosts.
- **CPUAffinity in the unit**: the unit template is shared across all instances; a single hard-coded `CPUAffinity=` would pin every instance to the same cores, defeating the purpose. Per-instance pinning needs deploy-time policy that is outside v1's scope.
### Files changed / added
```
deploy/files/usr/local/lib/systemd/system/left4me-server@.service (modified)
deploy/files/usr/local/lib/systemd/system/l4d2-game.slice (new)
deploy/files/usr/local/lib/systemd/system/l4d2-build.slice (new)
deploy/files/etc/sysctl.d/99-left4me.conf (new)
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox (modified)
deploy/deploy-test-server.sh (modified — sysctl --system step)
deploy/README.md (modified — performance section)
deploy/tests/test_deploy_artifacts.py (modified — assertions)
```
## Tests
`deploy/tests/test_deploy_artifacts.py` additions, following the existing
`assert "key=value" in text` pattern:
- For `left4me-server@.service`, assert every line listed in *Per-instance
unit additions* is present verbatim. Each is a separate assertion so a
failing line is identifiable.
- For `l4d2-game.slice`, assert `CPUWeight=1000` and `IOWeight=1000`.
- For `l4d2-build.slice`, assert `CPUWeight=10` and `IOWeight=10`.
- For `99-left4me.conf`, assert every sysctl line listed in *Host sysctls*.
- For `left4me-script-sandbox`, assert the strings `--slice=l4d2-build.slice`
and `OOMScoreAdjust=500` both appear.
- Assert the deploy script invokes `sysctl --system` (or
`sysctl -p /etc/sysctl.d/99-left4me.conf`) at least once after copying the
conf into place.
No runtime perf tests in v1 — the spec ships defaults, not measured wins.
Real-world measurement is left to operators with concrete instance counts,
hardware, and player loads.
## Rollout
Single deploy. Running game servers will not pick up the new directives until each instance is restarted (systemd does not reapply unit changes to already-running services). The web UI's "stop" + "start" cycle is sufficient. Document this in `deploy/README.md`.
## Open questions
None blocking. v2 candidates if measurement justifies them:
- Per-instance `CPUAffinity` driven by a deploy-env knob (`LEFT4ME_INSTANCE_CPUS`).
- Job-worker awareness of "server has active players" to defer builds further than weights alone.
- Optional `left4me-host-perf.service` oneshot that sets governor + NIC tuning under a single env-flag opt-in.
## References
- systemd.exec(5) — `Nice=`, `IOSchedulingClass=`, `OOMScoreAdjust=`, `MemoryHigh=`, `MemoryMax=`, `TasksMax=`, `KillSignal=`, `TimeoutStopSec=`, `LimitNOFILE=`, `LogRateLimitIntervalSec=`.
- systemd.resource-control(5) — slice semantics, `CPUWeight=`, `IOWeight=`, weight competition rules.
- systemd.kill(5) — signal handling and `KillSignal`.
- Red Hat Enterprise Linux Network Performance Tuning Guide — `rmem_max`/`wmem_max`/`netdev_max_backlog`/`netdev_budget`.
- LWN "SCHED_FIFO and realtime throttling"; RHEL Real-Time CPU throttling docs — rationale for not shipping RT by default.
- Linux Foundation real-time wiki — `sched_rt_runtime_us` semantics.
- forums.srcds.com / AlliedModders / linuxquestions.org threads — confirmation that srcds is single-threaded per instance.
- Phoronix governor comparisons — performance vs schedutil for sustained workloads.
- Multiple latency-tuning guides — `vm.swappiness=10` consensus.

View file

@ -1,217 +0,0 @@
# l4d2 server lifecycle: reboot-safe + drift reconciliation — design
Date: 2026-05-09
Status: design
## Summary
Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators.
## Goals
- An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action.
- The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change.
- The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
- Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.
## Non-goals
- **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision.
- **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
- **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec.
- **Per-server poller intervals.** A single global cadence is sufficient.
- **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays.
- **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.
## Premise check: system units, not user units
`systemctl --user enable --now` has different lifecycle rules — auto-start only at user login (unless `loginctl enable-linger <user>` is set), symlinks land in `~/.config/systemd/user/<target>.wants/`. It would be wrong here.
This project uses **system units**, confirmed by:
- Unit path: `/usr/local/lib/systemd/system/left4me-server@.service` is the system search path; user units live in `/etc/systemd/user/` or `~/.config/systemd/user/`.
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl:31-44`) calls plain `systemctl` (no `--user` flag) and runs as **root** via the sudoers rule at `deploy/files/etc/sudoers.d/left4me:2`.
- The unit's `[Install] WantedBy=multi-user.target` (line 43 of the unit) is a system target; user units would use `default.target`.
- The same machinery is already in production for `left4me-web.service``deploy-test-server.sh` runs `sudo systemctl enable --now left4me-web.service`, and that's how the web service auto-came-back after today's reboot. We're applying the same pattern to the game-server template instances.
`systemctl enable left4me-server@1.service` will create `/etc/systemd/system/multi-user.target.wants/left4me-server@1.service` symlinked to `/usr/local/lib/systemd/system/left4me-server@.service`. systemd handles the template instantiation via the `@` syntax automatically.
## Background
Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST:
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action.
- `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`.
- `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions.
- `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot.
- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not.
- The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth.
## Design
### Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
**Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`):
Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become:
```sh
case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) reject ;;
esac
```
The existing instance-name validation regex (currently lines 1217) is unchanged — it constrains the `<name>` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
```
already passes any args; no sudoers update needed.
**Python wrapper** (`l4d2host/service_control.py`):
Rename `start_service``enable_service` and `stop_service``disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged.
**Instance lifecycle** (`l4d2host/instances.py`):
- `start_instance` — replace the `start_service(...)` call with `enable_service(...)`.
- `stop_instance` — replace `stop_service(...)` with `disable_service(...)`.
- `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.
**CLI surface** (`l4d2host/cli.py`):
`l4d2ctl start <name>` and `l4d2ctl stop <name>` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance``enable_service` as renamed above.
**Web facade** (`l4d2web/services/l4d2_facade.py`):
Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`.
### Part B — Periodic state poller
Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`:
```python
def start_state_poller(app):
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
daemon=True,
name="left4me-state-poller",
)
thread.start()
def state_poller_loop(app, interval):
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass # never let a single failure kill the loop
time.sleep(interval)
def poll_all_servers():
with session_scope() as db:
active_server_ids = set(db.scalars(
select(Job.server_id).where(Job.state.in_(("queued", "running")))
).all())
server_ids = [
sid for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass
```
**Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.
**Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 8488: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`).
**First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed.
**Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:4458`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives.
### Why both parts
Either part alone is insufficient:
- **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now <unit>` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases.
- **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again.
Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.
### Migration on running hosts
Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot.
The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`.
### Files changed / added
```
deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs)
l4d2host/service_control.py (Part A — rename)
l4d2host/instances.py (Part A — call new names)
l4d2host/tests/test_lifecycle.py (Part A — test updates)
l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent)
deploy/tests/test_deploy_artifacts.py (Part A — helper assertions)
l4d2web/services/job_worker.py (Part B — poller code)
l4d2web/app.py (Part B — wire start_state_poller)
l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default)
l4d2web/tests/test_job_worker.py (Part B — poller tests)
```
## Tests
### Part A
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted).
- `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start``enable` and `stop``disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`.
- New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`.
### Part B
- `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order.
- `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server.
- `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally.
- `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`).
### CI sanity
`pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope).
## Rollout
Single deploy. After deploy:
1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.**
2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe.
3. Newly-started servers (`l4d2ctl start <name>` or web UI start) are enable-based from the first invocation.
If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa.
## Open questions
None blocking. v2 candidates:
- Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision).
- Per-server poll intervals or backoff for repeatedly-failing servers.
- A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").
## References
- systemd.unit(5) — `WantedBy=`, `Install` section semantics.
- systemctl(1) — `enable --now` / `disable --now` flags.
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`.
- Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`.
- `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.

View file

@ -1,487 +0,0 @@
# l4d2 network shaping & marking — design
Date: 2026-05-10
Status: design
## Summary
Add a network-side player-experience baseline alongside the existing host
perf baseline. Three concerns ship together:
1. **Mark srcds outbound packets** with DSCP `EF` and skb priority `6:0` so
any qdisc — host CAKE, ISP gear that honours DSCP, future systems —
recognises L4D2 game traffic as latency-sensitive. Marking happens by uid
match on the `left4me` user.
2. **Round out the UDP-socket sysctl baseline** (`udp_rmem_min`,
`udp_wmem_min`), set the default qdisc explicitly to `fq_codel`, and
switch TCP to `bbr` so coexisting TCP egress (admin, backups, web app,
apt) cannot bufferbloat the link the players share.
3. **Shape egress with CAKE.** On the test deploy, install a systemd oneshot
that applies `tc qdisc replace … cake …` from an operator-edited env
file. On production hosts running `systemd-networkd`, document the
equivalent `[CAKE]` section in the matching `.network` file as the
long-term path.
The intent is "all reasonable measures that do not depend on host-specific
hardware." Hardware-specific tuning (NIC ring buffers, IRQ pinning, CPU
governor, real-time scheduling, CPU affinity) remains a documented escape
hatch — same boundary the existing perf-baseline spec drew. The pieces
that *are* universally safe ship as defaults.
## Goals
- Game-server UDP packets carry an unambiguous priority signal in DSCP and
in `skb->priority`, set on the host before any qdisc inspects them.
- A coexisting bulk TCP flow on the same host (backup upload, package
fetch, web-app response) cannot push the bottleneck queue ahead of game
UDP under saturation.
- An operator who declares uplink bandwidth gets fair-queueing egress
shaping with diffserv-aware tin selection — i.e. EF-marked srcds traffic
drops into the highest-priority CAKE tin, per-destination-host fairness
keeps every connected player on equal footing.
- A production deployment using `systemd-networkd` has a one-block
configuration recipe, no helper script needed.
- Operators have a documented set of additional knobs (ingress shaping via
IFB, `busy_poll`, GRO toggling) for cases the default baseline does not
cover. None of these auto-apply.
## Non-goals
- NIC ring-buffer / IRQ pinning / RPS / RFS / hardware timestamping —
already declared host-specific in the perf-baseline spec; not
re-litigated here.
- `busy_poll` / `busy_read` as defaults — non-trivial CPU cost; documented
as opt-in.
- Ingress shaping via IFB as a default — only matters if egress CAKE turns
out load-bearing and ingress is also saturated; documented as opt-in.
- Real-time scheduling, governor changes — already declined by the
perf-baseline spec.
- Blueprint-side game settings (`sv_minrate`, `sv_maxrate`, tickrate,
`fps_max`) — owned by the server maintainer.
- Auto-detection or measurement of uplink bandwidth. CAKE only shapes
correctly when its declared bandwidth sits below the real bottleneck;
the operator must measure once and configure.
- Iface-flap watchdog. `tc qdisc replace` is idempotent; on prod,
`systemd-networkd` reapplies CAKE across iface lifecycle events. On
test, `systemctl restart left4me-cake.service` is the documented
recovery.
## Background
Current state (commit `62d6d4c` or thereabouts):
- The perf-baseline spec ships `/etc/sysctl.d/99-left4me.conf` with
`rmem_max`, `wmem_max`, `rmem_default`, `wmem_default`,
`netdev_max_backlog`, `netdev_budget`, `vm.swappiness`. No per-socket
UDP minimums, no default-qdisc directive, no TCP congestion-control
setting.
- `srcds_run` runs as system user `left4me`. srcds itself does not set
`IP_TOS` or `SO_PRIORITY`, so its UDP packets leave the host with
DSCP 0 and priority 0 — indistinguishable from any other UDP traffic to
any qdisc.
- The deploy ships nftables-relevant infrastructure only via package
defaults (Debian Trixie ships `nftables` in base, but no `left4me`
table is created).
- No qdisc is explicitly configured. The kernel's per-iface default
applies — `fq_codel` on Trixie, but only because Debian's default has
been `fq_codel` since Buster.
- The deploy script already copies sysctl drop-ins and runs
`sysctl --system` (`deploy/deploy-test-server.sh:196`).
## Design
### Sysctl additions to `99-left4me.conf`
Append to `deploy/files/etc/sysctl.d/99-left4me.conf`:
```
# Per-socket UDP buffer floors: protect game-server sockets that don't bump
# their own SO_RCVBUF/SO_SNDBUF when softirq drains lag briefly.
net.ipv4.udp_rmem_min = 16384
net.ipv4.udp_wmem_min = 16384
# Default qdisc for ifaces we don't explicitly shape with CAKE. Debian
# Trixie already defaults to fq_codel; setting it explicitly is
# belt-and-suspenders and survives kernel-default churn.
net.core.default_qdisc = fq_codel
# TCP congestion control: BBR for any bulk TCP egress on the host (admin
# SSH, backups, package fetches, web-app responses) so a long flow does
# not push the bottleneck queue ahead of game UDP. UDP srcds is
# unaffected.
net.ipv4.tcp_congestion_control = bbr
```
The deploy already runs `sysctl --system` after copying the conf
(`deploy/deploy-test-server.sh:198`); no script change required for this
block.
### nftables packet marking
New file `deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft`:
```nft
table inet left4me_mark {
chain mangle_output {
type filter hook output priority mangle; policy accept;
meta skuid "left4me" meta l4proto udp ip dscp set ef meta priority set 0006:0000
meta skuid "left4me" meta l4proto udp ip6 dscp set ef meta priority set 0006:0000
}
}
```
Per-element rationale:
- `meta skuid "left4me"` — every srcds instance runs as that user. The
match is exact; nothing else on the host matches. No false positives
against the web app (which runs as `left4me` too but speaks TCP) or the
build sandbox (different uid).
- `meta l4proto udp` — bypass anything not UDP, including the future
RCON/HTTP TCP traffic from the web app.
- `ip dscp set ef` / `ip6 dscp set ef` — DSCP `EF` (Expedited Forwarding,
decimal 46) is the standard low-latency marking. CAKE's `diffserv4`
preset routes EF into its highest-priority "Voice" tin. Two rules,
one per L3 family, because in an `inet` table the `ip` matcher only
fires on v4 and `ip6` only on v6.
- `meta priority set 0006:0000` — sets `skb->priority` to class `6:0`.
Read by qdiscs that classify on skb priority (CAKE included) ahead of
any DSCP table lookup. Set inline with the DSCP rule so a single
rule-match runs both statements.
The table is named `left4me_mark` and lives in its own `inet` namespace.
It does not touch, depend on, or conflict with any nftables config the
operator may run independently. `nft -f` loads the file; `nft delete
table inet left4me_mark` cleanly removes it.
New unit `deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service`:
```ini
[Unit]
Description=left4me nftables packet marking (DSCP EF + priority for srcds)
After=network-pre.target
Before=network.target
Wants=network-pre.target
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/nft -f /usr/local/lib/left4me/nft/left4me-mark.nft
ExecStop=/usr/sbin/nft delete table inet left4me_mark
[Install]
WantedBy=multi-user.target
```
`After=network-pre.target` / `Before=network.target` keeps the rules in
place before any iface comes up, so the very first packet srcds emits
post-boot is already marked.
Deploy script changes:
- Ensure `nftables` is installed (`apt-get install -y nftables`;
idempotent — package is in Trixie base).
- Create `/usr/local/lib/left4me/nft/` and copy `left4me-mark.nft` into
it.
- Copy the unit, `daemon-reload`, `systemctl enable --now
left4me-nft-mark.service`.
### CAKE egress shaper — test deploy mechanism
Three files plus deploy-script changes. All operator-tunable knobs go in
the env file; the helper and unit are static.
**`deploy/files/etc/left4me/cake.env`** (template; deploy installs only
if absent so operator edits survive re-runs):
```
# Uplink bandwidth in Mbit/s. Set to ~95% of the smaller of measured
# upload and measured download. CAKE only shapes correctly when its
# declared bandwidth sits below the real bottleneck. If unset, the
# left4me-cake.service unit logs a warning and exits 0 (no shaping).
LEFT4ME_UPLINK_MBIT=
# Egress interface. If unset, auto-detected from the IPv4 default route.
LEFT4ME_UPLINK_IFACE=
```
**`deploy/files/usr/local/libexec/left4me/left4me-apply-cake`** (mode
`0755`, owner `root:root`). The helper takes a single argument — `apply`
or `clear` — so the unit's `ExecStart` and `ExecStop` both call the same
script and the unit file stays free of shell escaping:
```sh
#!/bin/sh
set -eu
mode=${1:-apply}
if [ -r /etc/left4me/cake.env ]; then
. /etc/left4me/cake.env
fi
resolve_iface() {
if [ -n "${LEFT4ME_UPLINK_IFACE:-}" ]; then
printf '%s' "$LEFT4ME_UPLINK_IFACE"
return
fi
ip -4 route show default | awk '/default/ {print $5; exit}'
}
case "$mode" in
apply)
if [ -z "${LEFT4ME_UPLINK_MBIT:-}" ]; then
echo "left4me-cake: LEFT4ME_UPLINK_MBIT unset; skipping shaper" >&2
exit 0
fi
iface=$(resolve_iface)
if [ -z "$iface" ]; then
echo "left4me-cake: cannot determine egress iface; skipping" >&2
exit 0
fi
exec tc qdisc replace dev "$iface" root cake \
bandwidth "${LEFT4ME_UPLINK_MBIT}mbit" \
internet diffserv4 dual-dsthost
;;
clear)
iface=$(resolve_iface)
if [ -z "$iface" ]; then
exit 0
fi
tc qdisc del dev "$iface" root 2>/dev/null || true
;;
*)
echo "usage: $0 [apply|clear]" >&2
exit 2
;;
esac
```
`tc qdisc replace` is idempotent: replaces an existing root qdisc on the
iface, adds one if absent. Re-running the unit any time is safe. `clear`
swallows the "no such qdisc" error so stop is also idempotent.
Fail-soft on missing config matches the perf-baseline philosophy — the
deploy does not refuse to boot servers because the operator has not yet
filled in `LEFT4ME_UPLINK_MBIT`. The journal warning surfaces the gap.
**`deploy/files/usr/local/lib/systemd/system/left4me-cake.service`**:
```ini
[Unit]
Description=left4me CAKE egress shaper
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=-/etc/left4me/cake.env
ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply
ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear
[Install]
WantedBy=multi-user.target
```
Per-flag rationale for the `cake` invocation:
- `bandwidth ${LEFT4ME_UPLINK_MBIT}mbit` — operator-declared, ≈95% of
measured uplink. CAKE only shapes if its declared bandwidth is below
the real bottleneck; setting it slightly low moves the queue into a
place the host controls.
- `internet` — overhead-accounting keyword that handles common
Ethernet+ISP encapsulation (DOCSIS / GPON / PPPoE) correctly without
undershooting. Conservative default.
- `diffserv4` — four-tier DSCP-aware tin selection. Reads the EF marks
set by the nftables rule and routes srcds packets into the
highest-priority "Voice" tin. Without `diffserv4`, the marks are
ignored.
- `dual-dsthost` — egress fairness keyed on destination host. With ≥2
players connected, each player gets fair share regardless of how
chatty the server is to any single client.
Iface-flap behaviour: the kernel keeps the qdisc on an iface across
link-down/link-up while the iface itself exists. If the iface is
recreated (e.g., NetworkManager reconfiguration), `systemctl restart
left4me-cake.service` reapplies. Documented; no auto-watchdog in v1.
Deploy script changes (in `deploy/deploy-test-server.sh`):
- Copy `cake.env` to `/etc/left4me/cake.env` only if absent (do not
clobber operator edits).
- Copy `left4me-apply-cake` to `/usr/local/libexec/left4me/`, mode
`0755`, owner `root:root`.
- Copy `left4me-cake.service` to `/usr/local/lib/systemd/system/`.
- `systemctl daemon-reload` (already done in the existing flow).
- `systemctl enable --now left4me-cake.service`.
### CAKE egress shaper — production deployment (systemd-networkd)
On hosts running `systemd-networkd`, the CAKE configuration belongs in
the matching `.network` file. systemd-networkd reapplies it across iface
lifecycle events, addressing the only fragility of the test-deploy
oneshot.
Document in `deploy/README.md` Performance section:
```ini
# /etc/systemd/network/<your-uplink>.network
[CAKE]
Bandwidth=480M
OverheadKeyword=internet
PriorityQueueingPreset=diffserv4
EgressHostIsolation=yes
```
Directive names follow `systemd.network(5)`. Values mirror the test
deploy's `tc` invocation:
- `Bandwidth=480M` — placeholder; operator sets to ≈95% of measured
uplink in their actual `.network`.
- `OverheadKeyword=internet` — equivalent of the `internet` keyword.
- `PriorityQueueingPreset=diffserv4` — equivalent of `diffserv4`.
- `EgressHostIsolation=yes` — equivalent of `dual-dsthost` on egress.
The nftables marking from the previous section ships unchanged on prod;
it is qdisc-installer-agnostic.
The test-deploy oneshot does NOT install on a host running
`systemd-networkd`. v1 does not implement that gate — production hosts
do not run the test-deploy script. If the boundary blurs in the future,
add a check in `left4me-apply-cake` for `systemctl is-active
systemd-networkd` and skip cleanly.
### Documented escape hatches
Append to `deploy/README.md` Performance section, alongside the existing
governor / CPU-affinity / NIC entries:
- **Ingress shaping via IFB.** Egress CAKE alone does not protect srcds
receive against ingress saturation (large workshop downloads, package
fetches arriving at line rate). One-liner template using `modprobe
ifb`, `ip link set ifb0 up`, `tc qdisc add dev ifb0 root cake bandwidth
Xmbit ingress diffserv4 dual-srchost`, and a `tc filter` redirect from
the uplink iface. Worth flipping only when measurement shows ingress
hurting receive; in v1 we have no such measurement, so it stays
documented.
- **`net.core.busy_poll = 50` / `net.core.busy_read = 50`.** Reduces UDP
receive median latency by polling for incoming packets briefly at
syscall boundaries. Cost: measurable CPU per syscall under load. Worth
flipping if a host is dedicated to game serving and CPU headroom is
plentiful.
- **`ethtool -K <iface> gro off`.** Some Source-engine ops disable
generic receive offload to avoid receive-side coalescing latency.
Hardware/driver dependent. Document, do not ship.
These three entries follow the existing escape-hatch style: a one-liner
or short config block, plus one sentence on when it matters.
### Files changed / added
```
deploy/files/etc/sysctl.d/99-left4me.conf (modified — block added)
deploy/files/usr/local/lib/left4me/nft/left4me-mark.nft (new)
deploy/files/usr/local/lib/systemd/system/left4me-nft-mark.service (new)
deploy/files/etc/left4me/cake.env (new — template, deploy preserves operator edits)
deploy/files/usr/local/libexec/left4me/left4me-apply-cake (new)
deploy/files/usr/local/lib/systemd/system/left4me-cake.service (new)
deploy/deploy-test-server.sh (modified — install+enable nft and cake units, conditional copy of cake.env)
deploy/README.md (modified — Network shaping subsection + 3 new escape hatches)
deploy/tests/test_deploy_artifacts.py (modified — assertions for all artifacts above)
```
## Tests
Following the existing `assert "key=value" in text` pattern in
`deploy/tests/test_deploy_artifacts.py`:
**Sysctl block** (extension of the existing perf-baseline assertions):
- Each of `net.ipv4.udp_rmem_min = 16384`, `net.ipv4.udp_wmem_min =
16384`, `net.core.default_qdisc = fq_codel`,
`net.ipv4.tcp_congestion_control = bbr` is asserted as a separate line.
**nftables marking artifacts:**
- `left4me-mark.nft` ships with `table inet left4me_mark`, `chain
mangle_output`, `meta skuid "left4me"`, `ip dscp set ef`, `ip6 dscp
set ef`, and `meta priority set 0006:0000` each asserted as separate
substring matches. (DSCP and priority statements appear inline on
the same rule per L3 family; substring assertions don't depend on
rule layout.)
- `left4me-nft-mark.service` has `ExecStart=/usr/sbin/nft -f
/usr/local/lib/left4me/nft/left4me-mark.nft`, `ExecStop=/usr/sbin/nft
delete table inet left4me_mark`, `Type=oneshot`,
`RemainAfterExit=yes`, `WantedBy=multi-user.target`.
- `deploy-test-server.sh` invokes `systemctl enable --now
left4me-nft-mark.service` (or equivalent at-deploy enabling step).
**CAKE artifacts:**
- `cake.env` template contains the literal lines `LEFT4ME_UPLINK_MBIT=`
and `LEFT4ME_UPLINK_IFACE=` (commented or uncommented; matched as
substring).
- `left4me-apply-cake` contains the literals `tc qdisc replace`, `cake`,
`bandwidth`, `internet`, `diffserv4`, `dual-dsthost`,
`LEFT4ME_UPLINK_MBIT`, `LEFT4ME_UPLINK_IFACE`.
- `left4me-apply-cake` is mode `0755` after deploy (asserted via the
same mechanism the existing helper-script tests use).
- `left4me-cake.service` contains
`EnvironmentFile=-/etc/left4me/cake.env`,
`ExecStart=/usr/local/libexec/left4me/left4me-apply-cake apply`,
`ExecStop=/usr/local/libexec/left4me/left4me-apply-cake clear`,
`Wants=network-online.target`, `Type=oneshot`,
`WantedBy=multi-user.target`.
- `deploy-test-server.sh` invokes `systemctl enable --now
left4me-cake.service`.
- `deploy-test-server.sh` copies `cake.env` only when target absent
(asserted by literal substring of the guarding `[ -e
/etc/left4me/cake.env ]` test or equivalent).
No runtime networking tests in v1. The artifacts are static; their
runtime behaviour requires a real iface and a real bandwidth load,
which the operator measures.
## Rollout
Single deploy. After the new sysctl block lands, `sysctl --system`
applies it immediately (already in the deploy flow). The two new
systemd units start on `systemctl enable --now`; CAKE without a
configured `LEFT4ME_UPLINK_MBIT` logs a warning and no-ops, which is
the expected fresh-deploy state. The operator measures their uplink,
edits `/etc/left4me/cake.env`, and runs `systemctl restart
left4me-cake.service`.
Already-running game servers are unaffected by the network changes
themselves. The marking applies on every emitted packet from the moment
the nft rule loads; future-emitted packets pick up DSCP+priority without
restarting any srcds instance.
## Open questions
None blocking. v2 candidates if measurement justifies them:
- A `LEFT4ME_INGRESS_MBIT` knob that flips on the IFB ingress shaper as
a default, conditional on the env value being set.
- A `left4me-net-doctor` helper that reports current qdisc, applied
marks, and a one-shot saturation+ping measurement against a local
endpoint.
- A small Python wrapper in `l4d2host` that reads `cake.env` for
display in the web UI, so the operator sees in one place whether
shaping is active.
## References
- `tc-cake(8)` — keyword semantics: `bandwidth`, `internet`,
`diffserv4`, `dual-dsthost`, tin priority mapping.
- `systemd.network(5)``[CAKE]` section directives:
`Bandwidth=`, `OverheadKeyword=`, `PriorityQueueingPreset=`,
`EgressHostIsolation=`.
- `nft(8)``meta skuid`, `meta priority`, `ip dscp set`, table
isolation semantics.
- RFC 3246 — Expedited Forwarding (EF) PHB.
- Linux kernel `Documentation/networking/tcp_bbr.txt` — BBR pairs with
`fq` / `fq_codel` for correct pacing.
- `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`
— sibling spec; this spec extends `99-left4me.conf` and reuses the
same deploy-test-artifact pattern.

View file

@ -1,96 +0,0 @@
# Profile page with self-service password change — design
## Context
The web app has login/logout (`l4d2web/auth.py`, `l4d2web/routes/auth_routes.py`) and admin user management (activate/deactivate/delete), but no way for a logged-in user to change their own password. The header (`l4d2web/templates/base.html:27`) renders the username as a non-clickable `<span class="muted">`.
This design adds a `/profile` page reachable by clicking the username, with "Change password" as its first (and only) section. Future profile fields can slot into the same page as new sections without rework.
## Goals
- Logged-in users can change their own password from a self-service page.
- Following an industry-standard session model: a successful password change **invalidates every other active session for the user** and **keeps the current session signed in**. No re-login on the current browser.
- Single password policy enforced everywhere a password is set (web flow today, CLI `create-user` for consistency).
## Non-goals (out of scope for v1)
- Admin password reset for other users. A separate feature; no rework needed to add later.
- Password recovery / email reset flow.
- Other profile fields (display name, email, etc.). The page is structured to grow but ships with one section.
## Decisions
- **URLs.** `GET /profile` for the page, `POST /profile/password` for the form submission.
- **Form fields.** `current_password`, `new_password`, `confirm_new_password`. All three required.
- **Password policy.** Not empty, minimum 8 characters. Same rule applies to the CLI `create-user` so policy lives in one place.
- **Session policy.** Invalidate other sessions on success; keep the current session signed in.
- **Rate limit.** Per-IP, sliding window. Re-uses the same primitive as `/login`.
- **CSRF.** Standard hidden-token pattern shared with the rest of the app.
## Session-invalidation mechanism
A new `password_changed_at: DateTime NOT NULL` column on `users`. Two checkpoints:
1. **On login.** `login_user` stashes `session["pw_changed_at"] = user.password_changed_at.isoformat()`.
2. **On every request.** `load_current_user` rejects the session — same shape as the existing `user.active` check — when the marker is missing, malformed, or strictly older than `user.password_changed_at`.
On successful password change:
- Rotate `user.password_digest` and bump `user.password_changed_at` to "now".
- Re-stamp `session["pw_changed_at"]` to the new value so this browser keeps working.
- Other browsers carry the old marker and get logged out the next time they hit a `@require_login` route.
This mirrors the established `g.user = None if not user.active else user` pattern, so the surface area added to the auth path is small and the behavior is easy to reason about.
## Validation branches (POST /profile/password)
In order:
1. All three fields present → otherwise `error=fields_required`.
2. `new_password == confirm_new_password` → otherwise `error=mismatch`.
3. `validate_new_password(new_password)` passes → otherwise `error=empty` or `error=too_short`.
4. `verify_password(current_password, user.password_digest)` succeeds → otherwise `error=wrong_current`.
5. Rotate, re-stamp, redirect to `/profile?success=1`.
Errors are surfaced inline on the next render of `/profile` via a small `?error=<key>` → human-readable message map in the route. No flash storage required.
## Migration story
Adding `password_changed_at` to `users` requires a migration:
- Add the column nullable.
- Backfill existing rows with their `created_at` so historical data has a sane marker.
- Alter to `NOT NULL`.
Effect on existing live sessions: any cookie that predates the migration lacks `pw_changed_at` and is rejected on first request after deploy. Users log in once more. Acceptable for v1 deployment.
## Surface area
**New files**
- `l4d2web/alembic/versions/0009_user_password_changed_at.py`
- `l4d2web/services/rate_limit.py`
- `l4d2web/routes/profile_routes.py`
- `l4d2web/templates/profile.html`
- `l4d2web/tests/test_profile.py`
**Modified files**
- `l4d2web/models.py` — column.
- `l4d2web/auth.py``MIN_PASSWORD_LENGTH`, `validate_new_password`, `login_user` signature, freshness check in `load_current_user`.
- `l4d2web/routes/auth_routes.py` — pass marker to `login_user`; use the generic rate-limit helper.
- `l4d2web/templates/base.html` — username `<span>``<a href="/profile">`.
- `l4d2web/app.py` — register the new blueprint, reset its rate-limit bucket in TESTING.
- `l4d2web/cli.py` — apply `validate_new_password` for parity with the web flow.
## Reused utilities
- `hash_password`, `verify_password``l4d2web/auth.py`
- `require_login``l4d2web/auth.py`
- `session_scope``l4d2web/db.py`
- `now_utc``l4d2web/models.py`
- CSRF hidden-token pattern — see `templates/admin_users.html`, `routes/auth_routes.py`
## Open questions resolved during brainstorming
- *Should the current session also be invalidated?* No — industry consensus (Django `update_session_auth_hash`, Rails Devise, GitHub/Google behaviour, OWASP / NIST SP 800-63B implications) is to keep the current session and rotate other sessions. Forcing re-login on a session that just proved knowledge of the current password adds friction without security gain.
- *Should we add `password_changed_at` or use a `session_version` counter?* Timestamp is enough; the comparison is unambiguous and avoids an extra integer field with arbitrary meaning.
- *Admin reset?* Deferred. The current model has no rework debt for adding it later.

View file

@ -1,326 +0,0 @@
# Workshop Auto-Download — Design
## Problem
When a user adds workshop items to an overlay (`POST /overlays/{id}/items`), the route saves `WorkshopItem` metadata and enqueues a `build_overlay` job. The build symlinks already-cached `.vpk` files and emits `skipped: not yet downloaded` to stderr for everything else. The only thing that actually pulls bytes from Steam is the admin-only `refresh_workshop_items` job, which is a global mutex blocking all server starts, all builds, and installs.
In practice, this means freshly-added items never appear in the overlay until an admin presses a button. That isn't workable.
## Goals
1. Newly added items get downloaded without admin action.
2. Items that authors update on Steam get re-downloaded automatically on a daily cadence.
3. Overlay owners can manually re-check / re-pull their own overlay's items.
## Non-Goals
See "Out of Scope" at the end. In particular: the `refresh_workshop_items` global mutex stays; there is no cache GC; no per-item retry inside `download_to_cache`; no update-aware server-restart prompt.
## Architecture
Three changes layered onto the existing scheduler. None introduce a new job type or new scheduler rule.
```
┌─────────────────────────────────────────────────────────────────────┐
│ User adds items │
│ POST /overlays/{id}/items │
│ ↳ fetch metadata batch (mode=add) │
│ ↳ upsert WorkshopItem rows │
│ ↳ enqueue_build_overlay ◀── already happens today │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ build_overlay job (per-overlay; not a global mutex) │
│ WorkshopBuilder.build(): │
│ 1. query overlay's items │
│ 2. for each item where cache miss / stale: ◀── NEW │
│ download_to_cache(meta) with retry+backoff │
│ stamp WorkshopItem.last_downloaded_at │
│ 3. apply symlinks (existing logic) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Owner re-checks one overlay │
│ POST /overlays/{id}/refresh ◀── NEW │
│ ↳ fetch metadata batch for this overlay only (mode=refresh) │
│ ↳ update WorkshopItem rows │
│ ↳ enqueue_build_overlay (does the download) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Daily global update │
│ systemd timer → l4d2web workshop-refresh CLI ◀── NEW │
│ ↳ inserts Job(operation='refresh_workshop_items') │
│ ↳ worker picks it up; existing global-mutex rule still applies │
│ ↳ existing _run_refresh_workshop_items code unchanged │
└─────────────────────────────────────────────────────────────────────┘
```
Key invariant: **on-add downloads run inside the per-overlay `build_overlay` job, so they do not block server starts globally.** Only the daily global refresh keeps the existing global-mutex semantics.
## Component 1 — Auto-download inside `WorkshopBuilder.build`
The builder gets a new download phase between "query items" and "apply symlinks". Today's behavior (skip-uncached with stderr warning) is replaced.
### Decision logic
For each item bound to the overlay:
1. **Skip with warning** if `file_url == ""` (Steam returned `result != 1` last time we asked — delisted, private, or hidden). Emit one stderr line `workshop item {steam_id} skipped: no file_url (steam result: {last_error})`. Do **not** fail the build — these items quietly fall out of the symlink set because they never produce a cache file. An owner can investigate via the overlay detail page where `last_error` is shown.
2. Otherwise, **download** when any of:
- `last_downloaded_at IS NULL`, or
- cache file `{steam_id}.vpk` missing, or
- cache file `(mtime, size)` doesn't match `(time_updated, file_size)` from the row.
3. Otherwise, leave the item alone (its cache file is current).
`steam_workshop.download_to_cache` already does the `(mtime, size)` check internally and short-circuits when the cache is current, so the builder can call it unconditionally for items in the "maybe download" set and trust the helper for idempotence.
### Stamping
- On success per item: `WorkshopItem.last_downloaded_at = now()`, `last_error = ""`.
- On failure per item (after retry exhaustion): `last_error` records the final exception string; the builder raises → `last_build_status='failed'`.
### What the builder does NOT do
It does not fetch fresh Steam metadata. Metadata is the responsibility of the add route, the per-overlay refresh route, and the daily refresh job. The builder is a pure function of DB state — this keeps it cheap and predictable, and lets builds run without any outbound metadata call.
### Concurrency
Items are downloaded sequentially within one builder run. Different overlays' builds run in parallel under existing scheduler rules; when two overlays share an item and race, the existing `download_to_cache` idempotence handles it — the loser sees a fresh file and skips. `last_downloaded_at` writes from two concurrent builds collapse to one timestamp; no real race.
### Cancellation
The builder threads `should_cancel` into `download_to_cache` (the helper already accepts it). Cancelled mid-download deletes the `.partial` file; the symlink phase doesn't run. Cancellation during the inter-attempt sleep wakes up within ~250 ms (see retry section).
### Logging
Each item's download start / finish / error emits one line. Counts are reported in the existing summary line:
```
workshop overlay 'mycollection': downloaded=3 cached=12 skipped=1 created=14 removed=1 unchanged=11 errors=0
```
`skipped` now means "Steam can't serve this item (no file_url)" instead of the old "uncached" meaning. Uncached items get downloaded.
## Component 2 — Retry & backoff
Wraps each `download_to_cache(meta, ...)` call inside the builder.
```
attempts = 3
delays = [1s, 2s, 4s] # exponential; slept between attempts
for n in 1..attempts:
try:
download_to_cache(meta, cache_root, should_cancel=should_cancel)
break
except InterruptedError: # cancellation
raise # propagate immediately
except (requests.RequestException, OSError) as exc:
if n == attempts: raise # final attempt: bubble up → job fails
on_stderr(f"workshop {meta.steam_id} attempt {n}/{attempts} failed: {exc}")
sleep_with_cancel(delays[n-1], should_cancel)
```
### Notes
- `sleep_with_cancel` is a small helper that polls `should_cancel` every ~250 ms during the sleep so a cancel does not wait out the full backoff window.
- The retry loop lives in the builder (`overlay_builders.py`), not in `steam_workshop.download_to_cache`. The downloader stays a single-shot primitive; retry policy is a caller concern. Keeps the helper testable without time-mocking.
- HTTP 4xx responses raised by `raise_for_status()` are `requests.HTTPError` (a `RequestException`), so they are retried too. That is intentional — 404 / 410 will fail three times quickly and surface; the cost of three failed attempts is negligible compared to the cost of users having to guess why a single transient blip killed the job.
- On final failure the job fails with the per-item error string and overlay `last_build_status='failed'`, matching the existing "never silently mount a partial overlay" rule.
## Component 3 — Per-overlay refresh
New route `POST /overlays/{id}/refresh`. Mirrors the add route's metadata-fetch path but scoped to the items already in this overlay.
### Route sketch
```python
@bp.post("/overlays/<int:overlay_id>/refresh")
@require_login
def refresh_overlay(overlay_id: int) -> Response:
user = current_user()
with session_scope() as db:
overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
if err is not None: return err
steam_ids = db.scalars(
select(WorkshopItem.steam_id)
.join(OverlayWorkshopItem, OverlayWorkshopItem.workshop_item_id == WorkshopItem.id)
.where(OverlayWorkshopItem.overlay_id == overlay_id)
).all()
if not steam_ids:
return Response("overlay has no items", status=400)
try:
metas = steam_workshop.fetch_metadata_batch(steam_ids, mode="refresh")
except Exception as exc:
return Response(f"steam api error: {exc}", status=502)
with session_scope() as db:
overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
if err is not None: return err
metas_by_id = {m.steam_id: m for m in metas}
for steam_id in steam_ids:
wi = db.scalar(select(WorkshopItem).where(WorkshopItem.steam_id == steam_id))
meta = metas_by_id.get(steam_id)
if wi is None: continue
if meta is None:
wi.last_error = "steam returned no entry for this item"
continue
wi.title = meta.title
wi.filename = meta.filename
wi.file_url = meta.file_url
wi.file_size = meta.file_size
wi.time_updated = meta.time_updated
wi.preview_url = meta.preview_url
wi.last_error = "" if meta.result == 1 else f"steam result {meta.result}"
job = enqueue_build_overlay(db, overlay_id=overlay_id, user_id=user.id)
job_id = job.id
return redirect(f"/jobs/{job_id}")
```
### Behavior notes
- Permission: same `_check_workshop_overlay_access` used by add/remove — owner or admin.
- `mode="refresh"` (not `"add"`): non-L4D2 items silently drop from the batch instead of raising. An item whose `consumer_app_id` somehow changed after add will not break refresh.
- The metadata write does **not** stamp `last_downloaded_at`. That field stays bound to actual file presence — the builder's download phase stamps it after the bytes land. A refresh that finds `time_updated` advanced therefore leaves `last_downloaded_at` pointing at the prior version; the `(mtime, size)` check in `download_to_cache` sees the mismatch and the builder re-downloads. Correct by construction.
- One Steam metadata POST per click, owner-gated. No new rate-limit concern.
### UI
A "Refresh" button next to "Add items" on the overlay detail page (workshop type only). Submits the POST; redirects to the job page like everything else.
## Component 4 — Periodic global refresh (CLI + systemd timer)
The existing `_run_refresh_workshop_items` job is complete and correct — it fetches all metadata, downloads what advanced, re-enqueues `build_overlay` for affected overlays. We only need a way to enqueue it on a schedule.
### CLI subcommand
In `l4d2web/cli.py`:
```python
@cli.command("workshop-refresh")
def workshop_refresh() -> None:
"""Enqueue a global workshop refresh job. Idempotent: if one is already
queued or running, prints its id and exits 0."""
with session_scope() as db:
existing = db.scalar(
select(Job).where(
Job.operation == "refresh_workshop_items",
Job.state.in_(("queued", "running", "cancelling")),
).order_by(Job.id.desc()).limit(1)
)
if existing is not None:
click.echo(f"refresh_workshop_items job {existing.id} already {existing.state}")
return
job = Job(
user_id=None,
server_id=None,
operation="refresh_workshop_items",
state="queued",
)
db.add(job)
db.flush()
click.echo(f"enqueued refresh_workshop_items job {job.id}")
```
### Schema follow-up
`Job.user_id = None` for system-enqueued refreshes. The implementation plan must verify whether the column is currently nullable; if it is `NOT NULL`, the plan either (a) relaxes it to nullable (preferred — "system" is a real category) or (b) records the lowest-id admin user as the actor. The design assumes (a).
### systemd units in `deploy/`
```ini
# left4me-workshop-refresh.service
[Unit]
Description=Left4me — enqueue daily workshop refresh
After=network-online.target left4me-web.service
Requires=left4me-web.service
[Service]
Type=oneshot
User=left4me
ExecStart=/opt/left4me/bin/l4d2web workshop-refresh
```
```ini
# left4me-workshop-refresh.timer
[Unit]
Description=Left4me — daily workshop refresh
[Timer]
OnCalendar=*-*-* 04:00:00
Persistent=true
RandomizedDelaySec=15min
[Install]
WantedBy=timers.target
```
### Operator notes
- The timer enqueues; the worker decides when to actually run. The existing scheduler will defer the refresh if a server start, install, or build is in progress. Worst case the refresh starts after the conflicting job finishes — the intended behavior.
- `Persistent=true` handles "host was down at 04:00" — the unit runs on next boot. The CLI's idempotence check prevents pile-up if it fires twice.
- Deployment wires this into the existing `deploy/` install flow (in scope for the implementation plan).
## Testing
Layered against the existing test files. No new test infrastructure.
### `tests/test_overlay_builders.py` — bulk of new coverage
- `test_workshop_build_downloads_uncached_items` — item with `last_downloaded_at=None` and no cache file → patched `download_to_cache` is called → file appears → symlink created → `last_downloaded_at` stamped.
- `test_workshop_build_skips_already_cached_items` — item with cache file matching `(time_updated, size)``download_to_cache` returns immediately (its existing idempotence) → no network → symlink created.
- `test_workshop_build_redownloads_stale_cache` — cache file exists but `(mtime, size)` mismatches the DB row → re-download happens.
- `test_workshop_build_retry_succeeds` — patched downloader fails twice then succeeds → builder finishes ok, retry messages on stderr, `last_downloaded_at` stamped. Backoff sleep monkey-patched to zero for speed.
- `test_workshop_build_retry_exhausted_fails_job` — downloader fails all three attempts → builder raises → `last_build_status='failed'`, `last_error` populated on the WorkshopItem.
- `test_workshop_build_cancellation_during_download``should_cancel` flips true mid-download → builder returns early, `.partial` cleaned up by `download_to_cache`, symlink phase did not run.
- `test_workshop_build_cancellation_during_backoff` — cancel flips true while sleeping between retries → wakes up within ~250 ms of the cancel.
- `test_workshop_build_skips_items_with_no_file_url` — item with `file_url=""` and `last_error="steam result 9"` → builder writes one stderr line, does NOT call `download_to_cache`, build succeeds with `last_build_status='ok'`, item is absent from the symlink set.
### `tests/test_workshop_routes.py` — new per-overlay refresh route
- `test_overlay_refresh_owner_allowed` — owner POST → `fetch_metadata_batch` called with exactly that overlay's steam_ids → WorkshopItem rows updated → `build_overlay` enqueued → 302 to /jobs/{id}.
- `test_overlay_refresh_other_user_forbidden` — non-owner non-admin → 403.
- `test_overlay_refresh_admin_can_refresh_any` — admin POST on someone else's overlay → 200/302.
- `test_overlay_refresh_steam_api_error_502``fetch_metadata_batch` raises → response is 502, no job enqueued.
- `test_overlay_refresh_empty_overlay_400` — overlay has no items → 400, no Steam call.
- `test_overlay_refresh_drops_missing_items_gracefully` — Steam returns nothing for one ID → that row gets `last_error="steam returned no entry…"`, build still enqueued.
### `tests/test_cli.py` — new CLI subcommand
- `test_workshop_refresh_enqueues_job` — CLI invocation inserts a queued `Job(operation='refresh_workshop_items')` and prints its id.
- `test_workshop_refresh_idempotent_when_queued` — pre-existing queued/running refresh job → second invocation prints the existing id and does not insert a duplicate.
### `tests/test_job_worker.py`
No new tests. Scheduler rules and `_run_refresh_workshop_items` are unchanged. Existing coverage holds.
### Out of test scope
The systemd timer. Validating it requires a host; smoke it on the dev host post-deploy.
## Out of Scope
- **Replacing the global mutex on `refresh_workshop_items`.** Daily refresh still blocks server starts/builds during its run. Scheduled at 04:00 with `Persistent=true`; revisit only if it observably hurts.
- **Per-item retry policy in `download_to_cache`.** Retry stays in the builder.
- **Cache GC.** Cache still grows monotonically — same as the v1 spec.
- **Steam API rate-limit handling for the metadata endpoint.** No backoff for metadata calls. Retries apply only to per-item file downloads.
- **Update-aware server restart UX.** When the daily refresh re-downloads an item mounted by a running server, the running server keeps its old mount. Notifying the user / offering a "restart to pick up updates" prompt stays in the backlog.
- **Per-overlay refresh on non-workshop overlay types.** Only workshop overlays get the Refresh button.
## Affected Files
Implementation will touch roughly:
- `l4d2web/services/overlay_builders.py` — WorkshopBuilder download phase, retry helper.
- `l4d2web/routes/workshop_routes.py` — new `/overlays/{id}/refresh` route.
- `l4d2web/templates/...` — Refresh button on overlay detail page.
- `l4d2web/cli.py` — new `workshop-refresh` subcommand.
- `l4d2web/models.py` and `alembic/versions/...` — possibly relax `Job.user_id` to nullable (TBD per schema check).
- `deploy/` — systemd `.service` + `.timer` units, wired into the install flow.
- `l4d2web/tests/test_overlay_builders.py`, `test_workshop_routes.py`, `test_cli.py` — new test cases per the testing section.
The implementation plan will turn these into ordered steps with explicit checkpoints.

View file

@ -1,396 +0,0 @@
# Server live-state display (counts, map, roster, avatars, history)
## Context
The l4d2web UI currently shows systemd lifecycle state per game server (running/stopped/unknown) but nothing about what's happening *inside* the game: player count, current map, whether the server is hibernating, who is connected. To know any of that, users have to context-switch (open the game, query externally).
The goal is a **read-side live-state display**: counts + map + hibernating on the server list, plus a server-detail panel showing the current player roster (avatars + names) and a "recent players" section for who's been on lately. Backed by a persistent history table so we get count-over-time graphs and player-presence history (foundation for future ban UX) for free.
**Source: RCON exclusively.** A2S_INFO (UDP, anonymous) was investigated and discarded — it can't deliver Steam IDs, hibernating flag, or interactive commands, so anything beyond raw counts re-routes through RCON anyway. Both transports were verified working against prod `left4.me`. Going RCON-only means one transport, one set of tests, no throwaway scaffolding.
**Avatars: Steam Web API.** RCON gives Steam IDs; `ISteamUser/GetPlayerSummaries` resolves them to persona names + avatar URLs hot-linked from Steam's CDN. API key already obtained.
**Commands are deferred** to a separate plan. This plan is read-only.
---
## Architecture
```
┌─────────────────────────────┐
│ left4me-web (Flask) │
┌──────────────┐ RCON │ ┌───────────────────────┐ │
│ srcds 27016 │◄──────┼──┤ live-state poller │ │
└──────────────┘ TCP │ │ (daemon thread) │ │
│ └───────┬───────────────┘ │
┌──────────────┐ RCON │ │ writes │
│ srcds 27021 │◄──────┤ ▼ │
└──────────────┘ │ ┌───────────────────────┐ │
│ │ server_live_state │ │
Steam Web API │ │ server_player_session │ │
┌────────────┐ │ │ steam_user_profile │ │
│ Steam CDN │◄─┼──┤ │ │
│ avatars... │ │ └───────┬───────────────┘ │
└────────────┘ │ │ reads │
▲ │ ▼ │
│ │ ┌───────────────────────┐ │
└────────┼──┤ /servers, /servers/N │ │
<img src=...> │ │ (HTMX 5s refresh) │ │
│ └───────────────────────┘ │
└─────────────────────────────┘
```
Single daemon thread (modeled on the existing `start_state_poller` in `l4d2web/services/job_worker.py:617-647`), inside the Flask process, polls every `LIVE_STATE_POLL_SECONDS` (default 5). Per poll, per running server with a configured RCON password:
1. TCP connect to `127.0.0.1:<port>`, auth, send `status`, parse response.
2. Compare server-level state (players/map/hibernating/etc.) to the latest `server_live_state` row for this server. If unchanged, bump `last_seen_at`. If changed, insert a new row.
3. Reconcile open sessions (`server_player_session` rows where `left_at IS NULL`) with the current `status` roster: open new sessions for new players (backfilling `joined_at` from RCON's `connected` field), close sessions for players no longer present, update `min_ping`/`max_ping` for continuing sessions.
4. Collect Steam IDs that are missing from `steam_user_profile` or have `fetched_at` older than 24h; batch them into a single `GetPlayerSummaries` call; upsert results.
5. Trim `server_live_state` and closed sessions older than retention.
---
## Schema (one new alembic migration)
### New column: `servers.rcon_password`
```python
rcon_password: Mapped[str] = mapped_column(
String(64), nullable=False, default="", server_default=""
)
```
Empty string = "no password configured yet" (poller skips). Migration backfills every existing row with `secrets.token_urlsafe(32)` (~43 chars, URL-safe character set so the literal `"..."` cfg-quoting needs no escaping).
### `server_live_state` — run-length-encoded snapshots
```sql
CREATE TABLE server_live_state (
id INTEGER PRIMARY KEY AUTOINCREMENT,
server_id INTEGER NOT NULL REFERENCES servers(id) ON DELETE CASCADE,
started_at DATETIME NOT NULL, -- when this exact state first appeared
last_seen_at DATETIME NOT NULL, -- most recent poll where it still held
players INTEGER NOT NULL,
max_players INTEGER NOT NULL,
bots INTEGER NOT NULL,
map VARCHAR(64) NOT NULL,
hibernating BOOLEAN NOT NULL
);
CREATE INDEX ix_sls_server_started ON server_live_state(server_id, started_at DESC);
```
- "State" = the tuple `(players, max_players, bots, map, hibernating)`. Ping/loss are deliberately not stored at server-level, so they don't churn rows.
- Idle hibernating server collapses from one-row-per-poll to one-row-per-state-change (≈17,280× compression for a 24h-idle server).
- Latest snapshot for a server: `ORDER BY started_at DESC LIMIT 1`. UI staleness check: `last_seen_at > now - LIVE_STATE_STALE_SECONDS` (default 30).
- Retention: trim rows where `last_seen_at < now - LIVE_STATE_HISTORY_DAYS` (default 30).
- Failed polls produce no DB write; the staleness check on `last_seen_at` handles UI degradation cleanly.
### `server_player_session` — interval per connection
```sql
CREATE TABLE server_player_session (
id INTEGER PRIMARY KEY AUTOINCREMENT,
server_id INTEGER NOT NULL REFERENCES servers(id) ON DELETE CASCADE,
steam_id_64 VARCHAR(20) NOT NULL,
joined_at DATETIME NOT NULL,
left_at DATETIME NULL, -- NULL = currently in-game
name_at_join VARCHAR(64) NOT NULL,
min_ping INTEGER NOT NULL,
max_ping INTEGER NOT NULL
);
CREATE INDEX ix_sps_server_open ON server_player_session(server_id, left_at);
CREATE INDEX ix_sps_steam_history ON server_player_session(steam_id_64, joined_at);
```
- `joined_at` is **backfilled from RCON's `connected` duration** on first sighting (`joined_at = now - connected_seconds`). This heals brief polling gaps and survives web restarts: even if we just started polling, we know when the still-connected players actually joined.
- A player who disconnects and rejoins gets two rows, not one merged interval.
- Bots are excluded — rows with a non-`STEAM_X:Y:Z` uniqueid are skipped.
- `min_ping`/`max_ping` updated only when a new poll pushes the range, to avoid noise writes.
- On poller startup, close any sessions whose server isn't in current RCON output. Plus: close sessions after N consecutive failed polls of their server (TBD constant during implementation, e.g. 6 polls = ~30s).
- Retention: trim closed sessions where `left_at < now - SESSION_HISTORY_DAYS` (default 30). Open sessions never trimmed.
### `steam_user_profile` — cached profile data (24h TTL)
```sql
CREATE TABLE steam_user_profile (
steam_id_64 VARCHAR(20) PRIMARY KEY,
persona_name VARCHAR(64) NOT NULL,
avatar_url TEXT NOT NULL, -- avatarmedium from Steam Web API
fetched_at DATETIME NOT NULL
);
```
- Cache is global, not per-server (one profile per Steam ID).
- Refreshed when `fetched_at < now - 24h` or when entry is missing.
- Soft-fail: if the Steam API key is unset, the API is down, or a profile is private, we just leave the cache as-is and the UI falls back to `name_at_join` + placeholder avatar.
### Bind-rendered queries
**Current players on server X:**
```sql
SELECT sp.steam_id_64, sp.joined_at, sp.name_at_join,
sp.min_ping, sp.max_ping,
p.persona_name, p.avatar_url
FROM server_player_session sp
LEFT JOIN steam_user_profile p USING (steam_id_64)
WHERE sp.server_id = ? AND sp.left_at IS NULL
ORDER BY sp.joined_at;
```
**Recent players on server X (last 30 days, excluding currently in-game):**
```sql
SELECT sp.steam_id_64, MAX(sp.left_at) AS last_seen,
p.persona_name, p.avatar_url
FROM server_player_session sp
LEFT JOIN steam_user_profile p USING (steam_id_64)
WHERE sp.server_id = ?
AND sp.left_at IS NOT NULL
AND sp.left_at > datetime('now', '-30 days')
AND sp.steam_id_64 NOT IN (
SELECT steam_id_64 FROM server_player_session
WHERE server_id = ? AND left_at IS NULL
)
GROUP BY sp.steam_id_64, p.persona_name, p.avatar_url
ORDER BY last_seen DESC
LIMIT 20;
```
---
## Modules
### `l4d2web/services/rcon.py` (new)
Pure stdlib (`socket`, `struct`), no new dependency. Source RCON protocol:
```python
@dataclass(slots=True, frozen=True)
class PlayerRow:
steam_id_64: str # converted from STEAM_X:Y:Z
name: str
connected_seconds: int
ping: int
@dataclass(slots=True, frozen=True)
class StatusResponse:
map: str
players: int # humans
max_players: int
bots: int
hibernating: bool
roster: list[PlayerRow]
class RconError(Exception): ...
class RconAuthError(RconError): ...
def query_status(host: str, port: int, password: str, *, timeout: float = 2.0) -> StatusResponse: ...
```
Implementation notes:
- Auth handshake quirk verified live: server sends a `type=0` empty-body packet **before** the `type=2` auth response. Consume both. `req_id == -1` on the auth response = bad password.
- Single TCP connection per query (loopback, ~10-20ms total round-trip — pooling not worth it at this scale).
- Header regex on `map :` and `players :` lines (the `(hibernating|not hibernating)` token is in `players :`).
- Roster regex: split lines starting with `#`, skip the column-header line, robustly extract the quoted name + the `STEAM_X:Y:Z` token + `MM:SS` or `HH:MM:SS` connected duration + ping. Tolerate the two-numeric-prefix L4D2 variant (`# 2 1 "Crone" STEAM_1:0:...`).
- Steam ID conversion: `STEAM_X:Y:Z``76561197960265728 + (Z * 2) + Y` (Y is the low bit; returned as string).
### `l4d2web/services/steam_users.py` (new)
Modeled directly on `l4d2web/services/steam_workshop.py:17-43` (single `requests.Session`, 30s timeout, anonymous-pattern POST with form-encoded body — only difference is the `key=` parameter).
```python
@dataclass(slots=True, frozen=True)
class SteamProfile:
steam_id_64: str
persona_name: str
avatar_url: str # avatarmedium
def fetch_profiles_batch(steam_ids: Iterable[str], *, api_key: str) -> list[SteamProfile]: ...
```
- Endpoint: `GET https://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key=<key>&steamids=<csv>`.
- Up to 100 IDs per call; caller batches.
- Returns only successful resolutions (private/deleted accounts simply absent from the response — fine, they stay uncached and the UI falls back).
- Raises on transport errors; caller decides whether to surface.
### `l4d2web/services/live_state_poller.py` (new)
Modeled on `start_state_poller` / `state_poller_loop` in `l4d2web/services/job_worker.py:617-647`.
```python
def start_live_state_poller(app) -> None: ... # spawns daemon thread, skipped under TESTING
def live_state_poller_loop(app, interval: float) -> None: ...
def poll_once() -> None: # one full pass over running servers
...
```
Per-server algorithm:
1. RCON `status``StatusResponse` (or skip on auth/timeout, logged via `app.logger`).
2. **Server-level RLE upsert**: load newest `server_live_state` row for this server. If `(players, max_players, bots, map, hibernating)` matches → `UPDATE last_seen_at = now()`. Else → `INSERT` new row.
3. **Session reconciliation** in a single transaction:
- Load open sessions for this server.
- For each player in `response.roster` not in open sessions: `INSERT` new session with `joined_at = now - connected_seconds`, `name_at_join = roster.name`, `min_ping = max_ping = roster.ping`.
- For each open session whose player is in the roster: if `roster.ping < min_ping` or `> max_ping`, `UPDATE` the range. Otherwise skip the write.
- For each open session whose player is *not* in the roster: `UPDATE left_at = now()`.
4. **Profile enrichment**: collect Steam IDs from the roster where the cached profile is missing or `fetched_at < now - 24h`. Skip if `STEAM_WEB_API_KEY` unset. Batch into one Steam API call. Upsert results.
Periodic (every Nth cycle, e.g. once a minute):
- Trim `server_live_state` and closed sessions past retention.
- Close any open sessions whose `server_id` hasn't had a successful RCON response in the last `STUCK_SESSION_SECONDS` (default 60).
### Modify: `l4d2web/services/l4d2_facade.py:28-52`
`build_server_spec_payload` **appends** `f'rcon_password "{server.rcon_password}"'` as the *last* entry in the returned `config` list, only if the password is non-empty. Appending (not prepending) matters: Source's cfg semantics are last-wins, so putting our line after both the overlay `exec` lines and the user's blueprint config guarantees no overlay or blueprint can silently clobber the password and break the poller. `l4d2host/instances.py:40-58` already writes `spec.config` lines verbatim to `server.cfg`**no host-side change needed**.
### Modify: server-create route
Wherever the server-create form handler lives (`l4d2web/routes/server_routes.py` or similar — confirm during implementation): before commit, generate `rcon_password = secrets.token_urlsafe(32)`.
---
## Web UI
### Server list (template TBD: `ls l4d2web/templates/` during implementation)
Add an inline live-state cell per server row:
- Stopped server: `—`
- Stale (no row newer than `LIVE_STATE_STALE_SECONDS`): dim `?` with tooltip "no data"
- Hibernating: `0/4 · idle · c1m1_hotel`
- Active: `2/4 · c1m2_streets`
No HTMX on the list page; page reload picks up the latest snapshot.
### Server detail (`l4d2web/templates/server_detail.html`)
New section, HTMX-refreshed every `LIVE_STATE_POLL_SECONDS` (default 5):
```html
<section class="panel"
hx-get="/servers/{{ server.id }}/live-state"
hx-trigger="every 5s"
hx-swap="outerHTML">
<!-- rendered from l4d2web/templates/_live_state.html -->
</section>
```
The partial renders three blocks:
1. **Summary**: `players/max_players · map · idle?` plus a small "polled Ns ago" caption.
2. **Current players** (only if non-empty): grid of cards, each `<img src="{{ profile.avatar_url or placeholder }}" /> {{ profile.persona_name or session.name_at_join }} · {{ joined_relative }} · ping {{ min }}-{{ max }}ms`.
3. **Recent players** (last 30 days, excluding current; only if non-empty): smaller cards, `{{ avatar }} {{ persona_name or name_at_join }} · last seen {{ last_seen_relative }}`.
New route: `GET /servers/<id>/live-state` returns the partial. Composition mirrors the existing build-status pattern at `l4d2web/templates/_overlay_build_status.html:1-5`.
Avatar `<img>` tags point straight at Steam CDN URLs (`avatars.cloudflare.steamstatic.com` / `avatars.akamai.steamstatic.com`). No proxying. Same approach as `WorkshopItem.preview_url`. Note: confirm the existing CSP allows these hosts; if not, extend it.
No JS framework added — HTMX only.
---
## Config keys
In `l4d2web/config.py`, plus documented defaults in `deploy/templates/etc/left4me/web.env` where applicable:
| key | default | purpose |
|---|---|---|
| `LIVE_STATE_POLL_SECONDS` | `5` | poll interval |
| `LIVE_STATE_QUERY_TIMEOUT_SECONDS` | `2.0` | per-RCON-query timeout |
| `LIVE_STATE_POLL_WORKERS` | `4` | thread-pool size for parallel per-server polls |
| `LIVE_STATE_STALE_SECONDS` | `30` | UI staleness threshold |
| `LIVE_STATE_HISTORY_DAYS` | `30` | retention for snapshots + closed sessions |
| `STUCK_SESSION_SECONDS` | `60` | close open sessions whose server has been unreachable for this long |
| `STEAM_PROFILE_TTL_SECONDS` | `86400` | profile cache TTL |
| `STEAM_WEB_API_KEY` | `""` | from `web.env`; empty disables enrichment |
---
## Tests
- `l4d2web/tests/test_rcon.py` — protocol handshake against an in-process TCP fixture: auth-success, auth-failure (`req_id == -1`), header parse (incl. `(hibernating)` and `(reserved <token>)` variants), roster parse (incl. the two-numeric-prefix L4D2 variant), Steam ID conversion.
- `l4d2web/tests/test_steam_users.py` — request shape (key in querystring, batched ids, 100-per-call ceiling), response parsing, partial response (some IDs missing).
- `l4d2web/tests/test_live_state_poller.py` — mirror `test_state_poller_*` at `l4d2web/tests/test_job_worker.py:882-952`. Cover: iterates only running servers with non-empty `rcon_password`, RLE upsert (matching state → `last_seen_at` bump only; differing state → new row), session open with backfilled `joined_at`, session close on disappearance, ping range expansion, stuck-session close after N failures, drops auth failures silently, respects retention.
- `l4d2web/tests/test_server_routes.py` (extend) — `/servers/<id>/live-state` fragment route renders summary/current/recent blocks correctly; stale rendering when latest snapshot is old; soft-fail rendering when no profile cached.
- `l4d2web/tests/test_l4d2_facade.py` (extend) — `build_server_spec_payload` appends `rcon_password "..."` as the last config line when password is set; omits the line when empty; appears after both the overlay `exec` lines and the blueprint config lines.
- Migration test — existing rows backfilled with non-empty 43-char passwords; tables created with correct indexes.
---
## Critical files
**New:**
- `l4d2web/services/rcon.py` — Source RCON client + status parser
- `l4d2web/services/steam_users.py` — Steam Web API client (mirrors `steam_workshop.py`)
- `l4d2web/services/live_state_poller.py` — background thread + poll loop + session reconciler
- `l4d2web/alembic/versions/00XX_server_live_state.py` — migration: new column, three new tables, password backfill
- `l4d2web/templates/_live_state.html` — HTMX-refreshed fragment (summary + current + recent)
- `l4d2web/tests/test_rcon.py`, `l4d2web/tests/test_steam_users.py`, `l4d2web/tests/test_live_state_poller.py`
**Modify:**
- `l4d2web/models.py` — add `ServerLiveState`, `ServerPlayerSession`, `SteamUserProfile`; add `rcon_password` to `Server` (after line 137)
- `l4d2web/services/l4d2_facade.py:28-52``build_server_spec_payload` appends `rcon_password "..."` as the last config line when set
- `l4d2web/app.py` — call `start_live_state_poller(app)` next to existing `start_state_poller`
- `l4d2web/routes/server_routes.py` (or equivalent — confirm) — generate `rcon_password` in create handler; add `GET /servers/<id>/live-state`
- `l4d2web/templates/server_detail.html` — include `_live_state.html`
- `l4d2web/templates/<server-list>.html` — confirm filename; add inline badge column
- `l4d2web/config.py` — register the eight new config keys
- `deploy/templates/etc/left4me/web.env` — add `STEAM_WEB_API_KEY=` and any tunables we expose
**Reused without changes:**
- `l4d2web/services/job_worker.py:617-647` — daemon-thread / poll-loop pattern reference
- `l4d2web/services/steam_workshop.py:17-43``requests.Session` + form-POST pattern for Steam Web API
- `l4d2host/instances.py:40-58` — already writes `spec.config` verbatim, so no host-side change for password injection
- `l4d2web/templates/_overlay_build_status.html` — HTMX polling pattern reference
---
## Verification
1. **Unit tests**:
```
pytest l4d2web/tests/test_rcon.py l4d2web/tests/test_steam_users.py l4d2web/tests/test_live_state_poller.py -v
pytest l4d2web/tests -q # full regression
```
2. **Migration check**:
```
alembic upgrade head
sqlite3 l4d2web.db "SELECT id, name, length(rcon_password) FROM servers;" # every row ~43
sqlite3 l4d2web.db ".schema server_live_state server_player_session steam_user_profile"
```
3. **End-to-end against prod** (`left4.me`):
- Deploy. Confirm `systemctl status left4me-web.service` shows no crash-loop and the journal logs `start_live_state_poller` once.
- Restart both existing game servers so they pick up the injected password.
- SQL sanity (web-host shell):
```
sqlite3 l4d2web.db "SELECT server_id, started_at, last_seen_at, players, map, hibernating
FROM server_live_state ORDER BY server_id, started_at DESC LIMIT 10;"
```
Expect a single recent row per server while idle; new rows when players come/go.
- Connect to one server from the L4D2 client; within 5s, `/servers/<id>` shows a card with your avatar + persona name + ping range. Disconnect; within 5s the card moves to "recent."
- `sqlite3 l4d2web.db "SELECT * FROM server_player_session WHERE left_at IS NULL;"` — empty when nobody's connected; one row per current player when someone is.
- `sqlite3 l4d2web.db "SELECT count(*), MIN(fetched_at), MAX(fetched_at) FROM steam_user_profile;"` — at least one row after a player has been resolved.
4. **Failure-path checks**:
- Manually corrupt `servers.rcon_password` for one server; confirm the journal logs auth failure and the row's badge goes stale within `LIVE_STATE_STALE_SECONDS`; other servers unaffected.
- Unset `STEAM_WEB_API_KEY` in `web.env`, restart web; confirm display still works (in-game names + placeholder avatars), no errors in journal.
- `nft` drop the loopback TCP on one server's port; confirm rows stop appearing, open sessions close after `STUCK_SESSION_SECONDS`, badge goes stale.
---
## Open implementation questions
- **Server-list template filename**: confirm with `ls l4d2web/templates/` once implementation starts.
- **Server-create route location**: confirm path (likely `l4d2web/routes/server_routes.py`).
- **CSP allowlist for Steam avatar CDNs**: check `l4d2web/app.py` (or wherever security headers live) — extend `img-src` to include `avatars.cloudflare.steamstatic.com`, `avatars.akamai.steamstatic.com`, `avatars.steamstatic.com` if a CSP is enforced.
- **Adaptive backoff** for hibernating servers: defer; start with fixed 5s and revisit only if load becomes a concern (which it won't at current server count).
- **Migration data step**: SQLite alembic batch operation with a Python data step that iterates rows and generates `secrets.token_urlsafe(32)` per row — confirm pattern against existing migrations under `l4d2web/alembic/versions/`.
---
## Deferred to a separate plan
- Generic RCON command execution (`changelevel`, `kick`, `say`, `sm_ban`, ...)
- Web UI buttons mapped to those commands with CSRF + admin authz
- Audit log table for issued commands
- Player-count history graphs (data already accumulating from this plan)
- Ban UX (lookup by Steam ID, search across `server_player_session`)

View file

@ -1,61 +0,0 @@
# RCON Password Display on Server Detail Page — Design
**Goal:** Show the RCON password on the server detail page with a show/hide toggle.
**Architecture:** Presentational change only. The `server.rcon_password` field already exists in the database and is rendered via Jinja2 autoescaping into the template. A small external JS file provides the reveal/hide interaction via delegated click on `[data-password-toggle]` attributes — no inline handlers.
**Files touched:**
- `l4d2web/static/js/password-reveal.js` — new, ~15 lines
- `l4d2web/templates/server_detail.html` — add one row to `.server-info` DL
- `l4d2web/templates/base.html` — add script include
- `l4d2web/static/css/components.css` — optional, add `.password-mask` letter-spacing if default renders poorly
---
## Template
Add after the blueprint row in `server_detail.html` (line 13):
```html
<div>
<dt>RCON Password</dt>
<dd>
<span class="password-mask" data-password-field="{{ server.id }}">••••••••••••</span>
<span class="password-value" data-password-field="{{ server.id }}" hidden>{{ server.rcon_password }}</span>
<button class="link-button" data-password-toggle="{{ server.id }}" aria-label="Show RCON password">show</button>
</dd>
</div>
```
## JavaScript (`password-reveal.js`)
Delegated click listener on `[data-password-toggle]`. Toggles `hidden` between the mask span and value span, updates button text and aria-label.
```js
document.addEventListener('click', (e) => {
const btn = e.target.closest('[data-password-toggle]');
if (!btn) return;
const id = btn.dataset.passwordToggle;
const mask = document.querySelector(`[data-password-field="${id}"].password-mask`);
const value = document.querySelector(`[data-password-field="${id}"].password-value`);
const hidden = value.hidden;
value.hidden = !hidden;
mask.hidden = hidden;
btn.textContent = hidden ? 'hide' : 'show';
btn.setAttribute('aria-label', hidden ? 'Hide RCON password' : 'Show RCON password');
});
```
## CSS
Reuse existing `.link-button` for the toggle button. If the bullet characters render inconsistently across browsers (spacing, baseline), add a simple `.password-mask { letter-spacing: 0.15em; }` class — but likely unnecessary.
## Security
- Password is server-rendered via Jinja2 autoescaping — no XSS vector.
- Visible in page source to the server owner (consistent with existing auth model: user must own the server).
- No copy-to-clipboard functionality (per requirements).
## Testing
No new tests required — purely presentational change. Existing `test_create_server_generates_rcon_password` in `test_servers.py` already covers password generation.

View file

@ -1,77 +0,0 @@
# Server Hostname (Source `hostname` cvar) Design
**Goal:** Allow users to set the L4D2 server name (`hostname` cvar) that players see in the server browser and MOTD, with an ephemeral auto-generated fallback.
**Architecture:** A new `hostname VARCHAR(128)` column on the `servers` table. Empty string means "auto-generate at deploy time." The fallback is resolved ephemerally in `initialize_server` — computed fresh from `user.username + server.name` on each deploy, never persisted. Explicit overrides are stored and emitted verbatim.
---
## Model
Add one column to `Server` in `l4d2web/models.py`:
```python
hostname: Mapped[str] = mapped_column(String(128), default="", nullable=False)
```
Default `""` means auto-generate. Non-empty means explicit override.
## Behavior
| `hostname` value | Deploy result |
|---|---|
| `""` (empty) | Emit `hostname "<username> <server.name>"` — computed fresh each deploy, never written to DB |
| `"My Server"` | Emit `hostname "My Server"` verbatim |
| User clears the field | Resets to `""`, next deploy auto-generates |
The fallback is ephemeral — `initialize_server` resolves it in-memory for the spec YAML. The DB row stays empty. This means renames auto-propagate to the hostname on the next deploy without manual updates.
## Spec Payload
`build_server_spec_payload()` gains an optional `resolved_hostname: str = ""` keyword parameter. When non-empty, a `hostname "..."` line is inserted into the config array, before the `rcon_password` line (so rcon remains last-wins).
`initialize_server()` resolves the hostname:
```python
with session_scope() as db:
user = db.get(User, server.user_id)
resolved = server.hostname or f"{user.username} {server.name}"
```
## UI
On `server_detail.html`, a new row in the info `<dl>` block, placed after the RCON password row:
```
Hostname: [ _______________ ] [Save]
Leave empty for auto: "alice alpha"
```
- Input `name="hostname"`, `maxlength="128"`
- `value="{{ server.hostname }}"` (empty when not set)
- `placeholder="{{ user.username }} {{ server.name }}"` (previews auto-generated value)
- Form submits to `POST /servers/<id>` — same endpoint as the rename form
- No hostname field in the create-server modal; new servers always start with `hostname=""`
## Routes
**`POST /servers/<int:server_id>`** (update_server_form) — unchanged signature; just also saves `request.form.get("hostname", "")` to `server.hostname`.
**`POST /servers`** (create_server) — unchanged; `hostname` defaults to `""` from the model default.
## Files Touched
| File | Change |
|---|---|
| `l4d2web/models.py` | Add `hostname` column to `Server` |
| `l4d2web/alembic/versions/0011_server_hostname.py` | Migration — `ADD COLUMN hostname VARCHAR(128) NOT NULL DEFAULT ''` |
| `l4d2web/routes/server_routes.py` | `update_server_form` saves `hostname` from form |
| `l4d2web/services/l4d2_facade.py` | `build_server_spec_payload` accepts `resolved_hostname=`, emits `hostname "..."` line. `initialize_server` resolves fallback. |
| `l4d2web/templates/server_detail.html` | Hostname form row in info `<dl>` |
| `l4d2web/tests/test_servers.py` | Tests for create default, update, clear |
| `l4d2web/tests/test_l4d2_facade.py` | Tests for hostname in spec, fallback resolution |
## Open / Closed
- **Explicit vs ephemeral:** Explicit overrides persist; empty means auto at deploy time. No toggle, no "locked" mode needed in v1.
- **No hostname in create modal:** Simplifies the form. Hostname is configured post-creation on the detail page.

View file

@ -1,579 +0,0 @@
# Build-overlay template unit — refactor the script-sandbox helper
**Status: open question, not settled design.** This is a handoff
document prompted by the build-time idmap landing on 2026-05-15. The
current `left4me-script-sandbox` shell helper works but has accumulated
several layers of complexity (idmap bind setup, trap cleanup, nsenter
self-wrap) that a systemd template unit would handle declaratively.
The same pattern is already established in the codebase for
gameservers (`left4me-server@.service`). A future session should
evaluate whether to refactor and, if so, follow the steps below.
> **Updated 2026-05-15:** `l4d2-sandbox` was collapsed into `left4me`
> — see `docs/superpowers/plans/2026-05-15-uid-collapse.md`. The
> idmap bind setup + trap cleanup are gone, so the remaining
> complexity in the helper is just the nsenter self-wrap. References
> below to `User=l4d2-sandbox` should read as `User=left4me`; the
> template refactor will inherit that cleanly.
## Why this came up
While verifying the build-time idmap refactor, the first 5 build jobs
failed with `mkdir: Permission denied` on `/overlay/...`. Root cause:
- `left4me-web.service` runs with `PrivateTmp=true`, which puts the
web app (and anything it sudoes into) in a private mount namespace.
- The script-sandbox helper, invoked via `sudo` from the web app,
inherits that namespace.
- The helper's `mount --bind --map-users=...` pre-creates the idmap
staging path *in the web app's namespace*.
- `systemd-run` (called by the helper) spawns a transient unit in
PID 1's mount namespace.
- The transient unit's `BindPaths=...:/overlay` resolves the staging
path in PID 1's namespace — where the bind doesn't exist. It sees
an empty root-owned dir at the staging path (mkdir'd by the helper
before the bind) and binds *that* to `/overlay`.
- Sandbox uid hits EACCES on every write.
We fixed it (commit `f1aa05d`) by self-wrapping the helper into
PID 1's mount namespace at the top of the script:
```bash
if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then
exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \
/usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@"
fi
```
That works. But it's a band-aid for an architectural friction:
**helper invocation via `sudo` from a hardened service forces us to
manually escape the caller's namespace before any mount syscall**.
If the helper were *itself* a systemd unit started by PID 1, the
namespace would be correct by default.
The gameserver helper handles this at the unit level. Its
ExecStartPre is:
```
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
```
i.e. wrapped in nsenter *at the unit*. The unit is started by PID 1,
so it has PID 1's namespace, then nsenter is a belt-and-braces.
Mirror that pattern for builds: introduce `build-overlay@.service` as
a template unit, have the worker activate it instead of forking a
helper.
## Current state (the thing being replaced)
Files:
- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
the bash helper. ~100 lines. Self-wraps in nsenter, does pre-bind
with `--map-users`, invokes `systemd-run --quiet --collect --wait
--pipe -p ... -- /bin/bash /script.sh`, cleans up via trap.
- `l4d2web/services/overlay_builders.py:run_sandboxed_script` — the
worker entry point. Writes script content to
`/var/lib/left4me/sandbox-scripts/<uniqued>.sh`, invokes
`sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id>
<path>`, streams stdout/stderr via `subprocess.Popen` + the existing
`run_command` plumbing.
- `deploy/files/etc/sudoers.d/left4me` — grants `left4me` NOPASSWD to
the helper path.
What the helper actually does:
1. nsenter into PID 1's mount ns (the band-aid)
2. validate args + overlay dir exists
3. compute `STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID}`
4. `trap` cleanup; pre-emptive `umount` of stale staging; `mkdir -p`
the staging
5. `mount --bind --map-users=$(id -u left4me):$(id -u l4d2-sandbox):1
--map-groups=... $OVERLAY_DIR $STAGING`
6. `systemd-run` with the full hardening profile, `BindPaths=$STAGING:/overlay`
7. Wait for completion, propagate exit code
8. trap fires: `umount $STAGING; rmdir $STAGING`
## Proposed design
Replace the bash helper with two systemd units (template + a slice)
emitted from ckn-bw's existing `systemd_units` reactor, plus a small
worker rewrite.
### `build-overlay@.service` (template unit)
```ini
[Unit]
Description=Sandboxed overlay build for instance %i
DefaultDependencies=no
After=local-fs.target
RequiresMountsFor=/var/lib/left4me/overlays/%i
ConditionPathIsDirectory=/var/lib/left4me/overlays/%i
ConditionPathExists=/var/lib/left4me/sandbox-scripts/%i.sh
[Service]
Type=oneshot
User=l4d2-sandbox
Group=l4d2-sandbox
Slice=l4d2-build.slice
# Idmap bind: disk uid 980 (left4me) ↔ mount uid 981 (sandbox), so writes
# from the sandbox land on disk as left4me. + prefix runs as root before
# the User= drop (mount syscall requires CAP_SYS_ADMIN).
ExecStartPre=+/usr/bin/mkdir -p /run/left4me/idmap/%i
ExecStartPre=+/usr/bin/mount --bind \
--map-users=980:981:1 --map-groups=980:981:1 \
/var/lib/left4me/overlays/%i /run/left4me/idmap/%i
ExecStart=/bin/bash /script.sh
ExecStopPost=+-/usr/bin/umount /run/left4me/idmap/%i
ExecStopPost=+-/usr/bin/rmdir /run/left4me/idmap/%i
# Hardening — all the -p flags from the current bash helper, declared
# declaratively here instead of as systemd-run -p arguments.
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes
PrivateIPC=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
RestrictNamespaces=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictSUIDSGID=yes
LockPersonality=yes
MemoryDenyWriteExecute=yes
SystemCallFilter=@system-service @network-io
SystemCallArchitectures=native
CapabilityBoundingSet=
AmbientCapabilities=
IPAddressDeny=127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7
TemporaryFileSystem=/etc /var/lib
BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives /var/lib/left4me/sandbox-scripts/%i.sh:/script.sh
BindPaths=/run/left4me/idmap/%i:/overlay
WorkingDirectory=/overlay
Environment=HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay
UMask=0022
OOMScoreAdjust=500
MemoryMax=4G
MemorySwapMax=0
TasksMax=512
CPUQuota=200%
RuntimeMaxSec=3600
TimeoutStartSec=1h
TimeoutStopSec=30s
```
Notes:
- `Type=oneshot` makes `systemctl start` block until ExecStart exits.
- `ConditionPath*` provides early failure if the overlay dir or script
doesn't exist (avoids running the unit at all in those cases).
- `RequiresMountsFor=/var/lib/left4me/overlays/%i` ensures the parent
fs is mounted before this unit runs (`/` and `/var/lib` if it's a
separate mount point).
- `ExecStopPost` uses `+-` (root, ignore failures) — the bind might
already be torn down if the unit is restarting.
- `BindReadOnlyPaths=...:/script.sh` makes the per-overlay script
available at `/script.sh` inside the sandbox, picked from the
predictable path `/var/lib/left4me/sandbox-scripts/%i.sh`.
### Script source: filesystem vs. DB
**Critical design decision the future session must make.** The current
plan in the unit sketch above assumes the worker writes the script
content to `/var/lib/left4me/sandbox-scripts/<id>.sh` before calling
`systemctl start`. But the script *already lives in the DB* (the
`overlays.script` column), and the unit instance name `%i` is the
overlay row id. The filesystem copy is redundant unless we want it.
Three options:
**Option A — worker writes the script (the unit sketch above).**
Worker queries DB, writes `<id>.sh` to a known path, then
`systemctl start`. Unit reads via `BindReadOnlyPaths`. Simple, no DB
access from the unit, the existing `_sandbox_script_dir()` plumbing
mostly works. Cost: redundant on-disk copy; stale files between
builds if you don't clean them.
**Option B — unit fetches the script from the DB itself.** A small
root-side helper installed as
`/usr/local/libexec/left4me/left4me-fetch-script` does:
```python
#!/usr/bin/python3
import sqlite3, sys
overlay_id = int(sys.argv[1])
conn = sqlite3.connect("/var/lib/left4me/left4me.db")
row = conn.execute(
"SELECT script FROM overlays WHERE id = ?", (overlay_id,)
).fetchone()
sys.stdout.write((row[0] if row else "") or "")
```
Unit's ExecStartPre runs it as root (the `+` prefix), pipes the
output to a runtime path that ExecStart reads:
```ini
RuntimeDirectory=left4me/sandbox-scripts
RuntimeDirectoryMode=0700
ExecStartPre=+/bin/sh -c '/usr/local/libexec/left4me/left4me-fetch-script %i \
> /run/left4me/sandbox-scripts/%i.sh && chmod 0644 /run/left4me/sandbox-scripts/%i.sh'
BindReadOnlyPaths=/run/left4me/sandbox-scripts/%i.sh:/script.sh
```
(`RuntimeDirectory=` auto-creates `/run/left4me/sandbox-scripts/` on
start and removes it on stop, including the file inside.)
The fetch script doesn't need sudoers — it runs from ExecStartPre with
root privileges already. It only reads the DB; no writes. The DB is
`root:left4me 0640` so root can read it.
Worker becomes a one-liner: `sudo systemctl start build-overlay@<id>`.
No FS prep, no tmpfile cleanup.
**Option C — pipe the script content directly into bash stdin.** The
unit's ExecStart is something like
`/bin/sh -c "fetch-script %i | /bin/bash"`. Pros: no on-disk file at
all. Cons: `/bin/bash` runs without a file path, so `$0` is `bash` and
error messages look weird; harder to debug a failing script when there's
no file to inspect.
**Recommendation**: Option B. Decouples script storage (DB) from
sandbox transport (a /run/ runtime file). RuntimeDirectory= handles
cleanup. Worker becomes trivially small. The fetch-script helper is
~10 lines and stays in deploy/files/usr/local/libexec/left4me/.
If Option A is chosen instead, plan to track the script tmpfiles
explicitly so they don't accumulate. With Option B, RuntimeDirectory
auto-cleans on stop.
### Worker invocation
Replace `run_sandboxed_script` in
`l4d2web/services/overlay_builders.py`. The code below is the **Option
A** shape (worker writes the script). For **Option B** (recommended),
drop the `script_dir`/`script_path`/`write_text`/`chmod` lines — the
unit's ExecStartPre fetches from the DB. The signature can also drop
`script_text` since the worker doesn't need to pass content anymore.
```python
def run_sandboxed_script(
overlay_id: int,
script_text: str, # remove this param if Option B
*,
on_stdout: LogSink,
on_stderr: LogSink,
should_cancel: CancelCheck,
) -> None:
# The four lines below are Option A only — delete for Option B.
script_dir = _sandbox_script_dir()
script_dir.mkdir(parents=True, exist_ok=True)
script_path = script_dir / f"{overlay_id}.sh"
script_path.write_text(script_text or "")
os.chmod(script_path, 0o644)
unit = f"build-overlay@{overlay_id}.service"
# Tail the unit's journal as a sidecar so output streams into job-logs
# while the unit runs. --follow exits when the unit reaches "inactive".
journal = subprocess.Popen(
["journalctl", "--unit", unit, "--output=cat", "--follow",
"--since=now", "--no-pager"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
try:
# Start the unit (sudoers permits this exact verb pattern).
# Type=oneshot makes this block until ExecStart returns.
rc = subprocess.run(
["sudo", "-n", "/bin/systemctl", "start", unit],
check=False,
).returncode
finally:
# Drain remaining journal lines (journalctl --follow may not have
# printed everything yet by the time systemctl returns).
journal.terminate()
try:
for line in journal.stdout or []:
on_stdout(line.rstrip("\n"))
finally:
journal.wait(timeout=5)
# Read exit code from the unit. ExecMainStatus is the script's rc;
# Result is "success" / "failed" / "timeout" etc.
show = subprocess.check_output(
["systemctl", "show", unit,
"-p", "ExecMainStatus", "-p", "Result", "--value"],
text=True,
).split()
exec_main_status = int(show[0])
result = show[1]
if rc != 0 or result != "success":
raise BuildError(
f"build-overlay@{overlay_id} failed: "
f"systemctl rc={rc} unit result={result} script exit={exec_main_status}"
)
```
That's ~30 lines vs. ~50 today, and the helper script disappears
entirely.
**Two refinements to consider:**
1. **Cancel semantics**: today the worker's `should_cancel` callback
triggers a SIGTERM via the existing `run_command` plumbing. With
systemctl-start, you'd issue `systemctl stop build-overlay@<id>`
in a parallel thread when `should_cancel()` returns True. Wire
that up.
2. **Journal streaming race**: `journalctl --follow --since=now`
started *after* `systemctl start` may miss the first few lines.
Two fixes:
- Start the journal tail before systemctl-start (the unit doesn't
exist yet, so journalctl waits silently — verify this behaviour
on Trixie).
- Or use `journalctl --cursor` machinery: snapshot the cursor
before start, then read with `--cursor=` after.
Start-before is simpler and likely sufficient for L4D2 build
verbosity, where the first second of output isn't critical.
### Sudoers
Replace:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
```
with:
```
left4me ALL=(root) NOPASSWD: /bin/systemctl start build-overlay@*.service
left4me ALL=(root) NOPASSWD: /bin/systemctl stop build-overlay@*.service
```
(Tighter — verb-prefixed and instance-globbed. No script path passed.)
### Slice
`l4d2-build.slice` already exists (per the gameserver/sandbox today's
configuration). Reuse it — no change needed.
### Sandbox script tmpfile cleanup
Currently `run_sandboxed_script` writes a per-invocation
`tempfile.NamedTemporaryFile` with a random suffix and unlinks it in a
`finally`. With template-unit lookup, the script path is **predictable
per overlay id** (`/var/lib/left4me/sandbox-scripts/<id>.sh`).
Implications:
- Two concurrent builds for the *same* overlay id would clobber the
script file. The job queue already serializes per-overlay (per
`l4d2web/services/job_worker.py:OVERLAY_OPERATIONS`), so this
is OK.
- Scripts persist between builds (no auto-cleanup). Either accept
that (the next build overwrites) or delete after the unit goes
inactive. Recommend: leave them — small, useful for debugging.
## Migration
In order:
1. **Add the unit emission to ckn-bw's `bundles/left4me/metadata.py`
systemd_units reactor.** Mirror the pattern used for
`left4me-server@.service`. Drop in the template-unit content as
another reactor entry.
2. **Update sudoers** (`bundles/left4me/files/etc/sudoers.d/left4me`)
to permit `systemctl start/stop build-overlay@*.service` and
remove the script-sandbox grant.
3. **Replace `run_sandboxed_script` in left4me.** Add the new
journalctl-based output streaming, exit-code reading, and cancel
handling. Keep the function signature stable so callers
(`ScriptBuilder.build`, the wipe route) are unchanged.
4. **Delete `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`.**
5. **Update tests:**
- `deploy/tests/test_deploy_artifacts.py`:
- Drop `test_script_sandbox_uses_idmap_staging` and any other
tests that read SCRIPT_SANDBOX_HELPER.
- Add tests that assert the new unit emission in ckn-bw's
reactor output. (But that's in the other repo — left4me's
deploy tests can't directly cover it.)
- Add a test that asserts the worker invokes
`sudo systemctl start build-overlay@*` (grep
`overlay_builders.py`).
- `l4d2web/tests/test_overlay_builders.py` (if it exists):
update mocks for `run_sandboxed_script` to expect the new
subprocess shape.
6. **Test on `left4.me`:**
- Push left4me, `bw apply ovh.left4me`. Apply also picks up the
new unit emission and the sudoers change.
- Trigger a script-overlay rebuild via the web UI or the
enqueue API path used in this session (see test history in
git log around 2026-05-15).
- Inspect: `journalctl -u build-overlay@9.service`,
`systemctl status build-overlay@9.service`.
- Verify on-disk state: overlay files end up `left4me`-owned;
idmap bind cleanly torn down (`findmnt | grep idmap` empty).
## Open decisions for the future session
0. **Script source: filesystem (Option A) vs. DB-fetched in ExecStartPre
(Option B) vs. piped to stdin (Option C).** See the "Script source"
section above. This is the highest-impact decision because it
shapes the worker, the unit's ExecStartPre, and whether you need
a fetch-script helper binary at all. Recommendation: Option B.
1. **`/run/left4me/idmap/%i` vs. `/var/lib/left4me/tmp/sandbox-idmap-%i`** —
`/run` is tmpfs and wiped on reboot, more correct for transient
mount paths. But it requires the dir to exist (created by
ExecStartPre). Either works.
2. **What to do with the existing `left4me-apply-cake` dead code**
irrelevant to this refactor; flagged in the other handoff doc.
3. **Whether to drop the post-build `chmod o+r` in the sandbox helper**
already gone in the build-time-idmap commit. (Verify in the new
unit nothing equivalent is needed; files are left4me-owned, web
reads via primary uid.)
4. **`Type=oneshot` vs. `Type=exec`** — oneshot blocks `systemctl
start`. exec doesn't. With oneshot we don't need the
`journalctl --follow` workaround if we read journal *after*
completion. But for live progress (which the existing builds
stream), `--follow` is still needed. Stick with oneshot.
5. **Should the unit set `KillMode=mixed`** to ensure children die on
stop? Worth checking — the existing systemd-run line doesn't set
it explicitly; defaults usually suffice.
6. **`StateDirectory=` vs. explicit `mkdir -p`** — systemd has
StateDirectory and RuntimeDirectory directives that auto-create
per-unit directories. Could replace the `mkdir -p /run/left4me/idmap/%i`
ExecStartPre with `RuntimeDirectory=left4me/idmap/%i`. Cleaner;
gets auto-cleanup on stop too. Recommend doing this — both the
mkdir and the rmdir ExecStopPost would go away.
## Verification
End-to-end smoke test on `left4.me` after the deploy:
```bash
# unit is installed and template-parseable
systemctl status build-overlay@.service # should show "loaded; static"
sudo systemd-analyze verify build-overlay@1.service
# enqueue a build via the web app's worker path (mimic the
# enqueue_build_overlay pattern from this session's job 64 onwards)
# then watch:
sudo journalctl -u build-overlay@9.service -f
# on completion:
systemctl show build-overlay@9.service -p Result -p ExecMainStatus
# expect: Result=success, ExecMainStatus=0
# disk state
sudo find /var/lib/left4me/overlays/9 -uid 981 # should be empty
sudo find /run/left4me/idmap # should not exist or be empty
# pid 1 mount table — no orphan idmap binds
sudo findmnt --task 1 -o TARGET | grep idmap # empty
```
## Risks
- **Worker cancel-during-build**: today's `should_cancel` callback
signals via `run_command`'s child process. With the unit, the
worker needs a separate path: spawn a thread that polls
`should_cancel()` and calls `sudo systemctl stop build-overlay@<id>`
when triggered. Without this, builds that exceed `RuntimeMaxSec` or
hit user-cancel won't terminate promptly.
- **Journal lag at unit start**: `journalctl --follow` started before
`systemctl start` should pick up all output. If not, may need
cursor-based streaming. Test with a script that prints immediately
(`echo hello; exit 0`) — if "hello" appears in the job log, race
is handled.
- **Sudoers globbing**: `systemctl start build-overlay@*.service`
permits any instance id including weird strings like `../etc-passwd`.
Use a tighter glob if possible (e.g.,
`build-overlay@[0-9]*.service`). Test that sudoers rejects
unexpected instance names.
- **Type=oneshot return semantics**: confirm that `systemctl start
build-overlay@<id>` on a Type=oneshot unit returns rc=3 (or
similar) when the unit's ExecStart fails, so the worker can detect
failure without re-querying `systemctl show`.
- **Idle running over reboot**: a build that's running across a reboot
is killed when the system goes down. That's identical to today's
behavior with systemd-run. Acceptable.
- **The journalctl sidecar process accumulates as a zombie if not
reaped properly.** The proposed code does `journal.wait(timeout=5)`
— handle the timeout case (force-kill).
## Pointers
Reference files (with line numbers if applicable):
- **Current helper to be removed**:
`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
- **Current worker invoker**:
`l4d2web/services/overlay_builders.py:run_sandboxed_script` (~ln 324)
- **Current job-worker dispatch**:
`l4d2web/services/job_worker.py` (build_overlay operation)
- **Sudoers**:
`deploy/files/etc/sudoers.d/left4me` (matched verbatim in
`ckn-bw/bundles/left4me/files/etc/sudoers.d/left4me`)
- **Sample template unit pattern** (the model to copy):
`left4me-server@.service` emission in ckn-bw's
`bundles/left4me/metadata.py` systemd_units reactor.
- **Existing slice declaration** (already correct):
`l4d2-build.slice` in ckn-bw's reactor.
Recent commits that touched this surface:
- `4838108` — moved idmap to build time (the refactor that surfaced
the namespace bug)
- `f1aa05d` — added nsenter self-wrap (the band-aid this refactor
removes)
- `2f6a9cf`, `9053186`, `dd918ac` — earlier idmap-on-mount approach
that was reverted
Related design docs:
- `docs/superpowers/plans/2026-05-15-build-time-idmap.md` — the
plan whose architecture this refactor builds on
- `docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md`
unrelated open questions about deploy/ layout
## What's NOT in scope
- Rewriting the sandbox in Python / packaging differently.
- Changing the security hardening profile (the unit duplicates the
current set verbatim — adjust later if needed).
- Splitting the gameserver uid from the web app uid (noted in earlier
handoff doc).
- Re-evaluating whether `l4d2-sandbox` should exist as a separate
uid (kept; defense in depth).
- Touching the `left4me-overlay` gameserver helper (it already uses
the pattern; only the sandbox helper is being refactored to match).
## Estimate
Rough breakdown for the future session:
- Unit file design + ckn-bw reactor change: 1-2 hours
- Worker rewrite (run_sandboxed_script): 1-2 hours
- Tests: 1 hour
- Deploy + verify on test server: 30 min
- Bug-fix and iteration buffer: 1 hour
~5 hours of focused work, assuming no surprises with journalctl
streaming or sudoers semantics.
## Decision criteria for whether to do this
Do it if:
- You're about to make any other change to the sandbox hardening,
build lifecycle, or sandbox uid story.
- You're frustrated by debugging the existing helper.
- You want to remove the nsenter band-aid for hygiene.
Skip if:
- The sandbox is stable and you're not planning related changes.
- You'd rather invest the time in higher-value work elsewhere.
The current solution is fine; this refactor is upgrade-not-fix.

View file

@ -1,252 +0,0 @@
# Deploy directory architecture — open questions
**Resolved 2026-05-15 by [`docs/superpowers/plans/2026-05-15-deploy-dir-rethink.md`](../plans/2026-05-15-deploy-dir-rethink.md).**
Decision summary: `deploy/` is reference material; privileged scripts moved
to top-level `scripts/{libexec,sbin}/`; `deploy-test-server.sh` deleted;
dead static units (cake.service, nft-mark.service) deleted; reactor-emitted
units (server@, web, workshop-refresh.{service,timer}, slices) retained as
curated examples; ckn-bw `install_left4me_scripts` action repointed to the
new source paths. Body below preserved for archaeology.
---
**Status: open questions, not a settled design.** This is a thinking-aloud
handoff prompted by the script-consolidation change on 2026-05-15. Decisions
deferred; a future session should pick this up, talk through the options,
and commit to one shape.
## What happened on 2026-05-15 that prompted this
Two changes landed in quick succession:
1. `left4me-overlay` grew idmap bind-mount support so kernel-overlayfs copy-up
from `l4d2-sandbox`-owned lowerdirs produces `left4me`-owned upperdir
entries (commits `2f6a9cf` + `9053186`).
2. Consolidated all five privileged scripts (4 libexec helpers + 1 sbin
admin CLI) so left4me owns the source of truth and ckn-bw `install`s
them from `/opt/left4me/src/deploy/files/usr/local/{libexec,sbin}/`
after `git_deploy` (left4me `f5e36ee`, ckn-bw `3ccaa91`).
During (2), several architectural assumptions got revised mid-flight rather
than thought through fully:
- `deploy/README.md` flipped from "Status: superseded, historical reference"
to "deploy/files/ is canonical, only `deploy-test-server.sh` is historical."
- The scripts kept their existing deeply-nested paths under
`deploy/files/usr/local/libexec/left4me/*` rather than moving to a
cleaner top-level layout (an earlier draft of the plan proposed `bin/`,
but the user pushed back on mixing the admin CLI with the helpers).
- The resulting state works but several things feel half-finished. This
document enumerates them so they don't rot.
## Current state to look at before deciding anything
- `deploy/files/usr/local/libexec/left4me/{left4me-systemctl,journalctl,overlay,script-sandbox,apply-cake}`
- `deploy/files/usr/local/sbin/left4me`
- `deploy/files/usr/local/lib/systemd/system/{left4me-server@.service,left4me-web.service,...}`**NOT** deployed; ckn-bw emits via reactor. Currently dead-but-kept-for-reference.
- `deploy/files/etc/{sudoers.d/left4me,sysctl.d/99-left4me.conf,left4me/sandbox-resolv.conf,left4me/cake.env}``sudoers.d/left4me` and `sysctl.d/99-left4me.conf` and `left4me/sandbox-resolv.conf` are shipped (verbatim, from ckn-bw's own copies — **still duplicated!**). `cake.env` is dead code.
- `deploy/templates/etc/left4me/{host.env,web.env.template}` — Mako-rendered by ckn-bw's `bundles/left4me/files/etc/left4me/{host.env.mako,web.env.mako}` (its own copies, **also duplicated**).
- `deploy/deploy-test-server.sh` — superseded one-shot bash installer.
- `deploy/tests/test_deploy_artifacts.py` — pytest assertions over the
files above. Currently canonical / load-bearing.
The script consolidation only handled `usr/local/libexec/left4me/*` and
`usr/local/sbin/left4me`. The other duplicated items above were not in
scope.
## Open question 1: what does `deploy/` mean?
Four framings, not mutually exclusive but each implies different next moves:
- **A. "Files to install onto the target"** — single source of truth for
every deployable artifact (scripts, configs, sudoers, sysctl, units,
env templates). ckn-bw becomes pure orchestration: users, groups,
dirs, apt, venv, install actions reading from deploy/.
- **B. "Deploy-mechanism artifacts only"** — installer scripts, runbook
docs, env-template *examples*. Real project executables live elsewhere
in the repo.
- **C. "Reference documentation of deploy decisions"** — historical-flavored.
Real source-of-truth lives in ckn-bw. This was the framing before
2026-05-15.
- **D. "Configuration for the deploy target"** — sudoers, sysctl,
sandbox-resolv.conf, env. Executables live elsewhere.
Today we drifted into **A** for the scripts, **C** lingering for the
systemd units, partial-A-partial-C for /etc/ stuff, and we promoted the
templates section without changing its actual role. Inconsistent.
Pick one and lean in.
## Open question 2: should the scripts live in deploy/ at all?
Argument for keeping them where they are:
- Source path = deploy target. Self-documenting.
- Zero churn from the just-landed consolidation.
Argument for moving them out (top-level `libexec/`, `sbin/`, or `bin/`):
- `deploy/` has historically meant "deploy mechanism." Putting 381-line
Python code (`left4me-overlay`) there mixes "deploy artifacts" with
"core project logic." `left4me-overlay` is real software; it has
tests, it gets edited like any other code.
- Nesting is deep: `deploy/files/usr/local/libexec/left4me/left4me-overlay`
is 5 levels of dir before the actual file.
- Shorter paths make Python constants more readable (the test file uses
`OVERLAY_HELPER = DEPLOY / "files/usr/local/libexec/left4me/left4me-overlay"`).
Counter to the move:
- The user pushed back on a flat `bin/` because it mixes admin CLI
(`left4me`, sbin role) with internal helpers (`left4me-overlay` et al.,
libexec role). A two-dir top-level layout (`libexec/` + `sbin/`) avoids
that mix at the cost of two top-level dirs.
Open variants:
- Flat top-level `bin/` (mixed roles, simplest)
- Top-level `libexec/` + `sbin/` (role-separated, two top-level dirs)
- Top-level `scripts/` with `libexec/` and `sbin/` subdirs (one umbrella)
- Stay in `deploy/files/usr/local/{libexec,sbin}/` (current)
## Open question 3: what to do with `deploy-test-server.sh`
The script duplicates ckn-bw's install logic in bash form. ckn-bw is
authoritative now; the script is at best stale documentation, at worst
actively misleading (the user almost-but-didn't run it against an ovh.left4me
node during one of the recent debugging passes).
Options:
- **Delete entirely.** ckn-bw is the deploy. Script's content survives
in git history if anyone wants to reference it.
- **Relocate to `docs/`** as a readable "what does deploy do?" walkthrough.
Drop the executable bit, mark it explicitly as docs-only.
- **Keep as-is.** README already says superseded; one extra warning in
the script header would suffice. Lowest churn, ongoing rot risk.
If we go with the consolidation direction (everything canonical in
left4me), keeping a `deploy-test-server.sh` that doesn't match the
canonical paths becomes a documentation bug. Maintaining it in sync
with ckn-bw's items.py is overhead nobody wants.
## Open question 4: bw responsibilities vs. file installs
Today's split:
- **bw owns:** users, groups, dirs, env files (Mako-templated with node
metadata), sudoers + sysctl + sandbox-resolv.conf (verbatim, **its own
copies**), systemd units (reactor-emitted from `metadata.py`), apt
packages, venv creation, pip install, alembic, seed-overlays, the
install action for privileged scripts.
- **left4me owns:** privileged scripts (via the install action reading
from `/opt/left4me/src/deploy/files/usr/local/{libexec,sbin}/`).
The split is inconsistent. ckn-bw ships its own copies of:
- `bundles/left4me/files/etc/sudoers.d/left4me`
- `bundles/left4me/files/etc/sysctl.d/99-left4me.conf`
- `bundles/left4me/files/etc/left4me/sandbox-resolv.conf`
- `bundles/left4me/files/etc/left4me/{host.env.mako,web.env.mako}`
And **left4me also has copies** of the first three at
`deploy/files/etc/{sudoers.d/left4me,sysctl.d/99-left4me.conf,left4me/sandbox-resolv.conf}`.
Either ckn-bw's are the source of truth (in which case left4me's are
stale/historical), or left4me's are (in which case we should extend the
install-from-checkout pattern to these too).
Mako-templated env files genuinely need bw's metadata access — those
probably stay in ckn-bw as the authoritative renderer. But the
templates themselves could live in left4me with placeholders that bw
substitutes. We're not far from that today.
The clean version of "left4me canonical" would have:
- Verbatim files (sudoers, sysctl, sandbox-resolv.conf, scripts) all in
`deploy/files/...` in left4me. ckn-bw's bundle files/ directory holds
nothing but the Mako env templates (which need bw's metadata).
- Sudoers gets `test_with: visudo -cf {}` — currently a property of
ckn-bw's files item. To preserve this when the file moves to install-
via-action, the action itself would need to run `visudo -cf
/opt/left4me/src/deploy/files/etc/sudoers.d/left4me` before the install
step. Doable but adds complexity.
The clean version of "split-by-purpose" would have:
- Verbatim files stay in ckn-bw (config bundles are bundles' jobs).
- Scripts in left4me, exactly as today.
- left4me's `deploy/files/etc/` becomes pure reference — and we should
either keep it explicitly labeled as such, or delete it to avoid
duplication drift.
Both are coherent. Today we have neither — half-and-half.
## Open question 5: dead-code cleanup
These files exist in `deploy/files/` but serve no live purpose:
- `usr/local/lib/systemd/system/{left4me-cake.service,left4me-nft-mark.service}` — units replaced by ckn-bw's reactor / nftables bundle.
- `usr/local/lib/systemd/system/{left4me-server@.service,left4me-web.service,left4me-workshop-refresh.{service,timer},l4d2-game.slice,l4d2-build.slice}` — also reactor-emitted, not installed from these files.
- `usr/local/libexec/left4me/left4me-apply-cake` — dead since CAKE moved to networkd. Currently ships via the new install glob (harmless extra file on `/usr/local/libexec/left4me/`).
- `usr/local/lib/left4me/nft/left4me-mark.nft` — central nftables bundle replaced this.
- `etc/left4me/cake.env` — replaced by node metadata.
Each one of these is a self-contained delete-when-someone-feels-like-it
job. Cumulatively they add up to enough noise that future readers will
get confused about what's load-bearing.
Probably worth a "deploy/ janitorial pass" PR that just deletes the
documented-as-obsolete files. Out of scope for whatever architectural
shift you commit to, but mention it as adjacent cleanup.
## Adjacent thing the script consolidation introduced
The `install_left4me_scripts` action in ckn-bw ships *everything* in
`deploy/files/usr/local/libexec/left4me/` to `/usr/local/libexec/left4me/`
via `install -t DEST .../left4me/*`. This is what makes the action
filename-agnostic. Side effect: `left4me-apply-cake` (dead code) gets
installed too. It does nothing on disk because no unit references it.
Three escape hatches:
- Delete the file from `deploy/files/...` (clean — kills dead code).
- Move the file out of the install path (e.g. to `docs/historical/`).
- Filter the glob (introduces a named exclusion; user explicitly didn't
want filename-naming in the action).
If the broader "open question 5" cleanup happens, this resolves itself.
## Recommended structure for the followup session
When picking this up:
1. Read `deploy/README.md` (current shape) and this doc.
2. Pick a position on **open question 1**: what does `deploy/` mean?
The answer constrains everything else.
3. Once 1 is settled, **open questions 2 and 4 fall out**: where do
scripts live, where do config files live.
4. **Open question 3** (`deploy-test-server.sh` fate) is independent of
the others and can be decided in isolation.
5. **Open question 5** (dead-code cleanup) is independent too;
probably worth doing alongside whatever else lands.
6. End state should be: the rules for "what goes in deploy/" can be
written in two sentences. Today they take a paragraph plus
exceptions.
## Pointers
- Current `deploy/README.md` has the current canonical/historical split.
- ckn-bw's bundle: `git.sublimity.de/cronekorkn/ckn-bw`,
`bundles/left4me/items.py`. The `install_left4me_scripts` action and
the files dict are the relevant entry points.
- Plan that landed the recent change:
`docs/superpowers/plans/2026-05-14-overlay-idmap.md` (idmap helper) and
the ~/.claude/plans scratch file for the script consolidation.
- Recent commit history that touched this surface:
- `f5e36ee` deploy: claim /usr/local/sbin/left4me admin CLI in deploy/files
- `2f6a9cf` + `9053186` left4me-overlay idmap support
- ckn-bw `3ccaa91` left4me: install privileged scripts from git_deploy artifact
## What I don't think is in scope here
- Rewriting the shell helpers in Python / packaging them as
console_scripts. Considered and rejected in the script-consolidation
plan because of the egg-info / TOCTOU privilege concern around
left4me-uid-writable bin dirs.
- Switching to a kernel-overlayfs alternative.
- Splitting the gameserver uid from the web app uid. Separate planned
change.

View file

@ -1,349 +0,0 @@
# Deployment responsibility — design
## Status
**Shipped 2026-05-15.** All five migration steps landed and verified on
ovh.left4me. Implementation plan:
`docs/superpowers/plans/2026-05-15-deployment-responsibility.md`.
## Context
Trace: `2026-05-06-left4me-deployment-design.md` established the original
model — left4me's `deploy/files/` mirrors target filesystem paths;
ckn-bw integrates. The hardening refactor
(`2026-05-15-hardening-refactor-design.md`) landed *inline-in-reactor*
as an explicit tradeoff and queued the responsibility question for this
brainstorm (handoff: `2026-05-15-handoff-deployment-responsibility.md`).
The runtime-state relocation
(`2026-05-15-runtime-state-relocation-design.md`) made
`/opt/left4me/src` root-owned, which is the prerequisite that makes
target-side symlinks into the checkout safe — left4me cannot rewrite
its own deployment artifacts at runtime.
This design picks a narrow, conservative line. Application-shape
artifacts that are static across hosts move to left4me's `deploy/`
tree and are delivered to the target via **target-side symlinks**.
Per-host shape (CPU pinning, gunicorn workers, env file values) stays
bw-managed. The base systemd unit bodies stay bw-managed too — they
encode per-host values (workers, threads, CPU set) that are awkward to
parameterize cleanly, and ckn-bw is already the right place for that
computation.
The wedge between "moves" and "stays" is **threat model knowledge vs.
host shape**. The hardening profile is the security knowledge of the
application; the base unit body is the operational shape of the host.
Different repos.
## Scope
### Moves to left4me/deploy/, delivered via target-side symlinks
| Artifact | Source path | Symlink target |
|---|---|---|
| Hardening drop-in for `left4me-web` | `deploy/files/etc/systemd/system/left4me-web.service.d/10-hardening.conf` (NEW) | `/etc/systemd/system/left4me-web.service.d/10-hardening.conf` |
| Hardening drop-in for `left4me-server@` | `deploy/files/etc/systemd/system/left4me-server@.service.d/10-hardening.conf` (NEW) | same pattern |
| Sudoers | `deploy/files/etc/sudoers.d/left4me` (exists) | `/etc/sudoers.d/left4me` |
| Sysctl drop-in (absorbs `ptrace_scope`) | `deploy/files/etc/sysctl.d/99-left4me.conf` (exists; one line added) | `/etc/sysctl.d/99-left4me.conf` |
| Privileged helpers (`libexec/`) | `deploy/scripts/libexec/*` (relocated from `scripts/libexec/`) | `/usr/local/libexec/left4me/<name>` |
| Privileged helpers (`sbin/`) | `deploy/scripts/sbin/*` (relocated from `scripts/sbin/`) | `/usr/local/sbin/<name>` |
All symlinks are created by bw `symlinks{}` items in
`bundles/left4me/items.py`. `git_deploy:/opt/left4me/src` triggers
`systemctl daemon-reload` (for unit drop-ins) and `sysctl --system`
(for sysctl) so changes to the symlink-target content propagate even
though the symlink path itself doesn't change.
### Stays bw-managed
- **Base unit bodies** (`left4me-web.service`, `left4me-server@.service`):
emitted by the `systemd/units` reactor in
`bundles/left4me/metadata.py`. These encode per-host values
(gunicorn workers/threads, CPU pinning, instance bind paths). Pulling
them into left4me would require either templating or env-var
parameterization that doesn't cleanly cover everything (systemd
doesn't substitute env vars in non-Exec directives like
`SocketBindAllow=`).
- **Slice units** (`l4d2-game.slice`, `l4d2-build.slice`) and cpuset
drop-ins (`system.slice.d/99-left4me-cpuset.conf`,
`user.slice.d/99-left4me-cpuset.conf`): all encode per-host CPU
pinning. Reactor stays.
- **`host.env.mako`, `web.env.mako`**: per-host secret + scalar
templating. Stays.
- **`nginx/vhosts`, `nftables/input`, `nftables/output`**: bundle
abstractions (letsencrypt auto-population, set-merge) add real value
over raw files.
- **`systemd-timers/left4me-workshop-refresh`**: same — bundle
synthesizes the `.timer` + `.service` from the metadata dict.
- **Action chains**: `git_deploy`, `pip_install`, `alembic_upgrade`,
`seed_overlays`, `create_venv`, `pip_upgrade`, `install_steamcmd`.
Stays.
- **`directory`, `user`, `group`** items: must exist before
`git_deploy` runs.
- **`apt/packages`, `backup/paths`** defaults. Stays.
### Stays in left4me as reference fixtures (no change)
`deploy/files/usr/local/lib/systemd/system/*.{service,slice}`
reference units matched against the live form by
`deploy/tests/test_deploy_artifacts.py`. Base units stay bw-emitted,
so reference-vs-live assertion stays valid. Reference units should
**not** include hardening directives once the drop-in extraction
lands; the live form's hardening lives in the drop-in, not the base
unit.
## Repo layout (left4me)
```
deploy/
files/
etc/sudoers.d/left4me
etc/sysctl.d/99-left4me.conf
etc/systemd/system/left4me-web.service.d/10-hardening.conf # NEW
etc/systemd/system/left4me-server@.service.d/10-hardening.conf # NEW
etc/left4me/sandbox-resolv.conf # unchanged
usr/local/lib/systemd/system/*.{service,slice} # reference (unchanged shape)
scripts/ # moves in from scripts/
libexec/{left4me-overlay,left4me-systemctl,left4me-journalctl,left4me-script-sandbox}
sbin/<wrappers>
tests/
```
## Mechanism: target-side symlinks
bw `symlinks{}` item type. One entry per artifact:
```python
symlinks = {
'/etc/sudoers.d/left4me': {
'target': '/opt/left4me/src/deploy/files/etc/sudoers.d/left4me',
'owner': 'root',
'group': 'root',
'needs': ['git_deploy:/opt/left4me/src'],
},
'/etc/sysctl.d/99-left4me.conf': {
'target': '/opt/left4me/src/deploy/files/etc/sysctl.d/99-left4me.conf',
'owner': 'root',
'group': 'root',
'needs': ['git_deploy:/opt/left4me/src'],
'triggers': ['action:left4me_sysctl_reload'],
},
'/etc/systemd/system/left4me-web.service.d/10-hardening.conf': {
'target': '/opt/left4me/src/deploy/files/etc/systemd/system/left4me-web.service.d/10-hardening.conf',
'needs': [
'directory:/etc/systemd/system/left4me-web.service.d',
'git_deploy:/opt/left4me/src',
],
'triggers': ['action:systemd_daemon_reload'],
},
# …same for left4me-server@.service.d/10-hardening.conf
# …same for each script in /usr/local/{libexec/left4me,sbin}/
}
```
Drop-in directories (`*.service.d/`) need explicit `directory:` items
in `items.py` (the bw systemd bundle does not create them
automatically for symlink-only drop-ins). Mode `0755`, owner
`root:root`.
bw fires the symlink's `triggers:` when **the symlink itself
changes** (path/target update). It does *not* fire when the symlink's
*target content* changes — that's still a `git_deploy:` event. So
both wirings are needed: every symlink declares
`needs: ['git_deploy:/opt/left4me/src']`, and ckn-bw declares
`triggered_by: [git_deploy:/opt/left4me/src]` actions for the global
reloads (`daemon-reload`, `sysctl --system`).
## Per-artifact details
### Hardening drop-ins
Extract from `HARDENING_COMMON`, `HARDENING_SERVER`, `HARDENING_WEB`
Python dicts in `bundles/left4me/metadata.py` into static `.conf`
files:
```ini
# deploy/files/etc/systemd/system/left4me-web.service.d/10-hardening.conf
[Service]
ProtectProc=invisible
ProcSubset=pid
ProtectKernelTunables=true
SystemCallArchitectures=native
SystemCallFilter=@system-service
SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete
```
Per-directive comments documenting *why* each directive is set the
way it is (sudo-incompatibility carve-outs for web; i386 amendment
and `PrivatePIDs` rationale for server@) should live inline as `#`
comments. Today those rationale comments live in the Python source;
they need to come along.
After extraction, the reactor's emitted unit bodies drop the
`**HARDENING_WEB` / `**HARDENING_SERVER` splat. The reactor still
emits the base unit and is responsible for everything except the
hardening profile.
### Sudoers
Today: identical content in
`left4me/deploy/files/etc/sudoers.d/left4me` and
`ckn-bw/bundles/left4me/files/etc/sudoers.d/left4me`. The bw item
sources from the ckn-bw copy.
After: bw `symlinks{}` item; delete the ckn-bw copy and the bw
`files{}` entry. The `test_with: 'visudo -cf {}'` semantics don't
apply to symlinks; tested instead on commit in left4me CI (a
`test_sudoers.py` that runs `visudo -cf` against the live file).
### Sysctl drop-in + ptrace_scope absorption
Today: same dual-copy story as sudoers, plus `kernel.yama.ptrace_scope`
exists as a metadata default (`sysctl/kernel/yama/ptrace_scope: '2'`)
that gets deployed via `bundles/sysctl/` into a separate file.
After: append `kernel.yama.ptrace_scope = 2` to
`deploy/files/etc/sysctl.d/99-left4me.conf`. Delete the metadata
entry. Delete the bw `files{}` entry + ckn-bw mirror; replace with
symlink. `bundles/sysctl/` no longer renders anything for left4me;
all left4me sysctl tuning lives in the one drop-in.
### Privileged scripts
Done (Task 4): `deploy/scripts/libexec/`, `deploy/scripts/sbin/` under
`deploy/` for layout consistency.
`install_left4me_scripts` copy-action replaced by target-side symlinks
from `/usr/local/libexec/left4me/` and `/usr/local/sbin/` into the
checkout at `/opt/left4me/src/deploy/scripts/{libexec,sbin}/`.
Sudo follows symlinks. With `/opt/left4me/src` root-owned, the
symlink target is root-owned, and sudo's `Cmnd_Alias` path matching
sees the original `/usr/local/{libexec,sbin}/<name>` path.
### Reference units in `deploy/files/`
No structural change. Remove hardening directives from the reference
files in lockstep with extracting them into the drop-ins (otherwise
`test_deploy_artifacts.py` sees the reference unit with hardening
inline but the live unit without). The reference file then represents
"the base unit ckn-bw emits"; the drop-in represents "the hardening
profile left4me ships".
## Migration order
Each step is an independent landable PR.
1. **Canary — sysctl consolidation.**
- Add `kernel.yama.ptrace_scope = 2` to
`deploy/files/etc/sysctl.d/99-left4me.conf`.
- Delete `defaults['sysctl']['kernel']['yama']['ptrace_scope']`
from `bundles/left4me/metadata.py`.
- Delete `bundles/left4me/files/etc/sysctl.d/99-left4me.conf`
(the verbatim mirror).
- Replace the bw `files{}` entry with a `symlinks{}` entry
pointing at the checkout.
- Verify: `sysctl kernel.yama.ptrace_scope` reads `2`;
`bw apply` idempotent.
2. **Hardening drop-ins.**
- Create `deploy/files/etc/systemd/system/left4me-web.service.d/10-hardening.conf`
and `…/left4me-server@.service.d/10-hardening.conf` from the
`HARDENING_*` dicts.
- Remove `**HARDENING_WEB` / `**HARDENING_SERVER` splats from the
reactor; delete the three constants.
- Remove hardening directives from the reference units in
`deploy/files/usr/local/lib/systemd/system/`.
- Add `directory:/etc/systemd/system/left4me-{web,server@}.service.d`
items + symlinks for the drop-ins.
- Wire `systemctl daemon-reload` to fire on
`git_deploy:/opt/left4me/src`.
- Verify: `systemctl show -p ProtectSystem,ProtectKernelTunables,PrivateUsers,…
left4me-web.service left4me-server@1.service` matches the
pre-extraction values (full hardening test plan rerun is the
gold standard).
3. **Sudoers.**
- Replace bw `files{}` entry with `symlinks{}`.
- Delete `bundles/left4me/files/etc/sudoers.d/left4me`.
- Add left4me CI test running `visudo -cf` on the file.
- Verify: `sudo -l -U left4me` lists the expected commands;
gameserver start via the web app still works.
4. **Privileged scripts.**
- `git mv left4me/scripts left4me/deploy/scripts`.
- Update any references (commit hooks, docs).
- Replace `actions['install_left4me_scripts']` with `symlinks{}`
items, one per script. Drop the action.
- Update `git_deploy:` `triggers:` to remove
`action:install_left4me_scripts`.
- Verify: `sudo /usr/local/libexec/left4me/left4me-overlay status 1`
still works; gameserver lifecycle (start/stop) still works.
5. **Cleanup.**
- Prune `gunicorn_workers` / `gunicorn_threads` metadata defaults
if they end up referenced only by `web.env.mako` (they do today;
keep the metadata, they're real per-host values).
- Update `deploy/README.md` to describe the new layout
(deploy/files = symlink source-of-truth; deploy/scripts = same
for helpers).
- Update `bundles/left4me/README.md` to describe the new
symlink-based delivery model.
## Sequence vs. build-overlay-unit refactor
This design lands **before** the build-overlay-unit refactor
(`2026-05-15-build-overlay-unit-design.md`). Reasons:
- build-overlay-unit introduces a dispatcher unit template; its
hardening profile should live as a drop-in alongside the dispatcher
from the start, using the pattern this design establishes.
- The reactor surgery in step 2 (removing `HARDENING_*` splats) is
cleaner against today's reactor than against a reactor that's also
being reshaped for the build-overlay-unit work.
## Verification (end-to-end)
After all five steps land and `bw apply` is idempotent on
`ovh.left4me`:
1. `systemctl show -p ProtectSystem,PrivateUsers,SystemCallFilter,…
left4me-web.service left4me-server@1.service` matches the
hardening test plan's reference values (run the relevant tests
from `docs/superpowers/specs/2026-05-15-hardening-test-plan.md`).
2. `sysctl kernel.yama.ptrace_scope net.core.rmem_max
net.ipv4.tcp_congestion_control` returns expected values.
3. `sudo -l -U left4me` reports the same allowed commands as before.
4. `ls -la /etc/sudoers.d/left4me /etc/sysctl.d/99-left4me.conf
/etc/systemd/system/left4me-*.service.d/10-hardening.conf
/usr/local/libexec/left4me/* /usr/local/sbin/<wrappers>` shows
symlinks into `/opt/left4me/src/deploy/...`.
5. A gameserver round-trip (start via web app → cvar inspect → stop)
succeeds.
6. `bw verify ovh.left4me` reports no drift.
## Out of scope
- Moving base unit bodies into left4me. Per-host shape stays
reactor-emitted.
- AppArmor profiles (deferred from the defenses survey).
- Reshaping the bw `files{}` items for `host.env.mako` /
`web.env.mako` — they need mako templating with metadata context,
which ckn-bw is the right place for.
- The build-overlay-unit refactor itself. Lands separately on top of
this.
## Pointers
- Handoff (this brainstorm's framing):
`docs/superpowers/specs/2026-05-15-handoff-deployment-responsibility.md`
- Prereq (runtime state relocation + non-editable install, shipped):
`docs/superpowers/specs/2026-05-15-runtime-state-relocation-design.md`
- Original deployment design (the model being reaffirmed for
application-shape artifacts):
`docs/superpowers/specs/2026-05-06-left4me-deployment-design.md`
- Hardening refactor design (the inline-in-reactor approach this
design supersedes for hardening):
`docs/superpowers/specs/2026-05-15-hardening-refactor-design.md`
- Hardening test plan (reference for step-2 verification):
`docs/superpowers/specs/2026-05-15-hardening-test-plan.md`
- ckn-bw left4me bundle: `~/Projekte/ckn-bw/bundles/left4me/`

View file

@ -1,244 +0,0 @@
# Handoff — brainstorm deployment responsibility (left4me vs. ckn-bw)
## Status
**Resolved 2026-05-15** — the brainstorming session happened and produced
`docs/superpowers/specs/2026-05-15-deployment-responsibility-design.md`.
Read that for the answer. The runtime-state relocation
(`2026-05-15-runtime-state-relocation-design.md`) shipped as a prereq;
the design lands hardening drop-ins, sudoers, sysctl, and helpers as
symlinks into the (now root-owned) `/opt/left4me/src/deploy/...`
checkout, while base unit bodies and per-host shape stay bw-managed.
This doc is kept as the historical framing — the question that opened
the brainstorm, the operator's leaning, and the candidate options that
got evaluated. The actual landed answer is the design doc.
## The question
How should left4me and ckn-bw split responsibility for the host's
deployment?
**Not a fresh question.** The original deployment design at
`docs/superpowers/specs/2026-05-06-left4me-deployment-design.md`
already laid out the canonical shape: `deploy/files/` in the left4me
repo mirrors target filesystem paths for root-owned deployment
artifacts (systemd units, sudoers, helpers, env templates);
"production config management can own both env files directly"
(line 91). The implicit model: **left4me defines the deployment
artifacts; ckn-bw integrates them onto the host.** That spec also
defined a self-contained `deploy/deploy-test-server.sh` so the
deployment could be exercised without ckn-bw at all.
Over time, more and more of those artifacts migrated *into* ckn-bw's
`bundles/left4me/` — specifically:
- systemd unit definitions are now emitted by the
`systemd/units` reactor in `~/Projekte/ckn-bw/bundles/left4me/metadata.py`
(the hardening refactor we just landed reinforced this).
- sysctl options ended up in ckn-bw `bundles/left4me/metadata.py`
`defaults` (just landed too).
- sudoers exists in *both* repos (left4me `deploy/files/.../sudoers.d/left4me`
+ ckn-bw verbatim mirror).
- Privileged helpers moved BACK to left4me as part of deploy-dir-rethink
(commit `5284e28`) — `scripts/{libexec,sbin}/`. Pattern works:
left4me defines, ckn-bw deploys via `install_left4me_scripts`.
So the trajectory has been mixed: helpers re-converged on left4me
(good, matches 2026-05-06); systemd units + sysctl drifted into
ckn-bw (away from 2026-05-06). The brainstorm reconciles this.
**The question**: should we return to the 2026-05-06 model
end-to-end — every deployment artifact lives in left4me's
`deploy/files/`, ckn-bw becomes a thin integrator — or is the
current mixed shape the right answer for some artifact classes?
## Operator's leaning
Security-related artifacts belong **in the left4me repo**, owned by
the project; ckn-bw is responsible for **integrating** them into the
host (deploying them to the right paths, restarting affected units,
etc.) but doesn't *author* them.
Concretely the operator's preference (from session
2026-05-15): "security-related stuff should be bundled in this repo
and ckn-bw is responsible for integrating it into the server."
## Why we're doing this
Background from the hardening-refactor session
(`docs/superpowers/specs/2026-05-15-hardening-refactor-design.md`,
"Approach" section). We considered two shapes for the hardening
landing:
- **A** — hardening directives inline in ckn-bw's `systemd/units`
reactor (the path we took)
- **B** — hardening as drop-in `.conf` files living in left4me's
`deploy/files/etc/systemd/system/<unit>.d/`, ckn-bw deploys them
(consistent with 2026-05-06's `deploy/files/` model)
We picked A for the hardening refactor because B implied a broader
configmgmt responsibility reshape that deserved its own session.
That session is this one.
The motivating arguments for B (this brainstorming session evaluates
them seriously):
1. **Hardening is application knowledge.** Knowing srcds is i386,
that `MemoryDenyWriteExecute=true` breaks Source's text
relocations, that web's sudo path is incompatible with
`PrivateUsers=true` — all of this is left4me's domain, not
ckn-bw's. ckn-bw shouldn't need to understand the threat model.
2. **Test-artifact = production-artifact.** The Test 7 drop-in from
the hardening test plan literally is the file we'd want
deployed. With B, there's no translation step.
3. **Repo self-containment for security review.** A reviewer of
left4me sees the threat model in code form without needing to
read the configmgmt repo.
4. **Easier coordination with the `build-overlay-unit` refactor**
(queued). That unit's hardening profile can ship in its own
drop-in inline with the unit template.
The counter-argument:
- **Coupling cost.** A change to a directive may require redeploying
via ckn-bw, which means a cross-repo coordination cycle (edit
left4me → commit → push → ckn-bw `bw apply`). Today the same is
true (edit ckn-bw → push → apply); just the *which* repo changes.
## What "security-related" likely means
Enumerate during the brainstorm. Initial candidates:
- **systemd unit hardening directives** — currently in
ckn-bw `bundles/left4me/metadata.py` `HARDENING_COMMON` /
`HARDENING_SERVER` / `HARDENING_WEB`. Strong candidate for left4me.
- **sysctl drop-ins** — currently `kernel.yama.ptrace_scope=2` in
ckn-bw's left4me bundle `defaults` (`sysctl/kernel/yama/ptrace_scope`).
Strong candidate for left4me.
- **sudoers** — already in `left4me/deploy/files/etc/sudoers.d/left4me`
+ a verbatim mirror in `ckn-bw/bundles/left4me/files/etc/sudoers.d/left4me`.
Already mostly left4me-owned; redundancy worth resolving.
- **Privileged helper scripts** — already in `left4me/scripts/{libexec,sbin}/`,
ckn-bw deploys them via `install_left4me_scripts`. Already
left4me-owned. The pattern works.
- **systemd unit BASE definitions** (`User=`, `ExecStart=`, `Restart=`,
resource limits) — currently in ckn-bw's reactor. **Open question:**
is this application knowledge or infrastructure knowledge? They
depend on the application's binary paths, env files, restart
semantics — all application knowledge. Probably also belongs to
left4me.
- **AppArmor profiles** (if we add them later — deferred from the
defenses survey). Application knowledge.
- **`/etc/left4me/host.env` / `web.env` templating** — ckn-bw owns
these today because they're templated via mako from node metadata
(per-host overrides). Probably stays in ckn-bw.
- **User/group creation** — kernel-side infrastructure, no
application knowledge needed. Stays in ckn-bw.
- **Package installation** (apt). Stays in ckn-bw.
- **Firewall rules** — depend on per-instance port ranges
(`LEFT4ME_PORT_RANGE_*`); could be either. Worth discussing.
- **Nginx vhost** — same: depends on app-specific routes.
## Mechanism: how does ckn-bw "integrate"?
Brainstorm the deploy mechanism. Candidates (already partially
sketched in the hardening-refactor design doc's earlier draft, before
it was reverted to the inline-in-reactor approach):
- **Symlinks.** ckn-bw creates symlinks like
`/etc/systemd/system/left4me-server@.service.d/10-hardening.conf`
`/opt/left4me/src/deploy/files/etc/systemd/system/.../10-hardening.conf`.
Editing the file in the repo + `systemctl daemon-reload` picks it
up. Cleanest for "ckn-bw doesn't author."
- **File copy via `files` entries.** ckn-bw `files = {...}` reads
from `/opt/left4me/src/deploy/files/...` (post-git_deploy) and
copies to the target. Standard idiom. Two-place state.
- **Glob-walker action.** A small ckn-bw action walks `deploy/files/`
tree and mirrors paths to root.
- **Bundle inclusion / left4me-as-bundle.** Left4me's `deploy/`
becomes its own bundlewrap bundle that ckn-bw imports. Strongest
decoupling; requires bundlewrap bundle conventions.
Each has different implications for: triggers (which units restart
when which files change), drift detection, rollback semantics.
## Migration / coexistence path
Brainstorm: how do we get from the current state to the new state
without breaking things?
- Inventory: every artifact ckn-bw currently emits/ships for left4me
(the `systemd/units` reactor entries, sysctl defaults, sudoers
mirror, file deploy actions, etc.).
- For each: stays, moves, or split (some in each).
- Mechanism rollout: pick one (symlinks vs. file copy vs. ...) and
apply it consistently.
- Test-driven: pick one artifact as the canary (probably the sysctl
drop-in — smallest), validate the mechanism end-to-end, then
migrate the others.
## Key sub-questions for the brainstorm
1. **Is the unit's BASE definition application knowledge?** If yes,
ckn-bw's `systemd/units` reactor shrinks dramatically — to maybe
one line per unit ("ckn-bw, deploy this file as a unit"). If no,
we have a more delicate split.
2. **What about the user/group definitions?** Infrastructure-side
today. But the application defines that `left4me` (uid 980)
exists; ckn-bw just creates it. Could move.
3. **Per-host configuration** (gunicorn worker count, port ranges,
CPU pinning): these are per-host overrides ckn-bw computes from
node metadata. Stays in ckn-bw (or whatever owns deployment-time
parameterization).
4. **Test infrastructure**: `deploy/tests/test_deploy_artifacts.py`
asserts left4me's reference units match the deployed form. If
left4me starts owning the deployed form, those tests get
stronger (no longer "reference vs. live" drift; the file in
`deploy/files/` *is* the live form).
5. **Drift / observability**: how do we know the deployed state
matches the repo? Today `bw apply` + git diff is the source of
truth. Same applies; mechanism details vary.
6. **Rollback semantics**: removing a drop-in is one `rm` away; the
base unit is preserved. Same applies to reverting the
left4me-side commit and re-applying.
## Prereqs (must land before this brainstorming session)
- **uid-collapse refactor** — queued in
`docs/superpowers/plans/2026-05-15-uid-collapse.md`. Settles the
user model first so the deployment-responsibility brainstorm
doesn't have to juggle a moving user definition.
## Out of scope for the brainstorm
- The hardening composition itself (already settled, deployed,
verified).
- The `build-overlay-unit` template unit refactor
(`docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md`)
— both this brainstorm *and* the build-overlay-unit refactor
benefit from settling responsibility first. Sequencing TBD; the
brainstorm should consider whether to land before or after
build-overlay-unit.
- The application code itself (`l4d2web`, `l4d2host`) — that's
always been left4me-owned.
## Pointers
- **Original deployment design (the model to revisit):**
`docs/superpowers/specs/2026-05-06-left4me-deployment-design.md`
- Hardening refactor design (motivation; the deferred reshape):
`docs/superpowers/specs/2026-05-15-hardening-refactor-design.md`
- Hardening refactor plan (what got landed):
`docs/superpowers/plans/2026-05-15-hardening-refactor.md`
- Defenses survey (mentions AppArmor, deferred):
`docs/superpowers/specs/2026-05-15-hardening-defenses-survey.md`
- Test plan + executed results:
`docs/superpowers/specs/2026-05-15-hardening-test-plan.md`
- uid-collapse plan (prereq):
`docs/superpowers/plans/2026-05-15-uid-collapse.md`
- deploy-dir-rethink (recent reshape that moved scripts into left4me;
background on the current `deploy/` tree):
`docs/superpowers/plans/2026-05-15-deploy-dir-rethink.md` (or
`2026-05-15-deploy-dir-rethink-design.md`)
- Live ckn-bw bundle (the thing being rethought):
`~/Projekte/ckn-bw/bundles/left4me/`

View file

@ -1,285 +0,0 @@
# Handoff — non-editable install + root-owned `/opt/left4me/src`
## Status
**Superseded 2026-05-15** by what actually shipped — see
`docs/superpowers/specs/2026-05-15-runtime-state-relocation-design.md`.
The narrow approach proposed here (just flip `/opt/left4me/src` to
root, switch `pip install -e``pip install`) doesn't work as
described: `setuptools.build_meta` writes `<pkg>.egg-info/` into the
source dir during `get_requires_for_build_wheel`, which fails against
a root-owned source. The shipped fix copies source to a writable
tempdir before building, and (since that one-shot copy was needed
anyway) also relocates `.venv` + `steam` to `/var/lib/left4me/`.
The original prereq goal — making target-side symlinks of deployment
artifacts safe — is still met; the realized shape is just bigger than
this doc sketched.
This doc is kept as the historical record of the originally-proposed
approach and why it didn't work.
## The task
Change ckn-bw's `bundles/left4me/` so that:
1. The production install uses **non-editable** pip installs
(`pip install /opt/left4me/src/l4d2host /opt/left4me/src/l4d2web`),
not `pip install -e …`.
2. `/opt/left4me/src/` is **owned by root:root**, not left4me:left4me.
3. The `left4me_chown_src` action and the `/opt/left4me/src` directory
item's `owner`/`group` flip accordingly.
4. The pip-install action moves from "runs every apply" to "triggered
by `git_deploy:/opt/left4me/src`" — non-editable installs always
rebuild a wheel, so running unconditionally is wasteful.
Local-development install flows (direnv + `pip install -e ./l4d2host
-e ./l4d2web`) are **unchanged**. Editable installs remain correct on
developer machines; only the production install model on the host
changes.
## Why
Two reasons, listed in priority order.
**Security.** The deployment-responsibility brainstorm wants to make
`left4me/deploy/files/` the live source of truth for systemd units,
drop-ins, sudoers, sysctl, and helpers, delivered by ckn-bw via
target-side symlinks (`/etc/foo` → `/opt/left4me/src/deploy/files/...`).
If the symlink target sits inside a left4me-writable directory, the
service can rewrite its own hardening drop-in and escape the sandbox
on next restart. Making `/opt/left4me/src/` root-owned closes that
hole at the filesystem layer, before symlinks ever come into the
picture. Defense-in-depth that costs us nothing the production
workflow actually used.
**Operational honesty.** The only reason `/opt/left4me/src/` is
user-owned today is that `pip install -e` writes `.egg-info` into the
source tree. No production workflow ever edits files under
`/opt/left4me/src/` directly — code updates always come through
`git_deploy` + `pip_install`. Editable mode buys nothing on the host;
non-editable matches what the deploy actually does (rebuild + reinstall
wheel from new source).
## What changes — concretely
All edits are in `~/Projekte/ckn-bw/bundles/left4me/`.
### `items.py`
**Directory items** (`items.py:7-42`) — flip `/opt/left4me/src` to root:
```python
directories = {
'/opt/left4me': {
'owner': 'root',
'group': 'root',
},
'/opt/left4me/src': {
'owner': 'root',
'group': 'root',
# Was left4me:left4me before the non-editable install switch;
# production now installs wheels, so the source tree is read-only
# at runtime. Keeps left4me from being able to rewrite its own
# hardening drop-ins / unit files (see deployment-responsibility
# handoff for the full argument).
},
# /var/lib/left4me/* and /opt/left4me/{steam,.venv} stay left4me:left4me.
...
}
```
**`left4me_pip_install` action** (`items.py:247-263`) — drop `-e`,
become triggered:
```python
actions['left4me_pip_install'] = {
# Non-editable install: builds wheels from the checkout, installs
# into the venv's site-packages. Source tree is no longer mutated by
# pip, so /opt/left4me/src/ stays root:root with read-only access for
# left4me at runtime.
'command': 'sudo -u left4me /opt/left4me/.venv/bin/pip install /opt/left4me/src/l4d2host /opt/left4me/src/l4d2web',
'triggered': True, # was: ran every apply
'cascade_skip': False,
'needs': [
'git_deploy:/opt/left4me/src',
'action:left4me_create_venv',
# action:left4me_chown_src removed (deleted below).
],
'triggers': [
'action:left4me_alembic_upgrade',
],
}
```
**`left4me_chown_src` action** (`items.py:207-219`) — **delete**.
The action exists to repair file ownership after each `git_deploy`
extracts the tarball as root and we needed it as left4me. With the new
model, root is the target ownership, which is also what `git_deploy`
already produces. Action becomes a no-op; remove it.
**`git_deploy` triggers** (`items.py:157-183`) — ensure
`action:left4me_pip_install` is in `triggers`. Currently triggers
`left4me_alembic_upgrade` and `install_left4me_scripts`; add
`left4me_pip_install` so that a fresh checkout always rebuilds the
wheel and reinstalls.
### `metadata.py`
No changes. The `systemd/units` reactor's `WorkingDirectory` and
timer `working_dir` still point at `/opt/left4me/src` — that path is
still readable as left4me regardless of ownership (it's
world-readable by default after `git_deploy` extracts as root).
### `README.md`
Line 48 mentions `pip install -e`. Update to reflect non-editable
production install and add a one-line note that local dev still uses
`-e`. Two lines of edits.
### `l4d2web.egg-info/`, `l4d2host.egg-info/` on the live host
These directories exist today inside `/opt/left4me/src/l4d2{host,web}/`
as a side-effect of editable installs. After the switch they become
stale (pip installs a fresh wheel into the venv; the in-source egg-info
is unused). Clean-up options:
- **Leave them**: harmless, ignored by Python. Eventually removed by
whoever next refactors the source layout.
- **One-shot remove on the live host**: `sudo find /opt/left4me/src
-name "*.egg-info" -type d -exec rm -rf {} +`. Cosmetic; do whatever.
Either's fine. Document the choice in the commit message.
## What does NOT change
- **`l4d2host/` and `l4d2web/` `pyproject.toml`** — both already declare
`[build-system] requires = ["setuptools>=68", "wheel"]` and use the
flat `package-dir = {l4d2host = "."}` layout. Non-editable install
works out of the box; no packaging edits needed.
- **`alembic.ini` + migrations** — alembic reads
`/opt/left4me/src/l4d2web/alembic/versions/*.py` at runtime. Root
ownership + world-readable means left4me can still read; no change.
- **`examples/script-overlays/`** — same; read-only access by left4me
at seed time.
- **`/opt/left4me/.venv/`** — stays left4me:left4me (pip writes here
during the install action, run as left4me via sudo).
- **`/opt/left4me/steam/`** — stays left4me:left4me (steamcmd
self-updates).
- **`/var/lib/left4me/`** and all subdirs — stays left4me:left4me
(application runtime state).
- **Local-dev install instructions** in `README.md`, `AGENTS.md`,
`l4d2web/README.md` — keep `-e`. Developer machines need editable.
- **`install_left4me_scripts` action** — already copies from src as
root, target paths under `/usr/local/{libexec,sbin}/`. Source can be
root-owned now (no change in behavior).
- **Hardening composition + every deployed unit / drop-in / sudoers /
sysctl file** — out of scope for this change. Those move in the
deployment-responsibility brainstorm, after this lands.
## Verification
Run on left4.me (the production host) after `bw apply`:
1. **Source ownership**:
```
stat -c '%U:%G %a %n' /opt/left4me/src /opt/left4me/.venv /opt/left4me/steam /var/lib/left4me
```
Expected: `/opt/left4me/src``root:root`; `.venv` and `steam` and
`/var/lib/left4me``left4me:left4me`.
2. **Wheel installed, not editable**:
```
sudo -u left4me /opt/left4me/.venv/bin/pip show l4d2web l4d2host
```
Expected: `Location:` points inside
`/opt/left4me/.venv/lib/python*/site-packages/`, NOT inside
`/opt/left4me/src/`. (Editable installs report the source path as
`Location:`; non-editable reports site-packages.)
3. **App runs**:
```
systemctl status left4me-web.service
```
Active, recent logs clean.
4. **Alembic can still read migrations**:
```
sudo -u left4me sh -c 'cd /opt/left4me/src/l4d2web && /opt/left4me/.venv/bin/alembic current'
```
Returns the current head without errors.
5. **A gameserver starts**:
```
sudo /usr/local/libexec/left4me/left4me-systemctl start left4me-server@test
journalctl -u left4me-server@test -n 50
```
srcds_run starts cleanly. Stop it after verification.
6. **Idempotent `bw apply`**:
Run `bw apply left4.me` a second time. Should report zero changes —
no chown action drifting back, no pip install re-firing.
## Out of scope
- **The deployment-responsibility reshape itself.** That brainstorm
resumes after this prereq lands on left4.me. Do not touch
`deploy/files/`, hardening drop-ins, sudoers location, etc. — those
are the *next* session's work.
- **Removing the `bundles/left4me/files/etc/{sudoers.d,sysctl.d}/`
verbatim mirrors.** Same; that's the deployment-responsibility
session.
- **Moving `scripts/{libexec,sbin}/` into `deploy/scripts/`.** Same.
- **Reviewing whether the editable install pattern should change for
developer machines.** It should not — local dev wants editable for
fast iteration; only the host install model changes.
## Pointers
- **Deployment-responsibility brainstorm handoff** (the parent
context): `docs/superpowers/specs/2026-05-15-handoff-deployment-responsibility.md`
- **ckn-bw left4me bundle**:
`~/Projekte/ckn-bw/bundles/left4me/`
- `items.py:7-42` (directories)
- `items.py:157-183` (git_deploy)
- `items.py:207-219` (left4me_chown_src — delete)
- `items.py:247-263` (left4me_pip_install)
- `README.md:48` (docs update)
- **pyproject.toml layouts**:
`l4d2host/pyproject.toml`, `l4d2web/pyproject.toml`. Flat
`package-dir = {<pkg> = "."}` layout. Non-editable wheel build works
with this layout without further changes.
- **Hardening test plan** (motivates the security argument):
`docs/superpowers/specs/2026-05-15-hardening-test-plan.md`
- **Original deployment design** (the shape we're working toward):
`docs/superpowers/specs/2026-05-06-left4me-deployment-design.md`
## Commit messages (suggested)
ckn-bw side (the actual change):
```
refactor(left4me): non-editable install + root-owned /opt/left4me/src
Drop `pip install -e` for the production install; switch to wheel
install (`pip install /opt/left4me/src/l4d2{host,web}`). Source tree no
longer needs to be writable by left4me, so flip /opt/left4me/src to
root:root and delete the left4me_chown_src action.
Prereq for the deployment-responsibility reshape: makes target-side
symlinks from /etc/... into /opt/left4me/src/deploy/files/... safe by
construction (left4me cannot rewrite its own hardening profile).
Verified on left4.me: bw apply idempotent; pip show reports
site-packages location; web + gameserver units run clean.
```
left4me side (this handoff doc):
```
spec(noneditable-install): handoff for the install refactor prereq
Self-contained spec for the next agent to land the editable→
non-editable install switch and the root-ownership flip on
/opt/left4me/src. Prereq for the deployment-responsibility brainstorm.
```

View file

@ -1,467 +0,0 @@
# Handoff — collapse venv chain into uv workspace + `uv sync`
## Status
**Executed (left4me side) — see
`docs/superpowers/plans/2026-05-15-uv-workspace-execution.md` for what
actually shipped and what diverged from the assumptions below.** Three
load-bearing assumptions in this doc turned out to be wrong (no
`pkg_apt: uv` on Trixie; existing layout incompatible with read-only
source builds via setuptools; no `git` on prod). The executed plan
records the corrections.
## Goal
Replace the current five-action venv chain in `bundles/left4me/items.py`
with a single `uv sync --frozen` action driven by a committed
`uv.lock` at the left4me repo root. Eliminate the tempdir-copy dance
in `pip_install` (8 lines of shell working around setuptools writing
`<pkg>.egg-info/` into a root-owned source tree).
Net change: 5 actions → 3 actions; deterministic deploys via locked
dep versions; single command in dev and prod; one new build-time
dependency (`uv`) on the host.
## Why
Three motivations, listed in priority order.
**Deterministic prod deploys.** Today's chain installs whatever pip
resolves at apply time. A transitive dep getting a CVE-relevant bump
between two `bw apply` runs is invisible until it breaks something.
`uv sync --frozen` against a committed `uv.lock` makes the installed
version set reproducible from git history alone.
**Lower cognitive cost in `items.py`.** The `pip_install` action is
the longest, gnarliest action in the bundle — it does its own
tempdir/cleanup-trap/cp-r dance because the obvious `pip install
/opt/left4me/src/...` would write egg-info to a root-owned source
tree. uv's `sdist-then-wheel-from-tarball` build path makes this
problem go away: the source is read-only throughout.
**Workspace declares what's actually true.** `l4d2web` already imports
from `l4d2host` (5 files use `from l4d2host.paths import ...`).
Today's setup happens to work because both packages get installed
side-by-side via `pip install -e ./l4d2host -e ./l4d2web`, but the
dependency relationship is implicit. A uv workspace makes it explicit
via `[tool.uv.sources] l4d2host = { workspace = true }`.
## Current state — the 5-action chain
(All in `~/Projekte/ckn-bw/bundles/left4me/items.py`, ~lines 285-425.)
```
git_deploy:/opt/left4me/src
├── triggers → left4me_pip_install
│ ├── needs ← left4me_create_venv (always-on, gated unless)
│ │ └── triggers → left4me_pip_upgrade
│ └── triggers → left4me_alembic_upgrade
│ ├── triggers → left4me_seed_overlays
│ └── triggers → svc_systemd:left4me-web.service:restart
├── triggers → left4me_alembic_upgrade (belt-and-braces direct trigger)
└── triggers → left4me_daemon_reload
```
`left4me_pip_install` body (the part that simplifies):
```sh
sudo -u left4me sh -c '
set -e
tmpdir=$(mktemp -d -t left4me-build-XXXXXX)
trap "rm -rf \"$tmpdir\"" EXIT
cp -r /opt/left4me/src/l4d2host /opt/left4me/src/l4d2web "$tmpdir/"
/var/lib/left4me/.venv/bin/pip install --force-reinstall "$tmpdir/l4d2host" "$tmpdir/l4d2web"
'
```
## Target state — uv workspace + single sync action
Three actions instead of five:
```
git_deploy:/opt/left4me/src
├── triggers → left4me_uv_sync
│ └── triggers → left4me_alembic_upgrade
│ ├── triggers → left4me_seed_overlays
│ └── triggers → svc_systemd:left4me-web.service:restart
├── triggers → left4me_alembic_upgrade (belt-and-braces)
└── triggers → left4me_daemon_reload
```
`left4me_uv_sync` body:
```python
actions['left4me_uv_sync'] = {
'command': (
'sudo -u left4me '
'env UV_PROJECT_ENVIRONMENT=/var/lib/left4me/.venv '
'uv sync --frozen --project /opt/left4me/src'
),
'triggered': True,
'cascade_skip': False,
'needs': [
'git_deploy:/opt/left4me/src',
'pkg_apt:uv',
'directory:/var/lib/left4me',
'user:left4me',
],
'triggers': [
'action:left4me_alembic_upgrade',
],
}
```
`UV_PROJECT_ENVIRONMENT` redirects uv's default venv path (`<project>/.venv`)
to our writable runtime location at `/var/lib/left4me/.venv` (the source
at `/opt/left4me/src` is root-owned, so the default would be a permission
error).
`--frozen` requires `uv.lock` to be present and consistent with
`pyproject.toml` — refuses to silently update the lockfile during deploy.
## Empirical spike — do this FIRST
Before touching anything, verify the architectural assumption that
`uv` actually keeps a root-owned source directory pristine during
build. ~5 minute test on the live host:
```bash
ssh ckn@left4.me 'sudo apt-get install -y uv'
ssh ckn@left4.me '
sudo -u left4me sh -c "
wheels=\$(mktemp -d)
uv build --wheel --sdist /opt/left4me/src/l4d2host --out-dir \$wheels
ls \$wheels
sudo git -C /opt/left4me/src status --porcelain
"
'
```
Expected: the wheel + sdist exist in the tempdir, AND `git status`
reports the source tree clean (no new `l4d2host.egg-info/` directory).
If the source stays clean, proceed with the full migration.
If the source picks up `l4d2host.egg-info/` (uv's build invoked
setuptools.build_meta directly on the source instead of via the sdist
intermediate), fall back to **Medium scope**: keep the tempdir-copy
dance but use `uv pip install` in place of `pip install` (1:1 swap,
no workspace, smaller change). Update this handoff with the fallback
decision.
## What changes — left4me side
### New: `/Users/mwiegand/Projekte/left4me/pyproject.toml`
Workspace root. Short:
```toml
[project]
name = "left4me"
version = "0.0.0"
description = "Workspace root; packaging lives in the members."
requires-python = ">=3.13"
[tool.uv.workspace]
members = ["l4d2host", "l4d2web"]
# Dev-only dependencies (pytest, etc.) for the workspace.
[dependency-groups]
dev = [
"pytest",
]
```
### Modified: `l4d2host/pyproject.toml` and `l4d2web/pyproject.toml`
No real change to declared deps. `l4d2web` adds the workspace cross-dep:
```toml
# l4d2web/pyproject.toml
[project]
dependencies = [
"Flask>=3.0",
"SQLAlchemy>=2.0",
"alembic>=1.13",
"PyYAML>=6.0",
"gunicorn>=22.0",
"requests>=2.31",
"l4d2host", # NEW: declares the import relationship
]
[tool.uv.sources]
l4d2host = { workspace = true } # NEW: resolves to the in-workspace member
```
This makes explicit what's already true: `l4d2web/routes/overlay_routes.py`,
`l4d2web/services/overlay_creation.py`, and three other files import
from `l4d2host.paths`.
### New: `/Users/mwiegand/Projekte/left4me/uv.lock`
Generated by `uv lock` at the repo root. Committed to git. Pins every
transitive dep version.
### Modified: `/Users/mwiegand/Projekte/left4me/.envrc`
Today:
```
layout python python3.13
```
New (direnv hands off to uv for venv management):
```
# direnv's stdlib uv helper creates .venv via `uv sync` and activates it.
# Equivalent to: uv sync && source .venv/bin/activate
use uv
```
If `use uv` isn't available in this direnv version (it's a stdlib
function added in direnv 2.34+), fall back to:
```
uv sync >/dev/null
source .venv/bin/activate
```
### Modified: `README.md`, `AGENTS.md`, `l4d2web/README.md`
Update install instructions from:
```
pip install -e ./l4d2host -e ./l4d2web pytest
```
to:
```
uv sync # creates .venv, installs members editable, installs dev deps
```
One-time prereq for developers: install uv (`brew install uv` on
macOS, `apt install uv` on Debian Trixie+, or curl-pipe-sh from
astral.sh for older distros).
### Modified: `.gitignore`
Probably no change needed. uv's caches default to `~/.cache/uv` (not
in-repo). The `.venv` is already ignored.
## What changes — ckn-bw side
All edits in `~/Projekte/ckn-bw/bundles/left4me/`.
### `metadata.py`
Add `uv` to `apt.packages`:
```python
'apt': {
'packages': {
...
'uv': {}, # Required by left4me_uv_sync for production install.
...
},
},
```
Drop `python3-pip` if nothing else needs it (uv replaces pip). Keep
`python3-venv` if anything else on the host uses `python3 -m venv`; if
not, drop it too. `python3` and `python3-dev` stay (uv invokes them).
### `items.py`
Delete three actions:
- `left4me_create_venv`
- `left4me_pip_upgrade`
- `left4me_pip_install`
Add one action: `left4me_uv_sync` (body in the "Target state" section
above).
Update `git_deploy:/opt/left4me/src` triggers:
- Remove: `action:left4me_pip_install`
- Add: `action:left4me_uv_sync`
- Keep: `action:left4me_alembic_upgrade`, `action:left4me_daemon_reload`
`alembic_upgrade` and `seed_overlays` are unchanged — they invoke the
venv's `alembic` and `flask` binaries by absolute path, which `uv sync`
ensures exist. Update their `needs:` lists to point at
`action:left4me_uv_sync` instead of `action:left4me_pip_install`.
### `README.md`
Update the bundle README's deploy-flow description to mention `uv sync`
instead of `pip install -e`, matching the new shape.
## Migration order
1. **Spike test** (above): confirm uv preserves source cleanliness.
If fails, retreat to Medium scope.
2. **left4me-side preparation** (independent PR, can land first):
- Add root `pyproject.toml`, declare workspace
- Add `l4d2host` to `l4d2web`'s deps + workspace source
- Run `uv lock`, commit `uv.lock`
- Update `.envrc`
- Update local-dev docs
- Run `uv sync` locally, run `pytest` — all green
- Commit + push
3. **ckn-bw-side install** (depends on step 2):
- Add `pkg_apt: uv` to bundle defaults
- Delete the three old actions, add `uv_sync`
- Update `git_deploy` triggers and downstream `needs:`
- `bw test` clean
4. **First apply to ovh.left4me**:
- Expect: `pkg_apt: uv` installed, three old actions removed from
the graph, new `uv_sync` action fires (because git_deploy fires
with the new commit), runs `uv sync --frozen` against the new
workspace, alembic_upgrade + seed_overlays + web restart cascade.
- The existing `/var/lib/left4me/.venv` (created by
`python3 -m venv`) is structurally a uv-compatible venv; uv
should adopt it without recreation. If uv refuses to adopt
(incompatible metadata), one-shot fix on the host:
```
sudo -u left4me rm -rf /var/lib/left4me/.venv
# bw apply will recreate via `uv sync`
```
5. **Idempotency check + verification matrix**:
- `bw apply` idempotent (`0 fixed, 0 failed`)
- `pip show l4d2{host,web}` reports the locked version
- Web service active, gameserver round-trip works
6. **Commit ckn-bw side, do not push** (operator pushes manually).
## What does NOT change
- **Source ownership**: `/opt/left4me/src` stays `root:root` (the
runtime-state relocation made it so; uv reads it as world-readable).
- **Venv location**: `/var/lib/left4me/.venv` stays where it is, owned
by `left4me`, accessed via `UV_PROJECT_ENVIRONMENT`.
- **Hardening drop-ins, sudoers, sysctl, helpers**: all stable from
the deployment-responsibility migration. uv migration is independent.
- **systemd unit shapes**: reactor-emitted, per-host parameters
unchanged.
- **`alembic_upgrade` and `seed_overlays`**: same shell, same
triggering, same binaries (just from a uv-managed venv).
- **`pkg_apt: python3` and `python3-dev`**: kept (uv shells out to
the system Python interpreter).
- **CI workflows**: no CI currently exists; nothing to update.
## Out of scope
- Merging `l4d2host` and `l4d2web` into a single package. They stay
as separate workspace members.
- Switching to a non-direnv-based dev flow. `direnv` + `use uv` stays.
- Migrating other ckn-bw bundles to uv. This is left4me-specific.
- Pinning the host's `uv` version below the apt-current. If lockfile
format issues surface, address as a follow-up (e.g., apt-pin or
switch to astral.sh-installed uv).
## Risks
1. **Spike test failure**: uv build isn't actually source-clean → falls
back to Medium scope. Captured above; this is a graceful degradation.
2. **Lockfile format skew**: dev's brew-installed uv (latest) ahead of
prod's apt-installed uv (Trixie's version) → lockfile produced in
dev rejected in prod. Mitigation: stick to features supported by
the apt-installed version; if needed, switch prod to a pinned
astral.sh install.
3. **`alembic` invocation path**: today the action calls
`/var/lib/left4me/.venv/bin/alembic`. After uv sync, this path
should still exist (uv installs the same console_scripts entrypoint
as pip). Verify in step 4.
4. **direnv `use uv` availability**: `use uv` was added to direnv's
stdlib relatively recently. If the dev's direnv is older, use the
fallback `.envrc` snippet (`uv sync >/dev/null && source .venv/bin/activate`).
5. **`--force-reinstall` semantics gone**: today's chain uses
`pip install --force-reinstall` to work around the static
`0.1.0` version in pyproject.toml — without it pip would skip on
no-op. `uv sync --frozen` is version-aware via the lockfile, not
the package version string, so this concern goes away.
## Verification (end-to-end)
After ckn-bw apply:
1. **Source still clean**:
```
ssh ckn@left4.me 'sudo git -C /opt/left4me/src status --porcelain'
```
Empty output.
2. **Venv has the workspace members installed**:
```
ssh ckn@left4.me 'sudo -u left4me /var/lib/left4me/.venv/bin/python -c "import l4d2host; import l4d2web; print(l4d2host.__file__, l4d2web.__file__)"'
```
Both paths point inside `/var/lib/left4me/.venv/lib/python3.13/site-packages/`.
3. **Pinned versions match the lockfile**:
```
ssh ckn@left4.me 'sudo -u left4me /var/lib/left4me/.venv/bin/pip show flask | grep Version'
```
Matches the Flask version in `uv.lock`.
4. **Web service health**:
```
ssh ckn@left4.me 'sudo systemctl is-active left4me-web.service'
```
`active`.
5. **Idempotent apply**:
```
(cd ~/Projekte/ckn-bw && .venv/bin/bw apply ovh.left4me)
```
`0 fixed, 0 failed`.
6. **Gameserver round-trip**: start a verify instance via
`left4me-systemctl enable verify`, check journal for clean
srcds_run startup behaviour (modulo any missing instance dir),
disable.
## Pointers
- Deployment-responsibility design (just shipped; the venv chain it
did NOT touch is what this handoff replaces):
`docs/superpowers/specs/2026-05-15-deployment-responsibility-design.md`
- Runtime state relocation (made `/opt/left4me/src` root-owned, which
is why the current `pip_install` needs the tempdir dance):
`docs/superpowers/specs/2026-05-15-runtime-state-relocation-design.md`
- ckn-bw left4me bundle:
`~/Projekte/ckn-bw/bundles/left4me/`
- `items.py:285-306``git_deploy` triggers
- `items.py:328-340``left4me_create_venv`
- `items.py:342-352``left4me_pip_upgrade`
- `items.py:354-382``left4me_pip_install` (the tempdir dance)
- `items.py:384-407``left4me_alembic_upgrade`
- `items.py:409-424``left4me_seed_overlays`
- uv docs: https://docs.astral.sh/uv/ — workspace, `uv sync`,
`UV_PROJECT_ENVIRONMENT`.
## Commit messages (suggested)
left4me side (root pyproject + lockfile + member deps + .envrc + docs):
```
refactor(repo): uv workspace + lockfile
Declare the repo as a uv workspace with l4d2host and l4d2web as
members. Add uv.lock for deterministic dep resolution. l4d2web now
declares its cross-dep on l4d2host explicitly via tool.uv.sources.
Local-dev install switches from `pip install -e ./l4d2host -e ./l4d2web`
to `uv sync` (creates venv, installs members editable, installs dev
deps from one source). .envrc uses direnv's `use uv` helper.
Prereq for the ckn-bw bundle uv-sync action (handoff:
docs/superpowers/specs/2026-05-15-handoff-uv-workspace.md).
```
ckn-bw side (drop chain, install uv, single sync action):
```
refactor(left4me): collapse venv chain into uv sync
Replace left4me_create_venv + left4me_pip_upgrade + left4me_pip_install
(the tempdir-copy dance) with a single left4me_uv_sync action driven
by left4me's committed uv.lock. Deterministic dep versions, no source
mutation during build, three actions instead of five.
pkg_apt: uv added. python3-pip removed (uv replaces it).
Per docs/superpowers/specs/2026-05-15-handoff-uv-workspace.md (in the
left4me repo).
```

Some files were not shown because too many files have changed in this diff Show more