left4me/docs/superpowers/specs/2026-05-08-l4d2-script-overlays-design.md
mwiegand 78ead0b41d
docs(specs): script overlay type — design + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:27:14 +02:00

323 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# L4D2 Script Overlays Design
**Goal:** Add a single new overlay type, `script`, that lets users author arbitrary build recipes as bash and runs them inside a `bubblewrap` + `systemd-run --scope` sandbox. The new type subsumes the existing `l4d2center_maps` and `cedapug_maps` managed-globals overlay types, both of which are removed in the same change. After this work the overlay type list is exactly `workshop` (unchanged) and `script` (new).
**Approval status:** User-approved design direction. Implementation proceeds in lockstep with the companion plan at `docs/superpowers/plans/2026-05-08-l4d2-script-overlays.md`.
## Context
`left4me` users today have two ways to add content to a server: workshop overlays (rich UI for Steam Workshop items via `WorkshopBuilder`) and a pair of managed global-map overlay types (`l4d2center_maps`, `cedapug_maps`) with bespoke parsers, per-item DB rows, ETag-based change detection, and a daily refresh timer. They cannot author arbitrary build recipes.
The user's previous setup at `ckn-bw/bundles/left4dead2/files/scripts/overlays/` expressed every recipe as a small bash file: `competitive_rework` (GitHub tarball download), `tickrate` (inline `server.cfg` + addon DLL fetch), `standard` (workshop items + admin-list write), `workshop_maps` (workshop collection import), `l4d2center_maps` (CSV-driven map sync). All five fit naturally into a single "run a sandboxed bash script that populates the overlay dir" model.
The two managed global-map types in the current codebase are over-engineered for what they do — each is essentially "fetch a manifest, download archives, extract VPKs, place in `addons/`." Folding them into the new `script` type eliminates three database tables, two source-parser modules, the `GlobalMapOverlayBuilder`, the `py7zr` dependency, the global-overlay cache root, and the managed-singleton machinery, while letting an admin paste the equivalent shell code (which the user already wrote years ago) into a normal admin-owned, system-wide script overlay.
The trust model for the sandbox is "semi-public deployment, registered users." The threat surface is one user reading another user's overlay, the application DB, or arbitrary host secrets, plus runaway scripts exhausting disk/CPU/RAM. Network access is *not* restricted — scripts must be able to download from arbitrary URLs (GitHub, l4d2center, Steam CDN). Sandbox boundaries are namespace-based (mount, PID, IPC, UTS, cgroup), not command-allowlist-based; binary-allowlist sandboxing of bash is theatre because of `eval` and `exec`.
The test deploy DB is wiped as part of rollout; no data migration is performed. Existing user blueprints that reference `l4d2center_maps` or `cedapug_maps` overlay rows do not survive the change in the test environment.
A scheduled-refresh feature (the daily timer that today drives the global-map types) is intentionally **out of scope for this iteration**. The two existing systemd units and the `flask refresh-global-overlays` CLI command are deleted with no replacement. Refresh is reintroduced in a later iteration designed against concrete needs.
## Locked Decisions
1. **Single new overlay type: `script`.** Replaces both managed-globals types. Final type list: `workshop` + `script`. No `tarball`/`inline`/`manual` types — all of those collapse into `script` (with UI templates as a future ergonomics improvement).
2. **`Overlay.script` is a DB `TEXT` column** holding the raw bash. No file storage, no revision history in v1. Empty string for `workshop` rows.
3. **Build idempotency contract: script runs against the existing overlay dir.** No automatic wipe between builds. Users write `test -f … || curl …`-style guards if they want bandwidth efficiency. A manual "Wipe overlay" button on the detail page resets the dir to empty.
4. **No left4me-aware helpers in the sandbox.** The script sees pure bash plus whatever's in `/usr` (RO bind-mount of the host). Workshop items are not exposed via a helper — users wanting workshop content create a `workshop`-type overlay, which has its own first-class UX (thumbnails, collection paste, dedup cache, refresh).
5. **Sandbox engine: `bubblewrap` (`bwrap`) inside `systemd-run --scope --collect`.** `systemd-run` provides cgroup v2 limits + walltime kill via `RuntimeMaxSec`; `bwrap` provides the namespace isolation. Both are stable, well-audited, in-tree on Debian.
6. **Resource limits (system-wide, not per-overlay):** 1 hour walltime (`RuntimeMaxSec=3600`), 4 GB RAM (`MemoryMax=4G`, `MemorySwapMax=0`), 512 tasks, 200% CPU quota, post-build 20 GB disk cap on `du -sb` of the overlay dir.
7. **Network: host-shared.** No `--unshare-net`. Scripts have full outbound. Egress filtering is not in v1; the sandbox prevents reading internal state but does not prevent talking to internal IPs. Acceptable for the current trust model.
8. **No auto-seeding of "default" overlays.** Admin manually creates the equivalents of the old `l4d2center-maps`/`cedapug-maps` post-deploy by pasting the bash. The deploy script does not insert overlay rows.
9. **Daily/scheduled refresh: out of scope for this iteration.** No `auto_refresh` flag, no timer, no CLI command. Manual rebuild via the detail-page button is the only build trigger after this change.
10. **Permissions mirror workshop overlays.** Any logged-in user can create a private (`user_id = me`) script overlay. Admin can create system-wide (`user_id = NULL`). Owner or admin can edit/delete.
11. **Failure semantics via `Overlay.last_build_status`** (`'' | 'ok' | 'failed'`). Drives a "rebuild required" badge on the list and detail pages. Server initialization does **not** auto-block on `failed` (matches workshop's current behavior).
12. **Wipe is just another sandbox invocation.** The wipe endpoint runs the literal script `find /overlay -mindepth 1 -delete` through the same `left4me-script-sandbox` helper. No second helper, no privilege/UID puzzle (files are owned by `l4d2-sandbox`, who runs the wipe). After a successful wipe, `last_build_status` is reset to `''`. Wipe does **not** auto-enqueue a rebuild — the user decides.
13. **Privileged helper: `/usr/local/libexec/left4me/left4me-script-sandbox`.** Same pattern as the existing `left4me-overlay`, `left4me-systemctl`, `left4me-journalctl` helpers. Bash, owned root, mode 0755. The web user invokes it via `sudo -n` per a sudoers fragment. Root is needed to set up the namespaces; bwrap drops to the unprivileged `l4d2-sandbox` UID immediately.
14. **Dedicated sandbox UID `l4d2-sandbox`** (system user, `/usr/sbin/nologin`, no home). Owns nothing on the host outside what bwrap binds in. UID-drop happens inside the bwrap invocation via `--uid`/`--gid`.
15. **Strict argument validation in the helper.** Overlay id matches `^[0-9]+$`; overlay dir must exist under `/var/lib/left4me/overlays/`; script path must exist. Defense in depth — the real authorization check lives in the web app.
16. **Streaming I/O via the existing `run_with_streamed_output` helper.** Same plumbing `WorkshopBuilder` already uses for `steamcmd`/`curl` invocations. No new SSE/log path.
## Architecture
```text
Overlay row (type=script, script=TEXT, last_build_status)
▼ build_overlay(overlay_id) job
▼ BUILDERS["script"].build(overlay, on_stdout, on_stderr, should_cancel)
▼ ScriptBuilder writes overlay.script → tmpfile, then:
│ sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>
▼ Helper validates args, then exec()s:
│ systemd-run --scope --collect
│ -p MemoryMax=4G -p MemorySwapMax=0
│ -p TasksMax=512 -p CPUQuota=200%
│ -p RuntimeMaxSec=3600
│ -- bwrap [namespace flags...] /bin/bash /script.sh
▼ Inside the sandbox the script sees:
│ /overlay ← /var/lib/left4me/overlays/{id} RW (the build target)
│ /tmp,/run ← fresh tmpfs RW (ephemeral)
│ /usr,/lib,/lib64,/etc/{ssl,resolv.conf,nsswitch} RO (host-curated)
│ /proc,/dev ← fresh
│ network ← shared with host
│ UID/GID ← l4d2-sandbox (no_new_privs implicit in bwrap)
▼ stdout/stderr → run_with_streamed_output → existing job-log SSE stream
▼ After exit:
│ exit 0 ∧ du -sb /overlay ≤ 20 GB → last_build_status='ok'
│ any other outcome → last_build_status='failed'
```
The host library (`l4d2host`) is unchanged. The `KernelOverlayFSMounter` already mounts whatever's at `overlays/{id}/` regardless of how it got there. The Job model and worker model are essentially unchanged — `script` is just another overlay type for the same `build_overlay` operation that today supports `workshop`.
```python
BUILDERS = {
"workshop": WorkshopBuilder(),
"script": ScriptBuilder(),
}
```
## Data Model
### `Overlay` (modified)
```text
id INTEGER PK AUTOINCREMENT
name VARCHAR(255) NOT NULL
path VARCHAR(255) NOT NULL -- str(id) for new rows
type VARCHAR(16) NOT NULL -- 'workshop' | 'script'
user_id INTEGER NULL REFERENCES users(id) -- NULL = system-wide
script TEXT NOT NULL DEFAULT '' -- new; meaningful for type='script'
last_build_status VARCHAR(16) NOT NULL DEFAULT '' -- new; '' | 'ok' | 'failed'
created_at, updated_at
UNIQUE INDEX on (name) WHERE user_id IS NULL
UNIQUE INDEX on (name, user_id) WHERE user_id IS NOT NULL
INDEX on (type, user_id)
```
### Tables removed
- `global_overlay_item_files`
- `global_overlay_items`
- `global_overlay_sources`
Drop order matters for the SQLite migration: drop `_item_files` first (FK to `_items`), then `_items` (FK to `_sources`), then `_sources` (FK to `overlays`).
### Unchanged
`WorkshopItem`, `overlay_workshop_items`, `Job` (including `Job.overlay_id` and nullable `Job.user_id`), `Server`, `Blueprint`, etc.
## Filesystem Layout
```text
${LEFT4ME_ROOT}/
overlays/
{overlay_id}/ # script writes here; mounted by host
left4dead2/... # whatever the script produces
workshop_cache/{steam_id}.vpk # workshop type only — unchanged
# removed:
# global_overlay_cache/ # was used by managed-globals types
```
Single tree per overlay. No per-overlay scratch cache (the chosen idempotency model is "script runs against existing dir," so any caching the user wants lives inside the overlay dir and is preserved between builds).
The sandbox bind-mounts `${LEFT4ME_ROOT}/overlays/{id}/` to `/overlay` (RW). Nothing else under `${LEFT4ME_ROOT}` is visible inside the sandbox.
## Sandbox
### Helper script
`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`, mode 0755, owned root:
```bash
#!/bin/bash
# args: <overlay_id> <script_path>
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script" >&2; exit 65; }
SBX_UID=$(id -u l4d2-sandbox); SBX_GID=$(id -g l4d2-sandbox)
exec systemd-run --quiet --scope --collect \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- bwrap \
--die-with-parent --new-session \
--unshare-pid --unshare-ipc --unshare-uts --unshare-cgroup \
--uid "$SBX_UID" --gid "$SBX_GID" \
--proc /proc --dev /dev --tmpfs /tmp --tmpfs /run \
--ro-bind /usr /usr --ro-bind /lib /lib --ro-bind /lib64 /lib64 \
--symlink usr/bin /bin --symlink usr/sbin /sbin \
--ro-bind /etc/resolv.conf /etc/resolv.conf \
--ro-bind /etc/ssl /etc/ssl \
--ro-bind /etc/ca-certificates /etc/ca-certificates \
--ro-bind /etc/nsswitch.conf /etc/nsswitch.conf \
--bind "$OVERLAY_DIR" /overlay \
--chdir /overlay \
--setenv HOME /tmp --setenv PATH /usr/bin:/usr/sbin \
--setenv OVERLAY /overlay \
--ro-bind "$SCRIPT" /script.sh \
/bin/bash /script.sh
```
Network is *not* unshared (no `--unshare-net`); the sandbox shares the host network namespace. Every transient unit is visible via `systemctl list-units --type=scope` while running and journaled afterward (`journalctl --user-unit=run-…scope` or system journal depending on invocation).
### Sudoers fragment
Append to `deploy/files/etc/sudoers.d/left4me`:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
```
### System user
Provisioned in `deploy/deploy-test-server.sh`:
```bash
useradd --system --no-create-home --shell /usr/sbin/nologin l4d2-sandbox
apt-get install -y bubblewrap
```
## Build Lifecycle
`ScriptBuilder` lives in `l4d2web/services/overlay_builders.py` next to `WorkshopBuilder`:
```python
class ScriptBuilder:
def build(self, overlay, *, on_stdout, on_stderr, should_cancel):
with tempfile.NamedTemporaryFile("w", suffix=".sh", delete=False) as f:
f.write(overlay.script or "")
script_path = f.name
try:
cmd = [
"sudo", "-n",
"/usr/local/libexec/left4me/left4me-script-sandbox",
str(overlay.id), script_path,
]
run_with_streamed_output(cmd, on_stdout, on_stderr, should_cancel)
self._enforce_disk_budget(overlay.id, on_stderr)
finally:
os.unlink(script_path)
def _enforce_disk_budget(self, overlay_id, on_stderr):
size = subprocess.check_output(["du", "-sb", overlay_path(overlay_id)])
if int(size.split()[0]) > 20 * 1024**3:
on_stderr("overlay exceeded 20 GB disk cap")
raise BuildError("disk-cap-exceeded")
```
`run_with_streamed_output` is the existing helper used by `WorkshopBuilder` for `steamcmd`/`curl` invocations. The `should_cancel` callback fires `kill -TERM` on the sudo-`systemd-run` process tree; cgroup-collect tears down the whole scope on exit.
The job worker's existing job-completion path writes `Overlay.last_build_status = 'ok'` on success and `'failed'` on any non-zero exit / `BuildError` / cancel. This is a single column update inside the existing transaction; no new infrastructure.
## UI
### Create modal (`templates/overlays.html`)
The existing modal grows one option in the type radio: `Workshop | Script`. Name field unchanged. After insert, the web app generates `path = str(overlay_id)` for new rows (existing pattern).
### Detail page when `type='script'` (`templates/overlay_detail.html`)
- Plain styled `<textarea>` for `overlay.script` with a Save button → `POST /overlays/{id}/script`. No CodeMirror dependency in v1 (out of scope; keep frontend dep-light).
- "Rebuild" button → `POST /overlays/{id}/build`. Existing pattern from workshop overlays.
- "Wipe overlay" button (red, confirm-modal) → `POST /overlays/{id}/wipe`.
- `last_build_status` indicator badge: empty / "ok" / "failed".
- Live build log via existing SSE plumbing on the related Job row.
### Detail page when `type='workshop'`: unchanged.
### Sections removed
The global-source detail block (`overlay_detail.html` lines 3446) is deleted along with the managed-globals subsystem.
## Routes
`l4d2web/routes/overlay_routes.py` adds:
| Method | Path | Purpose |
|---|---|---|
| POST | `/overlays/{id}/script` | Update `script` text. Auto-enqueue coalesced `build_overlay` job. |
| POST | `/overlays/{id}/wipe` | Invoke `left4me-script-sandbox` with the literal script `find /overlay -mindepth 1 -delete`. Owner/admin only. Refuses if a `build_overlay` for this overlay is running. After success, set `last_build_status=''`. Does not auto-enqueue a rebuild. |
| POST | `/overlays/{id}/build` | Manual rebuild — same pattern as today's workshop overlay manual rebuild. |
Existing `POST /overlays` accepts `type=script` and an optional initial `script` body.
## Permissions
| Action | Who |
|---|---|
| Create script overlay (private, `user_id = me`) | Any authenticated user |
| Create script overlay (system-wide, `user_id = NULL`) | Admin |
| Edit (script body, name) | Owner or admin |
| Wipe / Rebuild | Owner or admin |
| Delete | Owner or admin |
| View | Owner, admin, or any user when `user_id IS NULL` |
These match the existing rules for workshop overlays.
## Job Worker / Scheduler
`services/job_worker.py` drops `"refresh_global_overlays"` from `GLOBAL_OPERATIONS` and removes the corresponding `refresh_global_overlays_running` and `blocked_servers_by_overlay` plumbing that exists only for the global-maps subsystem. The remaining mutex rules already cover:
- `build_overlay` per overlay (one running build per overlay).
- `install` and `refresh_workshop_items` as global mutexes.
- Server start/init blocks if any `build_overlay` for an overlay in the server's blueprint is running.
No new rules are needed for `script` — its build is mechanically identical to a `workshop` build from the scheduler's perspective.
## Daily Refresh — Removed
This iteration deletes the daily-refresh subsystem entirely:
- `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer` and `.service` — deleted.
- `flask refresh-global-overlays` CLI command in `l4d2web/cli.py` — deleted.
- No replacement timer, no replacement CLI, no `auto_refresh` column on `Overlay`.
The only build trigger after this change is the user clicking Rebuild on the detail page (or the auto-enqueue when they Save the script body). A scheduled-refresh feature is reintroduced in a future iteration designed against concrete operational needs.
## Risks
- **Sandbox escape via kernel bug.** `bwrap` has a strong track record but is not invulnerable. Mitigated by running as `l4d2-sandbox` (no privileged capabilities), no setuid binaries reachable, `no_new_privs` implicit. A successful escape would land in an unprivileged UID with no host secrets reachable.
- **Disk fill via runaway script.** A script that writes a 20 GB+ payload to `/overlay` succeeds inside the sandbox and only fails afterward at the post-build `du` check. The 20 GB lands on disk transiently. Mitigated by the kernel's per-cgroup IO accounting being unaware of file size (no good IO-time limit), accepting this as a v1 trade-off; a future improvement is overlay-dir-on-its-own-filesystem with a quota.
- **Network exfiltration.** Script can connect to anything outbound, including internal IPs. Acceptable for the current trust model (semi-public; users have credentials). Egress firewall is out of scope.
- **Build-mid-server-running.** The scheduler refuses `build_overlay` for an overlay attached to a starting/running server (existing rule, unchanged). Good. A user can still rebuild while a server using a *different* blueprint runs concurrently.
- **Wipe race with running build.** The wipe endpoint refuses if a `build_overlay` for the overlay is running. Without this check, a wipe could blow away files mid-script and produce undefined results.
- **Stale `last_build_status`.** A row inserted via direct DB write or restored from backup could carry an `'ok'` status that no longer reflects reality. Treated as cosmetic; users can rebuild to refresh.
- **Sudoers misconfig.** A typo in the sudoers fragment could grant `left4me` more than intended. Mitigated by deploy-artifact tests asserting the exact expected lines.
- **DB row deletion racing the sandbox.** A user deleting an overlay while its build runs would invalidate the bind-mount target. Mitigated by the existing scheduler rule that tracks running overlays; delete should refuse if a build is running. (Existing pattern for workshop overlays; reuse.)
- **Migration drops globals tables.** Acceptable for the test deploy. Production rollout would need a different migration story; this spec explicitly assumes test-deploy DB wipe.
## Out Of Scope
- **Scheduled / daily refresh.** Intentionally removed in this iteration. Reintroduced later, designed against the use cases that emerge.
- **Per-overlay resource overrides.** All script overlays share the same 1 h / 4 GB / 20 GB envelope. If a real overlay needs more (l4d2center mirror at peak), revisit.
- **CodeMirror or other rich script editor.** Plain `<textarea>` in v1.
- **Egress allowlist / proxy.** No network restrictions on the sandbox in v1.
- **`$CACHE` scratch dir** persisted across builds. Users cache inside the overlay dir if they want; idempotency model is "script runs against existing dir."
- **Multi-tenant cgroup tree per user.** All sandboxes share the same cgroup-quota envelope.
- **Revision history on `script` column.** No `overlay_script_revisions` table; whatever's in the row is the current script.
- **Auto-seeding of l4d2center / cedapug equivalents.** Admin pastes the script post-deploy.
- **Migration that preserves existing global-map overlay rows.** Test deploy DB is wiped.
- **Container-per-build (podman / docker).** Heavier than `bwrap`; revisit only if multi-tenant escalates to "fully public sign-up."
- **left4me-aware helpers** (`workshop`, `download`, `extract`) inside the sandbox. Pure bash + host `/usr` only.
## Implementation Boundaries
- **`l4d2host` is unchanged.** The host library has no concept of overlay types and the mount layer (`KernelOverlayFSMounter`) doesn't care how the overlay dir got populated.
- **The `OverlayBuilder` Protocol is unchanged** — same `build(overlay, *, on_stdout, on_stderr, should_cancel)` signature. `ScriptBuilder` plugs into the existing registry.
- **The job worker model is unchanged.** Same operations, same logs, same SSE plumbing, same scheduler rules (minus the refresh_global_overlays entry).
- **No new application-level dependencies.** Vendored HTMX, no new Python packages. Two new system dependencies: `bubblewrap` apt package and the `l4d2-sandbox` system user.
- **No new config keys.** Same env files (`/etc/left4me/host.env`, `/etc/left4me/web.env`).
- **DB migration is destructive for global-maps overlay rows.** This is acceptable per the test-deploy assumption; a production-rollout follow-up would need to address it.
- The companion implementation plan governs task ordering and verification commands. Implementation must not start without explicit user approval per that plan's gate.