docs(specs): script overlay type — design + implementation plan

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
mwiegand 2026-05-08 15:27:14 +02:00
parent 9985ecc56c
commit 78ead0b41d
No known key found for this signature in database
2 changed files with 673 additions and 0 deletions

View file

@ -0,0 +1,350 @@
# L4D2 Script Overlays Implementation Plan
> **Approval status:** User-approved 2026-05-08. Implementation proceeds.
**Goal:** Implement the `script` overlay type per `docs/superpowers/specs/2026-05-08-l4d2-script-overlays-design.md`. Add an `Overlay.script` TEXT column and `Overlay.last_build_status` enum-string column, a `ScriptBuilder` that runs user bash inside a `bubblewrap` + `systemd-run --scope` sandbox via a new `left4me-script-sandbox` privileged helper, route + UI surface for editing/wiping/rebuilding, and delete the entire managed-globals (`l4d2center_maps`, `cedapug_maps`) subsystem and its daily-refresh timer/CLI.
**Architecture:** The web app continues to enqueue `build_overlay` jobs for any overlay row. The job worker dispatches via `BUILDERS[overlay.type].build(...)`. After this change `BUILDERS = {"workshop": WorkshopBuilder(), "script": ScriptBuilder()}`. The new `ScriptBuilder` writes `overlay.script` to a tmpfile and execs `sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>`, which itself execs `systemd-run --scope --collect ... -- bwrap [namespace flags] /bin/bash /script.sh`. stdout/stderr stream through the existing `run_with_streamed_output` helper into the existing job-log SSE plumbing. The job-completion path writes `Overlay.last_build_status` based on the build outcome. The kernel-overlayfs mount layer (`KernelOverlayFSMounter`) is unchanged.
---
## Locked Decisions
See `docs/superpowers/specs/2026-05-08-l4d2-script-overlays-design.md` for design rationale. Implementation-relevant summary:
- Final overlay type list: `workshop` (unchanged) + `script` (new). Drop `l4d2center_maps`, `cedapug_maps`.
- New columns on `overlays`: `script TEXT NOT NULL DEFAULT ''`, `last_build_status VARCHAR(16) NOT NULL DEFAULT ''`.
- Drop tables (FK order): `global_overlay_item_files`, `global_overlay_items`, `global_overlay_sources`.
- `ScriptBuilder` in `l4d2web/services/overlay_builders.py`, uses existing `run_with_streamed_output`.
- Privileged helper `left4me-script-sandbox` (bash, mode 0755, owned root). `systemd-run --scope --collect -p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 -p CPUQuota=200% -p RuntimeMaxSec=3600 -- bwrap …`. Limits 1 h walltime, 4 GB RAM, 20 GB post-build `du` cap.
- New system user `l4d2-sandbox` (`/usr/sbin/nologin`, no home). New apt dep `bubblewrap`.
- Sudoers verb-unrestricted: `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox`.
- Daily refresh subsystem deleted: `left4me-refresh-global-overlays.{timer,service}` and `flask refresh-global-overlays` CLI removed. No replacement.
- Wipe is the same sandbox helper invoked with the literal script `find /overlay -mindepth 1 -delete`.
- `auto_refresh` column NOT added in this iteration.
- Test deploy DB is wiped on rollout; migration includes `DELETE FROM overlays WHERE type IN ('l4d2center_maps', 'cedapug_maps')` for safety.
---
## Current Gap
- `l4d2web/models.py` `Overlay` has no `script` or `last_build_status` columns. The 3 globals tables are present.
- `l4d2web/services/overlay_builders.py` `BUILDERS = {"workshop": WorkshopBuilder(), "l4d2center_maps": GlobalMapOverlayBuilder(), "cedapug_maps": GlobalMapOverlayBuilder()}`. No `ScriptBuilder`.
- `l4d2web/services/{global_map_sources,global_overlay_refresh,global_map_cache,global_overlays}.py` exist and are referenced by routes / CLI.
- `l4d2web/services/job_worker.py` carries `refresh_global_overlays_running` plumbing.
- `l4d2web/cli.py` defines `refresh-global-overlays`.
- `l4d2web/routes/overlay_routes.py` has no `/script`, `/wipe`, or `/build` endpoints for non-workshop types.
- `l4d2web/templates/overlays.html` create modal type radio offers only `workshop`.
- `l4d2web/templates/overlay_detail.html` has a global-source block (~lines 3446) that should not survive.
- `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.{timer,service}` exist.
- `deploy/deploy-test-server.sh` provisions `global_overlay_cache/` and does not provision `l4d2-sandbox` or install `bubblewrap`.
- Seven `tests/test_global_*.py` files exist and reference removed code.
---
## Task 1: Schema migration (alembic 0005)
**Files:**
- Create: `l4d2web/alembic/versions/0005_script_overlays.py` (revises `0004_drop_legacy_external_overlay_type`).
- Modify: `l4d2web/models.py``Overlay` gains `script` and `last_build_status` columns; remove `GlobalOverlaySource`, `GlobalOverlayItem`, `GlobalOverlayItemFile` model classes.
- Modify: `l4d2web/tests/test_overlay_models.py` (or whichever existing test asserts the Overlay schema; create one if absent) — assert new columns present.
Test plan (RED first):
1. `tests/test_alembic_migrations.py::test_upgrade_0005_adds_script_columns` — apply migrations to a fresh in-memory SQLite, assert `script` and `last_build_status` columns present on `overlays`, assert no `global_overlay_*` tables, assert old data wipe `DELETE FROM overlays WHERE type IN (...)` is part of the upgrade.
2. `tests/test_alembic_migrations.py::test_downgrade_0005_restores_globals` (only if downgrade is supported in the project's migration policy; skip with `pytest.skip` if not — kernel-overlayfs migration is one-way, follow that precedent).
3. `tests/test_overlay_models.py::test_overlay_has_script_columns``Overlay(...)` instance has `script=''` and `last_build_status=''` defaults.
Implementation:
- Migration uses `op.drop_table('global_overlay_item_files')` etc. in correct FK order; uses `op.add_column('overlays', sa.Column('script', sa.Text(), nullable=False, server_default=''))` and similar for `last_build_status` (`sa.String(16)`).
- The `DELETE FROM overlays WHERE type IN ('l4d2center_maps','cedapug_maps')` runs *before* the column additions so the operation is straightforward — these rows do not reference the new columns.
- `models.py`: delete the three globals model classes outright; add the two new columns to `Overlay` with explicit defaults.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_alembic_migrations.py l4d2web/tests/test_overlay_models.py -q
```
**Commit:** `feat(l4d2-web): script overlay schema — add overlay.script + last_build_status, drop globals tables`
---
## Task 2: ScriptBuilder + BUILDERS registry update
**Files:**
- Modify: `l4d2web/services/overlay_builders.py` — add `ScriptBuilder`, remove `GlobalMapOverlayBuilder`, change `BUILDERS` dict.
- Rewrite: `l4d2web/tests/test_overlay_builders.py` — drop globals-builder tests, add ScriptBuilder tests.
Test plan (RED first):
1. `test_overlay_builders.py::test_builders_registry``set(BUILDERS) == {"workshop", "script"}`. Assert `"l4d2center_maps"` and `"cedapug_maps"` and `"external"` are absent.
2. `test_overlay_builders.py::test_script_builder_invokes_helper` — patch `run_with_streamed_output` to capture argv; build an `Overlay(id=42, type='script', script='echo hi')`; assert argv shape `["sudo", "-n", "/usr/local/libexec/left4me/left4me-script-sandbox", "42", <script_path>]` and that the script_path file exists with content `"echo hi"` at invocation time. Verify the tmpfile is unlinked after build.
3. `test_overlay_builders.py::test_script_builder_disk_cap` — fake `subprocess.check_output` for `du` to return `25000000000`; build raises `BuildError("disk-cap-exceeded")` and `on_stderr` was called with the cap message.
4. `test_overlay_builders.py::test_script_builder_streams_output` — fake `run_with_streamed_output` invokes both `on_stdout("hello\n")` and `on_stderr("warn\n")`; both lambda lists capture the lines.
5. `test_overlay_builders.py::test_script_builder_cancel``should_cancel` returns True after the first stdout line; assert `run_with_streamed_output` propagated cancellation (the existing helper's contract — the test just ensures we pass `should_cancel` through and don't run the disk-budget check on cancel).
6. `test_overlay_builders.py::test_workshop_builder_unchanged` — smoke test that `WorkshopBuilder` still exists and is invokable (regression guard against accidental removal during refactor).
Implementation:
- Add `import os, subprocess, tempfile` at the top of `overlay_builders.py` if not present.
- `ScriptBuilder` exactly as in the spec (verbatim copy from the design doc, §Build Lifecycle).
- Define a small `BuildError` exception class if one doesn't already exist locally; reuse the existing one if `WorkshopBuilder` already raises a similar type.
- `_enforce_disk_budget` calls `subprocess.check_output(["du", "-sb", str(overlay_path(overlay_id))])`; the existing `overlay_path` helper in the module already returns the absolute Path. Parse first whitespace-delimited integer; cap is `20 * 1024**3`.
- Job-completion path: locate the existing path that handles `build_overlay` job success/failure (likely in `services/job_worker.py` or a related orchestration module). Add a single column write: on success `last_build_status='ok'`, on `BuildError` / non-zero exit / cancel `last_build_status='failed'`. Add a `tests/test_job_worker.py::test_build_overlay_writes_last_build_status` covering both branches.
- Remove `GlobalMapOverlayBuilder` class and any helper functions it owns that are not used elsewhere.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_overlay_builders.py l4d2web/tests/test_job_worker.py -q
```
**Commit:** `feat(l4d2-web): ScriptBuilder + BUILDERS registry update`
---
## Task 3: Delete global-overlay services + CLI command + their tests
**Files:**
- Delete: `l4d2web/services/global_map_sources.py`
- Delete: `l4d2web/services/global_overlay_refresh.py`
- Delete: `l4d2web/services/global_map_cache.py`
- Delete: `l4d2web/services/global_overlays.py`
- Modify: `l4d2web/cli.py` — remove `refresh-global-overlays` command (lines ~4455). Drop any imports that go orphaned.
- Delete: `l4d2web/tests/test_global_map_sources.py`
- Delete: `l4d2web/tests/test_global_overlay_models.py`
- Delete: `l4d2web/tests/test_global_overlay_builders.py`
- Delete: `l4d2web/tests/test_global_overlay_cli.py`
- Delete: `l4d2web/tests/test_global_overlay_refresh.py`
- Delete: `l4d2web/tests/test_global_overlays.py`
- Delete: `l4d2web/tests/test_global_map_cache.py`
- Audit & fix: any other module that imports the deleted modules. Likely candidates: `l4d2web/app.py` (CLI registration), `routes/overlay_routes.py`, `routes/page_routes.py`. Resolve by deletion of the dead import / call site, not by stubbing.
- Modify: `pyproject.toml` — drop `py7zr` from dependencies (only used by the deleted globals subsystem).
Test plan:
1. RED-first via grep: `grep -RIn 'global_map_sources\|global_overlay_refresh\|global_map_cache\|global_overlays\|refresh_global_overlays\|GlobalMapOverlayBuilder' l4d2web/ deploy/` — should return zero hits at the end of this task. Add this as `tests/test_no_globals_references.py::test_no_globals_imports` if you want it as a permanent regression guard, otherwise spot-check.
2. Existing `tests/test_cli.py` (or whichever covers Flask CLI) loses any cases for `refresh-global-overlays`; add a `test_refresh_global_overlays_command_removed` that asserts the click command is not registered.
Implementation:
- Delete files via `git rm`.
- In `cli.py`, remove the command function and its `@app.cli.command(...)` decorator. Drop any helper imports that become orphaned.
- Remove `py7zr` from `pyproject.toml` and re-lock if a lockfile is present.
**Verification:**
```
python3 -m pytest l4d2web/tests/ -q
grep -RIn 'global_map_sources\|global_overlay_refresh\|global_map_cache\|global_overlays\|refresh_global_overlays\|GlobalMapOverlayBuilder' l4d2web/ deploy/ || echo "clean"
```
**Commit:** `refactor(l4d2-web): drop global-overlays subsystem in favor of script type`
---
## Task 4: Job worker — drop refresh_global_overlays from scheduler
**Files:**
- Modify: `l4d2web/services/job_worker.py` — remove `"refresh_global_overlays"` from `GLOBAL_OPERATIONS`; remove `refresh_global_overlays_running` field from `SchedulerState` and any references in `can_start()`; check whether `blocked_servers_by_overlay` was added solely for the globals subsystem and remove if so.
- Modify: `l4d2web/tests/test_job_worker.py` — drop `refresh_global_overlays` truth-table rows; add explicit `build_overlay` truth-table cases for `script`-type overlays (mechanically identical to workshop, but pinned by test).
Test plan:
1. `test_job_worker.py::test_global_operations_set``GLOBAL_OPERATIONS == {"install", "refresh_workshop_items"}` (or whatever subset remains; pin it).
2. `test_job_worker.py::test_build_overlay_script_type_blocks_per_overlay` — start `build_overlay(overlay_id=7)` for a `script`-type overlay; assert second `build_overlay(overlay_id=7)` cannot start; assert `build_overlay(overlay_id=8)` can.
3. `test_job_worker.py::test_build_overlay_blocks_server_init_on_blueprint_overlay` — existing test, may need re-pinning if it referenced globals.
Implementation:
- Remove the field from the dataclass / TypedDict that backs `SchedulerState`.
- Remove any update sites that flipped the flag (the worker's enqueue / on-start / on-complete paths).
- The remaining mutex rules (`install` / `refresh_workshop_items` are global; `build_overlay` per-overlay; server ops block on overlays in their blueprint) are unchanged structurally.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_job_worker.py -q
```
**Commit:** `refactor(l4d2-web): drop refresh_global_overlays from scheduler`
---
## Task 5: Routes (script update / wipe / build)
**Files:**
- Modify: `l4d2web/routes/overlay_routes.py` — add three POST endpoints.
- Create: `l4d2web/tests/test_script_overlay_routes.py`.
Test plan (RED first):
1. `test_script_overlay_routes.py::test_create_script_overlay` — POST `/overlays` with form `{"name": "x", "type": "script"}` as a regular user → 302 to detail; row exists with `type='script'`, `script=''`, `last_build_status=''`, `user_id=current_user.id`, `path=str(id)`.
2. `test_script_overlay_routes.py::test_admin_creates_system_wide_script_overlay` — admin POST with system-wide flag → row has `user_id=NULL`.
3. `test_script_overlay_routes.py::test_update_script_body_enqueues_build` — POST `/overlays/{id}/script` with `{"script": "echo new"}` → row.script updated; one new `build_overlay` job enqueued for the overlay; second immediate POST coalesces (no second job inserted while first is pending).
4. `test_script_overlay_routes.py::test_manual_rebuild` — POST `/overlays/{id}/build` → enqueues `build_overlay`; coalesces.
5. `test_script_overlay_routes.py::test_wipe_runs_find_delete` — POST `/overlays/{id}/wipe` → invokes `ScriptBuilder.build` (or the underlying helper) with the literal script `find /overlay -mindepth 1 -delete`. After success, row.last_build_status `==''`. Does not enqueue a `build_overlay`.
6. `test_script_overlay_routes.py::test_wipe_refuses_during_running_build` — set scheduler state to `build_overlay(overlay_id=7)` running; POST `/overlays/7/wipe` → 409 (or whatever the existing pattern uses for scheduler conflicts), no sandbox invocation.
7. `test_script_overlay_routes.py::test_permissions_non_owner_denied` — user A creates private script overlay; user B POSTs `/overlays/{id}/script` → 403.
8. `test_script_overlay_routes.py::test_permissions_admin_can_edit_any` — admin POSTs `/overlays/{id}/script` for user A's row → 200.
Implementation:
- Mirror the existing `_can_edit_overlay()` permission helper.
- The `/wipe` endpoint can either (a) call `ScriptBuilder` directly with a synthetic `Overlay`-like object whose `.script` is the find command and whose `.id` is the real overlay id, or (b) factor a `_run_sandbox(overlay_id, script_text, on_stdout, on_stderr, should_cancel)` helper out of `ScriptBuilder.build()` and call it from both. (b) is cleaner; do (b).
- Wipe runs **synchronously** in the request thread (small, fast). It does NOT enqueue a job. Surface log output as flash messages or by streaming through the existing log infra — pick whichever matches the existing wipe-equivalent pattern (workshop overlays don't have a wipe; closest analog is the existing delete-overlay flow).
- The `/script` endpoint enqueues via the same `enqueue_build_overlay(overlay_id)` helper used by workshop overlays' add/remove flows. Coalescing is already implemented there.
**Verification:**
```
python3 -m pytest l4d2web/tests/test_script_overlay_routes.py l4d2web/tests/test_overlay_routes.py -q
```
**Commit:** `feat(l4d2-web): script overlay routes (script update / wipe / build)`
---
## Task 6: Templates (overlays.html + overlay_detail.html)
**Files:**
- Modify: `l4d2web/templates/overlays.html` — add `script` to the create-modal type radio (lines ~2949).
- Modify: `l4d2web/templates/overlay_detail.html` — add a `{% if overlay.type == 'script' %}` block with textarea + Save / Rebuild / Wipe buttons + status badge; delete the global-source block (lines ~3446).
- Modify: `l4d2web/tests/test_pages.py` — assert script-section renders for type=`script`, workshop-section renders for type=`workshop`, global-source-section is absent.
Test plan:
1. `test_pages.py::test_overlay_create_modal_offers_script_type` — GET `/overlays`; HTML contains `value="script"` radio.
2. `test_pages.py::test_overlay_detail_script_section` — create script overlay, GET `/overlays/{id}`; HTML contains `<textarea name="script">`, "Rebuild" button, "Wipe" button, status badge element.
3. `test_pages.py::test_overlay_detail_workshop_section_unchanged` — existing workshop detail still has thumbnail grid, add-item form, etc.
4. `test_pages.py::test_overlay_detail_no_global_source_block` — page HTML has no element from the deleted global-source block (check for an attribute or string unique to that block).
Implementation:
- Detail-page wipe button uses a small confirm-modal pattern (copy from the existing delete-overlay confirm modal).
- Status badge: existing CSS classes for ok/warn/error already exist in `static/`; reuse them.
- No new JS deps. Plain `<form method="post">` with HTMX `hx-post` for the script update if a streaming UX is desired (match existing patterns).
**Verification:**
```
python3 -m pytest l4d2web/tests/test_pages.py -q
```
Manual: start dev server (`flask run`), create a script overlay, paste `echo "hi" > foo`, click Save, watch log stream. Then click Wipe; confirm dir is empty. Then click Rebuild; confirm `foo` reappears.
**Commit:** `feat(l4d2-web): script overlay UI`
---
## Task 7: Libexec sandbox helper + sudoers + deploy-artifacts test
**Files:**
- Create: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` (bash, mode 0755 after deploy, owned root).
- Modify: `deploy/files/etc/sudoers.d/left4me` — append the rule.
- Modify: `deploy/tests/test_deploy_artifacts.py` — assert helper file present + sudoers contains the new line.
Test plan (RED first):
1. `test_deploy_artifacts.py::test_script_sandbox_helper_present` — file exists, mode bits indicate 0755 (or whatever the test framework allows checking pre-deploy), shebang is `#!/bin/bash`.
2. `test_deploy_artifacts.py::test_sudoers_includes_script_sandbox_rule` — sudoers file contains the exact line `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox`.
3. Optional integration test (skip on non-Linux dev): drive the helper as a subprocess with a synthesized fake `/var/lib/left4me/overlays/1/` and a no-op script, assert `bwrap` invocation happens (use a mock `systemd-run` or `LEFT4ME_SCRIPT_SANDBOX_DRY_RUN=1` env that prints the would-be invocation and exits 0). Mirrors the `LEFT4ME_OVERLAY_PRINT_ONLY=1` pattern from the kernel-overlayfs helper test.
Implementation:
- Helper script verbatim from the spec §Sandbox.
- Sudoers fragment: append (don't replace existing rules). The existing fragment has rules for `left4me-overlay`, `left4me-systemctl`, `left4me-journalctl` — match the same formatting (one rule per line, no trailing whitespace).
**Verification:**
```
python3 -m pytest deploy/tests/test_deploy_artifacts.py -q
bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
```
**Commit:** `feat(deploy): left4me-script-sandbox helper + sudoers fragment`
---
## Task 8: Deploy script — provision l4d2-sandbox + bubblewrap; drop globals timer
**Files:**
- Modify: `deploy/deploy-test-server.sh` — add `useradd --system ... l4d2-sandbox`, add `apt-get install -y bubblewrap`, ensure helper installation step picks up `left4me-script-sandbox` (likely automatic if it's a glob in `deploy/files/usr/local/libexec/left4me/*`); drop the `mkdir global_overlay_cache` line if present.
- Delete: `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer`
- Delete: `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.service`
- Modify: `deploy/tests/test_deploy_artifacts.py` — assert the two unit files are absent; assert `useradd l4d2-sandbox` and `apt-get install ... bubblewrap` lines are present in the deploy script.
Test plan:
1. `test_deploy_artifacts.py::test_globals_refresh_units_removed` — files do not exist under `deploy/files/usr/local/lib/systemd/system/`.
2. `test_deploy_artifacts.py::test_deploy_script_provisions_sandbox_user` — grep the deploy script for the useradd line.
3. `test_deploy_artifacts.py::test_deploy_script_installs_bubblewrap` — grep for `bubblewrap` in apt invocations.
Implementation:
- `useradd` line uses `--system --no-create-home --shell /usr/sbin/nologin`. Idempotency: wrap with `id l4d2-sandbox &>/dev/null || useradd ...`.
- `apt-get install`: append `bubblewrap` to whatever package list the script already maintains.
- Globals timer/service deletions: `git rm`.
**Verification:**
```
python3 -m pytest deploy/tests/ -q
shellcheck deploy/deploy-test-server.sh deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
```
**Commit:** `chore(deploy): provision l4d2-sandbox + bubblewrap; drop globals refresh timer`
---
## Task 9: Full pytest run + drift fixes
**Files:** as needed across the repo.
Test plan: run the full test suite for both packages; chase down any drift caused by removed model classes, dropped imports, or template changes.
```
python3 -m pytest l4d2web/tests/ -q
python3 -m pytest l4d2host/tests/ -q
python3 -m pytest deploy/tests/ -q
```
Implementation: fix what breaks. Common drift sources to expect:
- Tests that imported from deleted modules.
- Tests that asserted exact `BUILDERS` keyset (good — they should have been updated in Task 2).
- Tests that built fixtures with `type='l4d2center_maps'` or `type='cedapug_maps'` — those tests likely belong to the deleted set or need conversion to `type='script'`.
- Template snapshot tests (if any) that captured the deleted global-source block.
**Verification:** all three suites green.
**Commit:** `chore(l4d2-web): test suite drift fixes after script-overlays migration` (only if drift fixes needed; skip if Tasks 18 left the suite green)
---
## End-to-end deployment verification (manual, on test host)
After all tasks committed:
1. **Reset deploy:** run `deploy/deploy-test-server.sh` from clean state. Confirm `bubblewrap` installed (`dpkg -l bubblewrap`), `l4d2-sandbox` user exists (`id l4d2-sandbox`), `/usr/local/libexec/left4me/left4me-script-sandbox` is mode 0755 and root-owned, `sudo -ln` as `left4me` shows the new rule.
2. **Sandbox smoke:** as `left4me`, write `/tmp/echo.sh` containing `echo $(whoami) > /overlay/sentinel`. `mkdir -p /var/lib/left4me/overlays/1`. `sudo /usr/local/libexec/left4me/left4me-script-sandbox 1 /tmp/echo.sh`. Confirm `/var/lib/left4me/overlays/1/sentinel` contains `l4d2-sandbox` and is owned by `l4d2-sandbox`. Confirm `/etc/passwd`, `/var/lib/left4me/l4d2web.db`, and `/home` are not visible inside the sandbox by running probe scripts.
3. **Resource limits:**
- `dd if=/dev/zero of=/overlay/big bs=1M count=25000` → succeeds inside sandbox; `ScriptBuilder._enforce_disk_budget` flags the build failed; `last_build_status='failed'`.
- `sleep 7200` → killed at 1 h by `RuntimeMaxSec=3600`.
- Memory hog (`python3 -c "x=' '*(5*1024**3)"`) → OOM at 4 GB.
4. **App-level happy path:** as a non-admin user, create a script overlay via the UI, paste an old `competitive_rework`-style script, Save → build runs, succeeds, addons appear in `overlays/{id}/left4dead2/`. Stack onto a server blueprint, start the server, verify content mounts via the L4D2 admin console (`map workshop/...`).
5. **Wipe:** click Wipe → dir empty (find -delete output in log). Click Rebuild → repopulates. `last_build_status` cycles: `''``'ok'`.
6. **Scheduler:** start a server using the script overlay; in another browser tab attempt to Rebuild → 409 / scheduler-blocked. Stop server; rebuild succeeds.
7. **Audit log:** `journalctl --since "5 min ago" | grep run-` shows transient scopes per build with cgroup memory accounting visible.
These are not required for any single commit but should pass before declaring the work done.

View file

@ -0,0 +1,323 @@
# L4D2 Script Overlays Design
**Goal:** Add a single new overlay type, `script`, that lets users author arbitrary build recipes as bash and runs them inside a `bubblewrap` + `systemd-run --scope` sandbox. The new type subsumes the existing `l4d2center_maps` and `cedapug_maps` managed-globals overlay types, both of which are removed in the same change. After this work the overlay type list is exactly `workshop` (unchanged) and `script` (new).
**Approval status:** User-approved design direction. Implementation proceeds in lockstep with the companion plan at `docs/superpowers/plans/2026-05-08-l4d2-script-overlays.md`.
## Context
`left4me` users today have two ways to add content to a server: workshop overlays (rich UI for Steam Workshop items via `WorkshopBuilder`) and a pair of managed global-map overlay types (`l4d2center_maps`, `cedapug_maps`) with bespoke parsers, per-item DB rows, ETag-based change detection, and a daily refresh timer. They cannot author arbitrary build recipes.
The user's previous setup at `ckn-bw/bundles/left4dead2/files/scripts/overlays/` expressed every recipe as a small bash file: `competitive_rework` (GitHub tarball download), `tickrate` (inline `server.cfg` + addon DLL fetch), `standard` (workshop items + admin-list write), `workshop_maps` (workshop collection import), `l4d2center_maps` (CSV-driven map sync). All five fit naturally into a single "run a sandboxed bash script that populates the overlay dir" model.
The two managed global-map types in the current codebase are over-engineered for what they do — each is essentially "fetch a manifest, download archives, extract VPKs, place in `addons/`." Folding them into the new `script` type eliminates three database tables, two source-parser modules, the `GlobalMapOverlayBuilder`, the `py7zr` dependency, the global-overlay cache root, and the managed-singleton machinery, while letting an admin paste the equivalent shell code (which the user already wrote years ago) into a normal admin-owned, system-wide script overlay.
The trust model for the sandbox is "semi-public deployment, registered users." The threat surface is one user reading another user's overlay, the application DB, or arbitrary host secrets, plus runaway scripts exhausting disk/CPU/RAM. Network access is *not* restricted — scripts must be able to download from arbitrary URLs (GitHub, l4d2center, Steam CDN). Sandbox boundaries are namespace-based (mount, PID, IPC, UTS, cgroup), not command-allowlist-based; binary-allowlist sandboxing of bash is theatre because of `eval` and `exec`.
The test deploy DB is wiped as part of rollout; no data migration is performed. Existing user blueprints that reference `l4d2center_maps` or `cedapug_maps` overlay rows do not survive the change in the test environment.
A scheduled-refresh feature (the daily timer that today drives the global-map types) is intentionally **out of scope for this iteration**. The two existing systemd units and the `flask refresh-global-overlays` CLI command are deleted with no replacement. Refresh is reintroduced in a later iteration designed against concrete needs.
## Locked Decisions
1. **Single new overlay type: `script`.** Replaces both managed-globals types. Final type list: `workshop` + `script`. No `tarball`/`inline`/`manual` types — all of those collapse into `script` (with UI templates as a future ergonomics improvement).
2. **`Overlay.script` is a DB `TEXT` column** holding the raw bash. No file storage, no revision history in v1. Empty string for `workshop` rows.
3. **Build idempotency contract: script runs against the existing overlay dir.** No automatic wipe between builds. Users write `test -f … || curl …`-style guards if they want bandwidth efficiency. A manual "Wipe overlay" button on the detail page resets the dir to empty.
4. **No left4me-aware helpers in the sandbox.** The script sees pure bash plus whatever's in `/usr` (RO bind-mount of the host). Workshop items are not exposed via a helper — users wanting workshop content create a `workshop`-type overlay, which has its own first-class UX (thumbnails, collection paste, dedup cache, refresh).
5. **Sandbox engine: `bubblewrap` (`bwrap`) inside `systemd-run --scope --collect`.** `systemd-run` provides cgroup v2 limits + walltime kill via `RuntimeMaxSec`; `bwrap` provides the namespace isolation. Both are stable, well-audited, in-tree on Debian.
6. **Resource limits (system-wide, not per-overlay):** 1 hour walltime (`RuntimeMaxSec=3600`), 4 GB RAM (`MemoryMax=4G`, `MemorySwapMax=0`), 512 tasks, 200% CPU quota, post-build 20 GB disk cap on `du -sb` of the overlay dir.
7. **Network: host-shared.** No `--unshare-net`. Scripts have full outbound. Egress filtering is not in v1; the sandbox prevents reading internal state but does not prevent talking to internal IPs. Acceptable for the current trust model.
8. **No auto-seeding of "default" overlays.** Admin manually creates the equivalents of the old `l4d2center-maps`/`cedapug-maps` post-deploy by pasting the bash. The deploy script does not insert overlay rows.
9. **Daily/scheduled refresh: out of scope for this iteration.** No `auto_refresh` flag, no timer, no CLI command. Manual rebuild via the detail-page button is the only build trigger after this change.
10. **Permissions mirror workshop overlays.** Any logged-in user can create a private (`user_id = me`) script overlay. Admin can create system-wide (`user_id = NULL`). Owner or admin can edit/delete.
11. **Failure semantics via `Overlay.last_build_status`** (`'' | 'ok' | 'failed'`). Drives a "rebuild required" badge on the list and detail pages. Server initialization does **not** auto-block on `failed` (matches workshop's current behavior).
12. **Wipe is just another sandbox invocation.** The wipe endpoint runs the literal script `find /overlay -mindepth 1 -delete` through the same `left4me-script-sandbox` helper. No second helper, no privilege/UID puzzle (files are owned by `l4d2-sandbox`, who runs the wipe). After a successful wipe, `last_build_status` is reset to `''`. Wipe does **not** auto-enqueue a rebuild — the user decides.
13. **Privileged helper: `/usr/local/libexec/left4me/left4me-script-sandbox`.** Same pattern as the existing `left4me-overlay`, `left4me-systemctl`, `left4me-journalctl` helpers. Bash, owned root, mode 0755. The web user invokes it via `sudo -n` per a sudoers fragment. Root is needed to set up the namespaces; bwrap drops to the unprivileged `l4d2-sandbox` UID immediately.
14. **Dedicated sandbox UID `l4d2-sandbox`** (system user, `/usr/sbin/nologin`, no home). Owns nothing on the host outside what bwrap binds in. UID-drop happens inside the bwrap invocation via `--uid`/`--gid`.
15. **Strict argument validation in the helper.** Overlay id matches `^[0-9]+$`; overlay dir must exist under `/var/lib/left4me/overlays/`; script path must exist. Defense in depth — the real authorization check lives in the web app.
16. **Streaming I/O via the existing `run_with_streamed_output` helper.** Same plumbing `WorkshopBuilder` already uses for `steamcmd`/`curl` invocations. No new SSE/log path.
## Architecture
```text
Overlay row (type=script, script=TEXT, last_build_status)
▼ build_overlay(overlay_id) job
▼ BUILDERS["script"].build(overlay, on_stdout, on_stderr, should_cancel)
▼ ScriptBuilder writes overlay.script → tmpfile, then:
│ sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>
▼ Helper validates args, then exec()s:
│ systemd-run --scope --collect
│ -p MemoryMax=4G -p MemorySwapMax=0
│ -p TasksMax=512 -p CPUQuota=200%
│ -p RuntimeMaxSec=3600
│ -- bwrap [namespace flags...] /bin/bash /script.sh
▼ Inside the sandbox the script sees:
│ /overlay ← /var/lib/left4me/overlays/{id} RW (the build target)
│ /tmp,/run ← fresh tmpfs RW (ephemeral)
│ /usr,/lib,/lib64,/etc/{ssl,resolv.conf,nsswitch} RO (host-curated)
│ /proc,/dev ← fresh
│ network ← shared with host
│ UID/GID ← l4d2-sandbox (no_new_privs implicit in bwrap)
▼ stdout/stderr → run_with_streamed_output → existing job-log SSE stream
▼ After exit:
│ exit 0 ∧ du -sb /overlay ≤ 20 GB → last_build_status='ok'
│ any other outcome → last_build_status='failed'
```
The host library (`l4d2host`) is unchanged. The `KernelOverlayFSMounter` already mounts whatever's at `overlays/{id}/` regardless of how it got there. The Job model and worker model are essentially unchanged — `script` is just another overlay type for the same `build_overlay` operation that today supports `workshop`.
```python
BUILDERS = {
"workshop": WorkshopBuilder(),
"script": ScriptBuilder(),
}
```
## Data Model
### `Overlay` (modified)
```text
id INTEGER PK AUTOINCREMENT
name VARCHAR(255) NOT NULL
path VARCHAR(255) NOT NULL -- str(id) for new rows
type VARCHAR(16) NOT NULL -- 'workshop' | 'script'
user_id INTEGER NULL REFERENCES users(id) -- NULL = system-wide
script TEXT NOT NULL DEFAULT '' -- new; meaningful for type='script'
last_build_status VARCHAR(16) NOT NULL DEFAULT '' -- new; '' | 'ok' | 'failed'
created_at, updated_at
UNIQUE INDEX on (name) WHERE user_id IS NULL
UNIQUE INDEX on (name, user_id) WHERE user_id IS NOT NULL
INDEX on (type, user_id)
```
### Tables removed
- `global_overlay_item_files`
- `global_overlay_items`
- `global_overlay_sources`
Drop order matters for the SQLite migration: drop `_item_files` first (FK to `_items`), then `_items` (FK to `_sources`), then `_sources` (FK to `overlays`).
### Unchanged
`WorkshopItem`, `overlay_workshop_items`, `Job` (including `Job.overlay_id` and nullable `Job.user_id`), `Server`, `Blueprint`, etc.
## Filesystem Layout
```text
${LEFT4ME_ROOT}/
overlays/
{overlay_id}/ # script writes here; mounted by host
left4dead2/... # whatever the script produces
workshop_cache/{steam_id}.vpk # workshop type only — unchanged
# removed:
# global_overlay_cache/ # was used by managed-globals types
```
Single tree per overlay. No per-overlay scratch cache (the chosen idempotency model is "script runs against existing dir," so any caching the user wants lives inside the overlay dir and is preserved between builds).
The sandbox bind-mounts `${LEFT4ME_ROOT}/overlays/{id}/` to `/overlay` (RW). Nothing else under `${LEFT4ME_ROOT}` is visible inside the sandbox.
## Sandbox
### Helper script
`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`, mode 0755, owned root:
```bash
#!/bin/bash
# args: <overlay_id> <script_path>
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script" >&2; exit 65; }
SBX_UID=$(id -u l4d2-sandbox); SBX_GID=$(id -g l4d2-sandbox)
exec systemd-run --quiet --scope --collect \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- bwrap \
--die-with-parent --new-session \
--unshare-pid --unshare-ipc --unshare-uts --unshare-cgroup \
--uid "$SBX_UID" --gid "$SBX_GID" \
--proc /proc --dev /dev --tmpfs /tmp --tmpfs /run \
--ro-bind /usr /usr --ro-bind /lib /lib --ro-bind /lib64 /lib64 \
--symlink usr/bin /bin --symlink usr/sbin /sbin \
--ro-bind /etc/resolv.conf /etc/resolv.conf \
--ro-bind /etc/ssl /etc/ssl \
--ro-bind /etc/ca-certificates /etc/ca-certificates \
--ro-bind /etc/nsswitch.conf /etc/nsswitch.conf \
--bind "$OVERLAY_DIR" /overlay \
--chdir /overlay \
--setenv HOME /tmp --setenv PATH /usr/bin:/usr/sbin \
--setenv OVERLAY /overlay \
--ro-bind "$SCRIPT" /script.sh \
/bin/bash /script.sh
```
Network is *not* unshared (no `--unshare-net`); the sandbox shares the host network namespace. Every transient unit is visible via `systemctl list-units --type=scope` while running and journaled afterward (`journalctl --user-unit=run-…scope` or system journal depending on invocation).
### Sudoers fragment
Append to `deploy/files/etc/sudoers.d/left4me`:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
```
### System user
Provisioned in `deploy/deploy-test-server.sh`:
```bash
useradd --system --no-create-home --shell /usr/sbin/nologin l4d2-sandbox
apt-get install -y bubblewrap
```
## Build Lifecycle
`ScriptBuilder` lives in `l4d2web/services/overlay_builders.py` next to `WorkshopBuilder`:
```python
class ScriptBuilder:
def build(self, overlay, *, on_stdout, on_stderr, should_cancel):
with tempfile.NamedTemporaryFile("w", suffix=".sh", delete=False) as f:
f.write(overlay.script or "")
script_path = f.name
try:
cmd = [
"sudo", "-n",
"/usr/local/libexec/left4me/left4me-script-sandbox",
str(overlay.id), script_path,
]
run_with_streamed_output(cmd, on_stdout, on_stderr, should_cancel)
self._enforce_disk_budget(overlay.id, on_stderr)
finally:
os.unlink(script_path)
def _enforce_disk_budget(self, overlay_id, on_stderr):
size = subprocess.check_output(["du", "-sb", overlay_path(overlay_id)])
if int(size.split()[0]) > 20 * 1024**3:
on_stderr("overlay exceeded 20 GB disk cap")
raise BuildError("disk-cap-exceeded")
```
`run_with_streamed_output` is the existing helper used by `WorkshopBuilder` for `steamcmd`/`curl` invocations. The `should_cancel` callback fires `kill -TERM` on the sudo-`systemd-run` process tree; cgroup-collect tears down the whole scope on exit.
The job worker's existing job-completion path writes `Overlay.last_build_status = 'ok'` on success and `'failed'` on any non-zero exit / `BuildError` / cancel. This is a single column update inside the existing transaction; no new infrastructure.
## UI
### Create modal (`templates/overlays.html`)
The existing modal grows one option in the type radio: `Workshop | Script`. Name field unchanged. After insert, the web app generates `path = str(overlay_id)` for new rows (existing pattern).
### Detail page when `type='script'` (`templates/overlay_detail.html`)
- Plain styled `<textarea>` for `overlay.script` with a Save button → `POST /overlays/{id}/script`. No CodeMirror dependency in v1 (out of scope; keep frontend dep-light).
- "Rebuild" button → `POST /overlays/{id}/build`. Existing pattern from workshop overlays.
- "Wipe overlay" button (red, confirm-modal) → `POST /overlays/{id}/wipe`.
- `last_build_status` indicator badge: empty / "ok" / "failed".
- Live build log via existing SSE plumbing on the related Job row.
### Detail page when `type='workshop'`: unchanged.
### Sections removed
The global-source detail block (`overlay_detail.html` lines 3446) is deleted along with the managed-globals subsystem.
## Routes
`l4d2web/routes/overlay_routes.py` adds:
| Method | Path | Purpose |
|---|---|---|
| POST | `/overlays/{id}/script` | Update `script` text. Auto-enqueue coalesced `build_overlay` job. |
| POST | `/overlays/{id}/wipe` | Invoke `left4me-script-sandbox` with the literal script `find /overlay -mindepth 1 -delete`. Owner/admin only. Refuses if a `build_overlay` for this overlay is running. After success, set `last_build_status=''`. Does not auto-enqueue a rebuild. |
| POST | `/overlays/{id}/build` | Manual rebuild — same pattern as today's workshop overlay manual rebuild. |
Existing `POST /overlays` accepts `type=script` and an optional initial `script` body.
## Permissions
| Action | Who |
|---|---|
| Create script overlay (private, `user_id = me`) | Any authenticated user |
| Create script overlay (system-wide, `user_id = NULL`) | Admin |
| Edit (script body, name) | Owner or admin |
| Wipe / Rebuild | Owner or admin |
| Delete | Owner or admin |
| View | Owner, admin, or any user when `user_id IS NULL` |
These match the existing rules for workshop overlays.
## Job Worker / Scheduler
`services/job_worker.py` drops `"refresh_global_overlays"` from `GLOBAL_OPERATIONS` and removes the corresponding `refresh_global_overlays_running` and `blocked_servers_by_overlay` plumbing that exists only for the global-maps subsystem. The remaining mutex rules already cover:
- `build_overlay` per overlay (one running build per overlay).
- `install` and `refresh_workshop_items` as global mutexes.
- Server start/init blocks if any `build_overlay` for an overlay in the server's blueprint is running.
No new rules are needed for `script` — its build is mechanically identical to a `workshop` build from the scheduler's perspective.
## Daily Refresh — Removed
This iteration deletes the daily-refresh subsystem entirely:
- `deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer` and `.service` — deleted.
- `flask refresh-global-overlays` CLI command in `l4d2web/cli.py` — deleted.
- No replacement timer, no replacement CLI, no `auto_refresh` column on `Overlay`.
The only build trigger after this change is the user clicking Rebuild on the detail page (or the auto-enqueue when they Save the script body). A scheduled-refresh feature is reintroduced in a future iteration designed against concrete operational needs.
## Risks
- **Sandbox escape via kernel bug.** `bwrap` has a strong track record but is not invulnerable. Mitigated by running as `l4d2-sandbox` (no privileged capabilities), no setuid binaries reachable, `no_new_privs` implicit. A successful escape would land in an unprivileged UID with no host secrets reachable.
- **Disk fill via runaway script.** A script that writes a 20 GB+ payload to `/overlay` succeeds inside the sandbox and only fails afterward at the post-build `du` check. The 20 GB lands on disk transiently. Mitigated by the kernel's per-cgroup IO accounting being unaware of file size (no good IO-time limit), accepting this as a v1 trade-off; a future improvement is overlay-dir-on-its-own-filesystem with a quota.
- **Network exfiltration.** Script can connect to anything outbound, including internal IPs. Acceptable for the current trust model (semi-public; users have credentials). Egress firewall is out of scope.
- **Build-mid-server-running.** The scheduler refuses `build_overlay` for an overlay attached to a starting/running server (existing rule, unchanged). Good. A user can still rebuild while a server using a *different* blueprint runs concurrently.
- **Wipe race with running build.** The wipe endpoint refuses if a `build_overlay` for the overlay is running. Without this check, a wipe could blow away files mid-script and produce undefined results.
- **Stale `last_build_status`.** A row inserted via direct DB write or restored from backup could carry an `'ok'` status that no longer reflects reality. Treated as cosmetic; users can rebuild to refresh.
- **Sudoers misconfig.** A typo in the sudoers fragment could grant `left4me` more than intended. Mitigated by deploy-artifact tests asserting the exact expected lines.
- **DB row deletion racing the sandbox.** A user deleting an overlay while its build runs would invalidate the bind-mount target. Mitigated by the existing scheduler rule that tracks running overlays; delete should refuse if a build is running. (Existing pattern for workshop overlays; reuse.)
- **Migration drops globals tables.** Acceptable for the test deploy. Production rollout would need a different migration story; this spec explicitly assumes test-deploy DB wipe.
## Out Of Scope
- **Scheduled / daily refresh.** Intentionally removed in this iteration. Reintroduced later, designed against the use cases that emerge.
- **Per-overlay resource overrides.** All script overlays share the same 1 h / 4 GB / 20 GB envelope. If a real overlay needs more (l4d2center mirror at peak), revisit.
- **CodeMirror or other rich script editor.** Plain `<textarea>` in v1.
- **Egress allowlist / proxy.** No network restrictions on the sandbox in v1.
- **`$CACHE` scratch dir** persisted across builds. Users cache inside the overlay dir if they want; idempotency model is "script runs against existing dir."
- **Multi-tenant cgroup tree per user.** All sandboxes share the same cgroup-quota envelope.
- **Revision history on `script` column.** No `overlay_script_revisions` table; whatever's in the row is the current script.
- **Auto-seeding of l4d2center / cedapug equivalents.** Admin pastes the script post-deploy.
- **Migration that preserves existing global-map overlay rows.** Test deploy DB is wiped.
- **Container-per-build (podman / docker).** Heavier than `bwrap`; revisit only if multi-tenant escalates to "fully public sign-up."
- **left4me-aware helpers** (`workshop`, `download`, `extract`) inside the sandbox. Pure bash + host `/usr` only.
## Implementation Boundaries
- **`l4d2host` is unchanged.** The host library has no concept of overlay types and the mount layer (`KernelOverlayFSMounter`) doesn't care how the overlay dir got populated.
- **The `OverlayBuilder` Protocol is unchanged** — same `build(overlay, *, on_stdout, on_stderr, should_cancel)` signature. `ScriptBuilder` plugs into the existing registry.
- **The job worker model is unchanged.** Same operations, same logs, same SSE plumbing, same scheduler rules (minus the refresh_global_overlays entry).
- **No new application-level dependencies.** Vendored HTMX, no new Python packages. Two new system dependencies: `bubblewrap` apt package and the `l4d2-sandbox` system user.
- **No new config keys.** Same env files (`/etc/left4me/host.env`, `/etc/left4me/web.env`).
- **DB migration is destructive for global-maps overlay rows.** This is acceptable per the test-deploy assumption; a production-rollout follow-up would need to address it.
- The companion implementation plan governs task ordering and verification commands. Implementation must not start without explicit user approval per that plan's gate.