left4me/docs/superpowers/plans/2026-05-08-l4d2-script-overlays.md
mwiegand 78ead0b41d
docs(specs): script overlay type — design + implementation plan
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 15:27:14 +02:00

24 KiB
Raw Blame History

L4D2 Script Overlays Implementation Plan

Approval status: User-approved 2026-05-08. Implementation proceeds.

Goal: Implement the script overlay type per docs/superpowers/specs/2026-05-08-l4d2-script-overlays-design.md. Add an Overlay.script TEXT column and Overlay.last_build_status enum-string column, a ScriptBuilder that runs user bash inside a bubblewrap + systemd-run --scope sandbox via a new left4me-script-sandbox privileged helper, route + UI surface for editing/wiping/rebuilding, and delete the entire managed-globals (l4d2center_maps, cedapug_maps) subsystem and its daily-refresh timer/CLI.

Architecture: The web app continues to enqueue build_overlay jobs for any overlay row. The job worker dispatches via BUILDERS[overlay.type].build(...). After this change BUILDERS = {"workshop": WorkshopBuilder(), "script": ScriptBuilder()}. The new ScriptBuilder writes overlay.script to a tmpfile and execs sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>, which itself execs systemd-run --scope --collect ... -- bwrap [namespace flags] /bin/bash /script.sh. stdout/stderr stream through the existing run_with_streamed_output helper into the existing job-log SSE plumbing. The job-completion path writes Overlay.last_build_status based on the build outcome. The kernel-overlayfs mount layer (KernelOverlayFSMounter) is unchanged.


Locked Decisions

See docs/superpowers/specs/2026-05-08-l4d2-script-overlays-design.md for design rationale. Implementation-relevant summary:

  • Final overlay type list: workshop (unchanged) + script (new). Drop l4d2center_maps, cedapug_maps.
  • New columns on overlays: script TEXT NOT NULL DEFAULT '', last_build_status VARCHAR(16) NOT NULL DEFAULT ''.
  • Drop tables (FK order): global_overlay_item_files, global_overlay_items, global_overlay_sources.
  • ScriptBuilder in l4d2web/services/overlay_builders.py, uses existing run_with_streamed_output.
  • Privileged helper left4me-script-sandbox (bash, mode 0755, owned root). systemd-run --scope --collect -p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 -p CPUQuota=200% -p RuntimeMaxSec=3600 -- bwrap …. Limits 1 h walltime, 4 GB RAM, 20 GB post-build du cap.
  • New system user l4d2-sandbox (/usr/sbin/nologin, no home). New apt dep bubblewrap.
  • Sudoers verb-unrestricted: left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox.
  • Daily refresh subsystem deleted: left4me-refresh-global-overlays.{timer,service} and flask refresh-global-overlays CLI removed. No replacement.
  • Wipe is the same sandbox helper invoked with the literal script find /overlay -mindepth 1 -delete.
  • auto_refresh column NOT added in this iteration.
  • Test deploy DB is wiped on rollout; migration includes DELETE FROM overlays WHERE type IN ('l4d2center_maps', 'cedapug_maps') for safety.

Current Gap

  • l4d2web/models.py Overlay has no script or last_build_status columns. The 3 globals tables are present.
  • l4d2web/services/overlay_builders.py BUILDERS = {"workshop": WorkshopBuilder(), "l4d2center_maps": GlobalMapOverlayBuilder(), "cedapug_maps": GlobalMapOverlayBuilder()}. No ScriptBuilder.
  • l4d2web/services/{global_map_sources,global_overlay_refresh,global_map_cache,global_overlays}.py exist and are referenced by routes / CLI.
  • l4d2web/services/job_worker.py carries refresh_global_overlays_running plumbing.
  • l4d2web/cli.py defines refresh-global-overlays.
  • l4d2web/routes/overlay_routes.py has no /script, /wipe, or /build endpoints for non-workshop types.
  • l4d2web/templates/overlays.html create modal type radio offers only workshop.
  • l4d2web/templates/overlay_detail.html has a global-source block (~lines 3446) that should not survive.
  • deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.{timer,service} exist.
  • deploy/deploy-test-server.sh provisions global_overlay_cache/ and does not provision l4d2-sandbox or install bubblewrap.
  • Seven tests/test_global_*.py files exist and reference removed code.

Task 1: Schema migration (alembic 0005)

Files:

  • Create: l4d2web/alembic/versions/0005_script_overlays.py (revises 0004_drop_legacy_external_overlay_type).
  • Modify: l4d2web/models.pyOverlay gains script and last_build_status columns; remove GlobalOverlaySource, GlobalOverlayItem, GlobalOverlayItemFile model classes.
  • Modify: l4d2web/tests/test_overlay_models.py (or whichever existing test asserts the Overlay schema; create one if absent) — assert new columns present.

Test plan (RED first):

  1. tests/test_alembic_migrations.py::test_upgrade_0005_adds_script_columns — apply migrations to a fresh in-memory SQLite, assert script and last_build_status columns present on overlays, assert no global_overlay_* tables, assert old data wipe DELETE FROM overlays WHERE type IN (...) is part of the upgrade.
  2. tests/test_alembic_migrations.py::test_downgrade_0005_restores_globals (only if downgrade is supported in the project's migration policy; skip with pytest.skip if not — kernel-overlayfs migration is one-way, follow that precedent).
  3. tests/test_overlay_models.py::test_overlay_has_script_columnsOverlay(...) instance has script='' and last_build_status='' defaults.

Implementation:

  • Migration uses op.drop_table('global_overlay_item_files') etc. in correct FK order; uses op.add_column('overlays', sa.Column('script', sa.Text(), nullable=False, server_default='')) and similar for last_build_status (sa.String(16)).
  • The DELETE FROM overlays WHERE type IN ('l4d2center_maps','cedapug_maps') runs before the column additions so the operation is straightforward — these rows do not reference the new columns.
  • models.py: delete the three globals model classes outright; add the two new columns to Overlay with explicit defaults.

Verification:

python3 -m pytest l4d2web/tests/test_alembic_migrations.py l4d2web/tests/test_overlay_models.py -q

Commit: feat(l4d2-web): script overlay schema — add overlay.script + last_build_status, drop globals tables


Task 2: ScriptBuilder + BUILDERS registry update

Files:

  • Modify: l4d2web/services/overlay_builders.py — add ScriptBuilder, remove GlobalMapOverlayBuilder, change BUILDERS dict.
  • Rewrite: l4d2web/tests/test_overlay_builders.py — drop globals-builder tests, add ScriptBuilder tests.

Test plan (RED first):

  1. test_overlay_builders.py::test_builders_registryset(BUILDERS) == {"workshop", "script"}. Assert "l4d2center_maps" and "cedapug_maps" and "external" are absent.
  2. test_overlay_builders.py::test_script_builder_invokes_helper — patch run_with_streamed_output to capture argv; build an Overlay(id=42, type='script', script='echo hi'); assert argv shape ["sudo", "-n", "/usr/local/libexec/left4me/left4me-script-sandbox", "42", <script_path>] and that the script_path file exists with content "echo hi" at invocation time. Verify the tmpfile is unlinked after build.
  3. test_overlay_builders.py::test_script_builder_disk_cap — fake subprocess.check_output for du to return 25000000000; build raises BuildError("disk-cap-exceeded") and on_stderr was called with the cap message.
  4. test_overlay_builders.py::test_script_builder_streams_output — fake run_with_streamed_output invokes both on_stdout("hello\n") and on_stderr("warn\n"); both lambda lists capture the lines.
  5. test_overlay_builders.py::test_script_builder_cancelshould_cancel returns True after the first stdout line; assert run_with_streamed_output propagated cancellation (the existing helper's contract — the test just ensures we pass should_cancel through and don't run the disk-budget check on cancel).
  6. test_overlay_builders.py::test_workshop_builder_unchanged — smoke test that WorkshopBuilder still exists and is invokable (regression guard against accidental removal during refactor).

Implementation:

  • Add import os, subprocess, tempfile at the top of overlay_builders.py if not present.
  • ScriptBuilder exactly as in the spec (verbatim copy from the design doc, §Build Lifecycle).
  • Define a small BuildError exception class if one doesn't already exist locally; reuse the existing one if WorkshopBuilder already raises a similar type.
  • _enforce_disk_budget calls subprocess.check_output(["du", "-sb", str(overlay_path(overlay_id))]); the existing overlay_path helper in the module already returns the absolute Path. Parse first whitespace-delimited integer; cap is 20 * 1024**3.
  • Job-completion path: locate the existing path that handles build_overlay job success/failure (likely in services/job_worker.py or a related orchestration module). Add a single column write: on success last_build_status='ok', on BuildError / non-zero exit / cancel last_build_status='failed'. Add a tests/test_job_worker.py::test_build_overlay_writes_last_build_status covering both branches.
  • Remove GlobalMapOverlayBuilder class and any helper functions it owns that are not used elsewhere.

Verification:

python3 -m pytest l4d2web/tests/test_overlay_builders.py l4d2web/tests/test_job_worker.py -q

Commit: feat(l4d2-web): ScriptBuilder + BUILDERS registry update


Task 3: Delete global-overlay services + CLI command + their tests

Files:

  • Delete: l4d2web/services/global_map_sources.py
  • Delete: l4d2web/services/global_overlay_refresh.py
  • Delete: l4d2web/services/global_map_cache.py
  • Delete: l4d2web/services/global_overlays.py
  • Modify: l4d2web/cli.py — remove refresh-global-overlays command (lines ~4455). Drop any imports that go orphaned.
  • Delete: l4d2web/tests/test_global_map_sources.py
  • Delete: l4d2web/tests/test_global_overlay_models.py
  • Delete: l4d2web/tests/test_global_overlay_builders.py
  • Delete: l4d2web/tests/test_global_overlay_cli.py
  • Delete: l4d2web/tests/test_global_overlay_refresh.py
  • Delete: l4d2web/tests/test_global_overlays.py
  • Delete: l4d2web/tests/test_global_map_cache.py
  • Audit & fix: any other module that imports the deleted modules. Likely candidates: l4d2web/app.py (CLI registration), routes/overlay_routes.py, routes/page_routes.py. Resolve by deletion of the dead import / call site, not by stubbing.
  • Modify: pyproject.toml — drop py7zr from dependencies (only used by the deleted globals subsystem).

Test plan:

  1. RED-first via grep: grep -RIn 'global_map_sources\|global_overlay_refresh\|global_map_cache\|global_overlays\|refresh_global_overlays\|GlobalMapOverlayBuilder' l4d2web/ deploy/ — should return zero hits at the end of this task. Add this as tests/test_no_globals_references.py::test_no_globals_imports if you want it as a permanent regression guard, otherwise spot-check.
  2. Existing tests/test_cli.py (or whichever covers Flask CLI) loses any cases for refresh-global-overlays; add a test_refresh_global_overlays_command_removed that asserts the click command is not registered.

Implementation:

  • Delete files via git rm.
  • In cli.py, remove the command function and its @app.cli.command(...) decorator. Drop any helper imports that become orphaned.
  • Remove py7zr from pyproject.toml and re-lock if a lockfile is present.

Verification:

python3 -m pytest l4d2web/tests/ -q
grep -RIn 'global_map_sources\|global_overlay_refresh\|global_map_cache\|global_overlays\|refresh_global_overlays\|GlobalMapOverlayBuilder' l4d2web/ deploy/ || echo "clean"

Commit: refactor(l4d2-web): drop global-overlays subsystem in favor of script type


Task 4: Job worker — drop refresh_global_overlays from scheduler

Files:

  • Modify: l4d2web/services/job_worker.py — remove "refresh_global_overlays" from GLOBAL_OPERATIONS; remove refresh_global_overlays_running field from SchedulerState and any references in can_start(); check whether blocked_servers_by_overlay was added solely for the globals subsystem and remove if so.
  • Modify: l4d2web/tests/test_job_worker.py — drop refresh_global_overlays truth-table rows; add explicit build_overlay truth-table cases for script-type overlays (mechanically identical to workshop, but pinned by test).

Test plan:

  1. test_job_worker.py::test_global_operations_setGLOBAL_OPERATIONS == {"install", "refresh_workshop_items"} (or whatever subset remains; pin it).
  2. test_job_worker.py::test_build_overlay_script_type_blocks_per_overlay — start build_overlay(overlay_id=7) for a script-type overlay; assert second build_overlay(overlay_id=7) cannot start; assert build_overlay(overlay_id=8) can.
  3. test_job_worker.py::test_build_overlay_blocks_server_init_on_blueprint_overlay — existing test, may need re-pinning if it referenced globals.

Implementation:

  • Remove the field from the dataclass / TypedDict that backs SchedulerState.
  • Remove any update sites that flipped the flag (the worker's enqueue / on-start / on-complete paths).
  • The remaining mutex rules (install / refresh_workshop_items are global; build_overlay per-overlay; server ops block on overlays in their blueprint) are unchanged structurally.

Verification:

python3 -m pytest l4d2web/tests/test_job_worker.py -q

Commit: refactor(l4d2-web): drop refresh_global_overlays from scheduler


Task 5: Routes (script update / wipe / build)

Files:

  • Modify: l4d2web/routes/overlay_routes.py — add three POST endpoints.
  • Create: l4d2web/tests/test_script_overlay_routes.py.

Test plan (RED first):

  1. test_script_overlay_routes.py::test_create_script_overlay — POST /overlays with form {"name": "x", "type": "script"} as a regular user → 302 to detail; row exists with type='script', script='', last_build_status='', user_id=current_user.id, path=str(id).
  2. test_script_overlay_routes.py::test_admin_creates_system_wide_script_overlay — admin POST with system-wide flag → row has user_id=NULL.
  3. test_script_overlay_routes.py::test_update_script_body_enqueues_build — POST /overlays/{id}/script with {"script": "echo new"} → row.script updated; one new build_overlay job enqueued for the overlay; second immediate POST coalesces (no second job inserted while first is pending).
  4. test_script_overlay_routes.py::test_manual_rebuild — POST /overlays/{id}/build → enqueues build_overlay; coalesces.
  5. test_script_overlay_routes.py::test_wipe_runs_find_delete — POST /overlays/{id}/wipe → invokes ScriptBuilder.build (or the underlying helper) with the literal script find /overlay -mindepth 1 -delete. After success, row.last_build_status ==''. Does not enqueue a build_overlay.
  6. test_script_overlay_routes.py::test_wipe_refuses_during_running_build — set scheduler state to build_overlay(overlay_id=7) running; POST /overlays/7/wipe → 409 (or whatever the existing pattern uses for scheduler conflicts), no sandbox invocation.
  7. test_script_overlay_routes.py::test_permissions_non_owner_denied — user A creates private script overlay; user B POSTs /overlays/{id}/script → 403.
  8. test_script_overlay_routes.py::test_permissions_admin_can_edit_any — admin POSTs /overlays/{id}/script for user A's row → 200.

Implementation:

  • Mirror the existing _can_edit_overlay() permission helper.
  • The /wipe endpoint can either (a) call ScriptBuilder directly with a synthetic Overlay-like object whose .script is the find command and whose .id is the real overlay id, or (b) factor a _run_sandbox(overlay_id, script_text, on_stdout, on_stderr, should_cancel) helper out of ScriptBuilder.build() and call it from both. (b) is cleaner; do (b).
  • Wipe runs synchronously in the request thread (small, fast). It does NOT enqueue a job. Surface log output as flash messages or by streaming through the existing log infra — pick whichever matches the existing wipe-equivalent pattern (workshop overlays don't have a wipe; closest analog is the existing delete-overlay flow).
  • The /script endpoint enqueues via the same enqueue_build_overlay(overlay_id) helper used by workshop overlays' add/remove flows. Coalescing is already implemented there.

Verification:

python3 -m pytest l4d2web/tests/test_script_overlay_routes.py l4d2web/tests/test_overlay_routes.py -q

Commit: feat(l4d2-web): script overlay routes (script update / wipe / build)


Task 6: Templates (overlays.html + overlay_detail.html)

Files:

  • Modify: l4d2web/templates/overlays.html — add script to the create-modal type radio (lines ~2949).
  • Modify: l4d2web/templates/overlay_detail.html — add a {% if overlay.type == 'script' %} block with textarea + Save / Rebuild / Wipe buttons + status badge; delete the global-source block (lines ~3446).
  • Modify: l4d2web/tests/test_pages.py — assert script-section renders for type=script, workshop-section renders for type=workshop, global-source-section is absent.

Test plan:

  1. test_pages.py::test_overlay_create_modal_offers_script_type — GET /overlays; HTML contains value="script" radio.
  2. test_pages.py::test_overlay_detail_script_section — create script overlay, GET /overlays/{id}; HTML contains <textarea name="script">, "Rebuild" button, "Wipe" button, status badge element.
  3. test_pages.py::test_overlay_detail_workshop_section_unchanged — existing workshop detail still has thumbnail grid, add-item form, etc.
  4. test_pages.py::test_overlay_detail_no_global_source_block — page HTML has no element from the deleted global-source block (check for an attribute or string unique to that block).

Implementation:

  • Detail-page wipe button uses a small confirm-modal pattern (copy from the existing delete-overlay confirm modal).
  • Status badge: existing CSS classes for ok/warn/error already exist in static/; reuse them.
  • No new JS deps. Plain <form method="post"> with HTMX hx-post for the script update if a streaming UX is desired (match existing patterns).

Verification:

python3 -m pytest l4d2web/tests/test_pages.py -q

Manual: start dev server (flask run), create a script overlay, paste echo "hi" > foo, click Save, watch log stream. Then click Wipe; confirm dir is empty. Then click Rebuild; confirm foo reappears.

Commit: feat(l4d2-web): script overlay UI


Task 7: Libexec sandbox helper + sudoers + deploy-artifacts test

Files:

  • Create: deploy/files/usr/local/libexec/left4me/left4me-script-sandbox (bash, mode 0755 after deploy, owned root).
  • Modify: deploy/files/etc/sudoers.d/left4me — append the rule.
  • Modify: deploy/tests/test_deploy_artifacts.py — assert helper file present + sudoers contains the new line.

Test plan (RED first):

  1. test_deploy_artifacts.py::test_script_sandbox_helper_present — file exists, mode bits indicate 0755 (or whatever the test framework allows checking pre-deploy), shebang is #!/bin/bash.
  2. test_deploy_artifacts.py::test_sudoers_includes_script_sandbox_rule — sudoers file contains the exact line left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox.
  3. Optional integration test (skip on non-Linux dev): drive the helper as a subprocess with a synthesized fake /var/lib/left4me/overlays/1/ and a no-op script, assert bwrap invocation happens (use a mock systemd-run or LEFT4ME_SCRIPT_SANDBOX_DRY_RUN=1 env that prints the would-be invocation and exits 0). Mirrors the LEFT4ME_OVERLAY_PRINT_ONLY=1 pattern from the kernel-overlayfs helper test.

Implementation:

  • Helper script verbatim from the spec §Sandbox.
  • Sudoers fragment: append (don't replace existing rules). The existing fragment has rules for left4me-overlay, left4me-systemctl, left4me-journalctl — match the same formatting (one rule per line, no trailing whitespace).

Verification:

python3 -m pytest deploy/tests/test_deploy_artifacts.py -q
bash -n deploy/files/usr/local/libexec/left4me/left4me-script-sandbox

Commit: feat(deploy): left4me-script-sandbox helper + sudoers fragment


Task 8: Deploy script — provision l4d2-sandbox + bubblewrap; drop globals timer

Files:

  • Modify: deploy/deploy-test-server.sh — add useradd --system ... l4d2-sandbox, add apt-get install -y bubblewrap, ensure helper installation step picks up left4me-script-sandbox (likely automatic if it's a glob in deploy/files/usr/local/libexec/left4me/*); drop the mkdir global_overlay_cache line if present.
  • Delete: deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer
  • Delete: deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.service
  • Modify: deploy/tests/test_deploy_artifacts.py — assert the two unit files are absent; assert useradd l4d2-sandbox and apt-get install ... bubblewrap lines are present in the deploy script.

Test plan:

  1. test_deploy_artifacts.py::test_globals_refresh_units_removed — files do not exist under deploy/files/usr/local/lib/systemd/system/.
  2. test_deploy_artifacts.py::test_deploy_script_provisions_sandbox_user — grep the deploy script for the useradd line.
  3. test_deploy_artifacts.py::test_deploy_script_installs_bubblewrap — grep for bubblewrap in apt invocations.

Implementation:

  • useradd line uses --system --no-create-home --shell /usr/sbin/nologin. Idempotency: wrap with id l4d2-sandbox &>/dev/null || useradd ....
  • apt-get install: append bubblewrap to whatever package list the script already maintains.
  • Globals timer/service deletions: git rm.

Verification:

python3 -m pytest deploy/tests/ -q
shellcheck deploy/deploy-test-server.sh deploy/files/usr/local/libexec/left4me/left4me-script-sandbox

Commit: chore(deploy): provision l4d2-sandbox + bubblewrap; drop globals refresh timer


Task 9: Full pytest run + drift fixes

Files: as needed across the repo.

Test plan: run the full test suite for both packages; chase down any drift caused by removed model classes, dropped imports, or template changes.

python3 -m pytest l4d2web/tests/ -q
python3 -m pytest l4d2host/tests/ -q
python3 -m pytest deploy/tests/ -q

Implementation: fix what breaks. Common drift sources to expect:

  • Tests that imported from deleted modules.
  • Tests that asserted exact BUILDERS keyset (good — they should have been updated in Task 2).
  • Tests that built fixtures with type='l4d2center_maps' or type='cedapug_maps' — those tests likely belong to the deleted set or need conversion to type='script'.
  • Template snapshot tests (if any) that captured the deleted global-source block.

Verification: all three suites green.

Commit: chore(l4d2-web): test suite drift fixes after script-overlays migration (only if drift fixes needed; skip if Tasks 18 left the suite green)


End-to-end deployment verification (manual, on test host)

After all tasks committed:

  1. Reset deploy: run deploy/deploy-test-server.sh from clean state. Confirm bubblewrap installed (dpkg -l bubblewrap), l4d2-sandbox user exists (id l4d2-sandbox), /usr/local/libexec/left4me/left4me-script-sandbox is mode 0755 and root-owned, sudo -ln as left4me shows the new rule.
  2. Sandbox smoke: as left4me, write /tmp/echo.sh containing echo $(whoami) > /overlay/sentinel. mkdir -p /var/lib/left4me/overlays/1. sudo /usr/local/libexec/left4me/left4me-script-sandbox 1 /tmp/echo.sh. Confirm /var/lib/left4me/overlays/1/sentinel contains l4d2-sandbox and is owned by l4d2-sandbox. Confirm /etc/passwd, /var/lib/left4me/l4d2web.db, and /home are not visible inside the sandbox by running probe scripts.
  3. Resource limits:
    • dd if=/dev/zero of=/overlay/big bs=1M count=25000 → succeeds inside sandbox; ScriptBuilder._enforce_disk_budget flags the build failed; last_build_status='failed'.
    • sleep 7200 → killed at 1 h by RuntimeMaxSec=3600.
    • Memory hog (python3 -c "x=' '*(5*1024**3)") → OOM at 4 GB.
  4. App-level happy path: as a non-admin user, create a script overlay via the UI, paste an old competitive_rework-style script, Save → build runs, succeeds, addons appear in overlays/{id}/left4dead2/. Stack onto a server blueprint, start the server, verify content mounts via the L4D2 admin console (map workshop/...).
  5. Wipe: click Wipe → dir empty (find -delete output in log). Click Rebuild → repopulates. last_build_status cycles: '''ok'.
  6. Scheduler: start a server using the script overlay; in another browser tab attempt to Rebuild → 409 / scheduler-blocked. Stop server; rebuild succeeds.
  7. Audit log: journalctl --since "5 min ago" | grep run- shows transient scopes per build with cgroup memory accounting visible.

These are not required for any single commit but should pass before declaring the work done.