docs(specs): script overlay type — design + implementation plan

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 15:27:14 +02:00

22 KiB

Raw Blame History

L4D2 Script Overlays Design

Goal: Add a single new overlay type, script, that lets users author arbitrary build recipes as bash and runs them inside a bubblewrap + systemd-run --scope sandbox. The new type subsumes the existing l4d2center_maps and cedapug_maps managed-globals overlay types, both of which are removed in the same change. After this work the overlay type list is exactly workshop (unchanged) and script (new).

Approval status: User-approved design direction. Implementation proceeds in lockstep with the companion plan at docs/superpowers/plans/2026-05-08-l4d2-script-overlays.md.

Context

left4me users today have two ways to add content to a server: workshop overlays (rich UI for Steam Workshop items via WorkshopBuilder) and a pair of managed global-map overlay types (l4d2center_maps, cedapug_maps) with bespoke parsers, per-item DB rows, ETag-based change detection, and a daily refresh timer. They cannot author arbitrary build recipes.

The user's previous setup at ckn-bw/bundles/left4dead2/files/scripts/overlays/ expressed every recipe as a small bash file: competitive_rework (GitHub tarball download), tickrate (inline server.cfg + addon DLL fetch), standard (workshop items + admin-list write), workshop_maps (workshop collection import), l4d2center_maps (CSV-driven map sync). All five fit naturally into a single "run a sandboxed bash script that populates the overlay dir" model.

The two managed global-map types in the current codebase are over-engineered for what they do — each is essentially "fetch a manifest, download archives, extract VPKs, place in addons/." Folding them into the new script type eliminates three database tables, two source-parser modules, the GlobalMapOverlayBuilder, the py7zr dependency, the global-overlay cache root, and the managed-singleton machinery, while letting an admin paste the equivalent shell code (which the user already wrote years ago) into a normal admin-owned, system-wide script overlay.

The trust model for the sandbox is "semi-public deployment, registered users." The threat surface is one user reading another user's overlay, the application DB, or arbitrary host secrets, plus runaway scripts exhausting disk/CPU/RAM. Network access is not restricted — scripts must be able to download from arbitrary URLs (GitHub, l4d2center, Steam CDN). Sandbox boundaries are namespace-based (mount, PID, IPC, UTS, cgroup), not command-allowlist-based; binary-allowlist sandboxing of bash is theatre because of eval and exec.

The test deploy DB is wiped as part of rollout; no data migration is performed. Existing user blueprints that reference l4d2center_maps or cedapug_maps overlay rows do not survive the change in the test environment.

A scheduled-refresh feature (the daily timer that today drives the global-map types) is intentionally out of scope for this iteration. The two existing systemd units and the flask refresh-global-overlays CLI command are deleted with no replacement. Refresh is reintroduced in a later iteration designed against concrete needs.

Locked Decisions

Single new overlay type: script. Replaces both managed-globals types. Final type list: workshop + script. No tarball/inline/manual types — all of those collapse into script (with UI templates as a future ergonomics improvement).
Overlay.script is a DB TEXT column holding the raw bash. No file storage, no revision history in v1. Empty string for workshop rows.
Build idempotency contract: script runs against the existing overlay dir. No automatic wipe between builds. Users write test -f … || curl …-style guards if they want bandwidth efficiency. A manual "Wipe overlay" button on the detail page resets the dir to empty.
No left4me-aware helpers in the sandbox. The script sees pure bash plus whatever's in /usr (RO bind-mount of the host). Workshop items are not exposed via a helper — users wanting workshop content create a workshop-type overlay, which has its own first-class UX (thumbnails, collection paste, dedup cache, refresh).
Sandbox engine: bubblewrap (bwrap) inside systemd-run --scope --collect. systemd-run provides cgroup v2 limits + walltime kill via RuntimeMaxSec; bwrap provides the namespace isolation. Both are stable, well-audited, in-tree on Debian.
Resource limits (system-wide, not per-overlay): 1 hour walltime (RuntimeMaxSec=3600), 4 GB RAM (MemoryMax=4G, MemorySwapMax=0), 512 tasks, 200% CPU quota, post-build 20 GB disk cap on du -sb of the overlay dir.
Network: host-shared. No --unshare-net. Scripts have full outbound. Egress filtering is not in v1; the sandbox prevents reading internal state but does not prevent talking to internal IPs. Acceptable for the current trust model.
No auto-seeding of "default" overlays. Admin manually creates the equivalents of the old l4d2center-maps/cedapug-maps post-deploy by pasting the bash. The deploy script does not insert overlay rows.
Daily/scheduled refresh: out of scope for this iteration. No auto_refresh flag, no timer, no CLI command. Manual rebuild via the detail-page button is the only build trigger after this change.
Permissions mirror workshop overlays. Any logged-in user can create a private (user_id = me) script overlay. Admin can create system-wide (user_id = NULL). Owner or admin can edit/delete.
Failure semantics via Overlay.last_build_status ('' | 'ok' | 'failed'). Drives a "rebuild required" badge on the list and detail pages. Server initialization does not auto-block on failed (matches workshop's current behavior).
Wipe is just another sandbox invocation. The wipe endpoint runs the literal script find /overlay -mindepth 1 -delete through the same left4me-script-sandbox helper. No second helper, no privilege/UID puzzle (files are owned by l4d2-sandbox, who runs the wipe). After a successful wipe, last_build_status is reset to ''. Wipe does not auto-enqueue a rebuild — the user decides.
Privileged helper: /usr/local/libexec/left4me/left4me-script-sandbox. Same pattern as the existing left4me-overlay, left4me-systemctl, left4me-journalctl helpers. Bash, owned root, mode 0755. The web user invokes it via sudo -n per a sudoers fragment. Root is needed to set up the namespaces; bwrap drops to the unprivileged l4d2-sandbox UID immediately.
Dedicated sandbox UID l4d2-sandbox (system user, /usr/sbin/nologin, no home). Owns nothing on the host outside what bwrap binds in. UID-drop happens inside the bwrap invocation via --uid/--gid.
Strict argument validation in the helper. Overlay id matches ^[0-9]+$; overlay dir must exist under /var/lib/left4me/overlays/; script path must exist. Defense in depth — the real authorization check lives in the web app.
Streaming I/O via the existing run_with_streamed_output helper. Same plumbing WorkshopBuilder already uses for steamcmd/curl invocations. No new SSE/log path.

Architecture

Overlay row (type=script, script=TEXT, last_build_status)
   │
   ▼  build_overlay(overlay_id) job
   │
   ▼  BUILDERS["script"].build(overlay, on_stdout, on_stderr, should_cancel)
   │
   ▼  ScriptBuilder writes overlay.script → tmpfile, then:
   │    sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>
   │
   ▼  Helper validates args, then exec()s:
   │    systemd-run --scope --collect
   │      -p MemoryMax=4G -p MemorySwapMax=0
   │      -p TasksMax=512 -p CPUQuota=200%
   │      -p RuntimeMaxSec=3600
   │      -- bwrap [namespace flags...] /bin/bash /script.sh
   │
   ▼  Inside the sandbox the script sees:
   │    /overlay   ← /var/lib/left4me/overlays/{id}     RW (the build target)
   │    /tmp,/run  ← fresh tmpfs                         RW (ephemeral)
   │    /usr,/lib,/lib64,/etc/{ssl,resolv.conf,nsswitch} RO (host-curated)
   │    /proc,/dev ← fresh
   │    network    ← shared with host
   │    UID/GID    ← l4d2-sandbox (no_new_privs implicit in bwrap)
   │
   ▼  stdout/stderr → run_with_streamed_output → existing job-log SSE stream
   ▼  After exit:
   │    exit 0  ∧  du -sb /overlay ≤ 20 GB  →  last_build_status='ok'
   │    any other outcome                    →  last_build_status='failed'

The host library (l4d2host) is unchanged. The KernelOverlayFSMounter already mounts whatever's at overlays/{id}/ regardless of how it got there. The Job model and worker model are essentially unchanged — script is just another overlay type for the same build_overlay operation that today supports workshop.

BUILDERS = {
    "workshop": WorkshopBuilder(),
    "script":   ScriptBuilder(),
}

Data Model

`Overlay` (modified)

id INTEGER PK AUTOINCREMENT
name VARCHAR(255) NOT NULL
path VARCHAR(255) NOT NULL                   -- str(id) for new rows
type VARCHAR(16) NOT NULL                    -- 'workshop' | 'script'
user_id INTEGER NULL REFERENCES users(id)    -- NULL = system-wide

script TEXT NOT NULL DEFAULT ''              -- new; meaningful for type='script'
last_build_status VARCHAR(16) NOT NULL DEFAULT ''  -- new; '' | 'ok' | 'failed'

created_at, updated_at

UNIQUE INDEX on (name) WHERE user_id IS NULL
UNIQUE INDEX on (name, user_id) WHERE user_id IS NOT NULL
INDEX on (type, user_id)

Tables removed

global_overlay_item_files
global_overlay_items
global_overlay_sources

Drop order matters for the SQLite migration: drop _item_files first (FK to _items), then _items (FK to _sources), then _sources (FK to overlays).

Unchanged

WorkshopItem, overlay_workshop_items, Job (including Job.overlay_id and nullable Job.user_id), Server, Blueprint, etc.

Filesystem Layout

${LEFT4ME_ROOT}/
  overlays/
    {overlay_id}/                            # script writes here; mounted by host
      left4dead2/...                         # whatever the script produces
  workshop_cache/{steam_id}.vpk              # workshop type only — unchanged

# removed:
#   global_overlay_cache/                    # was used by managed-globals types

Single tree per overlay. No per-overlay scratch cache (the chosen idempotency model is "script runs against existing dir," so any caching the user wants lives inside the overlay dir and is preserved between builds).

The sandbox bind-mounts ${LEFT4ME_ROOT}/overlays/{id}/ to /overlay (RW). Nothing else under ${LEFT4ME_ROOT} is visible inside the sandbox.

Sandbox

Helper script

deploy/files/usr/local/libexec/left4me/left4me-script-sandbox, mode 0755, owned root:

#!/bin/bash
# args: <overlay_id> <script_path>
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script" >&2; exit 65; }

SBX_UID=$(id -u l4d2-sandbox); SBX_GID=$(id -g l4d2-sandbox)

exec systemd-run --quiet --scope --collect \
  -p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
  -p CPUQuota=200% -p RuntimeMaxSec=3600 \
  -- bwrap \
    --die-with-parent --new-session \
    --unshare-pid --unshare-ipc --unshare-uts --unshare-cgroup \
    --uid "$SBX_UID" --gid "$SBX_GID" \
    --proc /proc --dev /dev --tmpfs /tmp --tmpfs /run \
    --ro-bind /usr /usr --ro-bind /lib /lib --ro-bind /lib64 /lib64 \
    --symlink usr/bin /bin --symlink usr/sbin /sbin \
    --ro-bind /etc/resolv.conf /etc/resolv.conf \
    --ro-bind /etc/ssl /etc/ssl \
    --ro-bind /etc/ca-certificates /etc/ca-certificates \
    --ro-bind /etc/nsswitch.conf /etc/nsswitch.conf \
    --bind "$OVERLAY_DIR" /overlay \
    --chdir /overlay \
    --setenv HOME /tmp --setenv PATH /usr/bin:/usr/sbin \
    --setenv OVERLAY /overlay \
    --ro-bind "$SCRIPT" /script.sh \
    /bin/bash /script.sh

Network is not unshared (no --unshare-net); the sandbox shares the host network namespace. Every transient unit is visible via systemctl list-units --type=scope while running and journaled afterward (journalctl --user-unit=run-…scope or system journal depending on invocation).

Sudoers fragment

Append to deploy/files/etc/sudoers.d/left4me:

left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox

System user

Provisioned in deploy/deploy-test-server.sh:

useradd --system --no-create-home --shell /usr/sbin/nologin l4d2-sandbox
apt-get install -y bubblewrap

Build Lifecycle

ScriptBuilder lives in l4d2web/services/overlay_builders.py next to WorkshopBuilder:

class ScriptBuilder:
    def build(self, overlay, *, on_stdout, on_stderr, should_cancel):
        with tempfile.NamedTemporaryFile("w", suffix=".sh", delete=False) as f:
            f.write(overlay.script or "")
            script_path = f.name
        try:
            cmd = [
                "sudo", "-n",
                "/usr/local/libexec/left4me/left4me-script-sandbox",
                str(overlay.id), script_path,
            ]
            run_with_streamed_output(cmd, on_stdout, on_stderr, should_cancel)
            self._enforce_disk_budget(overlay.id, on_stderr)
        finally:
            os.unlink(script_path)

    def _enforce_disk_budget(self, overlay_id, on_stderr):
        size = subprocess.check_output(["du", "-sb", overlay_path(overlay_id)])
        if int(size.split()[0]) > 20 * 1024**3:
            on_stderr("overlay exceeded 20 GB disk cap")
            raise BuildError("disk-cap-exceeded")

run_with_streamed_output is the existing helper used by WorkshopBuilder for steamcmd/curl invocations. The should_cancel callback fires kill -TERM on the sudo-systemd-run process tree; cgroup-collect tears down the whole scope on exit.

The job worker's existing job-completion path writes Overlay.last_build_status = 'ok' on success and 'failed' on any non-zero exit / BuildError / cancel. This is a single column update inside the existing transaction; no new infrastructure.

UI

Create modal (`templates/overlays.html`)

The existing modal grows one option in the type radio: Workshop | Script. Name field unchanged. After insert, the web app generates path = str(overlay_id) for new rows (existing pattern).

Detail page when `type='script'` (`templates/overlay_detail.html`)

Plain styled <textarea> for overlay.script with a Save button → POST /overlays/{id}/script. No CodeMirror dependency in v1 (out of scope; keep frontend dep-light).
"Rebuild" button → POST /overlays/{id}/build. Existing pattern from workshop overlays.
"Wipe overlay" button (red, confirm-modal) → POST /overlays/{id}/wipe.
last_build_status indicator badge: empty / "ok" / "failed".
Live build log via existing SSE plumbing on the related Job row.

Detail page when `type='workshop'`: unchanged.

Sections removed

The global-source detail block (overlay_detail.html lines 34–46) is deleted along with the managed-globals subsystem.

Routes

l4d2web/routes/overlay_routes.py adds:

Method	Path	Purpose
POST	`/overlays/{id}/script`	Update `script` text. Auto-enqueue coalesced `build_overlay` job.
POST	`/overlays/{id}/wipe`	Invoke `left4me-script-sandbox` with the literal script `find /overlay -mindepth 1 -delete`. Owner/admin only. Refuses if a `build_overlay` for this overlay is running. After success, set `last_build_status=''`. Does not auto-enqueue a rebuild.
POST	`/overlays/{id}/build`	Manual rebuild — same pattern as today's workshop overlay manual rebuild.

Existing POST /overlays accepts type=script and an optional initial script body.

Permissions

Action	Who
Create script overlay (private, `user_id = me`)	Any authenticated user
Create script overlay (system-wide, `user_id = NULL`)	Admin
Edit (script body, name)	Owner or admin
Wipe / Rebuild	Owner or admin
Delete	Owner or admin
View	Owner, admin, or any user when `user_id IS NULL`

These match the existing rules for workshop overlays.

Job Worker / Scheduler

services/job_worker.py drops "refresh_global_overlays" from GLOBAL_OPERATIONS and removes the corresponding refresh_global_overlays_running and blocked_servers_by_overlay plumbing that exists only for the global-maps subsystem. The remaining mutex rules already cover:

build_overlay per overlay (one running build per overlay).
install and refresh_workshop_items as global mutexes.
Server start/init blocks if any build_overlay for an overlay in the server's blueprint is running.

No new rules are needed for script — its build is mechanically identical to a workshop build from the scheduler's perspective.

Daily Refresh — Removed

This iteration deletes the daily-refresh subsystem entirely:

deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timer and .service — deleted.
flask refresh-global-overlays CLI command in l4d2web/cli.py — deleted.
No replacement timer, no replacement CLI, no auto_refresh column on Overlay.

The only build trigger after this change is the user clicking Rebuild on the detail page (or the auto-enqueue when they Save the script body). A scheduled-refresh feature is reintroduced in a future iteration designed against concrete operational needs.

Risks

Sandbox escape via kernel bug. bwrap has a strong track record but is not invulnerable. Mitigated by running as l4d2-sandbox (no privileged capabilities), no setuid binaries reachable, no_new_privs implicit. A successful escape would land in an unprivileged UID with no host secrets reachable.
Disk fill via runaway script. A script that writes a 20 GB+ payload to /overlay succeeds inside the sandbox and only fails afterward at the post-build du check. The 20 GB lands on disk transiently. Mitigated by the kernel's per-cgroup IO accounting being unaware of file size (no good IO-time limit), accepting this as a v1 trade-off; a future improvement is overlay-dir-on-its-own-filesystem with a quota.
Network exfiltration. Script can connect to anything outbound, including internal IPs. Acceptable for the current trust model (semi-public; users have credentials). Egress firewall is out of scope.
Build-mid-server-running. The scheduler refuses build_overlay for an overlay attached to a starting/running server (existing rule, unchanged). Good. A user can still rebuild while a server using a different blueprint runs concurrently.
Wipe race with running build. The wipe endpoint refuses if a build_overlay for the overlay is running. Without this check, a wipe could blow away files mid-script and produce undefined results.
Stale last_build_status. A row inserted via direct DB write or restored from backup could carry an 'ok' status that no longer reflects reality. Treated as cosmetic; users can rebuild to refresh.
Sudoers misconfig. A typo in the sudoers fragment could grant left4me more than intended. Mitigated by deploy-artifact tests asserting the exact expected lines.
DB row deletion racing the sandbox. A user deleting an overlay while its build runs would invalidate the bind-mount target. Mitigated by the existing scheduler rule that tracks running overlays; delete should refuse if a build is running. (Existing pattern for workshop overlays; reuse.)
Migration drops globals tables. Acceptable for the test deploy. Production rollout would need a different migration story; this spec explicitly assumes test-deploy DB wipe.

Out Of Scope

Scheduled / daily refresh. Intentionally removed in this iteration. Reintroduced later, designed against the use cases that emerge.
Per-overlay resource overrides. All script overlays share the same 1 h / 4 GB / 20 GB envelope. If a real overlay needs more (l4d2center mirror at peak), revisit.
CodeMirror or other rich script editor. Plain <textarea> in v1.
Egress allowlist / proxy. No network restrictions on the sandbox in v1.
$CACHE scratch dir persisted across builds. Users cache inside the overlay dir if they want; idempotency model is "script runs against existing dir."
Multi-tenant cgroup tree per user. All sandboxes share the same cgroup-quota envelope.
Revision history on script column. No overlay_script_revisions table; whatever's in the row is the current script.
Auto-seeding of l4d2center / cedapug equivalents. Admin pastes the script post-deploy.
Migration that preserves existing global-map overlay rows. Test deploy DB is wiped.
Container-per-build (podman / docker). Heavier than bwrap; revisit only if multi-tenant escalates to "fully public sign-up."
left4me-aware helpers (workshop, download, extract) inside the sandbox. Pure bash + host /usr only.

Implementation Boundaries

l4d2host is unchanged. The host library has no concept of overlay types and the mount layer (KernelOverlayFSMounter) doesn't care how the overlay dir got populated.
The OverlayBuilder Protocol is unchanged — same build(overlay, *, on_stdout, on_stderr, should_cancel) signature. ScriptBuilder plugs into the existing registry.
The job worker model is unchanged. Same operations, same logs, same SSE plumbing, same scheduler rules (minus the refresh_global_overlays entry).
No new application-level dependencies. Vendored HTMX, no new Python packages. Two new system dependencies: bubblewrap apt package and the l4d2-sandbox system user.
No new config keys. Same env files (/etc/left4me/host.env, /etc/left4me/web.env).
DB migration is destructive for global-maps overlay rows. This is acceptable per the test-deploy assumption; a production-rollout follow-up would need to address it.
The companion implementation plan governs task ordering and verification commands. Implementation must not start without explicit user approval per that plan's gate.

22 KiB Raw Blame History Unescape Escape