22 KiB
L4D2 Script Overlays Design
Goal: Add a single new overlay type, script, that lets users author arbitrary build recipes as bash and runs them inside a bubblewrap + systemd-run --scope sandbox. The new type subsumes the existing l4d2center_maps and cedapug_maps managed-globals overlay types, both of which are removed in the same change. After this work the overlay type list is exactly workshop (unchanged) and script (new).
Approval status: User-approved design direction. Implementation proceeds in lockstep with the companion plan at docs/superpowers/plans/2026-05-08-l4d2-script-overlays.md.
Context
left4me users today have two ways to add content to a server: workshop overlays (rich UI for Steam Workshop items via WorkshopBuilder) and a pair of managed global-map overlay types (l4d2center_maps, cedapug_maps) with bespoke parsers, per-item DB rows, ETag-based change detection, and a daily refresh timer. They cannot author arbitrary build recipes.
The user's previous setup at ckn-bw/bundles/left4dead2/files/scripts/overlays/ expressed every recipe as a small bash file: competitive_rework (GitHub tarball download), tickrate (inline server.cfg + addon DLL fetch), standard (workshop items + admin-list write), workshop_maps (workshop collection import), l4d2center_maps (CSV-driven map sync). All five fit naturally into a single "run a sandboxed bash script that populates the overlay dir" model.
The two managed global-map types in the current codebase are over-engineered for what they do — each is essentially "fetch a manifest, download archives, extract VPKs, place in addons/." Folding them into the new script type eliminates three database tables, two source-parser modules, the GlobalMapOverlayBuilder, the py7zr dependency, the global-overlay cache root, and the managed-singleton machinery, while letting an admin paste the equivalent shell code (which the user already wrote years ago) into a normal admin-owned, system-wide script overlay.
The trust model for the sandbox is "semi-public deployment, registered users." The threat surface is one user reading another user's overlay, the application DB, or arbitrary host secrets, plus runaway scripts exhausting disk/CPU/RAM. Network access is not restricted — scripts must be able to download from arbitrary URLs (GitHub, l4d2center, Steam CDN). Sandbox boundaries are namespace-based (mount, PID, IPC, UTS, cgroup), not command-allowlist-based; binary-allowlist sandboxing of bash is theatre because of eval and exec.
The test deploy DB is wiped as part of rollout; no data migration is performed. Existing user blueprints that reference l4d2center_maps or cedapug_maps overlay rows do not survive the change in the test environment.
A scheduled-refresh feature (the daily timer that today drives the global-map types) is intentionally out of scope for this iteration. The two existing systemd units and the flask refresh-global-overlays CLI command are deleted with no replacement. Refresh is reintroduced in a later iteration designed against concrete needs.
Locked Decisions
- Single new overlay type:
script. Replaces both managed-globals types. Final type list:workshop+script. Notarball/inline/manualtypes — all of those collapse intoscript(with UI templates as a future ergonomics improvement). Overlay.scriptis a DBTEXTcolumn holding the raw bash. No file storage, no revision history in v1. Empty string forworkshoprows.- Build idempotency contract: script runs against the existing overlay dir. No automatic wipe between builds. Users write
test -f … || curl …-style guards if they want bandwidth efficiency. A manual "Wipe overlay" button on the detail page resets the dir to empty. - No left4me-aware helpers in the sandbox. The script sees pure bash plus whatever's in
/usr(RO bind-mount of the host). Workshop items are not exposed via a helper — users wanting workshop content create aworkshop-type overlay, which has its own first-class UX (thumbnails, collection paste, dedup cache, refresh). - Sandbox engine:
bubblewrap(bwrap) insidesystemd-run --scope --collect.systemd-runprovides cgroup v2 limits + walltime kill viaRuntimeMaxSec;bwrapprovides the namespace isolation. Both are stable, well-audited, in-tree on Debian. - Resource limits (system-wide, not per-overlay): 1 hour walltime (
RuntimeMaxSec=3600), 4 GB RAM (MemoryMax=4G,MemorySwapMax=0), 512 tasks, 200% CPU quota, post-build 20 GB disk cap ondu -sbof the overlay dir. - Network: host-shared. No
--unshare-net. Scripts have full outbound. Egress filtering is not in v1; the sandbox prevents reading internal state but does not prevent talking to internal IPs. Acceptable for the current trust model. - No auto-seeding of "default" overlays. Admin manually creates the equivalents of the old
l4d2center-maps/cedapug-mapspost-deploy by pasting the bash. The deploy script does not insert overlay rows. - Daily/scheduled refresh: out of scope for this iteration. No
auto_refreshflag, no timer, no CLI command. Manual rebuild via the detail-page button is the only build trigger after this change. - Permissions mirror workshop overlays. Any logged-in user can create a private (
user_id = me) script overlay. Admin can create system-wide (user_id = NULL). Owner or admin can edit/delete. - Failure semantics via
Overlay.last_build_status('' | 'ok' | 'failed'). Drives a "rebuild required" badge on the list and detail pages. Server initialization does not auto-block onfailed(matches workshop's current behavior). - Wipe is just another sandbox invocation. The wipe endpoint runs the literal script
find /overlay -mindepth 1 -deletethrough the sameleft4me-script-sandboxhelper. No second helper, no privilege/UID puzzle (files are owned byl4d2-sandbox, who runs the wipe). After a successful wipe,last_build_statusis reset to''. Wipe does not auto-enqueue a rebuild — the user decides. - Privileged helper:
/usr/local/libexec/left4me/left4me-script-sandbox. Same pattern as the existingleft4me-overlay,left4me-systemctl,left4me-journalctlhelpers. Bash, owned root, mode 0755. The web user invokes it viasudo -nper a sudoers fragment. Root is needed to set up the namespaces; bwrap drops to the unprivilegedl4d2-sandboxUID immediately. - Dedicated sandbox UID
l4d2-sandbox(system user,/usr/sbin/nologin, no home). Owns nothing on the host outside what bwrap binds in. UID-drop happens inside the bwrap invocation via--uid/--gid. - Strict argument validation in the helper. Overlay id matches
^[0-9]+$; overlay dir must exist under/var/lib/left4me/overlays/; script path must exist. Defense in depth — the real authorization check lives in the web app. - Streaming I/O via the existing
run_with_streamed_outputhelper. Same plumbingWorkshopBuilderalready uses forsteamcmd/curlinvocations. No new SSE/log path.
Architecture
Overlay row (type=script, script=TEXT, last_build_status)
│
▼ build_overlay(overlay_id) job
│
▼ BUILDERS["script"].build(overlay, on_stdout, on_stderr, should_cancel)
│
▼ ScriptBuilder writes overlay.script → tmpfile, then:
│ sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>
│
▼ Helper validates args, then exec()s:
│ systemd-run --scope --collect
│ -p MemoryMax=4G -p MemorySwapMax=0
│ -p TasksMax=512 -p CPUQuota=200%
│ -p RuntimeMaxSec=3600
│ -- bwrap [namespace flags...] /bin/bash /script.sh
│
▼ Inside the sandbox the script sees:
│ /overlay ← /var/lib/left4me/overlays/{id} RW (the build target)
│ /tmp,/run ← fresh tmpfs RW (ephemeral)
│ /usr,/lib,/lib64,/etc/{ssl,resolv.conf,nsswitch} RO (host-curated)
│ /proc,/dev ← fresh
│ network ← shared with host
│ UID/GID ← l4d2-sandbox (no_new_privs implicit in bwrap)
│
▼ stdout/stderr → run_with_streamed_output → existing job-log SSE stream
▼ After exit:
│ exit 0 ∧ du -sb /overlay ≤ 20 GB → last_build_status='ok'
│ any other outcome → last_build_status='failed'
The host library (l4d2host) is unchanged. The KernelOverlayFSMounter already mounts whatever's at overlays/{id}/ regardless of how it got there. The Job model and worker model are essentially unchanged — script is just another overlay type for the same build_overlay operation that today supports workshop.
BUILDERS = {
"workshop": WorkshopBuilder(),
"script": ScriptBuilder(),
}
Data Model
Overlay (modified)
id INTEGER PK AUTOINCREMENT
name VARCHAR(255) NOT NULL
path VARCHAR(255) NOT NULL -- str(id) for new rows
type VARCHAR(16) NOT NULL -- 'workshop' | 'script'
user_id INTEGER NULL REFERENCES users(id) -- NULL = system-wide
script TEXT NOT NULL DEFAULT '' -- new; meaningful for type='script'
last_build_status VARCHAR(16) NOT NULL DEFAULT '' -- new; '' | 'ok' | 'failed'
created_at, updated_at
UNIQUE INDEX on (name) WHERE user_id IS NULL
UNIQUE INDEX on (name, user_id) WHERE user_id IS NOT NULL
INDEX on (type, user_id)
Tables removed
global_overlay_item_filesglobal_overlay_itemsglobal_overlay_sources
Drop order matters for the SQLite migration: drop _item_files first (FK to _items), then _items (FK to _sources), then _sources (FK to overlays).
Unchanged
WorkshopItem, overlay_workshop_items, Job (including Job.overlay_id and nullable Job.user_id), Server, Blueprint, etc.
Filesystem Layout
${LEFT4ME_ROOT}/
overlays/
{overlay_id}/ # script writes here; mounted by host
left4dead2/... # whatever the script produces
workshop_cache/{steam_id}.vpk # workshop type only — unchanged
# removed:
# global_overlay_cache/ # was used by managed-globals types
Single tree per overlay. No per-overlay scratch cache (the chosen idempotency model is "script runs against existing dir," so any caching the user wants lives inside the overlay dir and is preserved between builds).
The sandbox bind-mounts ${LEFT4ME_ROOT}/overlays/{id}/ to /overlay (RW). Nothing else under ${LEFT4ME_ROOT} is visible inside the sandbox.
Sandbox
Helper script
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox, mode 0755, owned root:
#!/bin/bash
# args: <overlay_id> <script_path>
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script" >&2; exit 65; }
SBX_UID=$(id -u l4d2-sandbox); SBX_GID=$(id -g l4d2-sandbox)
exec systemd-run --quiet --scope --collect \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- bwrap \
--die-with-parent --new-session \
--unshare-pid --unshare-ipc --unshare-uts --unshare-cgroup \
--uid "$SBX_UID" --gid "$SBX_GID" \
--proc /proc --dev /dev --tmpfs /tmp --tmpfs /run \
--ro-bind /usr /usr --ro-bind /lib /lib --ro-bind /lib64 /lib64 \
--symlink usr/bin /bin --symlink usr/sbin /sbin \
--ro-bind /etc/resolv.conf /etc/resolv.conf \
--ro-bind /etc/ssl /etc/ssl \
--ro-bind /etc/ca-certificates /etc/ca-certificates \
--ro-bind /etc/nsswitch.conf /etc/nsswitch.conf \
--bind "$OVERLAY_DIR" /overlay \
--chdir /overlay \
--setenv HOME /tmp --setenv PATH /usr/bin:/usr/sbin \
--setenv OVERLAY /overlay \
--ro-bind "$SCRIPT" /script.sh \
/bin/bash /script.sh
Network is not unshared (no --unshare-net); the sandbox shares the host network namespace. Every transient unit is visible via systemctl list-units --type=scope while running and journaled afterward (journalctl --user-unit=run-…scope or system journal depending on invocation).
Sudoers fragment
Append to deploy/files/etc/sudoers.d/left4me:
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
System user
Provisioned in deploy/deploy-test-server.sh:
useradd --system --no-create-home --shell /usr/sbin/nologin l4d2-sandbox
apt-get install -y bubblewrap
Build Lifecycle
ScriptBuilder lives in l4d2web/services/overlay_builders.py next to WorkshopBuilder:
class ScriptBuilder:
def build(self, overlay, *, on_stdout, on_stderr, should_cancel):
with tempfile.NamedTemporaryFile("w", suffix=".sh", delete=False) as f:
f.write(overlay.script or "")
script_path = f.name
try:
cmd = [
"sudo", "-n",
"/usr/local/libexec/left4me/left4me-script-sandbox",
str(overlay.id), script_path,
]
run_with_streamed_output(cmd, on_stdout, on_stderr, should_cancel)
self._enforce_disk_budget(overlay.id, on_stderr)
finally:
os.unlink(script_path)
def _enforce_disk_budget(self, overlay_id, on_stderr):
size = subprocess.check_output(["du", "-sb", overlay_path(overlay_id)])
if int(size.split()[0]) > 20 * 1024**3:
on_stderr("overlay exceeded 20 GB disk cap")
raise BuildError("disk-cap-exceeded")
run_with_streamed_output is the existing helper used by WorkshopBuilder for steamcmd/curl invocations. The should_cancel callback fires kill -TERM on the sudo-systemd-run process tree; cgroup-collect tears down the whole scope on exit.
The job worker's existing job-completion path writes Overlay.last_build_status = 'ok' on success and 'failed' on any non-zero exit / BuildError / cancel. This is a single column update inside the existing transaction; no new infrastructure.
UI
Create modal (templates/overlays.html)
The existing modal grows one option in the type radio: Workshop | Script. Name field unchanged. After insert, the web app generates path = str(overlay_id) for new rows (existing pattern).
Detail page when type='script' (templates/overlay_detail.html)
- Plain styled
<textarea>foroverlay.scriptwith a Save button →POST /overlays/{id}/script. No CodeMirror dependency in v1 (out of scope; keep frontend dep-light). - "Rebuild" button →
POST /overlays/{id}/build. Existing pattern from workshop overlays. - "Wipe overlay" button (red, confirm-modal) →
POST /overlays/{id}/wipe. last_build_statusindicator badge: empty / "ok" / "failed".- Live build log via existing SSE plumbing on the related Job row.
Detail page when type='workshop': unchanged.
Sections removed
The global-source detail block (overlay_detail.html lines 34–46) is deleted along with the managed-globals subsystem.
Routes
l4d2web/routes/overlay_routes.py adds:
| Method | Path | Purpose |
|---|---|---|
| POST | /overlays/{id}/script |
Update script text. Auto-enqueue coalesced build_overlay job. |
| POST | /overlays/{id}/wipe |
Invoke left4me-script-sandbox with the literal script find /overlay -mindepth 1 -delete. Owner/admin only. Refuses if a build_overlay for this overlay is running. After success, set last_build_status=''. Does not auto-enqueue a rebuild. |
| POST | /overlays/{id}/build |
Manual rebuild — same pattern as today's workshop overlay manual rebuild. |
Existing POST /overlays accepts type=script and an optional initial script body.
Permissions
| Action | Who |
|---|---|
Create script overlay (private, user_id = me) |
Any authenticated user |
Create script overlay (system-wide, user_id = NULL) |
Admin |
| Edit (script body, name) | Owner or admin |
| Wipe / Rebuild | Owner or admin |
| Delete | Owner or admin |
| View | Owner, admin, or any user when user_id IS NULL |
These match the existing rules for workshop overlays.
Job Worker / Scheduler
services/job_worker.py drops "refresh_global_overlays" from GLOBAL_OPERATIONS and removes the corresponding refresh_global_overlays_running and blocked_servers_by_overlay plumbing that exists only for the global-maps subsystem. The remaining mutex rules already cover:
build_overlayper overlay (one running build per overlay).installandrefresh_workshop_itemsas global mutexes.- Server start/init blocks if any
build_overlayfor an overlay in the server's blueprint is running.
No new rules are needed for script — its build is mechanically identical to a workshop build from the scheduler's perspective.
Daily Refresh — Removed
This iteration deletes the daily-refresh subsystem entirely:
deploy/files/usr/local/lib/systemd/system/left4me-refresh-global-overlays.timerand.service— deleted.flask refresh-global-overlaysCLI command inl4d2web/cli.py— deleted.- No replacement timer, no replacement CLI, no
auto_refreshcolumn onOverlay.
The only build trigger after this change is the user clicking Rebuild on the detail page (or the auto-enqueue when they Save the script body). A scheduled-refresh feature is reintroduced in a future iteration designed against concrete operational needs.
Risks
- Sandbox escape via kernel bug.
bwraphas a strong track record but is not invulnerable. Mitigated by running asl4d2-sandbox(no privileged capabilities), no setuid binaries reachable,no_new_privsimplicit. A successful escape would land in an unprivileged UID with no host secrets reachable. - Disk fill via runaway script. A script that writes a 20 GB+ payload to
/overlaysucceeds inside the sandbox and only fails afterward at the post-buildducheck. The 20 GB lands on disk transiently. Mitigated by the kernel's per-cgroup IO accounting being unaware of file size (no good IO-time limit), accepting this as a v1 trade-off; a future improvement is overlay-dir-on-its-own-filesystem with a quota. - Network exfiltration. Script can connect to anything outbound, including internal IPs. Acceptable for the current trust model (semi-public; users have credentials). Egress firewall is out of scope.
- Build-mid-server-running. The scheduler refuses
build_overlayfor an overlay attached to a starting/running server (existing rule, unchanged). Good. A user can still rebuild while a server using a different blueprint runs concurrently. - Wipe race with running build. The wipe endpoint refuses if a
build_overlayfor the overlay is running. Without this check, a wipe could blow away files mid-script and produce undefined results. - Stale
last_build_status. A row inserted via direct DB write or restored from backup could carry an'ok'status that no longer reflects reality. Treated as cosmetic; users can rebuild to refresh. - Sudoers misconfig. A typo in the sudoers fragment could grant
left4memore than intended. Mitigated by deploy-artifact tests asserting the exact expected lines. - DB row deletion racing the sandbox. A user deleting an overlay while its build runs would invalidate the bind-mount target. Mitigated by the existing scheduler rule that tracks running overlays; delete should refuse if a build is running. (Existing pattern for workshop overlays; reuse.)
- Migration drops globals tables. Acceptable for the test deploy. Production rollout would need a different migration story; this spec explicitly assumes test-deploy DB wipe.
Out Of Scope
- Scheduled / daily refresh. Intentionally removed in this iteration. Reintroduced later, designed against the use cases that emerge.
- Per-overlay resource overrides. All script overlays share the same 1 h / 4 GB / 20 GB envelope. If a real overlay needs more (l4d2center mirror at peak), revisit.
- CodeMirror or other rich script editor. Plain
<textarea>in v1. - Egress allowlist / proxy. No network restrictions on the sandbox in v1.
$CACHEscratch dir persisted across builds. Users cache inside the overlay dir if they want; idempotency model is "script runs against existing dir."- Multi-tenant cgroup tree per user. All sandboxes share the same cgroup-quota envelope.
- Revision history on
scriptcolumn. Nooverlay_script_revisionstable; whatever's in the row is the current script. - Auto-seeding of l4d2center / cedapug equivalents. Admin pastes the script post-deploy.
- Migration that preserves existing global-map overlay rows. Test deploy DB is wiped.
- Container-per-build (podman / docker). Heavier than
bwrap; revisit only if multi-tenant escalates to "fully public sign-up." - left4me-aware helpers (
workshop,download,extract) inside the sandbox. Pure bash + host/usronly.
Implementation Boundaries
l4d2hostis unchanged. The host library has no concept of overlay types and the mount layer (KernelOverlayFSMounter) doesn't care how the overlay dir got populated.- The
OverlayBuilderProtocol is unchanged — samebuild(overlay, *, on_stdout, on_stderr, should_cancel)signature.ScriptBuilderplugs into the existing registry. - The job worker model is unchanged. Same operations, same logs, same SSE plumbing, same scheduler rules (minus the refresh_global_overlays entry).
- No new application-level dependencies. Vendored HTMX, no new Python packages. Two new system dependencies:
bubblewrapapt package and thel4d2-sandboxsystem user. - No new config keys. Same env files (
/etc/left4me/host.env,/etc/left4me/web.env). - DB migration is destructive for global-maps overlay rows. This is acceptable per the test-deploy assumption; a production-rollout follow-up would need to address it.
- The companion implementation plan governs task ordering and verification commands. Implementation must not start without explicit user approval per that plan's gate.