The script content lives in the overlays.script DB column and the unit's %i is the row id, so the worker-writes-script-to-fs step in the original sketch is duplication. Document three options (worker writes / unit fetches via helper / pipe to stdin) and recommend the unit-fetches variant with RuntimeDirectory= auto-cleanup. Promote this to the top of the open-decisions list since it shapes the worker, the unit, and whether a fetch-script helper is added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
23 KiB
Build-overlay template unit — refactor the script-sandbox helper
Status: open question, not settled design. This is a handoff
document prompted by the build-time idmap landing on 2026-05-15. The
current left4me-script-sandbox shell helper works but has accumulated
several layers of complexity (idmap bind setup, trap cleanup, nsenter
self-wrap) that a systemd template unit would handle declaratively.
The same pattern is already established in the codebase for
gameservers (left4me-server@.service). A future session should
evaluate whether to refactor and, if so, follow the steps below.
Why this came up
While verifying the build-time idmap refactor, the first 5 build jobs
failed with mkdir: Permission denied on /overlay/.... Root cause:
left4me-web.serviceruns withPrivateTmp=true, which puts the web app (and anything it sudoes into) in a private mount namespace.- The script-sandbox helper, invoked via
sudofrom the web app, inherits that namespace. - The helper's
mount --bind --map-users=...pre-creates the idmap staging path in the web app's namespace. systemd-run(called by the helper) spawns a transient unit in PID 1's mount namespace.- The transient unit's
BindPaths=...:/overlayresolves the staging path in PID 1's namespace — where the bind doesn't exist. It sees an empty root-owned dir at the staging path (mkdir'd by the helper before the bind) and binds that to/overlay. - Sandbox uid hits EACCES on every write.
We fixed it (commit f1aa05d) by self-wrapping the helper into
PID 1's mount namespace at the top of the script:
if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then
exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \
/usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@"
fi
That works. But it's a band-aid for an architectural friction:
helper invocation via sudo from a hardened service forces us to
manually escape the caller's namespace before any mount syscall.
If the helper were itself a systemd unit started by PID 1, the
namespace would be correct by default.
The gameserver helper handles this at the unit level. Its ExecStartPre is:
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
i.e. wrapped in nsenter at the unit. The unit is started by PID 1,
so it has PID 1's namespace, then nsenter is a belt-and-braces.
Mirror that pattern for builds: introduce build-overlay@.service as
a template unit, have the worker activate it instead of forking a
helper.
Current state (the thing being replaced)
Files:
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox— the bash helper. ~100 lines. Self-wraps in nsenter, does pre-bind with--map-users, invokessystemd-run --quiet --collect --wait --pipe -p ... -- /bin/bash /script.sh, cleans up via trap.l4d2web/services/overlay_builders.py:run_sandboxed_script— the worker entry point. Writes script content to/var/lib/left4me/sandbox-scripts/<uniqued>.sh, invokessudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <path>, streams stdout/stderr viasubprocess.Popen+ the existingrun_commandplumbing.deploy/files/etc/sudoers.d/left4me— grantsleft4meNOPASSWD to the helper path.
What the helper actually does:
- nsenter into PID 1's mount ns (the band-aid)
- validate args + overlay dir exists
- compute
STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID} trapcleanup; pre-emptiveumountof stale staging;mkdir -pthe stagingmount --bind --map-users=$(id -u left4me):$(id -u l4d2-sandbox):1 --map-groups=... $OVERLAY_DIR $STAGINGsystemd-runwith the full hardening profile,BindPaths=$STAGING:/overlay- Wait for completion, propagate exit code
- trap fires:
umount $STAGING; rmdir $STAGING
Proposed design
Replace the bash helper with two systemd units (template + a slice)
emitted from ckn-bw's existing systemd_units reactor, plus a small
worker rewrite.
build-overlay@.service (template unit)
[Unit]
Description=Sandboxed overlay build for instance %i
DefaultDependencies=no
After=local-fs.target
RequiresMountsFor=/var/lib/left4me/overlays/%i
ConditionPathIsDirectory=/var/lib/left4me/overlays/%i
ConditionPathExists=/var/lib/left4me/sandbox-scripts/%i.sh
[Service]
Type=oneshot
User=l4d2-sandbox
Group=l4d2-sandbox
Slice=l4d2-build.slice
# Idmap bind: disk uid 980 (left4me) ↔ mount uid 981 (sandbox), so writes
# from the sandbox land on disk as left4me. + prefix runs as root before
# the User= drop (mount syscall requires CAP_SYS_ADMIN).
ExecStartPre=+/usr/bin/mkdir -p /run/left4me/idmap/%i
ExecStartPre=+/usr/bin/mount --bind \
--map-users=980:981:1 --map-groups=980:981:1 \
/var/lib/left4me/overlays/%i /run/left4me/idmap/%i
ExecStart=/bin/bash /script.sh
ExecStopPost=+-/usr/bin/umount /run/left4me/idmap/%i
ExecStopPost=+-/usr/bin/rmdir /run/left4me/idmap/%i
# Hardening — all the -p flags from the current bash helper, declared
# declaratively here instead of as systemd-run -p arguments.
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes
PrivateIPC=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
RestrictNamespaces=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictSUIDSGID=yes
LockPersonality=yes
MemoryDenyWriteExecute=yes
SystemCallFilter=@system-service @network-io
SystemCallArchitectures=native
CapabilityBoundingSet=
AmbientCapabilities=
IPAddressDeny=127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7
TemporaryFileSystem=/etc /var/lib
BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives /var/lib/left4me/sandbox-scripts/%i.sh:/script.sh
BindPaths=/run/left4me/idmap/%i:/overlay
WorkingDirectory=/overlay
Environment=HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay
UMask=0022
OOMScoreAdjust=500
MemoryMax=4G
MemorySwapMax=0
TasksMax=512
CPUQuota=200%
RuntimeMaxSec=3600
TimeoutStartSec=1h
TimeoutStopSec=30s
Notes:
Type=oneshotmakessystemctl startblock until ExecStart exits.ConditionPath*provides early failure if the overlay dir or script doesn't exist (avoids running the unit at all in those cases).RequiresMountsFor=/var/lib/left4me/overlays/%iensures the parent fs is mounted before this unit runs (/and/var/libif it's a separate mount point).ExecStopPostuses+-(root, ignore failures) — the bind might already be torn down if the unit is restarting.BindReadOnlyPaths=...:/script.shmakes the per-overlay script available at/script.shinside the sandbox, picked from the predictable path/var/lib/left4me/sandbox-scripts/%i.sh.
Script source: filesystem vs. DB
Critical design decision the future session must make. The current
plan in the unit sketch above assumes the worker writes the script
content to /var/lib/left4me/sandbox-scripts/<id>.sh before calling
systemctl start. But the script already lives in the DB (the
overlays.script column), and the unit instance name %i is the
overlay row id. The filesystem copy is redundant unless we want it.
Three options:
Option A — worker writes the script (the unit sketch above).
Worker queries DB, writes <id>.sh to a known path, then
systemctl start. Unit reads via BindReadOnlyPaths. Simple, no DB
access from the unit, the existing _sandbox_script_dir() plumbing
mostly works. Cost: redundant on-disk copy; stale files between
builds if you don't clean them.
Option B — unit fetches the script from the DB itself. A small
root-side helper installed as
/usr/local/libexec/left4me/left4me-fetch-script does:
#!/usr/bin/python3
import sqlite3, sys
overlay_id = int(sys.argv[1])
conn = sqlite3.connect("/var/lib/left4me/left4me.db")
row = conn.execute(
"SELECT script FROM overlays WHERE id = ?", (overlay_id,)
).fetchone()
sys.stdout.write((row[0] if row else "") or "")
Unit's ExecStartPre runs it as root (the + prefix), pipes the
output to a runtime path that ExecStart reads:
RuntimeDirectory=left4me/sandbox-scripts
RuntimeDirectoryMode=0700
ExecStartPre=+/bin/sh -c '/usr/local/libexec/left4me/left4me-fetch-script %i \
> /run/left4me/sandbox-scripts/%i.sh && chmod 0644 /run/left4me/sandbox-scripts/%i.sh'
BindReadOnlyPaths=/run/left4me/sandbox-scripts/%i.sh:/script.sh
(RuntimeDirectory= auto-creates /run/left4me/sandbox-scripts/ on
start and removes it on stop, including the file inside.)
The fetch script doesn't need sudoers — it runs from ExecStartPre with
root privileges already. It only reads the DB; no writes. The DB is
root:left4me 0640 so root can read it.
Worker becomes a one-liner: sudo systemctl start build-overlay@<id>.
No FS prep, no tmpfile cleanup.
Option C — pipe the script content directly into bash stdin. The
unit's ExecStart is something like
/bin/sh -c "fetch-script %i | /bin/bash". Pros: no on-disk file at
all. Cons: /bin/bash runs without a file path, so $0 is bash and
error messages look weird; harder to debug a failing script when there's
no file to inspect.
Recommendation: Option B. Decouples script storage (DB) from sandbox transport (a /run/ runtime file). RuntimeDirectory= handles cleanup. Worker becomes trivially small. The fetch-script helper is ~10 lines and stays in deploy/files/usr/local/libexec/left4me/.
If Option A is chosen instead, plan to track the script tmpfiles explicitly so they don't accumulate. With Option B, RuntimeDirectory auto-cleans on stop.
Worker invocation
Replace run_sandboxed_script in
l4d2web/services/overlay_builders.py. The code below is the Option
A shape (worker writes the script). For Option B (recommended),
drop the script_dir/script_path/write_text/chmod lines — the
unit's ExecStartPre fetches from the DB. The signature can also drop
script_text since the worker doesn't need to pass content anymore.
def run_sandboxed_script(
overlay_id: int,
script_text: str, # remove this param if Option B
*,
on_stdout: LogSink,
on_stderr: LogSink,
should_cancel: CancelCheck,
) -> None:
# The four lines below are Option A only — delete for Option B.
script_dir = _sandbox_script_dir()
script_dir.mkdir(parents=True, exist_ok=True)
script_path = script_dir / f"{overlay_id}.sh"
script_path.write_text(script_text or "")
os.chmod(script_path, 0o644)
unit = f"build-overlay@{overlay_id}.service"
# Tail the unit's journal as a sidecar so output streams into job-logs
# while the unit runs. --follow exits when the unit reaches "inactive".
journal = subprocess.Popen(
["journalctl", "--unit", unit, "--output=cat", "--follow",
"--since=now", "--no-pager"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
try:
# Start the unit (sudoers permits this exact verb pattern).
# Type=oneshot makes this block until ExecStart returns.
rc = subprocess.run(
["sudo", "-n", "/bin/systemctl", "start", unit],
check=False,
).returncode
finally:
# Drain remaining journal lines (journalctl --follow may not have
# printed everything yet by the time systemctl returns).
journal.terminate()
try:
for line in journal.stdout or []:
on_stdout(line.rstrip("\n"))
finally:
journal.wait(timeout=5)
# Read exit code from the unit. ExecMainStatus is the script's rc;
# Result is "success" / "failed" / "timeout" etc.
show = subprocess.check_output(
["systemctl", "show", unit,
"-p", "ExecMainStatus", "-p", "Result", "--value"],
text=True,
).split()
exec_main_status = int(show[0])
result = show[1]
if rc != 0 or result != "success":
raise BuildError(
f"build-overlay@{overlay_id} failed: "
f"systemctl rc={rc} unit result={result} script exit={exec_main_status}"
)
That's ~30 lines vs. ~50 today, and the helper script disappears entirely.
Two refinements to consider:
-
Cancel semantics: today the worker's
should_cancelcallback triggers a SIGTERM via the existingrun_commandplumbing. With systemctl-start, you'd issuesystemctl stop build-overlay@<id>in a parallel thread whenshould_cancel()returns True. Wire that up. -
Journal streaming race:
journalctl --follow --since=nowstarted aftersystemctl startmay miss the first few lines. Two fixes:- Start the journal tail before systemctl-start (the unit doesn't exist yet, so journalctl waits silently — verify this behaviour on Trixie).
- Or use
journalctl --cursormachinery: snapshot the cursor before start, then read with--cursor=after.
Start-before is simpler and likely sufficient for L4D2 build verbosity, where the first second of output isn't critical.
Sudoers
Replace:
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
with:
left4me ALL=(root) NOPASSWD: /bin/systemctl start build-overlay@*.service
left4me ALL=(root) NOPASSWD: /bin/systemctl stop build-overlay@*.service
(Tighter — verb-prefixed and instance-globbed. No script path passed.)
Slice
l4d2-build.slice already exists (per the gameserver/sandbox today's
configuration). Reuse it — no change needed.
Sandbox script tmpfile cleanup
Currently run_sandboxed_script writes a per-invocation
tempfile.NamedTemporaryFile with a random suffix and unlinks it in a
finally. With template-unit lookup, the script path is predictable
per overlay id (/var/lib/left4me/sandbox-scripts/<id>.sh).
Implications:
- Two concurrent builds for the same overlay id would clobber the
script file. The job queue already serializes per-overlay (per
l4d2web/services/job_worker.py:OVERLAY_OPERATIONS), so this is OK. - Scripts persist between builds (no auto-cleanup). Either accept that (the next build overwrites) or delete after the unit goes inactive. Recommend: leave them — small, useful for debugging.
Migration
In order:
- Add the unit emission to ckn-bw's
bundles/left4me/metadata.pysystemd_units reactor. Mirror the pattern used forleft4me-server@.service. Drop in the template-unit content as another reactor entry. - Update sudoers (
bundles/left4me/files/etc/sudoers.d/left4me) to permitsystemctl start/stop build-overlay@*.serviceand remove the script-sandbox grant. - Replace
run_sandboxed_scriptin left4me. Add the new journalctl-based output streaming, exit-code reading, and cancel handling. Keep the function signature stable so callers (ScriptBuilder.build, the wipe route) are unchanged. - Delete
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox. - Update tests:
deploy/tests/test_deploy_artifacts.py:- Drop
test_script_sandbox_uses_idmap_stagingand any other tests that read SCRIPT_SANDBOX_HELPER. - Add tests that assert the new unit emission in ckn-bw's reactor output. (But that's in the other repo — left4me's deploy tests can't directly cover it.)
- Add a test that asserts the worker invokes
sudo systemctl start build-overlay@*(grepoverlay_builders.py).
- Drop
l4d2web/tests/test_overlay_builders.py(if it exists): update mocks forrun_sandboxed_scriptto expect the new subprocess shape.
- Test on
left4.me:- Push left4me,
bw apply ovh.left4me. Apply also picks up the new unit emission and the sudoers change. - Trigger a script-overlay rebuild via the web UI or the enqueue API path used in this session (see test history in git log around 2026-05-15).
- Inspect:
journalctl -u build-overlay@9.service,systemctl status build-overlay@9.service. - Verify on-disk state: overlay files end up
left4me-owned; idmap bind cleanly torn down (findmnt | grep idmapempty).
- Push left4me,
Open decisions for the future session
-
Script source: filesystem (Option A) vs. DB-fetched in ExecStartPre (Option B) vs. piped to stdin (Option C). See the "Script source" section above. This is the highest-impact decision because it shapes the worker, the unit's ExecStartPre, and whether you need a fetch-script helper binary at all. Recommendation: Option B.
-
/run/left4me/idmap/%ivs./var/lib/left4me/tmp/sandbox-idmap-%i—/runis tmpfs and wiped on reboot, more correct for transient mount paths. But it requires the dir to exist (created by ExecStartPre). Either works. -
What to do with the existing
left4me-apply-cakedead code — irrelevant to this refactor; flagged in the other handoff doc. -
Whether to drop the post-build
chmod o+rin the sandbox helper — already gone in the build-time-idmap commit. (Verify in the new unit nothing equivalent is needed; files are left4me-owned, web reads via primary uid.) -
Type=oneshotvs.Type=exec— oneshot blockssystemctl start. exec doesn't. With oneshot we don't need thejournalctl --followworkaround if we read journal after completion. But for live progress (which the existing builds stream),--followis still needed. Stick with oneshot. -
Should the unit set
KillMode=mixedto ensure children die on stop? Worth checking — the existing systemd-run line doesn't set it explicitly; defaults usually suffice. -
StateDirectory=vs. explicitmkdir -p— systemd has StateDirectory and RuntimeDirectory directives that auto-create per-unit directories. Could replace themkdir -p /run/left4me/idmap/%iExecStartPre withRuntimeDirectory=left4me/idmap/%i. Cleaner; gets auto-cleanup on stop too. Recommend doing this — both the mkdir and the rmdir ExecStopPost would go away.
Verification
End-to-end smoke test on left4.me after the deploy:
# unit is installed and template-parseable
systemctl status build-overlay@.service # should show "loaded; static"
sudo systemd-analyze verify build-overlay@1.service
# enqueue a build via the web app's worker path (mimic the
# enqueue_build_overlay pattern from this session's job 64 onwards)
# then watch:
sudo journalctl -u build-overlay@9.service -f
# on completion:
systemctl show build-overlay@9.service -p Result -p ExecMainStatus
# expect: Result=success, ExecMainStatus=0
# disk state
sudo find /var/lib/left4me/overlays/9 -uid 981 # should be empty
sudo find /run/left4me/idmap # should not exist or be empty
# pid 1 mount table — no orphan idmap binds
sudo findmnt --task 1 -o TARGET | grep idmap # empty
Risks
- Worker cancel-during-build: today's
should_cancelcallback signals viarun_command's child process. With the unit, the worker needs a separate path: spawn a thread that pollsshould_cancel()and callssudo systemctl stop build-overlay@<id>when triggered. Without this, builds that exceedRuntimeMaxSecor hit user-cancel won't terminate promptly. - Journal lag at unit start:
journalctl --followstarted beforesystemctl startshould pick up all output. If not, may need cursor-based streaming. Test with a script that prints immediately (echo hello; exit 0) — if "hello" appears in the job log, race is handled. - Sudoers globbing:
systemctl start build-overlay@*.servicepermits any instance id including weird strings like../etc-passwd. Use a tighter glob if possible (e.g.,build-overlay@[0-9]*.service). Test that sudoers rejects unexpected instance names. - Type=oneshot return semantics: confirm that
systemctl start build-overlay@<id>on a Type=oneshot unit returns rc=3 (or similar) when the unit's ExecStart fails, so the worker can detect failure without re-queryingsystemctl show. - Idle running over reboot: a build that's running across a reboot is killed when the system goes down. That's identical to today's behavior with systemd-run. Acceptable.
- The journalctl sidecar process accumulates as a zombie if not
reaped properly. The proposed code does
journal.wait(timeout=5)— handle the timeout case (force-kill).
Pointers
Reference files (with line numbers if applicable):
- Current helper to be removed:
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox - Current worker invoker:
l4d2web/services/overlay_builders.py:run_sandboxed_script(~ln 324) - Current job-worker dispatch:
l4d2web/services/job_worker.py(build_overlay operation) - Sudoers:
deploy/files/etc/sudoers.d/left4me(matched verbatim inckn-bw/bundles/left4me/files/etc/sudoers.d/left4me) - Sample template unit pattern (the model to copy):
left4me-server@.serviceemission in ckn-bw'sbundles/left4me/metadata.pysystemd_units reactor. - Existing slice declaration (already correct):
l4d2-build.slicein ckn-bw's reactor.
Recent commits that touched this surface:
4838108— moved idmap to build time (the refactor that surfaced the namespace bug)f1aa05d— added nsenter self-wrap (the band-aid this refactor removes)2f6a9cf,9053186,dd918ac— earlier idmap-on-mount approach that was reverted
Related design docs:
docs/superpowers/plans/2026-05-15-build-time-idmap.md— the plan whose architecture this refactor builds ondocs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md— unrelated open questions about deploy/ layout
What's NOT in scope
- Rewriting the sandbox in Python / packaging differently.
- Changing the security hardening profile (the unit duplicates the current set verbatim — adjust later if needed).
- Splitting the gameserver uid from the web app uid (noted in earlier handoff doc).
- Re-evaluating whether
l4d2-sandboxshould exist as a separate uid (kept; defense in depth). - Touching the
left4me-overlaygameserver helper (it already uses the pattern; only the sandbox helper is being refactored to match).
Estimate
Rough breakdown for the future session:
- Unit file design + ckn-bw reactor change: 1-2 hours
- Worker rewrite (run_sandboxed_script): 1-2 hours
- Tests: 1 hour
- Deploy + verify on test server: 30 min
- Bug-fix and iteration buffer: 1 hour
~5 hours of focused work, assuming no surprises with journalctl streaming or sudoers semantics.
Decision criteria for whether to do this
Do it if:
- You're about to make any other change to the sandbox hardening, build lifecycle, or sandbox uid story.
- You're frustrated by debugging the existing helper.
- You want to remove the nsenter band-aid for hygiene.
Skip if:
- The sandbox is stable and you're not planning related changes.
- You'd rather invest the time in higher-value work elsewhere.
The current solution is fine; this refactor is upgrade-not-fix.