The build-time idmap landing today required a nsenter self-wrap in left4me-script-sandbox to escape the web app's PrivateTmp namespace before pre-creating the idmapped staging bind. Working but band-aid: the helper is reinventing what a systemd template unit would do declaratively. Mirror the left4me-server@.service pattern with a build-overlay@.service template — ExecStartPre does the idmap bind in PID 1's namespace by default, the hardening flags live in the unit file, ExecStopPost tears down. Worker switches to sudo systemctl start. Doc captures full proposed unit, worker rewrite sketch, sudoers update, migration order, verification steps, and the ~5h estimate so a future session can pick this up cold and execute. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
19 KiB
Build-overlay template unit — refactor the script-sandbox helper
Status: open question, not settled design. This is a handoff
document prompted by the build-time idmap landing on 2026-05-15. The
current left4me-script-sandbox shell helper works but has accumulated
several layers of complexity (idmap bind setup, trap cleanup, nsenter
self-wrap) that a systemd template unit would handle declaratively.
The same pattern is already established in the codebase for
gameservers (left4me-server@.service). A future session should
evaluate whether to refactor and, if so, follow the steps below.
Why this came up
While verifying the build-time idmap refactor, the first 5 build jobs
failed with mkdir: Permission denied on /overlay/.... Root cause:
left4me-web.serviceruns withPrivateTmp=true, which puts the web app (and anything it sudoes into) in a private mount namespace.- The script-sandbox helper, invoked via
sudofrom the web app, inherits that namespace. - The helper's
mount --bind --map-users=...pre-creates the idmap staging path in the web app's namespace. systemd-run(called by the helper) spawns a transient unit in PID 1's mount namespace.- The transient unit's
BindPaths=...:/overlayresolves the staging path in PID 1's namespace — where the bind doesn't exist. It sees an empty root-owned dir at the staging path (mkdir'd by the helper before the bind) and binds that to/overlay. - Sandbox uid hits EACCES on every write.
We fixed it (commit f1aa05d) by self-wrapping the helper into
PID 1's mount namespace at the top of the script:
if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then
exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \
/usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@"
fi
That works. But it's a band-aid for an architectural friction:
helper invocation via sudo from a hardened service forces us to
manually escape the caller's namespace before any mount syscall.
If the helper were itself a systemd unit started by PID 1, the
namespace would be correct by default.
The gameserver helper handles this at the unit level. Its ExecStartPre is:
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
i.e. wrapped in nsenter at the unit. The unit is started by PID 1,
so it has PID 1's namespace, then nsenter is a belt-and-braces.
Mirror that pattern for builds: introduce build-overlay@.service as
a template unit, have the worker activate it instead of forking a
helper.
Current state (the thing being replaced)
Files:
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox— the bash helper. ~100 lines. Self-wraps in nsenter, does pre-bind with--map-users, invokessystemd-run --quiet --collect --wait --pipe -p ... -- /bin/bash /script.sh, cleans up via trap.l4d2web/services/overlay_builders.py:run_sandboxed_script— the worker entry point. Writes script content to/var/lib/left4me/sandbox-scripts/<uniqued>.sh, invokessudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <path>, streams stdout/stderr viasubprocess.Popen+ the existingrun_commandplumbing.deploy/files/etc/sudoers.d/left4me— grantsleft4meNOPASSWD to the helper path.
What the helper actually does:
- nsenter into PID 1's mount ns (the band-aid)
- validate args + overlay dir exists
- compute
STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID} trapcleanup; pre-emptiveumountof stale staging;mkdir -pthe stagingmount --bind --map-users=$(id -u left4me):$(id -u l4d2-sandbox):1 --map-groups=... $OVERLAY_DIR $STAGINGsystemd-runwith the full hardening profile,BindPaths=$STAGING:/overlay- Wait for completion, propagate exit code
- trap fires:
umount $STAGING; rmdir $STAGING
Proposed design
Replace the bash helper with two systemd units (template + a slice)
emitted from ckn-bw's existing systemd_units reactor, plus a small
worker rewrite.
build-overlay@.service (template unit)
[Unit]
Description=Sandboxed overlay build for instance %i
DefaultDependencies=no
After=local-fs.target
RequiresMountsFor=/var/lib/left4me/overlays/%i
ConditionPathIsDirectory=/var/lib/left4me/overlays/%i
ConditionPathExists=/var/lib/left4me/sandbox-scripts/%i.sh
[Service]
Type=oneshot
User=l4d2-sandbox
Group=l4d2-sandbox
Slice=l4d2-build.slice
# Idmap bind: disk uid 980 (left4me) ↔ mount uid 981 (sandbox), so writes
# from the sandbox land on disk as left4me. + prefix runs as root before
# the User= drop (mount syscall requires CAP_SYS_ADMIN).
ExecStartPre=+/usr/bin/mkdir -p /run/left4me/idmap/%i
ExecStartPre=+/usr/bin/mount --bind \
--map-users=980:981:1 --map-groups=980:981:1 \
/var/lib/left4me/overlays/%i /run/left4me/idmap/%i
ExecStart=/bin/bash /script.sh
ExecStopPost=+-/usr/bin/umount /run/left4me/idmap/%i
ExecStopPost=+-/usr/bin/rmdir /run/left4me/idmap/%i
# Hardening — all the -p flags from the current bash helper, declared
# declaratively here instead of as systemd-run -p arguments.
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes
PrivateIPC=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
RestrictNamespaces=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictSUIDSGID=yes
LockPersonality=yes
MemoryDenyWriteExecute=yes
SystemCallFilter=@system-service @network-io
SystemCallArchitectures=native
CapabilityBoundingSet=
AmbientCapabilities=
IPAddressDeny=127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7
TemporaryFileSystem=/etc /var/lib
BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives /var/lib/left4me/sandbox-scripts/%i.sh:/script.sh
BindPaths=/run/left4me/idmap/%i:/overlay
WorkingDirectory=/overlay
Environment=HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay
UMask=0022
OOMScoreAdjust=500
MemoryMax=4G
MemorySwapMax=0
TasksMax=512
CPUQuota=200%
RuntimeMaxSec=3600
TimeoutStartSec=1h
TimeoutStopSec=30s
Notes:
Type=oneshotmakessystemctl startblock until ExecStart exits.ConditionPath*provides early failure if the overlay dir or script doesn't exist (avoids running the unit at all in those cases).RequiresMountsFor=/var/lib/left4me/overlays/%iensures the parent fs is mounted before this unit runs (/and/var/libif it's a separate mount point).ExecStopPostuses+-(root, ignore failures) — the bind might already be torn down if the unit is restarting.BindReadOnlyPaths=...:/script.shmakes the per-overlay script available at/script.shinside the sandbox, picked from the predictable path/var/lib/left4me/sandbox-scripts/%i.sh.
Worker invocation
Replace run_sandboxed_script in
l4d2web/services/overlay_builders.py:
def run_sandboxed_script(
overlay_id: int,
script_text: str,
*,
on_stdout: LogSink,
on_stderr: LogSink,
should_cancel: CancelCheck,
) -> None:
script_dir = _sandbox_script_dir()
script_dir.mkdir(parents=True, exist_ok=True)
script_path = script_dir / f"{overlay_id}.sh"
script_path.write_text(script_text or "")
os.chmod(script_path, 0o644)
unit = f"build-overlay@{overlay_id}.service"
# Tail the unit's journal as a sidecar so output streams into job-logs
# while the unit runs. --follow exits when the unit reaches "inactive".
journal = subprocess.Popen(
["journalctl", "--unit", unit, "--output=cat", "--follow",
"--since=now", "--no-pager"],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
try:
# Start the unit (sudoers permits this exact verb pattern).
# Type=oneshot makes this block until ExecStart returns.
rc = subprocess.run(
["sudo", "-n", "/bin/systemctl", "start", unit],
check=False,
).returncode
finally:
# Drain remaining journal lines (journalctl --follow may not have
# printed everything yet by the time systemctl returns).
journal.terminate()
try:
for line in journal.stdout or []:
on_stdout(line.rstrip("\n"))
finally:
journal.wait(timeout=5)
# Read exit code from the unit. ExecMainStatus is the script's rc;
# Result is "success" / "failed" / "timeout" etc.
show = subprocess.check_output(
["systemctl", "show", unit,
"-p", "ExecMainStatus", "-p", "Result", "--value"],
text=True,
).split()
exec_main_status = int(show[0])
result = show[1]
if rc != 0 or result != "success":
raise BuildError(
f"build-overlay@{overlay_id} failed: "
f"systemctl rc={rc} unit result={result} script exit={exec_main_status}"
)
That's ~30 lines vs. ~50 today, and the helper script disappears entirely.
Two refinements to consider:
-
Cancel semantics: today the worker's
should_cancelcallback triggers a SIGTERM via the existingrun_commandplumbing. With systemctl-start, you'd issuesystemctl stop build-overlay@<id>in a parallel thread whenshould_cancel()returns True. Wire that up. -
Journal streaming race:
journalctl --follow --since=nowstarted aftersystemctl startmay miss the first few lines. Two fixes:- Start the journal tail before systemctl-start (the unit doesn't exist yet, so journalctl waits silently — verify this behaviour on Trixie).
- Or use
journalctl --cursormachinery: snapshot the cursor before start, then read with--cursor=after.
Start-before is simpler and likely sufficient for L4D2 build verbosity, where the first second of output isn't critical.
Sudoers
Replace:
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
with:
left4me ALL=(root) NOPASSWD: /bin/systemctl start build-overlay@*.service
left4me ALL=(root) NOPASSWD: /bin/systemctl stop build-overlay@*.service
(Tighter — verb-prefixed and instance-globbed. No script path passed.)
Slice
l4d2-build.slice already exists (per the gameserver/sandbox today's
configuration). Reuse it — no change needed.
Sandbox script tmpfile cleanup
Currently run_sandboxed_script writes a per-invocation
tempfile.NamedTemporaryFile with a random suffix and unlinks it in a
finally. With template-unit lookup, the script path is predictable
per overlay id (/var/lib/left4me/sandbox-scripts/<id>.sh).
Implications:
- Two concurrent builds for the same overlay id would clobber the
script file. The job queue already serializes per-overlay (per
l4d2web/services/job_worker.py:OVERLAY_OPERATIONS), so this is OK. - Scripts persist between builds (no auto-cleanup). Either accept that (the next build overwrites) or delete after the unit goes inactive. Recommend: leave them — small, useful for debugging.
Migration
In order:
- Add the unit emission to ckn-bw's
bundles/left4me/metadata.pysystemd_units reactor. Mirror the pattern used forleft4me-server@.service. Drop in the template-unit content as another reactor entry. - Update sudoers (
bundles/left4me/files/etc/sudoers.d/left4me) to permitsystemctl start/stop build-overlay@*.serviceand remove the script-sandbox grant. - Replace
run_sandboxed_scriptin left4me. Add the new journalctl-based output streaming, exit-code reading, and cancel handling. Keep the function signature stable so callers (ScriptBuilder.build, the wipe route) are unchanged. - Delete
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox. - Update tests:
deploy/tests/test_deploy_artifacts.py:- Drop
test_script_sandbox_uses_idmap_stagingand any other tests that read SCRIPT_SANDBOX_HELPER. - Add tests that assert the new unit emission in ckn-bw's reactor output. (But that's in the other repo — left4me's deploy tests can't directly cover it.)
- Add a test that asserts the worker invokes
sudo systemctl start build-overlay@*(grepoverlay_builders.py).
- Drop
l4d2web/tests/test_overlay_builders.py(if it exists): update mocks forrun_sandboxed_scriptto expect the new subprocess shape.
- Test on
left4.me:- Push left4me,
bw apply ovh.left4me. Apply also picks up the new unit emission and the sudoers change. - Trigger a script-overlay rebuild via the web UI or the enqueue API path used in this session (see test history in git log around 2026-05-15).
- Inspect:
journalctl -u build-overlay@9.service,systemctl status build-overlay@9.service. - Verify on-disk state: overlay files end up
left4me-owned; idmap bind cleanly torn down (findmnt | grep idmapempty).
- Push left4me,
Open decisions for the future session
/run/left4me/idmap/%ivs./var/lib/left4me/tmp/sandbox-idmap-%i—/runis tmpfs and wiped on reboot, more correct for transient mount paths. But it requires the dir to exist (created by ExecStartPre). Either works.- What to do with the existing
left4me-apply-cakedead code — irrelevant to this refactor; flagged in the other handoff doc. - Whether to drop the post-build
chmod o+rin the sandbox helper — already gone in the build-time-idmap commit. (Verify in the new unit nothing equivalent is needed; files are left4me-owned, web reads via primary uid.) Type=oneshotvs.Type=exec— oneshot blockssystemctl start. exec doesn't. With oneshot we don't need thejournalctl --followworkaround if we read journal after completion. But for live progress (which the existing builds stream),--followis still needed. Stick with oneshot.- Should the unit set
KillMode=mixedto ensure children die on stop? Worth checking — the existing systemd-run line doesn't set it explicitly; defaults usually suffice. StateDirectory=vs. explicitmkdir -p— systemd has StateDirectory and RuntimeDirectory directives that auto-create per-unit directories. Could replace themkdir -p /run/left4me/idmap/%iExecStartPre withRuntimeDirectory=left4me/idmap/%i. Cleaner; gets auto-cleanup on stop too. Recommend doing this — both the mkdir and the rmdir ExecStopPost would go away.
Verification
End-to-end smoke test on left4.me after the deploy:
# unit is installed and template-parseable
systemctl status build-overlay@.service # should show "loaded; static"
sudo systemd-analyze verify build-overlay@1.service
# enqueue a build via the web app's worker path (mimic the
# enqueue_build_overlay pattern from this session's job 64 onwards)
# then watch:
sudo journalctl -u build-overlay@9.service -f
# on completion:
systemctl show build-overlay@9.service -p Result -p ExecMainStatus
# expect: Result=success, ExecMainStatus=0
# disk state
sudo find /var/lib/left4me/overlays/9 -uid 981 # should be empty
sudo find /run/left4me/idmap # should not exist or be empty
# pid 1 mount table — no orphan idmap binds
sudo findmnt --task 1 -o TARGET | grep idmap # empty
Risks
- Worker cancel-during-build: today's
should_cancelcallback signals viarun_command's child process. With the unit, the worker needs a separate path: spawn a thread that pollsshould_cancel()and callssudo systemctl stop build-overlay@<id>when triggered. Without this, builds that exceedRuntimeMaxSecor hit user-cancel won't terminate promptly. - Journal lag at unit start:
journalctl --followstarted beforesystemctl startshould pick up all output. If not, may need cursor-based streaming. Test with a script that prints immediately (echo hello; exit 0) — if "hello" appears in the job log, race is handled. - Sudoers globbing:
systemctl start build-overlay@*.servicepermits any instance id including weird strings like../etc-passwd. Use a tighter glob if possible (e.g.,build-overlay@[0-9]*.service). Test that sudoers rejects unexpected instance names. - Type=oneshot return semantics: confirm that
systemctl start build-overlay@<id>on a Type=oneshot unit returns rc=3 (or similar) when the unit's ExecStart fails, so the worker can detect failure without re-queryingsystemctl show. - Idle running over reboot: a build that's running across a reboot is killed when the system goes down. That's identical to today's behavior with systemd-run. Acceptable.
- The journalctl sidecar process accumulates as a zombie if not
reaped properly. The proposed code does
journal.wait(timeout=5)— handle the timeout case (force-kill).
Pointers
Reference files (with line numbers if applicable):
- Current helper to be removed:
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox - Current worker invoker:
l4d2web/services/overlay_builders.py:run_sandboxed_script(~ln 324) - Current job-worker dispatch:
l4d2web/services/job_worker.py(build_overlay operation) - Sudoers:
deploy/files/etc/sudoers.d/left4me(matched verbatim inckn-bw/bundles/left4me/files/etc/sudoers.d/left4me) - Sample template unit pattern (the model to copy):
left4me-server@.serviceemission in ckn-bw'sbundles/left4me/metadata.pysystemd_units reactor. - Existing slice declaration (already correct):
l4d2-build.slicein ckn-bw's reactor.
Recent commits that touched this surface:
4838108— moved idmap to build time (the refactor that surfaced the namespace bug)f1aa05d— added nsenter self-wrap (the band-aid this refactor removes)2f6a9cf,9053186,dd918ac— earlier idmap-on-mount approach that was reverted
Related design docs:
docs/superpowers/plans/2026-05-15-build-time-idmap.md— the plan whose architecture this refactor builds ondocs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md— unrelated open questions about deploy/ layout
What's NOT in scope
- Rewriting the sandbox in Python / packaging differently.
- Changing the security hardening profile (the unit duplicates the current set verbatim — adjust later if needed).
- Splitting the gameserver uid from the web app uid (noted in earlier handoff doc).
- Re-evaluating whether
l4d2-sandboxshould exist as a separate uid (kept; defense in depth). - Touching the
left4me-overlaygameserver helper (it already uses the pattern; only the sandbox helper is being refactored to match).
Estimate
Rough breakdown for the future session:
- Unit file design + ckn-bw reactor change: 1-2 hours
- Worker rewrite (run_sandboxed_script): 1-2 hours
- Tests: 1 hour
- Deploy + verify on test server: 30 min
- Bug-fix and iteration buffer: 1 hour
~5 hours of focused work, assuming no surprises with journalctl streaming or sudoers semantics.
Decision criteria for whether to do this
Do it if:
- You're about to make any other change to the sandbox hardening, build lifecycle, or sandbox uid story.
- You're frustrated by debugging the existing helper.
- You want to remove the nsenter band-aid for hygiene.
Skip if:
- The sandbox is stable and you're not planning related changes.
- You'd rather invest the time in higher-value work elsewhere.
The current solution is fine; this refactor is upgrade-not-fix.