# Build-overlay template unit — refactor the script-sandbox helper **Status: open question, not settled design.** This is a handoff document prompted by the build-time idmap landing on 2026-05-15. The current `left4me-script-sandbox` shell helper works but has accumulated several layers of complexity (idmap bind setup, trap cleanup, nsenter self-wrap) that a systemd template unit would handle declaratively. The same pattern is already established in the codebase for gameservers (`left4me-server@.service`). A future session should evaluate whether to refactor and, if so, follow the steps below. ## Why this came up While verifying the build-time idmap refactor, the first 5 build jobs failed with `mkdir: Permission denied` on `/overlay/...`. Root cause: - `left4me-web.service` runs with `PrivateTmp=true`, which puts the web app (and anything it sudoes into) in a private mount namespace. - The script-sandbox helper, invoked via `sudo` from the web app, inherits that namespace. - The helper's `mount --bind --map-users=...` pre-creates the idmap staging path *in the web app's namespace*. - `systemd-run` (called by the helper) spawns a transient unit in PID 1's mount namespace. - The transient unit's `BindPaths=...:/overlay` resolves the staging path in PID 1's namespace — where the bind doesn't exist. It sees an empty root-owned dir at the staging path (mkdir'd by the helper before the bind) and binds *that* to `/overlay`. - Sandbox uid hits EACCES on every write. We fixed it (commit `f1aa05d`) by self-wrapping the helper into PID 1's mount namespace at the top of the script: ```bash if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \ /usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@" fi ``` That works. But it's a band-aid for an architectural friction: **helper invocation via `sudo` from a hardened service forces us to manually escape the caller's namespace before any mount syscall**. If the helper were *itself* a systemd unit started by PID 1, the namespace would be correct by default. The gameserver helper handles this at the unit level. Its ExecStartPre is: ``` ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i ``` i.e. wrapped in nsenter *at the unit*. The unit is started by PID 1, so it has PID 1's namespace, then nsenter is a belt-and-braces. Mirror that pattern for builds: introduce `build-overlay@.service` as a template unit, have the worker activate it instead of forking a helper. ## Current state (the thing being replaced) Files: - `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — the bash helper. ~100 lines. Self-wraps in nsenter, does pre-bind with `--map-users`, invokes `systemd-run --quiet --collect --wait --pipe -p ... -- /bin/bash /script.sh`, cleans up via trap. - `l4d2web/services/overlay_builders.py:run_sandboxed_script` — the worker entry point. Writes script content to `/var/lib/left4me/sandbox-scripts/.sh`, invokes `sudo -n /usr/local/libexec/left4me/left4me-script-sandbox `, streams stdout/stderr via `subprocess.Popen` + the existing `run_command` plumbing. - `deploy/files/etc/sudoers.d/left4me` — grants `left4me` NOPASSWD to the helper path. What the helper actually does: 1. nsenter into PID 1's mount ns (the band-aid) 2. validate args + overlay dir exists 3. compute `STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID}` 4. `trap` cleanup; pre-emptive `umount` of stale staging; `mkdir -p` the staging 5. `mount --bind --map-users=$(id -u left4me):$(id -u l4d2-sandbox):1 --map-groups=... $OVERLAY_DIR $STAGING` 6. `systemd-run` with the full hardening profile, `BindPaths=$STAGING:/overlay` 7. Wait for completion, propagate exit code 8. trap fires: `umount $STAGING; rmdir $STAGING` ## Proposed design Replace the bash helper with two systemd units (template + a slice) emitted from ckn-bw's existing `systemd_units` reactor, plus a small worker rewrite. ### `build-overlay@.service` (template unit) ```ini [Unit] Description=Sandboxed overlay build for instance %i DefaultDependencies=no After=local-fs.target RequiresMountsFor=/var/lib/left4me/overlays/%i ConditionPathIsDirectory=/var/lib/left4me/overlays/%i ConditionPathExists=/var/lib/left4me/sandbox-scripts/%i.sh [Service] Type=oneshot User=l4d2-sandbox Group=l4d2-sandbox Slice=l4d2-build.slice # Idmap bind: disk uid 980 (left4me) ↔ mount uid 981 (sandbox), so writes # from the sandbox land on disk as left4me. + prefix runs as root before # the User= drop (mount syscall requires CAP_SYS_ADMIN). ExecStartPre=+/usr/bin/mkdir -p /run/left4me/idmap/%i ExecStartPre=+/usr/bin/mount --bind \ --map-users=980:981:1 --map-groups=980:981:1 \ /var/lib/left4me/overlays/%i /run/left4me/idmap/%i ExecStart=/bin/bash /script.sh ExecStopPost=+-/usr/bin/umount /run/left4me/idmap/%i ExecStopPost=+-/usr/bin/rmdir /run/left4me/idmap/%i # Hardening — all the -p flags from the current bash helper, declared # declaratively here instead of as systemd-run -p arguments. NoNewPrivileges=yes ProtectSystem=strict ProtectHome=yes PrivateTmp=yes PrivateDevices=yes PrivateIPC=yes ProtectKernelTunables=yes ProtectKernelModules=yes ProtectKernelLogs=yes ProtectControlGroups=yes RestrictNamespaces=yes RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX RestrictSUIDSGID=yes LockPersonality=yes MemoryDenyWriteExecute=yes SystemCallFilter=@system-service @network-io SystemCallArchitectures=native CapabilityBoundingSet= AmbientCapabilities= IPAddressDeny=127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7 TemporaryFileSystem=/etc /var/lib BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives /var/lib/left4me/sandbox-scripts/%i.sh:/script.sh BindPaths=/run/left4me/idmap/%i:/overlay WorkingDirectory=/overlay Environment=HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay UMask=0022 OOMScoreAdjust=500 MemoryMax=4G MemorySwapMax=0 TasksMax=512 CPUQuota=200% RuntimeMaxSec=3600 TimeoutStartSec=1h TimeoutStopSec=30s ``` Notes: - `Type=oneshot` makes `systemctl start` block until ExecStart exits. - `ConditionPath*` provides early failure if the overlay dir or script doesn't exist (avoids running the unit at all in those cases). - `RequiresMountsFor=/var/lib/left4me/overlays/%i` ensures the parent fs is mounted before this unit runs (`/` and `/var/lib` if it's a separate mount point). - `ExecStopPost` uses `+-` (root, ignore failures) — the bind might already be torn down if the unit is restarting. - `BindReadOnlyPaths=...:/script.sh` makes the per-overlay script available at `/script.sh` inside the sandbox, picked from the predictable path `/var/lib/left4me/sandbox-scripts/%i.sh`. ### Worker invocation Replace `run_sandboxed_script` in `l4d2web/services/overlay_builders.py`: ```python def run_sandboxed_script( overlay_id: int, script_text: str, *, on_stdout: LogSink, on_stderr: LogSink, should_cancel: CancelCheck, ) -> None: script_dir = _sandbox_script_dir() script_dir.mkdir(parents=True, exist_ok=True) script_path = script_dir / f"{overlay_id}.sh" script_path.write_text(script_text or "") os.chmod(script_path, 0o644) unit = f"build-overlay@{overlay_id}.service" # Tail the unit's journal as a sidecar so output streams into job-logs # while the unit runs. --follow exits when the unit reaches "inactive". journal = subprocess.Popen( ["journalctl", "--unit", unit, "--output=cat", "--follow", "--since=now", "--no-pager"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, ) try: # Start the unit (sudoers permits this exact verb pattern). # Type=oneshot makes this block until ExecStart returns. rc = subprocess.run( ["sudo", "-n", "/bin/systemctl", "start", unit], check=False, ).returncode finally: # Drain remaining journal lines (journalctl --follow may not have # printed everything yet by the time systemctl returns). journal.terminate() try: for line in journal.stdout or []: on_stdout(line.rstrip("\n")) finally: journal.wait(timeout=5) # Read exit code from the unit. ExecMainStatus is the script's rc; # Result is "success" / "failed" / "timeout" etc. show = subprocess.check_output( ["systemctl", "show", unit, "-p", "ExecMainStatus", "-p", "Result", "--value"], text=True, ).split() exec_main_status = int(show[0]) result = show[1] if rc != 0 or result != "success": raise BuildError( f"build-overlay@{overlay_id} failed: " f"systemctl rc={rc} unit result={result} script exit={exec_main_status}" ) ``` That's ~30 lines vs. ~50 today, and the helper script disappears entirely. **Two refinements to consider:** 1. **Cancel semantics**: today the worker's `should_cancel` callback triggers a SIGTERM via the existing `run_command` plumbing. With systemctl-start, you'd issue `systemctl stop build-overlay@` in a parallel thread when `should_cancel()` returns True. Wire that up. 2. **Journal streaming race**: `journalctl --follow --since=now` started *after* `systemctl start` may miss the first few lines. Two fixes: - Start the journal tail before systemctl-start (the unit doesn't exist yet, so journalctl waits silently — verify this behaviour on Trixie). - Or use `journalctl --cursor` machinery: snapshot the cursor before start, then read with `--cursor=` after. Start-before is simpler and likely sufficient for L4D2 build verbosity, where the first second of output isn't critical. ### Sudoers Replace: ``` left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox ``` with: ``` left4me ALL=(root) NOPASSWD: /bin/systemctl start build-overlay@*.service left4me ALL=(root) NOPASSWD: /bin/systemctl stop build-overlay@*.service ``` (Tighter — verb-prefixed and instance-globbed. No script path passed.) ### Slice `l4d2-build.slice` already exists (per the gameserver/sandbox today's configuration). Reuse it — no change needed. ### Sandbox script tmpfile cleanup Currently `run_sandboxed_script` writes a per-invocation `tempfile.NamedTemporaryFile` with a random suffix and unlinks it in a `finally`. With template-unit lookup, the script path is **predictable per overlay id** (`/var/lib/left4me/sandbox-scripts/.sh`). Implications: - Two concurrent builds for the *same* overlay id would clobber the script file. The job queue already serializes per-overlay (per `l4d2web/services/job_worker.py:OVERLAY_OPERATIONS`), so this is OK. - Scripts persist between builds (no auto-cleanup). Either accept that (the next build overwrites) or delete after the unit goes inactive. Recommend: leave them — small, useful for debugging. ## Migration In order: 1. **Add the unit emission to ckn-bw's `bundles/left4me/metadata.py` systemd_units reactor.** Mirror the pattern used for `left4me-server@.service`. Drop in the template-unit content as another reactor entry. 2. **Update sudoers** (`bundles/left4me/files/etc/sudoers.d/left4me`) to permit `systemctl start/stop build-overlay@*.service` and remove the script-sandbox grant. 3. **Replace `run_sandboxed_script` in left4me.** Add the new journalctl-based output streaming, exit-code reading, and cancel handling. Keep the function signature stable so callers (`ScriptBuilder.build`, the wipe route) are unchanged. 4. **Delete `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`.** 5. **Update tests:** - `deploy/tests/test_deploy_artifacts.py`: - Drop `test_script_sandbox_uses_idmap_staging` and any other tests that read SCRIPT_SANDBOX_HELPER. - Add tests that assert the new unit emission in ckn-bw's reactor output. (But that's in the other repo — left4me's deploy tests can't directly cover it.) - Add a test that asserts the worker invokes `sudo systemctl start build-overlay@*` (grep `overlay_builders.py`). - `l4d2web/tests/test_overlay_builders.py` (if it exists): update mocks for `run_sandboxed_script` to expect the new subprocess shape. 6. **Test on `left4.me`:** - Push left4me, `bw apply ovh.left4me`. Apply also picks up the new unit emission and the sudoers change. - Trigger a script-overlay rebuild via the web UI or the enqueue API path used in this session (see test history in git log around 2026-05-15). - Inspect: `journalctl -u build-overlay@9.service`, `systemctl status build-overlay@9.service`. - Verify on-disk state: overlay files end up `left4me`-owned; idmap bind cleanly torn down (`findmnt | grep idmap` empty). ## Open decisions for the future session 1. **`/run/left4me/idmap/%i` vs. `/var/lib/left4me/tmp/sandbox-idmap-%i`** — `/run` is tmpfs and wiped on reboot, more correct for transient mount paths. But it requires the dir to exist (created by ExecStartPre). Either works. 2. **What to do with the existing `left4me-apply-cake` dead code** — irrelevant to this refactor; flagged in the other handoff doc. 3. **Whether to drop the post-build `chmod o+r` in the sandbox helper** — already gone in the build-time-idmap commit. (Verify in the new unit nothing equivalent is needed; files are left4me-owned, web reads via primary uid.) 4. **`Type=oneshot` vs. `Type=exec`** — oneshot blocks `systemctl start`. exec doesn't. With oneshot we don't need the `journalctl --follow` workaround if we read journal *after* completion. But for live progress (which the existing builds stream), `--follow` is still needed. Stick with oneshot. 5. **Should the unit set `KillMode=mixed`** to ensure children die on stop? Worth checking — the existing systemd-run line doesn't set it explicitly; defaults usually suffice. 6. **`StateDirectory=` vs. explicit `mkdir -p`** — systemd has StateDirectory and RuntimeDirectory directives that auto-create per-unit directories. Could replace the `mkdir -p /run/left4me/idmap/%i` ExecStartPre with `RuntimeDirectory=left4me/idmap/%i`. Cleaner; gets auto-cleanup on stop too. Recommend doing this — both the mkdir and the rmdir ExecStopPost would go away. ## Verification End-to-end smoke test on `left4.me` after the deploy: ```bash # unit is installed and template-parseable systemctl status build-overlay@.service # should show "loaded; static" sudo systemd-analyze verify build-overlay@1.service # enqueue a build via the web app's worker path (mimic the # enqueue_build_overlay pattern from this session's job 64 onwards) # then watch: sudo journalctl -u build-overlay@9.service -f # on completion: systemctl show build-overlay@9.service -p Result -p ExecMainStatus # expect: Result=success, ExecMainStatus=0 # disk state sudo find /var/lib/left4me/overlays/9 -uid 981 # should be empty sudo find /run/left4me/idmap # should not exist or be empty # pid 1 mount table — no orphan idmap binds sudo findmnt --task 1 -o TARGET | grep idmap # empty ``` ## Risks - **Worker cancel-during-build**: today's `should_cancel` callback signals via `run_command`'s child process. With the unit, the worker needs a separate path: spawn a thread that polls `should_cancel()` and calls `sudo systemctl stop build-overlay@` when triggered. Without this, builds that exceed `RuntimeMaxSec` or hit user-cancel won't terminate promptly. - **Journal lag at unit start**: `journalctl --follow` started before `systemctl start` should pick up all output. If not, may need cursor-based streaming. Test with a script that prints immediately (`echo hello; exit 0`) — if "hello" appears in the job log, race is handled. - **Sudoers globbing**: `systemctl start build-overlay@*.service` permits any instance id including weird strings like `../etc-passwd`. Use a tighter glob if possible (e.g., `build-overlay@[0-9]*.service`). Test that sudoers rejects unexpected instance names. - **Type=oneshot return semantics**: confirm that `systemctl start build-overlay@` on a Type=oneshot unit returns rc=3 (or similar) when the unit's ExecStart fails, so the worker can detect failure without re-querying `systemctl show`. - **Idle running over reboot**: a build that's running across a reboot is killed when the system goes down. That's identical to today's behavior with systemd-run. Acceptable. - **The journalctl sidecar process accumulates as a zombie if not reaped properly.** The proposed code does `journal.wait(timeout=5)` — handle the timeout case (force-kill). ## Pointers Reference files (with line numbers if applicable): - **Current helper to be removed**: `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` - **Current worker invoker**: `l4d2web/services/overlay_builders.py:run_sandboxed_script` (~ln 324) - **Current job-worker dispatch**: `l4d2web/services/job_worker.py` (build_overlay operation) - **Sudoers**: `deploy/files/etc/sudoers.d/left4me` (matched verbatim in `ckn-bw/bundles/left4me/files/etc/sudoers.d/left4me`) - **Sample template unit pattern** (the model to copy): `left4me-server@.service` emission in ckn-bw's `bundles/left4me/metadata.py` systemd_units reactor. - **Existing slice declaration** (already correct): `l4d2-build.slice` in ckn-bw's reactor. Recent commits that touched this surface: - `4838108` — moved idmap to build time (the refactor that surfaced the namespace bug) - `f1aa05d` — added nsenter self-wrap (the band-aid this refactor removes) - `2f6a9cf`, `9053186`, `dd918ac` — earlier idmap-on-mount approach that was reverted Related design docs: - `docs/superpowers/plans/2026-05-15-build-time-idmap.md` — the plan whose architecture this refactor builds on - `docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md` — unrelated open questions about deploy/ layout ## What's NOT in scope - Rewriting the sandbox in Python / packaging differently. - Changing the security hardening profile (the unit duplicates the current set verbatim — adjust later if needed). - Splitting the gameserver uid from the web app uid (noted in earlier handoff doc). - Re-evaluating whether `l4d2-sandbox` should exist as a separate uid (kept; defense in depth). - Touching the `left4me-overlay` gameserver helper (it already uses the pattern; only the sandbox helper is being refactored to match). ## Estimate Rough breakdown for the future session: - Unit file design + ckn-bw reactor change: 1-2 hours - Worker rewrite (run_sandboxed_script): 1-2 hours - Tests: 1 hour - Deploy + verify on test server: 30 min - Bug-fix and iteration buffer: 1 hour ~5 hours of focused work, assuming no surprises with journalctl streaming or sudoers semantics. ## Decision criteria for whether to do this Do it if: - You're about to make any other change to the sandbox hardening, build lifecycle, or sandbox uid story. - You're frustrated by debugging the existing helper. - You want to remove the nsenter band-aid for hygiene. Skip if: - The sandbox is stable and you're not planning related changes. - You'd rather invest the time in higher-value work elsewhere. The current solution is fine; this refactor is upgrade-not-fix.