diff --git a/docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md b/docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md new file mode 100644 index 0000000..5163a8d --- /dev/null +++ b/docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md @@ -0,0 +1,491 @@ +# Build-overlay template unit — refactor the script-sandbox helper + +**Status: open question, not settled design.** This is a handoff +document prompted by the build-time idmap landing on 2026-05-15. The +current `left4me-script-sandbox` shell helper works but has accumulated +several layers of complexity (idmap bind setup, trap cleanup, nsenter +self-wrap) that a systemd template unit would handle declaratively. +The same pattern is already established in the codebase for +gameservers (`left4me-server@.service`). A future session should +evaluate whether to refactor and, if so, follow the steps below. + +## Why this came up + +While verifying the build-time idmap refactor, the first 5 build jobs +failed with `mkdir: Permission denied` on `/overlay/...`. Root cause: + +- `left4me-web.service` runs with `PrivateTmp=true`, which puts the + web app (and anything it sudoes into) in a private mount namespace. +- The script-sandbox helper, invoked via `sudo` from the web app, + inherits that namespace. +- The helper's `mount --bind --map-users=...` pre-creates the idmap + staging path *in the web app's namespace*. +- `systemd-run` (called by the helper) spawns a transient unit in + PID 1's mount namespace. +- The transient unit's `BindPaths=...:/overlay` resolves the staging + path in PID 1's namespace — where the bind doesn't exist. It sees + an empty root-owned dir at the staging path (mkdir'd by the helper + before the bind) and binds *that* to `/overlay`. +- Sandbox uid hits EACCES on every write. + +We fixed it (commit `f1aa05d`) by self-wrapping the helper into +PID 1's mount namespace at the top of the script: + +```bash +if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then + exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \ + /usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@" +fi +``` + +That works. But it's a band-aid for an architectural friction: +**helper invocation via `sudo` from a hardened service forces us to +manually escape the caller's namespace before any mount syscall**. +If the helper were *itself* a systemd unit started by PID 1, the +namespace would be correct by default. + +The gameserver helper handles this at the unit level. Its +ExecStartPre is: + +``` +ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i +``` + +i.e. wrapped in nsenter *at the unit*. The unit is started by PID 1, +so it has PID 1's namespace, then nsenter is a belt-and-braces. +Mirror that pattern for builds: introduce `build-overlay@.service` as +a template unit, have the worker activate it instead of forking a +helper. + +## Current state (the thing being replaced) + +Files: + +- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` — + the bash helper. ~100 lines. Self-wraps in nsenter, does pre-bind + with `--map-users`, invokes `systemd-run --quiet --collect --wait + --pipe -p ... -- /bin/bash /script.sh`, cleans up via trap. +- `l4d2web/services/overlay_builders.py:run_sandboxed_script` — the + worker entry point. Writes script content to + `/var/lib/left4me/sandbox-scripts/.sh`, invokes + `sudo -n /usr/local/libexec/left4me/left4me-script-sandbox + `, streams stdout/stderr via `subprocess.Popen` + the existing + `run_command` plumbing. +- `deploy/files/etc/sudoers.d/left4me` — grants `left4me` NOPASSWD to + the helper path. + +What the helper actually does: +1. nsenter into PID 1's mount ns (the band-aid) +2. validate args + overlay dir exists +3. compute `STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID}` +4. `trap` cleanup; pre-emptive `umount` of stale staging; `mkdir -p` + the staging +5. `mount --bind --map-users=$(id -u left4me):$(id -u l4d2-sandbox):1 + --map-groups=... $OVERLAY_DIR $STAGING` +6. `systemd-run` with the full hardening profile, `BindPaths=$STAGING:/overlay` +7. Wait for completion, propagate exit code +8. trap fires: `umount $STAGING; rmdir $STAGING` + +## Proposed design + +Replace the bash helper with two systemd units (template + a slice) +emitted from ckn-bw's existing `systemd_units` reactor, plus a small +worker rewrite. + +### `build-overlay@.service` (template unit) + +```ini +[Unit] +Description=Sandboxed overlay build for instance %i +DefaultDependencies=no +After=local-fs.target +RequiresMountsFor=/var/lib/left4me/overlays/%i +ConditionPathIsDirectory=/var/lib/left4me/overlays/%i +ConditionPathExists=/var/lib/left4me/sandbox-scripts/%i.sh + +[Service] +Type=oneshot +User=l4d2-sandbox +Group=l4d2-sandbox +Slice=l4d2-build.slice + +# Idmap bind: disk uid 980 (left4me) ↔ mount uid 981 (sandbox), so writes +# from the sandbox land on disk as left4me. + prefix runs as root before +# the User= drop (mount syscall requires CAP_SYS_ADMIN). +ExecStartPre=+/usr/bin/mkdir -p /run/left4me/idmap/%i +ExecStartPre=+/usr/bin/mount --bind \ + --map-users=980:981:1 --map-groups=980:981:1 \ + /var/lib/left4me/overlays/%i /run/left4me/idmap/%i + +ExecStart=/bin/bash /script.sh + +ExecStopPost=+-/usr/bin/umount /run/left4me/idmap/%i +ExecStopPost=+-/usr/bin/rmdir /run/left4me/idmap/%i + +# Hardening — all the -p flags from the current bash helper, declared +# declaratively here instead of as systemd-run -p arguments. +NoNewPrivileges=yes +ProtectSystem=strict +ProtectHome=yes +PrivateTmp=yes +PrivateDevices=yes +PrivateIPC=yes +ProtectKernelTunables=yes +ProtectKernelModules=yes +ProtectKernelLogs=yes +ProtectControlGroups=yes +RestrictNamespaces=yes +RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX +RestrictSUIDSGID=yes +LockPersonality=yes +MemoryDenyWriteExecute=yes +SystemCallFilter=@system-service @network-io +SystemCallArchitectures=native +CapabilityBoundingSet= +AmbientCapabilities= +IPAddressDeny=127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7 +TemporaryFileSystem=/etc /var/lib +BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives /var/lib/left4me/sandbox-scripts/%i.sh:/script.sh +BindPaths=/run/left4me/idmap/%i:/overlay +WorkingDirectory=/overlay +Environment=HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay +UMask=0022 +OOMScoreAdjust=500 +MemoryMax=4G +MemorySwapMax=0 +TasksMax=512 +CPUQuota=200% +RuntimeMaxSec=3600 +TimeoutStartSec=1h +TimeoutStopSec=30s +``` + +Notes: +- `Type=oneshot` makes `systemctl start` block until ExecStart exits. +- `ConditionPath*` provides early failure if the overlay dir or script + doesn't exist (avoids running the unit at all in those cases). +- `RequiresMountsFor=/var/lib/left4me/overlays/%i` ensures the parent + fs is mounted before this unit runs (`/` and `/var/lib` if it's a + separate mount point). +- `ExecStopPost` uses `+-` (root, ignore failures) — the bind might + already be torn down if the unit is restarting. +- `BindReadOnlyPaths=...:/script.sh` makes the per-overlay script + available at `/script.sh` inside the sandbox, picked from the + predictable path `/var/lib/left4me/sandbox-scripts/%i.sh`. + +### Worker invocation + +Replace `run_sandboxed_script` in +`l4d2web/services/overlay_builders.py`: + +```python +def run_sandboxed_script( + overlay_id: int, + script_text: str, + *, + on_stdout: LogSink, + on_stderr: LogSink, + should_cancel: CancelCheck, +) -> None: + script_dir = _sandbox_script_dir() + script_dir.mkdir(parents=True, exist_ok=True) + script_path = script_dir / f"{overlay_id}.sh" + script_path.write_text(script_text or "") + os.chmod(script_path, 0o644) + + unit = f"build-overlay@{overlay_id}.service" + + # Tail the unit's journal as a sidecar so output streams into job-logs + # while the unit runs. --follow exits when the unit reaches "inactive". + journal = subprocess.Popen( + ["journalctl", "--unit", unit, "--output=cat", "--follow", + "--since=now", "--no-pager"], + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + ) + + try: + # Start the unit (sudoers permits this exact verb pattern). + # Type=oneshot makes this block until ExecStart returns. + rc = subprocess.run( + ["sudo", "-n", "/bin/systemctl", "start", unit], + check=False, + ).returncode + finally: + # Drain remaining journal lines (journalctl --follow may not have + # printed everything yet by the time systemctl returns). + journal.terminate() + try: + for line in journal.stdout or []: + on_stdout(line.rstrip("\n")) + finally: + journal.wait(timeout=5) + + # Read exit code from the unit. ExecMainStatus is the script's rc; + # Result is "success" / "failed" / "timeout" etc. + show = subprocess.check_output( + ["systemctl", "show", unit, + "-p", "ExecMainStatus", "-p", "Result", "--value"], + text=True, + ).split() + exec_main_status = int(show[0]) + result = show[1] + + if rc != 0 or result != "success": + raise BuildError( + f"build-overlay@{overlay_id} failed: " + f"systemctl rc={rc} unit result={result} script exit={exec_main_status}" + ) +``` + +That's ~30 lines vs. ~50 today, and the helper script disappears +entirely. + +**Two refinements to consider:** + +1. **Cancel semantics**: today the worker's `should_cancel` callback + triggers a SIGTERM via the existing `run_command` plumbing. With + systemctl-start, you'd issue `systemctl stop build-overlay@` + in a parallel thread when `should_cancel()` returns True. Wire + that up. +2. **Journal streaming race**: `journalctl --follow --since=now` + started *after* `systemctl start` may miss the first few lines. + Two fixes: + - Start the journal tail before systemctl-start (the unit doesn't + exist yet, so journalctl waits silently — verify this behaviour + on Trixie). + - Or use `journalctl --cursor` machinery: snapshot the cursor + before start, then read with `--cursor=` after. + + Start-before is simpler and likely sufficient for L4D2 build + verbosity, where the first second of output isn't critical. + +### Sudoers + +Replace: +``` +left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox +``` + +with: +``` +left4me ALL=(root) NOPASSWD: /bin/systemctl start build-overlay@*.service +left4me ALL=(root) NOPASSWD: /bin/systemctl stop build-overlay@*.service +``` + +(Tighter — verb-prefixed and instance-globbed. No script path passed.) + +### Slice + +`l4d2-build.slice` already exists (per the gameserver/sandbox today's +configuration). Reuse it — no change needed. + +### Sandbox script tmpfile cleanup + +Currently `run_sandboxed_script` writes a per-invocation +`tempfile.NamedTemporaryFile` with a random suffix and unlinks it in a +`finally`. With template-unit lookup, the script path is **predictable +per overlay id** (`/var/lib/left4me/sandbox-scripts/.sh`). +Implications: + +- Two concurrent builds for the *same* overlay id would clobber the + script file. The job queue already serializes per-overlay (per + `l4d2web/services/job_worker.py:OVERLAY_OPERATIONS`), so this + is OK. +- Scripts persist between builds (no auto-cleanup). Either accept + that (the next build overwrites) or delete after the unit goes + inactive. Recommend: leave them — small, useful for debugging. + +## Migration + +In order: + +1. **Add the unit emission to ckn-bw's `bundles/left4me/metadata.py` + systemd_units reactor.** Mirror the pattern used for + `left4me-server@.service`. Drop in the template-unit content as + another reactor entry. +2. **Update sudoers** (`bundles/left4me/files/etc/sudoers.d/left4me`) + to permit `systemctl start/stop build-overlay@*.service` and + remove the script-sandbox grant. +3. **Replace `run_sandboxed_script` in left4me.** Add the new + journalctl-based output streaming, exit-code reading, and cancel + handling. Keep the function signature stable so callers + (`ScriptBuilder.build`, the wipe route) are unchanged. +4. **Delete `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`.** +5. **Update tests:** + - `deploy/tests/test_deploy_artifacts.py`: + - Drop `test_script_sandbox_uses_idmap_staging` and any other + tests that read SCRIPT_SANDBOX_HELPER. + - Add tests that assert the new unit emission in ckn-bw's + reactor output. (But that's in the other repo — left4me's + deploy tests can't directly cover it.) + - Add a test that asserts the worker invokes + `sudo systemctl start build-overlay@*` (grep + `overlay_builders.py`). + - `l4d2web/tests/test_overlay_builders.py` (if it exists): + update mocks for `run_sandboxed_script` to expect the new + subprocess shape. +6. **Test on `left4.me`:** + - Push left4me, `bw apply ovh.left4me`. Apply also picks up the + new unit emission and the sudoers change. + - Trigger a script-overlay rebuild via the web UI or the + enqueue API path used in this session (see test history in + git log around 2026-05-15). + - Inspect: `journalctl -u build-overlay@9.service`, + `systemctl status build-overlay@9.service`. + - Verify on-disk state: overlay files end up `left4me`-owned; + idmap bind cleanly torn down (`findmnt | grep idmap` empty). + +## Open decisions for the future session + +1. **`/run/left4me/idmap/%i` vs. `/var/lib/left4me/tmp/sandbox-idmap-%i`** — + `/run` is tmpfs and wiped on reboot, more correct for transient + mount paths. But it requires the dir to exist (created by + ExecStartPre). Either works. +2. **What to do with the existing `left4me-apply-cake` dead code** — + irrelevant to this refactor; flagged in the other handoff doc. +3. **Whether to drop the post-build `chmod o+r` in the sandbox helper** — + already gone in the build-time-idmap commit. (Verify in the new + unit nothing equivalent is needed; files are left4me-owned, web + reads via primary uid.) +4. **`Type=oneshot` vs. `Type=exec`** — oneshot blocks `systemctl + start`. exec doesn't. With oneshot we don't need the + `journalctl --follow` workaround if we read journal *after* + completion. But for live progress (which the existing builds + stream), `--follow` is still needed. Stick with oneshot. +5. **Should the unit set `KillMode=mixed`** to ensure children die on + stop? Worth checking — the existing systemd-run line doesn't set + it explicitly; defaults usually suffice. +6. **`StateDirectory=` vs. explicit `mkdir -p`** — systemd has + StateDirectory and RuntimeDirectory directives that auto-create + per-unit directories. Could replace the `mkdir -p /run/left4me/idmap/%i` + ExecStartPre with `RuntimeDirectory=left4me/idmap/%i`. Cleaner; + gets auto-cleanup on stop too. Recommend doing this — both the + mkdir and the rmdir ExecStopPost would go away. + +## Verification + +End-to-end smoke test on `left4.me` after the deploy: + +```bash +# unit is installed and template-parseable +systemctl status build-overlay@.service # should show "loaded; static" +sudo systemd-analyze verify build-overlay@1.service + +# enqueue a build via the web app's worker path (mimic the +# enqueue_build_overlay pattern from this session's job 64 onwards) +# then watch: +sudo journalctl -u build-overlay@9.service -f + +# on completion: +systemctl show build-overlay@9.service -p Result -p ExecMainStatus +# expect: Result=success, ExecMainStatus=0 + +# disk state +sudo find /var/lib/left4me/overlays/9 -uid 981 # should be empty +sudo find /run/left4me/idmap # should not exist or be empty + +# pid 1 mount table — no orphan idmap binds +sudo findmnt --task 1 -o TARGET | grep idmap # empty +``` + +## Risks + +- **Worker cancel-during-build**: today's `should_cancel` callback + signals via `run_command`'s child process. With the unit, the + worker needs a separate path: spawn a thread that polls + `should_cancel()` and calls `sudo systemctl stop build-overlay@` + when triggered. Without this, builds that exceed `RuntimeMaxSec` or + hit user-cancel won't terminate promptly. +- **Journal lag at unit start**: `journalctl --follow` started before + `systemctl start` should pick up all output. If not, may need + cursor-based streaming. Test with a script that prints immediately + (`echo hello; exit 0`) — if "hello" appears in the job log, race + is handled. +- **Sudoers globbing**: `systemctl start build-overlay@*.service` + permits any instance id including weird strings like `../etc-passwd`. + Use a tighter glob if possible (e.g., + `build-overlay@[0-9]*.service`). Test that sudoers rejects + unexpected instance names. +- **Type=oneshot return semantics**: confirm that `systemctl start + build-overlay@` on a Type=oneshot unit returns rc=3 (or + similar) when the unit's ExecStart fails, so the worker can detect + failure without re-querying `systemctl show`. +- **Idle running over reboot**: a build that's running across a reboot + is killed when the system goes down. That's identical to today's + behavior with systemd-run. Acceptable. +- **The journalctl sidecar process accumulates as a zombie if not + reaped properly.** The proposed code does `journal.wait(timeout=5)` + — handle the timeout case (force-kill). + +## Pointers + +Reference files (with line numbers if applicable): + +- **Current helper to be removed**: + `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` +- **Current worker invoker**: + `l4d2web/services/overlay_builders.py:run_sandboxed_script` (~ln 324) +- **Current job-worker dispatch**: + `l4d2web/services/job_worker.py` (build_overlay operation) +- **Sudoers**: + `deploy/files/etc/sudoers.d/left4me` (matched verbatim in + `ckn-bw/bundles/left4me/files/etc/sudoers.d/left4me`) +- **Sample template unit pattern** (the model to copy): + `left4me-server@.service` emission in ckn-bw's + `bundles/left4me/metadata.py` systemd_units reactor. +- **Existing slice declaration** (already correct): + `l4d2-build.slice` in ckn-bw's reactor. + +Recent commits that touched this surface: +- `4838108` — moved idmap to build time (the refactor that surfaced + the namespace bug) +- `f1aa05d` — added nsenter self-wrap (the band-aid this refactor + removes) +- `2f6a9cf`, `9053186`, `dd918ac` — earlier idmap-on-mount approach + that was reverted + +Related design docs: +- `docs/superpowers/plans/2026-05-15-build-time-idmap.md` — the + plan whose architecture this refactor builds on +- `docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md` — + unrelated open questions about deploy/ layout + +## What's NOT in scope + +- Rewriting the sandbox in Python / packaging differently. +- Changing the security hardening profile (the unit duplicates the + current set verbatim — adjust later if needed). +- Splitting the gameserver uid from the web app uid (noted in earlier + handoff doc). +- Re-evaluating whether `l4d2-sandbox` should exist as a separate + uid (kept; defense in depth). +- Touching the `left4me-overlay` gameserver helper (it already uses + the pattern; only the sandbox helper is being refactored to match). + +## Estimate + +Rough breakdown for the future session: +- Unit file design + ckn-bw reactor change: 1-2 hours +- Worker rewrite (run_sandboxed_script): 1-2 hours +- Tests: 1 hour +- Deploy + verify on test server: 30 min +- Bug-fix and iteration buffer: 1 hour + +~5 hours of focused work, assuming no surprises with journalctl +streaming or sudoers semantics. + +## Decision criteria for whether to do this + +Do it if: +- You're about to make any other change to the sandbox hardening, + build lifecycle, or sandbox uid story. +- You're frustrated by debugging the existing helper. +- You want to remove the nsenter band-aid for hygiene. + +Skip if: +- The sandbox is stable and you're not planning related changes. +- You'd rather invest the time in higher-value work elsewhere. + +The current solution is fine; this refactor is upgrade-not-fix.