left4me/docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md

# Build-overlay template unit — refactor the script-sandbox helper

**Status: open question, not settled design.** This is a handoff
document prompted by the build-time idmap landing on 2026-05-15. The
current `left4me-script-sandbox` shell helper works but has accumulated
several layers of complexity (idmap bind setup, trap cleanup, nsenter
self-wrap) that a systemd template unit would handle declaratively.
The same pattern is already established in the codebase for
gameservers (`left4me-server@.service`). A future session should
evaluate whether to refactor and, if so, follow the steps below.

## Why this came up

While verifying the build-time idmap refactor, the first 5 build jobs
failed with `mkdir: Permission denied` on `/overlay/...`. Root cause:

- `left4me-web.service` runs with `PrivateTmp=true`, which puts the
  web app (and anything it sudoes into) in a private mount namespace.
- The script-sandbox helper, invoked via `sudo` from the web app,
  inherits that namespace.
- The helper's `mount --bind --map-users=...` pre-creates the idmap
  staging path *in the web app's namespace*.
- `systemd-run` (called by the helper) spawns a transient unit in
  PID 1's mount namespace.
- The transient unit's `BindPaths=...:/overlay` resolves the staging
  path in PID 1's namespace — where the bind doesn't exist. It sees
  an empty root-owned dir at the staging path (mkdir'd by the helper
  before the bind) and binds *that* to `/overlay`.
- Sandbox uid hits EACCES on every write.

We fixed it (commit `f1aa05d`) by self-wrapping the helper into
PID 1's mount namespace at the top of the script:

```bash
if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then
    exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \
        /usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@"
fi
```

That works. But it's a band-aid for an architectural friction:
**helper invocation via `sudo` from a hardened service forces us to
manually escape the caller's namespace before any mount syscall**.
If the helper were *itself* a systemd unit started by PID 1, the
namespace would be correct by default.

The gameserver helper handles this at the unit level. Its
ExecStartPre is:

```
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
```

i.e. wrapped in nsenter *at the unit*. The unit is started by PID 1,
so it has PID 1's namespace, then nsenter is a belt-and-braces.
Mirror that pattern for builds: introduce `build-overlay@.service` as
a template unit, have the worker activate it instead of forking a
helper.

## Current state (the thing being replaced)

Files:

- `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox` —
  the bash helper. ~100 lines. Self-wraps in nsenter, does pre-bind
  with `--map-users`, invokes `systemd-run --quiet --collect --wait
  --pipe -p ... -- /bin/bash /script.sh`, cleans up via trap.
- `l4d2web/services/overlay_builders.py:run_sandboxed_script` — the
  worker entry point. Writes script content to
  `/var/lib/left4me/sandbox-scripts/<uniqued>.sh`, invokes
  `sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id>
  <path>`, streams stdout/stderr via `subprocess.Popen` + the existing
  `run_command` plumbing.
- `deploy/files/etc/sudoers.d/left4me` — grants `left4me` NOPASSWD to
  the helper path.

What the helper actually does:
1. nsenter into PID 1's mount ns (the band-aid)
2. validate args + overlay dir exists
3. compute `STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID}`
4. `trap` cleanup; pre-emptive `umount` of stale staging; `mkdir -p`
   the staging
5. `mount --bind --map-users=$(id -u left4me):$(id -u l4d2-sandbox):1
   --map-groups=... $OVERLAY_DIR $STAGING`
6. `systemd-run` with the full hardening profile, `BindPaths=$STAGING:/overlay`
7. Wait for completion, propagate exit code
8. trap fires: `umount $STAGING; rmdir $STAGING`

## Proposed design

Replace the bash helper with two systemd units (template + a slice)
emitted from ckn-bw's existing `systemd_units` reactor, plus a small
worker rewrite.

### `build-overlay@.service` (template unit)

```ini
[Unit]
Description=Sandboxed overlay build for instance %i
DefaultDependencies=no
After=local-fs.target
RequiresMountsFor=/var/lib/left4me/overlays/%i
ConditionPathIsDirectory=/var/lib/left4me/overlays/%i
ConditionPathExists=/var/lib/left4me/sandbox-scripts/%i.sh

[Service]
Type=oneshot
User=l4d2-sandbox
Group=l4d2-sandbox
Slice=l4d2-build.slice

# Idmap bind: disk uid 980 (left4me) ↔ mount uid 981 (sandbox), so writes
# from the sandbox land on disk as left4me. + prefix runs as root before
# the User= drop (mount syscall requires CAP_SYS_ADMIN).
ExecStartPre=+/usr/bin/mkdir -p /run/left4me/idmap/%i
ExecStartPre=+/usr/bin/mount --bind \
    --map-users=980:981:1 --map-groups=980:981:1 \
    /var/lib/left4me/overlays/%i /run/left4me/idmap/%i

ExecStart=/bin/bash /script.sh

ExecStopPost=+-/usr/bin/umount /run/left4me/idmap/%i
ExecStopPost=+-/usr/bin/rmdir /run/left4me/idmap/%i

# Hardening — all the -p flags from the current bash helper, declared
# declaratively here instead of as systemd-run -p arguments.
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes
PrivateIPC=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
RestrictNamespaces=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictSUIDSGID=yes
LockPersonality=yes
MemoryDenyWriteExecute=yes
SystemCallFilter=@system-service @network-io
SystemCallArchitectures=native
CapabilityBoundingSet=
AmbientCapabilities=
IPAddressDeny=127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7
TemporaryFileSystem=/etc /var/lib
BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives /var/lib/left4me/sandbox-scripts/%i.sh:/script.sh
BindPaths=/run/left4me/idmap/%i:/overlay
WorkingDirectory=/overlay
Environment=HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay
UMask=0022
OOMScoreAdjust=500
MemoryMax=4G
MemorySwapMax=0
TasksMax=512
CPUQuota=200%
RuntimeMaxSec=3600
TimeoutStartSec=1h
TimeoutStopSec=30s
```

Notes:
- `Type=oneshot` makes `systemctl start` block until ExecStart exits.
- `ConditionPath*` provides early failure if the overlay dir or script
  doesn't exist (avoids running the unit at all in those cases).
- `RequiresMountsFor=/var/lib/left4me/overlays/%i` ensures the parent
  fs is mounted before this unit runs (`/` and `/var/lib` if it's a
  separate mount point).
- `ExecStopPost` uses `+-` (root, ignore failures) — the bind might
  already be torn down if the unit is restarting.
- `BindReadOnlyPaths=...:/script.sh` makes the per-overlay script
  available at `/script.sh` inside the sandbox, picked from the
  predictable path `/var/lib/left4me/sandbox-scripts/%i.sh`.

### Script source: filesystem vs. DB

**Critical design decision the future session must make.** The current
plan in the unit sketch above assumes the worker writes the script
content to `/var/lib/left4me/sandbox-scripts/<id>.sh` before calling
`systemctl start`. But the script *already lives in the DB* (the
`overlays.script` column), and the unit instance name `%i` is the
overlay row id. The filesystem copy is redundant unless we want it.

Three options:

**Option A — worker writes the script (the unit sketch above).**
Worker queries DB, writes `<id>.sh` to a known path, then
`systemctl start`. Unit reads via `BindReadOnlyPaths`. Simple, no DB
access from the unit, the existing `_sandbox_script_dir()` plumbing
mostly works. Cost: redundant on-disk copy; stale files between
builds if you don't clean them.

**Option B — unit fetches the script from the DB itself.** A small
root-side helper installed as
`/usr/local/libexec/left4me/left4me-fetch-script` does:

```python
#!/usr/bin/python3
import sqlite3, sys
overlay_id = int(sys.argv[1])
conn = sqlite3.connect("/var/lib/left4me/left4me.db")
row = conn.execute(
    "SELECT script FROM overlays WHERE id = ?", (overlay_id,)
).fetchone()
sys.stdout.write((row[0] if row else "") or "")
```

Unit's ExecStartPre runs it as root (the `+` prefix), pipes the
output to a runtime path that ExecStart reads:

```ini
RuntimeDirectory=left4me/sandbox-scripts
RuntimeDirectoryMode=0700
ExecStartPre=+/bin/sh -c '/usr/local/libexec/left4me/left4me-fetch-script %i \
    > /run/left4me/sandbox-scripts/%i.sh && chmod 0644 /run/left4me/sandbox-scripts/%i.sh'
BindReadOnlyPaths=/run/left4me/sandbox-scripts/%i.sh:/script.sh
```

(`RuntimeDirectory=` auto-creates `/run/left4me/sandbox-scripts/` on
start and removes it on stop, including the file inside.)

The fetch script doesn't need sudoers — it runs from ExecStartPre with
root privileges already. It only reads the DB; no writes. The DB is
`root:left4me 0640` so root can read it.

Worker becomes a one-liner: `sudo systemctl start build-overlay@<id>`.
No FS prep, no tmpfile cleanup.

**Option C — pipe the script content directly into bash stdin.** The
unit's ExecStart is something like
`/bin/sh -c "fetch-script %i | /bin/bash"`. Pros: no on-disk file at
all. Cons: `/bin/bash` runs without a file path, so `$0` is `bash` and
error messages look weird; harder to debug a failing script when there's
no file to inspect.

**Recommendation**: Option B. Decouples script storage (DB) from
sandbox transport (a /run/ runtime file). RuntimeDirectory= handles
cleanup. Worker becomes trivially small. The fetch-script helper is
~10 lines and stays in deploy/files/usr/local/libexec/left4me/.

If Option A is chosen instead, plan to track the script tmpfiles
explicitly so they don't accumulate. With Option B, RuntimeDirectory
auto-cleans on stop.

### Worker invocation

Replace `run_sandboxed_script` in
`l4d2web/services/overlay_builders.py`. The code below is the **Option
A** shape (worker writes the script). For **Option B** (recommended),
drop the `script_dir`/`script_path`/`write_text`/`chmod` lines — the
unit's ExecStartPre fetches from the DB. The signature can also drop
`script_text` since the worker doesn't need to pass content anymore.

```python
def run_sandboxed_script(
    overlay_id: int,
    script_text: str,  # remove this param if Option B
    *,
    on_stdout: LogSink,
    on_stderr: LogSink,
    should_cancel: CancelCheck,
) -> None:
    # The four lines below are Option A only — delete for Option B.
    script_dir = _sandbox_script_dir()
    script_dir.mkdir(parents=True, exist_ok=True)
    script_path = script_dir / f"{overlay_id}.sh"
    script_path.write_text(script_text or "")
    os.chmod(script_path, 0o644)

    unit = f"build-overlay@{overlay_id}.service"

    # Tail the unit's journal as a sidecar so output streams into job-logs
    # while the unit runs. --follow exits when the unit reaches "inactive".
    journal = subprocess.Popen(
        ["journalctl", "--unit", unit, "--output=cat", "--follow",
         "--since=now", "--no-pager"],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
    )

    try:
        # Start the unit (sudoers permits this exact verb pattern).
        # Type=oneshot makes this block until ExecStart returns.
        rc = subprocess.run(
            ["sudo", "-n", "/bin/systemctl", "start", unit],
            check=False,
        ).returncode
    finally:
        # Drain remaining journal lines (journalctl --follow may not have
        # printed everything yet by the time systemctl returns).
        journal.terminate()
        try:
            for line in journal.stdout or []:
                on_stdout(line.rstrip("\n"))
        finally:
            journal.wait(timeout=5)

    # Read exit code from the unit. ExecMainStatus is the script's rc;
    # Result is "success" / "failed" / "timeout" etc.
    show = subprocess.check_output(
        ["systemctl", "show", unit,
         "-p", "ExecMainStatus", "-p", "Result", "--value"],
        text=True,
    ).split()
    exec_main_status = int(show[0])
    result = show[1]

    if rc != 0 or result != "success":
        raise BuildError(
            f"build-overlay@{overlay_id} failed: "
            f"systemctl rc={rc} unit result={result} script exit={exec_main_status}"
        )
```

That's ~30 lines vs. ~50 today, and the helper script disappears
entirely.

**Two refinements to consider:**

1. **Cancel semantics**: today the worker's `should_cancel` callback
   triggers a SIGTERM via the existing `run_command` plumbing. With
   systemctl-start, you'd issue `systemctl stop build-overlay@<id>`
   in a parallel thread when `should_cancel()` returns True. Wire
   that up.
2. **Journal streaming race**: `journalctl --follow --since=now`
   started *after* `systemctl start` may miss the first few lines.
   Two fixes:
   - Start the journal tail before systemctl-start (the unit doesn't
     exist yet, so journalctl waits silently — verify this behaviour
     on Trixie).
   - Or use `journalctl --cursor` machinery: snapshot the cursor
     before start, then read with `--cursor=` after.

   Start-before is simpler and likely sufficient for L4D2 build
   verbosity, where the first second of output isn't critical.

### Sudoers

Replace:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox
```

with:
```
left4me ALL=(root) NOPASSWD: /bin/systemctl start build-overlay@*.service
left4me ALL=(root) NOPASSWD: /bin/systemctl stop build-overlay@*.service
```

(Tighter — verb-prefixed and instance-globbed. No script path passed.)

### Slice

`l4d2-build.slice` already exists (per the gameserver/sandbox today's
configuration). Reuse it — no change needed.

### Sandbox script tmpfile cleanup

Currently `run_sandboxed_script` writes a per-invocation
`tempfile.NamedTemporaryFile` with a random suffix and unlinks it in a
`finally`. With template-unit lookup, the script path is **predictable
per overlay id** (`/var/lib/left4me/sandbox-scripts/<id>.sh`).
Implications:

- Two concurrent builds for the *same* overlay id would clobber the
  script file. The job queue already serializes per-overlay (per
  `l4d2web/services/job_worker.py:OVERLAY_OPERATIONS`), so this
  is OK.
- Scripts persist between builds (no auto-cleanup). Either accept
  that (the next build overwrites) or delete after the unit goes
  inactive. Recommend: leave them — small, useful for debugging.

## Migration

In order:

1. **Add the unit emission to ckn-bw's `bundles/left4me/metadata.py`
   systemd_units reactor.** Mirror the pattern used for
   `left4me-server@.service`. Drop in the template-unit content as
   another reactor entry.
2. **Update sudoers** (`bundles/left4me/files/etc/sudoers.d/left4me`)
   to permit `systemctl start/stop build-overlay@*.service` and
   remove the script-sandbox grant.
3. **Replace `run_sandboxed_script` in left4me.** Add the new
   journalctl-based output streaming, exit-code reading, and cancel
   handling. Keep the function signature stable so callers
   (`ScriptBuilder.build`, the wipe route) are unchanged.
4. **Delete `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`.**
5. **Update tests:**
   - `deploy/tests/test_deploy_artifacts.py`:
     - Drop `test_script_sandbox_uses_idmap_staging` and any other
       tests that read SCRIPT_SANDBOX_HELPER.
     - Add tests that assert the new unit emission in ckn-bw's
       reactor output. (But that's in the other repo — left4me's
       deploy tests can't directly cover it.)
     - Add a test that asserts the worker invokes
       `sudo systemctl start build-overlay@*` (grep
       `overlay_builders.py`).
   - `l4d2web/tests/test_overlay_builders.py` (if it exists):
     update mocks for `run_sandboxed_script` to expect the new
     subprocess shape.
6. **Test on `left4.me`:**
   - Push left4me, `bw apply ovh.left4me`. Apply also picks up the
     new unit emission and the sudoers change.
   - Trigger a script-overlay rebuild via the web UI or the
     enqueue API path used in this session (see test history in
     git log around 2026-05-15).
   - Inspect: `journalctl -u build-overlay@9.service`,
     `systemctl status build-overlay@9.service`.
   - Verify on-disk state: overlay files end up `left4me`-owned;
     idmap bind cleanly torn down (`findmnt | grep idmap` empty).

## Open decisions for the future session

0. **Script source: filesystem (Option A) vs. DB-fetched in ExecStartPre
   (Option B) vs. piped to stdin (Option C).** See the "Script source"
   section above. This is the highest-impact decision because it
   shapes the worker, the unit's ExecStartPre, and whether you need
   a fetch-script helper binary at all. Recommendation: Option B.

1. **`/run/left4me/idmap/%i` vs. `/var/lib/left4me/tmp/sandbox-idmap-%i`** —
   `/run` is tmpfs and wiped on reboot, more correct for transient
   mount paths. But it requires the dir to exist (created by
   ExecStartPre). Either works.
2. **What to do with the existing `left4me-apply-cake` dead code** —
   irrelevant to this refactor; flagged in the other handoff doc.
3. **Whether to drop the post-build `chmod o+r` in the sandbox helper** —
   already gone in the build-time-idmap commit. (Verify in the new
   unit nothing equivalent is needed; files are left4me-owned, web
   reads via primary uid.)
4. **`Type=oneshot` vs. `Type=exec`** — oneshot blocks `systemctl
   start`. exec doesn't. With oneshot we don't need the
   `journalctl --follow` workaround if we read journal *after*
   completion. But for live progress (which the existing builds
   stream), `--follow` is still needed. Stick with oneshot.
5. **Should the unit set `KillMode=mixed`** to ensure children die on
   stop? Worth checking — the existing systemd-run line doesn't set
   it explicitly; defaults usually suffice.
6. **`StateDirectory=` vs. explicit `mkdir -p`** — systemd has
   StateDirectory and RuntimeDirectory directives that auto-create
   per-unit directories. Could replace the `mkdir -p /run/left4me/idmap/%i`
   ExecStartPre with `RuntimeDirectory=left4me/idmap/%i`. Cleaner;
   gets auto-cleanup on stop too. Recommend doing this — both the
   mkdir and the rmdir ExecStopPost would go away.

## Verification

End-to-end smoke test on `left4.me` after the deploy:

```bash
# unit is installed and template-parseable
systemctl status build-overlay@.service           # should show "loaded; static"
sudo systemd-analyze verify build-overlay@1.service

# enqueue a build via the web app's worker path (mimic the
# enqueue_build_overlay pattern from this session's job 64 onwards)
# then watch:
sudo journalctl -u build-overlay@9.service -f

# on completion:
systemctl show build-overlay@9.service -p Result -p ExecMainStatus
# expect: Result=success, ExecMainStatus=0

# disk state
sudo find /var/lib/left4me/overlays/9 -uid 981  # should be empty
sudo find /run/left4me/idmap                    # should not exist or be empty

# pid 1 mount table — no orphan idmap binds
sudo findmnt --task 1 -o TARGET | grep idmap     # empty
```

## Risks

- **Worker cancel-during-build**: today's `should_cancel` callback
  signals via `run_command`'s child process. With the unit, the
  worker needs a separate path: spawn a thread that polls
  `should_cancel()` and calls `sudo systemctl stop build-overlay@<id>`
  when triggered. Without this, builds that exceed `RuntimeMaxSec` or
  hit user-cancel won't terminate promptly.
- **Journal lag at unit start**: `journalctl --follow` started before
  `systemctl start` should pick up all output. If not, may need
  cursor-based streaming. Test with a script that prints immediately
  (`echo hello; exit 0`) — if "hello" appears in the job log, race
  is handled.
- **Sudoers globbing**: `systemctl start build-overlay@*.service`
  permits any instance id including weird strings like `../etc-passwd`.
  Use a tighter glob if possible (e.g.,
  `build-overlay@[0-9]*.service`). Test that sudoers rejects
  unexpected instance names.
- **Type=oneshot return semantics**: confirm that `systemctl start
  build-overlay@<id>` on a Type=oneshot unit returns rc=3 (or
  similar) when the unit's ExecStart fails, so the worker can detect
  failure without re-querying `systemctl show`.
- **Idle running over reboot**: a build that's running across a reboot
  is killed when the system goes down. That's identical to today's
  behavior with systemd-run. Acceptable.
- **The journalctl sidecar process accumulates as a zombie if not
  reaped properly.** The proposed code does `journal.wait(timeout=5)`
  — handle the timeout case (force-kill).

## Pointers

Reference files (with line numbers if applicable):

- **Current helper to be removed**:
  `deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`
- **Current worker invoker**:
  `l4d2web/services/overlay_builders.py:run_sandboxed_script` (~ln 324)
- **Current job-worker dispatch**:
  `l4d2web/services/job_worker.py` (build_overlay operation)
- **Sudoers**:
  `deploy/files/etc/sudoers.d/left4me` (matched verbatim in
  `ckn-bw/bundles/left4me/files/etc/sudoers.d/left4me`)
- **Sample template unit pattern** (the model to copy):
  `left4me-server@.service` emission in ckn-bw's
  `bundles/left4me/metadata.py` systemd_units reactor.
- **Existing slice declaration** (already correct):
  `l4d2-build.slice` in ckn-bw's reactor.

Recent commits that touched this surface:
- `4838108` — moved idmap to build time (the refactor that surfaced
  the namespace bug)
- `f1aa05d` — added nsenter self-wrap (the band-aid this refactor
  removes)
- `2f6a9cf`, `9053186`, `dd918ac` — earlier idmap-on-mount approach
  that was reverted

Related design docs:
- `docs/superpowers/plans/2026-05-15-build-time-idmap.md` — the
  plan whose architecture this refactor builds on
- `docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md` —
  unrelated open questions about deploy/ layout

## What's NOT in scope

- Rewriting the sandbox in Python / packaging differently.
- Changing the security hardening profile (the unit duplicates the
  current set verbatim — adjust later if needed).
- Splitting the gameserver uid from the web app uid (noted in earlier
  handoff doc).
- Re-evaluating whether `l4d2-sandbox` should exist as a separate
  uid (kept; defense in depth).
- Touching the `left4me-overlay` gameserver helper (it already uses
  the pattern; only the sandbox helper is being refactored to match).

## Estimate

Rough breakdown for the future session:
- Unit file design + ckn-bw reactor change: 1-2 hours
- Worker rewrite (run_sandboxed_script): 1-2 hours
- Tests: 1 hour
- Deploy + verify on test server: 30 min
- Bug-fix and iteration buffer: 1 hour

~5 hours of focused work, assuming no surprises with journalctl
streaming or sudoers semantics.

## Decision criteria for whether to do this

Do it if:
- You're about to make any other change to the sandbox hardening,
  build lifecycle, or sandbox uid story.
- You're frustrated by debugging the existing helper.
- You want to remove the nsenter band-aid for hygiene.

Skip if:
- The sandbox is stable and you're not planning related changes.
- You'd rather invest the time in higher-value work elsewhere.

The current solution is fine; this refactor is upgrade-not-fix.