left4me/docs/superpowers/specs/2026-05-15-build-overlay-unit-design.md
mwiegand 28b0ff951b
spec(build-overlay-unit): flag DB-fetch-in-ExecStartPre as an option
The script content lives in the overlays.script DB column and the
unit's %i is the row id, so the worker-writes-script-to-fs step in
the original sketch is duplication. Document three options (worker
writes / unit fetches via helper / pipe to stdin) and recommend the
unit-fetches variant with RuntimeDirectory= auto-cleanup. Promote
this to the top of the open-decisions list since it shapes the
worker, the unit, and whether a fetch-script helper is added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 01:54:41 +02:00

23 KiB

Build-overlay template unit — refactor the script-sandbox helper

Status: open question, not settled design. This is a handoff document prompted by the build-time idmap landing on 2026-05-15. The current left4me-script-sandbox shell helper works but has accumulated several layers of complexity (idmap bind setup, trap cleanup, nsenter self-wrap) that a systemd template unit would handle declaratively. The same pattern is already established in the codebase for gameservers (left4me-server@.service). A future session should evaluate whether to refactor and, if so, follow the steps below.

Why this came up

While verifying the build-time idmap refactor, the first 5 build jobs failed with mkdir: Permission denied on /overlay/.... Root cause:

  • left4me-web.service runs with PrivateTmp=true, which puts the web app (and anything it sudoes into) in a private mount namespace.
  • The script-sandbox helper, invoked via sudo from the web app, inherits that namespace.
  • The helper's mount --bind --map-users=... pre-creates the idmap staging path in the web app's namespace.
  • systemd-run (called by the helper) spawns a transient unit in PID 1's mount namespace.
  • The transient unit's BindPaths=...:/overlay resolves the staging path in PID 1's namespace — where the bind doesn't exist. It sees an empty root-owned dir at the staging path (mkdir'd by the helper before the bind) and binds that to /overlay.
  • Sandbox uid hits EACCES on every write.

We fixed it (commit f1aa05d) by self-wrapping the helper into PID 1's mount namespace at the top of the script:

if [[ "${L4D2_SANDBOX_IN_PID1_MNT_NS:-}" != "1" ]]; then
    exec env L4D2_SANDBOX_IN_PID1_MNT_NS=1 \
        /usr/bin/nsenter --mount=/proc/1/ns/mnt -- "$0" "$@"
fi

That works. But it's a band-aid for an architectural friction: helper invocation via sudo from a hardened service forces us to manually escape the caller's namespace before any mount syscall. If the helper were itself a systemd unit started by PID 1, the namespace would be correct by default.

The gameserver helper handles this at the unit level. Its ExecStartPre is:

ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i

i.e. wrapped in nsenter at the unit. The unit is started by PID 1, so it has PID 1's namespace, then nsenter is a belt-and-braces. Mirror that pattern for builds: introduce build-overlay@.service as a template unit, have the worker activate it instead of forking a helper.

Current state (the thing being replaced)

Files:

  • deploy/files/usr/local/libexec/left4me/left4me-script-sandbox — the bash helper. ~100 lines. Self-wraps in nsenter, does pre-bind with --map-users, invokes systemd-run --quiet --collect --wait --pipe -p ... -- /bin/bash /script.sh, cleans up via trap.
  • l4d2web/services/overlay_builders.py:run_sandboxed_script — the worker entry point. Writes script content to /var/lib/left4me/sandbox-scripts/<uniqued>.sh, invokes sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <path>, streams stdout/stderr via subprocess.Popen + the existing run_command plumbing.
  • deploy/files/etc/sudoers.d/left4me — grants left4me NOPASSWD to the helper path.

What the helper actually does:

  1. nsenter into PID 1's mount ns (the band-aid)
  2. validate args + overlay dir exists
  3. compute STAGING=/var/lib/left4me/tmp/sandbox-idmap-${OVERLAY_ID}
  4. trap cleanup; pre-emptive umount of stale staging; mkdir -p the staging
  5. mount --bind --map-users=$(id -u left4me):$(id -u l4d2-sandbox):1 --map-groups=... $OVERLAY_DIR $STAGING
  6. systemd-run with the full hardening profile, BindPaths=$STAGING:/overlay
  7. Wait for completion, propagate exit code
  8. trap fires: umount $STAGING; rmdir $STAGING

Proposed design

Replace the bash helper with two systemd units (template + a slice) emitted from ckn-bw's existing systemd_units reactor, plus a small worker rewrite.

build-overlay@.service (template unit)

[Unit]
Description=Sandboxed overlay build for instance %i
DefaultDependencies=no
After=local-fs.target
RequiresMountsFor=/var/lib/left4me/overlays/%i
ConditionPathIsDirectory=/var/lib/left4me/overlays/%i
ConditionPathExists=/var/lib/left4me/sandbox-scripts/%i.sh

[Service]
Type=oneshot
User=l4d2-sandbox
Group=l4d2-sandbox
Slice=l4d2-build.slice

# Idmap bind: disk uid 980 (left4me) ↔ mount uid 981 (sandbox), so writes
# from the sandbox land on disk as left4me. + prefix runs as root before
# the User= drop (mount syscall requires CAP_SYS_ADMIN).
ExecStartPre=+/usr/bin/mkdir -p /run/left4me/idmap/%i
ExecStartPre=+/usr/bin/mount --bind \
    --map-users=980:981:1 --map-groups=980:981:1 \
    /var/lib/left4me/overlays/%i /run/left4me/idmap/%i

ExecStart=/bin/bash /script.sh

ExecStopPost=+-/usr/bin/umount /run/left4me/idmap/%i
ExecStopPost=+-/usr/bin/rmdir /run/left4me/idmap/%i

# Hardening — all the -p flags from the current bash helper, declared
# declaratively here instead of as systemd-run -p arguments.
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
PrivateDevices=yes
PrivateIPC=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
ProtectKernelLogs=yes
ProtectControlGroups=yes
RestrictNamespaces=yes
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictSUIDSGID=yes
LockPersonality=yes
MemoryDenyWriteExecute=yes
SystemCallFilter=@system-service @network-io
SystemCallArchitectures=native
CapabilityBoundingSet=
AmbientCapabilities=
IPAddressDeny=127.0.0.0/8 ::1/128 169.254.0.0/16 fe80::/10 224.0.0.0/4 ff00::/8 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 100.64.0.0/10 fc00::/7
TemporaryFileSystem=/etc /var/lib
BindReadOnlyPaths=/etc/left4me/sandbox-resolv.conf:/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives /var/lib/left4me/sandbox-scripts/%i.sh:/script.sh
BindPaths=/run/left4me/idmap/%i:/overlay
WorkingDirectory=/overlay
Environment=HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay
UMask=0022
OOMScoreAdjust=500
MemoryMax=4G
MemorySwapMax=0
TasksMax=512
CPUQuota=200%
RuntimeMaxSec=3600
TimeoutStartSec=1h
TimeoutStopSec=30s

Notes:

  • Type=oneshot makes systemctl start block until ExecStart exits.
  • ConditionPath* provides early failure if the overlay dir or script doesn't exist (avoids running the unit at all in those cases).
  • RequiresMountsFor=/var/lib/left4me/overlays/%i ensures the parent fs is mounted before this unit runs (/ and /var/lib if it's a separate mount point).
  • ExecStopPost uses +- (root, ignore failures) — the bind might already be torn down if the unit is restarting.
  • BindReadOnlyPaths=...:/script.sh makes the per-overlay script available at /script.sh inside the sandbox, picked from the predictable path /var/lib/left4me/sandbox-scripts/%i.sh.

Script source: filesystem vs. DB

Critical design decision the future session must make. The current plan in the unit sketch above assumes the worker writes the script content to /var/lib/left4me/sandbox-scripts/<id>.sh before calling systemctl start. But the script already lives in the DB (the overlays.script column), and the unit instance name %i is the overlay row id. The filesystem copy is redundant unless we want it.

Three options:

Option A — worker writes the script (the unit sketch above). Worker queries DB, writes <id>.sh to a known path, then systemctl start. Unit reads via BindReadOnlyPaths. Simple, no DB access from the unit, the existing _sandbox_script_dir() plumbing mostly works. Cost: redundant on-disk copy; stale files between builds if you don't clean them.

Option B — unit fetches the script from the DB itself. A small root-side helper installed as /usr/local/libexec/left4me/left4me-fetch-script does:

#!/usr/bin/python3
import sqlite3, sys
overlay_id = int(sys.argv[1])
conn = sqlite3.connect("/var/lib/left4me/left4me.db")
row = conn.execute(
    "SELECT script FROM overlays WHERE id = ?", (overlay_id,)
).fetchone()
sys.stdout.write((row[0] if row else "") or "")

Unit's ExecStartPre runs it as root (the + prefix), pipes the output to a runtime path that ExecStart reads:

RuntimeDirectory=left4me/sandbox-scripts
RuntimeDirectoryMode=0700
ExecStartPre=+/bin/sh -c '/usr/local/libexec/left4me/left4me-fetch-script %i \
    > /run/left4me/sandbox-scripts/%i.sh && chmod 0644 /run/left4me/sandbox-scripts/%i.sh'
BindReadOnlyPaths=/run/left4me/sandbox-scripts/%i.sh:/script.sh

(RuntimeDirectory= auto-creates /run/left4me/sandbox-scripts/ on start and removes it on stop, including the file inside.)

The fetch script doesn't need sudoers — it runs from ExecStartPre with root privileges already. It only reads the DB; no writes. The DB is root:left4me 0640 so root can read it.

Worker becomes a one-liner: sudo systemctl start build-overlay@<id>. No FS prep, no tmpfile cleanup.

Option C — pipe the script content directly into bash stdin. The unit's ExecStart is something like /bin/sh -c "fetch-script %i | /bin/bash". Pros: no on-disk file at all. Cons: /bin/bash runs without a file path, so $0 is bash and error messages look weird; harder to debug a failing script when there's no file to inspect.

Recommendation: Option B. Decouples script storage (DB) from sandbox transport (a /run/ runtime file). RuntimeDirectory= handles cleanup. Worker becomes trivially small. The fetch-script helper is ~10 lines and stays in deploy/files/usr/local/libexec/left4me/.

If Option A is chosen instead, plan to track the script tmpfiles explicitly so they don't accumulate. With Option B, RuntimeDirectory auto-cleans on stop.

Worker invocation

Replace run_sandboxed_script in l4d2web/services/overlay_builders.py. The code below is the Option A shape (worker writes the script). For Option B (recommended), drop the script_dir/script_path/write_text/chmod lines — the unit's ExecStartPre fetches from the DB. The signature can also drop script_text since the worker doesn't need to pass content anymore.

def run_sandboxed_script(
    overlay_id: int,
    script_text: str,  # remove this param if Option B
    *,
    on_stdout: LogSink,
    on_stderr: LogSink,
    should_cancel: CancelCheck,
) -> None:
    # The four lines below are Option A only — delete for Option B.
    script_dir = _sandbox_script_dir()
    script_dir.mkdir(parents=True, exist_ok=True)
    script_path = script_dir / f"{overlay_id}.sh"
    script_path.write_text(script_text or "")
    os.chmod(script_path, 0o644)

    unit = f"build-overlay@{overlay_id}.service"

    # Tail the unit's journal as a sidecar so output streams into job-logs
    # while the unit runs. --follow exits when the unit reaches "inactive".
    journal = subprocess.Popen(
        ["journalctl", "--unit", unit, "--output=cat", "--follow",
         "--since=now", "--no-pager"],
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
    )

    try:
        # Start the unit (sudoers permits this exact verb pattern).
        # Type=oneshot makes this block until ExecStart returns.
        rc = subprocess.run(
            ["sudo", "-n", "/bin/systemctl", "start", unit],
            check=False,
        ).returncode
    finally:
        # Drain remaining journal lines (journalctl --follow may not have
        # printed everything yet by the time systemctl returns).
        journal.terminate()
        try:
            for line in journal.stdout or []:
                on_stdout(line.rstrip("\n"))
        finally:
            journal.wait(timeout=5)

    # Read exit code from the unit. ExecMainStatus is the script's rc;
    # Result is "success" / "failed" / "timeout" etc.
    show = subprocess.check_output(
        ["systemctl", "show", unit,
         "-p", "ExecMainStatus", "-p", "Result", "--value"],
        text=True,
    ).split()
    exec_main_status = int(show[0])
    result = show[1]

    if rc != 0 or result != "success":
        raise BuildError(
            f"build-overlay@{overlay_id} failed: "
            f"systemctl rc={rc} unit result={result} script exit={exec_main_status}"
        )

That's ~30 lines vs. ~50 today, and the helper script disappears entirely.

Two refinements to consider:

  1. Cancel semantics: today the worker's should_cancel callback triggers a SIGTERM via the existing run_command plumbing. With systemctl-start, you'd issue systemctl stop build-overlay@<id> in a parallel thread when should_cancel() returns True. Wire that up.

  2. Journal streaming race: journalctl --follow --since=now started after systemctl start may miss the first few lines. Two fixes:

    • Start the journal tail before systemctl-start (the unit doesn't exist yet, so journalctl waits silently — verify this behaviour on Trixie).
    • Or use journalctl --cursor machinery: snapshot the cursor before start, then read with --cursor= after.

    Start-before is simpler and likely sufficient for L4D2 build verbosity, where the first second of output isn't critical.

Sudoers

Replace:

left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox

with:

left4me ALL=(root) NOPASSWD: /bin/systemctl start build-overlay@*.service
left4me ALL=(root) NOPASSWD: /bin/systemctl stop build-overlay@*.service

(Tighter — verb-prefixed and instance-globbed. No script path passed.)

Slice

l4d2-build.slice already exists (per the gameserver/sandbox today's configuration). Reuse it — no change needed.

Sandbox script tmpfile cleanup

Currently run_sandboxed_script writes a per-invocation tempfile.NamedTemporaryFile with a random suffix and unlinks it in a finally. With template-unit lookup, the script path is predictable per overlay id (/var/lib/left4me/sandbox-scripts/<id>.sh). Implications:

  • Two concurrent builds for the same overlay id would clobber the script file. The job queue already serializes per-overlay (per l4d2web/services/job_worker.py:OVERLAY_OPERATIONS), so this is OK.
  • Scripts persist between builds (no auto-cleanup). Either accept that (the next build overwrites) or delete after the unit goes inactive. Recommend: leave them — small, useful for debugging.

Migration

In order:

  1. Add the unit emission to ckn-bw's bundles/left4me/metadata.py systemd_units reactor. Mirror the pattern used for left4me-server@.service. Drop in the template-unit content as another reactor entry.
  2. Update sudoers (bundles/left4me/files/etc/sudoers.d/left4me) to permit systemctl start/stop build-overlay@*.service and remove the script-sandbox grant.
  3. Replace run_sandboxed_script in left4me. Add the new journalctl-based output streaming, exit-code reading, and cancel handling. Keep the function signature stable so callers (ScriptBuilder.build, the wipe route) are unchanged.
  4. Delete deploy/files/usr/local/libexec/left4me/left4me-script-sandbox.
  5. Update tests:
    • deploy/tests/test_deploy_artifacts.py:
      • Drop test_script_sandbox_uses_idmap_staging and any other tests that read SCRIPT_SANDBOX_HELPER.
      • Add tests that assert the new unit emission in ckn-bw's reactor output. (But that's in the other repo — left4me's deploy tests can't directly cover it.)
      • Add a test that asserts the worker invokes sudo systemctl start build-overlay@* (grep overlay_builders.py).
    • l4d2web/tests/test_overlay_builders.py (if it exists): update mocks for run_sandboxed_script to expect the new subprocess shape.
  6. Test on left4.me:
    • Push left4me, bw apply ovh.left4me. Apply also picks up the new unit emission and the sudoers change.
    • Trigger a script-overlay rebuild via the web UI or the enqueue API path used in this session (see test history in git log around 2026-05-15).
    • Inspect: journalctl -u build-overlay@9.service, systemctl status build-overlay@9.service.
    • Verify on-disk state: overlay files end up left4me-owned; idmap bind cleanly torn down (findmnt | grep idmap empty).

Open decisions for the future session

  1. Script source: filesystem (Option A) vs. DB-fetched in ExecStartPre (Option B) vs. piped to stdin (Option C). See the "Script source" section above. This is the highest-impact decision because it shapes the worker, the unit's ExecStartPre, and whether you need a fetch-script helper binary at all. Recommendation: Option B.

  2. /run/left4me/idmap/%i vs. /var/lib/left4me/tmp/sandbox-idmap-%i/run is tmpfs and wiped on reboot, more correct for transient mount paths. But it requires the dir to exist (created by ExecStartPre). Either works.

  3. What to do with the existing left4me-apply-cake dead code — irrelevant to this refactor; flagged in the other handoff doc.

  4. Whether to drop the post-build chmod o+r in the sandbox helper — already gone in the build-time-idmap commit. (Verify in the new unit nothing equivalent is needed; files are left4me-owned, web reads via primary uid.)

  5. Type=oneshot vs. Type=exec — oneshot blocks systemctl start. exec doesn't. With oneshot we don't need the journalctl --follow workaround if we read journal after completion. But for live progress (which the existing builds stream), --follow is still needed. Stick with oneshot.

  6. Should the unit set KillMode=mixed to ensure children die on stop? Worth checking — the existing systemd-run line doesn't set it explicitly; defaults usually suffice.

  7. StateDirectory= vs. explicit mkdir -p — systemd has StateDirectory and RuntimeDirectory directives that auto-create per-unit directories. Could replace the mkdir -p /run/left4me/idmap/%i ExecStartPre with RuntimeDirectory=left4me/idmap/%i. Cleaner; gets auto-cleanup on stop too. Recommend doing this — both the mkdir and the rmdir ExecStopPost would go away.

Verification

End-to-end smoke test on left4.me after the deploy:

# unit is installed and template-parseable
systemctl status build-overlay@.service           # should show "loaded; static"
sudo systemd-analyze verify build-overlay@1.service

# enqueue a build via the web app's worker path (mimic the
# enqueue_build_overlay pattern from this session's job 64 onwards)
# then watch:
sudo journalctl -u build-overlay@9.service -f

# on completion:
systemctl show build-overlay@9.service -p Result -p ExecMainStatus
# expect: Result=success, ExecMainStatus=0

# disk state
sudo find /var/lib/left4me/overlays/9 -uid 981  # should be empty
sudo find /run/left4me/idmap                    # should not exist or be empty

# pid 1 mount table — no orphan idmap binds
sudo findmnt --task 1 -o TARGET | grep idmap     # empty

Risks

  • Worker cancel-during-build: today's should_cancel callback signals via run_command's child process. With the unit, the worker needs a separate path: spawn a thread that polls should_cancel() and calls sudo systemctl stop build-overlay@<id> when triggered. Without this, builds that exceed RuntimeMaxSec or hit user-cancel won't terminate promptly.
  • Journal lag at unit start: journalctl --follow started before systemctl start should pick up all output. If not, may need cursor-based streaming. Test with a script that prints immediately (echo hello; exit 0) — if "hello" appears in the job log, race is handled.
  • Sudoers globbing: systemctl start build-overlay@*.service permits any instance id including weird strings like ../etc-passwd. Use a tighter glob if possible (e.g., build-overlay@[0-9]*.service). Test that sudoers rejects unexpected instance names.
  • Type=oneshot return semantics: confirm that systemctl start build-overlay@<id> on a Type=oneshot unit returns rc=3 (or similar) when the unit's ExecStart fails, so the worker can detect failure without re-querying systemctl show.
  • Idle running over reboot: a build that's running across a reboot is killed when the system goes down. That's identical to today's behavior with systemd-run. Acceptable.
  • The journalctl sidecar process accumulates as a zombie if not reaped properly. The proposed code does journal.wait(timeout=5) — handle the timeout case (force-kill).

Pointers

Reference files (with line numbers if applicable):

  • Current helper to be removed: deploy/files/usr/local/libexec/left4me/left4me-script-sandbox
  • Current worker invoker: l4d2web/services/overlay_builders.py:run_sandboxed_script (~ln 324)
  • Current job-worker dispatch: l4d2web/services/job_worker.py (build_overlay operation)
  • Sudoers: deploy/files/etc/sudoers.d/left4me (matched verbatim in ckn-bw/bundles/left4me/files/etc/sudoers.d/left4me)
  • Sample template unit pattern (the model to copy): left4me-server@.service emission in ckn-bw's bundles/left4me/metadata.py systemd_units reactor.
  • Existing slice declaration (already correct): l4d2-build.slice in ckn-bw's reactor.

Recent commits that touched this surface:

  • 4838108 — moved idmap to build time (the refactor that surfaced the namespace bug)
  • f1aa05d — added nsenter self-wrap (the band-aid this refactor removes)
  • 2f6a9cf, 9053186, dd918ac — earlier idmap-on-mount approach that was reverted

Related design docs:

  • docs/superpowers/plans/2026-05-15-build-time-idmap.md — the plan whose architecture this refactor builds on
  • docs/superpowers/specs/2026-05-15-deploy-dir-rethink-design.md — unrelated open questions about deploy/ layout

What's NOT in scope

  • Rewriting the sandbox in Python / packaging differently.
  • Changing the security hardening profile (the unit duplicates the current set verbatim — adjust later if needed).
  • Splitting the gameserver uid from the web app uid (noted in earlier handoff doc).
  • Re-evaluating whether l4d2-sandbox should exist as a separate uid (kept; defense in depth).
  • Touching the left4me-overlay gameserver helper (it already uses the pattern; only the sandbox helper is being refactored to match).

Estimate

Rough breakdown for the future session:

  • Unit file design + ckn-bw reactor change: 1-2 hours
  • Worker rewrite (run_sandboxed_script): 1-2 hours
  • Tests: 1 hour
  • Deploy + verify on test server: 30 min
  • Bug-fix and iteration buffer: 1 hour

~5 hours of focused work, assuming no surprises with journalctl streaming or sudoers semantics.

Decision criteria for whether to do this

Do it if:

  • You're about to make any other change to the sandbox hardening, build lifecycle, or sandbox uid story.
  • You're frustrated by debugging the existing helper.
  • You want to remove the nsenter band-aid for hygiene.

Skip if:

  • The sandbox is stable and you're not planning related changes.
  • You'd rather invest the time in higher-value work elsewhere.

The current solution is fine; this refactor is upgrade-not-fix.