left4me/docs/superpowers/specs/2026-05-08-l4d2-script-sandbox-v2-systemd.md
mwiegand efaaf84cd9
docs(specs): script sandbox v2 — systemd-only design + plan
Spec captures the v2 architecture (systemd-run service mode with full
hardening directives, no bwrap), the two surfaces in scope (helper
rewrite + bubblewrap dep removal + left4me.db mode tightening), and the
gotchas surfaced by smoke-testing the prototype on ckn@10.0.4.128:
- ProtectSystem=strict makes /var/lib/left4me visible (not invisible);
  must add TemporaryFileSystem=/var/lib to mask it.
- Script bind via BindReadOnlyPaths uses ${SCRIPT}:/script.sh syntax.
- No PrivatePID= directive in systemd; host PIDs visible via /proc.
  Information disclosure only — kernel UID-mismatch blocks signals.

Plan breaks the migration into 4 tasks (helper rewrite, deploy-script
deps + DB mode, host smoke-test, drift sweep) with explicit rollback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:46:13 +02:00

11 KiB

L4D2 Script Sandbox v2 — Systemd-Only

Goal: Replace the bwrap-based left4me-script-sandbox helper with one that uses systemd-run in service-unit mode alone. Drop bubblewrap as a system dependency. Gain capability bounding, seccomp filtering, kernel-tunable / -module / -log protection, address-family restriction, LockPersonality, MemoryDenyWriteExecute, and RestrictSUIDSGID — none of which the bwrap+systemd-run-scope composition could provide. Lose PID-namespace isolation (no PrivatePID= directive in systemd) — judged acceptable for the current trust model.

Approval status: User-approved 2026-05-08 after smoke testing on ckn@10.0.4.128.

Context

The v1 sandbox (see 2026-05-08-l4d2-script-overlays-design.md) layers bubblewrap for namespacing inside systemd-run --scope for cgroup limits. That works, but --scope units register an existing process tree and so cannot accept service-only directives like NoNewPrivileges=, ProtectSystem=, SystemCallFilter=, CapabilityBoundingSet=, etc. Smoke testing on the deployed host confirmed bwrap covers mount/PID/IPC/UTS namespacing well, but leaves capability bounding, seccomp, and kernel-surface protection unenforced.

A switch to systemd-run in default (transient service) mode unlocks the full hardening surface. Smoke testing of a v2 prototype against the deployed test host confirmed:

  • Every isolation invariant the bwrap version provides (filesystem masking, UID drop, network reachability, /overlay RW bind, host-side l4d2-sandbox ownership, host secret hiding) is reproducible with systemd directives.
  • All cgroup limits (memory.max=4G, memory.swap.max=0, pids.max=512, cpu.max=200%, RuntimeMaxSec=3600) apply identically.
  • MemoryError fires at the 4 GB cap (cgroup-enforced).
  • The wipe path (find /overlay -mindepth 1 -delete) succeeds.
  • Hardening directives the v1 design couldn't express enforce real syscall blocks: unshare(CLONE_NEWUSER), mount(2), personality(2), bpf(2), swapoff(2), sysctl -w are all blocked.

The single behavioral regression: host process IDs are visible via /proc and ps -ef because systemd has no PrivatePID= directive. Sending signals to those processes is still blocked by the kernel's UID-mismatch check (l4d2-sandbox cannot signal root-owned processes). Information disclosure is the only leak; signal capability is intact.

Locked Decisions

  1. Replace the helper body wholesale. No bwrap invocation. systemd-run in service mode does both isolation and resource limits.
  2. Helper path, sudoers rule, ScriptBuilder API, and l4d2-sandbox UID are unchanged. The Python side (run_sandboxed_script, route handlers, tests) does not change.
  3. bubblewrap apt dependency dropped from deploy-test-server.sh.
  4. left4me.db file mode tightened to 0640 root:left4me at deploy time. This is a host-hygiene fix that is independent of the sandbox change but was surfaced by smoke testing — without it, any host user (and, transitively, the sandbox) could read the application database.
  5. TemporaryFileSystem=/var/lib is required. ProtectSystem=strict makes /var/lib/left4me read-only but visible; the only way to reliably hide its contents from the unit is to mask the parent with a tmpfs. The BindPaths=…/overlays/{id}:/overlay mount is unaffected because /overlay is at a different path.
  6. PrivatePID= is not configured. systemd has no such directive. ps -ef from inside the sandbox shows host processes. The kernel's UID-based signal restriction blocks any actual interaction with them. Acceptable for the current trust model.
  7. Walltime kill remains RuntimeMaxSec=3600. Same as v1.
  8. Network namespace remains shared with the host. No PrivateNetwork=. Scripts must reach Steam / l4d2center / GitHub / etc.
  9. SystemCallFilter=@system-service @network-io is the seccomp baseline. systemd's curated @system-service group is "everything a normal service does"; adding @network-io is explicit even though it overlaps. Build failures revealing missing syscall classes are surfaced via journalctl and addressed by widening the filter (@process, etc.) on demand.
  10. Single helper file replaces v1. Not adding a -v2 variant. The v1 implementation is removed in the same change.

Architecture

sudo helper
  └─ systemd-run --service (default) --pipe --wait
       (transient .service unit, full hardening directives)
       └─ /bin/bash /script.sh

systemd-run in service mode:

  • Opens a transient service unit on the system bus.
  • Applies all -p properties as the unit's exec context.
  • Forks; the child sets up the unit's namespaces (mount, IPC, user), drops privileges to User=l4d2-sandbox, applies the seccomp filter, and execve()s /bin/bash /script.sh.
  • --pipe connects the unit's stdin/stdout/stderr to the calling helper's stdio (so the existing run_command harness in ScriptBuilder continues to capture line-by-line).
  • --wait blocks until the unit terminates and propagates the exit code.
  • --collect removes the unit on exit even if it failed.
  • The cgroup carries the resource limits; the systemd timer enforces RuntimeMaxSec=3600.

Helper

deploy/files/usr/local/libexec/left4me/left4me-script-sandbox, mode 0755, owned root:

#!/bin/bash
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir at $OVERLAY_DIR" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script at $SCRIPT" >&2; exit 65; }

if [[ "${LEFT4ME_SCRIPT_SANDBOX_DRY_RUN:-}" == "1" ]]; then
    echo "DRY RUN: overlay_id=$OVERLAY_ID script=$SCRIPT overlay_dir=$OVERLAY_DIR"
    exit 0
fi

chown -R l4d2-sandbox:l4d2-sandbox "$OVERLAY_DIR"
chmod 0755 "$OVERLAY_DIR"

exec systemd-run --quiet --collect --wait --pipe \
    --unit="left4me-script-${OVERLAY_ID}-$$" \
    -p User=l4d2-sandbox -p Group=l4d2-sandbox \
    -p NoNewPrivileges=yes \
    -p ProtectSystem=strict -p ProtectHome=yes \
    -p PrivateTmp=yes -p PrivateDevices=yes -p PrivateIPC=yes \
    -p ProtectKernelTunables=yes -p ProtectKernelModules=yes \
    -p ProtectKernelLogs=yes -p ProtectControlGroups=yes \
    -p RestrictNamespaces=yes \
    -p RestrictAddressFamilies="AF_INET AF_INET6 AF_UNIX" \
    -p RestrictSUIDSGID=yes -p LockPersonality=yes \
    -p MemoryDenyWriteExecute=yes \
    -p SystemCallFilter="@system-service @network-io" \
    -p SystemCallArchitectures=native \
    -p CapabilityBoundingSet= -p AmbientCapabilities= \
    -p TemporaryFileSystem="/etc /var/lib" \
    -p BindReadOnlyPaths="/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives ${SCRIPT}:/script.sh" \
    -p BindPaths="${OVERLAY_DIR}:/overlay" \
    -p WorkingDirectory=/overlay \
    -p Environment="HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay" \
    -p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
    -p CPUQuota=200% -p RuntimeMaxSec=3600 \
    -- /bin/bash /script.sh

Sudoers fragment

Unchanged from v1: left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox.

System user

Unchanged from v1: l4d2-sandbox (useradd --system --no-create-home --shell /usr/sbin/nologin).

Filesystem expectations

  • /var/lib/left4me must be mode 0711 (left4me-owned). Already provisioned by v1 deploy script.
  • /var/lib/left4me/left4me.db mode 0640 root:left4me. New — added by this change.
  • Overlay directory /var/lib/left4me/overlays/{id}/ chowned to l4d2-sandbox:l4d2-sandbox 0755 by the helper before each run. Unchanged from v1.

Build Lifecycle (unchanged from v1)

ScriptBuilder.build() writes the script to a 0644 tmpfile, exec's sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile> via run_command, then runs _enforce_disk_budget. The helper's internal mechanism changes; the wrapper API is identical. Overlay.last_build_status is written by the job worker on completion.

Risks

  • systemd CVE landing in our directive set. Single-tool migration removes one isolation layer. Mitigated by uid drop + cgroup limits + NoNewPrivileges=yes (kernel-enforced state independent of namespace setup). The escape would be an unprivileged process with no filesystem isolation but still capped on resources; same severity envelope as a hypothetical bwrap CVE in v1. The trust model (registered users) makes a single isolation layer acceptable.
  • SystemCallFilter rejecting a syscall a user script unexpectedly needs. Symptom: build fails with SIGSYS. Diagnosis: journalctl --since "1 min ago" | grep SECCOMP. Resolution: widen the filter (+@process, +@privileged if the script genuinely needs more than a normal service). v1 had no syscall filter, so this is a new failure class.
  • ProtectSystem=strict masking something a script wanted to write to. Only /overlay, /tmp, /run are writable inside the sandbox. Same as v1.
  • Host PID visibility (no PrivatePID=). Information disclosure; not a privilege boundary.
  • MemoryDenyWriteExecute=yes blocking JITs. A script that launches node / a JIT runtime would fail because W+X mappings are blocked. None of the recipe set the user has historically used (curl + tar + cp) needs a JIT; revisit if a real script trips this.
  • RestrictAddressFamilies blocking some download tools. curl, wget, git over https use AF_INET/AF_INET6; getent hosts uses AF_UNIX (nss). Smoke-tested as working. A script that wanted raw sockets (AF_PACKET) or netlink (AF_NETLINK) would fail; neither is plausible for build recipes.

Out Of Scope

  • Per-overlay UID isolation. Cross-script-overlay write access is still possible after a hypothetical sandbox bypass (every script overlay's dir is owned by l4d2-sandbox). A per-overlay UID pool was discussed as the next-step hardening but is deferred.
  • PrivateNetwork= / egress filtering. No change from v1.
  • systemd-nspawn or LXC. Researched; both are heavier than necessary for transient bash builds.
  • PrivatePID= workaround via unshare. Not pursued — would require re-introducing a wrapper inside the unit, defeating the simplification.

Implementation Boundaries

  • Web app code is unchanged. ScriptBuilder, run_sandboxed_script, route handlers, models, migrations — all untouched. The migration is purely in the deployed helper script and adjacent deploy artifacts.
  • bubblewrap apt package removed. Already absent from production paths after this change; deploy script updated.
  • No new systemd unit files. Each invocation is a transient unit named left4me-script-{overlay_id}-{pid}.service.
  • No application-level dependency changes. No new Python packages, no template changes, no DB migration.