Spec captures the v2 architecture (systemd-run service mode with full
hardening directives, no bwrap), the two surfaces in scope (helper
rewrite + bubblewrap dep removal + left4me.db mode tightening), and the
gotchas surfaced by smoke-testing the prototype on ckn@10.0.4.128:
- ProtectSystem=strict makes /var/lib/left4me visible (not invisible);
must add TemporaryFileSystem=/var/lib to mask it.
- Script bind via BindReadOnlyPaths uses ${SCRIPT}:/script.sh syntax.
- No PrivatePID= directive in systemd; host PIDs visible via /proc.
Information disclosure only — kernel UID-mismatch blocks signals.
Plan breaks the migration into 4 tasks (helper rewrite, deploy-script
deps + DB mode, host smoke-test, drift sweep) with explicit rollback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 KiB
L4D2 Script Sandbox v2 — Systemd-Only
Goal: Replace the bwrap-based left4me-script-sandbox helper with one that uses systemd-run in service-unit mode alone. Drop bubblewrap as a system dependency. Gain capability bounding, seccomp filtering, kernel-tunable / -module / -log protection, address-family restriction, LockPersonality, MemoryDenyWriteExecute, and RestrictSUIDSGID — none of which the bwrap+systemd-run-scope composition could provide. Lose PID-namespace isolation (no PrivatePID= directive in systemd) — judged acceptable for the current trust model.
Approval status: User-approved 2026-05-08 after smoke testing on ckn@10.0.4.128.
Context
The v1 sandbox (see 2026-05-08-l4d2-script-overlays-design.md) layers bubblewrap for namespacing inside systemd-run --scope for cgroup limits. That works, but --scope units register an existing process tree and so cannot accept service-only directives like NoNewPrivileges=, ProtectSystem=, SystemCallFilter=, CapabilityBoundingSet=, etc. Smoke testing on the deployed host confirmed bwrap covers mount/PID/IPC/UTS namespacing well, but leaves capability bounding, seccomp, and kernel-surface protection unenforced.
A switch to systemd-run in default (transient service) mode unlocks the full hardening surface. Smoke testing of a v2 prototype against the deployed test host confirmed:
- Every isolation invariant the bwrap version provides (filesystem masking, UID drop, network reachability,
/overlayRW bind, host-sidel4d2-sandboxownership, host secret hiding) is reproducible with systemd directives. - All cgroup limits (
memory.max=4G,memory.swap.max=0,pids.max=512,cpu.max=200%,RuntimeMaxSec=3600) apply identically. MemoryErrorfires at the 4 GB cap (cgroup-enforced).- The wipe path (
find /overlay -mindepth 1 -delete) succeeds. - Hardening directives the v1 design couldn't express enforce real syscall blocks:
unshare(CLONE_NEWUSER),mount(2),personality(2),bpf(2),swapoff(2),sysctl -ware all blocked.
The single behavioral regression: host process IDs are visible via /proc and ps -ef because systemd has no PrivatePID= directive. Sending signals to those processes is still blocked by the kernel's UID-mismatch check (l4d2-sandbox cannot signal root-owned processes). Information disclosure is the only leak; signal capability is intact.
Locked Decisions
- Replace the helper body wholesale. No
bwrapinvocation.systemd-runin service mode does both isolation and resource limits. - Helper path, sudoers rule, ScriptBuilder API, and
l4d2-sandboxUID are unchanged. The Python side (run_sandboxed_script, route handlers, tests) does not change. bubblewrapapt dependency dropped fromdeploy-test-server.sh.left4me.dbfile mode tightened to 0640 root:left4me at deploy time. This is a host-hygiene fix that is independent of the sandbox change but was surfaced by smoke testing — without it, any host user (and, transitively, the sandbox) could read the application database.TemporaryFileSystem=/var/libis required.ProtectSystem=strictmakes/var/lib/left4meread-only but visible; the only way to reliably hide its contents from the unit is to mask the parent with a tmpfs. TheBindPaths=…/overlays/{id}:/overlaymount is unaffected because/overlayis at a different path.PrivatePID=is not configured. systemd has no such directive.ps -effrom inside the sandbox shows host processes. The kernel's UID-based signal restriction blocks any actual interaction with them. Acceptable for the current trust model.- Walltime kill remains
RuntimeMaxSec=3600. Same as v1. - Network namespace remains shared with the host. No
PrivateNetwork=. Scripts must reach Steam / l4d2center / GitHub / etc. SystemCallFilter=@system-service @network-iois the seccomp baseline. systemd's curated@system-servicegroup is "everything a normal service does"; adding@network-iois explicit even though it overlaps. Build failures revealing missing syscall classes are surfaced viajournalctland addressed by widening the filter (@process, etc.) on demand.- Single helper file replaces v1. Not adding a
-v2variant. The v1 implementation is removed in the same change.
Architecture
sudo helper
└─ systemd-run --service (default) --pipe --wait
(transient .service unit, full hardening directives)
└─ /bin/bash /script.sh
systemd-run in service mode:
- Opens a transient service unit on the system bus.
- Applies all
-pproperties as the unit's exec context. - Forks; the child sets up the unit's namespaces (mount, IPC, user), drops privileges to
User=l4d2-sandbox, applies the seccomp filter, andexecve()s/bin/bash /script.sh. --pipeconnects the unit's stdin/stdout/stderr to the calling helper's stdio (so the existingrun_commandharness inScriptBuildercontinues to capture line-by-line).--waitblocks until the unit terminates and propagates the exit code.--collectremoves the unit on exit even if it failed.- The cgroup carries the resource limits; the systemd timer enforces
RuntimeMaxSec=3600.
Helper
deploy/files/usr/local/libexec/left4me/left4me-script-sandbox, mode 0755, owned root:
#!/bin/bash
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir at $OVERLAY_DIR" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script at $SCRIPT" >&2; exit 65; }
if [[ "${LEFT4ME_SCRIPT_SANDBOX_DRY_RUN:-}" == "1" ]]; then
echo "DRY RUN: overlay_id=$OVERLAY_ID script=$SCRIPT overlay_dir=$OVERLAY_DIR"
exit 0
fi
chown -R l4d2-sandbox:l4d2-sandbox "$OVERLAY_DIR"
chmod 0755 "$OVERLAY_DIR"
exec systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
-p NoNewPrivileges=yes \
-p ProtectSystem=strict -p ProtectHome=yes \
-p PrivateTmp=yes -p PrivateDevices=yes -p PrivateIPC=yes \
-p ProtectKernelTunables=yes -p ProtectKernelModules=yes \
-p ProtectKernelLogs=yes -p ProtectControlGroups=yes \
-p RestrictNamespaces=yes \
-p RestrictAddressFamilies="AF_INET AF_INET6 AF_UNIX" \
-p RestrictSUIDSGID=yes -p LockPersonality=yes \
-p MemoryDenyWriteExecute=yes \
-p SystemCallFilter="@system-service @network-io" \
-p SystemCallArchitectures=native \
-p CapabilityBoundingSet= -p AmbientCapabilities= \
-p TemporaryFileSystem="/etc /var/lib" \
-p BindReadOnlyPaths="/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives ${SCRIPT}:/script.sh" \
-p BindPaths="${OVERLAY_DIR}:/overlay" \
-p WorkingDirectory=/overlay \
-p Environment="HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay" \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- /bin/bash /script.sh
Sudoers fragment
Unchanged from v1: left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox.
System user
Unchanged from v1: l4d2-sandbox (useradd --system --no-create-home --shell /usr/sbin/nologin).
Filesystem expectations
/var/lib/left4memust be mode 0711 (left4me-owned). Already provisioned by v1 deploy script./var/lib/left4me/left4me.dbmode 0640 root:left4me. New — added by this change.- Overlay directory
/var/lib/left4me/overlays/{id}/chowned tol4d2-sandbox:l4d2-sandbox0755 by the helper before each run. Unchanged from v1.
Build Lifecycle (unchanged from v1)
ScriptBuilder.build() writes the script to a 0644 tmpfile, exec's sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile> via run_command, then runs _enforce_disk_budget. The helper's internal mechanism changes; the wrapper API is identical. Overlay.last_build_status is written by the job worker on completion.
Risks
- systemd CVE landing in our directive set. Single-tool migration removes one isolation layer. Mitigated by uid drop + cgroup limits +
NoNewPrivileges=yes(kernel-enforced state independent of namespace setup). The escape would be an unprivileged process with no filesystem isolation but still capped on resources; same severity envelope as a hypothetical bwrap CVE in v1. The trust model (registered users) makes a single isolation layer acceptable. SystemCallFilterrejecting a syscall a user script unexpectedly needs. Symptom: build fails with SIGSYS. Diagnosis:journalctl --since "1 min ago" | grep SECCOMP. Resolution: widen the filter (+@process,+@privilegedif the script genuinely needs more than a normal service). v1 had no syscall filter, so this is a new failure class.ProtectSystem=strictmasking something a script wanted to write to. Only/overlay,/tmp,/runare writable inside the sandbox. Same as v1.- Host PID visibility (no
PrivatePID=). Information disclosure; not a privilege boundary. MemoryDenyWriteExecute=yesblocking JITs. A script that launchesnode/ a JIT runtime would fail because W+X mappings are blocked. None of the recipe set the user has historically used (curl + tar + cp) needs a JIT; revisit if a real script trips this.RestrictAddressFamiliesblocking some download tools.curl,wget,git over httpsuseAF_INET/AF_INET6;getent hostsusesAF_UNIX(nss). Smoke-tested as working. A script that wanted raw sockets (AF_PACKET) or netlink (AF_NETLINK) would fail; neither is plausible for build recipes.
Out Of Scope
- Per-overlay UID isolation. Cross-script-overlay write access is still possible after a hypothetical sandbox bypass (every script overlay's dir is owned by
l4d2-sandbox). A per-overlay UID pool was discussed as the next-step hardening but is deferred. PrivateNetwork=/ egress filtering. No change from v1.- systemd-nspawn or LXC. Researched; both are heavier than necessary for transient bash builds.
PrivatePID=workaround viaunshare. Not pursued — would require re-introducing a wrapper inside the unit, defeating the simplification.
Implementation Boundaries
- Web app code is unchanged.
ScriptBuilder,run_sandboxed_script, route handlers, models, migrations — all untouched. The migration is purely in the deployed helper script and adjacent deploy artifacts. bubblewrapapt package removed. Already absent from production paths after this change; deploy script updated.- No new systemd unit files. Each invocation is a transient unit named
left4me-script-{overlay_id}-{pid}.service. - No application-level dependency changes. No new Python packages, no template changes, no DB migration.