left4me/docs/superpowers/specs/2026-05-08-l4d2-script-sandbox-v2-systemd.md
mwiegand efaaf84cd9
docs(specs): script sandbox v2 — systemd-only design + plan
Spec captures the v2 architecture (systemd-run service mode with full
hardening directives, no bwrap), the two surfaces in scope (helper
rewrite + bubblewrap dep removal + left4me.db mode tightening), and the
gotchas surfaced by smoke-testing the prototype on ckn@10.0.4.128:
- ProtectSystem=strict makes /var/lib/left4me visible (not invisible);
  must add TemporaryFileSystem=/var/lib to mask it.
- Script bind via BindReadOnlyPaths uses ${SCRIPT}:/script.sh syntax.
- No PrivatePID= directive in systemd; host PIDs visible via /proc.
  Information disclosure only — kernel UID-mismatch blocks signals.

Plan breaks the migration into 4 tasks (helper rewrite, deploy-script
deps + DB mode, host smoke-test, drift sweep) with explicit rollback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:46:13 +02:00

138 lines
11 KiB
Markdown

# L4D2 Script Sandbox v2 — Systemd-Only
**Goal:** Replace the bwrap-based `left4me-script-sandbox` helper with one that uses `systemd-run` in **service-unit mode** alone. Drop `bubblewrap` as a system dependency. Gain capability bounding, seccomp filtering, kernel-tunable / -module / -log protection, address-family restriction, `LockPersonality`, `MemoryDenyWriteExecute`, and `RestrictSUIDSGID` — none of which the bwrap+systemd-run-scope composition could provide. Lose PID-namespace isolation (no `PrivatePID=` directive in systemd) — judged acceptable for the current trust model.
**Approval status:** User-approved 2026-05-08 after smoke testing on `ckn@10.0.4.128`.
## Context
The v1 sandbox (see `2026-05-08-l4d2-script-overlays-design.md`) layers `bubblewrap` for namespacing inside `systemd-run --scope` for cgroup limits. That works, but `--scope` units register an existing process tree and so cannot accept service-only directives like `NoNewPrivileges=`, `ProtectSystem=`, `SystemCallFilter=`, `CapabilityBoundingSet=`, etc. Smoke testing on the deployed host confirmed bwrap covers mount/PID/IPC/UTS namespacing well, but leaves capability bounding, seccomp, and kernel-surface protection unenforced.
A switch to `systemd-run` in default (transient service) mode unlocks the full hardening surface. Smoke testing of a v2 prototype against the deployed test host confirmed:
- Every isolation invariant the bwrap version provides (filesystem masking, UID drop, network reachability, `/overlay` RW bind, host-side `l4d2-sandbox` ownership, host secret hiding) is reproducible with systemd directives.
- All cgroup limits (`memory.max=4G`, `memory.swap.max=0`, `pids.max=512`, `cpu.max=200%`, `RuntimeMaxSec=3600`) apply identically.
- `MemoryError` fires at the 4 GB cap (cgroup-enforced).
- The wipe path (`find /overlay -mindepth 1 -delete`) succeeds.
- Hardening directives the v1 design couldn't express enforce real syscall blocks: `unshare(CLONE_NEWUSER)`, `mount(2)`, `personality(2)`, `bpf(2)`, `swapoff(2)`, `sysctl -w` are all blocked.
The single behavioral regression: host process IDs are visible via `/proc` and `ps -ef` because systemd has no `PrivatePID=` directive. Sending signals to those processes is still blocked by the kernel's UID-mismatch check (`l4d2-sandbox` cannot signal `root`-owned processes). Information disclosure is the only leak; signal capability is intact.
## Locked Decisions
1. **Replace the helper body wholesale.** No `bwrap` invocation. `systemd-run` in service mode does both isolation and resource limits.
2. **Helper path, sudoers rule, ScriptBuilder API, and `l4d2-sandbox` UID are unchanged.** The Python side (`run_sandboxed_script`, route handlers, tests) does not change.
3. **`bubblewrap` apt dependency dropped from `deploy-test-server.sh`.**
4. **`left4me.db` file mode tightened to 0640 root:left4me at deploy time.** This is a host-hygiene fix that is independent of the sandbox change but was surfaced by smoke testing — without it, *any* host user (and, transitively, the sandbox) could read the application database.
5. **`TemporaryFileSystem=/var/lib` is required.** `ProtectSystem=strict` makes `/var/lib/left4me` read-only but visible; the only way to reliably hide its contents from the unit is to mask the parent with a tmpfs. The `BindPaths=…/overlays/{id}:/overlay` mount is unaffected because `/overlay` is at a different path.
6. **`PrivatePID=` is not configured.** systemd has no such directive. `ps -ef` from inside the sandbox shows host processes. The kernel's UID-based signal restriction blocks any actual interaction with them. Acceptable for the current trust model.
7. **Walltime kill remains `RuntimeMaxSec=3600`.** Same as v1.
8. **Network namespace remains shared with the host.** No `PrivateNetwork=`. Scripts must reach Steam / l4d2center / GitHub / etc.
9. **`SystemCallFilter=@system-service @network-io`** is the seccomp baseline. systemd's curated `@system-service` group is "everything a normal service does"; adding `@network-io` is explicit even though it overlaps. Build failures revealing missing syscall classes are surfaced via `journalctl` and addressed by widening the filter (`@process`, etc.) on demand.
10. **Single helper file replaces v1.** Not adding a `-v2` variant. The v1 implementation is removed in the same change.
## Architecture
```text
sudo helper
└─ systemd-run --service (default) --pipe --wait
(transient .service unit, full hardening directives)
└─ /bin/bash /script.sh
```
systemd-run in service mode:
- Opens a transient service unit on the system bus.
- Applies all `-p` properties as the unit's exec context.
- Forks; the child sets up the unit's namespaces (mount, IPC, user), drops privileges to `User=l4d2-sandbox`, applies the seccomp filter, and `execve()`s `/bin/bash /script.sh`.
- `--pipe` connects the unit's stdin/stdout/stderr to the calling helper's stdio (so the existing `run_command` harness in `ScriptBuilder` continues to capture line-by-line).
- `--wait` blocks until the unit terminates and propagates the exit code.
- `--collect` removes the unit on exit even if it failed.
- The cgroup carries the resource limits; the systemd timer enforces `RuntimeMaxSec=3600`.
### Helper
`deploy/files/usr/local/libexec/left4me/left4me-script-sandbox`, mode 0755, owned root:
```bash
#!/bin/bash
set -euo pipefail
[[ $# -eq 2 ]] || { echo "usage: $0 <overlay_id> <script>" >&2; exit 64; }
OVERLAY_ID=$1; SCRIPT=$2
[[ "$OVERLAY_ID" =~ ^[0-9]+$ ]] || { echo "bad overlay id" >&2; exit 64; }
OVERLAY_DIR=/var/lib/left4me/overlays/$OVERLAY_ID
[[ -d $OVERLAY_DIR ]] || { echo "no overlay dir at $OVERLAY_DIR" >&2; exit 65; }
[[ -f $SCRIPT ]] || { echo "no script at $SCRIPT" >&2; exit 65; }
if [[ "${LEFT4ME_SCRIPT_SANDBOX_DRY_RUN:-}" == "1" ]]; then
echo "DRY RUN: overlay_id=$OVERLAY_ID script=$SCRIPT overlay_dir=$OVERLAY_DIR"
exit 0
fi
chown -R l4d2-sandbox:l4d2-sandbox "$OVERLAY_DIR"
chmod 0755 "$OVERLAY_DIR"
exec systemd-run --quiet --collect --wait --pipe \
--unit="left4me-script-${OVERLAY_ID}-$$" \
-p User=l4d2-sandbox -p Group=l4d2-sandbox \
-p NoNewPrivileges=yes \
-p ProtectSystem=strict -p ProtectHome=yes \
-p PrivateTmp=yes -p PrivateDevices=yes -p PrivateIPC=yes \
-p ProtectKernelTunables=yes -p ProtectKernelModules=yes \
-p ProtectKernelLogs=yes -p ProtectControlGroups=yes \
-p RestrictNamespaces=yes \
-p RestrictAddressFamilies="AF_INET AF_INET6 AF_UNIX" \
-p RestrictSUIDSGID=yes -p LockPersonality=yes \
-p MemoryDenyWriteExecute=yes \
-p SystemCallFilter="@system-service @network-io" \
-p SystemCallArchitectures=native \
-p CapabilityBoundingSet= -p AmbientCapabilities= \
-p TemporaryFileSystem="/etc /var/lib" \
-p BindReadOnlyPaths="/etc/resolv.conf /etc/ssl /etc/ca-certificates /etc/nsswitch.conf /etc/alternatives ${SCRIPT}:/script.sh" \
-p BindPaths="${OVERLAY_DIR}:/overlay" \
-p WorkingDirectory=/overlay \
-p Environment="HOME=/tmp PATH=/usr/bin:/usr/sbin OVERLAY=/overlay" \
-p MemoryMax=4G -p MemorySwapMax=0 -p TasksMax=512 \
-p CPUQuota=200% -p RuntimeMaxSec=3600 \
-- /bin/bash /script.sh
```
### Sudoers fragment
Unchanged from v1: `left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-script-sandbox`.
### System user
Unchanged from v1: `l4d2-sandbox` (`useradd --system --no-create-home --shell /usr/sbin/nologin`).
### Filesystem expectations
- `/var/lib/left4me` must be mode 0711 (left4me-owned). Already provisioned by v1 deploy script.
- `/var/lib/left4me/left4me.db` mode 0640 root:left4me. **New** — added by this change.
- Overlay directory `/var/lib/left4me/overlays/{id}/` chowned to `l4d2-sandbox:l4d2-sandbox` 0755 by the helper before each run. Unchanged from v1.
## Build Lifecycle (unchanged from v1)
`ScriptBuilder.build()` writes the script to a 0644 tmpfile, exec's `sudo -n /usr/local/libexec/left4me/left4me-script-sandbox <id> <tmpfile>` via `run_command`, then runs `_enforce_disk_budget`. The helper's internal mechanism changes; the wrapper API is identical. `Overlay.last_build_status` is written by the job worker on completion.
## Risks
- **systemd CVE landing in our directive set.** Single-tool migration removes one isolation layer. Mitigated by uid drop + cgroup limits + `NoNewPrivileges=yes` (kernel-enforced state independent of namespace setup). The escape would be an unprivileged process with no filesystem isolation but still capped on resources; same severity envelope as a hypothetical bwrap CVE in v1. The trust model (registered users) makes a single isolation layer acceptable.
- **`SystemCallFilter` rejecting a syscall a user script unexpectedly needs.** Symptom: build fails with SIGSYS. Diagnosis: `journalctl --since "1 min ago" | grep SECCOMP`. Resolution: widen the filter (`+@process`, `+@privileged` if the script genuinely needs more than a normal service). v1 had no syscall filter, so this is a new failure class.
- **`ProtectSystem=strict` masking something a script wanted to write to.** Only `/overlay`, `/tmp`, `/run` are writable inside the sandbox. Same as v1.
- **Host PID visibility (no `PrivatePID=`).** Information disclosure; not a privilege boundary.
- **`MemoryDenyWriteExecute=yes` blocking JITs.** A script that launches `node` / a JIT runtime would fail because W+X mappings are blocked. None of the recipe set the user has historically used (curl + tar + cp) needs a JIT; revisit if a real script trips this.
- **`RestrictAddressFamilies` blocking some download tools.** `curl`, `wget`, `git over https` use `AF_INET`/`AF_INET6`; `getent hosts` uses `AF_UNIX` (nss). Smoke-tested as working. A script that wanted raw sockets (`AF_PACKET`) or netlink (`AF_NETLINK`) would fail; neither is plausible for build recipes.
## Out Of Scope
- **Per-overlay UID isolation.** Cross-script-overlay write access is still possible after a hypothetical sandbox bypass (every script overlay's dir is owned by `l4d2-sandbox`). A per-overlay UID pool was discussed as the next-step hardening but is deferred.
- **`PrivateNetwork=` / egress filtering.** No change from v1.
- **systemd-nspawn or LXC.** Researched; both are heavier than necessary for transient bash builds.
- **`PrivatePID=` workaround via `unshare`.** Not pursued — would require re-introducing a wrapper inside the unit, defeating the simplification.
## Implementation Boundaries
- **Web app code is unchanged.** `ScriptBuilder`, `run_sandboxed_script`, route handlers, models, migrations — all untouched. The migration is purely in the deployed helper script and adjacent deploy artifacts.
- **`bubblewrap` apt package removed.** Already absent from production paths after this change; deploy script updated.
- **No new systemd unit files.** Each invocation is a transient unit named `left4me-script-{overlay_id}-{pid}.service`.
- **No application-level dependency changes.** No new Python packages, no template changes, no DB migration.