left4me/docs/superpowers/specs/2026-05-15-hardening-defenses-survey.md
mwiegand 2834ad4911
deploy: move scripts/{libexec,sbin}/ into deploy/scripts/
Layout consistency: everything ckn-bw deploys to the host now lives
under deploy/. ckn-bw's install_left4me_scripts copy-action goes away
in lockstep with this commit and is replaced by target-side symlinks.

Also updates all path references in docs, tests (conftest.py parents[]
depth, test_overlay_helper.py HELPER_SOURCE), and deploy/README.md.

Part of 2026-05-15-deployment-responsibility-design.md migration step 4.
2026-05-15 19:38:42 +02:00

698 lines
29 KiB
Markdown

# left4me application hardening — defenses survey
**Status:** living spec. Companion to `2026-05-15-hardening-threat-model.md`
and `2026-05-15-hardening-test-plan.md`.
This document catalogs the Linux + systemd defense primitives applicable
to left4me, evaluates each against this codebase's needs, and proposes a
candidate composition. Each candidate is *testable* — the test plan
exercises it before commit.
Reference: the threat model defines defenses D1-D7. This document maps
primitives to those defenses.
## Section 1 — Linux kernel primitives
### Namespaces (`man 7 namespaces`)
| NS | Isolates | Relevance |
|---|---|---|
| **mount** | filesystem hierarchy view | Core. Gives `TemporaryFileSystem=` + bind primitives. |
| **user** | uid/gid mapping | Big for D2/D4 (cross-uid ptrace block). |
| **pid** | PID 1, /proc visibility | Pairs with `ProcSubset=pid` for D2. |
| **net** | netifs, ports, routes | Breaks gameservers; do **not** apply to server@. |
| **ipc** | SysV IPC + POSIX MQ + abstract sockets | Hygienic; `PrivateIPC=true`. |
| **uts** | hostname | Cosmetic; doesn't matter for us. |
| **time** | CLOCK_MONOTONIC offset | Irrelevant for us. |
| **cgroup** | cgroup view | Defense-in-depth against cgroup escape. |
**For left4me:** mount + user + pid + ipc on `left4me-server@.service`.
The web unit can use the same minus user-ns (incompatible with sudo).
### Capabilities (`man 7 capabilities`)
Per-process, granted at exec via file caps or by systemd at unit start.
Bounding set = upper bound; ambient = inherited across non-setuid exec.
- **CapabilityBoundingSet=** empty drops everything. Neither srcds nor
gunicorn needs any capability after they start (no raw sockets, no
mount, no module load, no setuid).
- **AmbientCapabilities=** empty (default).
Sharp edge: with `+`-prefixed ExecStartPre, the helper runs as PID 1
(root, all caps), unaffected by these. That's how we get the privileged
overlay mount without breaking the unit's caps.
### Seccomp-bpf (`man 2 seccomp`)
Filter syscall set. Per-process. Composes with the AND of all filters
loaded. The systemd `SystemCallFilter=` wraps it.
For us, two filter strategies:
- **Allow-list base** (`@system-service`): permissive enough for srcds
+ gunicorn; subtract dangerous groups.
- **Deny-list**: simpler but easier to leave holes.
Strategy: allow-list with subtractions.
Critical subtractions for D2:
- `~@debug` — drops `ptrace(2)`, `process_vm_readv/writev(2)`,
`process_madvise(2)`. **Single most important syscall block** for our
threat model.
- `~@mount``mount`, `umount2`, `pivot_root` (gameserver doesn't need;
helper does, and helper runs as root via `+` prefix).
- `~@privileged` — anything requiring CAP_*; redundant with empty
bounding set but defense-in-depth.
- `~@reboot`, `~@swap`, `~@cpu-emulation`, `~@obsolete` — cheap removal.
Sharp edges:
- `SystemCallFilter=` lines compose left-to-right by union (first line
sets allow-list; subsequent `~` lines subtract).
- A `~` subtract on a group not in the allow-list is a no-op.
- `SystemCallArchitectures=native` blocks 32-bit syscall entries that
bypass the filter. Always set this.
- `SystemCallErrorNumber=EPERM` vs. default `KILL``EPERM` is gentler
for non-essential paths; `KILL` is loud and obvious. Start with
default (KILL) for clear signal, switch to `EPERM` if a benign caller
trips it (e.g., a library probing for capabilities).
### Yama LSM — `kernel.yama.ptrace_scope`
System-wide sysctl. Values:
- 0: any same-user can ptrace
- 1: same-uid or direct ancestor (Debian default)
- 2: requires `CAP_SYS_PTRACE` (admin only)
- 3: ptrace disabled entirely
For left4me: setting to 2 system-wide is cheap and removes the same-uid
ptrace path entirely. Set via `/etc/sysctl.d/99-left4me.conf` (or
extend an existing file). Doesn't affect debuggability — if you ever
need to ptrace, do it as root.
Caveat: Yama is enforced AT THE TIME of `ptrace` call. With seccomp
blocking the syscall entirely (`~@debug`), Yama becomes belt-and-braces;
keep both for defense-in-depth.
### LSMs other than Yama
| LSM | Status on Debian Trixie | Fit for us |
|---|---|---|
| **AppArmor** | Available; not enabled by default | Could write profiles for srcds + gunicorn. Per-unit profile via `AppArmorProfile=` on systemd. Moderate effort. |
| **SELinux** | Available; not enabled by default | Heavy. Not worth the operational cost on a single-host VPS. |
| **landlock** | Kernel ≥5.13; available | Process-local sandboxing. Apps must opt in via `landlock(2)`. Python doesn't have a stdlib binding; need to call via ctypes or a wrapper. For us: would need to retrofit gunicorn or write a wrapper. Defer. |
| **BPF LSM** | Kernel ≥5.7; available | Programmable LSM hooks. Bleeding edge for personal infra. Defer. |
| **Tomoyo** | Available; not Debian-enabled | Path-based MAC. Niche. Skip. |
**For left4me:** Yama yes. AppArmor *maybe*, as a follow-up — a profile
limited to "deny path X" patterns for srcds would be small but adds an
audit/rollback surface. Skip in the first pass; revisit if test results
show systemd directives alone leave gaps.
### Filesystem ACLs and modes
POSIX permissions, supplementary groups, ACLs (`setfacl`), extended
attrs (`xattr`).
For us:
- DB and `web.env` already use `root:left4me 0640`. If we go uid-split,
ownership changes; if we go hardening-only, mode is fine — what
matters is *whether the unit's FS view contains them at all*.
- `setfacl` for fine-grained sharing (e.g., one supplementary group
used by both web and game). Doable but adds complexity; consider
only if uid split goes ahead.
### File attributes (chattr)
`chattr +i` (immutable) and `chattr +a` (append-only).
For us:
- `chattr +i /opt/left4me/src/**` — prevents post-deploy tampering by
anything short of root removing the attr. But: `pip install -e`
creates `*.egg-info` files in the tree; deploy of new code would need
to `chattr -R -i ...` first. Too much friction. Skip.
- `chattr +i /etc/left4me/web.env` — keeps the env file from being
rewritten by a malicious uid. Works because the env file is rewritten
rarely (rotate SECRET_KEY explicitly via ckn-bw apply, which is root
and can `chattr -i` first). Worth considering as a small extra.
### cgroups v2
Not a security primitive (not confidentiality/integrity), but a
**resource ceiling**. Already in use:
- `Slice=l4d2-game.slice`, `MemoryMax`, `TasksMax` — keep.
`MemoryDenyWriteExecute=true` is a kernel-level prctl + seccomp, not a
cgroup, but listed here because it's resource-adjacent. See systemd
section.
### Sudo / setuid
Sudoers grants narrow what a unit's uid can do as root. For us, the
helpers (`deploy/scripts/libexec/left4me-*`) already validate inputs tightly
(verified in audit). Two design options for the future:
- **Keep sudo path**, narrow the grants (per-uid via 3-user split, or
per-action via tighter sudoers).
- **Replace sudo with systemctl-managed transient units triggered via
dbus / `systemctl start`** — the build-overlay-unit spec already
proposes this for the script-sandbox.
The web app needs to invoke the helpers somehow. `NoNewPrivileges=true`
on the web unit would break sudo's setuid. If we move to
systemctl-triggered units (no setuid involved), we can also tighten the
web unit. Sequenced in the implementation plan, not this survey.
## Section 2 — systemd unit-config primitives
### Identity
- **`User=` / `Group=`** — drop privileges. Already set.
- **`DynamicUser=true`** — transient uid per run, persisted across runs
via `StateDirectory=`. Strong default. **Bad fit for us** because
multiple units share `/var/lib/left4me/` cross-unit; DynamicUser's
per-unit `StateDirectory=` model fights that.
- **`SupplementaryGroups=`** — extra groups. Used if we add a shared
read-only group (e.g., `l4d2-overlay-readers`).
### Filesystem virtualization
The lever the operator asked about ("can systemd have a fully virtual
filesystem"). Yes — composition:
- **`RootDirectory=path`** — chroot. Full FS substitution. Heavy;
requires populating libs/binaries. Skip for the first pass.
- **`RootImage=path`** — same but from a disk image. Way too heavy.
- **`TemporaryFileSystem=path[:opts]`** — empty tmpfs at `path`.
Cheap. Composes with bind paths.
- **`BindReadOnlyPaths=src[:dst]`** — RO bind. Composes over
TemporaryFileSystem.
- **`BindPaths=src[:dst]`** — RW bind. Composes over TemporaryFileSystem.
- **`InaccessiblePaths=path`** — masks a path with an empty file/dir.
Legacy; Bind* is cleaner.
- **`NoExecPaths=path`** / **`ExecPaths=path`** — restrict
executable paths. Strong but easy to misconfigure.
Composition pattern (the one we want for srcds):
```ini
TemporaryFileSystem=/var/lib /etc /opt /home /root /srv
BindReadOnlyPaths=/var/lib/left4me/installation
BindReadOnlyPaths=/var/lib/left4me/overlays
BindReadOnlyPaths=/etc/left4me/host.env
BindReadOnlyPaths=/etc/ssl /etc/ca-certificates /etc/resolv.conf
BindReadOnlyPaths=/etc/nsswitch.conf /etc/alternatives
BindPaths=/var/lib/left4me/runtime/%i
```
Result: srcds has no DB, no `web.env`, no `/opt/left4me/src/` in its FS
view. Files outside the bound list are simply not there from srcds's
perspective — `open()` returns ENOENT, not EACCES.
Sharp edges:
- `TemporaryFileSystem=` size defaults to half RAM; clamp via
`:size=NNM,nr_inodes=NN`.
- Bind paths must exist on disk; ENOENT prevents unit start.
- `BindReadOnlyPaths=` and `BindPaths=` reorder semantics: bind-mounts
applied in order; later wins.
- `RuntimeDirectory=` integrates with `TemporaryFileSystem=` cleanly:
`RuntimeDirectory=left4me/foo` creates `/run/left4me/foo` and binds
it in, auto-cleaning on stop.
### Namespaces (systemd wrappers)
- **`PrivateTmp=true`** — already set.
- **`PrivateDevices=true`** — already set. Drops most of `/dev`.
- **`PrivateNetwork=true`** — **don't** for gameservers (breaks UDP).
- **`PrivateIPC=true`** — private SysV/POSIX IPC namespace; cheap win.
- **`PrivateUsers=true`** — own userns. The configured `User=left4me`
is identity-mapped inside; outside, the unit's processes appear as a
mapped high uid (defense for D2/D4 against cross-namespace ptrace).
Sharp edge: incompatible with `sudo` from inside the unit (setuid +
userns mapping = no host-root).
- **`PrivateMounts=true`** — own mount ns (default-implicit with most
Protect* / Private* directives).
### `/proc` and `/sys` protection
- **`ProtectProc=invisible|noaccess|ptraceable|default`** —
`invisible` makes other procs' `/proc/<pid>/*` not exist. **D2.**
- **`ProcSubset=pid|all`** — `pid` restricts `/proc/` to PID entries;
hides `/proc/kallsyms`, `/proc/cpuinfo`, etc. Cheap.
- **`ProtectKernelTunables=true`** — `/proc/sys`, `/sys` read-only.
- **`ProtectKernelModules=true`** — block `init_module`, `delete_module`.
- **`ProtectKernelLogs=true`** — block `/dev/kmsg`, syslog().
- **`ProtectClock=true`** — block `clock_settime`, `settimeofday`.
- **`ProtectControlGroups=true`** — `/sys/fs/cgroup` read-only.
- **`ProtectHostname=true`** — block `sethostname`/`setdomainname`.
All of `ProtectKernel*`, `ProtectClock`, `ProtectControlGroups`,
`ProtectHostname` are cheap and have no downside for srcds or gunicorn.
Add all of them.
### Filesystem protection (legacy / not Bind*)
- **`ProtectSystem=false|true|full|strict`** — increasingly stringent
RO of system paths. `strict` makes `/`, `/usr`, `/boot`, `/etc`,
`/opt` RO except for explicit writable paths.
- **`ProtectHome=false|true|read-only|tmpfs`** — `tmpfs` masks `/home`,
`/root`, `/run/user` with empty tmpfs.
For us: `ProtectSystem=strict` + `ProtectHome=tmpfs` is the baseline.
But once we adopt `TemporaryFileSystem=` for the relevant trees, these
become secondary — TemporaryFileSystem fully supersedes them in the
covered subtrees. Keep both as defense-in-depth (cheap).
### Syscall filtering
- **`SystemCallFilter=expr`** — discussed in Linux section.
- **`SystemCallArchitectures=native`** — always set.
- **`SystemCallLog=expr`** — opt-in logging without enforcement;
useful for diagnosing what gets called before tightening.
- **`SystemCallErrorNumber=EPERM`** — soft denial vs. SIGKILL. Default
is SIGKILL; switch later if a benign caller trips.
### Capabilities
- **`CapabilityBoundingSet=`** — empty drops all. Use it.
- **`AmbientCapabilities=`** — empty (default).
- **`NoNewPrivileges=true`** — prevents setuid escalation. **Required
on srcds**, **incompatible with sudo on web** until sudo is replaced.
### Network restrictions
- **`RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX`** — for srcds.
AF_UNIX needed for journald socket access.
- **`IPAddressAllow=` / `IPAddressDeny=`** — uses cgroup BPF; affects
outbound traffic. For srcds: probably overcomplicates; the firewall
already controls ingress. Skip for first pass.
- **`SocketBindAllow=` / `SocketBindDeny=`** — restricts which ports a
unit can `bind()`. For srcds, allow only the configured game port
range. Adds value but couples to config. Defer to a follow-up.
### Resource restrictions
- **`MemoryMax`**, **`TasksMax`**, **`LimitNOFILE`** — already set.
- **`OOMScoreAdjust`** — already set (favor killing the gameserver
before system processes if memory tight).
- **`MemoryDenyWriteExecute=true`** — blocks `mprotect(PROT_WRITE|PROT_EXEC)`.
Defends against shellcode in JIT memory. **Source engine likely
fine** (no JIT in the binary; the Squirrel script engine is an
interpreter, not JIT). **Sourcemod plugins**: most are compiled to
bytecode + run on SourcePawn VM (interpreter); no JIT either. Verify
in test.
### IPC and process hygiene
- **`RemoveIPC=true`** — clean up SysV IPC on unit stop.
- **`KeyringMode=private`** — own kernel keyring; no host-key access.
- **`LockPersonality=true`** — block `personality(2)` calls (no x86 vs
x86-64 mode toggle). Already set.
- **`RestrictRealtime=true`** — block real-time scheduling. srcds may
use SCHED_OTHER + nice; no realtime needed.
- **`RestrictNamespaces=true`** — block `unshare(2)` / `clone(CLONE_NEW*)`.
- **`RestrictSUIDSGID=true`** — already set.
- **`UMask=0027`** — narrow default umask.
### Capabilities of the `+` prefix
`ExecStartPre=+cmd` runs `cmd` as root in PID 1's namespaces, bypassing
the unit's User= and almost all Protect*/Private*/Restrict* directives.
This is how the existing overlay-mount helper runs. Critical to verify
in test:
- Does `+` preserve the bypass when `PrivateUsers=true` is set?
(Expected: yes — the userns is set up around the unit's processes;
`+` puts the helper outside it.)
### State management (per-unit)
- **`StateDirectory=path`** — creates `/var/lib/<path>` owned by User=.
- **`RuntimeDirectory=path`** — creates `/run/<path>`, auto-deleted on
stop.
- **`LogsDirectory=path`** — `/var/log/<path>`.
- **`CacheDirectory=path`** — `/var/cache/<path>`.
- **`ConfigurationDirectory=path`** — `/etc/<path>`.
Useful for cleanup hygiene if we redesign storage layout. Not required
for first pass.
### `systemd-analyze security`
`systemd-analyze security <unit>` produces a security score per unit
(lower = more secure). Output lists each directive with a ✓/✗.
Useful as:
- Regression check (record baseline, ensure score drops after refactor).
- Discovery tool ("which directives haven't I set?").
Baseline scores (to capture during test plan):
- `left4me-server@1.service` before refactor
- `left4me-web.service` before refactor
### Composability lookups
The systemd docs use a "predefined preset" concept that's worth knowing:
- **`@privileged`** (syscall group) ⊃ `@process`, `@module`, `@ptrace`, etc.
- **`@system-service`** is the recommended base for "I want a normal
service to work."
- Subtracting `~@privileged` is broad; `~@debug @mount @raw-io` is
surgical.
## Section 3 — Application-level options
### Apparmor profile for srcds
If systemd directives leave gaps, an AppArmor profile would let us
deny specific paths or operations beyond what systemd's directives
cover. E.g., "deny network for srcds to a specific IP range" via
`network inet stream...` deny rules; or "deny mounting" beyond
`SystemCallFilter`.
Effort:
- Enable AppArmor in the kernel cmdline + boot config.
- Write a profile (e.g., `/etc/apparmor.d/usr.bin.srcds_linux`).
- Reference via systemd `AppArmorProfile=` per unit.
Skip for the first pass; revisit if test results show the systemd
directives alone leave a gap.
### landlock for the web app
Python web app could call `landlock_create_ruleset` / `landlock_add_rule`
/ `landlock_restrict_self` via ctypes. Restricts FS access at runtime.
For us:
- Could restrict gunicorn to `/var/lib/left4me/` + `/etc/left4me/web.env`
+ `/opt/left4me/.venv` + `/tmp`.
- Symmetric to `TemporaryFileSystem=` + `Bind*` but at the
application layer (no systemd reach).
Skip; systemd directives are simpler. Reconsider if we move to a
DynamicUser-style world later.
### File-integrity tooling (Aide, Tripwire)
Out of scope for prevention; useful for detection. Not in this design.
### Custom seccomp profile (bypassing systemd)
The web app could call `seccomp(2)` from inside Python via libseccomp
+ ctypes to tighten its own filter beyond what systemd applies.
Symmetric to landlock; skip for the same reason.
## Section 4 — Per-defense mapping
For each defense from the threat model, the primitives that implement
it, in priority order:
### D1 — Gameserver RCE cannot exfiltrate DB or `web.env`
| Primitive | Strength | Notes |
|---|---|---|
| `TemporaryFileSystem=/var/lib /etc` + minimal bind set | Strong | The files simply aren't in the unit's FS view. ENOENT, not EACCES. |
| 3-user split (DB owned by `l4d2-web`) | Strong | Kernel-enforced; survives unit-config errors. |
| `BindReadOnlyPaths=/dev/null:/var/lib/left4me/left4me.db` | Medium | Masks the path; brittle (paths can move). |
| Filesystem ACLs (DB mode 0600) | Weak | Kernel still allows `left4me` group; only fixed by uid split. |
**Composition chosen:** `TemporaryFileSystem=` + Bind* (primary).
3-user split as defense-in-depth or deferred.
### D2 — Gameserver RCE cannot ptrace web app or peers
| Primitive | Strength | Notes |
|---|---|---|
| `SystemCallFilter=~@debug` | Strong | Blocks `ptrace`, `process_vm_readv/writev`. |
| `kernel.yama.ptrace_scope=2` | Strong | Belt-and-braces at the kernel level. |
| `CapabilityBoundingSet=` empty | Strong | No CAP_SYS_PTRACE. |
| `PrivateUsers=true` | Strong | Cross-userns ptrace requires CAP_SYS_PTRACE. |
| 3-user split | Strong | Different uids; same-uid path doesn't exist. |
**Composition chosen:** All four (syscall + yama + caps + userns)
together; they compose redundantly.
### D3 — Gameserver RCE cannot use sudo helpers
| Primitive | Strength | Notes |
|---|---|---|
| `NoNewPrivileges=true` | Strong | Blocks sudo's setuid. Already set on server@. |
| `PrivateUsers=true` | Strong | sudo across userns boundary impossible. |
| Sudoers grants scoped to `l4d2-web` (uid split) | Strong | Different uid means sudo grant doesn't apply. |
| `RestrictSUIDSGID=true` | Strong | Already set. |
**Composition chosen:** NoNewPrivileges (already) + PrivateUsers (new)
+ RestrictSUIDSGID (already). 3-user split is *also* covered by NNP
+ PrivateUsers; uid split would be defense-in-depth.
### D4 — Web app RCE cannot ptrace gameservers
| Primitive | Strength | Notes |
|---|---|---|
| `SystemCallFilter=~@debug` on **web** | Strong | Symmetric to D2 but applied to web. |
| `kernel.yama.ptrace_scope=2` | Strong | System-wide, helps both directions. |
| 3-user split | Strong | Different uids. |
**Composition chosen:** SystemCallFilter on web + yama=2 system-wide.
PrivateUsers cannot be applied to web (sudo incompatibility). 3-user
split as defense-in-depth or deferred.
### D5 — Cross-server contamination
Each `left4me-server@<n>.service` is a separate unit instance. With
`PrivateUsers=true`, each gets its own user namespace. Cross-namespace
ptrace fails. With `TemporaryFileSystem=` and per-instance
`BindPaths=/var/lib/left4me/runtime/%i`, neither instance can read the
other's `runtime/<n>/` or attach to its process.
**Composition chosen:** PrivateUsers + per-instance Bind* (above).
Per-instance uids out of scope.
### D6 — Persistent compromise of `/opt/left4me/src/` blocked from gameserver
Already covered by `ProtectSystem=strict` on server@.service. With
`TemporaryFileSystem=/opt`, the path simply isn't visible to srcds.
**Stronger and redundant — both can stay.**
### D7 — Defenses survive a unit-config refactor in the wrong direction
`deploy/tests/test_deploy_artifacts.py` asserts the directives' presence
in the deployed unit. Add hardening invariants as test cases. Survives
because the test fails CI before deploy.
## Section 5 — Candidate composition
**For testing, not commitment.** Test plan validates each piece.
### `left4me-server@.service`
```ini
[Service]
User=left4me
Group=left4me
# (existing)
Type=simple
WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
ExecStart=/var/lib/left4me/runtime/%i/merged/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
Restart=on-failure
RestartSec=5
# Resource control (existing)
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
# Hardening — identity
NoNewPrivileges=true
RestrictSUIDSGID=true
# Hardening — namespaces
PrivateTmp=true
PrivateDevices=true
PrivateIPC=true
PrivateUsers=true # NEW
ProtectHome=true
# Hardening — filesystem view
TemporaryFileSystem=/var/lib /etc /opt /home /root /srv /mnt /media # NEW
BindReadOnlyPaths=/var/lib/left4me/installation # was ReadOnlyPaths
BindReadOnlyPaths=/var/lib/left4me/overlays # was ReadOnlyPaths
BindReadOnlyPaths=/etc/left4me/host.env # NEW
BindReadOnlyPaths=/etc/ssl /etc/ca-certificates # NEW
BindReadOnlyPaths=/etc/resolv.conf /etc/nsswitch.conf /etc/alternatives # NEW
BindPaths=/var/lib/left4me/runtime/%i # was ReadWritePaths
ProtectSystem=strict
# (remove old ReadOnlyPaths= and ReadWritePaths= lines — superseded)
# Hardening — /proc, /sys, kernel
ProtectProc=invisible # NEW
ProcSubset=pid # NEW
ProtectKernelTunables=true # NEW
ProtectKernelModules=true # NEW
ProtectKernelLogs=true # NEW
ProtectClock=true # NEW
ProtectControlGroups=true # NEW
ProtectHostname=true # NEW
LockPersonality=true
# Hardening — caps + syscall
CapabilityBoundingSet= # NEW
AmbientCapabilities= # NEW
SystemCallArchitectures=native # NEW
SystemCallFilter=@system-service # NEW
SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete @privileged # NEW
# Hardening — network
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX # NEW (AF_UNIX for journald)
# Hardening — namespaces, realtime, IPC
RestrictNamespaces=true # NEW
RestrictRealtime=true # NEW
RemoveIPC=true # NEW
KeyringMode=private # NEW
UMask=0027 # NEW
# Deferred until test:
# MemoryDenyWriteExecute=true # MAY break sourcemod / Source engine; test first.
```
### `left4me-web.service`
```ini
[Service]
User=left4me
Group=left4me
# (existing)
Type=simple
WorkingDirectory=/opt/left4me/src
Environment=HOME=/var/lib/left4me PATH=/opt/left4me/.venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/etc/left4me/web.env
ExecStart=/opt/left4me/.venv/bin/gunicorn --workers ... --threads ... --bind 127.0.0.1:8000 'l4d2web.app:create_app()'
Restart=on-failure
RestartSec=3
# Hardening
PrivateTmp=true
ProtectSystem=strict # tightened from =full
ProtectHome=true
ReadWritePaths=/var/lib/left4me # web needs broad write access there
# NoNewPrivileges intentionally NOT set — sudo
# PrivateUsers intentionally NOT set — sudo
# /proc + kernel hardening (sudo-compatible)
ProtectProc=invisible # NEW
ProcSubset=pid # NEW
ProtectKernelTunables=true # NEW
ProtectKernelModules=true # NEW
ProtectKernelLogs=true # NEW
ProtectClock=true # NEW
ProtectControlGroups=true # NEW
ProtectHostname=true # NEW
LockPersonality=true # NEW
# Syscall filter — allow @system-service minus debug-class; keep @privileged
# because sudo needs setuid, chown, etc.
SystemCallArchitectures=native # NEW
SystemCallFilter=@system-service # NEW
SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete # NEW
# Network
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX # NEW
# Misc hygiene
RestrictRealtime=true # NEW
RestrictNamespaces=true # NEW
RemoveIPC=true # NEW
UMask=0027 # NEW
# Deferred for sudo-removal future work:
# NoNewPrivileges=true
# CapabilityBoundingSet=
# PrivateUsers=true
```
### Host sysctl
`/etc/sysctl.d/99-left4me.conf` (or merge into existing):
```
kernel.yama.ptrace_scope=2
```
System-wide. Means: even if a unit-level config slips, host-level
ptrace is admin-only. Cost: zero for our use case (no debugging in
prod).
## Section 6 — Trade-offs and known sharp edges
To verify in the test plan:
1. **`PrivateUsers=true` + `+`-prefixed ExecStartPre**: expected to
work (the `+` runs outside the unit's namespaces). Sharp if it
doesn't — the overlay mount would fail and srcds wouldn't start.
2. **`TemporaryFileSystem=/etc` and missing files**: srcds and its
dependencies (libstdc++ runtime, libssl, libcurl) may read files
from `/etc` we haven't bound. Watch journalctl for ENOENT during
first start.
3. **`SystemCallFilter=~@privileged` and Source engine**: srcds is C++
and uses syscalls beyond the obvious. A `~@privileged` may trip
something. Mitigation: test with `SystemCallLog=` instead of
`SystemCallFilter=` first; observe what would have been blocked;
then narrow.
4. **`MemoryDenyWriteExecute=true` and sourcemod**: SourcePawn is
bytecode-interpreted (no JIT) per public docs, but plugin
compilation could in theory use a JIT. Test before enabling.
5. **`RestrictAddressFamilies=` without AF_UNIX**: journald socket
needs it. Always include AF_UNIX.
6. **`ProcSubset=pid` and Python**: gunicorn shouldn't break (uses
/proc/self/* + signal-based ipc). Verify.
7. **sysctl `kernel.yama.ptrace_scope=2`**: blocks operator's own
`gdb` / `strace -p` against any running service. If you need to
debug, temporarily set back to 1 via sysctl, then revert.
8. **`ProtectSystem=strict` on web**: was `=full`. Tighter; might
break a write the web app does to a path outside `/var/lib/left4me`.
Audit `l4d2web/*` for `os.makedirs` or `open(...'w')` outside that
root.
## Open questions for the implementer
(After test plan results come back, finalize these.)
1. Do we adopt `MemoryDenyWriteExecute=true` if it works for srcds?
(Probably yes, defense-in-depth at low cost.)
2. Do we set `SocketBindAllow=` on srcds to lock the port range?
(Depends on whether `instance.env` exposes the range cleanly to a
unit directive.)
3. Do we deploy AppArmor profiles as a follow-up?
(Probably no — operational complexity exceeds the marginal gain on
single-host infra.)
4. Do we keep both `BindReadOnlyPaths=` and the legacy
`ReadOnlyPaths=` declarations, or simplify? (Simplify — use Bind*
exclusively once `TemporaryFileSystem=` is in place.)
5. Do we proceed with 3-user split as a follow-up, or close the spec
as "addressed by hardening"? Depends on operator's residual-risk
tolerance after Phase A lands and we observe.
## Pointers
- Threat model: `docs/superpowers/specs/2026-05-15-hardening-threat-model.md`
- Test plan: `docs/superpowers/specs/2026-05-15-hardening-test-plan.md`
- Original uid-split spec (still open): `docs/superpowers/specs/2026-05-15-user-uid-split-design.md`
- Live unit source (ckn-bw reactor): `~/Projekte/ckn-bw/bundles/left4me/metadata.py:150+`
- Reference units (deploy-dir-rethink reference-only): `deploy/files/usr/local/lib/systemd/system/`
- systemd docs (latest, systemd 256+ on Trixie):
`man systemd.exec`, `man systemd.unit`, `man systemd-analyze`.
- L4D2 / Source engine docs:
- SourcePawn (bytecode-interpreted): https://wiki.alliedmods.net/SourcePawn
- srcds is a Source 2007 engine binary; closed-source, expect surprises.