Reframe the queued uid-split decision into a broader hardening analysis. Audit found the same-uid attack surface (DB readable from srcds, ptrace allowed, RCON stored plaintext) is closable by either uid split or systemd directive composition; the three specs ground that choice in a threat model, survey the defenses, and lay out a self-contained test plan to run on left4.me next. uid-split spec deferred pending results. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
698 lines
29 KiB
Markdown
698 lines
29 KiB
Markdown
# left4me application hardening — defenses survey
|
|
|
|
**Status:** living spec. Companion to `2026-05-15-hardening-threat-model.md`
|
|
and `2026-05-15-hardening-test-plan.md`.
|
|
|
|
This document catalogs the Linux + systemd defense primitives applicable
|
|
to left4me, evaluates each against this codebase's needs, and proposes a
|
|
candidate composition. Each candidate is *testable* — the test plan
|
|
exercises it before commit.
|
|
|
|
Reference: the threat model defines defenses D1-D7. This document maps
|
|
primitives to those defenses.
|
|
|
|
## Section 1 — Linux kernel primitives
|
|
|
|
### Namespaces (`man 7 namespaces`)
|
|
|
|
| NS | Isolates | Relevance |
|
|
|---|---|---|
|
|
| **mount** | filesystem hierarchy view | Core. Gives `TemporaryFileSystem=` + bind primitives. |
|
|
| **user** | uid/gid mapping | Big for D2/D4 (cross-uid ptrace block). |
|
|
| **pid** | PID 1, /proc visibility | Pairs with `ProcSubset=pid` for D2. |
|
|
| **net** | netifs, ports, routes | Breaks gameservers; do **not** apply to server@. |
|
|
| **ipc** | SysV IPC + POSIX MQ + abstract sockets | Hygienic; `PrivateIPC=true`. |
|
|
| **uts** | hostname | Cosmetic; doesn't matter for us. |
|
|
| **time** | CLOCK_MONOTONIC offset | Irrelevant for us. |
|
|
| **cgroup** | cgroup view | Defense-in-depth against cgroup escape. |
|
|
|
|
**For left4me:** mount + user + pid + ipc on `left4me-server@.service`.
|
|
The web unit can use the same minus user-ns (incompatible with sudo).
|
|
|
|
### Capabilities (`man 7 capabilities`)
|
|
|
|
Per-process, granted at exec via file caps or by systemd at unit start.
|
|
Bounding set = upper bound; ambient = inherited across non-setuid exec.
|
|
|
|
- **CapabilityBoundingSet=** empty drops everything. Neither srcds nor
|
|
gunicorn needs any capability after they start (no raw sockets, no
|
|
mount, no module load, no setuid).
|
|
- **AmbientCapabilities=** empty (default).
|
|
|
|
Sharp edge: with `+`-prefixed ExecStartPre, the helper runs as PID 1
|
|
(root, all caps), unaffected by these. That's how we get the privileged
|
|
overlay mount without breaking the unit's caps.
|
|
|
|
### Seccomp-bpf (`man 2 seccomp`)
|
|
|
|
Filter syscall set. Per-process. Composes with the AND of all filters
|
|
loaded. The systemd `SystemCallFilter=` wraps it.
|
|
|
|
For us, two filter strategies:
|
|
- **Allow-list base** (`@system-service`): permissive enough for srcds
|
|
+ gunicorn; subtract dangerous groups.
|
|
- **Deny-list**: simpler but easier to leave holes.
|
|
|
|
Strategy: allow-list with subtractions.
|
|
|
|
Critical subtractions for D2:
|
|
- `~@debug` — drops `ptrace(2)`, `process_vm_readv/writev(2)`,
|
|
`process_madvise(2)`. **Single most important syscall block** for our
|
|
threat model.
|
|
- `~@mount` — `mount`, `umount2`, `pivot_root` (gameserver doesn't need;
|
|
helper does, and helper runs as root via `+` prefix).
|
|
- `~@privileged` — anything requiring CAP_*; redundant with empty
|
|
bounding set but defense-in-depth.
|
|
- `~@reboot`, `~@swap`, `~@cpu-emulation`, `~@obsolete` — cheap removal.
|
|
|
|
Sharp edges:
|
|
- `SystemCallFilter=` lines compose left-to-right by union (first line
|
|
sets allow-list; subsequent `~` lines subtract).
|
|
- A `~` subtract on a group not in the allow-list is a no-op.
|
|
- `SystemCallArchitectures=native` blocks 32-bit syscall entries that
|
|
bypass the filter. Always set this.
|
|
- `SystemCallErrorNumber=EPERM` vs. default `KILL` — `EPERM` is gentler
|
|
for non-essential paths; `KILL` is loud and obvious. Start with
|
|
default (KILL) for clear signal, switch to `EPERM` if a benign caller
|
|
trips it (e.g., a library probing for capabilities).
|
|
|
|
### Yama LSM — `kernel.yama.ptrace_scope`
|
|
|
|
System-wide sysctl. Values:
|
|
- 0: any same-user can ptrace
|
|
- 1: same-uid or direct ancestor (Debian default)
|
|
- 2: requires `CAP_SYS_PTRACE` (admin only)
|
|
- 3: ptrace disabled entirely
|
|
|
|
For left4me: setting to 2 system-wide is cheap and removes the same-uid
|
|
ptrace path entirely. Set via `/etc/sysctl.d/99-left4me.conf` (or
|
|
extend an existing file). Doesn't affect debuggability — if you ever
|
|
need to ptrace, do it as root.
|
|
|
|
Caveat: Yama is enforced AT THE TIME of `ptrace` call. With seccomp
|
|
blocking the syscall entirely (`~@debug`), Yama becomes belt-and-braces;
|
|
keep both for defense-in-depth.
|
|
|
|
### LSMs other than Yama
|
|
|
|
| LSM | Status on Debian Trixie | Fit for us |
|
|
|---|---|---|
|
|
| **AppArmor** | Available; not enabled by default | Could write profiles for srcds + gunicorn. Per-unit profile via `AppArmorProfile=` on systemd. Moderate effort. |
|
|
| **SELinux** | Available; not enabled by default | Heavy. Not worth the operational cost on a single-host VPS. |
|
|
| **landlock** | Kernel ≥5.13; available | Process-local sandboxing. Apps must opt in via `landlock(2)`. Python doesn't have a stdlib binding; need to call via ctypes or a wrapper. For us: would need to retrofit gunicorn or write a wrapper. Defer. |
|
|
| **BPF LSM** | Kernel ≥5.7; available | Programmable LSM hooks. Bleeding edge for personal infra. Defer. |
|
|
| **Tomoyo** | Available; not Debian-enabled | Path-based MAC. Niche. Skip. |
|
|
|
|
**For left4me:** Yama yes. AppArmor *maybe*, as a follow-up — a profile
|
|
limited to "deny path X" patterns for srcds would be small but adds an
|
|
audit/rollback surface. Skip in the first pass; revisit if test results
|
|
show systemd directives alone leave gaps.
|
|
|
|
### Filesystem ACLs and modes
|
|
|
|
POSIX permissions, supplementary groups, ACLs (`setfacl`), extended
|
|
attrs (`xattr`).
|
|
|
|
For us:
|
|
- DB and `web.env` already use `root:left4me 0640`. If we go uid-split,
|
|
ownership changes; if we go hardening-only, mode is fine — what
|
|
matters is *whether the unit's FS view contains them at all*.
|
|
- `setfacl` for fine-grained sharing (e.g., one supplementary group
|
|
used by both web and game). Doable but adds complexity; consider
|
|
only if uid split goes ahead.
|
|
|
|
### File attributes (chattr)
|
|
|
|
`chattr +i` (immutable) and `chattr +a` (append-only).
|
|
|
|
For us:
|
|
- `chattr +i /opt/left4me/src/**` — prevents post-deploy tampering by
|
|
anything short of root removing the attr. But: `pip install -e`
|
|
creates `*.egg-info` files in the tree; deploy of new code would need
|
|
to `chattr -R -i ...` first. Too much friction. Skip.
|
|
- `chattr +i /etc/left4me/web.env` — keeps the env file from being
|
|
rewritten by a malicious uid. Works because the env file is rewritten
|
|
rarely (rotate SECRET_KEY explicitly via ckn-bw apply, which is root
|
|
and can `chattr -i` first). Worth considering as a small extra.
|
|
|
|
### cgroups v2
|
|
|
|
Not a security primitive (not confidentiality/integrity), but a
|
|
**resource ceiling**. Already in use:
|
|
- `Slice=l4d2-game.slice`, `MemoryMax`, `TasksMax` — keep.
|
|
|
|
`MemoryDenyWriteExecute=true` is a kernel-level prctl + seccomp, not a
|
|
cgroup, but listed here because it's resource-adjacent. See systemd
|
|
section.
|
|
|
|
### Sudo / setuid
|
|
|
|
Sudoers grants narrow what a unit's uid can do as root. For us, the
|
|
helpers (`scripts/libexec/left4me-*`) already validate inputs tightly
|
|
(verified in audit). Two design options for the future:
|
|
|
|
- **Keep sudo path**, narrow the grants (per-uid via 3-user split, or
|
|
per-action via tighter sudoers).
|
|
- **Replace sudo with systemctl-managed transient units triggered via
|
|
dbus / `systemctl start`** — the build-overlay-unit spec already
|
|
proposes this for the script-sandbox.
|
|
|
|
The web app needs to invoke the helpers somehow. `NoNewPrivileges=true`
|
|
on the web unit would break sudo's setuid. If we move to
|
|
systemctl-triggered units (no setuid involved), we can also tighten the
|
|
web unit. Sequenced in the implementation plan, not this survey.
|
|
|
|
## Section 2 — systemd unit-config primitives
|
|
|
|
### Identity
|
|
|
|
- **`User=` / `Group=`** — drop privileges. Already set.
|
|
- **`DynamicUser=true`** — transient uid per run, persisted across runs
|
|
via `StateDirectory=`. Strong default. **Bad fit for us** because
|
|
multiple units share `/var/lib/left4me/` cross-unit; DynamicUser's
|
|
per-unit `StateDirectory=` model fights that.
|
|
- **`SupplementaryGroups=`** — extra groups. Used if we add a shared
|
|
read-only group (e.g., `l4d2-overlay-readers`).
|
|
|
|
### Filesystem virtualization
|
|
|
|
The lever the operator asked about ("can systemd have a fully virtual
|
|
filesystem"). Yes — composition:
|
|
|
|
- **`RootDirectory=path`** — chroot. Full FS substitution. Heavy;
|
|
requires populating libs/binaries. Skip for the first pass.
|
|
- **`RootImage=path`** — same but from a disk image. Way too heavy.
|
|
- **`TemporaryFileSystem=path[:opts]`** — empty tmpfs at `path`.
|
|
Cheap. Composes with bind paths.
|
|
- **`BindReadOnlyPaths=src[:dst]`** — RO bind. Composes over
|
|
TemporaryFileSystem.
|
|
- **`BindPaths=src[:dst]`** — RW bind. Composes over TemporaryFileSystem.
|
|
- **`InaccessiblePaths=path`** — masks a path with an empty file/dir.
|
|
Legacy; Bind* is cleaner.
|
|
- **`NoExecPaths=path`** / **`ExecPaths=path`** — restrict
|
|
executable paths. Strong but easy to misconfigure.
|
|
|
|
Composition pattern (the one we want for srcds):
|
|
```ini
|
|
TemporaryFileSystem=/var/lib /etc /opt /home /root /srv
|
|
BindReadOnlyPaths=/var/lib/left4me/installation
|
|
BindReadOnlyPaths=/var/lib/left4me/overlays
|
|
BindReadOnlyPaths=/etc/left4me/host.env
|
|
BindReadOnlyPaths=/etc/ssl /etc/ca-certificates /etc/resolv.conf
|
|
BindReadOnlyPaths=/etc/nsswitch.conf /etc/alternatives
|
|
BindPaths=/var/lib/left4me/runtime/%i
|
|
```
|
|
|
|
Result: srcds has no DB, no `web.env`, no `/opt/left4me/src/` in its FS
|
|
view. Files outside the bound list are simply not there from srcds's
|
|
perspective — `open()` returns ENOENT, not EACCES.
|
|
|
|
Sharp edges:
|
|
- `TemporaryFileSystem=` size defaults to half RAM; clamp via
|
|
`:size=NNM,nr_inodes=NN`.
|
|
- Bind paths must exist on disk; ENOENT prevents unit start.
|
|
- `BindReadOnlyPaths=` and `BindPaths=` reorder semantics: bind-mounts
|
|
applied in order; later wins.
|
|
- `RuntimeDirectory=` integrates with `TemporaryFileSystem=` cleanly:
|
|
`RuntimeDirectory=left4me/foo` creates `/run/left4me/foo` and binds
|
|
it in, auto-cleaning on stop.
|
|
|
|
### Namespaces (systemd wrappers)
|
|
|
|
- **`PrivateTmp=true`** — already set.
|
|
- **`PrivateDevices=true`** — already set. Drops most of `/dev`.
|
|
- **`PrivateNetwork=true`** — **don't** for gameservers (breaks UDP).
|
|
- **`PrivateIPC=true`** — private SysV/POSIX IPC namespace; cheap win.
|
|
- **`PrivateUsers=true`** — own userns. The configured `User=left4me`
|
|
is identity-mapped inside; outside, the unit's processes appear as a
|
|
mapped high uid (defense for D2/D4 against cross-namespace ptrace).
|
|
Sharp edge: incompatible with `sudo` from inside the unit (setuid +
|
|
userns mapping = no host-root).
|
|
- **`PrivateMounts=true`** — own mount ns (default-implicit with most
|
|
Protect* / Private* directives).
|
|
|
|
### `/proc` and `/sys` protection
|
|
|
|
- **`ProtectProc=invisible|noaccess|ptraceable|default`** —
|
|
`invisible` makes other procs' `/proc/<pid>/*` not exist. **D2.**
|
|
- **`ProcSubset=pid|all`** — `pid` restricts `/proc/` to PID entries;
|
|
hides `/proc/kallsyms`, `/proc/cpuinfo`, etc. Cheap.
|
|
- **`ProtectKernelTunables=true`** — `/proc/sys`, `/sys` read-only.
|
|
- **`ProtectKernelModules=true`** — block `init_module`, `delete_module`.
|
|
- **`ProtectKernelLogs=true`** — block `/dev/kmsg`, syslog().
|
|
- **`ProtectClock=true`** — block `clock_settime`, `settimeofday`.
|
|
- **`ProtectControlGroups=true`** — `/sys/fs/cgroup` read-only.
|
|
- **`ProtectHostname=true`** — block `sethostname`/`setdomainname`.
|
|
|
|
All of `ProtectKernel*`, `ProtectClock`, `ProtectControlGroups`,
|
|
`ProtectHostname` are cheap and have no downside for srcds or gunicorn.
|
|
Add all of them.
|
|
|
|
### Filesystem protection (legacy / not Bind*)
|
|
|
|
- **`ProtectSystem=false|true|full|strict`** — increasingly stringent
|
|
RO of system paths. `strict` makes `/`, `/usr`, `/boot`, `/etc`,
|
|
`/opt` RO except for explicit writable paths.
|
|
- **`ProtectHome=false|true|read-only|tmpfs`** — `tmpfs` masks `/home`,
|
|
`/root`, `/run/user` with empty tmpfs.
|
|
|
|
For us: `ProtectSystem=strict` + `ProtectHome=tmpfs` is the baseline.
|
|
But once we adopt `TemporaryFileSystem=` for the relevant trees, these
|
|
become secondary — TemporaryFileSystem fully supersedes them in the
|
|
covered subtrees. Keep both as defense-in-depth (cheap).
|
|
|
|
### Syscall filtering
|
|
|
|
- **`SystemCallFilter=expr`** — discussed in Linux section.
|
|
- **`SystemCallArchitectures=native`** — always set.
|
|
- **`SystemCallLog=expr`** — opt-in logging without enforcement;
|
|
useful for diagnosing what gets called before tightening.
|
|
- **`SystemCallErrorNumber=EPERM`** — soft denial vs. SIGKILL. Default
|
|
is SIGKILL; switch later if a benign caller trips.
|
|
|
|
### Capabilities
|
|
|
|
- **`CapabilityBoundingSet=`** — empty drops all. Use it.
|
|
- **`AmbientCapabilities=`** — empty (default).
|
|
- **`NoNewPrivileges=true`** — prevents setuid escalation. **Required
|
|
on srcds**, **incompatible with sudo on web** until sudo is replaced.
|
|
|
|
### Network restrictions
|
|
|
|
- **`RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX`** — for srcds.
|
|
AF_UNIX needed for journald socket access.
|
|
- **`IPAddressAllow=` / `IPAddressDeny=`** — uses cgroup BPF; affects
|
|
outbound traffic. For srcds: probably overcomplicates; the firewall
|
|
already controls ingress. Skip for first pass.
|
|
- **`SocketBindAllow=` / `SocketBindDeny=`** — restricts which ports a
|
|
unit can `bind()`. For srcds, allow only the configured game port
|
|
range. Adds value but couples to config. Defer to a follow-up.
|
|
|
|
### Resource restrictions
|
|
|
|
- **`MemoryMax`**, **`TasksMax`**, **`LimitNOFILE`** — already set.
|
|
- **`OOMScoreAdjust`** — already set (favor killing the gameserver
|
|
before system processes if memory tight).
|
|
- **`MemoryDenyWriteExecute=true`** — blocks `mprotect(PROT_WRITE|PROT_EXEC)`.
|
|
Defends against shellcode in JIT memory. **Source engine likely
|
|
fine** (no JIT in the binary; the Squirrel script engine is an
|
|
interpreter, not JIT). **Sourcemod plugins**: most are compiled to
|
|
bytecode + run on SourcePawn VM (interpreter); no JIT either. Verify
|
|
in test.
|
|
|
|
### IPC and process hygiene
|
|
|
|
- **`RemoveIPC=true`** — clean up SysV IPC on unit stop.
|
|
- **`KeyringMode=private`** — own kernel keyring; no host-key access.
|
|
- **`LockPersonality=true`** — block `personality(2)` calls (no x86 vs
|
|
x86-64 mode toggle). Already set.
|
|
- **`RestrictRealtime=true`** — block real-time scheduling. srcds may
|
|
use SCHED_OTHER + nice; no realtime needed.
|
|
- **`RestrictNamespaces=true`** — block `unshare(2)` / `clone(CLONE_NEW*)`.
|
|
- **`RestrictSUIDSGID=true`** — already set.
|
|
- **`UMask=0027`** — narrow default umask.
|
|
|
|
### Capabilities of the `+` prefix
|
|
|
|
`ExecStartPre=+cmd` runs `cmd` as root in PID 1's namespaces, bypassing
|
|
the unit's User= and almost all Protect*/Private*/Restrict* directives.
|
|
This is how the existing overlay-mount helper runs. Critical to verify
|
|
in test:
|
|
- Does `+` preserve the bypass when `PrivateUsers=true` is set?
|
|
(Expected: yes — the userns is set up around the unit's processes;
|
|
`+` puts the helper outside it.)
|
|
|
|
### State management (per-unit)
|
|
|
|
- **`StateDirectory=path`** — creates `/var/lib/<path>` owned by User=.
|
|
- **`RuntimeDirectory=path`** — creates `/run/<path>`, auto-deleted on
|
|
stop.
|
|
- **`LogsDirectory=path`** — `/var/log/<path>`.
|
|
- **`CacheDirectory=path`** — `/var/cache/<path>`.
|
|
- **`ConfigurationDirectory=path`** — `/etc/<path>`.
|
|
|
|
Useful for cleanup hygiene if we redesign storage layout. Not required
|
|
for first pass.
|
|
|
|
### `systemd-analyze security`
|
|
|
|
`systemd-analyze security <unit>` produces a security score per unit
|
|
(lower = more secure). Output lists each directive with a ✓/✗.
|
|
Useful as:
|
|
- Regression check (record baseline, ensure score drops after refactor).
|
|
- Discovery tool ("which directives haven't I set?").
|
|
|
|
Baseline scores (to capture during test plan):
|
|
- `left4me-server@1.service` before refactor
|
|
- `left4me-web.service` before refactor
|
|
|
|
### Composability lookups
|
|
|
|
The systemd docs use a "predefined preset" concept that's worth knowing:
|
|
|
|
- **`@privileged`** (syscall group) ⊃ `@process`, `@module`, `@ptrace`, etc.
|
|
- **`@system-service`** is the recommended base for "I want a normal
|
|
service to work."
|
|
- Subtracting `~@privileged` is broad; `~@debug @mount @raw-io` is
|
|
surgical.
|
|
|
|
## Section 3 — Application-level options
|
|
|
|
### Apparmor profile for srcds
|
|
|
|
If systemd directives leave gaps, an AppArmor profile would let us
|
|
deny specific paths or operations beyond what systemd's directives
|
|
cover. E.g., "deny network for srcds to a specific IP range" via
|
|
`network inet stream...` deny rules; or "deny mounting" beyond
|
|
`SystemCallFilter`.
|
|
|
|
Effort:
|
|
- Enable AppArmor in the kernel cmdline + boot config.
|
|
- Write a profile (e.g., `/etc/apparmor.d/usr.bin.srcds_linux`).
|
|
- Reference via systemd `AppArmorProfile=` per unit.
|
|
|
|
Skip for the first pass; revisit if test results show the systemd
|
|
directives alone leave a gap.
|
|
|
|
### landlock for the web app
|
|
|
|
Python web app could call `landlock_create_ruleset` / `landlock_add_rule`
|
|
/ `landlock_restrict_self` via ctypes. Restricts FS access at runtime.
|
|
|
|
For us:
|
|
- Could restrict gunicorn to `/var/lib/left4me/` + `/etc/left4me/web.env`
|
|
+ `/opt/left4me/.venv` + `/tmp`.
|
|
- Symmetric to `TemporaryFileSystem=` + `Bind*` but at the
|
|
application layer (no systemd reach).
|
|
|
|
Skip; systemd directives are simpler. Reconsider if we move to a
|
|
DynamicUser-style world later.
|
|
|
|
### File-integrity tooling (Aide, Tripwire)
|
|
|
|
Out of scope for prevention; useful for detection. Not in this design.
|
|
|
|
### Custom seccomp profile (bypassing systemd)
|
|
|
|
The web app could call `seccomp(2)` from inside Python via libseccomp
|
|
+ ctypes to tighten its own filter beyond what systemd applies.
|
|
Symmetric to landlock; skip for the same reason.
|
|
|
|
## Section 4 — Per-defense mapping
|
|
|
|
For each defense from the threat model, the primitives that implement
|
|
it, in priority order:
|
|
|
|
### D1 — Gameserver RCE cannot exfiltrate DB or `web.env`
|
|
|
|
| Primitive | Strength | Notes |
|
|
|---|---|---|
|
|
| `TemporaryFileSystem=/var/lib /etc` + minimal bind set | Strong | The files simply aren't in the unit's FS view. ENOENT, not EACCES. |
|
|
| 3-user split (DB owned by `l4d2-web`) | Strong | Kernel-enforced; survives unit-config errors. |
|
|
| `BindReadOnlyPaths=/dev/null:/var/lib/left4me/left4me.db` | Medium | Masks the path; brittle (paths can move). |
|
|
| Filesystem ACLs (DB mode 0600) | Weak | Kernel still allows `left4me` group; only fixed by uid split. |
|
|
|
|
**Composition chosen:** `TemporaryFileSystem=` + Bind* (primary).
|
|
3-user split as defense-in-depth or deferred.
|
|
|
|
### D2 — Gameserver RCE cannot ptrace web app or peers
|
|
|
|
| Primitive | Strength | Notes |
|
|
|---|---|---|
|
|
| `SystemCallFilter=~@debug` | Strong | Blocks `ptrace`, `process_vm_readv/writev`. |
|
|
| `kernel.yama.ptrace_scope=2` | Strong | Belt-and-braces at the kernel level. |
|
|
| `CapabilityBoundingSet=` empty | Strong | No CAP_SYS_PTRACE. |
|
|
| `PrivateUsers=true` | Strong | Cross-userns ptrace requires CAP_SYS_PTRACE. |
|
|
| 3-user split | Strong | Different uids; same-uid path doesn't exist. |
|
|
|
|
**Composition chosen:** All four (syscall + yama + caps + userns)
|
|
together; they compose redundantly.
|
|
|
|
### D3 — Gameserver RCE cannot use sudo helpers
|
|
|
|
| Primitive | Strength | Notes |
|
|
|---|---|---|
|
|
| `NoNewPrivileges=true` | Strong | Blocks sudo's setuid. Already set on server@. |
|
|
| `PrivateUsers=true` | Strong | sudo across userns boundary impossible. |
|
|
| Sudoers grants scoped to `l4d2-web` (uid split) | Strong | Different uid means sudo grant doesn't apply. |
|
|
| `RestrictSUIDSGID=true` | Strong | Already set. |
|
|
|
|
**Composition chosen:** NoNewPrivileges (already) + PrivateUsers (new)
|
|
+ RestrictSUIDSGID (already). 3-user split is *also* covered by NNP
|
|
+ PrivateUsers; uid split would be defense-in-depth.
|
|
|
|
### D4 — Web app RCE cannot ptrace gameservers
|
|
|
|
| Primitive | Strength | Notes |
|
|
|---|---|---|
|
|
| `SystemCallFilter=~@debug` on **web** | Strong | Symmetric to D2 but applied to web. |
|
|
| `kernel.yama.ptrace_scope=2` | Strong | System-wide, helps both directions. |
|
|
| 3-user split | Strong | Different uids. |
|
|
|
|
**Composition chosen:** SystemCallFilter on web + yama=2 system-wide.
|
|
PrivateUsers cannot be applied to web (sudo incompatibility). 3-user
|
|
split as defense-in-depth or deferred.
|
|
|
|
### D5 — Cross-server contamination
|
|
|
|
Each `left4me-server@<n>.service` is a separate unit instance. With
|
|
`PrivateUsers=true`, each gets its own user namespace. Cross-namespace
|
|
ptrace fails. With `TemporaryFileSystem=` and per-instance
|
|
`BindPaths=/var/lib/left4me/runtime/%i`, neither instance can read the
|
|
other's `runtime/<n>/` or attach to its process.
|
|
|
|
**Composition chosen:** PrivateUsers + per-instance Bind* (above).
|
|
Per-instance uids out of scope.
|
|
|
|
### D6 — Persistent compromise of `/opt/left4me/src/` blocked from gameserver
|
|
|
|
Already covered by `ProtectSystem=strict` on server@.service. With
|
|
`TemporaryFileSystem=/opt`, the path simply isn't visible to srcds.
|
|
**Stronger and redundant — both can stay.**
|
|
|
|
### D7 — Defenses survive a unit-config refactor in the wrong direction
|
|
|
|
`deploy/tests/test_deploy_artifacts.py` asserts the directives' presence
|
|
in the deployed unit. Add hardening invariants as test cases. Survives
|
|
because the test fails CI before deploy.
|
|
|
|
## Section 5 — Candidate composition
|
|
|
|
**For testing, not commitment.** Test plan validates each piece.
|
|
|
|
### `left4me-server@.service`
|
|
|
|
```ini
|
|
[Service]
|
|
User=left4me
|
|
Group=left4me
|
|
|
|
# (existing)
|
|
Type=simple
|
|
WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
|
|
EnvironmentFile=/etc/left4me/host.env
|
|
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
|
|
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
|
|
ExecStart=/var/lib/left4me/runtime/%i/merged/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
|
|
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
|
|
Restart=on-failure
|
|
RestartSec=5
|
|
|
|
# Resource control (existing)
|
|
Slice=l4d2-game.slice
|
|
Nice=-5
|
|
IOSchedulingClass=best-effort
|
|
IOSchedulingPriority=4
|
|
OOMScoreAdjust=-200
|
|
MemoryHigh=1.5G
|
|
MemoryMax=2G
|
|
TasksMax=256
|
|
LimitNOFILE=65536
|
|
KillSignal=SIGINT
|
|
TimeoutStopSec=15s
|
|
LogRateLimitIntervalSec=0
|
|
|
|
# Hardening — identity
|
|
NoNewPrivileges=true
|
|
RestrictSUIDSGID=true
|
|
|
|
# Hardening — namespaces
|
|
PrivateTmp=true
|
|
PrivateDevices=true
|
|
PrivateIPC=true
|
|
PrivateUsers=true # NEW
|
|
ProtectHome=true
|
|
|
|
# Hardening — filesystem view
|
|
TemporaryFileSystem=/var/lib /etc /opt /home /root /srv /mnt /media # NEW
|
|
BindReadOnlyPaths=/var/lib/left4me/installation # was ReadOnlyPaths
|
|
BindReadOnlyPaths=/var/lib/left4me/overlays # was ReadOnlyPaths
|
|
BindReadOnlyPaths=/etc/left4me/host.env # NEW
|
|
BindReadOnlyPaths=/etc/ssl /etc/ca-certificates # NEW
|
|
BindReadOnlyPaths=/etc/resolv.conf /etc/nsswitch.conf /etc/alternatives # NEW
|
|
BindPaths=/var/lib/left4me/runtime/%i # was ReadWritePaths
|
|
ProtectSystem=strict
|
|
# (remove old ReadOnlyPaths= and ReadWritePaths= lines — superseded)
|
|
|
|
# Hardening — /proc, /sys, kernel
|
|
ProtectProc=invisible # NEW
|
|
ProcSubset=pid # NEW
|
|
ProtectKernelTunables=true # NEW
|
|
ProtectKernelModules=true # NEW
|
|
ProtectKernelLogs=true # NEW
|
|
ProtectClock=true # NEW
|
|
ProtectControlGroups=true # NEW
|
|
ProtectHostname=true # NEW
|
|
LockPersonality=true
|
|
|
|
# Hardening — caps + syscall
|
|
CapabilityBoundingSet= # NEW
|
|
AmbientCapabilities= # NEW
|
|
SystemCallArchitectures=native # NEW
|
|
SystemCallFilter=@system-service # NEW
|
|
SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete @privileged # NEW
|
|
|
|
# Hardening — network
|
|
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX # NEW (AF_UNIX for journald)
|
|
|
|
# Hardening — namespaces, realtime, IPC
|
|
RestrictNamespaces=true # NEW
|
|
RestrictRealtime=true # NEW
|
|
RemoveIPC=true # NEW
|
|
KeyringMode=private # NEW
|
|
UMask=0027 # NEW
|
|
|
|
# Deferred until test:
|
|
# MemoryDenyWriteExecute=true # MAY break sourcemod / Source engine; test first.
|
|
```
|
|
|
|
### `left4me-web.service`
|
|
|
|
```ini
|
|
[Service]
|
|
User=left4me
|
|
Group=left4me
|
|
|
|
# (existing)
|
|
Type=simple
|
|
WorkingDirectory=/opt/left4me/src
|
|
Environment=HOME=/var/lib/left4me PATH=/opt/left4me/.venv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
|
|
EnvironmentFile=/etc/left4me/host.env
|
|
EnvironmentFile=/etc/left4me/web.env
|
|
ExecStart=/opt/left4me/.venv/bin/gunicorn --workers ... --threads ... --bind 127.0.0.1:8000 'l4d2web.app:create_app()'
|
|
Restart=on-failure
|
|
RestartSec=3
|
|
|
|
# Hardening
|
|
PrivateTmp=true
|
|
ProtectSystem=strict # tightened from =full
|
|
ProtectHome=true
|
|
ReadWritePaths=/var/lib/left4me # web needs broad write access there
|
|
# NoNewPrivileges intentionally NOT set — sudo
|
|
# PrivateUsers intentionally NOT set — sudo
|
|
|
|
# /proc + kernel hardening (sudo-compatible)
|
|
ProtectProc=invisible # NEW
|
|
ProcSubset=pid # NEW
|
|
ProtectKernelTunables=true # NEW
|
|
ProtectKernelModules=true # NEW
|
|
ProtectKernelLogs=true # NEW
|
|
ProtectClock=true # NEW
|
|
ProtectControlGroups=true # NEW
|
|
ProtectHostname=true # NEW
|
|
LockPersonality=true # NEW
|
|
|
|
# Syscall filter — allow @system-service minus debug-class; keep @privileged
|
|
# because sudo needs setuid, chown, etc.
|
|
SystemCallArchitectures=native # NEW
|
|
SystemCallFilter=@system-service # NEW
|
|
SystemCallFilter=~@debug @mount @raw-io @reboot @swap @cpu-emulation @obsolete # NEW
|
|
|
|
# Network
|
|
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX # NEW
|
|
|
|
# Misc hygiene
|
|
RestrictRealtime=true # NEW
|
|
RestrictNamespaces=true # NEW
|
|
RemoveIPC=true # NEW
|
|
UMask=0027 # NEW
|
|
|
|
# Deferred for sudo-removal future work:
|
|
# NoNewPrivileges=true
|
|
# CapabilityBoundingSet=
|
|
# PrivateUsers=true
|
|
```
|
|
|
|
### Host sysctl
|
|
|
|
`/etc/sysctl.d/99-left4me.conf` (or merge into existing):
|
|
```
|
|
kernel.yama.ptrace_scope=2
|
|
```
|
|
|
|
System-wide. Means: even if a unit-level config slips, host-level
|
|
ptrace is admin-only. Cost: zero for our use case (no debugging in
|
|
prod).
|
|
|
|
## Section 6 — Trade-offs and known sharp edges
|
|
|
|
To verify in the test plan:
|
|
|
|
1. **`PrivateUsers=true` + `+`-prefixed ExecStartPre**: expected to
|
|
work (the `+` runs outside the unit's namespaces). Sharp if it
|
|
doesn't — the overlay mount would fail and srcds wouldn't start.
|
|
2. **`TemporaryFileSystem=/etc` and missing files**: srcds and its
|
|
dependencies (libstdc++ runtime, libssl, libcurl) may read files
|
|
from `/etc` we haven't bound. Watch journalctl for ENOENT during
|
|
first start.
|
|
3. **`SystemCallFilter=~@privileged` and Source engine**: srcds is C++
|
|
and uses syscalls beyond the obvious. A `~@privileged` may trip
|
|
something. Mitigation: test with `SystemCallLog=` instead of
|
|
`SystemCallFilter=` first; observe what would have been blocked;
|
|
then narrow.
|
|
4. **`MemoryDenyWriteExecute=true` and sourcemod**: SourcePawn is
|
|
bytecode-interpreted (no JIT) per public docs, but plugin
|
|
compilation could in theory use a JIT. Test before enabling.
|
|
5. **`RestrictAddressFamilies=` without AF_UNIX**: journald socket
|
|
needs it. Always include AF_UNIX.
|
|
6. **`ProcSubset=pid` and Python**: gunicorn shouldn't break (uses
|
|
/proc/self/* + signal-based ipc). Verify.
|
|
7. **sysctl `kernel.yama.ptrace_scope=2`**: blocks operator's own
|
|
`gdb` / `strace -p` against any running service. If you need to
|
|
debug, temporarily set back to 1 via sysctl, then revert.
|
|
8. **`ProtectSystem=strict` on web**: was `=full`. Tighter; might
|
|
break a write the web app does to a path outside `/var/lib/left4me`.
|
|
Audit `l4d2web/*` for `os.makedirs` or `open(...'w')` outside that
|
|
root.
|
|
|
|
## Open questions for the implementer
|
|
|
|
(After test plan results come back, finalize these.)
|
|
|
|
1. Do we adopt `MemoryDenyWriteExecute=true` if it works for srcds?
|
|
(Probably yes, defense-in-depth at low cost.)
|
|
2. Do we set `SocketBindAllow=` on srcds to lock the port range?
|
|
(Depends on whether `instance.env` exposes the range cleanly to a
|
|
unit directive.)
|
|
3. Do we deploy AppArmor profiles as a follow-up?
|
|
(Probably no — operational complexity exceeds the marginal gain on
|
|
single-host infra.)
|
|
4. Do we keep both `BindReadOnlyPaths=` and the legacy
|
|
`ReadOnlyPaths=` declarations, or simplify? (Simplify — use Bind*
|
|
exclusively once `TemporaryFileSystem=` is in place.)
|
|
5. Do we proceed with 3-user split as a follow-up, or close the spec
|
|
as "addressed by hardening"? Depends on operator's residual-risk
|
|
tolerance after Phase A lands and we observe.
|
|
|
|
## Pointers
|
|
|
|
- Threat model: `docs/superpowers/specs/2026-05-15-hardening-threat-model.md`
|
|
- Test plan: `docs/superpowers/specs/2026-05-15-hardening-test-plan.md`
|
|
- Original uid-split spec (still open): `docs/superpowers/specs/2026-05-15-user-uid-split-design.md`
|
|
- Live unit source (ckn-bw reactor): `~/Projekte/ckn-bw/bundles/left4me/metadata.py:150+`
|
|
- Reference units (deploy-dir-rethink reference-only): `deploy/files/usr/local/lib/systemd/system/`
|
|
- systemd docs (latest, systemd 256+ on Trixie):
|
|
`man systemd.exec`, `man systemd.unit`, `man systemd-analyze`.
|
|
- L4D2 / Source engine docs:
|
|
- SourcePawn (bytecode-interpreted): https://wiki.alliedmods.net/SourcePawn
|
|
- srcds is a Source 2007 engine binary; closed-source, expect surprises.
|