left4me/deploy/files/usr/local/lib/systemd/system/left4me-server@.service
mwiegand 5eac51a93e
fix(deploy): wrap overlay helper with nsenter so it doesn't pin the unit's mount namespace
systemd's `+` Exec prefix removes sandbox/credentials but does NOT
detach from the unit's per-service mount namespace (created by
PrivateTmp/Protect*). The Python interpreter for the helper was
launched inside that namespace, and even though the helper internally
nsenter'd into PID 1 for the umount syscall, the calling Python
process itself never left the unit's namespace. Its existence pinned
the namespace alive, which kept the slave mount tree alive, which
made PID 1's umount return EBUSY for the entire duration of the
helper's run. The mount became unmountable the moment the helper
exited — empirically verified by polling /proc/*/ns/mnt during stop:
the only PID holding the dying namespace was the helper itself.

Wrap both ExecStartPre and ExecStopPost with `/usr/bin/nsenter
--mount=/proc/1/ns/mnt --` so the helper Python interpreter runs in
PID 1's mount namespace from the start. With the helper out of the
unit's namespace, umount succeeds first try once the cgroup empties.
Reset went from ~25 s with retry/lazy-fallback workarounds to ~0.5 s
clean.

Knock-on cleanups:
- Helper drops internal nsenter for the syscalls (already in PID 1's
  namespace), and drops the eager-retry loop + lazy-umount fallback +
  inner work_inner retry (no race left to ride out).
- Revert TimeoutStopSec=60s back to 15s.
- Tests updated to expect the new argv shapes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 15:13:59 +02:00

85 lines
3.8 KiB
Desktop File

[Unit]
Description=left4me server instance %i
After=network-online.target
Wants=network-online.target
# Bound the restart loop. Without these, a persistent ExecStartPre or
# ExecStart failure spins indefinitely. Note: these are [Unit]-section
# directives (systemd 230+), not [Service].
StartLimitBurst=5
StartLimitIntervalSec=60s
[Service]
Type=simple
User=left4me
Group=left4me
EnvironmentFile=/etc/left4me/host.env
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
# `-` prefix: chdir failure is non-fatal. systemd applies WorkingDirectory
# before every Exec line — including ExecStartPre — but the merged dir only
# exists once ExecStartPre's overlay mount succeeds. With `-`, ExecStartPre
# runs in the unit's home (cwd doesn't matter for the mount helper); the
# ExecStart re-applies WorkingDirectory after the mount and finds the dir.
WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
# Single source of truth for the kernel-overlayfs mount lifecycle: the web
# app's start_instance only stages cfg files and asks systemd to enable+
# start this unit; the actual `mount -t overlay` lives here so reboot
# auto-start works the same as a UI-driven start. ExecStopPost mirrors it
# so the unmount lives in the same place — no Python-side _mounter needed
# in stop/delete/reset paths. Both helper verbs are idempotent.
#
# `+` prefix runs the helper as PID 1 (root, no sandbox). Required because
# the unit has NoNewPrivileges=true, which blocks sudo's setuid escalation
# — and the helper itself needs root for the mount/umount syscalls.
#
# `nsenter --mount=/proc/1/ns/mnt --` runs the helper Python interpreter
# in PID 1's mount namespace. Without this, the `+` prefix removes the
# sandbox/credentials but does NOT detach from the unit's per-service
# mount namespace (created by PrivateTmp/Protect*) — so the helper
# process itself would hold a reference to that namespace, keeping the
# slave-mount tree alive after the cgroup empties, and umount in PID 1
# would return EBUSY for as long as the helper ran. Putting nsenter at
# the unit-level (as opposed to inside the helper, where only the
# umount syscall escaped) is what actually frees the namespace. Once
# the helper is in PID 1's namespace, ExecStopPost's umount succeeds
# on the first try with no retry/race window. ExecStopPost (not
# ExecStop) so unmount runs after the cgroup is cleared; ExecStop runs
# while srcds is still alive and would EBUSY.
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
# Run from the merged overlay, NOT installation/. srcds_run is a shell
# script that `cd`s to its own dirname before exec'ing srcds_linux, so the
# binary's path determines where the engine reads gameinfo.txt and addons
# from — WorkingDirectory has no effect. Invoking installation/srcds_run
# would resolve everything against the lower layer and never see overlay-
# provided plugins (Metamod/SourceMod) or cfgs (zonemod, confogl).
ExecStart=/var/lib/left4me/runtime/%i/merged/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
Restart=on-failure
RestartSec=5
# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
Slice=l4d2-game.slice
Nice=-5
IOSchedulingClass=best-effort
IOSchedulingPriority=4
OOMScoreAdjust=-200
MemoryHigh=1.5G
MemoryMax=2G
TasksMax=256
LimitNOFILE=65536
KillSignal=SIGINT
TimeoutStopSec=15s
LogRateLimitIntervalSec=0
# Hardening (unchanged from previous baseline).
NoNewPrivileges=true
PrivateTmp=true
PrivateDevices=true
ProtectHome=true
ProtectSystem=strict
ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays
ReadWritePaths=/var/lib/left4me/runtime/%i
RestrictSUIDSGID=true
LockPersonality=true
[Install]
WantedBy=multi-user.target