systemd's `+` Exec prefix removes sandbox/credentials but does NOT detach from the unit's per-service mount namespace (created by PrivateTmp/Protect*). The Python interpreter for the helper was launched inside that namespace, and even though the helper internally nsenter'd into PID 1 for the umount syscall, the calling Python process itself never left the unit's namespace. Its existence pinned the namespace alive, which kept the slave mount tree alive, which made PID 1's umount return EBUSY for the entire duration of the helper's run. The mount became unmountable the moment the helper exited — empirically verified by polling /proc/*/ns/mnt during stop: the only PID holding the dying namespace was the helper itself. Wrap both ExecStartPre and ExecStopPost with `/usr/bin/nsenter --mount=/proc/1/ns/mnt --` so the helper Python interpreter runs in PID 1's mount namespace from the start. With the helper out of the unit's namespace, umount succeeds first try once the cgroup empties. Reset went from ~25 s with retry/lazy-fallback workarounds to ~0.5 s clean. Knock-on cleanups: - Helper drops internal nsenter for the syscalls (already in PID 1's namespace), and drops the eager-retry loop + lazy-umount fallback + inner work_inner retry (no race left to ride out). - Revert TimeoutStopSec=60s back to 15s. - Tests updated to expect the new argv shapes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
85 lines
3.8 KiB
Desktop File
85 lines
3.8 KiB
Desktop File
[Unit]
|
|
Description=left4me server instance %i
|
|
After=network-online.target
|
|
Wants=network-online.target
|
|
# Bound the restart loop. Without these, a persistent ExecStartPre or
|
|
# ExecStart failure spins indefinitely. Note: these are [Unit]-section
|
|
# directives (systemd 230+), not [Service].
|
|
StartLimitBurst=5
|
|
StartLimitIntervalSec=60s
|
|
|
|
[Service]
|
|
Type=simple
|
|
User=left4me
|
|
Group=left4me
|
|
EnvironmentFile=/etc/left4me/host.env
|
|
EnvironmentFile=/var/lib/left4me/instances/%i/instance.env
|
|
# `-` prefix: chdir failure is non-fatal. systemd applies WorkingDirectory
|
|
# before every Exec line — including ExecStartPre — but the merged dir only
|
|
# exists once ExecStartPre's overlay mount succeeds. With `-`, ExecStartPre
|
|
# runs in the unit's home (cwd doesn't matter for the mount helper); the
|
|
# ExecStart re-applies WorkingDirectory after the mount and finds the dir.
|
|
WorkingDirectory=-/var/lib/left4me/runtime/%i/merged/left4dead2
|
|
# Single source of truth for the kernel-overlayfs mount lifecycle: the web
|
|
# app's start_instance only stages cfg files and asks systemd to enable+
|
|
# start this unit; the actual `mount -t overlay` lives here so reboot
|
|
# auto-start works the same as a UI-driven start. ExecStopPost mirrors it
|
|
# so the unmount lives in the same place — no Python-side _mounter needed
|
|
# in stop/delete/reset paths. Both helper verbs are idempotent.
|
|
#
|
|
# `+` prefix runs the helper as PID 1 (root, no sandbox). Required because
|
|
# the unit has NoNewPrivileges=true, which blocks sudo's setuid escalation
|
|
# — and the helper itself needs root for the mount/umount syscalls.
|
|
#
|
|
# `nsenter --mount=/proc/1/ns/mnt --` runs the helper Python interpreter
|
|
# in PID 1's mount namespace. Without this, the `+` prefix removes the
|
|
# sandbox/credentials but does NOT detach from the unit's per-service
|
|
# mount namespace (created by PrivateTmp/Protect*) — so the helper
|
|
# process itself would hold a reference to that namespace, keeping the
|
|
# slave-mount tree alive after the cgroup empties, and umount in PID 1
|
|
# would return EBUSY for as long as the helper ran. Putting nsenter at
|
|
# the unit-level (as opposed to inside the helper, where only the
|
|
# umount syscall escaped) is what actually frees the namespace. Once
|
|
# the helper is in PID 1's namespace, ExecStopPost's umount succeeds
|
|
# on the first try with no retry/race window. ExecStopPost (not
|
|
# ExecStop) so unmount runs after the cgroup is cleared; ExecStop runs
|
|
# while srcds is still alive and would EBUSY.
|
|
ExecStartPre=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay mount %i
|
|
# Run from the merged overlay, NOT installation/. srcds_run is a shell
|
|
# script that `cd`s to its own dirname before exec'ing srcds_linux, so the
|
|
# binary's path determines where the engine reads gameinfo.txt and addons
|
|
# from — WorkingDirectory has no effect. Invoking installation/srcds_run
|
|
# would resolve everything against the lower layer and never see overlay-
|
|
# provided plugins (Metamod/SourceMod) or cfgs (zonemod, confogl).
|
|
ExecStart=/var/lib/left4me/runtime/%i/merged/srcds_run -game left4dead2 +hostport ${L4D2_PORT} $L4D2_ARGS
|
|
ExecStopPost=+/usr/bin/nsenter --mount=/proc/1/ns/mnt -- /usr/local/libexec/left4me/left4me-overlay umount %i
|
|
Restart=on-failure
|
|
RestartSec=5
|
|
|
|
# Resource control baseline — see docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md
|
|
Slice=l4d2-game.slice
|
|
Nice=-5
|
|
IOSchedulingClass=best-effort
|
|
IOSchedulingPriority=4
|
|
OOMScoreAdjust=-200
|
|
MemoryHigh=1.5G
|
|
MemoryMax=2G
|
|
TasksMax=256
|
|
LimitNOFILE=65536
|
|
KillSignal=SIGINT
|
|
TimeoutStopSec=15s
|
|
LogRateLimitIntervalSec=0
|
|
|
|
# Hardening (unchanged from previous baseline).
|
|
NoNewPrivileges=true
|
|
PrivateTmp=true
|
|
PrivateDevices=true
|
|
ProtectHome=true
|
|
ProtectSystem=strict
|
|
ReadOnlyPaths=/var/lib/left4me/installation /var/lib/left4me/overlays
|
|
ReadWritePaths=/var/lib/left4me/runtime/%i
|
|
RestrictSUIDSGID=true
|
|
LockPersonality=true
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|