Switch lifecycle verbs from systemctl start/stop to enable --now / disable --now (servers survive host reboot via WantedBy= symlinks), plus a periodic state poller for runtime drift (OOM kills, manual systemctl ops, exhausted Restart=on-failure). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
14 KiB
l4d2 server lifecycle: reboot-safe + drift reconciliation — design
Date: 2026-05-09 Status: design
Summary
Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from systemctl start/stop to systemctl enable --now/disable --now. Pair this with a periodic background poller that refreshes Server.actual_state so out-of-band state changes (OOM kills, manual systemctl stop, crashes that exhaust Restart=on-failure) no longer leave the web UI showing stale "running" indicators.
Goals
- An L4D2 server started via the web UI (or
l4d2ctl start) automatically comes back up after a host reboot, with no operator action. - The web app's
Server.actual_stateconverges to systemd's actual state within ~30 seconds of any out-of-band change. - The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
- Migration from the existing
systemctl start-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.
Non-goals
- Auto-restart on detected drift. When the poller observes
desired_state=runningbutactual_state=stopped, this spec does not re-enqueue a start job. That's a v2 UX/policy decision. - UI surfacing of stale-state warnings. Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
- Reconciliation of orphan systemd units. Units enabled on disk but not represented by any
Serverrow (e.g., from a crashed delete) — separate cleanup spec. - Per-server poller intervals. A single global cadence is sufficient.
- Replacing
Restart=on-failurewith anything more elaborate. The unit's existing restart policy stays. - Reactive-style state propagation. No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.
Background
Today's behavior, confirmed by forensics on ckn@10.0.4.128 after the operator ran sudo systemctl poweroff at 11:48:02 CEST:
- The
left4me-systemctlhelper (deploy/files/usr/local/libexec/left4me/left4me-systemctl) accepts the verbsstart,stop, andshow, each invoking the literalsystemctlaction. l4d2host/service_control.pyexposesstart_service(name)andstop_service(name)that buildsystemctl_command("start"/"stop", name).l4d2host/instances.pystart_instanceandstop_instancecall those functions.systemctl startis a transient activation. systemd creates noWantedBy=multi-user.target.wants/symlink, so the unit doesn't auto-start on next boot.- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted;
left4me-web.servicecame back (it isenabled); the game instances did not. - The web app's
Server.actual_stateis only ever written byrefresh_server_actual_state_after_job()inl4d2web/services/job_worker.py:581, called solely after a job completes. With no jobs in flight after the reboot, the row'sactual_state="running"from yesterday remained the displayed truth.
Design
Part A — Switch lifecycle verbs to enable --now / disable --now
Helper script (deploy/files/usr/local/libexec/left4me/left4me-systemctl):
Rename the action verbs the helper accepts: drop start/stop, add enable/disable. The bodies become:
case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) reject ;;
esac
The existing instance-name validation regex (currently lines 12–17) is unchanged — it constrains the <name> argument, not the action. The sudoers rule at deploy/files/etc/sudoers.d/left4me:
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
already passes any args; no sudoers update needed.
Python wrapper (l4d2host/service_control.py):
Rename start_service → enable_service and stop_service → disable_service. Each builds systemctl_command("enable", name) / systemctl_command("disable", name). The existing show_service is unchanged.
Instance lifecycle (l4d2host/instances.py):
start_instance— replace thestart_service(...)call withenable_service(...).stop_instance— replacestop_service(...)withdisable_service(...)._purge_instance(called bydelete_instanceandreset_instance) — replacestop_service(...)withdisable_service(...). A disabled-but-not-running unit'sdisable --nowis a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.
CLI surface (l4d2host/cli.py):
l4d2ctl start <name> and l4d2ctl stop <name> keep their names per the contract in AGENTS.md ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: start = "ensure running, now and after reboot." Internal call paths route through start_instance → enable_service as renamed above.
Web facade (l4d2web/services/l4d2_facade.py):
Unchanged. Still invokes ["l4d2ctl", "start", ...] / ["l4d2ctl", "stop", ...].
Part B — Periodic state poller
Add a single background thread spawned alongside the existing job-worker threads in l4d2web/services/job_worker.py:start_job_workers:
def start_state_poller(app):
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
daemon=True,
name="left4me-state-poller",
)
thread.start()
def state_poller_loop(app, interval):
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass # never let a single failure kill the loop
time.sleep(interval)
def poll_all_servers():
with session_scope() as db:
active_server_ids = set(db.scalars(
select(Job.server_id).where(Job.state.in_(("queued", "running")))
).all())
server_ids = [
sid for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass
Why skip in-flight servers: the job worker's success path also calls refresh_server_actual_state. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but srcds hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.
Wiring in startup (l4d2web/app.py:create_app): call start_state_poller(app) adjacent to start_job_workers(app), gated by the same should_start_workers predicate (existing lines 84–88: JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()).
First-tick latency: the loop runs poll_all_servers() once before the first time.sleep(interval), so the DB catches up to systemd reality within milliseconds of app boot (one systemctl show per server). A separate startup-reconcile path is not needed.
Concurrency: the poller and the workers all use session_scope() (l4d2web/db.py:44–58) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per deploy-test-server.sh:188-198) handles concurrent reads + serialized writes. No new locking primitives.
Why both parts
Either part alone is insufficient:
- Part A alone survives reboots but doesn't catch OOM kills, manual
systemctl disable --now <unit>from a shell, or crashes that exhaustRestart=on-failure. The DB still drifts in those cases. - Part B alone keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at
actual_state=stoppedon a server they expected to come back, with the only recourse being to click start again.
Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.
Migration on running hosts
Zero one-shot needed. After this lands, a server currently running via the old systemctl start (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, systemctl disable --now runs — disable is a no-op for an already-not-enabled unit, but --now still kills the live process. The next start runs systemctl enable --now, which enables + starts. From that point on the unit survives reboot.
The poller's first tick after deploy will refresh every server's actual_state to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to stopped.
Files changed / added
deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs)
l4d2host/service_control.py (Part A — rename)
l4d2host/instances.py (Part A — call new names)
l4d2host/tests/test_lifecycle.py (Part A — test updates)
l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent)
deploy/tests/test_deploy_artifacts.py (Part A — helper assertions)
l4d2web/services/job_worker.py (Part B — poller code)
l4d2web/app.py (Part B — wire start_state_poller)
l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default)
l4d2web/tests/test_job_worker.py (Part B — poller tests)
Tests
Part A
deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args: update body assertions to expectenable)/disable)/show). Add an assertion thatenable)body containsenable --nowanddisable)body containsdisable --now. Update rejected-action examples (dropstart/stopsince they're no longer accepted).l4d2host/tests/test_lifecycle.py: every assertion that mocksrun_commandand inspects the systemctl-helper invocation needs the action token updated fromstart→enableandstop→disable. The_purge_instancepaths exercised bydelete_instanceandreset_instanceflip fromstoptodisable.- New direct unit tests in
l4d2host/tests/test_service_control.py(create the file if it doesn't exist already): exerciseenable_serviceanddisable_servicewith a mockedrun_commandand assert they emit["sudo", "-n", helper_path, "enable"|"disable", name].
Part B
l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server(new): seed twoServerrows withactual_state="unknown"; monkey-patchrefresh_server_actual_stateto record calls; run one iteration ofpoll_all_servers(); assert it was called once per server in any order.test_state_poller_skips_servers_with_inflight_jobs(new): seed aServerrow + aJobwithstate="running"for that server; runpoll_all_servers(); assertrefresh_server_actual_statewas NOT called for that server.test_state_poller_swallows_per_server_exceptions(new): makerefresh_server_actual_stateraise for one server; assert other servers are still polled and the loop function returns normally.test_state_poller_disabled_when_job_workers_disabled(new): create app withJOB_WORKER_ENABLED=False; assertstart_state_polleris not invoked (or that noleft4me-state-pollerthread is alive aftercreate_app).
CI sanity
pytest deploy/tests/ l4d2host/tests l4d2web/tests -q is green except the pre-existing unrelated test_deploy_script_has_safe_defaults_and_preserves_state (stale since caa8b83, out of scope).
Rollout
Single deploy. After deploy:
- The poller's first tick (within seconds of
left4me-web.servicestarting) refreshes every server'sactual_stateto systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. No operator UI clicks required. - Servers currently
running(started via the oldsystemctl start) keep running, but they're not yetenabled. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe. - Newly-started servers (
l4d2ctl start <name>or web UI start) are enable-based from the first invocation.
If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + service_control.py change can be reverted independently of the poller, and vice versa.
Open questions
None blocking. v2 candidates:
- Auto-restart on
desired_state=running && actual_state=stopped(separate UX decision). - Per-server poll intervals or backoff for repeatedly-failing servers.
- A "drift" badge in the UI when
actual_state_updated_atis older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").
References
- systemd.unit(5) —
WantedBy=,Installsection semantics. - systemctl(1) —
enable --now/disable --nowflags. - Existing perf-baseline spec:
docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md. - Existing CPU-isolation spec:
docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md. AGENTS.md— Host CLI write-command set is fixed; this spec preserves that contract.