# l4d2 server lifecycle: reboot-safe + drift reconciliation — design Date: 2026-05-09 Status: design ## Summary Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators. ## Goals - An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action. - The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change. - The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of. - Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model. ## Non-goals - **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision. - **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope. - **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec. - **Per-server poller intervals.** A single global cadence is sufficient. - **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays. - **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB. ## Premise check: system units, not user units `systemctl --user enable --now` has different lifecycle rules — auto-start only at user login (unless `loginctl enable-linger ` is set), symlinks land in `~/.config/systemd/user/.wants/`. It would be wrong here. This project uses **system units**, confirmed by: - Unit path: `/usr/local/lib/systemd/system/left4me-server@.service` is the system search path; user units live in `/etc/systemd/user/` or `~/.config/systemd/user/`. - The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl:31-44`) calls plain `systemctl` (no `--user` flag) and runs as **root** via the sudoers rule at `deploy/files/etc/sudoers.d/left4me:2`. - The unit's `[Install] WantedBy=multi-user.target` (line 43 of the unit) is a system target; user units would use `default.target`. - The same machinery is already in production for `left4me-web.service` — `deploy-test-server.sh` runs `sudo systemctl enable --now left4me-web.service`, and that's how the web service auto-came-back after today's reboot. We're applying the same pattern to the game-server template instances. `systemctl enable left4me-server@1.service` will create `/etc/systemd/system/multi-user.target.wants/left4me-server@1.service` symlinked to `/usr/local/lib/systemd/system/left4me-server@.service`. systemd handles the template instantiation via the `@` syntax automatically. ## Background Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST: - The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action. - `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`. - `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions. - `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot. - After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not. - The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth. ## Design ### Part A — Switch lifecycle verbs to `enable --now` / `disable --now` **Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`): Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become: ```sh case "$action" in enable) exec "$systemctl" enable --now "$unit" ;; disable) exec "$systemctl" disable --now "$unit" ;; show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;; *) reject ;; esac ``` The existing instance-name validation regex (currently lines 12–17) is unchanged — it constrains the `` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`: ``` left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl * ``` already passes any args; no sudoers update needed. **Python wrapper** (`l4d2host/service_control.py`): Rename `start_service` → `enable_service` and `stop_service` → `disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged. **Instance lifecycle** (`l4d2host/instances.py`): - `start_instance` — replace the `start_service(...)` call with `enable_service(...)`. - `stop_instance` — replace `stop_service(...)` with `disable_service(...)`. - `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior. **CLI surface** (`l4d2host/cli.py`): `l4d2ctl start ` and `l4d2ctl stop ` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance` → `enable_service` as renamed above. **Web facade** (`l4d2web/services/l4d2_facade.py`): Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`. ### Part B — Periodic state poller Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`: ```python def start_state_poller(app): interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30)) thread = threading.Thread( target=state_poller_loop, args=(app, interval), daemon=True, name="left4me-state-poller", ) thread.start() def state_poller_loop(app, interval): while True: try: with app.app_context(): poll_all_servers() except Exception: pass # never let a single failure kill the loop time.sleep(interval) def poll_all_servers(): with session_scope() as db: active_server_ids = set(db.scalars( select(Job.server_id).where(Job.state.in_(("queued", "running"))) ).all()) server_ids = [ sid for sid in db.scalars(select(Server.id)).all() if sid not in active_server_ids ] for sid in server_ids: try: refresh_server_actual_state(sid) except Exception: pass ``` **Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings. **Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 84–88: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`). **First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed. **Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:44–58`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives. ### Why both parts Either part alone is insufficient: - **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now ` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases. - **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again. Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports. ### Migration on running hosts Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot. The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`. ### Files changed / added ``` deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs) l4d2host/service_control.py (Part A — rename) l4d2host/instances.py (Part A — call new names) l4d2host/tests/test_lifecycle.py (Part A — test updates) l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent) deploy/tests/test_deploy_artifacts.py (Part A — helper assertions) l4d2web/services/job_worker.py (Part B — poller code) l4d2web/app.py (Part B — wire start_state_poller) l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default) l4d2web/tests/test_job_worker.py (Part B — poller tests) ``` ## Tests ### Part A - `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted). - `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start` → `enable` and `stop` → `disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`. - New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`. ### Part B - `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order. - `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server. - `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally. - `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`). ### CI sanity `pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope). ## Rollout Single deploy. After deploy: 1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.** 2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe. 3. Newly-started servers (`l4d2ctl start ` or web UI start) are enable-based from the first invocation. If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa. ## Open questions None blocking. v2 candidates: - Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision). - Per-server poll intervals or backoff for repeatedly-failing servers. - A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable"). ## References - systemd.unit(5) — `WantedBy=`, `Install` section semantics. - systemctl(1) — `enable --now` / `disable --now` flags. - Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`. - Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`. - `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.