From 72cd7ca1ef17303ba61e95d341bded2ec2c18e23 Mon Sep 17 00:00:00 2001 From: mwiegand Date: Sat, 9 May 2026 12:21:59 +0200 Subject: [PATCH] =?UTF-8?q?docs(specs):=20l4d2=20server=20lifecycle=20rebo?= =?UTF-8?q?ot-and-drift=20=E2=80=94=20design?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Switch lifecycle verbs from systemctl start/stop to enable --now / disable --now (servers survive host reboot via WantedBy= symlinks), plus a periodic state poller for runtime drift (OOM kills, manual systemctl ops, exhausted Restart=on-failure). Co-Authored-By: Claude Opus 4.7 (1M context) --- ...erver-lifecycle-reboot-and-drift-design.md | 204 ++++++++++++++++++ 1 file changed, 204 insertions(+) create mode 100644 docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md diff --git a/docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md b/docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md new file mode 100644 index 0000000..66638b7 --- /dev/null +++ b/docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md @@ -0,0 +1,204 @@ +# l4d2 server lifecycle: reboot-safe + drift reconciliation — design + +Date: 2026-05-09 +Status: design + +## Summary + +Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators. + +## Goals + +- An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action. +- The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change. +- The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of. +- Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model. + +## Non-goals + +- **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision. +- **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope. +- **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec. +- **Per-server poller intervals.** A single global cadence is sufficient. +- **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays. +- **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB. + +## Background + +Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST: + +- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action. +- `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`. +- `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions. +- `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot. +- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not. +- The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth. + +## Design + +### Part A — Switch lifecycle verbs to `enable --now` / `disable --now` + +**Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`): + +Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become: + +```sh +case "$action" in + enable) exec "$systemctl" enable --now "$unit" ;; + disable) exec "$systemctl" disable --now "$unit" ;; + show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;; + *) reject ;; +esac +``` + +The existing instance-name validation regex (currently lines 12–17) is unchanged — it constrains the `` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`: + +``` +left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl * +``` + +already passes any args; no sudoers update needed. + +**Python wrapper** (`l4d2host/service_control.py`): + +Rename `start_service` → `enable_service` and `stop_service` → `disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged. + +**Instance lifecycle** (`l4d2host/instances.py`): + +- `start_instance` — replace the `start_service(...)` call with `enable_service(...)`. +- `stop_instance` — replace `stop_service(...)` with `disable_service(...)`. +- `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior. + +**CLI surface** (`l4d2host/cli.py`): + +`l4d2ctl start ` and `l4d2ctl stop ` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance` → `enable_service` as renamed above. + +**Web facade** (`l4d2web/services/l4d2_facade.py`): + +Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`. + +### Part B — Periodic state poller + +Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`: + +```python +def start_state_poller(app): + interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30)) + thread = threading.Thread( + target=state_poller_loop, + args=(app, interval), + daemon=True, + name="left4me-state-poller", + ) + thread.start() + + +def state_poller_loop(app, interval): + while True: + try: + with app.app_context(): + poll_all_servers() + except Exception: + pass # never let a single failure kill the loop + time.sleep(interval) + + +def poll_all_servers(): + with session_scope() as db: + active_server_ids = set(db.scalars( + select(Job.server_id).where(Job.state.in_(("queued", "running"))) + ).all()) + server_ids = [ + sid for sid in db.scalars(select(Server.id)).all() + if sid not in active_server_ids + ] + for sid in server_ids: + try: + refresh_server_actual_state(sid) + except Exception: + pass +``` + +**Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings. + +**Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 84–88: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`). + +**First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed. + +**Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:44–58`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives. + +### Why both parts + +Either part alone is insufficient: + +- **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now ` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases. +- **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again. + +Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports. + +### Migration on running hosts + +Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot. + +The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`. + +### Files changed / added + +``` +deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs) +l4d2host/service_control.py (Part A — rename) +l4d2host/instances.py (Part A — call new names) +l4d2host/tests/test_lifecycle.py (Part A — test updates) +l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent) +deploy/tests/test_deploy_artifacts.py (Part A — helper assertions) + +l4d2web/services/job_worker.py (Part B — poller code) +l4d2web/app.py (Part B — wire start_state_poller) +l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default) +l4d2web/tests/test_job_worker.py (Part B — poller tests) +``` + +## Tests + +### Part A + +- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted). +- `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start` → `enable` and `stop` → `disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`. +- New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`. + +### Part B + +- `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order. +- `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server. +- `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally. +- `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`). + +### CI sanity + +`pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope). + +## Rollout + +Single deploy. After deploy: + +1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.** +2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe. +3. Newly-started servers (`l4d2ctl start ` or web UI start) are enable-based from the first invocation. + +If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa. + +## Open questions + +None blocking. v2 candidates: + +- Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision). +- Per-server poll intervals or backoff for repeatedly-failing servers. +- A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable"). + +## References + +- systemd.unit(5) — `WantedBy=`, `Install` section semantics. +- systemctl(1) — `enable --now` / `disable --now` flags. +- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`. +- Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`. +- `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.