docs(specs): l4d2 server lifecycle reboot-and-drift — design
Switch lifecycle verbs from systemctl start/stop to enable --now / disable --now (servers survive host reboot via WantedBy= symlinks), plus a periodic state poller for runtime drift (OOM kills, manual systemctl ops, exhausted Restart=on-failure). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
20604dd79c
commit
72cd7ca1ef
1 changed files with 204 additions and 0 deletions
|
|
@ -0,0 +1,204 @@
|
|||
# l4d2 server lifecycle: reboot-safe + drift reconciliation — design
|
||||
|
||||
Date: 2026-05-09
|
||||
Status: design
|
||||
|
||||
## Summary
|
||||
|
||||
Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators.
|
||||
|
||||
## Goals
|
||||
|
||||
- An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action.
|
||||
- The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change.
|
||||
- The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
|
||||
- Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision.
|
||||
- **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
|
||||
- **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec.
|
||||
- **Per-server poller intervals.** A single global cadence is sufficient.
|
||||
- **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays.
|
||||
- **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.
|
||||
|
||||
## Background
|
||||
|
||||
Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST:
|
||||
|
||||
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action.
|
||||
- `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`.
|
||||
- `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions.
|
||||
- `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot.
|
||||
- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not.
|
||||
- The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth.
|
||||
|
||||
## Design
|
||||
|
||||
### Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
|
||||
|
||||
**Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`):
|
||||
|
||||
Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become:
|
||||
|
||||
```sh
|
||||
case "$action" in
|
||||
enable) exec "$systemctl" enable --now "$unit" ;;
|
||||
disable) exec "$systemctl" disable --now "$unit" ;;
|
||||
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
|
||||
*) reject ;;
|
||||
esac
|
||||
```
|
||||
|
||||
The existing instance-name validation regex (currently lines 12–17) is unchanged — it constrains the `<name>` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`:
|
||||
|
||||
```
|
||||
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
|
||||
```
|
||||
|
||||
already passes any args; no sudoers update needed.
|
||||
|
||||
**Python wrapper** (`l4d2host/service_control.py`):
|
||||
|
||||
Rename `start_service` → `enable_service` and `stop_service` → `disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged.
|
||||
|
||||
**Instance lifecycle** (`l4d2host/instances.py`):
|
||||
|
||||
- `start_instance` — replace the `start_service(...)` call with `enable_service(...)`.
|
||||
- `stop_instance` — replace `stop_service(...)` with `disable_service(...)`.
|
||||
- `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.
|
||||
|
||||
**CLI surface** (`l4d2host/cli.py`):
|
||||
|
||||
`l4d2ctl start <name>` and `l4d2ctl stop <name>` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance` → `enable_service` as renamed above.
|
||||
|
||||
**Web facade** (`l4d2web/services/l4d2_facade.py`):
|
||||
|
||||
Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`.
|
||||
|
||||
### Part B — Periodic state poller
|
||||
|
||||
Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`:
|
||||
|
||||
```python
|
||||
def start_state_poller(app):
|
||||
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
|
||||
thread = threading.Thread(
|
||||
target=state_poller_loop,
|
||||
args=(app, interval),
|
||||
daemon=True,
|
||||
name="left4me-state-poller",
|
||||
)
|
||||
thread.start()
|
||||
|
||||
|
||||
def state_poller_loop(app, interval):
|
||||
while True:
|
||||
try:
|
||||
with app.app_context():
|
||||
poll_all_servers()
|
||||
except Exception:
|
||||
pass # never let a single failure kill the loop
|
||||
time.sleep(interval)
|
||||
|
||||
|
||||
def poll_all_servers():
|
||||
with session_scope() as db:
|
||||
active_server_ids = set(db.scalars(
|
||||
select(Job.server_id).where(Job.state.in_(("queued", "running")))
|
||||
).all())
|
||||
server_ids = [
|
||||
sid for sid in db.scalars(select(Server.id)).all()
|
||||
if sid not in active_server_ids
|
||||
]
|
||||
for sid in server_ids:
|
||||
try:
|
||||
refresh_server_actual_state(sid)
|
||||
except Exception:
|
||||
pass
|
||||
```
|
||||
|
||||
**Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.
|
||||
|
||||
**Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 84–88: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`).
|
||||
|
||||
**First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed.
|
||||
|
||||
**Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:44–58`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives.
|
||||
|
||||
### Why both parts
|
||||
|
||||
Either part alone is insufficient:
|
||||
|
||||
- **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now <unit>` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases.
|
||||
- **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again.
|
||||
|
||||
Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.
|
||||
|
||||
### Migration on running hosts
|
||||
|
||||
Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot.
|
||||
|
||||
The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`.
|
||||
|
||||
### Files changed / added
|
||||
|
||||
```
|
||||
deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs)
|
||||
l4d2host/service_control.py (Part A — rename)
|
||||
l4d2host/instances.py (Part A — call new names)
|
||||
l4d2host/tests/test_lifecycle.py (Part A — test updates)
|
||||
l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent)
|
||||
deploy/tests/test_deploy_artifacts.py (Part A — helper assertions)
|
||||
|
||||
l4d2web/services/job_worker.py (Part B — poller code)
|
||||
l4d2web/app.py (Part B — wire start_state_poller)
|
||||
l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default)
|
||||
l4d2web/tests/test_job_worker.py (Part B — poller tests)
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
### Part A
|
||||
|
||||
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted).
|
||||
- `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start` → `enable` and `stop` → `disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`.
|
||||
- New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`.
|
||||
|
||||
### Part B
|
||||
|
||||
- `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order.
|
||||
- `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server.
|
||||
- `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally.
|
||||
- `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`).
|
||||
|
||||
### CI sanity
|
||||
|
||||
`pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope).
|
||||
|
||||
## Rollout
|
||||
|
||||
Single deploy. After deploy:
|
||||
|
||||
1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.**
|
||||
2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe.
|
||||
3. Newly-started servers (`l4d2ctl start <name>` or web UI start) are enable-based from the first invocation.
|
||||
|
||||
If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa.
|
||||
|
||||
## Open questions
|
||||
|
||||
None blocking. v2 candidates:
|
||||
|
||||
- Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision).
|
||||
- Per-server poll intervals or backoff for repeatedly-failing servers.
|
||||
- A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").
|
||||
|
||||
## References
|
||||
|
||||
- systemd.unit(5) — `WantedBy=`, `Install` section semantics.
|
||||
- systemctl(1) — `enable --now` / `disable --now` flags.
|
||||
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`.
|
||||
- Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`.
|
||||
- `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.
|
||||
Loading…
Reference in a new issue