left4me/docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
mwiegand 1dd674714a
docs(specs): perf baseline lifecycle — premise check on system vs user units
Make explicit that the project uses system units (root systemctl, unit
under /usr/local/lib/systemd/system/, WantedBy=multi-user.target), so
`systemctl enable --now` is the correct verb to make instances survive
a host reboot. User units have different lifecycle rules and would not
auto-start at boot without enable-linger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:25:34 +02:00

217 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# l4d2 server lifecycle: reboot-safe + drift reconciliation — design
Date: 2026-05-09
Status: design
## Summary
Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from `systemctl start`/`stop` to `systemctl enable --now`/`disable --now`. Pair this with a periodic background poller that refreshes `Server.actual_state` so out-of-band state changes (OOM kills, manual `systemctl stop`, crashes that exhaust `Restart=on-failure`) no longer leave the web UI showing stale "running" indicators.
## Goals
- An L4D2 server started via the web UI (or `l4d2ctl start`) automatically comes back up after a host reboot, with no operator action.
- The web app's `Server.actual_state` converges to systemd's actual state within ~30 seconds of any out-of-band change.
- The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
- Migration from the existing `systemctl start`-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.
## Non-goals
- **Auto-restart on detected drift.** When the poller observes `desired_state=running` but `actual_state=stopped`, this spec does not re-enqueue a start job. That's a v2 UX/policy decision.
- **UI surfacing of stale-state warnings.** Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
- **Reconciliation of orphan systemd units.** Units enabled on disk but not represented by any `Server` row (e.g., from a crashed delete) — separate cleanup spec.
- **Per-server poller intervals.** A single global cadence is sufficient.
- **Replacing `Restart=on-failure`** with anything more elaborate. The unit's existing restart policy stays.
- **Reactive-style state propagation.** No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.
## Premise check: system units, not user units
`systemctl --user enable --now` has different lifecycle rules — auto-start only at user login (unless `loginctl enable-linger <user>` is set), symlinks land in `~/.config/systemd/user/<target>.wants/`. It would be wrong here.
This project uses **system units**, confirmed by:
- Unit path: `/usr/local/lib/systemd/system/left4me-server@.service` is the system search path; user units live in `/etc/systemd/user/` or `~/.config/systemd/user/`.
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl:31-44`) calls plain `systemctl` (no `--user` flag) and runs as **root** via the sudoers rule at `deploy/files/etc/sudoers.d/left4me:2`.
- The unit's `[Install] WantedBy=multi-user.target` (line 43 of the unit) is a system target; user units would use `default.target`.
- The same machinery is already in production for `left4me-web.service``deploy-test-server.sh` runs `sudo systemctl enable --now left4me-web.service`, and that's how the web service auto-came-back after today's reboot. We're applying the same pattern to the game-server template instances.
`systemctl enable left4me-server@1.service` will create `/etc/systemd/system/multi-user.target.wants/left4me-server@1.service` symlinked to `/usr/local/lib/systemd/system/left4me-server@.service`. systemd handles the template instantiation via the `@` syntax automatically.
## Background
Today's behavior, confirmed by forensics on `ckn@10.0.4.128` after the operator ran `sudo systemctl poweroff` at 11:48:02 CEST:
- The `left4me-systemctl` helper (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`) accepts the verbs `start`, `stop`, and `show`, each invoking the literal `systemctl` action.
- `l4d2host/service_control.py` exposes `start_service(name)` and `stop_service(name)` that build `systemctl_command("start"/"stop", name)`.
- `l4d2host/instances.py` `start_instance` and `stop_instance` call those functions.
- `systemctl start` is a transient activation. systemd creates **no** `WantedBy=multi-user.target.wants/` symlink, so the unit doesn't auto-start on next boot.
- After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; `left4me-web.service` came back (it *is* `enable`d); the game instances did not.
- The web app's `Server.actual_state` is only ever written by `refresh_server_actual_state_after_job()` in `l4d2web/services/job_worker.py:581`, called solely after a job completes. With no jobs in flight after the reboot, the row's `actual_state="running"` from yesterday remained the displayed truth.
## Design
### Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
**Helper script** (`deploy/files/usr/local/libexec/left4me/left4me-systemctl`):
Rename the action verbs the helper accepts: drop `start`/`stop`, add `enable`/`disable`. The bodies become:
```sh
case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) reject ;;
esac
```
The existing instance-name validation regex (currently lines 1217) is unchanged — it constrains the `<name>` argument, not the action. The sudoers rule at `deploy/files/etc/sudoers.d/left4me`:
```
left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *
```
already passes any args; no sudoers update needed.
**Python wrapper** (`l4d2host/service_control.py`):
Rename `start_service``enable_service` and `stop_service``disable_service`. Each builds `systemctl_command("enable", name)` / `systemctl_command("disable", name)`. The existing `show_service` is unchanged.
**Instance lifecycle** (`l4d2host/instances.py`):
- `start_instance` — replace the `start_service(...)` call with `enable_service(...)`.
- `stop_instance` — replace `stop_service(...)` with `disable_service(...)`.
- `_purge_instance` (called by `delete_instance` and `reset_instance`) — replace `stop_service(...)` with `disable_service(...)`. A disabled-but-not-running unit's `disable --now` is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.
**CLI surface** (`l4d2host/cli.py`):
`l4d2ctl start <name>` and `l4d2ctl stop <name>` keep their names per the contract in `AGENTS.md` ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: `start` = "ensure running, now and after reboot." Internal call paths route through `start_instance``enable_service` as renamed above.
**Web facade** (`l4d2web/services/l4d2_facade.py`):
Unchanged. Still invokes `["l4d2ctl", "start", ...]` / `["l4d2ctl", "stop", ...]`.
### Part B — Periodic state poller
Add a single background thread spawned alongside the existing job-worker threads in `l4d2web/services/job_worker.py:start_job_workers`:
```python
def start_state_poller(app):
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
daemon=True,
name="left4me-state-poller",
)
thread.start()
def state_poller_loop(app, interval):
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass # never let a single failure kill the loop
time.sleep(interval)
def poll_all_servers():
with session_scope() as db:
active_server_ids = set(db.scalars(
select(Job.server_id).where(Job.state.in_(("queued", "running")))
).all())
server_ids = [
sid for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass
```
**Why skip in-flight servers:** the job worker's success path also calls `refresh_server_actual_state`. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but `srcds` hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.
**Wiring in startup** (`l4d2web/app.py:create_app`): call `start_state_poller(app)` adjacent to `start_job_workers(app)`, gated by the same `should_start_workers` predicate (existing lines 8488: `JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()`).
**First-tick latency:** the loop runs `poll_all_servers()` once before the first `time.sleep(interval)`, so the DB catches up to systemd reality within milliseconds of app boot (one `systemctl show` per server). A separate startup-reconcile path is not needed.
**Concurrency:** the poller and the workers all use `session_scope()` (`l4d2web/db.py:4458`) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per `deploy-test-server.sh:188-198`) handles concurrent reads + serialized writes. No new locking primitives.
### Why both parts
Either part alone is insufficient:
- **Part A alone** survives reboots but doesn't catch OOM kills, manual `systemctl disable --now <unit>` from a shell, or crashes that exhaust `Restart=on-failure`. The DB still drifts in those cases.
- **Part B alone** keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at `actual_state=stopped` on a server they expected to come back, with the only recourse being to click start again.
Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.
### Migration on running hosts
Zero one-shot needed. After this lands, a server currently running via the old `systemctl start` (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, `systemctl disable --now` runs — `disable` is a no-op for an already-not-enabled unit, but `--now` still kills the live process. The next start runs `systemctl enable --now`, which enables + starts. From that point on the unit survives reboot.
The poller's first tick after deploy will refresh every server's `actual_state` to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to `stopped`.
### Files changed / added
```
deploy/files/usr/local/libexec/left4me/left4me-systemctl (Part A — verbs)
l4d2host/service_control.py (Part A — rename)
l4d2host/instances.py (Part A — call new names)
l4d2host/tests/test_lifecycle.py (Part A — test updates)
l4d2host/tests/test_service_control.py (Part A — new direct unit tests, create if absent)
deploy/tests/test_deploy_artifacts.py (Part A — helper assertions)
l4d2web/services/job_worker.py (Part B — poller code)
l4d2web/app.py (Part B — wire start_state_poller)
l4d2web/config.py (Part B — STATE_POLLER_INTERVAL_SECONDS default)
l4d2web/tests/test_job_worker.py (Part B — poller tests)
```
## Tests
### Part A
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`: update body assertions to expect `enable)` / `disable)` / `show)`. Add an assertion that `enable)` body contains `enable --now` and `disable)` body contains `disable --now`. Update rejected-action examples (drop `start`/`stop` since they're no longer accepted).
- `l4d2host/tests/test_lifecycle.py`: every assertion that mocks `run_command` and inspects the systemctl-helper invocation needs the action token updated from `start``enable` and `stop``disable`. The `_purge_instance` paths exercised by `delete_instance` and `reset_instance` flip from `stop` to `disable`.
- New direct unit tests in `l4d2host/tests/test_service_control.py` (create the file if it doesn't exist already): exercise `enable_service` and `disable_service` with a mocked `run_command` and assert they emit `["sudo", "-n", helper_path, "enable"|"disable", name]`.
### Part B
- `l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server` (new): seed two `Server` rows with `actual_state="unknown"`; monkey-patch `refresh_server_actual_state` to record calls; run one iteration of `poll_all_servers()`; assert it was called once per server in any order.
- `test_state_poller_skips_servers_with_inflight_jobs` (new): seed a `Server` row + a `Job` with `state="running"` for that server; run `poll_all_servers()`; assert `refresh_server_actual_state` was NOT called for that server.
- `test_state_poller_swallows_per_server_exceptions` (new): make `refresh_server_actual_state` raise for one server; assert other servers are still polled and the loop function returns normally.
- `test_state_poller_disabled_when_job_workers_disabled` (new): create app with `JOB_WORKER_ENABLED=False`; assert `start_state_poller` is not invoked (or that no `left4me-state-poller` thread is alive after `create_app`).
### CI sanity
`pytest deploy/tests/ l4d2host/tests l4d2web/tests -q` is green except the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state` (stale since `caa8b83`, out of scope).
## Rollout
Single deploy. After deploy:
1. The poller's first tick (within seconds of `left4me-web.service` starting) refreshes every server's `actual_state` to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. **No operator UI clicks required.**
2. Servers currently `running` (started via the old `systemctl start`) keep running, but they're not yet `enabled`. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe.
3. Newly-started servers (`l4d2ctl start <name>` or web UI start) are enable-based from the first invocation.
If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + `service_control.py` change can be reverted independently of the poller, and vice versa.
## Open questions
None blocking. v2 candidates:
- Auto-restart on `desired_state=running && actual_state=stopped` (separate UX decision).
- Per-server poll intervals or backoff for repeatedly-failing servers.
- A "drift" badge in the UI when `actual_state_updated_at` is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").
## References
- systemd.unit(5) — `WantedBy=`, `Install` section semantics.
- systemctl(1) — `enable --now` / `disable --now` flags.
- Existing perf-baseline spec: `docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md`.
- Existing CPU-isolation spec: `docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md`.
- `AGENTS.md` — Host CLI write-command set is fixed; this spec preserves that contract.