left4me/docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
mwiegand 1dd674714a
docs(specs): perf baseline lifecycle — premise check on system vs user units
Make explicit that the project uses system units (root systemctl, unit
under /usr/local/lib/systemd/system/, WantedBy=multi-user.target), so
`systemctl enable --now` is the correct verb to make instances survive
a host reboot. User units have different lifecycle rules and would not
auto-start at boot without enable-linger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:25:34 +02:00

15 KiB
Raw Blame History

l4d2 server lifecycle: reboot-safe + drift reconciliation — design

Date: 2026-05-09 Status: design

Summary

Make L4D2 server instances survive a host reboot by switching their lifecycle verbs from systemctl start/stop to systemctl enable --now/disable --now. Pair this with a periodic background poller that refreshes Server.actual_state so out-of-band state changes (OOM kills, manual systemctl stop, crashes that exhaust Restart=on-failure) no longer leave the web UI showing stale "running" indicators.

Goals

  • An L4D2 server started via the web UI (or l4d2ctl start) automatically comes back up after a host reboot, with no operator action.
  • The web app's Server.actual_state converges to systemd's actual state within ~30 seconds of any out-of-band change.
  • The single-source-of-truth for "this server should be running" lives in systemd's wants-symlinks, not in a SQLite row that systemd has no awareness of.
  • Migration from the existing systemctl start-based fleet is a no-op: the next stop+start cycle through the UI converts each server to the enable-based model.

Non-goals

  • Auto-restart on detected drift. When the poller observes desired_state=running but actual_state=stopped, this spec does not re-enqueue a start job. That's a v2 UX/policy decision.
  • UI surfacing of stale-state warnings. Once the poller is reliable, the dashboard could show "DB believes X, but actual_state was last refreshed N seconds ago." Out of scope.
  • Reconciliation of orphan systemd units. Units enabled on disk but not represented by any Server row (e.g., from a crashed delete) — separate cleanup spec.
  • Per-server poller intervals. A single global cadence is sufficient.
  • Replacing Restart=on-failure with anything more elaborate. The unit's existing restart policy stays.
  • Reactive-style state propagation. No SSE/websocket pushes to the UI when actual_state changes. The next page render reads the fresh value from the DB.

Premise check: system units, not user units

systemctl --user enable --now has different lifecycle rules — auto-start only at user login (unless loginctl enable-linger <user> is set), symlinks land in ~/.config/systemd/user/<target>.wants/. It would be wrong here.

This project uses system units, confirmed by:

  • Unit path: /usr/local/lib/systemd/system/left4me-server@.service is the system search path; user units live in /etc/systemd/user/ or ~/.config/systemd/user/.
  • The left4me-systemctl helper (deploy/files/usr/local/libexec/left4me/left4me-systemctl:31-44) calls plain systemctl (no --user flag) and runs as root via the sudoers rule at deploy/files/etc/sudoers.d/left4me:2.
  • The unit's [Install] WantedBy=multi-user.target (line 43 of the unit) is a system target; user units would use default.target.
  • The same machinery is already in production for left4me-web.servicedeploy-test-server.sh runs sudo systemctl enable --now left4me-web.service, and that's how the web service auto-came-back after today's reboot. We're applying the same pattern to the game-server template instances.

systemctl enable left4me-server@1.service will create /etc/systemd/system/multi-user.target.wants/left4me-server@1.service symlinked to /usr/local/lib/systemd/system/left4me-server@.service. systemd handles the template instantiation via the @ syntax automatically.

Background

Today's behavior, confirmed by forensics on ckn@10.0.4.128 after the operator ran sudo systemctl poweroff at 11:48:02 CEST:

  • The left4me-systemctl helper (deploy/files/usr/local/libexec/left4me/left4me-systemctl) accepts the verbs start, stop, and show, each invoking the literal systemctl action.
  • l4d2host/service_control.py exposes start_service(name) and stop_service(name) that build systemctl_command("start"/"stop", name).
  • l4d2host/instances.py start_instance and stop_instance call those functions.
  • systemctl start is a transient activation. systemd creates no WantedBy=multi-user.target.wants/ symlink, so the unit doesn't auto-start on next boot.
  • After the host poweroff at 11:48:02, both running instances were cleanly shut down. The host rebooted; left4me-web.service came back (it is enabled); the game instances did not.
  • The web app's Server.actual_state is only ever written by refresh_server_actual_state_after_job() in l4d2web/services/job_worker.py:581, called solely after a job completes. With no jobs in flight after the reboot, the row's actual_state="running" from yesterday remained the displayed truth.

Design

Part A — Switch lifecycle verbs to enable --now / disable --now

Helper script (deploy/files/usr/local/libexec/left4me/left4me-systemctl):

Rename the action verbs the helper accepts: drop start/stop, add enable/disable. The bodies become:

case "$action" in
    enable)  exec "$systemctl" enable --now "$unit" ;;
    disable) exec "$systemctl" disable --now "$unit" ;;
    show)    exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
    *)       reject ;;
esac

The existing instance-name validation regex (currently lines 1217) is unchanged — it constrains the <name> argument, not the action. The sudoers rule at deploy/files/etc/sudoers.d/left4me:

left4me ALL=(root) NOPASSWD: /usr/local/libexec/left4me/left4me-systemctl *

already passes any args; no sudoers update needed.

Python wrapper (l4d2host/service_control.py):

Rename start_serviceenable_service and stop_servicedisable_service. Each builds systemctl_command("enable", name) / systemctl_command("disable", name). The existing show_service is unchanged.

Instance lifecycle (l4d2host/instances.py):

  • start_instance — replace the start_service(...) call with enable_service(...).
  • stop_instance — replace stop_service(...) with disable_service(...).
  • _purge_instance (called by delete_instance and reset_instance) — replace stop_service(...) with disable_service(...). A disabled-but-not-running unit's disable --now is a no-op for the runtime + still removes any leftover wants-symlink, which is the desired idempotent behavior.

CLI surface (l4d2host/cli.py):

l4d2ctl start <name> and l4d2ctl stop <name> keep their names per the contract in AGENTS.md ("Host CLI write commands are fixed to: install, initialize, start, stop, delete"). The semantics now genuinely match the verb at the operator level: start = "ensure running, now and after reboot." Internal call paths route through start_instanceenable_service as renamed above.

Web facade (l4d2web/services/l4d2_facade.py):

Unchanged. Still invokes ["l4d2ctl", "start", ...] / ["l4d2ctl", "stop", ...].

Part B — Periodic state poller

Add a single background thread spawned alongside the existing job-worker threads in l4d2web/services/job_worker.py:start_job_workers:

def start_state_poller(app):
    interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
    thread = threading.Thread(
        target=state_poller_loop,
        args=(app, interval),
        daemon=True,
        name="left4me-state-poller",
    )
    thread.start()


def state_poller_loop(app, interval):
    while True:
        try:
            with app.app_context():
                poll_all_servers()
        except Exception:
            pass  # never let a single failure kill the loop
        time.sleep(interval)


def poll_all_servers():
    with session_scope() as db:
        active_server_ids = set(db.scalars(
            select(Job.server_id).where(Job.state.in_(("queued", "running")))
        ).all())
        server_ids = [
            sid for sid in db.scalars(select(Server.id)).all()
            if sid not in active_server_ids
        ]
    for sid in server_ids:
        try:
            refresh_server_actual_state(sid)
        except Exception:
            pass

Why skip in-flight servers: the job worker's success path also calls refresh_server_actual_state. Both writers touching the same row at overlapping times produces no kernel-level race (SQLite WAL serializes writes), but a poller observing transient state mid-job — e.g., the brief window where the unit is being enabled but srcds hasn't fully bound the port yet — could write a misleading value that the worker's post-completion refresh then overwrites. Skipping is simpler than reasoning about the orderings.

Wiring in startup (l4d2web/app.py:create_app): call start_state_poller(app) adjacent to start_job_workers(app), gated by the same should_start_workers predicate (existing lines 8488: JOB_WORKER_ENABLED && not TESTING && not _in_flask_cli_context()).

First-tick latency: the loop runs poll_all_servers() once before the first time.sleep(interval), so the DB catches up to systemd reality within milliseconds of app boot (one systemctl show per server). A separate startup-reconcile path is not needed.

Concurrency: the poller and the workers all use session_scope() (l4d2web/db.py:4458) which commits-on-success / rolls-back-on-exception. SQLite WAL mode (configured by the deploy script per deploy-test-server.sh:188-198) handles concurrent reads + serialized writes. No new locking primitives.

Why both parts

Either part alone is insufficient:

  • Part A alone survives reboots but doesn't catch OOM kills, manual systemctl disable --now <unit> from a shell, or crashes that exhaust Restart=on-failure. The DB still drifts in those cases.
  • Part B alone keeps the DB honest but doesn't bring servers back after a reboot — the operator would still be looking at actual_state=stopped on a server they expected to come back, with the only recourse being to click start again.

Together: enable-based lifecycle keeps systemd as the source of truth; the poller keeps the DB honest about whatever systemd reports.

Migration on running hosts

Zero one-shot needed. After this lands, a server currently running via the old systemctl start (so: started but not enabled) keeps running through the deploy. The next time the operator clicks stop in the UI, systemctl disable --now runs — disable is a no-op for an already-not-enabled unit, but --now still kills the live process. The next start runs systemctl enable --now, which enables + starts. From that point on the unit survives reboot.

The poller's first tick after deploy will refresh every server's actual_state to whatever systemd reports — if the test box's two stale "running" rows still claim running but no unit is loaded, the next tick flips them to stopped.

Files changed / added

deploy/files/usr/local/libexec/left4me/left4me-systemctl    (Part A — verbs)
l4d2host/service_control.py                                  (Part A — rename)
l4d2host/instances.py                                        (Part A — call new names)
l4d2host/tests/test_lifecycle.py                             (Part A — test updates)
l4d2host/tests/test_service_control.py                       (Part A — new direct unit tests, create if absent)
deploy/tests/test_deploy_artifacts.py                        (Part A — helper assertions)

l4d2web/services/job_worker.py                               (Part B — poller code)
l4d2web/app.py                                               (Part B — wire start_state_poller)
l4d2web/config.py                                            (Part B — STATE_POLLER_INTERVAL_SECONDS default)
l4d2web/tests/test_job_worker.py                             (Part B — poller tests)

Tests

Part A

  • deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args: update body assertions to expect enable) / disable) / show). Add an assertion that enable) body contains enable --now and disable) body contains disable --now. Update rejected-action examples (drop start/stop since they're no longer accepted).
  • l4d2host/tests/test_lifecycle.py: every assertion that mocks run_command and inspects the systemctl-helper invocation needs the action token updated from startenable and stopdisable. The _purge_instance paths exercised by delete_instance and reset_instance flip from stop to disable.
  • New direct unit tests in l4d2host/tests/test_service_control.py (create the file if it doesn't exist already): exercise enable_service and disable_service with a mocked run_command and assert they emit ["sudo", "-n", helper_path, "enable"|"disable", name].

Part B

  • l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server (new): seed two Server rows with actual_state="unknown"; monkey-patch refresh_server_actual_state to record calls; run one iteration of poll_all_servers(); assert it was called once per server in any order.
  • test_state_poller_skips_servers_with_inflight_jobs (new): seed a Server row + a Job with state="running" for that server; run poll_all_servers(); assert refresh_server_actual_state was NOT called for that server.
  • test_state_poller_swallows_per_server_exceptions (new): make refresh_server_actual_state raise for one server; assert other servers are still polled and the loop function returns normally.
  • test_state_poller_disabled_when_job_workers_disabled (new): create app with JOB_WORKER_ENABLED=False; assert start_state_poller is not invoked (or that no left4me-state-poller thread is alive after create_app).

CI sanity

pytest deploy/tests/ l4d2host/tests l4d2web/tests -q is green except the pre-existing unrelated test_deploy_script_has_safe_defaults_and_preserves_state (stale since caa8b83, out of scope).

Rollout

Single deploy. After deploy:

  1. The poller's first tick (within seconds of left4me-web.service starting) refreshes every server's actual_state to systemd reality. Any servers stuck on stale "running" flip to "stopped" automatically. No operator UI clicks required.
  2. Servers currently running (started via the old systemctl start) keep running, but they're not yet enabled. The operator's next stop+start through the UI converts them to enable-based and from that point onwards they're reboot-safe.
  3. Newly-started servers (l4d2ctl start <name> or web UI start) are enable-based from the first invocation.

If something goes wrong — e.g., the helper rejects a previously-valid invocation or the poller floods the journal — the helper script + service_control.py change can be reverted independently of the poller, and vice versa.

Open questions

None blocking. v2 candidates:

  • Auto-restart on desired_state=running && actual_state=stopped (separate UX decision).
  • Per-server poll intervals or backoff for repeatedly-failing servers.
  • A "drift" badge in the UI when actual_state_updated_at is older than 2× the poll interval (proxy for "the poller isn't running" or "the host is unreachable").

References

  • systemd.unit(5) — WantedBy=, Install section semantics.
  • systemctl(1) — enable --now / disable --now flags.
  • Existing perf-baseline spec: docs/superpowers/specs/2026-05-09-l4d2-server-host-perf-baseline-design.md.
  • Existing CPU-isolation spec: docs/superpowers/specs/2026-05-09-l4d2-cpu-isolation-design.md.
  • AGENTS.md — Host CLI write-command set is fixed; this spec preserves that contract.