left4me/docs/superpowers/plans/2026-05-09-l4d2-server-lifecycle-reboot-and-drift.md
mwiegand 3b0bde9b50
docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan
Two TDD tasks: helper+service_control verb rename, then poller code
+ wiring + tests. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00

22 KiB
Raw Blame History

L4D2 Server Lifecycle: Reboot-Safe + Drift Reconciliation Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Make L4D2 server instances survive a host reboot (Part A) and converge Server.actual_state to systemd reality every ~30s for out-of-band drift (Part B).

Architecture: Helper script + service_control.py switch from systemctl start/stop to systemctl enable --now / disable --now. A new background thread spawned with the job workers polls every server's status periodically and writes the result via the existing refresh_server_actual_state() path. Skip servers with in-flight jobs to avoid racing with the post-job refresh.

Tech Stack: bash helper script + sudoers; Python subprocess via l4d2host.service_control.systemctl_command; SQLAlchemy via session_scope(); threading; pytest.

Spec: docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md


File Structure

Files to modify (Part A — lifecycle verb change):

  • deploy/files/usr/local/libexec/left4me/left4me-systemctl — accept verbs enable/disable/show (drop start/stop).
  • l4d2host/service_control.py — rename start_serviceenable_service, stop_servicedisable_service. Action tokens become "enable" / "disable".
  • l4d2host/instances.py — call enable_service from start_instance; call disable_service from stop_instance and _purge_instance.
  • l4d2host/tests/test_lifecycle.py — update mock-call expectations.
  • l4d2host/tests/test_service_control.py — new file with direct unit tests for enable_service / disable_service.
  • deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args — update the verb assertions.

Files to modify (Part B — poller):

  • l4d2web/services/job_worker.py — add start_state_poller, state_poller_loop, poll_all_servers.
  • l4d2web/app.py — call start_state_poller(app) next to start_job_workers(app).
  • l4d2web/config.py — default STATE_POLLER_INTERVAL_SECONDS = 30.
  • l4d2web/tests/test_job_worker.py — four new tests for the poller.

No host-library, web-app facade, or CLI surface signatures change. The l4d2ctl start <name> / l4d2ctl stop <name> commands keep their names (per AGENTS.md).


Pre-flight

  • Step 0a: Verify clean working tree

Run: git status Expected: nothing to commit, working tree clean

  • Step 0b: Verify the existing test suite is at the known-good baseline

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q Expected: 460 passed, 1 failed (the pre-existing unrelated test_deploy_script_has_safe_defaults_and_preserves_state), 2 skipped.

If the count differs, stop and surface — this plan assumes that exact baseline.


Task 1: Part A — Switch lifecycle verbs to enable --now / disable --now

This task changes the helper script, the Python wrapper, and the instance lifecycle in one cohesive commit. The change is end-to-end vertical — splitting it across commits would leave broken intermediate states (helper accepting verbs that no caller uses, or callers using verbs the helper rejects).

Files:

  • Modify: deploy/files/usr/local/libexec/left4me/left4me-systemctl
  • Modify: l4d2host/service_control.py
  • Modify: l4d2host/instances.py
  • Modify: l4d2host/tests/test_lifecycle.py
  • Create: l4d2host/tests/test_service_control.py
  • Modify: deploy/tests/test_deploy_artifacts.py

Step 1.1: Update the deploy artifact test for the helper

Open deploy/tests/test_deploy_artifacts.py. Find test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args.

Replace the assertions that check the helper's case-statement bodies. Currently the test asserts something like:

assert 'start) exec "$systemctl" start "$unit"' in script
assert 'stop) exec "$systemctl" stop "$unit"' in script

Update to:

assert 'enable)' in script
assert 'enable --now' in script
assert 'disable)' in script
assert 'disable --now' in script

Keep the --property=ActiveState and --property=SubState assertions for the show action (unchanged).

The rejected-action examples list (currently includes things like ["bad/action", "alpha"]) is unchanged — those are still bad. If the test currently asserts that start and stop are accepted (e.g., a positive case), drop those — start/stop are now rejected verbs, not accepted ones.

Step 1.2: Run the updated artifact test to verify it fails

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v Expected: FAIL — the helper script still has start)/stop) cases, not enable)/disable).

Step 1.3: Edit the helper script

Open deploy/files/usr/local/libexec/left4me/left4me-systemctl. Find the case-statement (currently around lines 2427). Replace:

case "$action" in
    start) exec "$systemctl" start "$unit" ;;
    stop) exec "$systemctl" stop "$unit" ;;
    show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
    *) ...
esac

with:

case "$action" in
    enable) exec "$systemctl" enable --now "$unit" ;;
    disable) exec "$systemctl" disable --now "$unit" ;;
    show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
    *) ...
esac

Keep the rest of the script (shebang, name validation, *) reject-and-exit branch) unchanged. The exact form of the *) reject case in the existing helper should be preserved.

Step 1.4: Verify the helper script still parses

Run: sh -n deploy/files/usr/local/libexec/left4me/left4me-systemctl Expected: exit 0, no output.

Step 1.5: Run the artifact test, verify it passes

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v Expected: PASS.

Step 1.6: Update service_control.py

Open l4d2host/service_control.py. Replace:

def start_service(
    name: str,
    *,
    on_stdout: Callable[[str], None] | None = None,
    on_stderr: Callable[[str], None] | None = None,
    passthrough: bool = False,
    should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
    return run_command(
        systemctl_command("start", name),
        on_stdout=on_stdout,
        on_stderr=on_stderr,
        passthrough=passthrough,
        should_cancel=should_cancel,
    )


def stop_service(
    name: str,
    *,
    on_stdout: Callable[[str], None] | None = None,
    on_stderr: Callable[[str], None] | None = None,
    passthrough: bool = False,
    should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
    return run_command(
        systemctl_command("stop", name),
        on_stdout=on_stdout,
        on_stderr=on_stderr,
        passthrough=passthrough,
        should_cancel=should_cancel,
    )

with:

def enable_service(
    name: str,
    *,
    on_stdout: Callable[[str], None] | None = None,
    on_stderr: Callable[[str], None] | None = None,
    passthrough: bool = False,
    should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
    return run_command(
        systemctl_command("enable", name),
        on_stdout=on_stdout,
        on_stderr=on_stderr,
        passthrough=passthrough,
        should_cancel=should_cancel,
    )


def disable_service(
    name: str,
    *,
    on_stdout: Callable[[str], None] | None = None,
    on_stderr: Callable[[str], None] | None = None,
    passthrough: bool = False,
    should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
    return run_command(
        systemctl_command("disable", name),
        on_stdout=on_stdout,
        on_stderr=on_stderr,
        passthrough=passthrough,
        should_cancel=should_cancel,
    )

show_service, stream_command, stream_journal, and the systemctl_command / journalctl_command helpers are unchanged.

Step 1.7: Update instances.py to call the new names

Open l4d2host/instances.py. Replace the import:

from l4d2host.service_control import start_service, stop_service

with:

from l4d2host.service_control import disable_service, enable_service

Inside start_instance, find the start_service(...) call (around line 137 in current source) and replace with enable_service(...). Inside stop_instance (line 159) and _purge_instance (line 194), replace stop_service(...) with disable_service(...). Keep all keyword arguments identical — only the function name changes.

Step 1.8: Update test_lifecycle.py

Open l4d2host/tests/test_lifecycle.py. Search for every assertion that references the start or stop action token in mock-call expectations against service_control.run_command or systemctl_command. The tests typically look for argument lists like ["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "<name>"].

Update each occurrence:

  • "start""enable" (in the start_instance test paths)
  • "stop""disable" (in the stop_instance, delete_instance, reset_instance, and _purge_instance test paths)

Some tests may import start_service / stop_service directly. Update those imports to enable_service / disable_service.

Step 1.9: Create direct unit tests for enable_service / disable_service

Create l4d2host/tests/test_service_control.py with:

from unittest.mock import patch

from l4d2host.service_control import (
    SYSTEMCTL_HELPER,
    disable_service,
    enable_service,
)


@patch("l4d2host.service_control.run_command")
def test_enable_service_invokes_helper_with_enable_action(mock_run):
    enable_service("instance-7")
    args, _ = mock_run.call_args
    assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]


@patch("l4d2host.service_control.run_command")
def test_disable_service_invokes_helper_with_disable_action(mock_run):
    disable_service("instance-7")
    args, _ = mock_run.call_args
    assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]

Step 1.10: Run the host-library tests

Run: cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q Expected: all green (110 or 111 passing depending on whether test_service_control.py already existed; +2 from the new direct tests).

If anything red: fix the test expectations, not the implementation. The implementation matches the spec exactly. Most likely failure mode: a test in test_lifecycle.py you missed updating; search for any remaining string literal "start" or "stop" in helper-arg-list contexts.

Step 1.11: Run the deploy artifact test suite

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q Expected: 36 passed, 1 failed (the pre-existing unrelated test).

Step 1.12: Commit

git add deploy/files/usr/local/libexec/left4me/left4me-systemctl \
        l4d2host/service_control.py l4d2host/instances.py \
        l4d2host/tests/test_lifecycle.py \
        l4d2host/tests/test_service_control.py \
        deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now

Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract — only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"

Task 2: Part B — Periodic state poller

This task adds the poller code, wires it into the Flask startup, exposes its config knob, and tests four behaviors. One cohesive commit.

Files:

  • Modify: l4d2web/services/job_worker.py
  • Modify: l4d2web/app.py
  • Modify: l4d2web/config.py
  • Modify: l4d2web/tests/test_job_worker.py

Step 2.1: Add the failing tests

Open l4d2web/tests/test_job_worker.py. Append after the existing tests:

def test_state_poller_refreshes_each_server(app, monkeypatch):
    from l4d2web.services import job_worker as jw

    with app.app_context():
        from l4d2web.db import session_scope
        from l4d2web.models import Server
        with session_scope() as db:
            db.add_all([
                Server(id=11, name="alpha", port=27015, blueprint_id=None,
                       desired_state="running", actual_state="unknown"),
                Server(id=12, name="beta", port=27016, blueprint_id=None,
                       desired_state="running", actual_state="unknown"),
            ])

    refreshed = []
    monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))

    with app.app_context():
        jw.poll_all_servers()

    assert sorted(refreshed) == [11, 12]


def test_state_poller_skips_servers_with_inflight_jobs(app, monkeypatch):
    from l4d2web.services import job_worker as jw

    with app.app_context():
        from l4d2web.db import session_scope
        from l4d2web.models import Job, Server
        with session_scope() as db:
            db.add(Server(id=21, name="gamma", port=27017, blueprint_id=None,
                          desired_state="running", actual_state="running"))
            db.add(Job(server_id=21, operation="stop", state="running"))

    refreshed = []
    monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))

    with app.app_context():
        jw.poll_all_servers()

    assert refreshed == []


def test_state_poller_swallows_per_server_exceptions(app, monkeypatch):
    from l4d2web.services import job_worker as jw

    with app.app_context():
        from l4d2web.db import session_scope
        from l4d2web.models import Server
        with session_scope() as db:
            db.add_all([
                Server(id=31, name="bad", port=27018, blueprint_id=None,
                       desired_state="running", actual_state="unknown"),
                Server(id=32, name="good", port=27019, blueprint_id=None,
                       desired_state="running", actual_state="unknown"),
            ])

    refreshed = []

    def fake_refresh(sid):
        if sid == 31:
            raise RuntimeError("simulated host failure")
        refreshed.append(sid)

    monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)

    with app.app_context():
        jw.poll_all_servers()  # must not raise

    assert refreshed == [32]


def test_state_poller_disabled_when_job_workers_disabled(monkeypatch):
    """create_app must not spawn the poller thread when JOB_WORKER_ENABLED=False."""
    import threading

    from l4d2web.app import create_app

    spawned = []
    real_thread_init = threading.Thread.__init__

    def tracking_init(self, *args, **kwargs):
        if kwargs.get("name") == "left4me-state-poller":
            spawned.append(True)
        real_thread_init(self, *args, **kwargs)

    monkeypatch.setattr(threading.Thread, "__init__", tracking_init)
    create_app({"TESTING": True, "JOB_WORKER_ENABLED": False})
    assert not spawned

(The tests assume the existing app fixture from conftest.py. If your project uses a different fixture name, adjust accordingly. The polling tests run poll_all_servers() synchronously to avoid testing the loop's time.sleep.)

Step 2.2: Run the new tests, verify they fail

Run: cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v Expected: FAIL — poll_all_servers and start_state_poller don't exist yet.

Step 2.3: Add the poller code to job_worker.py

Open l4d2web/services/job_worker.py. Add at the bottom of the file:

def start_state_poller(app):
    interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
    thread = threading.Thread(
        target=state_poller_loop,
        args=(app, interval),
        daemon=True,
        name="left4me-state-poller",
    )
    thread.start()


def state_poller_loop(app, interval: float) -> None:
    while True:
        try:
            with app.app_context():
                poll_all_servers()
        except Exception:
            pass
        time.sleep(interval)


def poll_all_servers() -> None:
    with session_scope() as db:
        active_server_ids = set(db.scalars(
            select(Job.server_id).where(Job.state.in_(("queued", "running")))
        ).all())
        server_ids = [
            sid for sid in db.scalars(select(Server.id)).all()
            if sid not in active_server_ids
        ]
    for sid in server_ids:
        try:
            refresh_server_actual_state(sid)
        except Exception:
            pass

Server, Job, select, session_scope, threading, time, and refresh_server_actual_state are already imported in this file. Verify by scanning the existing imports; if any are missing (unlikely for select/Server/Job since the worker uses them), add them.

Step 2.4: Wire the poller into create_app

Open l4d2web/app.py. Find the existing start_job_workers(app) call (around line 91, inside the if should_start_workers: block). Add start_state_poller(app) immediately after it:

if should_start_workers:
    recover_stale_jobs()
    start_job_workers(app)
    start_state_poller(app)

Also update the import:

from l4d2web.services.job_worker import (
    recover_stale_jobs,
    start_job_workers,
    start_state_poller,
)

(If the existing import is single-line from ... import recover_stale_jobs, start_job_workers, just add start_state_poller to the list.)

Step 2.5: Add the config default

Open l4d2web/config.py. Find the dict literal that contains other defaults like JOB_WORKER_THREADS, PORT_RANGE_START, etc. Add:

"STATE_POLLER_INTERVAL_SECONDS": 30,

In the env-var-loading section (where LEFT4ME_PORT_RANGE_START etc. are read), add:

"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("LEFT4ME_STATE_POLLER_INTERVAL_SECONDS", "30")),

Step 2.6: Run the four new tests, verify they pass

Run: cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v Expected: PASS for all four.

Step 2.7: Run the full web test suite

Run: cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests -q Expected: 317 passed, 1 skipped (313 + 4 new tests).

Step 2.8: Commit

git add l4d2web/services/job_worker.py l4d2web/app.py l4d2web/config.py l4d2web/tests/test_job_worker.py
git commit -m "$(cat <<'EOF'
feat(l4d2-web): periodic state poller refreshes Server.actual_state

A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs are skipped to avoid racing the post-job
refresh. Catches reboot drift, OOM kills, manual systemctl operations,
and any other out-of-band state change. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"

Final Verification

  • Step F.1: Full test sweep

Run: cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q Expected: ~466 passed, 1 failed (the pre-existing unrelated test_deploy_script_has_safe_defaults_and_preserves_state), 2 skipped.

  • Step F.2: Working tree clean and commit shape

Run: git status && git log --oneline -5 Expected:

  • git status: clean.

  • Top of git log:

    1. feat(l4d2-web): periodic state poller refreshes Server.actual_state
    2. feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
    3. docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan
    4. docs(specs): l4d2 server lifecycle reboot-and-drift — design
  • Step F.3: Operator-side smoke test (deferred, not part of this plan)

End-to-end on ckn@10.0.4.128 after deploy:

deploy/deploy-test-server.sh ckn@10.0.4.128

# Confirm the helper now drives enable/disable
ssh ckn@10.0.4.128 'cat /usr/local/libexec/left4me/left4me-systemctl | grep -E "enable|disable"'
# expect:  enable) exec "$systemctl" enable --now "$unit"
#          disable) exec "$systemctl" disable --now "$unit"

# Click "start" in the web UI for a server. Then:
ssh ckn@10.0.4.128 'systemctl is-enabled left4me-server@1.service'
# expect: enabled

# Reboot the host:
ssh ckn@10.0.4.128 'sudo systemctl reboot'
# wait for it to come back, then:
ssh ckn@10.0.4.128 'systemctl is-active left4me-server@1.service && pgrep -fa srcds'
# expect: active, srcds running with no UI intervention

# Confirm the poller corrects out-of-band drift
ssh ckn@10.0.4.128 'sudo systemctl disable --now left4me-server@1.service'
# Within ~30s the web UI's actual_state for server 1 flips from "running" to "stopped".
ssh ckn@10.0.4.128 'sudo -u left4me /opt/left4me/.venv/bin/python -c "
import sqlite3
c = sqlite3.connect(\"/var/lib/left4me/left4me.db\")
print(c.execute(\"SELECT id, actual_state, actual_state_updated_at FROM servers WHERE id=1\").fetchone())
"'
# expect: actual_state='stopped' with a fresh updated_at.

Out of Scope (do NOT implement here)

  • Auto-restart on desired_state=running && actual_state=stopped.
  • UI banners for stale-state warnings.
  • Reconciliation of orphan systemd units.
  • Per-server poll intervals.
  • Replacing Restart=on-failure.
  • Touching the pre-existing red test (test_deploy_script_has_safe_defaults_and_preserves_state).

If you find yourself touching any of these, stop — they belong in a separate spec.