left4me/docs/superpowers/plans/2026-05-09-l4d2-server-lifecycle-reboot-and-drift.md
mwiegand 3b0bde9b50
docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan
Two TDD tasks: helper+service_control verb rename, then poller code
+ wiring + tests. Operator-side smoke test in F.3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 12:21:59 +02:00

584 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# L4D2 Server Lifecycle: Reboot-Safe + Drift Reconciliation Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Make L4D2 server instances survive a host reboot (Part A) and converge `Server.actual_state` to systemd reality every ~30s for out-of-band drift (Part B).
**Architecture:** Helper script + `service_control.py` switch from `systemctl start/stop` to `systemctl enable --now / disable --now`. A new background thread spawned with the job workers polls every server's status periodically and writes the result via the existing `refresh_server_actual_state()` path. Skip servers with in-flight jobs to avoid racing with the post-job refresh.
**Tech Stack:** bash helper script + sudoers; Python `subprocess` via `l4d2host.service_control.systemctl_command`; SQLAlchemy via `session_scope()`; threading; pytest.
**Spec:** `docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md`
---
## File Structure
Files to modify (Part A — lifecycle verb change):
- `deploy/files/usr/local/libexec/left4me/left4me-systemctl` — accept verbs `enable`/`disable`/`show` (drop `start`/`stop`).
- `l4d2host/service_control.py` — rename `start_service``enable_service`, `stop_service``disable_service`. Action tokens become `"enable"` / `"disable"`.
- `l4d2host/instances.py` — call `enable_service` from `start_instance`; call `disable_service` from `stop_instance` and `_purge_instance`.
- `l4d2host/tests/test_lifecycle.py` — update mock-call expectations.
- `l4d2host/tests/test_service_control.py` — new file with direct unit tests for `enable_service` / `disable_service`.
- `deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args` — update the verb assertions.
Files to modify (Part B — poller):
- `l4d2web/services/job_worker.py` — add `start_state_poller`, `state_poller_loop`, `poll_all_servers`.
- `l4d2web/app.py` — call `start_state_poller(app)` next to `start_job_workers(app)`.
- `l4d2web/config.py` — default `STATE_POLLER_INTERVAL_SECONDS = 30`.
- `l4d2web/tests/test_job_worker.py` — four new tests for the poller.
No host-library, web-app facade, or CLI surface signatures change. The `l4d2ctl start <name>` / `l4d2ctl stop <name>` commands keep their names (per `AGENTS.md`).
---
## Pre-flight
- [ ] **Step 0a: Verify clean working tree**
Run: `git status`
Expected: `nothing to commit, working tree clean`
- [ ] **Step 0b: Verify the existing test suite is at the known-good baseline**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: 460 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
If the count differs, stop and surface — this plan assumes that exact baseline.
---
## Task 1: Part A — Switch lifecycle verbs to `enable --now` / `disable --now`
This task changes the helper script, the Python wrapper, and the instance lifecycle in one cohesive commit. The change is end-to-end vertical — splitting it across commits would leave broken intermediate states (helper accepting verbs that no caller uses, or callers using verbs the helper rejects).
**Files:**
- Modify: `deploy/files/usr/local/libexec/left4me/left4me-systemctl`
- Modify: `l4d2host/service_control.py`
- Modify: `l4d2host/instances.py`
- Modify: `l4d2host/tests/test_lifecycle.py`
- Create: `l4d2host/tests/test_service_control.py`
- Modify: `deploy/tests/test_deploy_artifacts.py`
### Step 1.1: Update the deploy artifact test for the helper
Open `deploy/tests/test_deploy_artifacts.py`. Find `test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args`.
Replace the assertions that check the helper's case-statement bodies. Currently the test asserts something like:
```python
assert 'start) exec "$systemctl" start "$unit"' in script
assert 'stop) exec "$systemctl" stop "$unit"' in script
```
Update to:
```python
assert 'enable)' in script
assert 'enable --now' in script
assert 'disable)' in script
assert 'disable --now' in script
```
Keep the `--property=ActiveState` and `--property=SubState` assertions for the `show` action (unchanged).
The rejected-action examples list (currently includes things like `["bad/action", "alpha"]`) is unchanged — those are still bad. If the test currently asserts that `start` and `stop` are accepted (e.g., a positive case), drop those — `start`/`stop` are now rejected verbs, not accepted ones.
### Step 1.2: Run the updated artifact test to verify it fails
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
Expected: FAIL — the helper script still has `start)`/`stop)` cases, not `enable)`/`disable)`.
### Step 1.3: Edit the helper script
Open `deploy/files/usr/local/libexec/left4me/left4me-systemctl`. Find the case-statement (currently around lines 2427). Replace:
```sh
case "$action" in
start) exec "$systemctl" start "$unit" ;;
stop) exec "$systemctl" stop "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) ...
esac
```
with:
```sh
case "$action" in
enable) exec "$systemctl" enable --now "$unit" ;;
disable) exec "$systemctl" disable --now "$unit" ;;
show) exec "$systemctl" show "$unit" --property=ActiveState --property=SubState ;;
*) ...
esac
```
Keep the rest of the script (shebang, name validation, `*)` reject-and-exit branch) unchanged. The exact form of the `*)` reject case in the existing helper should be preserved.
### Step 1.4: Verify the helper script still parses
Run: `sh -n deploy/files/usr/local/libexec/left4me/left4me-systemctl`
Expected: exit 0, no output.
### Step 1.5: Run the artifact test, verify it passes
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/test_deploy_artifacts.py::test_systemctl_helper_passes_shell_syntax_check_and_rejects_bad_args -v`
Expected: PASS.
### Step 1.6: Update `service_control.py`
Open `l4d2host/service_control.py`. Replace:
```python
def start_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("start", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
def stop_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("stop", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
```
with:
```python
def enable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("enable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
def disable_service(
name: str,
*,
on_stdout: Callable[[str], None] | None = None,
on_stderr: Callable[[str], None] | None = None,
passthrough: bool = False,
should_cancel: Callable[[], bool] | None = None,
) -> CommandResult:
return run_command(
systemctl_command("disable", name),
on_stdout=on_stdout,
on_stderr=on_stderr,
passthrough=passthrough,
should_cancel=should_cancel,
)
```
`show_service`, `stream_command`, `stream_journal`, and the `systemctl_command` / `journalctl_command` helpers are unchanged.
### Step 1.7: Update `instances.py` to call the new names
Open `l4d2host/instances.py`. Replace the import:
```python
from l4d2host.service_control import start_service, stop_service
```
with:
```python
from l4d2host.service_control import disable_service, enable_service
```
Inside `start_instance`, find the `start_service(...)` call (around line 137 in current source) and replace with `enable_service(...)`. Inside `stop_instance` (line 159) and `_purge_instance` (line 194), replace `stop_service(...)` with `disable_service(...)`. Keep all keyword arguments identical — only the function name changes.
### Step 1.8: Update `test_lifecycle.py`
Open `l4d2host/tests/test_lifecycle.py`. Search for every assertion that references the `start` or `stop` action token in mock-call expectations against `service_control.run_command` or `systemctl_command`. The tests typically look for argument lists like `["sudo", "-n", "/usr/local/libexec/left4me/left4me-systemctl", "start", "<name>"]`.
Update each occurrence:
- `"start"``"enable"` (in the `start_instance` test paths)
- `"stop"``"disable"` (in the `stop_instance`, `delete_instance`, `reset_instance`, and `_purge_instance` test paths)
Some tests may import `start_service` / `stop_service` directly. Update those imports to `enable_service` / `disable_service`.
### Step 1.9: Create direct unit tests for `enable_service` / `disable_service`
Create `l4d2host/tests/test_service_control.py` with:
```python
from unittest.mock import patch
from l4d2host.service_control import (
SYSTEMCTL_HELPER,
disable_service,
enable_service,
)
@patch("l4d2host.service_control.run_command")
def test_enable_service_invokes_helper_with_enable_action(mock_run):
enable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "enable", "instance-7"]
@patch("l4d2host.service_control.run_command")
def test_disable_service_invokes_helper_with_disable_action(mock_run):
disable_service("instance-7")
args, _ = mock_run.call_args
assert args[0] == ["sudo", "-n", SYSTEMCTL_HELPER, "disable", "instance-7"]
```
### Step 1.10: Run the host-library tests
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2host/tests -q`
Expected: all green (110 or 111 passing depending on whether `test_service_control.py` already existed; `+2` from the new direct tests).
If anything red: fix the test expectations, not the implementation. The implementation matches the spec exactly. Most likely failure mode: a test in `test_lifecycle.py` you missed updating; search for any remaining string literal `"start"` or `"stop"` in helper-arg-list contexts.
### Step 1.11: Run the deploy artifact test suite
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ -q`
Expected: 36 passed, 1 failed (the pre-existing unrelated test).
### Step 1.12: Commit
```bash
git add deploy/files/usr/local/libexec/left4me/left4me-systemctl \
l4d2host/service_control.py l4d2host/instances.py \
l4d2host/tests/test_lifecycle.py \
l4d2host/tests/test_service_control.py \
deploy/tests/test_deploy_artifacts.py
git commit -m "$(cat <<'EOF'
feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now
Servers started via the web UI now create a WantedBy= symlink under
multi-user.target.wants/, so they auto-start on the next host reboot.
Helper verbs renamed start/stop -> enable/disable; service_control.py
renamed start_service/stop_service -> enable_service/disable_service.
The user-facing l4d2ctl start/stop commands keep their names per the
AGENTS.md contract — only the implementation changes. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"
```
---
## Task 2: Part B — Periodic state poller
This task adds the poller code, wires it into the Flask startup, exposes its config knob, and tests four behaviors. One cohesive commit.
**Files:**
- Modify: `l4d2web/services/job_worker.py`
- Modify: `l4d2web/app.py`
- Modify: `l4d2web/config.py`
- Modify: `l4d2web/tests/test_job_worker.py`
### Step 2.1: Add the failing tests
Open `l4d2web/tests/test_job_worker.py`. Append after the existing tests:
```python
def test_state_poller_refreshes_each_server(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Server
with session_scope() as db:
db.add_all([
Server(id=11, name="alpha", port=27015, blueprint_id=None,
desired_state="running", actual_state="unknown"),
Server(id=12, name="beta", port=27016, blueprint_id=None,
desired_state="running", actual_state="unknown"),
])
refreshed = []
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
with app.app_context():
jw.poll_all_servers()
assert sorted(refreshed) == [11, 12]
def test_state_poller_skips_servers_with_inflight_jobs(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Job, Server
with session_scope() as db:
db.add(Server(id=21, name="gamma", port=27017, blueprint_id=None,
desired_state="running", actual_state="running"))
db.add(Job(server_id=21, operation="stop", state="running"))
refreshed = []
monkeypatch.setattr(jw, "refresh_server_actual_state", lambda sid: refreshed.append(sid))
with app.app_context():
jw.poll_all_servers()
assert refreshed == []
def test_state_poller_swallows_per_server_exceptions(app, monkeypatch):
from l4d2web.services import job_worker as jw
with app.app_context():
from l4d2web.db import session_scope
from l4d2web.models import Server
with session_scope() as db:
db.add_all([
Server(id=31, name="bad", port=27018, blueprint_id=None,
desired_state="running", actual_state="unknown"),
Server(id=32, name="good", port=27019, blueprint_id=None,
desired_state="running", actual_state="unknown"),
])
refreshed = []
def fake_refresh(sid):
if sid == 31:
raise RuntimeError("simulated host failure")
refreshed.append(sid)
monkeypatch.setattr(jw, "refresh_server_actual_state", fake_refresh)
with app.app_context():
jw.poll_all_servers() # must not raise
assert refreshed == [32]
def test_state_poller_disabled_when_job_workers_disabled(monkeypatch):
"""create_app must not spawn the poller thread when JOB_WORKER_ENABLED=False."""
import threading
from l4d2web.app import create_app
spawned = []
real_thread_init = threading.Thread.__init__
def tracking_init(self, *args, **kwargs):
if kwargs.get("name") == "left4me-state-poller":
spawned.append(True)
real_thread_init(self, *args, **kwargs)
monkeypatch.setattr(threading.Thread, "__init__", tracking_init)
create_app({"TESTING": True, "JOB_WORKER_ENABLED": False})
assert not spawned
```
(The tests assume the existing `app` fixture from `conftest.py`. If your project uses a different fixture name, adjust accordingly. The polling tests run `poll_all_servers()` synchronously to avoid testing the loop's `time.sleep`.)
### Step 2.2: Run the new tests, verify they fail
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
Expected: FAIL — `poll_all_servers` and `start_state_poller` don't exist yet.
### Step 2.3: Add the poller code to `job_worker.py`
Open `l4d2web/services/job_worker.py`. Add at the bottom of the file:
```python
def start_state_poller(app):
interval = float(app.config.get("STATE_POLLER_INTERVAL_SECONDS", 30))
thread = threading.Thread(
target=state_poller_loop,
args=(app, interval),
daemon=True,
name="left4me-state-poller",
)
thread.start()
def state_poller_loop(app, interval: float) -> None:
while True:
try:
with app.app_context():
poll_all_servers()
except Exception:
pass
time.sleep(interval)
def poll_all_servers() -> None:
with session_scope() as db:
active_server_ids = set(db.scalars(
select(Job.server_id).where(Job.state.in_(("queued", "running")))
).all())
server_ids = [
sid for sid in db.scalars(select(Server.id)).all()
if sid not in active_server_ids
]
for sid in server_ids:
try:
refresh_server_actual_state(sid)
except Exception:
pass
```
`Server`, `Job`, `select`, `session_scope`, `threading`, `time`, and `refresh_server_actual_state` are already imported in this file. Verify by scanning the existing imports; if any are missing (unlikely for `select`/`Server`/`Job` since the worker uses them), add them.
### Step 2.4: Wire the poller into `create_app`
Open `l4d2web/app.py`. Find the existing `start_job_workers(app)` call (around line 91, inside the `if should_start_workers:` block). Add `start_state_poller(app)` immediately after it:
```python
if should_start_workers:
recover_stale_jobs()
start_job_workers(app)
start_state_poller(app)
```
Also update the import:
```python
from l4d2web.services.job_worker import (
recover_stale_jobs,
start_job_workers,
start_state_poller,
)
```
(If the existing import is single-line `from ... import recover_stale_jobs, start_job_workers`, just add `start_state_poller` to the list.)
### Step 2.5: Add the config default
Open `l4d2web/config.py`. Find the dict literal that contains other defaults like `JOB_WORKER_THREADS`, `PORT_RANGE_START`, etc. Add:
```python
"STATE_POLLER_INTERVAL_SECONDS": 30,
```
In the env-var-loading section (where `LEFT4ME_PORT_RANGE_START` etc. are read), add:
```python
"STATE_POLLER_INTERVAL_SECONDS": float(os.getenv("LEFT4ME_STATE_POLLER_INTERVAL_SECONDS", "30")),
```
### Step 2.6: Run the four new tests, verify they pass
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests/test_job_worker.py::test_state_poller_refreshes_each_server l4d2web/tests/test_job_worker.py::test_state_poller_skips_servers_with_inflight_jobs l4d2web/tests/test_job_worker.py::test_state_poller_swallows_per_server_exceptions l4d2web/tests/test_job_worker.py::test_state_poller_disabled_when_job_workers_disabled -v`
Expected: PASS for all four.
### Step 2.7: Run the full web test suite
Run: `cd /Users/mwiegand/Projekte/left4me && pytest l4d2web/tests -q`
Expected: 317 passed, 1 skipped (313 + 4 new tests).
### Step 2.8: Commit
```bash
git add l4d2web/services/job_worker.py l4d2web/app.py l4d2web/config.py l4d2web/tests/test_job_worker.py
git commit -m "$(cat <<'EOF'
feat(l4d2-web): periodic state poller refreshes Server.actual_state
A background thread spawned alongside the job workers polls every
server's status every STATE_POLLER_INTERVAL_SECONDS (default 30) and
writes the result via the existing refresh_server_actual_state path.
Servers with in-flight jobs are skipped to avoid racing the post-job
refresh. Catches reboot drift, OOM kills, manual systemctl operations,
and any other out-of-band state change. Spec:
docs/superpowers/specs/2026-05-09-l4d2-server-lifecycle-reboot-and-drift-design.md
EOF
)"
```
---
## Final Verification
- [ ] **Step F.1: Full test sweep**
Run: `cd /Users/mwiegand/Projekte/left4me && pytest deploy/tests/ l4d2host/tests l4d2web/tests -q`
Expected: ~466 passed, 1 failed (the pre-existing unrelated `test_deploy_script_has_safe_defaults_and_preserves_state`), 2 skipped.
- [ ] **Step F.2: Working tree clean and commit shape**
Run: `git status && git log --oneline -5`
Expected:
- `git status`: clean.
- Top of `git log`:
1. `feat(l4d2-web): periodic state poller refreshes Server.actual_state`
2. `feat(l4d2-host): server lifecycle uses systemctl enable --now / disable --now`
3. `docs(plans): l4d2 server lifecycle reboot-and-drift — implementation plan`
4. `docs(specs): l4d2 server lifecycle reboot-and-drift — design`
- [ ] **Step F.3: Operator-side smoke test (deferred, not part of this plan)**
End-to-end on `ckn@10.0.4.128` after deploy:
```sh
deploy/deploy-test-server.sh ckn@10.0.4.128
# Confirm the helper now drives enable/disable
ssh ckn@10.0.4.128 'cat /usr/local/libexec/left4me/left4me-systemctl | grep -E "enable|disable"'
# expect: enable) exec "$systemctl" enable --now "$unit"
# disable) exec "$systemctl" disable --now "$unit"
# Click "start" in the web UI for a server. Then:
ssh ckn@10.0.4.128 'systemctl is-enabled left4me-server@1.service'
# expect: enabled
# Reboot the host:
ssh ckn@10.0.4.128 'sudo systemctl reboot'
# wait for it to come back, then:
ssh ckn@10.0.4.128 'systemctl is-active left4me-server@1.service && pgrep -fa srcds'
# expect: active, srcds running with no UI intervention
# Confirm the poller corrects out-of-band drift
ssh ckn@10.0.4.128 'sudo systemctl disable --now left4me-server@1.service'
# Within ~30s the web UI's actual_state for server 1 flips from "running" to "stopped".
ssh ckn@10.0.4.128 'sudo -u left4me /opt/left4me/.venv/bin/python -c "
import sqlite3
c = sqlite3.connect(\"/var/lib/left4me/left4me.db\")
print(c.execute(\"SELECT id, actual_state, actual_state_updated_at FROM servers WHERE id=1\").fetchone())
"'
# expect: actual_state='stopped' with a fresh updated_at.
```
---
## Out of Scope (do NOT implement here)
- Auto-restart on `desired_state=running && actual_state=stopped`.
- UI banners for stale-state warnings.
- Reconciliation of orphan systemd units.
- Per-server poll intervals.
- Replacing `Restart=on-failure`.
- Touching the pre-existing red test (`test_deploy_script_has_safe_defaults_and_preserves_state`).
If you find yourself touching any of these, stop — they belong in a separate spec.