left4me/docs/superpowers/plans/2026-05-06-l4d2-web-queue-worker.md

278 lines
9.2 KiB
Markdown

# L4D2 Web Queue Worker Implementation Plan
> **Approval gate:** This plan may be written and refined without further approval. Do not implement code changes from this plan until the user explicitly approves implementation.
**Goal:** Complete the `l4d2web` async lifecycle queue so queued jobs are claimed, executed through the `l4d2ctl` host command boundary, logged to `job_logs`, reflected in server state, and streamed live to the UI.
**Architecture:** Keep the v1 single-process Flask architecture. Use DB-backed queued jobs as the durable source of truth, worker threads inside the Flask process, SQLite-safe process-local locks, and direct imports through `l4d2web.services.l4d2_facade`. Do not shell out to `l4d2ctl` from the web app.
---
## Current Gap
- Server lifecycle routes create `Job(state="queued")` rows.
- `l4d2web.services.job_worker` has scheduler helpers, stale recovery, command-log append, and actual-state refresh helpers.
- No worker claims queued jobs.
- No code dispatches queued operations to `l4d2_facade`.
- No command callbacks persist live stdout/stderr while jobs run.
- Job-log SSE currently replays existing rows once and does not live-follow new rows.
- Job-log SSE emits `stdout`/`stderr` custom events, while `static/js/sse.js` only handles default messages.
- No web route currently enqueues global `install` jobs.
---
## Locked Decisions
- Queue execution uses direct Python imports through `l4d2web.services.l4d2_facade`.
- The queue is DB-backed, not an in-memory `queue.Queue`.
- Worker threads are in-process daemon threads.
- SQLite concurrency is protected with process-local locks; no distributed lock manager is added.
- Workers are not started during normal tests.
- `POST /admin/install` is added as the admin-only runtime install/update entry point.
- `install` jobs have `server_id=None` and are globally exclusive.
- Server-specific jobs do not overlap on the same `server_id`.
- Different server jobs can run concurrently when no install job is running.
- A web `start` job applies the live-linked blueprint before start by running `initialize_server(server_id)` and then `start_server(server_id)`. This satisfies “blueprint updates apply on next action.”
- `delete` removes the host instance/runtime through `l4d2host`; it does not delete the web `Server` row in v1.
- Command log rows are retained indefinitely.
---
## Task 1: Extend Worker Tests First
**Files:**
- Modify: `l4d2web/tests/test_job_worker.py`
- Modify as needed: `l4d2web/tests/test_job_logs.py`
Add tests that verify the worker behavior without touching real systemd, Steam, or `/opt/l4d2`. Use monkeypatched `l4d2web.services.l4d2_facade` functions.
Required coverage:
- `run_worker_once()` claims the oldest runnable queued job.
- A successful server job transitions `queued -> running -> succeeded` and sets `exit_code=0`, `started_at`, `finished_at`, and `updated_at`.
- A successful job persists stdout/stderr callback lines in `job_logs`.
- A `subprocess.CalledProcessError` transitions the job to `failed` and stores `exit_code=exc.returncode`.
- An unexpected exception transitions the job to `failed` with `exit_code=1`.
- Same-server jobs do not overlap.
- Different-server jobs can be claimed concurrently by separate worker passes.
- An `install` job is not claimed while any server job is running.
- Server jobs are not claimed while an `install` job is running.
- Startup recovery marks stale `running` jobs as `failed`.
- Actual server state is refreshed after server-specific lifecycle jobs.
- `Server.last_error` is cleared on success and set on failure.
Verification command:
```bash
pytest l4d2web/tests/test_job_worker.py -q
```
Expected before implementation: FAIL.
---
## Task 2: Implement Queue Claiming And Job Execution
**Files:**
- Modify: `l4d2web/services/job_worker.py`
Add worker-core functions:
- `build_scheduler_state(session) -> SchedulerState`
- `claim_next_job() -> int | None`
- `run_worker_once() -> bool`
- `run_job(job_id: int) -> None`
- `finish_job(job_id: int, state: str, exit_code: int | None, error: str = "") -> None`
- `append_job_log_line(job_id: int, stream: str, line: str, max_chars: int = 4096) -> int`
Implementation rules:
- Use a module-level claim lock around scheduler-state construction, queued-job selection, and `queued -> running` transition.
- Commit the `running` transition before executing any host operation.
- Do not keep a DB session open while a host operation runs.
- Use a module-level log lock around `append_job_log()` so concurrent stdout/stderr callback threads cannot duplicate `seq` values.
- Recompute scheduler state from `running` jobs in the DB, not from only in-memory state.
- Select queued jobs by `created_at`, then `id` for deterministic order.
- Skip malformed server operations with no `server_id` by failing the job cleanly.
- Treat unknown operations as failed jobs, not worker-thread crashes.
Operation dispatch:
```text
install -> l4d2_facade.install_runtime(...)
initialize -> l4d2_facade.initialize_server(server_id, ...)
start -> l4d2_facade.initialize_server(server_id, ...), then l4d2_facade.start_server(server_id, ...)
stop -> l4d2_facade.stop_server(server_id, ...)
delete -> l4d2_facade.delete_server(server_id, ...)
```
Failure handling:
- `subprocess.CalledProcessError`: append remaining stderr if useful, fail with `exit_code=returncode`.
- Any other exception: append exception text to stderr, fail with `exit_code=1`.
- Never let a job exception kill the worker loop.
Verification command:
```bash
pytest l4d2web/tests/test_job_worker.py -q
```
Expected after implementation: PASS.
---
## Task 3: Add Worker Thread Startup
**Files:**
- Modify: `l4d2web/config.py`
- Modify: `l4d2web/app.py`
- Modify: `l4d2web/services/job_worker.py`
- Modify: `l4d2web/tests/test_job_worker.py`
Add config:
```python
"JOB_WORKER_ENABLED": True
"JOB_WORKER_POLL_SECONDS": 1
```
Add worker lifecycle functions:
- `start_job_workers(app) -> None`
- `worker_loop(app, poll_seconds: float) -> None`
Startup behavior:
- `create_app()` still calls `recover_stale_jobs()`.
- After recovery, `create_app()` starts workers only when enabled and not in `TESTING`.
- Guard against duplicate worker startup in the same process.
- Worker threads run as daemon threads.
- Each worker loop uses `app.app_context()` around `run_worker_once()`.
- If no job was run, sleep for `JOB_WORKER_POLL_SECONDS`.
Testing requirements:
- Tests should not accidentally start real background workers.
- Add a focused startup test with monkeypatched `start_job_workers` if needed.
Verification command:
```bash
pytest l4d2web/tests/test_job_worker.py -q
```
---
## Task 4: Make Job Log SSE Live-Follow
**Files:**
- Modify: `l4d2web/routes/job_routes.py`
- Modify: `l4d2web/static/js/sse.js`
- Modify: `l4d2web/tests/test_job_logs.py`
Route behavior:
- Authorize the job before streaming.
- Replay rows with `seq > last_seq` up to `JOB_LOG_REPLAY_LIMIT`.
- Continue polling for new rows while the job is not terminal.
- Close the stream after all available logs are sent and the job state is terminal.
- Keep emitting `id: <seq>` so EventSource can resume.
- Keep `event: stdout` and `event: stderr` for job logs.
JS behavior:
- Keep handling default server-log messages via `source.onmessage`.
- Also register `stdout` and `stderr` listeners that append job-log lines to the same element.
- Prefix custom job-log events with the stream name only if useful for readability.
Terminal states:
```text
succeeded
failed
cancelled
```
`cancelled` is reserved for future use and does not require cancellation support in this task.
Verification command:
```bash
pytest l4d2web/tests/test_job_logs.py -q
```
---
## Task 5: Add Admin Runtime Install Action
**Files:**
- Modify: `l4d2web/routes/page_routes.py`
- Modify: `l4d2web/templates/admin.html`
- Modify: `l4d2web/tests/test_pages.py` or add a focused admin route test
Behavior:
- `POST /admin/install` requires `@require_admin`.
- Creates `Job(user_id=current_admin.id, server_id=None, operation="install", state="queued")`.
- Redirects to `/admin/jobs`.
- Non-admin logged-in users receive `403`.
- Anonymous users are redirected to login.
- Admin page shows a CSRF-protected form/button for runtime install/update.
Verification command:
```bash
pytest l4d2web/tests/test_pages.py -q
```
---
## Task 6: Full Verification And Review
Run focused suites first:
```bash
pytest l4d2web/tests/test_job_worker.py -q
pytest l4d2web/tests/test_job_logs.py -q
pytest l4d2web/tests/test_pages.py -q
```
Then run the full web suite:
```bash
pytest l4d2web/tests -q
```
Refresh the code index after implementation:
```bash
ccc index
```
Request a final read-only review focused on:
- queue claiming races
- duplicate worker startup
- job-log sequence ordering
- error handling and `last_error`
- live SSE behavior
- `start` applying blueprint updates before host start
---
## Commit Strategy
Use small commits after passing relevant tests:
1. `feat(l4d2-web): execute queued lifecycle jobs`
2. `feat(l4d2-web): live-follow queued job logs`
3. `feat(l4d2-web): add admin runtime install job`
Do not commit unless the user explicitly asks for commits.
---
## Open Approval Gate
Before modifying implementation files, ask the user for explicit approval to proceed with the queue-worker implementation.