278 lines
9.2 KiB
Markdown
278 lines
9.2 KiB
Markdown
# L4D2 Web Queue Worker Implementation Plan
|
|
|
|
> **Approval gate:** This plan may be written and refined without further approval. Do not implement code changes from this plan until the user explicitly approves implementation.
|
|
|
|
**Goal:** Complete the `l4d2web` async lifecycle queue so queued jobs are claimed, executed through the `l4d2ctl` host command boundary, logged to `job_logs`, reflected in server state, and streamed live to the UI.
|
|
|
|
**Architecture:** Keep the v1 single-process Flask architecture. Use DB-backed queued jobs as the durable source of truth, worker threads inside the Flask process, SQLite-safe process-local locks, and direct imports through `l4d2web.services.l4d2_facade`. Do not shell out to `l4d2ctl` from the web app.
|
|
|
|
---
|
|
|
|
## Current Gap
|
|
|
|
- Server lifecycle routes create `Job(state="queued")` rows.
|
|
- `l4d2web.services.job_worker` has scheduler helpers, stale recovery, command-log append, and actual-state refresh helpers.
|
|
- No worker claims queued jobs.
|
|
- No code dispatches queued operations to `l4d2_facade`.
|
|
- No command callbacks persist live stdout/stderr while jobs run.
|
|
- Job-log SSE currently replays existing rows once and does not live-follow new rows.
|
|
- Job-log SSE emits `stdout`/`stderr` custom events, while `static/js/sse.js` only handles default messages.
|
|
- No web route currently enqueues global `install` jobs.
|
|
|
|
---
|
|
|
|
## Locked Decisions
|
|
|
|
- Queue execution uses direct Python imports through `l4d2web.services.l4d2_facade`.
|
|
- The queue is DB-backed, not an in-memory `queue.Queue`.
|
|
- Worker threads are in-process daemon threads.
|
|
- SQLite concurrency is protected with process-local locks; no distributed lock manager is added.
|
|
- Workers are not started during normal tests.
|
|
- `POST /admin/install` is added as the admin-only runtime install/update entry point.
|
|
- `install` jobs have `server_id=None` and are globally exclusive.
|
|
- Server-specific jobs do not overlap on the same `server_id`.
|
|
- Different server jobs can run concurrently when no install job is running.
|
|
- A web `start` job applies the live-linked blueprint before start by running `initialize_server(server_id)` and then `start_server(server_id)`. This satisfies “blueprint updates apply on next action.”
|
|
- `delete` removes the host instance/runtime through `l4d2host`; it does not delete the web `Server` row in v1.
|
|
- Command log rows are retained indefinitely.
|
|
|
|
---
|
|
|
|
## Task 1: Extend Worker Tests First
|
|
|
|
**Files:**
|
|
- Modify: `l4d2web/tests/test_job_worker.py`
|
|
- Modify as needed: `l4d2web/tests/test_job_logs.py`
|
|
|
|
Add tests that verify the worker behavior without touching real systemd, Steam, or `/opt/l4d2`. Use monkeypatched `l4d2web.services.l4d2_facade` functions.
|
|
|
|
Required coverage:
|
|
|
|
- `run_worker_once()` claims the oldest runnable queued job.
|
|
- A successful server job transitions `queued -> running -> succeeded` and sets `exit_code=0`, `started_at`, `finished_at`, and `updated_at`.
|
|
- A successful job persists stdout/stderr callback lines in `job_logs`.
|
|
- A `subprocess.CalledProcessError` transitions the job to `failed` and stores `exit_code=exc.returncode`.
|
|
- An unexpected exception transitions the job to `failed` with `exit_code=1`.
|
|
- Same-server jobs do not overlap.
|
|
- Different-server jobs can be claimed concurrently by separate worker passes.
|
|
- An `install` job is not claimed while any server job is running.
|
|
- Server jobs are not claimed while an `install` job is running.
|
|
- Startup recovery marks stale `running` jobs as `failed`.
|
|
- Actual server state is refreshed after server-specific lifecycle jobs.
|
|
- `Server.last_error` is cleared on success and set on failure.
|
|
|
|
Verification command:
|
|
|
|
```bash
|
|
pytest l4d2web/tests/test_job_worker.py -q
|
|
```
|
|
|
|
Expected before implementation: FAIL.
|
|
|
|
---
|
|
|
|
## Task 2: Implement Queue Claiming And Job Execution
|
|
|
|
**Files:**
|
|
- Modify: `l4d2web/services/job_worker.py`
|
|
|
|
Add worker-core functions:
|
|
|
|
- `build_scheduler_state(session) -> SchedulerState`
|
|
- `claim_next_job() -> int | None`
|
|
- `run_worker_once() -> bool`
|
|
- `run_job(job_id: int) -> None`
|
|
- `finish_job(job_id: int, state: str, exit_code: int | None, error: str = "") -> None`
|
|
- `append_job_log_line(job_id: int, stream: str, line: str, max_chars: int = 4096) -> int`
|
|
|
|
Implementation rules:
|
|
|
|
- Use a module-level claim lock around scheduler-state construction, queued-job selection, and `queued -> running` transition.
|
|
- Commit the `running` transition before executing any host operation.
|
|
- Do not keep a DB session open while a host operation runs.
|
|
- Use a module-level log lock around `append_job_log()` so concurrent stdout/stderr callback threads cannot duplicate `seq` values.
|
|
- Recompute scheduler state from `running` jobs in the DB, not from only in-memory state.
|
|
- Select queued jobs by `created_at`, then `id` for deterministic order.
|
|
- Skip malformed server operations with no `server_id` by failing the job cleanly.
|
|
- Treat unknown operations as failed jobs, not worker-thread crashes.
|
|
|
|
Operation dispatch:
|
|
|
|
```text
|
|
install -> l4d2_facade.install_runtime(...)
|
|
initialize -> l4d2_facade.initialize_server(server_id, ...)
|
|
start -> l4d2_facade.initialize_server(server_id, ...), then l4d2_facade.start_server(server_id, ...)
|
|
stop -> l4d2_facade.stop_server(server_id, ...)
|
|
delete -> l4d2_facade.delete_server(server_id, ...)
|
|
```
|
|
|
|
Failure handling:
|
|
|
|
- `subprocess.CalledProcessError`: append remaining stderr if useful, fail with `exit_code=returncode`.
|
|
- Any other exception: append exception text to stderr, fail with `exit_code=1`.
|
|
- Never let a job exception kill the worker loop.
|
|
|
|
Verification command:
|
|
|
|
```bash
|
|
pytest l4d2web/tests/test_job_worker.py -q
|
|
```
|
|
|
|
Expected after implementation: PASS.
|
|
|
|
---
|
|
|
|
## Task 3: Add Worker Thread Startup
|
|
|
|
**Files:**
|
|
- Modify: `l4d2web/config.py`
|
|
- Modify: `l4d2web/app.py`
|
|
- Modify: `l4d2web/services/job_worker.py`
|
|
- Modify: `l4d2web/tests/test_job_worker.py`
|
|
|
|
Add config:
|
|
|
|
```python
|
|
"JOB_WORKER_ENABLED": True
|
|
"JOB_WORKER_POLL_SECONDS": 1
|
|
```
|
|
|
|
Add worker lifecycle functions:
|
|
|
|
- `start_job_workers(app) -> None`
|
|
- `worker_loop(app, poll_seconds: float) -> None`
|
|
|
|
Startup behavior:
|
|
|
|
- `create_app()` still calls `recover_stale_jobs()`.
|
|
- After recovery, `create_app()` starts workers only when enabled and not in `TESTING`.
|
|
- Guard against duplicate worker startup in the same process.
|
|
- Worker threads run as daemon threads.
|
|
- Each worker loop uses `app.app_context()` around `run_worker_once()`.
|
|
- If no job was run, sleep for `JOB_WORKER_POLL_SECONDS`.
|
|
|
|
Testing requirements:
|
|
|
|
- Tests should not accidentally start real background workers.
|
|
- Add a focused startup test with monkeypatched `start_job_workers` if needed.
|
|
|
|
Verification command:
|
|
|
|
```bash
|
|
pytest l4d2web/tests/test_job_worker.py -q
|
|
```
|
|
|
|
---
|
|
|
|
## Task 4: Make Job Log SSE Live-Follow
|
|
|
|
**Files:**
|
|
- Modify: `l4d2web/routes/job_routes.py`
|
|
- Modify: `l4d2web/static/js/sse.js`
|
|
- Modify: `l4d2web/tests/test_job_logs.py`
|
|
|
|
Route behavior:
|
|
|
|
- Authorize the job before streaming.
|
|
- Replay rows with `seq > last_seq` up to `JOB_LOG_REPLAY_LIMIT`.
|
|
- Continue polling for new rows while the job is not terminal.
|
|
- Close the stream after all available logs are sent and the job state is terminal.
|
|
- Keep emitting `id: <seq>` so EventSource can resume.
|
|
- Keep `event: stdout` and `event: stderr` for job logs.
|
|
|
|
JS behavior:
|
|
|
|
- Keep handling default server-log messages via `source.onmessage`.
|
|
- Also register `stdout` and `stderr` listeners that append job-log lines to the same element.
|
|
- Prefix custom job-log events with the stream name only if useful for readability.
|
|
|
|
Terminal states:
|
|
|
|
```text
|
|
succeeded
|
|
failed
|
|
cancelled
|
|
```
|
|
|
|
`cancelled` is reserved for future use and does not require cancellation support in this task.
|
|
|
|
Verification command:
|
|
|
|
```bash
|
|
pytest l4d2web/tests/test_job_logs.py -q
|
|
```
|
|
|
|
---
|
|
|
|
## Task 5: Add Admin Runtime Install Action
|
|
|
|
**Files:**
|
|
- Modify: `l4d2web/routes/page_routes.py`
|
|
- Modify: `l4d2web/templates/admin.html`
|
|
- Modify: `l4d2web/tests/test_pages.py` or add a focused admin route test
|
|
|
|
Behavior:
|
|
|
|
- `POST /admin/install` requires `@require_admin`.
|
|
- Creates `Job(user_id=current_admin.id, server_id=None, operation="install", state="queued")`.
|
|
- Redirects to `/admin/jobs`.
|
|
- Non-admin logged-in users receive `403`.
|
|
- Anonymous users are redirected to login.
|
|
- Admin page shows a CSRF-protected form/button for runtime install/update.
|
|
|
|
Verification command:
|
|
|
|
```bash
|
|
pytest l4d2web/tests/test_pages.py -q
|
|
```
|
|
|
|
---
|
|
|
|
## Task 6: Full Verification And Review
|
|
|
|
Run focused suites first:
|
|
|
|
```bash
|
|
pytest l4d2web/tests/test_job_worker.py -q
|
|
pytest l4d2web/tests/test_job_logs.py -q
|
|
pytest l4d2web/tests/test_pages.py -q
|
|
```
|
|
|
|
Then run the full web suite:
|
|
|
|
```bash
|
|
pytest l4d2web/tests -q
|
|
```
|
|
|
|
Refresh the code index after implementation:
|
|
|
|
```bash
|
|
ccc index
|
|
```
|
|
|
|
Request a final read-only review focused on:
|
|
|
|
- queue claiming races
|
|
- duplicate worker startup
|
|
- job-log sequence ordering
|
|
- error handling and `last_error`
|
|
- live SSE behavior
|
|
- `start` applying blueprint updates before host start
|
|
|
|
---
|
|
|
|
## Commit Strategy
|
|
|
|
Use small commits after passing relevant tests:
|
|
|
|
1. `feat(l4d2-web): execute queued lifecycle jobs`
|
|
2. `feat(l4d2-web): live-follow queued job logs`
|
|
3. `feat(l4d2-web): add admin runtime install job`
|
|
|
|
Do not commit unless the user explicitly asks for commits.
|
|
|
|
---
|
|
|
|
## Open Approval Gate
|
|
|
|
Before modifying implementation files, ask the user for explicit approval to proceed with the queue-worker implementation.
|