# L4D2 Web Queue Worker Implementation Plan > **Approval gate:** This plan may be written and refined without further approval. Do not implement code changes from this plan until the user explicitly approves implementation. **Goal:** Complete the `l4d2web` async lifecycle queue so queued jobs are claimed, executed through the `l4d2ctl` host command boundary, logged to `job_logs`, reflected in server state, and streamed live to the UI. **Architecture:** Keep the v1 single-process Flask architecture. Use DB-backed queued jobs as the durable source of truth, worker threads inside the Flask process, SQLite-safe process-local locks, and direct imports through `l4d2web.services.l4d2_facade`. Do not shell out to `l4d2ctl` from the web app. --- ## Current Gap - Server lifecycle routes create `Job(state="queued")` rows. - `l4d2web.services.job_worker` has scheduler helpers, stale recovery, command-log append, and actual-state refresh helpers. - No worker claims queued jobs. - No code dispatches queued operations to `l4d2_facade`. - No command callbacks persist live stdout/stderr while jobs run. - Job-log SSE currently replays existing rows once and does not live-follow new rows. - Job-log SSE emits `stdout`/`stderr` custom events, while `static/js/sse.js` only handles default messages. - No web route currently enqueues global `install` jobs. --- ## Locked Decisions - Queue execution uses direct Python imports through `l4d2web.services.l4d2_facade`. - The queue is DB-backed, not an in-memory `queue.Queue`. - Worker threads are in-process daemon threads. - SQLite concurrency is protected with process-local locks; no distributed lock manager is added. - Workers are not started during normal tests. - `POST /admin/install` is added as the admin-only runtime install/update entry point. - `install` jobs have `server_id=None` and are globally exclusive. - Server-specific jobs do not overlap on the same `server_id`. - Different server jobs can run concurrently when no install job is running. - A web `start` job applies the live-linked blueprint before start by running `initialize_server(server_id)` and then `start_server(server_id)`. This satisfies “blueprint updates apply on next action.” - `delete` removes the host instance/runtime through `l4d2host`; it does not delete the web `Server` row in v1. - Command log rows are retained indefinitely. --- ## Task 1: Extend Worker Tests First **Files:** - Modify: `l4d2web/tests/test_job_worker.py` - Modify as needed: `l4d2web/tests/test_job_logs.py` Add tests that verify the worker behavior without touching real systemd, Steam, or `/opt/l4d2`. Use monkeypatched `l4d2web.services.l4d2_facade` functions. Required coverage: - `run_worker_once()` claims the oldest runnable queued job. - A successful server job transitions `queued -> running -> succeeded` and sets `exit_code=0`, `started_at`, `finished_at`, and `updated_at`. - A successful job persists stdout/stderr callback lines in `job_logs`. - A `subprocess.CalledProcessError` transitions the job to `failed` and stores `exit_code=exc.returncode`. - An unexpected exception transitions the job to `failed` with `exit_code=1`. - Same-server jobs do not overlap. - Different-server jobs can be claimed concurrently by separate worker passes. - An `install` job is not claimed while any server job is running. - Server jobs are not claimed while an `install` job is running. - Startup recovery marks stale `running` jobs as `failed`. - Actual server state is refreshed after server-specific lifecycle jobs. - `Server.last_error` is cleared on success and set on failure. Verification command: ```bash pytest l4d2web/tests/test_job_worker.py -q ``` Expected before implementation: FAIL. --- ## Task 2: Implement Queue Claiming And Job Execution **Files:** - Modify: `l4d2web/services/job_worker.py` Add worker-core functions: - `build_scheduler_state(session) -> SchedulerState` - `claim_next_job() -> int | None` - `run_worker_once() -> bool` - `run_job(job_id: int) -> None` - `finish_job(job_id: int, state: str, exit_code: int | None, error: str = "") -> None` - `append_job_log_line(job_id: int, stream: str, line: str, max_chars: int = 4096) -> int` Implementation rules: - Use a module-level claim lock around scheduler-state construction, queued-job selection, and `queued -> running` transition. - Commit the `running` transition before executing any host operation. - Do not keep a DB session open while a host operation runs. - Use a module-level log lock around `append_job_log()` so concurrent stdout/stderr callback threads cannot duplicate `seq` values. - Recompute scheduler state from `running` jobs in the DB, not from only in-memory state. - Select queued jobs by `created_at`, then `id` for deterministic order. - Skip malformed server operations with no `server_id` by failing the job cleanly. - Treat unknown operations as failed jobs, not worker-thread crashes. Operation dispatch: ```text install -> l4d2_facade.install_runtime(...) initialize -> l4d2_facade.initialize_server(server_id, ...) start -> l4d2_facade.initialize_server(server_id, ...), then l4d2_facade.start_server(server_id, ...) stop -> l4d2_facade.stop_server(server_id, ...) delete -> l4d2_facade.delete_server(server_id, ...) ``` Failure handling: - `subprocess.CalledProcessError`: append remaining stderr if useful, fail with `exit_code=returncode`. - Any other exception: append exception text to stderr, fail with `exit_code=1`. - Never let a job exception kill the worker loop. Verification command: ```bash pytest l4d2web/tests/test_job_worker.py -q ``` Expected after implementation: PASS. --- ## Task 3: Add Worker Thread Startup **Files:** - Modify: `l4d2web/config.py` - Modify: `l4d2web/app.py` - Modify: `l4d2web/services/job_worker.py` - Modify: `l4d2web/tests/test_job_worker.py` Add config: ```python "JOB_WORKER_ENABLED": True "JOB_WORKER_POLL_SECONDS": 1 ``` Add worker lifecycle functions: - `start_job_workers(app) -> None` - `worker_loop(app, poll_seconds: float) -> None` Startup behavior: - `create_app()` still calls `recover_stale_jobs()`. - After recovery, `create_app()` starts workers only when enabled and not in `TESTING`. - Guard against duplicate worker startup in the same process. - Worker threads run as daemon threads. - Each worker loop uses `app.app_context()` around `run_worker_once()`. - If no job was run, sleep for `JOB_WORKER_POLL_SECONDS`. Testing requirements: - Tests should not accidentally start real background workers. - Add a focused startup test with monkeypatched `start_job_workers` if needed. Verification command: ```bash pytest l4d2web/tests/test_job_worker.py -q ``` --- ## Task 4: Make Job Log SSE Live-Follow **Files:** - Modify: `l4d2web/routes/job_routes.py` - Modify: `l4d2web/static/js/sse.js` - Modify: `l4d2web/tests/test_job_logs.py` Route behavior: - Authorize the job before streaming. - Replay rows with `seq > last_seq` up to `JOB_LOG_REPLAY_LIMIT`. - Continue polling for new rows while the job is not terminal. - Close the stream after all available logs are sent and the job state is terminal. - Keep emitting `id: ` so EventSource can resume. - Keep `event: stdout` and `event: stderr` for job logs. JS behavior: - Keep handling default server-log messages via `source.onmessage`. - Also register `stdout` and `stderr` listeners that append job-log lines to the same element. - Prefix custom job-log events with the stream name only if useful for readability. Terminal states: ```text succeeded failed cancelled ``` `cancelled` is reserved for future use and does not require cancellation support in this task. Verification command: ```bash pytest l4d2web/tests/test_job_logs.py -q ``` --- ## Task 5: Add Admin Runtime Install Action **Files:** - Modify: `l4d2web/routes/page_routes.py` - Modify: `l4d2web/templates/admin.html` - Modify: `l4d2web/tests/test_pages.py` or add a focused admin route test Behavior: - `POST /admin/install` requires `@require_admin`. - Creates `Job(user_id=current_admin.id, server_id=None, operation="install", state="queued")`. - Redirects to `/admin/jobs`. - Non-admin logged-in users receive `403`. - Anonymous users are redirected to login. - Admin page shows a CSRF-protected form/button for runtime install/update. Verification command: ```bash pytest l4d2web/tests/test_pages.py -q ``` --- ## Task 6: Full Verification And Review Run focused suites first: ```bash pytest l4d2web/tests/test_job_worker.py -q pytest l4d2web/tests/test_job_logs.py -q pytest l4d2web/tests/test_pages.py -q ``` Then run the full web suite: ```bash pytest l4d2web/tests -q ``` Refresh the code index after implementation: ```bash ccc index ``` Request a final read-only review focused on: - queue claiming races - duplicate worker startup - job-log sequence ordering - error handling and `last_error` - live SSE behavior - `start` applying blueprint updates before host start --- ## Commit Strategy Use small commits after passing relevant tests: 1. `feat(l4d2-web): execute queued lifecycle jobs` 2. `feat(l4d2-web): live-follow queued job logs` 3. `feat(l4d2-web): add admin runtime install job` Do not commit unless the user explicitly asks for commits. --- ## Open Approval Gate Before modifying implementation files, ask the user for explicit approval to proceed with the queue-worker implementation.