left4me/docs/superpowers/plans/2026-05-06-l4d2-web-queue-worker.md

9.2 KiB

L4D2 Web Queue Worker Implementation Plan

Approval gate: This plan may be written and refined without further approval. Do not implement code changes from this plan until the user explicitly approves implementation.

Goal: Complete the l4d2web async lifecycle queue so queued jobs are claimed, executed through the l4d2ctl host command boundary, logged to job_logs, reflected in server state, and streamed live to the UI.

Architecture: Keep the v1 single-process Flask architecture. Use DB-backed queued jobs as the durable source of truth, worker threads inside the Flask process, SQLite-safe process-local locks, and direct imports through l4d2web.services.l4d2_facade. Do not shell out to l4d2ctl from the web app.


Current Gap

  • Server lifecycle routes create Job(state="queued") rows.
  • l4d2web.services.job_worker has scheduler helpers, stale recovery, command-log append, and actual-state refresh helpers.
  • No worker claims queued jobs.
  • No code dispatches queued operations to l4d2_facade.
  • No command callbacks persist live stdout/stderr while jobs run.
  • Job-log SSE currently replays existing rows once and does not live-follow new rows.
  • Job-log SSE emits stdout/stderr custom events, while static/js/sse.js only handles default messages.
  • No web route currently enqueues global install jobs.

Locked Decisions

  • Queue execution uses direct Python imports through l4d2web.services.l4d2_facade.
  • The queue is DB-backed, not an in-memory queue.Queue.
  • Worker threads are in-process daemon threads.
  • SQLite concurrency is protected with process-local locks; no distributed lock manager is added.
  • Workers are not started during normal tests.
  • POST /admin/install is added as the admin-only runtime install/update entry point.
  • install jobs have server_id=None and are globally exclusive.
  • Server-specific jobs do not overlap on the same server_id.
  • Different server jobs can run concurrently when no install job is running.
  • A web start job applies the live-linked blueprint before start by running initialize_server(server_id) and then start_server(server_id). This satisfies “blueprint updates apply on next action.”
  • delete removes the host instance/runtime through l4d2host; it does not delete the web Server row in v1.
  • Command log rows are retained indefinitely.

Task 1: Extend Worker Tests First

Files:

  • Modify: l4d2web/tests/test_job_worker.py
  • Modify as needed: l4d2web/tests/test_job_logs.py

Add tests that verify the worker behavior without touching real systemd, Steam, or /opt/l4d2. Use monkeypatched l4d2web.services.l4d2_facade functions.

Required coverage:

  • run_worker_once() claims the oldest runnable queued job.
  • A successful server job transitions queued -> running -> succeeded and sets exit_code=0, started_at, finished_at, and updated_at.
  • A successful job persists stdout/stderr callback lines in job_logs.
  • A subprocess.CalledProcessError transitions the job to failed and stores exit_code=exc.returncode.
  • An unexpected exception transitions the job to failed with exit_code=1.
  • Same-server jobs do not overlap.
  • Different-server jobs can be claimed concurrently by separate worker passes.
  • An install job is not claimed while any server job is running.
  • Server jobs are not claimed while an install job is running.
  • Startup recovery marks stale running jobs as failed.
  • Actual server state is refreshed after server-specific lifecycle jobs.
  • Server.last_error is cleared on success and set on failure.

Verification command:

pytest l4d2web/tests/test_job_worker.py -q

Expected before implementation: FAIL.


Task 2: Implement Queue Claiming And Job Execution

Files:

  • Modify: l4d2web/services/job_worker.py

Add worker-core functions:

  • build_scheduler_state(session) -> SchedulerState
  • claim_next_job() -> int | None
  • run_worker_once() -> bool
  • run_job(job_id: int) -> None
  • finish_job(job_id: int, state: str, exit_code: int | None, error: str = "") -> None
  • append_job_log_line(job_id: int, stream: str, line: str, max_chars: int = 4096) -> int

Implementation rules:

  • Use a module-level claim lock around scheduler-state construction, queued-job selection, and queued -> running transition.
  • Commit the running transition before executing any host operation.
  • Do not keep a DB session open while a host operation runs.
  • Use a module-level log lock around append_job_log() so concurrent stdout/stderr callback threads cannot duplicate seq values.
  • Recompute scheduler state from running jobs in the DB, not from only in-memory state.
  • Select queued jobs by created_at, then id for deterministic order.
  • Skip malformed server operations with no server_id by failing the job cleanly.
  • Treat unknown operations as failed jobs, not worker-thread crashes.

Operation dispatch:

install    -> l4d2_facade.install_runtime(...)
initialize -> l4d2_facade.initialize_server(server_id, ...)
start      -> l4d2_facade.initialize_server(server_id, ...), then l4d2_facade.start_server(server_id, ...)
stop       -> l4d2_facade.stop_server(server_id, ...)
delete     -> l4d2_facade.delete_server(server_id, ...)

Failure handling:

  • subprocess.CalledProcessError: append remaining stderr if useful, fail with exit_code=returncode.
  • Any other exception: append exception text to stderr, fail with exit_code=1.
  • Never let a job exception kill the worker loop.

Verification command:

pytest l4d2web/tests/test_job_worker.py -q

Expected after implementation: PASS.


Task 3: Add Worker Thread Startup

Files:

  • Modify: l4d2web/config.py
  • Modify: l4d2web/app.py
  • Modify: l4d2web/services/job_worker.py
  • Modify: l4d2web/tests/test_job_worker.py

Add config:

"JOB_WORKER_ENABLED": True
"JOB_WORKER_POLL_SECONDS": 1

Add worker lifecycle functions:

  • start_job_workers(app) -> None
  • worker_loop(app, poll_seconds: float) -> None

Startup behavior:

  • create_app() still calls recover_stale_jobs().
  • After recovery, create_app() starts workers only when enabled and not in TESTING.
  • Guard against duplicate worker startup in the same process.
  • Worker threads run as daemon threads.
  • Each worker loop uses app.app_context() around run_worker_once().
  • If no job was run, sleep for JOB_WORKER_POLL_SECONDS.

Testing requirements:

  • Tests should not accidentally start real background workers.
  • Add a focused startup test with monkeypatched start_job_workers if needed.

Verification command:

pytest l4d2web/tests/test_job_worker.py -q

Task 4: Make Job Log SSE Live-Follow

Files:

  • Modify: l4d2web/routes/job_routes.py
  • Modify: l4d2web/static/js/sse.js
  • Modify: l4d2web/tests/test_job_logs.py

Route behavior:

  • Authorize the job before streaming.
  • Replay rows with seq > last_seq up to JOB_LOG_REPLAY_LIMIT.
  • Continue polling for new rows while the job is not terminal.
  • Close the stream after all available logs are sent and the job state is terminal.
  • Keep emitting id: <seq> so EventSource can resume.
  • Keep event: stdout and event: stderr for job logs.

JS behavior:

  • Keep handling default server-log messages via source.onmessage.
  • Also register stdout and stderr listeners that append job-log lines to the same element.
  • Prefix custom job-log events with the stream name only if useful for readability.

Terminal states:

succeeded
failed
cancelled

cancelled is reserved for future use and does not require cancellation support in this task.

Verification command:

pytest l4d2web/tests/test_job_logs.py -q

Task 5: Add Admin Runtime Install Action

Files:

  • Modify: l4d2web/routes/page_routes.py
  • Modify: l4d2web/templates/admin.html
  • Modify: l4d2web/tests/test_pages.py or add a focused admin route test

Behavior:

  • POST /admin/install requires @require_admin.
  • Creates Job(user_id=current_admin.id, server_id=None, operation="install", state="queued").
  • Redirects to /admin/jobs.
  • Non-admin logged-in users receive 403.
  • Anonymous users are redirected to login.
  • Admin page shows a CSRF-protected form/button for runtime install/update.

Verification command:

pytest l4d2web/tests/test_pages.py -q

Task 6: Full Verification And Review

Run focused suites first:

pytest l4d2web/tests/test_job_worker.py -q
pytest l4d2web/tests/test_job_logs.py -q
pytest l4d2web/tests/test_pages.py -q

Then run the full web suite:

pytest l4d2web/tests -q

Refresh the code index after implementation:

ccc index

Request a final read-only review focused on:

  • queue claiming races
  • duplicate worker startup
  • job-log sequence ordering
  • error handling and last_error
  • live SSE behavior
  • start applying blueprint updates before host start

Commit Strategy

Use small commits after passing relevant tests:

  1. feat(l4d2-web): execute queued lifecycle jobs
  2. feat(l4d2-web): live-follow queued job logs
  3. feat(l4d2-web): add admin runtime install job

Do not commit unless the user explicitly asks for commits.


Open Approval Gate

Before modifying implementation files, ask the user for explicit approval to proceed with the queue-worker implementation.