9.2 KiB
L4D2 Web Queue Worker Implementation Plan
Approval gate: This plan may be written and refined without further approval. Do not implement code changes from this plan until the user explicitly approves implementation.
Goal: Complete the l4d2web async lifecycle queue so queued jobs are claimed, executed through the l4d2ctl host command boundary, logged to job_logs, reflected in server state, and streamed live to the UI.
Architecture: Keep the v1 single-process Flask architecture. Use DB-backed queued jobs as the durable source of truth, worker threads inside the Flask process, SQLite-safe process-local locks, and direct imports through l4d2web.services.l4d2_facade. Do not shell out to l4d2ctl from the web app.
Current Gap
- Server lifecycle routes create
Job(state="queued")rows. l4d2web.services.job_workerhas scheduler helpers, stale recovery, command-log append, and actual-state refresh helpers.- No worker claims queued jobs.
- No code dispatches queued operations to
l4d2_facade. - No command callbacks persist live stdout/stderr while jobs run.
- Job-log SSE currently replays existing rows once and does not live-follow new rows.
- Job-log SSE emits
stdout/stderrcustom events, whilestatic/js/sse.jsonly handles default messages. - No web route currently enqueues global
installjobs.
Locked Decisions
- Queue execution uses direct Python imports through
l4d2web.services.l4d2_facade. - The queue is DB-backed, not an in-memory
queue.Queue. - Worker threads are in-process daemon threads.
- SQLite concurrency is protected with process-local locks; no distributed lock manager is added.
- Workers are not started during normal tests.
POST /admin/installis added as the admin-only runtime install/update entry point.installjobs haveserver_id=Noneand are globally exclusive.- Server-specific jobs do not overlap on the same
server_id. - Different server jobs can run concurrently when no install job is running.
- A web
startjob applies the live-linked blueprint before start by runninginitialize_server(server_id)and thenstart_server(server_id). This satisfies “blueprint updates apply on next action.” deleteremoves the host instance/runtime throughl4d2host; it does not delete the webServerrow in v1.- Command log rows are retained indefinitely.
Task 1: Extend Worker Tests First
Files:
- Modify:
l4d2web/tests/test_job_worker.py - Modify as needed:
l4d2web/tests/test_job_logs.py
Add tests that verify the worker behavior without touching real systemd, Steam, or /opt/l4d2. Use monkeypatched l4d2web.services.l4d2_facade functions.
Required coverage:
run_worker_once()claims the oldest runnable queued job.- A successful server job transitions
queued -> running -> succeededand setsexit_code=0,started_at,finished_at, andupdated_at. - A successful job persists stdout/stderr callback lines in
job_logs. - A
subprocess.CalledProcessErrortransitions the job tofailedand storesexit_code=exc.returncode. - An unexpected exception transitions the job to
failedwithexit_code=1. - Same-server jobs do not overlap.
- Different-server jobs can be claimed concurrently by separate worker passes.
- An
installjob is not claimed while any server job is running. - Server jobs are not claimed while an
installjob is running. - Startup recovery marks stale
runningjobs asfailed. - Actual server state is refreshed after server-specific lifecycle jobs.
Server.last_erroris cleared on success and set on failure.
Verification command:
pytest l4d2web/tests/test_job_worker.py -q
Expected before implementation: FAIL.
Task 2: Implement Queue Claiming And Job Execution
Files:
- Modify:
l4d2web/services/job_worker.py
Add worker-core functions:
build_scheduler_state(session) -> SchedulerStateclaim_next_job() -> int | Nonerun_worker_once() -> boolrun_job(job_id: int) -> Nonefinish_job(job_id: int, state: str, exit_code: int | None, error: str = "") -> Noneappend_job_log_line(job_id: int, stream: str, line: str, max_chars: int = 4096) -> int
Implementation rules:
- Use a module-level claim lock around scheduler-state construction, queued-job selection, and
queued -> runningtransition. - Commit the
runningtransition before executing any host operation. - Do not keep a DB session open while a host operation runs.
- Use a module-level log lock around
append_job_log()so concurrent stdout/stderr callback threads cannot duplicateseqvalues. - Recompute scheduler state from
runningjobs in the DB, not from only in-memory state. - Select queued jobs by
created_at, thenidfor deterministic order. - Skip malformed server operations with no
server_idby failing the job cleanly. - Treat unknown operations as failed jobs, not worker-thread crashes.
Operation dispatch:
install -> l4d2_facade.install_runtime(...)
initialize -> l4d2_facade.initialize_server(server_id, ...)
start -> l4d2_facade.initialize_server(server_id, ...), then l4d2_facade.start_server(server_id, ...)
stop -> l4d2_facade.stop_server(server_id, ...)
delete -> l4d2_facade.delete_server(server_id, ...)
Failure handling:
subprocess.CalledProcessError: append remaining stderr if useful, fail withexit_code=returncode.- Any other exception: append exception text to stderr, fail with
exit_code=1. - Never let a job exception kill the worker loop.
Verification command:
pytest l4d2web/tests/test_job_worker.py -q
Expected after implementation: PASS.
Task 3: Add Worker Thread Startup
Files:
- Modify:
l4d2web/config.py - Modify:
l4d2web/app.py - Modify:
l4d2web/services/job_worker.py - Modify:
l4d2web/tests/test_job_worker.py
Add config:
"JOB_WORKER_ENABLED": True
"JOB_WORKER_POLL_SECONDS": 1
Add worker lifecycle functions:
start_job_workers(app) -> Noneworker_loop(app, poll_seconds: float) -> None
Startup behavior:
create_app()still callsrecover_stale_jobs().- After recovery,
create_app()starts workers only when enabled and not inTESTING. - Guard against duplicate worker startup in the same process.
- Worker threads run as daemon threads.
- Each worker loop uses
app.app_context()aroundrun_worker_once(). - If no job was run, sleep for
JOB_WORKER_POLL_SECONDS.
Testing requirements:
- Tests should not accidentally start real background workers.
- Add a focused startup test with monkeypatched
start_job_workersif needed.
Verification command:
pytest l4d2web/tests/test_job_worker.py -q
Task 4: Make Job Log SSE Live-Follow
Files:
- Modify:
l4d2web/routes/job_routes.py - Modify:
l4d2web/static/js/sse.js - Modify:
l4d2web/tests/test_job_logs.py
Route behavior:
- Authorize the job before streaming.
- Replay rows with
seq > last_sequp toJOB_LOG_REPLAY_LIMIT. - Continue polling for new rows while the job is not terminal.
- Close the stream after all available logs are sent and the job state is terminal.
- Keep emitting
id: <seq>so EventSource can resume. - Keep
event: stdoutandevent: stderrfor job logs.
JS behavior:
- Keep handling default server-log messages via
source.onmessage. - Also register
stdoutandstderrlisteners that append job-log lines to the same element. - Prefix custom job-log events with the stream name only if useful for readability.
Terminal states:
succeeded
failed
cancelled
cancelled is reserved for future use and does not require cancellation support in this task.
Verification command:
pytest l4d2web/tests/test_job_logs.py -q
Task 5: Add Admin Runtime Install Action
Files:
- Modify:
l4d2web/routes/page_routes.py - Modify:
l4d2web/templates/admin.html - Modify:
l4d2web/tests/test_pages.pyor add a focused admin route test
Behavior:
POST /admin/installrequires@require_admin.- Creates
Job(user_id=current_admin.id, server_id=None, operation="install", state="queued"). - Redirects to
/admin/jobs. - Non-admin logged-in users receive
403. - Anonymous users are redirected to login.
- Admin page shows a CSRF-protected form/button for runtime install/update.
Verification command:
pytest l4d2web/tests/test_pages.py -q
Task 6: Full Verification And Review
Run focused suites first:
pytest l4d2web/tests/test_job_worker.py -q
pytest l4d2web/tests/test_job_logs.py -q
pytest l4d2web/tests/test_pages.py -q
Then run the full web suite:
pytest l4d2web/tests -q
Refresh the code index after implementation:
ccc index
Request a final read-only review focused on:
- queue claiming races
- duplicate worker startup
- job-log sequence ordering
- error handling and
last_error - live SSE behavior
startapplying blueprint updates before host start
Commit Strategy
Use small commits after passing relevant tests:
feat(l4d2-web): execute queued lifecycle jobsfeat(l4d2-web): live-follow queued job logsfeat(l4d2-web): add admin runtime install job
Do not commit unless the user explicitly asks for commits.
Open Approval Gate
Before modifying implementation files, ask the user for explicit approval to proceed with the queue-worker implementation.