diff --git a/docs/superpowers/specs/2026-05-11-workshop-auto-download-design.md b/docs/superpowers/specs/2026-05-11-workshop-auto-download-design.md new file mode 100644 index 0000000..0aebef7 --- /dev/null +++ b/docs/superpowers/specs/2026-05-11-workshop-auto-download-design.md @@ -0,0 +1,326 @@ +# Workshop Auto-Download — Design + +## Problem + +When a user adds workshop items to an overlay (`POST /overlays/{id}/items`), the route saves `WorkshopItem` metadata and enqueues a `build_overlay` job. The build symlinks already-cached `.vpk` files and emits `skipped: not yet downloaded` to stderr for everything else. The only thing that actually pulls bytes from Steam is the admin-only `refresh_workshop_items` job, which is a global mutex blocking all server starts, all builds, and installs. + +In practice, this means freshly-added items never appear in the overlay until an admin presses a button. That isn't workable. + +## Goals + +1. Newly added items get downloaded without admin action. +2. Items that authors update on Steam get re-downloaded automatically on a daily cadence. +3. Overlay owners can manually re-check / re-pull their own overlay's items. + +## Non-Goals + +See "Out of Scope" at the end. In particular: the `refresh_workshop_items` global mutex stays; there is no cache GC; no per-item retry inside `download_to_cache`; no update-aware server-restart prompt. + +## Architecture + +Three changes layered onto the existing scheduler. None introduce a new job type or new scheduler rule. + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ User adds items │ +│ POST /overlays/{id}/items │ +│ ↳ fetch metadata batch (mode=add) │ +│ ↳ upsert WorkshopItem rows │ +│ ↳ enqueue_build_overlay ◀── already happens today │ +└─────────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────────┐ +│ build_overlay job (per-overlay; not a global mutex) │ +│ WorkshopBuilder.build(): │ +│ 1. query overlay's items │ +│ 2. for each item where cache miss / stale: ◀── NEW │ +│ download_to_cache(meta) with retry+backoff │ +│ stamp WorkshopItem.last_downloaded_at │ +│ 3. apply symlinks (existing logic) │ +└─────────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────────┐ +│ Owner re-checks one overlay │ +│ POST /overlays/{id}/refresh ◀── NEW │ +│ ↳ fetch metadata batch for this overlay only (mode=refresh) │ +│ ↳ update WorkshopItem rows │ +│ ↳ enqueue_build_overlay (does the download) │ +└─────────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────────┐ +│ Daily global update │ +│ systemd timer → l4d2web workshop-refresh CLI ◀── NEW │ +│ ↳ inserts Job(operation='refresh_workshop_items') │ +│ ↳ worker picks it up; existing global-mutex rule still applies │ +│ ↳ existing _run_refresh_workshop_items code unchanged │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +Key invariant: **on-add downloads run inside the per-overlay `build_overlay` job, so they do not block server starts globally.** Only the daily global refresh keeps the existing global-mutex semantics. + +## Component 1 — Auto-download inside `WorkshopBuilder.build` + +The builder gets a new download phase between "query items" and "apply symlinks". Today's behavior (skip-uncached with stderr warning) is replaced. + +### Decision logic + +For each item bound to the overlay: + +1. **Skip with warning** if `file_url == ""` (Steam returned `result != 1` last time we asked — delisted, private, or hidden). Emit one stderr line `workshop item {steam_id} skipped: no file_url (steam result: {last_error})`. Do **not** fail the build — these items quietly fall out of the symlink set because they never produce a cache file. An owner can investigate via the overlay detail page where `last_error` is shown. +2. Otherwise, **download** when any of: + - `last_downloaded_at IS NULL`, or + - cache file `{steam_id}.vpk` missing, or + - cache file `(mtime, size)` doesn't match `(time_updated, file_size)` from the row. +3. Otherwise, leave the item alone (its cache file is current). + +`steam_workshop.download_to_cache` already does the `(mtime, size)` check internally and short-circuits when the cache is current, so the builder can call it unconditionally for items in the "maybe download" set and trust the helper for idempotence. + +### Stamping + +- On success per item: `WorkshopItem.last_downloaded_at = now()`, `last_error = ""`. +- On failure per item (after retry exhaustion): `last_error` records the final exception string; the builder raises → `last_build_status='failed'`. + +### What the builder does NOT do + +It does not fetch fresh Steam metadata. Metadata is the responsibility of the add route, the per-overlay refresh route, and the daily refresh job. The builder is a pure function of DB state — this keeps it cheap and predictable, and lets builds run without any outbound metadata call. + +### Concurrency + +Items are downloaded sequentially within one builder run. Different overlays' builds run in parallel under existing scheduler rules; when two overlays share an item and race, the existing `download_to_cache` idempotence handles it — the loser sees a fresh file and skips. `last_downloaded_at` writes from two concurrent builds collapse to one timestamp; no real race. + +### Cancellation + +The builder threads `should_cancel` into `download_to_cache` (the helper already accepts it). Cancelled mid-download deletes the `.partial` file; the symlink phase doesn't run. Cancellation during the inter-attempt sleep wakes up within ~250 ms (see retry section). + +### Logging + +Each item's download start / finish / error emits one line. Counts are reported in the existing summary line: + +``` +workshop overlay 'mycollection': downloaded=3 cached=12 skipped=1 created=14 removed=1 unchanged=11 errors=0 +``` + +`skipped` now means "Steam can't serve this item (no file_url)" instead of the old "uncached" meaning. Uncached items get downloaded. + +## Component 2 — Retry & backoff + +Wraps each `download_to_cache(meta, ...)` call inside the builder. + +``` +attempts = 3 +delays = [1s, 2s, 4s] # exponential; slept between attempts + +for n in 1..attempts: + try: + download_to_cache(meta, cache_root, should_cancel=should_cancel) + break + except InterruptedError: # cancellation + raise # propagate immediately + except (requests.RequestException, OSError) as exc: + if n == attempts: raise # final attempt: bubble up → job fails + on_stderr(f"workshop {meta.steam_id} attempt {n}/{attempts} failed: {exc}") + sleep_with_cancel(delays[n-1], should_cancel) +``` + +### Notes + +- `sleep_with_cancel` is a small helper that polls `should_cancel` every ~250 ms during the sleep so a cancel does not wait out the full backoff window. +- The retry loop lives in the builder (`overlay_builders.py`), not in `steam_workshop.download_to_cache`. The downloader stays a single-shot primitive; retry policy is a caller concern. Keeps the helper testable without time-mocking. +- HTTP 4xx responses raised by `raise_for_status()` are `requests.HTTPError` (a `RequestException`), so they are retried too. That is intentional — 404 / 410 will fail three times quickly and surface; the cost of three failed attempts is negligible compared to the cost of users having to guess why a single transient blip killed the job. +- On final failure the job fails with the per-item error string and overlay `last_build_status='failed'`, matching the existing "never silently mount a partial overlay" rule. + +## Component 3 — Per-overlay refresh + +New route `POST /overlays/{id}/refresh`. Mirrors the add route's metadata-fetch path but scoped to the items already in this overlay. + +### Route sketch + +```python +@bp.post("/overlays//refresh") +@require_login +def refresh_overlay(overlay_id: int) -> Response: + user = current_user() + with session_scope() as db: + overlay, err = _check_workshop_overlay_access(overlay_id, user, db) + if err is not None: return err + steam_ids = db.scalars( + select(WorkshopItem.steam_id) + .join(OverlayWorkshopItem, OverlayWorkshopItem.workshop_item_id == WorkshopItem.id) + .where(OverlayWorkshopItem.overlay_id == overlay_id) + ).all() + + if not steam_ids: + return Response("overlay has no items", status=400) + + try: + metas = steam_workshop.fetch_metadata_batch(steam_ids, mode="refresh") + except Exception as exc: + return Response(f"steam api error: {exc}", status=502) + + with session_scope() as db: + overlay, err = _check_workshop_overlay_access(overlay_id, user, db) + if err is not None: return err + metas_by_id = {m.steam_id: m for m in metas} + for steam_id in steam_ids: + wi = db.scalar(select(WorkshopItem).where(WorkshopItem.steam_id == steam_id)) + meta = metas_by_id.get(steam_id) + if wi is None: continue + if meta is None: + wi.last_error = "steam returned no entry for this item" + continue + wi.title = meta.title + wi.filename = meta.filename + wi.file_url = meta.file_url + wi.file_size = meta.file_size + wi.time_updated = meta.time_updated + wi.preview_url = meta.preview_url + wi.last_error = "" if meta.result == 1 else f"steam result {meta.result}" + job = enqueue_build_overlay(db, overlay_id=overlay_id, user_id=user.id) + job_id = job.id + return redirect(f"/jobs/{job_id}") +``` + +### Behavior notes + +- Permission: same `_check_workshop_overlay_access` used by add/remove — owner or admin. +- `mode="refresh"` (not `"add"`): non-L4D2 items silently drop from the batch instead of raising. An item whose `consumer_app_id` somehow changed after add will not break refresh. +- The metadata write does **not** stamp `last_downloaded_at`. That field stays bound to actual file presence — the builder's download phase stamps it after the bytes land. A refresh that finds `time_updated` advanced therefore leaves `last_downloaded_at` pointing at the prior version; the `(mtime, size)` check in `download_to_cache` sees the mismatch and the builder re-downloads. Correct by construction. +- One Steam metadata POST per click, owner-gated. No new rate-limit concern. + +### UI + +A "Refresh" button next to "Add items" on the overlay detail page (workshop type only). Submits the POST; redirects to the job page like everything else. + +## Component 4 — Periodic global refresh (CLI + systemd timer) + +The existing `_run_refresh_workshop_items` job is complete and correct — it fetches all metadata, downloads what advanced, re-enqueues `build_overlay` for affected overlays. We only need a way to enqueue it on a schedule. + +### CLI subcommand + +In `l4d2web/cli.py`: + +```python +@cli.command("workshop-refresh") +def workshop_refresh() -> None: + """Enqueue a global workshop refresh job. Idempotent: if one is already + queued or running, prints its id and exits 0.""" + with session_scope() as db: + existing = db.scalar( + select(Job).where( + Job.operation == "refresh_workshop_items", + Job.state.in_(("queued", "running", "cancelling")), + ).order_by(Job.id.desc()).limit(1) + ) + if existing is not None: + click.echo(f"refresh_workshop_items job {existing.id} already {existing.state}") + return + job = Job( + user_id=None, + server_id=None, + operation="refresh_workshop_items", + state="queued", + ) + db.add(job) + db.flush() + click.echo(f"enqueued refresh_workshop_items job {job.id}") +``` + +### Schema follow-up + +`Job.user_id = None` for system-enqueued refreshes. The implementation plan must verify whether the column is currently nullable; if it is `NOT NULL`, the plan either (a) relaxes it to nullable (preferred — "system" is a real category) or (b) records the lowest-id admin user as the actor. The design assumes (a). + +### systemd units in `deploy/` + +```ini +# left4me-workshop-refresh.service +[Unit] +Description=Left4me — enqueue daily workshop refresh +After=network-online.target left4me-web.service +Requires=left4me-web.service + +[Service] +Type=oneshot +User=left4me +ExecStart=/opt/left4me/bin/l4d2web workshop-refresh +``` + +```ini +# left4me-workshop-refresh.timer +[Unit] +Description=Left4me — daily workshop refresh + +[Timer] +OnCalendar=*-*-* 04:00:00 +Persistent=true +RandomizedDelaySec=15min + +[Install] +WantedBy=timers.target +``` + +### Operator notes + +- The timer enqueues; the worker decides when to actually run. The existing scheduler will defer the refresh if a server start, install, or build is in progress. Worst case the refresh starts after the conflicting job finishes — the intended behavior. +- `Persistent=true` handles "host was down at 04:00" — the unit runs on next boot. The CLI's idempotence check prevents pile-up if it fires twice. +- Deployment wires this into the existing `deploy/` install flow (in scope for the implementation plan). + +## Testing + +Layered against the existing test files. No new test infrastructure. + +### `tests/test_overlay_builders.py` — bulk of new coverage + +- `test_workshop_build_downloads_uncached_items` — item with `last_downloaded_at=None` and no cache file → patched `download_to_cache` is called → file appears → symlink created → `last_downloaded_at` stamped. +- `test_workshop_build_skips_already_cached_items` — item with cache file matching `(time_updated, size)` → `download_to_cache` returns immediately (its existing idempotence) → no network → symlink created. +- `test_workshop_build_redownloads_stale_cache` — cache file exists but `(mtime, size)` mismatches the DB row → re-download happens. +- `test_workshop_build_retry_succeeds` — patched downloader fails twice then succeeds → builder finishes ok, retry messages on stderr, `last_downloaded_at` stamped. Backoff sleep monkey-patched to zero for speed. +- `test_workshop_build_retry_exhausted_fails_job` — downloader fails all three attempts → builder raises → `last_build_status='failed'`, `last_error` populated on the WorkshopItem. +- `test_workshop_build_cancellation_during_download` — `should_cancel` flips true mid-download → builder returns early, `.partial` cleaned up by `download_to_cache`, symlink phase did not run. +- `test_workshop_build_cancellation_during_backoff` — cancel flips true while sleeping between retries → wakes up within ~250 ms of the cancel. +- `test_workshop_build_skips_items_with_no_file_url` — item with `file_url=""` and `last_error="steam result 9"` → builder writes one stderr line, does NOT call `download_to_cache`, build succeeds with `last_build_status='ok'`, item is absent from the symlink set. + +### `tests/test_workshop_routes.py` — new per-overlay refresh route + +- `test_overlay_refresh_owner_allowed` — owner POST → `fetch_metadata_batch` called with exactly that overlay's steam_ids → WorkshopItem rows updated → `build_overlay` enqueued → 302 to /jobs/{id}. +- `test_overlay_refresh_other_user_forbidden` — non-owner non-admin → 403. +- `test_overlay_refresh_admin_can_refresh_any` — admin POST on someone else's overlay → 200/302. +- `test_overlay_refresh_steam_api_error_502` — `fetch_metadata_batch` raises → response is 502, no job enqueued. +- `test_overlay_refresh_empty_overlay_400` — overlay has no items → 400, no Steam call. +- `test_overlay_refresh_drops_missing_items_gracefully` — Steam returns nothing for one ID → that row gets `last_error="steam returned no entry…"`, build still enqueued. + +### `tests/test_cli.py` — new CLI subcommand + +- `test_workshop_refresh_enqueues_job` — CLI invocation inserts a queued `Job(operation='refresh_workshop_items')` and prints its id. +- `test_workshop_refresh_idempotent_when_queued` — pre-existing queued/running refresh job → second invocation prints the existing id and does not insert a duplicate. + +### `tests/test_job_worker.py` + +No new tests. Scheduler rules and `_run_refresh_workshop_items` are unchanged. Existing coverage holds. + +### Out of test scope + +The systemd timer. Validating it requires a host; smoke it on the dev host post-deploy. + +## Out of Scope + +- **Replacing the global mutex on `refresh_workshop_items`.** Daily refresh still blocks server starts/builds during its run. Scheduled at 04:00 with `Persistent=true`; revisit only if it observably hurts. +- **Per-item retry policy in `download_to_cache`.** Retry stays in the builder. +- **Cache GC.** Cache still grows monotonically — same as the v1 spec. +- **Steam API rate-limit handling for the metadata endpoint.** No backoff for metadata calls. Retries apply only to per-item file downloads. +- **Update-aware server restart UX.** When the daily refresh re-downloads an item mounted by a running server, the running server keeps its old mount. Notifying the user / offering a "restart to pick up updates" prompt stays in the backlog. +- **Per-overlay refresh on non-workshop overlay types.** Only workshop overlays get the Refresh button. + +## Affected Files + +Implementation will touch roughly: + +- `l4d2web/services/overlay_builders.py` — WorkshopBuilder download phase, retry helper. +- `l4d2web/routes/workshop_routes.py` — new `/overlays/{id}/refresh` route. +- `l4d2web/templates/...` — Refresh button on overlay detail page. +- `l4d2web/cli.py` — new `workshop-refresh` subcommand. +- `l4d2web/models.py` and `alembic/versions/...` — possibly relax `Job.user_id` to nullable (TBD per schema check). +- `deploy/` — systemd `.service` + `.timer` units, wired into the install flow. +- `l4d2web/tests/test_overlay_builders.py`, `test_workshop_routes.py`, `test_cli.py` — new test cases per the testing section. + +The implementation plan will turn these into ordered steps with explicit checkpoints.