# Workshop Auto-Download — Design ## Problem When a user adds workshop items to an overlay (`POST /overlays/{id}/items`), the route saves `WorkshopItem` metadata and enqueues a `build_overlay` job. The build symlinks already-cached `.vpk` files and emits `skipped: not yet downloaded` to stderr for everything else. The only thing that actually pulls bytes from Steam is the admin-only `refresh_workshop_items` job, which is a global mutex blocking all server starts, all builds, and installs. In practice, this means freshly-added items never appear in the overlay until an admin presses a button. That isn't workable. ## Goals 1. Newly added items get downloaded without admin action. 2. Items that authors update on Steam get re-downloaded automatically on a daily cadence. 3. Overlay owners can manually re-check / re-pull their own overlay's items. ## Non-Goals See "Out of Scope" at the end. In particular: the `refresh_workshop_items` global mutex stays; there is no cache GC; no per-item retry inside `download_to_cache`; no update-aware server-restart prompt. ## Architecture Three changes layered onto the existing scheduler. None introduce a new job type or new scheduler rule. ``` ┌─────────────────────────────────────────────────────────────────────┐ │ User adds items │ │ POST /overlays/{id}/items │ │ ↳ fetch metadata batch (mode=add) │ │ ↳ upsert WorkshopItem rows │ │ ↳ enqueue_build_overlay ◀── already happens today │ └─────────────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ build_overlay job (per-overlay; not a global mutex) │ │ WorkshopBuilder.build(): │ │ 1. query overlay's items │ │ 2. for each item where cache miss / stale: ◀── NEW │ │ download_to_cache(meta) with retry+backoff │ │ stamp WorkshopItem.last_downloaded_at │ │ 3. apply symlinks (existing logic) │ └─────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ Owner re-checks one overlay │ │ POST /overlays/{id}/refresh ◀── NEW │ │ ↳ fetch metadata batch for this overlay only (mode=refresh) │ │ ↳ update WorkshopItem rows │ │ ↳ enqueue_build_overlay (does the download) │ └─────────────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────────────┐ │ Daily global update │ │ systemd timer → l4d2web workshop-refresh CLI ◀── NEW │ │ ↳ inserts Job(operation='refresh_workshop_items') │ │ ↳ worker picks it up; existing global-mutex rule still applies │ │ ↳ existing _run_refresh_workshop_items code unchanged │ └─────────────────────────────────────────────────────────────────────┘ ``` Key invariant: **on-add downloads run inside the per-overlay `build_overlay` job, so they do not block server starts globally.** Only the daily global refresh keeps the existing global-mutex semantics. ## Component 1 — Auto-download inside `WorkshopBuilder.build` The builder gets a new download phase between "query items" and "apply symlinks". Today's behavior (skip-uncached with stderr warning) is replaced. ### Decision logic For each item bound to the overlay: 1. **Skip with warning** if `file_url == ""` (Steam returned `result != 1` last time we asked — delisted, private, or hidden). Emit one stderr line `workshop item {steam_id} skipped: no file_url (steam result: {last_error})`. Do **not** fail the build — these items quietly fall out of the symlink set because they never produce a cache file. An owner can investigate via the overlay detail page where `last_error` is shown. 2. Otherwise, **download** when any of: - `last_downloaded_at IS NULL`, or - cache file `{steam_id}.vpk` missing, or - cache file `(mtime, size)` doesn't match `(time_updated, file_size)` from the row. 3. Otherwise, leave the item alone (its cache file is current). `steam_workshop.download_to_cache` already does the `(mtime, size)` check internally and short-circuits when the cache is current, so the builder can call it unconditionally for items in the "maybe download" set and trust the helper for idempotence. ### Stamping - On success per item: `WorkshopItem.last_downloaded_at = now()`, `last_error = ""`. - On failure per item (after retry exhaustion): `last_error` records the final exception string; the builder raises → `last_build_status='failed'`. ### What the builder does NOT do It does not fetch fresh Steam metadata. Metadata is the responsibility of the add route, the per-overlay refresh route, and the daily refresh job. The builder is a pure function of DB state — this keeps it cheap and predictable, and lets builds run without any outbound metadata call. ### Concurrency Items are downloaded sequentially within one builder run. Different overlays' builds run in parallel under existing scheduler rules; when two overlays share an item and race, the existing `download_to_cache` idempotence handles it — the loser sees a fresh file and skips. `last_downloaded_at` writes from two concurrent builds collapse to one timestamp; no real race. ### Cancellation The builder threads `should_cancel` into `download_to_cache` (the helper already accepts it). Cancelled mid-download deletes the `.partial` file; the symlink phase doesn't run. Cancellation during the inter-attempt sleep wakes up within ~250 ms (see retry section). ### Logging Each item's download start / finish / error emits one line. Counts are reported in the existing summary line: ``` workshop overlay 'mycollection': downloaded=3 cached=12 skipped=1 created=14 removed=1 unchanged=11 errors=0 ``` `skipped` now means "Steam can't serve this item (no file_url)" instead of the old "uncached" meaning. Uncached items get downloaded. ## Component 2 — Retry & backoff Wraps each `download_to_cache(meta, ...)` call inside the builder. ``` attempts = 3 delays = [1s, 2s, 4s] # exponential; slept between attempts for n in 1..attempts: try: download_to_cache(meta, cache_root, should_cancel=should_cancel) break except InterruptedError: # cancellation raise # propagate immediately except (requests.RequestException, OSError) as exc: if n == attempts: raise # final attempt: bubble up → job fails on_stderr(f"workshop {meta.steam_id} attempt {n}/{attempts} failed: {exc}") sleep_with_cancel(delays[n-1], should_cancel) ``` ### Notes - `sleep_with_cancel` is a small helper that polls `should_cancel` every ~250 ms during the sleep so a cancel does not wait out the full backoff window. - The retry loop lives in the builder (`overlay_builders.py`), not in `steam_workshop.download_to_cache`. The downloader stays a single-shot primitive; retry policy is a caller concern. Keeps the helper testable without time-mocking. - HTTP 4xx responses raised by `raise_for_status()` are `requests.HTTPError` (a `RequestException`), so they are retried too. That is intentional — 404 / 410 will fail three times quickly and surface; the cost of three failed attempts is negligible compared to the cost of users having to guess why a single transient blip killed the job. - On final failure the job fails with the per-item error string and overlay `last_build_status='failed'`, matching the existing "never silently mount a partial overlay" rule. ## Component 3 — Per-overlay refresh New route `POST /overlays/{id}/refresh`. Mirrors the add route's metadata-fetch path but scoped to the items already in this overlay. ### Route sketch ```python @bp.post("/overlays//refresh") @require_login def refresh_overlay(overlay_id: int) -> Response: user = current_user() with session_scope() as db: overlay, err = _check_workshop_overlay_access(overlay_id, user, db) if err is not None: return err steam_ids = db.scalars( select(WorkshopItem.steam_id) .join(OverlayWorkshopItem, OverlayWorkshopItem.workshop_item_id == WorkshopItem.id) .where(OverlayWorkshopItem.overlay_id == overlay_id) ).all() if not steam_ids: return Response("overlay has no items", status=400) try: metas = steam_workshop.fetch_metadata_batch(steam_ids, mode="refresh") except Exception as exc: return Response(f"steam api error: {exc}", status=502) with session_scope() as db: overlay, err = _check_workshop_overlay_access(overlay_id, user, db) if err is not None: return err metas_by_id = {m.steam_id: m for m in metas} for steam_id in steam_ids: wi = db.scalar(select(WorkshopItem).where(WorkshopItem.steam_id == steam_id)) meta = metas_by_id.get(steam_id) if wi is None: continue if meta is None: wi.last_error = "steam returned no entry for this item" continue wi.title = meta.title wi.filename = meta.filename wi.file_url = meta.file_url wi.file_size = meta.file_size wi.time_updated = meta.time_updated wi.preview_url = meta.preview_url wi.last_error = "" if meta.result == 1 else f"steam result {meta.result}" job = enqueue_build_overlay(db, overlay_id=overlay_id, user_id=user.id) job_id = job.id return redirect(f"/jobs/{job_id}") ``` ### Behavior notes - Permission: same `_check_workshop_overlay_access` used by add/remove — owner or admin. - `mode="refresh"` (not `"add"`): non-L4D2 items silently drop from the batch instead of raising. An item whose `consumer_app_id` somehow changed after add will not break refresh. - The metadata write does **not** stamp `last_downloaded_at`. That field stays bound to actual file presence — the builder's download phase stamps it after the bytes land. A refresh that finds `time_updated` advanced therefore leaves `last_downloaded_at` pointing at the prior version; the `(mtime, size)` check in `download_to_cache` sees the mismatch and the builder re-downloads. Correct by construction. - One Steam metadata POST per click, owner-gated. No new rate-limit concern. ### UI A "Refresh" button next to "Add items" on the overlay detail page (workshop type only). Submits the POST; redirects to the job page like everything else. ## Component 4 — Periodic global refresh (CLI + systemd timer) The existing `_run_refresh_workshop_items` job is complete and correct — it fetches all metadata, downloads what advanced, re-enqueues `build_overlay` for affected overlays. We only need a way to enqueue it on a schedule. ### CLI subcommand In `l4d2web/cli.py`: ```python @cli.command("workshop-refresh") def workshop_refresh() -> None: """Enqueue a global workshop refresh job. Idempotent: if one is already queued or running, prints its id and exits 0.""" with session_scope() as db: existing = db.scalar( select(Job).where( Job.operation == "refresh_workshop_items", Job.state.in_(("queued", "running", "cancelling")), ).order_by(Job.id.desc()).limit(1) ) if existing is not None: click.echo(f"refresh_workshop_items job {existing.id} already {existing.state}") return job = Job( user_id=None, server_id=None, operation="refresh_workshop_items", state="queued", ) db.add(job) db.flush() click.echo(f"enqueued refresh_workshop_items job {job.id}") ``` ### Schema follow-up `Job.user_id = None` for system-enqueued refreshes. The implementation plan must verify whether the column is currently nullable; if it is `NOT NULL`, the plan either (a) relaxes it to nullable (preferred — "system" is a real category) or (b) records the lowest-id admin user as the actor. The design assumes (a). ### systemd units in `deploy/` ```ini # left4me-workshop-refresh.service [Unit] Description=Left4me — enqueue daily workshop refresh After=network-online.target left4me-web.service Requires=left4me-web.service [Service] Type=oneshot User=left4me ExecStart=/opt/left4me/bin/l4d2web workshop-refresh ``` ```ini # left4me-workshop-refresh.timer [Unit] Description=Left4me — daily workshop refresh [Timer] OnCalendar=*-*-* 04:00:00 Persistent=true RandomizedDelaySec=15min [Install] WantedBy=timers.target ``` ### Operator notes - The timer enqueues; the worker decides when to actually run. The existing scheduler will defer the refresh if a server start, install, or build is in progress. Worst case the refresh starts after the conflicting job finishes — the intended behavior. - `Persistent=true` handles "host was down at 04:00" — the unit runs on next boot. The CLI's idempotence check prevents pile-up if it fires twice. - Deployment wires this into the existing `deploy/` install flow (in scope for the implementation plan). ## Testing Layered against the existing test files. No new test infrastructure. ### `tests/test_overlay_builders.py` — bulk of new coverage - `test_workshop_build_downloads_uncached_items` — item with `last_downloaded_at=None` and no cache file → patched `download_to_cache` is called → file appears → symlink created → `last_downloaded_at` stamped. - `test_workshop_build_skips_already_cached_items` — item with cache file matching `(time_updated, size)` → `download_to_cache` returns immediately (its existing idempotence) → no network → symlink created. - `test_workshop_build_redownloads_stale_cache` — cache file exists but `(mtime, size)` mismatches the DB row → re-download happens. - `test_workshop_build_retry_succeeds` — patched downloader fails twice then succeeds → builder finishes ok, retry messages on stderr, `last_downloaded_at` stamped. Backoff sleep monkey-patched to zero for speed. - `test_workshop_build_retry_exhausted_fails_job` — downloader fails all three attempts → builder raises → `last_build_status='failed'`, `last_error` populated on the WorkshopItem. - `test_workshop_build_cancellation_during_download` — `should_cancel` flips true mid-download → builder returns early, `.partial` cleaned up by `download_to_cache`, symlink phase did not run. - `test_workshop_build_cancellation_during_backoff` — cancel flips true while sleeping between retries → wakes up within ~250 ms of the cancel. - `test_workshop_build_skips_items_with_no_file_url` — item with `file_url=""` and `last_error="steam result 9"` → builder writes one stderr line, does NOT call `download_to_cache`, build succeeds with `last_build_status='ok'`, item is absent from the symlink set. ### `tests/test_workshop_routes.py` — new per-overlay refresh route - `test_overlay_refresh_owner_allowed` — owner POST → `fetch_metadata_batch` called with exactly that overlay's steam_ids → WorkshopItem rows updated → `build_overlay` enqueued → 302 to /jobs/{id}. - `test_overlay_refresh_other_user_forbidden` — non-owner non-admin → 403. - `test_overlay_refresh_admin_can_refresh_any` — admin POST on someone else's overlay → 200/302. - `test_overlay_refresh_steam_api_error_502` — `fetch_metadata_batch` raises → response is 502, no job enqueued. - `test_overlay_refresh_empty_overlay_400` — overlay has no items → 400, no Steam call. - `test_overlay_refresh_drops_missing_items_gracefully` — Steam returns nothing for one ID → that row gets `last_error="steam returned no entry…"`, build still enqueued. ### `tests/test_cli.py` — new CLI subcommand - `test_workshop_refresh_enqueues_job` — CLI invocation inserts a queued `Job(operation='refresh_workshop_items')` and prints its id. - `test_workshop_refresh_idempotent_when_queued` — pre-existing queued/running refresh job → second invocation prints the existing id and does not insert a duplicate. ### `tests/test_job_worker.py` No new tests. Scheduler rules and `_run_refresh_workshop_items` are unchanged. Existing coverage holds. ### Out of test scope The systemd timer. Validating it requires a host; smoke it on the dev host post-deploy. ## Out of Scope - **Replacing the global mutex on `refresh_workshop_items`.** Daily refresh still blocks server starts/builds during its run. Scheduled at 04:00 with `Persistent=true`; revisit only if it observably hurts. - **Per-item retry policy in `download_to_cache`.** Retry stays in the builder. - **Cache GC.** Cache still grows monotonically — same as the v1 spec. - **Steam API rate-limit handling for the metadata endpoint.** No backoff for metadata calls. Retries apply only to per-item file downloads. - **Update-aware server restart UX.** When the daily refresh re-downloads an item mounted by a running server, the running server keeps its old mount. Notifying the user / offering a "restart to pick up updates" prompt stays in the backlog. - **Per-overlay refresh on non-workshop overlay types.** Only workshop overlays get the Refresh button. ## Affected Files Implementation will touch roughly: - `l4d2web/services/overlay_builders.py` — WorkshopBuilder download phase, retry helper. - `l4d2web/routes/workshop_routes.py` — new `/overlays/{id}/refresh` route. - `l4d2web/templates/...` — Refresh button on overlay detail page. - `l4d2web/cli.py` — new `workshop-refresh` subcommand. - `l4d2web/models.py` and `alembic/versions/...` — possibly relax `Job.user_id` to nullable (TBD per schema check). - `deploy/` — systemd `.service` + `.timer` units, wired into the install flow. - `l4d2web/tests/test_overlay_builders.py`, `test_workshop_routes.py`, `test_cli.py` — new test cases per the testing section. The implementation plan will turn these into ordered steps with explicit checkpoints.