left4me/docs/superpowers/specs/2026-05-11-workshop-auto-download-design.md
mwiegand fef8cc4ea6
docs: design for workshop auto-download
Closes the gap where added workshop items never reach disk until an
admin presses the global refresh button. Downloads piggyback on the
per-overlay build_overlay job; daily updates come from a systemd
timer + CLI subcommand that enqueues the existing refresh job.
2026-05-11 22:28:20 +02:00

19 KiB

Workshop Auto-Download — Design

Problem

When a user adds workshop items to an overlay (POST /overlays/{id}/items), the route saves WorkshopItem metadata and enqueues a build_overlay job. The build symlinks already-cached .vpk files and emits skipped: not yet downloaded to stderr for everything else. The only thing that actually pulls bytes from Steam is the admin-only refresh_workshop_items job, which is a global mutex blocking all server starts, all builds, and installs.

In practice, this means freshly-added items never appear in the overlay until an admin presses a button. That isn't workable.

Goals

  1. Newly added items get downloaded without admin action.
  2. Items that authors update on Steam get re-downloaded automatically on a daily cadence.
  3. Overlay owners can manually re-check / re-pull their own overlay's items.

Non-Goals

See "Out of Scope" at the end. In particular: the refresh_workshop_items global mutex stays; there is no cache GC; no per-item retry inside download_to_cache; no update-aware server-restart prompt.

Architecture

Three changes layered onto the existing scheduler. None introduce a new job type or new scheduler rule.

┌─────────────────────────────────────────────────────────────────────┐
│ User adds items                                                     │
│   POST /overlays/{id}/items                                         │
│     ↳ fetch metadata batch (mode=add)                               │
│     ↳ upsert WorkshopItem rows                                      │
│     ↳ enqueue_build_overlay         ◀── already happens today       │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│ build_overlay job  (per-overlay; not a global mutex)                │
│   WorkshopBuilder.build():                                          │
│     1. query overlay's items                                        │
│     2. for each item where cache miss / stale:        ◀── NEW       │
│          download_to_cache(meta) with retry+backoff                 │
│          stamp WorkshopItem.last_downloaded_at                      │
│     3. apply symlinks (existing logic)                              │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Owner re-checks one overlay                                         │
│   POST /overlays/{id}/refresh         ◀── NEW                       │
│     ↳ fetch metadata batch for this overlay only (mode=refresh)     │
│     ↳ update WorkshopItem rows                                      │
│     ↳ enqueue_build_overlay (does the download)                     │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│ Daily global update                                                 │
│   systemd timer → l4d2web workshop-refresh CLI    ◀── NEW           │
│     ↳ inserts Job(operation='refresh_workshop_items')               │
│     ↳ worker picks it up; existing global-mutex rule still applies  │
│     ↳ existing _run_refresh_workshop_items code unchanged           │
└─────────────────────────────────────────────────────────────────────┘

Key invariant: on-add downloads run inside the per-overlay build_overlay job, so they do not block server starts globally. Only the daily global refresh keeps the existing global-mutex semantics.

Component 1 — Auto-download inside WorkshopBuilder.build

The builder gets a new download phase between "query items" and "apply symlinks". Today's behavior (skip-uncached with stderr warning) is replaced.

Decision logic

For each item bound to the overlay:

  1. Skip with warning if file_url == "" (Steam returned result != 1 last time we asked — delisted, private, or hidden). Emit one stderr line workshop item {steam_id} skipped: no file_url (steam result: {last_error}). Do not fail the build — these items quietly fall out of the symlink set because they never produce a cache file. An owner can investigate via the overlay detail page where last_error is shown.
  2. Otherwise, download when any of:
    • last_downloaded_at IS NULL, or
    • cache file {steam_id}.vpk missing, or
    • cache file (mtime, size) doesn't match (time_updated, file_size) from the row.
  3. Otherwise, leave the item alone (its cache file is current).

steam_workshop.download_to_cache already does the (mtime, size) check internally and short-circuits when the cache is current, so the builder can call it unconditionally for items in the "maybe download" set and trust the helper for idempotence.

Stamping

  • On success per item: WorkshopItem.last_downloaded_at = now(), last_error = "".
  • On failure per item (after retry exhaustion): last_error records the final exception string; the builder raises → last_build_status='failed'.

What the builder does NOT do

It does not fetch fresh Steam metadata. Metadata is the responsibility of the add route, the per-overlay refresh route, and the daily refresh job. The builder is a pure function of DB state — this keeps it cheap and predictable, and lets builds run without any outbound metadata call.

Concurrency

Items are downloaded sequentially within one builder run. Different overlays' builds run in parallel under existing scheduler rules; when two overlays share an item and race, the existing download_to_cache idempotence handles it — the loser sees a fresh file and skips. last_downloaded_at writes from two concurrent builds collapse to one timestamp; no real race.

Cancellation

The builder threads should_cancel into download_to_cache (the helper already accepts it). Cancelled mid-download deletes the .partial file; the symlink phase doesn't run. Cancellation during the inter-attempt sleep wakes up within ~250 ms (see retry section).

Logging

Each item's download start / finish / error emits one line. Counts are reported in the existing summary line:

workshop overlay 'mycollection': downloaded=3 cached=12 skipped=1 created=14 removed=1 unchanged=11 errors=0

skipped now means "Steam can't serve this item (no file_url)" instead of the old "uncached" meaning. Uncached items get downloaded.

Component 2 — Retry & backoff

Wraps each download_to_cache(meta, ...) call inside the builder.

attempts = 3
delays   = [1s, 2s, 4s]   # exponential; slept between attempts

for n in 1..attempts:
    try:
        download_to_cache(meta, cache_root, should_cancel=should_cancel)
        break
    except InterruptedError:                 # cancellation
        raise                                # propagate immediately
    except (requests.RequestException, OSError) as exc:
        if n == attempts: raise              # final attempt: bubble up → job fails
        on_stderr(f"workshop {meta.steam_id} attempt {n}/{attempts} failed: {exc}")
        sleep_with_cancel(delays[n-1], should_cancel)

Notes

  • sleep_with_cancel is a small helper that polls should_cancel every ~250 ms during the sleep so a cancel does not wait out the full backoff window.
  • The retry loop lives in the builder (overlay_builders.py), not in steam_workshop.download_to_cache. The downloader stays a single-shot primitive; retry policy is a caller concern. Keeps the helper testable without time-mocking.
  • HTTP 4xx responses raised by raise_for_status() are requests.HTTPError (a RequestException), so they are retried too. That is intentional — 404 / 410 will fail three times quickly and surface; the cost of three failed attempts is negligible compared to the cost of users having to guess why a single transient blip killed the job.
  • On final failure the job fails with the per-item error string and overlay last_build_status='failed', matching the existing "never silently mount a partial overlay" rule.

Component 3 — Per-overlay refresh

New route POST /overlays/{id}/refresh. Mirrors the add route's metadata-fetch path but scoped to the items already in this overlay.

Route sketch

@bp.post("/overlays/<int:overlay_id>/refresh")
@require_login
def refresh_overlay(overlay_id: int) -> Response:
    user = current_user()
    with session_scope() as db:
        overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
        if err is not None: return err
        steam_ids = db.scalars(
            select(WorkshopItem.steam_id)
            .join(OverlayWorkshopItem, OverlayWorkshopItem.workshop_item_id == WorkshopItem.id)
            .where(OverlayWorkshopItem.overlay_id == overlay_id)
        ).all()

    if not steam_ids:
        return Response("overlay has no items", status=400)

    try:
        metas = steam_workshop.fetch_metadata_batch(steam_ids, mode="refresh")
    except Exception as exc:
        return Response(f"steam api error: {exc}", status=502)

    with session_scope() as db:
        overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
        if err is not None: return err
        metas_by_id = {m.steam_id: m for m in metas}
        for steam_id in steam_ids:
            wi = db.scalar(select(WorkshopItem).where(WorkshopItem.steam_id == steam_id))
            meta = metas_by_id.get(steam_id)
            if wi is None: continue
            if meta is None:
                wi.last_error = "steam returned no entry for this item"
                continue
            wi.title = meta.title
            wi.filename = meta.filename
            wi.file_url = meta.file_url
            wi.file_size = meta.file_size
            wi.time_updated = meta.time_updated
            wi.preview_url = meta.preview_url
            wi.last_error = "" if meta.result == 1 else f"steam result {meta.result}"
        job = enqueue_build_overlay(db, overlay_id=overlay_id, user_id=user.id)
        job_id = job.id
    return redirect(f"/jobs/{job_id}")

Behavior notes

  • Permission: same _check_workshop_overlay_access used by add/remove — owner or admin.
  • mode="refresh" (not "add"): non-L4D2 items silently drop from the batch instead of raising. An item whose consumer_app_id somehow changed after add will not break refresh.
  • The metadata write does not stamp last_downloaded_at. That field stays bound to actual file presence — the builder's download phase stamps it after the bytes land. A refresh that finds time_updated advanced therefore leaves last_downloaded_at pointing at the prior version; the (mtime, size) check in download_to_cache sees the mismatch and the builder re-downloads. Correct by construction.
  • One Steam metadata POST per click, owner-gated. No new rate-limit concern.

UI

A "Refresh" button next to "Add items" on the overlay detail page (workshop type only). Submits the POST; redirects to the job page like everything else.

Component 4 — Periodic global refresh (CLI + systemd timer)

The existing _run_refresh_workshop_items job is complete and correct — it fetches all metadata, downloads what advanced, re-enqueues build_overlay for affected overlays. We only need a way to enqueue it on a schedule.

CLI subcommand

In l4d2web/cli.py:

@cli.command("workshop-refresh")
def workshop_refresh() -> None:
    """Enqueue a global workshop refresh job. Idempotent: if one is already
    queued or running, prints its id and exits 0."""
    with session_scope() as db:
        existing = db.scalar(
            select(Job).where(
                Job.operation == "refresh_workshop_items",
                Job.state.in_(("queued", "running", "cancelling")),
            ).order_by(Job.id.desc()).limit(1)
        )
        if existing is not None:
            click.echo(f"refresh_workshop_items job {existing.id} already {existing.state}")
            return
        job = Job(
            user_id=None,
            server_id=None,
            operation="refresh_workshop_items",
            state="queued",
        )
        db.add(job)
        db.flush()
        click.echo(f"enqueued refresh_workshop_items job {job.id}")

Schema follow-up

Job.user_id = None for system-enqueued refreshes. The implementation plan must verify whether the column is currently nullable; if it is NOT NULL, the plan either (a) relaxes it to nullable (preferred — "system" is a real category) or (b) records the lowest-id admin user as the actor. The design assumes (a).

systemd units in deploy/

# left4me-workshop-refresh.service
[Unit]
Description=Left4me — enqueue daily workshop refresh
After=network-online.target left4me-web.service
Requires=left4me-web.service

[Service]
Type=oneshot
User=left4me
ExecStart=/opt/left4me/bin/l4d2web workshop-refresh
# left4me-workshop-refresh.timer
[Unit]
Description=Left4me — daily workshop refresh

[Timer]
OnCalendar=*-*-* 04:00:00
Persistent=true
RandomizedDelaySec=15min

[Install]
WantedBy=timers.target

Operator notes

  • The timer enqueues; the worker decides when to actually run. The existing scheduler will defer the refresh if a server start, install, or build is in progress. Worst case the refresh starts after the conflicting job finishes — the intended behavior.
  • Persistent=true handles "host was down at 04:00" — the unit runs on next boot. The CLI's idempotence check prevents pile-up if it fires twice.
  • Deployment wires this into the existing deploy/ install flow (in scope for the implementation plan).

Testing

Layered against the existing test files. No new test infrastructure.

tests/test_overlay_builders.py — bulk of new coverage

  • test_workshop_build_downloads_uncached_items — item with last_downloaded_at=None and no cache file → patched download_to_cache is called → file appears → symlink created → last_downloaded_at stamped.
  • test_workshop_build_skips_already_cached_items — item with cache file matching (time_updated, size)download_to_cache returns immediately (its existing idempotence) → no network → symlink created.
  • test_workshop_build_redownloads_stale_cache — cache file exists but (mtime, size) mismatches the DB row → re-download happens.
  • test_workshop_build_retry_succeeds — patched downloader fails twice then succeeds → builder finishes ok, retry messages on stderr, last_downloaded_at stamped. Backoff sleep monkey-patched to zero for speed.
  • test_workshop_build_retry_exhausted_fails_job — downloader fails all three attempts → builder raises → last_build_status='failed', last_error populated on the WorkshopItem.
  • test_workshop_build_cancellation_during_downloadshould_cancel flips true mid-download → builder returns early, .partial cleaned up by download_to_cache, symlink phase did not run.
  • test_workshop_build_cancellation_during_backoff — cancel flips true while sleeping between retries → wakes up within ~250 ms of the cancel.
  • test_workshop_build_skips_items_with_no_file_url — item with file_url="" and last_error="steam result 9" → builder writes one stderr line, does NOT call download_to_cache, build succeeds with last_build_status='ok', item is absent from the symlink set.

tests/test_workshop_routes.py — new per-overlay refresh route

  • test_overlay_refresh_owner_allowed — owner POST → fetch_metadata_batch called with exactly that overlay's steam_ids → WorkshopItem rows updated → build_overlay enqueued → 302 to /jobs/{id}.
  • test_overlay_refresh_other_user_forbidden — non-owner non-admin → 403.
  • test_overlay_refresh_admin_can_refresh_any — admin POST on someone else's overlay → 200/302.
  • test_overlay_refresh_steam_api_error_502fetch_metadata_batch raises → response is 502, no job enqueued.
  • test_overlay_refresh_empty_overlay_400 — overlay has no items → 400, no Steam call.
  • test_overlay_refresh_drops_missing_items_gracefully — Steam returns nothing for one ID → that row gets last_error="steam returned no entry…", build still enqueued.

tests/test_cli.py — new CLI subcommand

  • test_workshop_refresh_enqueues_job — CLI invocation inserts a queued Job(operation='refresh_workshop_items') and prints its id.
  • test_workshop_refresh_idempotent_when_queued — pre-existing queued/running refresh job → second invocation prints the existing id and does not insert a duplicate.

tests/test_job_worker.py

No new tests. Scheduler rules and _run_refresh_workshop_items are unchanged. Existing coverage holds.

Out of test scope

The systemd timer. Validating it requires a host; smoke it on the dev host post-deploy.

Out of Scope

  • Replacing the global mutex on refresh_workshop_items. Daily refresh still blocks server starts/builds during its run. Scheduled at 04:00 with Persistent=true; revisit only if it observably hurts.
  • Per-item retry policy in download_to_cache. Retry stays in the builder.
  • Cache GC. Cache still grows monotonically — same as the v1 spec.
  • Steam API rate-limit handling for the metadata endpoint. No backoff for metadata calls. Retries apply only to per-item file downloads.
  • Update-aware server restart UX. When the daily refresh re-downloads an item mounted by a running server, the running server keeps its old mount. Notifying the user / offering a "restart to pick up updates" prompt stays in the backlog.
  • Per-overlay refresh on non-workshop overlay types. Only workshop overlays get the Refresh button.

Affected Files

Implementation will touch roughly:

  • l4d2web/services/overlay_builders.py — WorkshopBuilder download phase, retry helper.
  • l4d2web/routes/workshop_routes.py — new /overlays/{id}/refresh route.
  • l4d2web/templates/... — Refresh button on overlay detail page.
  • l4d2web/cli.py — new workshop-refresh subcommand.
  • l4d2web/models.py and alembic/versions/... — possibly relax Job.user_id to nullable (TBD per schema check).
  • deploy/ — systemd .service + .timer units, wired into the install flow.
  • l4d2web/tests/test_overlay_builders.py, test_workshop_routes.py, test_cli.py — new test cases per the testing section.

The implementation plan will turn these into ordered steps with explicit checkpoints.