Closes the gap where added workshop items never reach disk until an admin presses the global refresh button. Downloads piggyback on the per-overlay build_overlay job; daily updates come from a systemd timer + CLI subcommand that enqueues the existing refresh job.
19 KiB
Workshop Auto-Download — Design
Problem
When a user adds workshop items to an overlay (POST /overlays/{id}/items), the route saves WorkshopItem metadata and enqueues a build_overlay job. The build symlinks already-cached .vpk files and emits skipped: not yet downloaded to stderr for everything else. The only thing that actually pulls bytes from Steam is the admin-only refresh_workshop_items job, which is a global mutex blocking all server starts, all builds, and installs.
In practice, this means freshly-added items never appear in the overlay until an admin presses a button. That isn't workable.
Goals
- Newly added items get downloaded without admin action.
- Items that authors update on Steam get re-downloaded automatically on a daily cadence.
- Overlay owners can manually re-check / re-pull their own overlay's items.
Non-Goals
See "Out of Scope" at the end. In particular: the refresh_workshop_items global mutex stays; there is no cache GC; no per-item retry inside download_to_cache; no update-aware server-restart prompt.
Architecture
Three changes layered onto the existing scheduler. None introduce a new job type or new scheduler rule.
┌─────────────────────────────────────────────────────────────────────┐
│ User adds items │
│ POST /overlays/{id}/items │
│ ↳ fetch metadata batch (mode=add) │
│ ↳ upsert WorkshopItem rows │
│ ↳ enqueue_build_overlay ◀── already happens today │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ build_overlay job (per-overlay; not a global mutex) │
│ WorkshopBuilder.build(): │
│ 1. query overlay's items │
│ 2. for each item where cache miss / stale: ◀── NEW │
│ download_to_cache(meta) with retry+backoff │
│ stamp WorkshopItem.last_downloaded_at │
│ 3. apply symlinks (existing logic) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Owner re-checks one overlay │
│ POST /overlays/{id}/refresh ◀── NEW │
│ ↳ fetch metadata batch for this overlay only (mode=refresh) │
│ ↳ update WorkshopItem rows │
│ ↳ enqueue_build_overlay (does the download) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Daily global update │
│ systemd timer → l4d2web workshop-refresh CLI ◀── NEW │
│ ↳ inserts Job(operation='refresh_workshop_items') │
│ ↳ worker picks it up; existing global-mutex rule still applies │
│ ↳ existing _run_refresh_workshop_items code unchanged │
└─────────────────────────────────────────────────────────────────────┘
Key invariant: on-add downloads run inside the per-overlay build_overlay job, so they do not block server starts globally. Only the daily global refresh keeps the existing global-mutex semantics.
Component 1 — Auto-download inside WorkshopBuilder.build
The builder gets a new download phase between "query items" and "apply symlinks". Today's behavior (skip-uncached with stderr warning) is replaced.
Decision logic
For each item bound to the overlay:
- Skip with warning if
file_url == ""(Steam returnedresult != 1last time we asked — delisted, private, or hidden). Emit one stderr lineworkshop item {steam_id} skipped: no file_url (steam result: {last_error}). Do not fail the build — these items quietly fall out of the symlink set because they never produce a cache file. An owner can investigate via the overlay detail page wherelast_erroris shown. - Otherwise, download when any of:
last_downloaded_at IS NULL, or- cache file
{steam_id}.vpkmissing, or - cache file
(mtime, size)doesn't match(time_updated, file_size)from the row.
- Otherwise, leave the item alone (its cache file is current).
steam_workshop.download_to_cache already does the (mtime, size) check internally and short-circuits when the cache is current, so the builder can call it unconditionally for items in the "maybe download" set and trust the helper for idempotence.
Stamping
- On success per item:
WorkshopItem.last_downloaded_at = now(),last_error = "". - On failure per item (after retry exhaustion):
last_errorrecords the final exception string; the builder raises →last_build_status='failed'.
What the builder does NOT do
It does not fetch fresh Steam metadata. Metadata is the responsibility of the add route, the per-overlay refresh route, and the daily refresh job. The builder is a pure function of DB state — this keeps it cheap and predictable, and lets builds run without any outbound metadata call.
Concurrency
Items are downloaded sequentially within one builder run. Different overlays' builds run in parallel under existing scheduler rules; when two overlays share an item and race, the existing download_to_cache idempotence handles it — the loser sees a fresh file and skips. last_downloaded_at writes from two concurrent builds collapse to one timestamp; no real race.
Cancellation
The builder threads should_cancel into download_to_cache (the helper already accepts it). Cancelled mid-download deletes the .partial file; the symlink phase doesn't run. Cancellation during the inter-attempt sleep wakes up within ~250 ms (see retry section).
Logging
Each item's download start / finish / error emits one line. Counts are reported in the existing summary line:
workshop overlay 'mycollection': downloaded=3 cached=12 skipped=1 created=14 removed=1 unchanged=11 errors=0
skipped now means "Steam can't serve this item (no file_url)" instead of the old "uncached" meaning. Uncached items get downloaded.
Component 2 — Retry & backoff
Wraps each download_to_cache(meta, ...) call inside the builder.
attempts = 3
delays = [1s, 2s, 4s] # exponential; slept between attempts
for n in 1..attempts:
try:
download_to_cache(meta, cache_root, should_cancel=should_cancel)
break
except InterruptedError: # cancellation
raise # propagate immediately
except (requests.RequestException, OSError) as exc:
if n == attempts: raise # final attempt: bubble up → job fails
on_stderr(f"workshop {meta.steam_id} attempt {n}/{attempts} failed: {exc}")
sleep_with_cancel(delays[n-1], should_cancel)
Notes
sleep_with_cancelis a small helper that pollsshould_cancelevery ~250 ms during the sleep so a cancel does not wait out the full backoff window.- The retry loop lives in the builder (
overlay_builders.py), not insteam_workshop.download_to_cache. The downloader stays a single-shot primitive; retry policy is a caller concern. Keeps the helper testable without time-mocking. - HTTP 4xx responses raised by
raise_for_status()arerequests.HTTPError(aRequestException), so they are retried too. That is intentional — 404 / 410 will fail three times quickly and surface; the cost of three failed attempts is negligible compared to the cost of users having to guess why a single transient blip killed the job. - On final failure the job fails with the per-item error string and overlay
last_build_status='failed', matching the existing "never silently mount a partial overlay" rule.
Component 3 — Per-overlay refresh
New route POST /overlays/{id}/refresh. Mirrors the add route's metadata-fetch path but scoped to the items already in this overlay.
Route sketch
@bp.post("/overlays/<int:overlay_id>/refresh")
@require_login
def refresh_overlay(overlay_id: int) -> Response:
user = current_user()
with session_scope() as db:
overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
if err is not None: return err
steam_ids = db.scalars(
select(WorkshopItem.steam_id)
.join(OverlayWorkshopItem, OverlayWorkshopItem.workshop_item_id == WorkshopItem.id)
.where(OverlayWorkshopItem.overlay_id == overlay_id)
).all()
if not steam_ids:
return Response("overlay has no items", status=400)
try:
metas = steam_workshop.fetch_metadata_batch(steam_ids, mode="refresh")
except Exception as exc:
return Response(f"steam api error: {exc}", status=502)
with session_scope() as db:
overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
if err is not None: return err
metas_by_id = {m.steam_id: m for m in metas}
for steam_id in steam_ids:
wi = db.scalar(select(WorkshopItem).where(WorkshopItem.steam_id == steam_id))
meta = metas_by_id.get(steam_id)
if wi is None: continue
if meta is None:
wi.last_error = "steam returned no entry for this item"
continue
wi.title = meta.title
wi.filename = meta.filename
wi.file_url = meta.file_url
wi.file_size = meta.file_size
wi.time_updated = meta.time_updated
wi.preview_url = meta.preview_url
wi.last_error = "" if meta.result == 1 else f"steam result {meta.result}"
job = enqueue_build_overlay(db, overlay_id=overlay_id, user_id=user.id)
job_id = job.id
return redirect(f"/jobs/{job_id}")
Behavior notes
- Permission: same
_check_workshop_overlay_accessused by add/remove — owner or admin. mode="refresh"(not"add"): non-L4D2 items silently drop from the batch instead of raising. An item whoseconsumer_app_idsomehow changed after add will not break refresh.- The metadata write does not stamp
last_downloaded_at. That field stays bound to actual file presence — the builder's download phase stamps it after the bytes land. A refresh that findstime_updatedadvanced therefore leaveslast_downloaded_atpointing at the prior version; the(mtime, size)check indownload_to_cachesees the mismatch and the builder re-downloads. Correct by construction. - One Steam metadata POST per click, owner-gated. No new rate-limit concern.
UI
A "Refresh" button next to "Add items" on the overlay detail page (workshop type only). Submits the POST; redirects to the job page like everything else.
Component 4 — Periodic global refresh (CLI + systemd timer)
The existing _run_refresh_workshop_items job is complete and correct — it fetches all metadata, downloads what advanced, re-enqueues build_overlay for affected overlays. We only need a way to enqueue it on a schedule.
CLI subcommand
In l4d2web/cli.py:
@cli.command("workshop-refresh")
def workshop_refresh() -> None:
"""Enqueue a global workshop refresh job. Idempotent: if one is already
queued or running, prints its id and exits 0."""
with session_scope() as db:
existing = db.scalar(
select(Job).where(
Job.operation == "refresh_workshop_items",
Job.state.in_(("queued", "running", "cancelling")),
).order_by(Job.id.desc()).limit(1)
)
if existing is not None:
click.echo(f"refresh_workshop_items job {existing.id} already {existing.state}")
return
job = Job(
user_id=None,
server_id=None,
operation="refresh_workshop_items",
state="queued",
)
db.add(job)
db.flush()
click.echo(f"enqueued refresh_workshop_items job {job.id}")
Schema follow-up
Job.user_id = None for system-enqueued refreshes. The implementation plan must verify whether the column is currently nullable; if it is NOT NULL, the plan either (a) relaxes it to nullable (preferred — "system" is a real category) or (b) records the lowest-id admin user as the actor. The design assumes (a).
systemd units in deploy/
# left4me-workshop-refresh.service
[Unit]
Description=Left4me — enqueue daily workshop refresh
After=network-online.target left4me-web.service
Requires=left4me-web.service
[Service]
Type=oneshot
User=left4me
ExecStart=/opt/left4me/bin/l4d2web workshop-refresh
# left4me-workshop-refresh.timer
[Unit]
Description=Left4me — daily workshop refresh
[Timer]
OnCalendar=*-*-* 04:00:00
Persistent=true
RandomizedDelaySec=15min
[Install]
WantedBy=timers.target
Operator notes
- The timer enqueues; the worker decides when to actually run. The existing scheduler will defer the refresh if a server start, install, or build is in progress. Worst case the refresh starts after the conflicting job finishes — the intended behavior.
Persistent=truehandles "host was down at 04:00" — the unit runs on next boot. The CLI's idempotence check prevents pile-up if it fires twice.- Deployment wires this into the existing
deploy/install flow (in scope for the implementation plan).
Testing
Layered against the existing test files. No new test infrastructure.
tests/test_overlay_builders.py — bulk of new coverage
test_workshop_build_downloads_uncached_items— item withlast_downloaded_at=Noneand no cache file → patcheddownload_to_cacheis called → file appears → symlink created →last_downloaded_atstamped.test_workshop_build_skips_already_cached_items— item with cache file matching(time_updated, size)→download_to_cachereturns immediately (its existing idempotence) → no network → symlink created.test_workshop_build_redownloads_stale_cache— cache file exists but(mtime, size)mismatches the DB row → re-download happens.test_workshop_build_retry_succeeds— patched downloader fails twice then succeeds → builder finishes ok, retry messages on stderr,last_downloaded_atstamped. Backoff sleep monkey-patched to zero for speed.test_workshop_build_retry_exhausted_fails_job— downloader fails all three attempts → builder raises →last_build_status='failed',last_errorpopulated on the WorkshopItem.test_workshop_build_cancellation_during_download—should_cancelflips true mid-download → builder returns early,.partialcleaned up bydownload_to_cache, symlink phase did not run.test_workshop_build_cancellation_during_backoff— cancel flips true while sleeping between retries → wakes up within ~250 ms of the cancel.test_workshop_build_skips_items_with_no_file_url— item withfile_url=""andlast_error="steam result 9"→ builder writes one stderr line, does NOT calldownload_to_cache, build succeeds withlast_build_status='ok', item is absent from the symlink set.
tests/test_workshop_routes.py — new per-overlay refresh route
test_overlay_refresh_owner_allowed— owner POST →fetch_metadata_batchcalled with exactly that overlay's steam_ids → WorkshopItem rows updated →build_overlayenqueued → 302 to /jobs/{id}.test_overlay_refresh_other_user_forbidden— non-owner non-admin → 403.test_overlay_refresh_admin_can_refresh_any— admin POST on someone else's overlay → 200/302.test_overlay_refresh_steam_api_error_502—fetch_metadata_batchraises → response is 502, no job enqueued.test_overlay_refresh_empty_overlay_400— overlay has no items → 400, no Steam call.test_overlay_refresh_drops_missing_items_gracefully— Steam returns nothing for one ID → that row getslast_error="steam returned no entry…", build still enqueued.
tests/test_cli.py — new CLI subcommand
test_workshop_refresh_enqueues_job— CLI invocation inserts a queuedJob(operation='refresh_workshop_items')and prints its id.test_workshop_refresh_idempotent_when_queued— pre-existing queued/running refresh job → second invocation prints the existing id and does not insert a duplicate.
tests/test_job_worker.py
No new tests. Scheduler rules and _run_refresh_workshop_items are unchanged. Existing coverage holds.
Out of test scope
The systemd timer. Validating it requires a host; smoke it on the dev host post-deploy.
Out of Scope
- Replacing the global mutex on
refresh_workshop_items. Daily refresh still blocks server starts/builds during its run. Scheduled at 04:00 withPersistent=true; revisit only if it observably hurts. - Per-item retry policy in
download_to_cache. Retry stays in the builder. - Cache GC. Cache still grows monotonically — same as the v1 spec.
- Steam API rate-limit handling for the metadata endpoint. No backoff for metadata calls. Retries apply only to per-item file downloads.
- Update-aware server restart UX. When the daily refresh re-downloads an item mounted by a running server, the running server keeps its old mount. Notifying the user / offering a "restart to pick up updates" prompt stays in the backlog.
- Per-overlay refresh on non-workshop overlay types. Only workshop overlays get the Refresh button.
Affected Files
Implementation will touch roughly:
l4d2web/services/overlay_builders.py— WorkshopBuilder download phase, retry helper.l4d2web/routes/workshop_routes.py— new/overlays/{id}/refreshroute.l4d2web/templates/...— Refresh button on overlay detail page.l4d2web/cli.py— newworkshop-refreshsubcommand.l4d2web/models.pyandalembic/versions/...— possibly relaxJob.user_idto nullable (TBD per schema check).deploy/— systemd.service+.timerunits, wired into the install flow.l4d2web/tests/test_overlay_builders.py,test_workshop_routes.py,test_cli.py— new test cases per the testing section.
The implementation plan will turn these into ordered steps with explicit checkpoints.