docs: design for workshop auto-download

Closes the gap where added workshop items never reach disk until an
admin presses the global refresh button. Downloads piggyback on the
per-overlay build_overlay job; daily updates come from a systemd
timer + CLI subcommand that enqueues the existing refresh job.
This commit is contained in:
mwiegand 2026-05-11 22:28:20 +02:00
parent c5758487a9
commit fef8cc4ea6
No known key found for this signature in database

View file

@ -0,0 +1,326 @@
# Workshop Auto-Download — Design
## Problem
When a user adds workshop items to an overlay (`POST /overlays/{id}/items`), the route saves `WorkshopItem` metadata and enqueues a `build_overlay` job. The build symlinks already-cached `.vpk` files and emits `skipped: not yet downloaded` to stderr for everything else. The only thing that actually pulls bytes from Steam is the admin-only `refresh_workshop_items` job, which is a global mutex blocking all server starts, all builds, and installs.
In practice, this means freshly-added items never appear in the overlay until an admin presses a button. That isn't workable.
## Goals
1. Newly added items get downloaded without admin action.
2. Items that authors update on Steam get re-downloaded automatically on a daily cadence.
3. Overlay owners can manually re-check / re-pull their own overlay's items.
## Non-Goals
See "Out of Scope" at the end. In particular: the `refresh_workshop_items` global mutex stays; there is no cache GC; no per-item retry inside `download_to_cache`; no update-aware server-restart prompt.
## Architecture
Three changes layered onto the existing scheduler. None introduce a new job type or new scheduler rule.
```
┌─────────────────────────────────────────────────────────────────────┐
│ User adds items │
│ POST /overlays/{id}/items │
│ ↳ fetch metadata batch (mode=add) │
│ ↳ upsert WorkshopItem rows │
│ ↳ enqueue_build_overlay ◀── already happens today │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ build_overlay job (per-overlay; not a global mutex) │
│ WorkshopBuilder.build(): │
│ 1. query overlay's items │
│ 2. for each item where cache miss / stale: ◀── NEW │
│ download_to_cache(meta) with retry+backoff │
│ stamp WorkshopItem.last_downloaded_at │
│ 3. apply symlinks (existing logic) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Owner re-checks one overlay │
│ POST /overlays/{id}/refresh ◀── NEW │
│ ↳ fetch metadata batch for this overlay only (mode=refresh) │
│ ↳ update WorkshopItem rows │
│ ↳ enqueue_build_overlay (does the download) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ Daily global update │
│ systemd timer → l4d2web workshop-refresh CLI ◀── NEW │
│ ↳ inserts Job(operation='refresh_workshop_items') │
│ ↳ worker picks it up; existing global-mutex rule still applies │
│ ↳ existing _run_refresh_workshop_items code unchanged │
└─────────────────────────────────────────────────────────────────────┘
```
Key invariant: **on-add downloads run inside the per-overlay `build_overlay` job, so they do not block server starts globally.** Only the daily global refresh keeps the existing global-mutex semantics.
## Component 1 — Auto-download inside `WorkshopBuilder.build`
The builder gets a new download phase between "query items" and "apply symlinks". Today's behavior (skip-uncached with stderr warning) is replaced.
### Decision logic
For each item bound to the overlay:
1. **Skip with warning** if `file_url == ""` (Steam returned `result != 1` last time we asked — delisted, private, or hidden). Emit one stderr line `workshop item {steam_id} skipped: no file_url (steam result: {last_error})`. Do **not** fail the build — these items quietly fall out of the symlink set because they never produce a cache file. An owner can investigate via the overlay detail page where `last_error` is shown.
2. Otherwise, **download** when any of:
- `last_downloaded_at IS NULL`, or
- cache file `{steam_id}.vpk` missing, or
- cache file `(mtime, size)` doesn't match `(time_updated, file_size)` from the row.
3. Otherwise, leave the item alone (its cache file is current).
`steam_workshop.download_to_cache` already does the `(mtime, size)` check internally and short-circuits when the cache is current, so the builder can call it unconditionally for items in the "maybe download" set and trust the helper for idempotence.
### Stamping
- On success per item: `WorkshopItem.last_downloaded_at = now()`, `last_error = ""`.
- On failure per item (after retry exhaustion): `last_error` records the final exception string; the builder raises → `last_build_status='failed'`.
### What the builder does NOT do
It does not fetch fresh Steam metadata. Metadata is the responsibility of the add route, the per-overlay refresh route, and the daily refresh job. The builder is a pure function of DB state — this keeps it cheap and predictable, and lets builds run without any outbound metadata call.
### Concurrency
Items are downloaded sequentially within one builder run. Different overlays' builds run in parallel under existing scheduler rules; when two overlays share an item and race, the existing `download_to_cache` idempotence handles it — the loser sees a fresh file and skips. `last_downloaded_at` writes from two concurrent builds collapse to one timestamp; no real race.
### Cancellation
The builder threads `should_cancel` into `download_to_cache` (the helper already accepts it). Cancelled mid-download deletes the `.partial` file; the symlink phase doesn't run. Cancellation during the inter-attempt sleep wakes up within ~250 ms (see retry section).
### Logging
Each item's download start / finish / error emits one line. Counts are reported in the existing summary line:
```
workshop overlay 'mycollection': downloaded=3 cached=12 skipped=1 created=14 removed=1 unchanged=11 errors=0
```
`skipped` now means "Steam can't serve this item (no file_url)" instead of the old "uncached" meaning. Uncached items get downloaded.
## Component 2 — Retry & backoff
Wraps each `download_to_cache(meta, ...)` call inside the builder.
```
attempts = 3
delays = [1s, 2s, 4s] # exponential; slept between attempts
for n in 1..attempts:
try:
download_to_cache(meta, cache_root, should_cancel=should_cancel)
break
except InterruptedError: # cancellation
raise # propagate immediately
except (requests.RequestException, OSError) as exc:
if n == attempts: raise # final attempt: bubble up → job fails
on_stderr(f"workshop {meta.steam_id} attempt {n}/{attempts} failed: {exc}")
sleep_with_cancel(delays[n-1], should_cancel)
```
### Notes
- `sleep_with_cancel` is a small helper that polls `should_cancel` every ~250 ms during the sleep so a cancel does not wait out the full backoff window.
- The retry loop lives in the builder (`overlay_builders.py`), not in `steam_workshop.download_to_cache`. The downloader stays a single-shot primitive; retry policy is a caller concern. Keeps the helper testable without time-mocking.
- HTTP 4xx responses raised by `raise_for_status()` are `requests.HTTPError` (a `RequestException`), so they are retried too. That is intentional — 404 / 410 will fail three times quickly and surface; the cost of three failed attempts is negligible compared to the cost of users having to guess why a single transient blip killed the job.
- On final failure the job fails with the per-item error string and overlay `last_build_status='failed'`, matching the existing "never silently mount a partial overlay" rule.
## Component 3 — Per-overlay refresh
New route `POST /overlays/{id}/refresh`. Mirrors the add route's metadata-fetch path but scoped to the items already in this overlay.
### Route sketch
```python
@bp.post("/overlays/<int:overlay_id>/refresh")
@require_login
def refresh_overlay(overlay_id: int) -> Response:
user = current_user()
with session_scope() as db:
overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
if err is not None: return err
steam_ids = db.scalars(
select(WorkshopItem.steam_id)
.join(OverlayWorkshopItem, OverlayWorkshopItem.workshop_item_id == WorkshopItem.id)
.where(OverlayWorkshopItem.overlay_id == overlay_id)
).all()
if not steam_ids:
return Response("overlay has no items", status=400)
try:
metas = steam_workshop.fetch_metadata_batch(steam_ids, mode="refresh")
except Exception as exc:
return Response(f"steam api error: {exc}", status=502)
with session_scope() as db:
overlay, err = _check_workshop_overlay_access(overlay_id, user, db)
if err is not None: return err
metas_by_id = {m.steam_id: m for m in metas}
for steam_id in steam_ids:
wi = db.scalar(select(WorkshopItem).where(WorkshopItem.steam_id == steam_id))
meta = metas_by_id.get(steam_id)
if wi is None: continue
if meta is None:
wi.last_error = "steam returned no entry for this item"
continue
wi.title = meta.title
wi.filename = meta.filename
wi.file_url = meta.file_url
wi.file_size = meta.file_size
wi.time_updated = meta.time_updated
wi.preview_url = meta.preview_url
wi.last_error = "" if meta.result == 1 else f"steam result {meta.result}"
job = enqueue_build_overlay(db, overlay_id=overlay_id, user_id=user.id)
job_id = job.id
return redirect(f"/jobs/{job_id}")
```
### Behavior notes
- Permission: same `_check_workshop_overlay_access` used by add/remove — owner or admin.
- `mode="refresh"` (not `"add"`): non-L4D2 items silently drop from the batch instead of raising. An item whose `consumer_app_id` somehow changed after add will not break refresh.
- The metadata write does **not** stamp `last_downloaded_at`. That field stays bound to actual file presence — the builder's download phase stamps it after the bytes land. A refresh that finds `time_updated` advanced therefore leaves `last_downloaded_at` pointing at the prior version; the `(mtime, size)` check in `download_to_cache` sees the mismatch and the builder re-downloads. Correct by construction.
- One Steam metadata POST per click, owner-gated. No new rate-limit concern.
### UI
A "Refresh" button next to "Add items" on the overlay detail page (workshop type only). Submits the POST; redirects to the job page like everything else.
## Component 4 — Periodic global refresh (CLI + systemd timer)
The existing `_run_refresh_workshop_items` job is complete and correct — it fetches all metadata, downloads what advanced, re-enqueues `build_overlay` for affected overlays. We only need a way to enqueue it on a schedule.
### CLI subcommand
In `l4d2web/cli.py`:
```python
@cli.command("workshop-refresh")
def workshop_refresh() -> None:
"""Enqueue a global workshop refresh job. Idempotent: if one is already
queued or running, prints its id and exits 0."""
with session_scope() as db:
existing = db.scalar(
select(Job).where(
Job.operation == "refresh_workshop_items",
Job.state.in_(("queued", "running", "cancelling")),
).order_by(Job.id.desc()).limit(1)
)
if existing is not None:
click.echo(f"refresh_workshop_items job {existing.id} already {existing.state}")
return
job = Job(
user_id=None,
server_id=None,
operation="refresh_workshop_items",
state="queued",
)
db.add(job)
db.flush()
click.echo(f"enqueued refresh_workshop_items job {job.id}")
```
### Schema follow-up
`Job.user_id = None` for system-enqueued refreshes. The implementation plan must verify whether the column is currently nullable; if it is `NOT NULL`, the plan either (a) relaxes it to nullable (preferred — "system" is a real category) or (b) records the lowest-id admin user as the actor. The design assumes (a).
### systemd units in `deploy/`
```ini
# left4me-workshop-refresh.service
[Unit]
Description=Left4me — enqueue daily workshop refresh
After=network-online.target left4me-web.service
Requires=left4me-web.service
[Service]
Type=oneshot
User=left4me
ExecStart=/opt/left4me/bin/l4d2web workshop-refresh
```
```ini
# left4me-workshop-refresh.timer
[Unit]
Description=Left4me — daily workshop refresh
[Timer]
OnCalendar=*-*-* 04:00:00
Persistent=true
RandomizedDelaySec=15min
[Install]
WantedBy=timers.target
```
### Operator notes
- The timer enqueues; the worker decides when to actually run. The existing scheduler will defer the refresh if a server start, install, or build is in progress. Worst case the refresh starts after the conflicting job finishes — the intended behavior.
- `Persistent=true` handles "host was down at 04:00" — the unit runs on next boot. The CLI's idempotence check prevents pile-up if it fires twice.
- Deployment wires this into the existing `deploy/` install flow (in scope for the implementation plan).
## Testing
Layered against the existing test files. No new test infrastructure.
### `tests/test_overlay_builders.py` — bulk of new coverage
- `test_workshop_build_downloads_uncached_items` — item with `last_downloaded_at=None` and no cache file → patched `download_to_cache` is called → file appears → symlink created → `last_downloaded_at` stamped.
- `test_workshop_build_skips_already_cached_items` — item with cache file matching `(time_updated, size)``download_to_cache` returns immediately (its existing idempotence) → no network → symlink created.
- `test_workshop_build_redownloads_stale_cache` — cache file exists but `(mtime, size)` mismatches the DB row → re-download happens.
- `test_workshop_build_retry_succeeds` — patched downloader fails twice then succeeds → builder finishes ok, retry messages on stderr, `last_downloaded_at` stamped. Backoff sleep monkey-patched to zero for speed.
- `test_workshop_build_retry_exhausted_fails_job` — downloader fails all three attempts → builder raises → `last_build_status='failed'`, `last_error` populated on the WorkshopItem.
- `test_workshop_build_cancellation_during_download``should_cancel` flips true mid-download → builder returns early, `.partial` cleaned up by `download_to_cache`, symlink phase did not run.
- `test_workshop_build_cancellation_during_backoff` — cancel flips true while sleeping between retries → wakes up within ~250 ms of the cancel.
- `test_workshop_build_skips_items_with_no_file_url` — item with `file_url=""` and `last_error="steam result 9"` → builder writes one stderr line, does NOT call `download_to_cache`, build succeeds with `last_build_status='ok'`, item is absent from the symlink set.
### `tests/test_workshop_routes.py` — new per-overlay refresh route
- `test_overlay_refresh_owner_allowed` — owner POST → `fetch_metadata_batch` called with exactly that overlay's steam_ids → WorkshopItem rows updated → `build_overlay` enqueued → 302 to /jobs/{id}.
- `test_overlay_refresh_other_user_forbidden` — non-owner non-admin → 403.
- `test_overlay_refresh_admin_can_refresh_any` — admin POST on someone else's overlay → 200/302.
- `test_overlay_refresh_steam_api_error_502``fetch_metadata_batch` raises → response is 502, no job enqueued.
- `test_overlay_refresh_empty_overlay_400` — overlay has no items → 400, no Steam call.
- `test_overlay_refresh_drops_missing_items_gracefully` — Steam returns nothing for one ID → that row gets `last_error="steam returned no entry…"`, build still enqueued.
### `tests/test_cli.py` — new CLI subcommand
- `test_workshop_refresh_enqueues_job` — CLI invocation inserts a queued `Job(operation='refresh_workshop_items')` and prints its id.
- `test_workshop_refresh_idempotent_when_queued` — pre-existing queued/running refresh job → second invocation prints the existing id and does not insert a duplicate.
### `tests/test_job_worker.py`
No new tests. Scheduler rules and `_run_refresh_workshop_items` are unchanged. Existing coverage holds.
### Out of test scope
The systemd timer. Validating it requires a host; smoke it on the dev host post-deploy.
## Out of Scope
- **Replacing the global mutex on `refresh_workshop_items`.** Daily refresh still blocks server starts/builds during its run. Scheduled at 04:00 with `Persistent=true`; revisit only if it observably hurts.
- **Per-item retry policy in `download_to_cache`.** Retry stays in the builder.
- **Cache GC.** Cache still grows monotonically — same as the v1 spec.
- **Steam API rate-limit handling for the metadata endpoint.** No backoff for metadata calls. Retries apply only to per-item file downloads.
- **Update-aware server restart UX.** When the daily refresh re-downloads an item mounted by a running server, the running server keeps its old mount. Notifying the user / offering a "restart to pick up updates" prompt stays in the backlog.
- **Per-overlay refresh on non-workshop overlay types.** Only workshop overlays get the Refresh button.
## Affected Files
Implementation will touch roughly:
- `l4d2web/services/overlay_builders.py` — WorkshopBuilder download phase, retry helper.
- `l4d2web/routes/workshop_routes.py` — new `/overlays/{id}/refresh` route.
- `l4d2web/templates/...` — Refresh button on overlay detail page.
- `l4d2web/cli.py` — new `workshop-refresh` subcommand.
- `l4d2web/models.py` and `alembic/versions/...` — possibly relax `Job.user_id` to nullable (TBD per schema check).
- `deploy/` — systemd `.service` + `.timer` units, wired into the install flow.
- `l4d2web/tests/test_overlay_builders.py`, `test_workshop_routes.py`, `test_cli.py` — new test cases per the testing section.
The implementation plan will turn these into ordered steps with explicit checkpoints.