left4me/docs/superpowers/specs/2026-05-12-server-live-state-display-design.md
mwiegand e25e7098f6
refactor(live-state): drop redundant ix_sps_server_recent index
The two indexes ix_sps_server_open and ix_sps_server_recent were
byte-identical because SQLAlchemy's Index(name, *cols) form drops the
DESC ordering the spec intended. Rather than reach for text("left_at
DESC"), drop the second index entirely — SQLite scans the ASC index
backwards at no measurable cost. Spec and plan updated to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 21:27:01 +02:00

396 lines
23 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Server live-state display (counts, map, roster, avatars, history)
## Context
The l4d2web UI currently shows systemd lifecycle state per game server (running/stopped/unknown) but nothing about what's happening *inside* the game: player count, current map, whether the server is hibernating, who is connected. To know any of that, users have to context-switch (open the game, query externally).
The goal is a **read-side live-state display**: counts + map + hibernating on the server list, plus a server-detail panel showing the current player roster (avatars + names) and a "recent players" section for who's been on lately. Backed by a persistent history table so we get count-over-time graphs and player-presence history (foundation for future ban UX) for free.
**Source: RCON exclusively.** A2S_INFO (UDP, anonymous) was investigated and discarded — it can't deliver Steam IDs, hibernating flag, or interactive commands, so anything beyond raw counts re-routes through RCON anyway. Both transports were verified working against prod `left4.me`. Going RCON-only means one transport, one set of tests, no throwaway scaffolding.
**Avatars: Steam Web API.** RCON gives Steam IDs; `ISteamUser/GetPlayerSummaries` resolves them to persona names + avatar URLs hot-linked from Steam's CDN. API key already obtained.
**Commands are deferred** to a separate plan. This plan is read-only.
---
## Architecture
```
┌─────────────────────────────┐
│ left4me-web (Flask) │
┌──────────────┐ RCON │ ┌───────────────────────┐ │
│ srcds 27016 │◄──────┼──┤ live-state poller │ │
└──────────────┘ TCP │ │ (daemon thread) │ │
│ └───────┬───────────────┘ │
┌──────────────┐ RCON │ │ writes │
│ srcds 27021 │◄──────┤ ▼ │
└──────────────┘ │ ┌───────────────────────┐ │
│ │ server_live_state │ │
Steam Web API │ │ server_player_session │ │
┌────────────┐ │ │ steam_user_profile │ │
│ Steam CDN │◄─┼──┤ │ │
│ avatars... │ │ └───────┬───────────────┘ │
└────────────┘ │ │ reads │
▲ │ ▼ │
│ │ ┌───────────────────────┐ │
└────────┼──┤ /servers, /servers/N │ │
<img src=...> │ │ (HTMX 5s refresh) │ │
│ └───────────────────────┘ │
└─────────────────────────────┘
```
Single daemon thread (modeled on the existing `start_state_poller` in `l4d2web/services/job_worker.py:617-647`), inside the Flask process, polls every `LIVE_STATE_POLL_SECONDS` (default 5). Per poll, per running server with a configured RCON password:
1. TCP connect to `127.0.0.1:<port>`, auth, send `status`, parse response.
2. Compare server-level state (players/map/hibernating/etc.) to the latest `server_live_state` row for this server. If unchanged, bump `last_seen_at`. If changed, insert a new row.
3. Reconcile open sessions (`server_player_session` rows where `left_at IS NULL`) with the current `status` roster: open new sessions for new players (backfilling `joined_at` from RCON's `connected` field), close sessions for players no longer present, update `min_ping`/`max_ping` for continuing sessions.
4. Collect Steam IDs that are missing from `steam_user_profile` or have `fetched_at` older than 24h; batch them into a single `GetPlayerSummaries` call; upsert results.
5. Trim `server_live_state` and closed sessions older than retention.
---
## Schema (one new alembic migration)
### New column: `servers.rcon_password`
```python
rcon_password: Mapped[str] = mapped_column(
String(64), nullable=False, default="", server_default=""
)
```
Empty string = "no password configured yet" (poller skips). Migration backfills every existing row with `secrets.token_urlsafe(32)` (~43 chars, URL-safe character set so the literal `"..."` cfg-quoting needs no escaping).
### `server_live_state` — run-length-encoded snapshots
```sql
CREATE TABLE server_live_state (
id INTEGER PRIMARY KEY AUTOINCREMENT,
server_id INTEGER NOT NULL REFERENCES servers(id) ON DELETE CASCADE,
started_at DATETIME NOT NULL, -- when this exact state first appeared
last_seen_at DATETIME NOT NULL, -- most recent poll where it still held
players INTEGER NOT NULL,
max_players INTEGER NOT NULL,
bots INTEGER NOT NULL,
map VARCHAR(64) NOT NULL,
hibernating BOOLEAN NOT NULL
);
CREATE INDEX ix_sls_server_started ON server_live_state(server_id, started_at DESC);
```
- "State" = the tuple `(players, max_players, bots, map, hibernating)`. Ping/loss are deliberately not stored at server-level, so they don't churn rows.
- Idle hibernating server collapses from one-row-per-poll to one-row-per-state-change (≈17,280× compression for a 24h-idle server).
- Latest snapshot for a server: `ORDER BY started_at DESC LIMIT 1`. UI staleness check: `last_seen_at > now - LIVE_STATE_STALE_SECONDS` (default 30).
- Retention: trim rows where `last_seen_at < now - LIVE_STATE_HISTORY_DAYS` (default 30).
- Failed polls produce no DB write; the staleness check on `last_seen_at` handles UI degradation cleanly.
### `server_player_session` — interval per connection
```sql
CREATE TABLE server_player_session (
id INTEGER PRIMARY KEY AUTOINCREMENT,
server_id INTEGER NOT NULL REFERENCES servers(id) ON DELETE CASCADE,
steam_id_64 VARCHAR(20) NOT NULL,
joined_at DATETIME NOT NULL,
left_at DATETIME NULL, -- NULL = currently in-game
name_at_join VARCHAR(64) NOT NULL,
min_ping INTEGER NOT NULL,
max_ping INTEGER NOT NULL
);
CREATE INDEX ix_sps_server_open ON server_player_session(server_id, left_at);
CREATE INDEX ix_sps_steam_history ON server_player_session(steam_id_64, joined_at);
```
- `joined_at` is **backfilled from RCON's `connected` duration** on first sighting (`joined_at = now - connected_seconds`). This heals brief polling gaps and survives web restarts: even if we just started polling, we know when the still-connected players actually joined.
- A player who disconnects and rejoins gets two rows, not one merged interval.
- Bots are excluded — rows with a non-`STEAM_X:Y:Z` uniqueid are skipped.
- `min_ping`/`max_ping` updated only when a new poll pushes the range, to avoid noise writes.
- On poller startup, close any sessions whose server isn't in current RCON output. Plus: close sessions after N consecutive failed polls of their server (TBD constant during implementation, e.g. 6 polls = ~30s).
- Retention: trim closed sessions where `left_at < now - SESSION_HISTORY_DAYS` (default 30). Open sessions never trimmed.
### `steam_user_profile` — cached profile data (24h TTL)
```sql
CREATE TABLE steam_user_profile (
steam_id_64 VARCHAR(20) PRIMARY KEY,
persona_name VARCHAR(64) NOT NULL,
avatar_url TEXT NOT NULL, -- avatarmedium from Steam Web API
fetched_at DATETIME NOT NULL
);
```
- Cache is global, not per-server (one profile per Steam ID).
- Refreshed when `fetched_at < now - 24h` or when entry is missing.
- Soft-fail: if the Steam API key is unset, the API is down, or a profile is private, we just leave the cache as-is and the UI falls back to `name_at_join` + placeholder avatar.
### Bind-rendered queries
**Current players on server X:**
```sql
SELECT sp.steam_id_64, sp.joined_at, sp.name_at_join,
sp.min_ping, sp.max_ping,
p.persona_name, p.avatar_url
FROM server_player_session sp
LEFT JOIN steam_user_profile p USING (steam_id_64)
WHERE sp.server_id = ? AND sp.left_at IS NULL
ORDER BY sp.joined_at;
```
**Recent players on server X (last 30 days, excluding currently in-game):**
```sql
SELECT sp.steam_id_64, MAX(sp.left_at) AS last_seen,
p.persona_name, p.avatar_url
FROM server_player_session sp
LEFT JOIN steam_user_profile p USING (steam_id_64)
WHERE sp.server_id = ?
AND sp.left_at IS NOT NULL
AND sp.left_at > datetime('now', '-30 days')
AND sp.steam_id_64 NOT IN (
SELECT steam_id_64 FROM server_player_session
WHERE server_id = ? AND left_at IS NULL
)
GROUP BY sp.steam_id_64, p.persona_name, p.avatar_url
ORDER BY last_seen DESC
LIMIT 20;
```
---
## Modules
### `l4d2web/services/rcon.py` (new)
Pure stdlib (`socket`, `struct`), no new dependency. Source RCON protocol:
```python
@dataclass(slots=True, frozen=True)
class PlayerRow:
steam_id_64: str # converted from STEAM_X:Y:Z
name: str
connected_seconds: int
ping: int
@dataclass(slots=True, frozen=True)
class StatusResponse:
map: str
players: int # humans
max_players: int
bots: int
hibernating: bool
roster: list[PlayerRow]
class RconError(Exception): ...
class RconAuthError(RconError): ...
def query_status(host: str, port: int, password: str, *, timeout: float = 2.0) -> StatusResponse: ...
```
Implementation notes:
- Auth handshake quirk verified live: server sends a `type=0` empty-body packet **before** the `type=2` auth response. Consume both. `req_id == -1` on the auth response = bad password.
- Single TCP connection per query (loopback, ~10-20ms total round-trip — pooling not worth it at this scale).
- Header regex on `map :` and `players :` lines (the `(hibernating|not hibernating)` token is in `players :`).
- Roster regex: split lines starting with `#`, skip the column-header line, robustly extract the quoted name + the `STEAM_X:Y:Z` token + `MM:SS` or `HH:MM:SS` connected duration + ping. Tolerate the two-numeric-prefix L4D2 variant (`# 2 1 "Crone" STEAM_1:0:...`).
- Steam ID conversion: `STEAM_X:Y:Z``76561197960265728 + (Y * 2) + Z` (returned as string).
### `l4d2web/services/steam_users.py` (new)
Modeled directly on `l4d2web/services/steam_workshop.py:17-43` (single `requests.Session`, 30s timeout, anonymous-pattern POST with form-encoded body — only difference is the `key=` parameter).
```python
@dataclass(slots=True, frozen=True)
class SteamProfile:
steam_id_64: str
persona_name: str
avatar_url: str # avatarmedium
def fetch_profiles_batch(steam_ids: Iterable[str], *, api_key: str) -> list[SteamProfile]: ...
```
- Endpoint: `GET https://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key=<key>&steamids=<csv>`.
- Up to 100 IDs per call; caller batches.
- Returns only successful resolutions (private/deleted accounts simply absent from the response — fine, they stay uncached and the UI falls back).
- Raises on transport errors; caller decides whether to surface.
### `l4d2web/services/live_state_poller.py` (new)
Modeled on `start_state_poller` / `state_poller_loop` in `l4d2web/services/job_worker.py:617-647`.
```python
def start_live_state_poller(app) -> None: ... # spawns daemon thread, skipped under TESTING
def live_state_poller_loop(app, interval: float) -> None: ...
def poll_once() -> None: # one full pass over running servers
...
```
Per-server algorithm:
1. RCON `status``StatusResponse` (or skip on auth/timeout, logged via `app.logger`).
2. **Server-level RLE upsert**: load newest `server_live_state` row for this server. If `(players, max_players, bots, map, hibernating)` matches → `UPDATE last_seen_at = now()`. Else → `INSERT` new row.
3. **Session reconciliation** in a single transaction:
- Load open sessions for this server.
- For each player in `response.roster` not in open sessions: `INSERT` new session with `joined_at = now - connected_seconds`, `name_at_join = roster.name`, `min_ping = max_ping = roster.ping`.
- For each open session whose player is in the roster: if `roster.ping < min_ping` or `> max_ping`, `UPDATE` the range. Otherwise skip the write.
- For each open session whose player is *not* in the roster: `UPDATE left_at = now()`.
4. **Profile enrichment**: collect Steam IDs from the roster where the cached profile is missing or `fetched_at < now - 24h`. Skip if `STEAM_WEB_API_KEY` unset. Batch into one Steam API call. Upsert results.
Periodic (every Nth cycle, e.g. once a minute):
- Trim `server_live_state` and closed sessions past retention.
- Close any open sessions whose `server_id` hasn't had a successful RCON response in the last `STUCK_SESSION_SECONDS` (default 60).
### Modify: `l4d2web/services/l4d2_facade.py:28-52`
`build_server_spec_payload` **appends** `f'rcon_password "{server.rcon_password}"'` as the *last* entry in the returned `config` list, only if the password is non-empty. Appending (not prepending) matters: Source's cfg semantics are last-wins, so putting our line after both the overlay `exec` lines and the user's blueprint config guarantees no overlay or blueprint can silently clobber the password and break the poller. `l4d2host/instances.py:40-58` already writes `spec.config` lines verbatim to `server.cfg`**no host-side change needed**.
### Modify: server-create route
Wherever the server-create form handler lives (`l4d2web/routes/server_routes.py` or similar — confirm during implementation): before commit, generate `rcon_password = secrets.token_urlsafe(32)`.
---
## Web UI
### Server list (template TBD: `ls l4d2web/templates/` during implementation)
Add an inline live-state cell per server row:
- Stopped server: `—`
- Stale (no row newer than `LIVE_STATE_STALE_SECONDS`): dim `?` with tooltip "no data"
- Hibernating: `0/4 · idle · c1m1_hotel`
- Active: `2/4 · c1m2_streets`
No HTMX on the list page; page reload picks up the latest snapshot.
### Server detail (`l4d2web/templates/server_detail.html`)
New section, HTMX-refreshed every `LIVE_STATE_POLL_SECONDS` (default 5):
```html
<section class="panel"
hx-get="/servers/{{ server.id }}/live-state"
hx-trigger="every 5s"
hx-swap="outerHTML">
<!-- rendered from l4d2web/templates/_live_state.html -->
</section>
```
The partial renders three blocks:
1. **Summary**: `players/max_players · map · idle?` plus a small "polled Ns ago" caption.
2. **Current players** (only if non-empty): grid of cards, each `<img src="{{ profile.avatar_url or placeholder }}" /> {{ profile.persona_name or session.name_at_join }} · {{ joined_relative }} · ping {{ min }}-{{ max }}ms`.
3. **Recent players** (last 30 days, excluding current; only if non-empty): smaller cards, `{{ avatar }} {{ persona_name or name_at_join }} · last seen {{ last_seen_relative }}`.
New route: `GET /servers/<id>/live-state` returns the partial. Composition mirrors the existing build-status pattern at `l4d2web/templates/_overlay_build_status.html:1-5`.
Avatar `<img>` tags point straight at Steam CDN URLs (`avatars.cloudflare.steamstatic.com` / `avatars.akamai.steamstatic.com`). No proxying. Same approach as `WorkshopItem.preview_url`. Note: confirm the existing CSP allows these hosts; if not, extend it.
No JS framework added — HTMX only.
---
## Config keys
In `l4d2web/config.py`, plus documented defaults in `deploy/templates/etc/left4me/web.env` where applicable:
| key | default | purpose |
|---|---|---|
| `LIVE_STATE_POLL_SECONDS` | `5` | poll interval |
| `LIVE_STATE_QUERY_TIMEOUT_SECONDS` | `2.0` | per-RCON-query timeout |
| `LIVE_STATE_POLL_WORKERS` | `4` | thread-pool size for parallel per-server polls |
| `LIVE_STATE_STALE_SECONDS` | `30` | UI staleness threshold |
| `LIVE_STATE_HISTORY_DAYS` | `30` | retention for snapshots + closed sessions |
| `STUCK_SESSION_SECONDS` | `60` | close open sessions whose server has been unreachable for this long |
| `STEAM_PROFILE_TTL_SECONDS` | `86400` | profile cache TTL |
| `STEAM_WEB_API_KEY` | `""` | from `web.env`; empty disables enrichment |
---
## Tests
- `l4d2web/tests/test_rcon.py` — protocol handshake against an in-process TCP fixture: auth-success, auth-failure (`req_id == -1`), header parse (incl. `(hibernating)` and `(reserved <token>)` variants), roster parse (incl. the two-numeric-prefix L4D2 variant), Steam ID conversion.
- `l4d2web/tests/test_steam_users.py` — request shape (key in querystring, batched ids, 100-per-call ceiling), response parsing, partial response (some IDs missing).
- `l4d2web/tests/test_live_state_poller.py` — mirror `test_state_poller_*` at `l4d2web/tests/test_job_worker.py:882-952`. Cover: iterates only running servers with non-empty `rcon_password`, RLE upsert (matching state → `last_seen_at` bump only; differing state → new row), session open with backfilled `joined_at`, session close on disappearance, ping range expansion, stuck-session close after N failures, drops auth failures silently, respects retention.
- `l4d2web/tests/test_server_routes.py` (extend) — `/servers/<id>/live-state` fragment route renders summary/current/recent blocks correctly; stale rendering when latest snapshot is old; soft-fail rendering when no profile cached.
- `l4d2web/tests/test_l4d2_facade.py` (extend) — `build_server_spec_payload` appends `rcon_password "..."` as the last config line when password is set; omits the line when empty; appears after both the overlay `exec` lines and the blueprint config lines.
- Migration test — existing rows backfilled with non-empty 43-char passwords; tables created with correct indexes.
---
## Critical files
**New:**
- `l4d2web/services/rcon.py` — Source RCON client + status parser
- `l4d2web/services/steam_users.py` — Steam Web API client (mirrors `steam_workshop.py`)
- `l4d2web/services/live_state_poller.py` — background thread + poll loop + session reconciler
- `l4d2web/alembic/versions/00XX_server_live_state.py` — migration: new column, three new tables, password backfill
- `l4d2web/templates/_live_state.html` — HTMX-refreshed fragment (summary + current + recent)
- `l4d2web/tests/test_rcon.py`, `l4d2web/tests/test_steam_users.py`, `l4d2web/tests/test_live_state_poller.py`
**Modify:**
- `l4d2web/models.py` — add `ServerLiveState`, `ServerPlayerSession`, `SteamUserProfile`; add `rcon_password` to `Server` (after line 137)
- `l4d2web/services/l4d2_facade.py:28-52``build_server_spec_payload` appends `rcon_password "..."` as the last config line when set
- `l4d2web/app.py` — call `start_live_state_poller(app)` next to existing `start_state_poller`
- `l4d2web/routes/server_routes.py` (or equivalent — confirm) — generate `rcon_password` in create handler; add `GET /servers/<id>/live-state`
- `l4d2web/templates/server_detail.html` — include `_live_state.html`
- `l4d2web/templates/<server-list>.html` — confirm filename; add inline badge column
- `l4d2web/config.py` — register the eight new config keys
- `deploy/templates/etc/left4me/web.env` — add `STEAM_WEB_API_KEY=` and any tunables we expose
**Reused without changes:**
- `l4d2web/services/job_worker.py:617-647` — daemon-thread / poll-loop pattern reference
- `l4d2web/services/steam_workshop.py:17-43``requests.Session` + form-POST pattern for Steam Web API
- `l4d2host/instances.py:40-58` — already writes `spec.config` verbatim, so no host-side change for password injection
- `l4d2web/templates/_overlay_build_status.html` — HTMX polling pattern reference
---
## Verification
1. **Unit tests**:
```
pytest l4d2web/tests/test_rcon.py l4d2web/tests/test_steam_users.py l4d2web/tests/test_live_state_poller.py -v
pytest l4d2web/tests -q # full regression
```
2. **Migration check**:
```
alembic upgrade head
sqlite3 l4d2web.db "SELECT id, name, length(rcon_password) FROM servers;" # every row ~43
sqlite3 l4d2web.db ".schema server_live_state server_player_session steam_user_profile"
```
3. **End-to-end against prod** (`left4.me`):
- Deploy. Confirm `systemctl status left4me-web.service` shows no crash-loop and the journal logs `start_live_state_poller` once.
- Restart both existing game servers so they pick up the injected password.
- SQL sanity (web-host shell):
```
sqlite3 l4d2web.db "SELECT server_id, started_at, last_seen_at, players, map, hibernating
FROM server_live_state ORDER BY server_id, started_at DESC LIMIT 10;"
```
Expect a single recent row per server while idle; new rows when players come/go.
- Connect to one server from the L4D2 client; within 5s, `/servers/<id>` shows a card with your avatar + persona name + ping range. Disconnect; within 5s the card moves to "recent."
- `sqlite3 l4d2web.db "SELECT * FROM server_player_session WHERE left_at IS NULL;"` — empty when nobody's connected; one row per current player when someone is.
- `sqlite3 l4d2web.db "SELECT count(*), MIN(fetched_at), MAX(fetched_at) FROM steam_user_profile;"` — at least one row after a player has been resolved.
4. **Failure-path checks**:
- Manually corrupt `servers.rcon_password` for one server; confirm the journal logs auth failure and the row's badge goes stale within `LIVE_STATE_STALE_SECONDS`; other servers unaffected.
- Unset `STEAM_WEB_API_KEY` in `web.env`, restart web; confirm display still works (in-game names + placeholder avatars), no errors in journal.
- `nft` drop the loopback TCP on one server's port; confirm rows stop appearing, open sessions close after `STUCK_SESSION_SECONDS`, badge goes stale.
---
## Open implementation questions
- **Server-list template filename**: confirm with `ls l4d2web/templates/` once implementation starts.
- **Server-create route location**: confirm path (likely `l4d2web/routes/server_routes.py`).
- **CSP allowlist for Steam avatar CDNs**: check `l4d2web/app.py` (or wherever security headers live) — extend `img-src` to include `avatars.cloudflare.steamstatic.com`, `avatars.akamai.steamstatic.com`, `avatars.steamstatic.com` if a CSP is enforced.
- **Adaptive backoff** for hibernating servers: defer; start with fixed 5s and revisit only if load becomes a concern (which it won't at current server count).
- **Migration data step**: SQLite alembic batch operation with a Python data step that iterates rows and generates `secrets.token_urlsafe(32)` per row — confirm pattern against existing migrations under `l4d2web/alembic/versions/`.
---
## Deferred to a separate plan
- Generic RCON command execution (`changelevel`, `kick`, `say`, `sm_ban`, ...)
- Web UI buttons mapped to those commands with CSRF + admin authz
- Audit log table for issued commands
- Player-count history graphs (data already accumulating from this plan)
- Ban UX (lookup by Steam ID, search across `server_player_session`)