left4me/docs/superpowers/specs/2026-05-05-l4d2-host-smoke-test-design.md
2026-05-05 23:23:26 +02:00

170 lines
6.5 KiB
Markdown

# L4D2 Host Smoke Test Design
**Goal:** Validate the implemented `l4d2host` library and `l4d2ctl` CLI on the disposable Linux server `ckn@10.0.4.128` before continuing web-app lifecycle job wiring.
**Target Host:** `ckn@10.0.4.128`
**Access Assumption:** SSH access as `ckn` with sudo privileges.
**Primary Constraint:** Ask for explicit user approval before every server-touching step.
## Context
The repository now contains both planned components:
- `components/l4d2-host-lib`: Python host library and `l4d2ctl` CLI.
- `components/l4d2-web-app`: Flask app for users, blueprints, servers, jobs, and logs.
The web app depends on the host library for real lifecycle behavior. Before wiring web lifecycle jobs end-to-end, the host contract should be proven on an actual Linux machine with `steamcmd`, `fuse-overlayfs`, systemd user services, and journald available.
## Scope
The smoke test verifies these host-lib behaviors:
- SSH connectivity and sudo access to `ckn@10.0.4.128`.
- Required runtime tools are present or can be installed: `steamcmd`, `fuse-overlayfs`, `fusermount3`, `systemctl --user`, `journalctl --user`, and Python packaging tooling.
- `/opt/l4d2` exists with permissions that allow the `ckn` user to run the v1 host workflow.
- `l4d2ctl install` downloads or updates the L4D2 dedicated server into `/opt/l4d2/installation`.
- `l4d2ctl initialize smoke -f spec.yaml` writes instance and runtime state under `/opt/l4d2`.
- `l4d2ctl start smoke` mounts the runtime overlay, copies `server.cfg`, and starts the systemd user service.
- `get_instance_status("smoke")` reports an interpretable status.
- `stream_instance_logs("smoke")` can read journald output.
- `l4d2ctl stop smoke` stops the user service and unmounts the runtime overlay.
- `l4d2ctl delete smoke` removes the instance/runtime directories.
- Re-running `l4d2ctl delete smoke` succeeds as a no-op.
## Out Of Scope
- Web-app job execution or UI changes.
- Long-running game-server operations beyond a short start/status/log/stop check.
- Workshop mod management or web-managed overlay file content.
- Production hardening for the disposable test server.
## Execution Strategy
The smoke test is intentionally gated. Each step must stop after reporting evidence and wait for user approval before moving to the next step.
### Step 1: Read-Only Server Inspection
Purpose: understand the target host without changing it.
Allowed actions:
- SSH into `ckn@10.0.4.128`.
- Inspect OS, package manager, current user, sudo availability, Python version, systemd user availability, lingering status, existing `/opt/l4d2` state, and relevant runtime tools.
Not allowed in this step:
- Installing packages.
- Creating or modifying files.
- Starting or stopping services.
- Mounting or unmounting filesystems.
Checkpoint: report findings and ask before any setup changes.
### Step 2: Server Preparation
Purpose: make the disposable server capable of running the host-lib workflow.
Allowed actions after approval:
- Install missing packages needed for the host workflow.
- Create `/opt/l4d2` if missing.
- Set ownership/permissions so `ckn` can run the smoke workflow.
- Configure systemd user prerequisites if required for `systemctl --user`.
Checkpoint: report exact changes and ask before deploying code.
### Step 3: Deploy Current Host Lib
Purpose: install the current repository implementation on the target host without inventing new packaging.
Allowed actions after approval:
- Copy or archive the current `components/l4d2-host-lib` source to the server.
- Install it using its existing `pyproject.toml`, preferably into an isolated virtual environment.
- Verify that `l4d2ctl --help` exposes the fixed v1 command surface.
Checkpoint: report command evidence and ask before downloading server files.
### Step 4: Run `l4d2ctl install`
Purpose: validate the install/update command against real `steamcmd` behavior.
Allowed actions after approval:
- Run `l4d2ctl install` on the target host.
- Capture stdout, stderr, and exit code.
- Inspect `/opt/l4d2/installation` enough to confirm expected installation output.
Checkpoint: report evidence and ask before creating a smoke instance.
### Step 5: Run Instance Lifecycle Smoke Test
Purpose: validate initialize/start/status/logs/stop/delete against the real runtime.
Allowed actions after approval:
- Create a minimal spec file for instance name `smoke`.
- Run `l4d2ctl initialize smoke -f spec.yaml`.
- Run `l4d2ctl start smoke`.
- Check `systemctl --user status l4d2@smoke.service`.
- Check mount state for `/opt/l4d2/runtime/smoke/merged`.
- Call `get_instance_status("smoke")` from Python.
- Call `stream_instance_logs("smoke", lines=50, follow=False)` from Python.
- Run `l4d2ctl stop smoke`.
- Run `l4d2ctl delete smoke`.
- Run `l4d2ctl delete smoke` again to verify no-op success.
Checkpoint: report command evidence and ask what to do with remaining artifacts.
### Step 6: Cleanup Decision
Purpose: preserve useful diagnostics or remove smoke-test state based on user preference.
Allowed actions after approval:
- Remove copied source archives or virtual environments.
- Remove smoke spec files.
- Leave `/opt/l4d2/installation` intact if useful for later web-app testing, or remove it if requested.
Checkpoint: report final target-host state.
## Failure Handling
Any failure stops the smoke-test flow immediately. The report must include:
- command that failed
- exit code if available
- relevant stdout and stderr
- likely category: environment issue, host-lib bug, packaging/deploy issue, or unclear
- recommended next action
No automatic destructive cleanup should happen after a failure. If a failure leaves `/opt/l4d2`, a mounted overlay, copied files, or a systemd service behind, inspectable state should be preserved until the user approves cleanup.
## Evidence Requirements
Each completed step should report fresh command evidence. Suitable evidence includes:
- exact commands run
- exit code or clear command success/failure status
- key stdout/stderr lines
- relevant filesystem paths
- service status summaries
- mount state
- journal/log snippets
No step should be called successful without current evidence from that step.
## Next Phase After Smoke Test
If the host-lib smoke test succeeds, continue with web-app lifecycle job wiring:
- enqueue lifecycle jobs from routes/UI
- run jobs through worker threads
- call `l4d2web.services.l4d2_facade`
- persist callback output to `job_logs`
- live-follow job logs through SSE
- update server desired and actual state
If the smoke test fails due to host-lib behavior, fix the host library before continuing web-app lifecycle work.