bundlewrap/docs/superpowers/specs/2026-05-10-ckn-bw-agents-md-refactor-round-2-design.md
CroneKorkN 0e88c4967e
docs/specs: round-2 agents-md refactor design (gaps 7-12)
Continuation of round 1. Five commits: two new bundles/AGENTS.md
Pitfalls (file: source basename, git_deploy gotchas) and three
bundle READMEs (letsencrypt operational, bind apply-both, nginx
new file). Diverges from the handoff on placement: gaps 7-9 go
in bundles/AGENTS.md not items/AGENTS.md, since items/AGENTS.md
is scoped to custom item types only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:39:40 +02:00

286 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Round 2 — agent-doc refactor (gaps 712)
## Why
Continuation of round 1 (spec at
`2026-05-10-ckn-bw-agents-md-refactor-round-1-design.md`). Round 1
landed the cross-cutting lessons (read-only allowlist, bundle
validation needs a node, nodes-carry-only-node-specific-metadata,
reactors-must-read-metadata, triggers/triggered:True invariant,
self-healing pattern). Round 2 covers the remaining six gaps: built-in
item-type gotchas and three bundle READMEs.
## Scope
In:
- Gap 7 — `file:`'s `source` defaults to the basename of the destination.
- Gap 8 — `git_deploy` extracts as the connecting user (root after
sudo); chown action needed for non-root downstream consumers.
- Gap 9 — `git_deploy` URL form: `://` triggers per-apply clone, no `://`
requires a `git_deploy_repos` map at the repo root.
- Gap 10 — `bundles/letsencrypt`: first-apply behaviour, DNS-01
prerequisites, negative-cache penalty.
- Gap 11 — `bundles/bind`: applying changes to a `master_node`-linked
pair needs `bw apply` on both ends.
- Gap 12 — `bundles/nginx`: how port 80 is served, `vm/cores`
requirement.
Out:
- Bundle behaviour changes. Pure docs.
- `bw apply` / `bw run` — not authorised this session.
## Placement decision (diverges from the handoff)
The handoff suggests `items/AGENTS.md` for gaps 7, 8, 9. But
`items/AGENTS.md` is scoped to **custom** item types in the `items/`
directory — its first sentence: *"Custom item types — each `*.py` is
a `bundlewrap.items.Item` subclass…"*. Built-in gotchas (`file:`,
`git_deploy:`) don't fit there.
Round-1 lessons about built-in mechanics (reactors must read metadata,
`triggers` invariant, self-healing pattern) all landed in
`bundles/AGENTS.md` Pitfalls. Gaps 7, 8, 9 are the same shape, so
they go in the same place.
## Validation findings
- Gap 7: well-known bw built-in semantics. Trusting the handoff.
- Gap 8: confirmed at `.venv/src/bundlewrap/bundlewrap/items/git_deploy.py`'s
`fix()` method — uses `self.node.upload(...)` which writes as the sudo
user (root). Files end up root-owned.
- Gap 9: confirmed in round 1 (`git_deploy.py:103` —
`if "://" in self.attributes['repo']:`).
- Gap 10: confirmed `/etc/dehydrated/letsencrypt-ensure-some-certificate`
exists in the bundle; runs on every domain with idempotent `unless`.
Daily timer at `/usr/bin/dehydrated --cron --accept-terms --challenge dns-01`.
- Gap 11: nuanced. The bundle DOES set `bind/type = 'slave'` and renders
different named.conf.local for slaves, so bind itself may AXFR at
runtime. But the slave's *bw-managed* zone files are statically
rendered from the master's metadata at slave-apply time
(`bundles/bind/items.py:100`). The practical workflow rule — "apply
both" — is correct regardless. I'll frame the README as the workflow
rule, not the absolute "not AXFR slaving" claim from the handoff.
- Gap 12: confirmed `nginx.conf:42` includes `/etc/nginx/sites-enabled/*`;
`nginx/items.py:35` reads `node.metadata.get('vm/cores')` with no
default. README does not exist.
## Existing README states
- `bundles/letsencrypt/README.md` — 9 lines: upstream link + nsupdate
snippet. Reshape into an operational README; keep the nsupdate snippet.
- `bundles/bind/README.md` — does not exist. Create.
- `bundles/nginx/README.md` — does not exist. Create.
## Commits
### Commit 7 — `file:` source defaults to destination basename (Gap 7)
`bundles/AGENTS.md` Pitfalls — new bullet:
```markdown
- **`file:` `source` defaults to the destination basename.** For a
destination of `/etc/foo/bar.conf` with no `source` key, bw looks for
`bundles/<bundle>/files/bar.conf`. Only declare `source` explicitly
when the basename you want differs (e.g. shipping a Mako template
named `bar.conf.mako` to a destination of `/etc/foo/bar.conf`).
```
### Commit 8 — `git_deploy` gotchas (Gaps 8 + 9)
`bundles/AGENTS.md` Pitfalls — two new bullets.
```markdown
- **`git_deploy` extracts as the connecting (sudo) user — files end up
root-owned.** A downstream action that runs as a non-root app user
(typical: editable pip install, Rails bundle install) will hit
`Permission denied` on `.egg-info` or similar. The fix is a
self-healing chown action between `git_deploy` and the downstream
action:
```python
actions['<bundle>_chown_src'] = {
'command': 'chown -R <user>:<group> <path>',
'unless': 'test -z "$(find <path> ! -user <user> -print -quit)"',
'cascade_skip': False,
'needs': ['git_deploy:<path>', 'user:<user>', 'group:<group>'],
}
```
See `bundles/left4me/items.py` for an in-tree example.
- **`git_deploy` URL form matters.** A URL containing `://` (HTTP/HTTPS,
`ssh://`) makes bw clone to a temp dir per-apply — no operator-side
state needed. Without `://` (SCP-style `git@host:path`), bw expects a
`git_deploy_repos` map file at the repo root pointing at a long-lived
local clone, and raises `RepositoryError('missing repo map for
git_deploy')` if it isn't there. For HTTPS-reachable repos use the
HTTPS form; for SSH-only, prefer the explicit `ssh://user@host/path`
form so the map isn't needed.
```
### Commit 9 — letsencrypt README (Gap 10)
Reshape `bundles/letsencrypt/README.md`. Keep the upstream link and
nsupdate snippet at the top; add three structured sections.
```markdown
# letsencrypt
Issues and renews Let's Encrypt certs via [dehydrated][upstream] with
DNS-01 against the in-house bind-acme server.
[upstream]: https://github.com/dehydrated-io/dehydrated/wiki/example-dns-01-nsupdate-script
## First-apply behaviour
Immediately after `bw apply <node>`, nginx serves a **self-signed
cert** for each declared domain — generated by
`/etc/dehydrated/letsencrypt-ensure-some-certificate` so nginx has
something to start with. The real Let's Encrypt cert arrives at most
24h later when the systemd timer fires
(`/usr/bin/dehydrated --cron --accept-terms --challenge dns-01`). To
shortcut the wait:
```sh
ssh <node> 'sudo /usr/bin/dehydrated --cron --accept-terms --challenge dns-01'
ssh <node> 'sudo systemctl reload nginx'
```
## DNS-01 prerequisites
`hook.sh` does `nsupdate` against the bind-acme server (referenced
by `letsencrypt/acme_node`). For the challenge to succeed:
1. The acme node must be in the same metadata graph (so
`bw metadata <node> -k letsencrypt/acme_node` resolves).
2. **All NS servers** for the validated domain must serve the
`_acme-challenge.<domain>` CNAME — Let's Encrypt validates from
primary AND secondary geographic regions; both authoritative
servers must agree. If a secondary NS is also a bw-managed node,
`bw apply` it after adding the domain (see e.g. `ovh.secondary`).
3. The bind-acme node's TSIG key must be reachable. `hook.sh` is
rendered with the bind-acme server's `network/internal/ipv4`
for clients outside that LAN, the route must exist (typically via
wireguard `s2s` peer membership).
## Negative-cache penalty
If the first DNS-01 attempt fails (e.g. zone not yet applied to the
secondary NS), Let's Encrypt's resolvers cache NXDOMAIN for the SOA's
negative TTL (often 900s = 15 min). Subsequent attempts during that
window also fail and refresh the cache. Combined with LE's rate limit
of **5 failed authorisations per domain per hour**, recovery requires
you to **stop retrying** for ~15 minutes after fixing the DNS, then
make at most one attempt.
## nsupdate sample
For interactive testing of the bind-acme TSIG path:
```sh
printf "server 127.0.0.1
zone acme.resolver.name.
update add _acme-challenge.ckn.li.acme.resolver.name. 600 IN TXT \"hello\"
send
" | nsupdate -y hmac-sha512:acme:<TSIG_KEY_REDACTED>
```
```
### Commit 10 — bind README (Gap 11, reframed)
Create `bundles/bind/README.md`. Frame as the workflow rule, not the
absolute "not AXFR" claim.
```markdown
# bind
Authoritative DNS — primary plus optional `bind/master_node` slaves.
## Applying changes needs both nodes
The slave's bw-managed zone files are rendered from the master's
metadata at slave-apply time (see `bundles/bind/items.py:100`). When
you change a record on the master (adding a `letsencrypt/domains`
entry, a new vhost, etc.), the change is only published once you
apply BOTH:
```sh
bw apply htz.mails # primary (where the source records live)
bw apply ovh.secondary # secondary (renders its own zone files)
```
Until both have been applied, `bw verify ovh.secondary` will show
stale zones and consumers that hit the secondary (Let's Encrypt's
secondary-region validators in particular) will see NXDOMAIN. Even
though the slave's named.conf.local declares `type slave;`, don't
rely on bind's own AXFR catching up — the bw-rendered file on disk
is what `bw verify` measures.
## See also
- `bundles/bind-acme/` — the in-house ACME-update receiver.
- `bundles/letsencrypt/README.md` — DNS-01 prerequisites and the
negative-cache penalty (the most common operational consequence of
forgetting to apply the secondary).
```
### Commit 11 — nginx README (Gap 12)
Create `bundles/nginx/README.md`.
```markdown
# nginx
Webserver. Per-node vhosts in `nginx/vhosts`; per-vhost templates in
`data/nginx/*.conf`.
## How port 80 is served
The bundle ships a fixed `80.conf` to
`/etc/nginx/sites-available/80.conf` (picked up by the
`sites-enabled/` symlink) that handles **all** port-80 traffic
across vhosts:
1. ACME HTTP-01 challenges (`/.well-known/acme-challenge/`) are
served from `/var/lib/dehydrated/acme-challenges/`.
2. All other port-80 requests are 301-redirected to
`https://$host$request_uri`.
Per-vhost templates only declare `listen 443 ssl http2;`, so they
don't need their own port-80 server blocks. If you need vhost-
specific port-80 behaviour (e.g. plain-HTTP without redirect), you'll
need to override 80.conf or add a per-vhost block.
## Required metadata
- `vm/cores` — read directly by `items.py` for `worker_processes`.
No default; `bw items <node>` raises at item-build time if missing.
Typically supplied by the `vm` bundle / hetzner-vm group; double-
check on bare-metal hosts.
- `nginx/vhosts` — dict of vhost-name → vhost-config.
- `nginx/modules` — list of dynamic modules to load.
## Cross-namespace
`items.py` reads `letsencrypt/domains` to skip emitting a per-vhost
HTTPS block when LE hasn't declared the domain yet — keeps the bundle
loadable on a node where letsencrypt isn't fully wired up.
```
## Out of scope
- Bundle behaviour changes. Pure docs.
- `bw apply` / `bw run`.
- Reformatting the existing two-line bundle READMEs into the new
shape — bundles/AGENTS.md explicitly says don't do that
("uneven quality is part of what we accept in exchange for not
blocking other work").
## Constraints
- Don't echo decrypted secrets. The TSIG-key example in the
letsencrypt nsupdate snippet uses `<TSIG_KEY_REDACTED>`.
- After each commit, `.venv/bin/bw test` must pass.
- No push.