bundlewrap/docs/superpowers/specs/2026-05-10-ckn-bw-agents-md-refactor-round-2-design.md
CroneKorkN 0e88c4967e
docs/specs: round-2 agents-md refactor design (gaps 7-12)
Continuation of round 1. Five commits: two new bundles/AGENTS.md
Pitfalls (file: source basename, git_deploy gotchas) and three
bundle READMEs (letsencrypt operational, bind apply-both, nginx
new file). Diverges from the handoff on placement: gaps 7-9 go
in bundles/AGENTS.md not items/AGENTS.md, since items/AGENTS.md
is scoped to custom item types only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 20:39:40 +02:00

11 KiB
Raw Blame History

Round 2 — agent-doc refactor (gaps 712)

Why

Continuation of round 1 (spec at 2026-05-10-ckn-bw-agents-md-refactor-round-1-design.md). Round 1 landed the cross-cutting lessons (read-only allowlist, bundle validation needs a node, nodes-carry-only-node-specific-metadata, reactors-must-read-metadata, triggers/triggered:True invariant, self-healing pattern). Round 2 covers the remaining six gaps: built-in item-type gotchas and three bundle READMEs.

Scope

In:

  • Gap 7 — file:'s source defaults to the basename of the destination.
  • Gap 8 — git_deploy extracts as the connecting user (root after sudo); chown action needed for non-root downstream consumers.
  • Gap 9 — git_deploy URL form: :// triggers per-apply clone, no :// requires a git_deploy_repos map at the repo root.
  • Gap 10 — bundles/letsencrypt: first-apply behaviour, DNS-01 prerequisites, negative-cache penalty.
  • Gap 11 — bundles/bind: applying changes to a master_node-linked pair needs bw apply on both ends.
  • Gap 12 — bundles/nginx: how port 80 is served, vm/cores requirement.

Out:

  • Bundle behaviour changes. Pure docs.
  • bw apply / bw run — not authorised this session.

Placement decision (diverges from the handoff)

The handoff suggests items/AGENTS.md for gaps 7, 8, 9. But items/AGENTS.md is scoped to custom item types in the items/ directory — its first sentence: "Custom item types — each *.py is a bundlewrap.items.Item subclass…". Built-in gotchas (file:, git_deploy:) don't fit there.

Round-1 lessons about built-in mechanics (reactors must read metadata, triggers invariant, self-healing pattern) all landed in bundles/AGENTS.md Pitfalls. Gaps 7, 8, 9 are the same shape, so they go in the same place.

Validation findings

  • Gap 7: well-known bw built-in semantics. Trusting the handoff.
  • Gap 8: confirmed at .venv/src/bundlewrap/bundlewrap/items/git_deploy.py's fix() method — uses self.node.upload(...) which writes as the sudo user (root). Files end up root-owned.
  • Gap 9: confirmed in round 1 (git_deploy.py:103if "://" in self.attributes['repo']:).
  • Gap 10: confirmed /etc/dehydrated/letsencrypt-ensure-some-certificate exists in the bundle; runs on every domain with idempotent unless. Daily timer at /usr/bin/dehydrated --cron --accept-terms --challenge dns-01.
  • Gap 11: nuanced. The bundle DOES set bind/type = 'slave' and renders different named.conf.local for slaves, so bind itself may AXFR at runtime. But the slave's bw-managed zone files are statically rendered from the master's metadata at slave-apply time (bundles/bind/items.py:100). The practical workflow rule — "apply both" — is correct regardless. I'll frame the README as the workflow rule, not the absolute "not AXFR slaving" claim from the handoff.
  • Gap 12: confirmed nginx.conf:42 includes /etc/nginx/sites-enabled/*; nginx/items.py:35 reads node.metadata.get('vm/cores') with no default. README does not exist.

Existing README states

  • bundles/letsencrypt/README.md — 9 lines: upstream link + nsupdate snippet. Reshape into an operational README; keep the nsupdate snippet.
  • bundles/bind/README.md — does not exist. Create.
  • bundles/nginx/README.md — does not exist. Create.

Commits

Commit 7 — file: source defaults to destination basename (Gap 7)

bundles/AGENTS.md Pitfalls — new bullet:

- **`file:` `source` defaults to the destination basename.** For a
  destination of `/etc/foo/bar.conf` with no `source` key, bw looks for
  `bundles/<bundle>/files/bar.conf`. Only declare `source` explicitly
  when the basename you want differs (e.g. shipping a Mako template
  named `bar.conf.mako` to a destination of `/etc/foo/bar.conf`).

Commit 8 — git_deploy gotchas (Gaps 8 + 9)

bundles/AGENTS.md Pitfalls — two new bullets.

- **`git_deploy` extracts as the connecting (sudo) user — files end up
  root-owned.** A downstream action that runs as a non-root app user
  (typical: editable pip install, Rails bundle install) will hit
  `Permission denied` on `.egg-info` or similar. The fix is a
  self-healing chown action between `git_deploy` and the downstream
  action:

  ```python
  actions['<bundle>_chown_src'] = {
      'command': 'chown -R <user>:<group> <path>',
      'unless': 'test -z "$(find <path> ! -user <user> -print -quit)"',
      'cascade_skip': False,
      'needs': ['git_deploy:<path>', 'user:<user>', 'group:<group>'],
  }

See bundles/left4me/items.py for an in-tree example.

  • git_deploy URL form matters. A URL containing :// (HTTP/HTTPS, ssh://) makes bw clone to a temp dir per-apply — no operator-side state needed. Without :// (SCP-style git@host:path), bw expects a git_deploy_repos map file at the repo root pointing at a long-lived local clone, and raises RepositoryError('missing repo map for git_deploy') if it isn't there. For HTTPS-reachable repos use the HTTPS form; for SSH-only, prefer the explicit ssh://user@host/path form so the map isn't needed.

### Commit 9 — letsencrypt README (Gap 10)

Reshape `bundles/letsencrypt/README.md`. Keep the upstream link and
nsupdate snippet at the top; add three structured sections.

```markdown
# letsencrypt

Issues and renews Let's Encrypt certs via [dehydrated][upstream] with
DNS-01 against the in-house bind-acme server.

[upstream]: https://github.com/dehydrated-io/dehydrated/wiki/example-dns-01-nsupdate-script

## First-apply behaviour

Immediately after `bw apply <node>`, nginx serves a **self-signed
cert** for each declared domain — generated by
`/etc/dehydrated/letsencrypt-ensure-some-certificate` so nginx has
something to start with. The real Let's Encrypt cert arrives at most
24h later when the systemd timer fires
(`/usr/bin/dehydrated --cron --accept-terms --challenge dns-01`). To
shortcut the wait:

```sh
ssh <node> 'sudo /usr/bin/dehydrated --cron --accept-terms --challenge dns-01'
ssh <node> 'sudo systemctl reload nginx'

DNS-01 prerequisites

hook.sh does nsupdate against the bind-acme server (referenced by letsencrypt/acme_node). For the challenge to succeed:

  1. The acme node must be in the same metadata graph (so bw metadata <node> -k letsencrypt/acme_node resolves).
  2. All NS servers for the validated domain must serve the _acme-challenge.<domain> CNAME — Let's Encrypt validates from primary AND secondary geographic regions; both authoritative servers must agree. If a secondary NS is also a bw-managed node, bw apply it after adding the domain (see e.g. ovh.secondary).
  3. The bind-acme node's TSIG key must be reachable. hook.sh is rendered with the bind-acme server's network/internal/ipv4 — for clients outside that LAN, the route must exist (typically via wireguard s2s peer membership).

Negative-cache penalty

If the first DNS-01 attempt fails (e.g. zone not yet applied to the secondary NS), Let's Encrypt's resolvers cache NXDOMAIN for the SOA's negative TTL (often 900s = 15 min). Subsequent attempts during that window also fail and refresh the cache. Combined with LE's rate limit of 5 failed authorisations per domain per hour, recovery requires you to stop retrying for ~15 minutes after fixing the DNS, then make at most one attempt.

nsupdate sample

For interactive testing of the bind-acme TSIG path:

printf "server 127.0.0.1
zone acme.resolver.name.
update add _acme-challenge.ckn.li.acme.resolver.name. 600 IN TXT \"hello\"
send
" | nsupdate -y hmac-sha512:acme:<TSIG_KEY_REDACTED>

### Commit 10 — bind README (Gap 11, reframed)

Create `bundles/bind/README.md`. Frame as the workflow rule, not the
absolute "not AXFR" claim.

```markdown
# bind

Authoritative DNS — primary plus optional `bind/master_node` slaves.

## Applying changes needs both nodes

The slave's bw-managed zone files are rendered from the master's
metadata at slave-apply time (see `bundles/bind/items.py:100`). When
you change a record on the master (adding a `letsencrypt/domains`
entry, a new vhost, etc.), the change is only published once you
apply BOTH:

```sh
bw apply htz.mails        # primary (where the source records live)
bw apply ovh.secondary    # secondary (renders its own zone files)

Until both have been applied, bw verify ovh.secondary will show stale zones and consumers that hit the secondary (Let's Encrypt's secondary-region validators in particular) will see NXDOMAIN. Even though the slave's named.conf.local declares type slave;, don't rely on bind's own AXFR catching up — the bw-rendered file on disk is what bw verify measures.

See also

  • bundles/bind-acme/ — the in-house ACME-update receiver.
  • bundles/letsencrypt/README.md — DNS-01 prerequisites and the negative-cache penalty (the most common operational consequence of forgetting to apply the secondary).

### Commit 11 — nginx README (Gap 12)

Create `bundles/nginx/README.md`.

```markdown
# nginx

Webserver. Per-node vhosts in `nginx/vhosts`; per-vhost templates in
`data/nginx/*.conf`.

## How port 80 is served

The bundle ships a fixed `80.conf` to
`/etc/nginx/sites-available/80.conf` (picked up by the
`sites-enabled/` symlink) that handles **all** port-80 traffic
across vhosts:

1. ACME HTTP-01 challenges (`/.well-known/acme-challenge/`) are
   served from `/var/lib/dehydrated/acme-challenges/`.
2. All other port-80 requests are 301-redirected to
   `https://$host$request_uri`.

Per-vhost templates only declare `listen 443 ssl http2;`, so they
don't need their own port-80 server blocks. If you need vhost-
specific port-80 behaviour (e.g. plain-HTTP without redirect), you'll
need to override 80.conf or add a per-vhost block.

## Required metadata

- `vm/cores` — read directly by `items.py` for `worker_processes`.
  No default; `bw items <node>` raises at item-build time if missing.
  Typically supplied by the `vm` bundle / hetzner-vm group; double-
  check on bare-metal hosts.
- `nginx/vhosts` — dict of vhost-name → vhost-config.
- `nginx/modules` — list of dynamic modules to load.

## Cross-namespace

`items.py` reads `letsencrypt/domains` to skip emitting a per-vhost
HTTPS block when LE hasn't declared the domain yet — keeps the bundle
loadable on a node where letsencrypt isn't fully wired up.

Out of scope

  • Bundle behaviour changes. Pure docs.
  • bw apply / bw run.
  • Reformatting the existing two-line bundle READMEs into the new shape — bundles/AGENTS.md explicitly says don't do that ("uneven quality is part of what we accept in exchange for not blocking other work").

Constraints

  • Don't echo decrypted secrets. The TSIG-key example in the letsencrypt nsupdate snippet uses <TSIG_KEY_REDACTED>.
  • After each commit, .venv/bin/bw test must pass.
  • No push.