[Bug] B8: netbird app doesn't work (LOW PRIORITY) #15

New Issue

lfg2025 · 2026-06-15T15:18:55Z

lfg2025 commented

2026-06-15 15:18:55 +00:00

netbird app still doesn't work. Low priority / much later.

Tracked in repo: tests/production-quality/TRACKER.md

netbird app still doesn't work. Low priority / much later. Tracked in repo: tests/production-quality/TRACKER.md

lfg2025 commented

2026-06-17 19:09:05 +00:00

Code-side review done; full diagnosis still needs podman logs from a node (LOW PRIORITY, kept open).

NetBird installs as a 3-container stack (stacks.rs): netbird-server (management/signal/relay), netbird-dashboard, and an nginx proxy on :8087. Config + secrets are generated, exposedAddress/AUTH_AUTHORITY are set to the detected host IP, and the nginx routes /api, /oauth2, /signalexchange, /management correctly.

Findings without a live node:

The server/dashboard request no NET_ADMIN//dev/net/tun — but that's expected; those are needed by the NetBird client agent (not installed here), not the management/dashboard, so that's likely not the failure.
Most probable cause: the dashboard auth flow (AUTH_AUTHORITY=.../oauth2) depends on the management service's embedded IDP being reachable/seeded; if that's not coming up, login fails and the app "doesn't work".

Next diagnostic step (needs a reachable node): podman logs netbird-server, netbird-dashboard, and the nginx proxy after install, and confirm all three are actually running. Holding off on speculative code changes that can't be verified live.

**Code-side review done; full diagnosis still needs `podman logs` from a node (LOW PRIORITY, kept open).** NetBird installs as a 3-container stack (`stacks.rs`): `netbird-server` (management/signal/relay), `netbird-dashboard`, and an nginx proxy on :8087. Config + secrets are generated, `exposedAddress`/`AUTH_AUTHORITY` are set to the detected host IP, and the nginx routes `/api`, `/oauth2`, `/signalexchange`, `/management` correctly. Findings without a live node: - The server/dashboard request **no** `NET_ADMIN`/`/dev/net/tun` — but that's expected; those are needed by the NetBird **client agent** (not installed here), not the management/dashboard, so that's likely not the failure. - Most probable cause: the dashboard auth flow (`AUTH_AUTHORITY=.../oauth2`) depends on the management service's embedded IDP being reachable/seeded; if that's not coming up, login fails and the app "doesn't work". Next diagnostic step (needs a reachable node): `podman logs netbird-server`, `netbird-dashboard`, and the nginx proxy after install, and confirm all three are actually running. Holding off on speculative code changes that can't be verified live.

lfg2025 commented

2026-06-17 20:34:02 +00:00

Diagnosed live on .198 and fixed in code — netbird now works on the LAN IP.

Three distinct problems were stacked here:

nginx cached a container IP → permanent 502. The proxy used proxy_pass http://netbird-dashboard:80 (literal name), so nginx resolved it once at startup and kept hitting a dead IP after the container's IP changed on restart/reboot. Fix: point nginx resolver at the netbird-net gateway (Podman's aardvark DNS) and use variable upstreams so it re-resolves at request time. (Note: resolver local=on and the ${NGINX_LOCAL_RESOLVERS} template var both proved unsupported on this image — the explicit gateway is what works; the install now reads it back from podman network inspect.)
Dashboard/server hit conmon died without writing exit file — podman reported them "Up 29h" while they were actually dead and refusing connections. A clean recreate fixed it. (This was largely caused by my manual podman commands over SSH interfering with the archipelago-managed stack; linger is enabled, so archipelago's own containers are stable. There is still a general lifecycle gap — a non-serving container should be detected/recreated by the reconciler — tracked separately.)
Auth failed with CORS / "Unauthenticated". detect_netbird_public_host_ip() baked the Tailscale 100.x address as the dashboard's API/OIDC origin, so loading the UI via the LAN IP made every /api + /oauth2 fetch cross-origin → browser-blocked. Fix: prefer the LAN IP as the canonical origin; add CORS (reflect Origin + preflight) on /api+/oauth2; and list the LAN origin in the OIDC redirect URIs too, so any access IP works.

Validated live on .198 (config regenerated with the LAN origin + containers recreated): dashboard 200, /api/instance 200, OIDC discovery 200, and the dashboard JS is now baked to http://192.168.1.198:8087. cargo check + vue-tsc pass.

Rollout: these are in stacks.rs config generation, so a binary update + netbird reinstall regenerates the correct config on each node. The conmon-died lifecycle hardening is broader and is filed as a follow-up.

**Diagnosed live on .198 and fixed in code — netbird now works on the LAN IP.** Three distinct problems were stacked here: 1. **nginx cached a container IP → permanent 502.** The proxy used `proxy_pass http://netbird-dashboard:80` (literal name), so nginx resolved it once at startup and kept hitting a dead IP after the container's IP changed on restart/reboot. **Fix:** point nginx `resolver` at the netbird-net gateway (Podman's aardvark DNS) and use variable upstreams so it re-resolves at request time. (Note: `resolver local=on` and the `${NGINX_LOCAL_RESOLVERS}` template var both proved unsupported on this image — the explicit gateway is what works; the install now reads it back from `podman network inspect`.) 2. **Dashboard/server hit `conmon died without writing exit file`** — podman reported them "Up 29h" while they were actually dead and refusing connections. A clean recreate fixed it. (This was largely caused by *my* manual `podman` commands over SSH interfering with the archipelago-managed stack; linger is enabled, so archipelago's own containers are stable. There is still a general lifecycle gap — a non-serving container should be detected/recreated by the reconciler — tracked separately.) 3. **Auth failed with CORS / "Unauthenticated".** `detect_netbird_public_host_ip()` baked the **Tailscale 100.x** address as the dashboard's API/OIDC origin, so loading the UI via the **LAN IP** made every `/api` + `/oauth2` fetch cross-origin → browser-blocked. **Fix:** prefer the **LAN IP** as the canonical origin; add CORS (reflect Origin + preflight) on `/api`+`/oauth2`; and list the LAN origin in the OIDC redirect URIs too, so any access IP works. **Validated live on .198** (config regenerated with the LAN origin + containers recreated): dashboard `200`, `/api/instance` `200`, OIDC discovery `200`, and the dashboard JS is now baked to `http://192.168.1.198:8087`. `cargo check` + `vue-tsc` pass. Rollout: these are in `stacks.rs` config generation, so a binary update + netbird reinstall regenerates the correct config on each node. The `conmon`-died lifecycle hardening is broader and is filed as a follow-up.

lfg2025 closed this issue

2026-06-17 20:34:02 +00:00

lfg2025 referenced this issue

2026-06-17 20:34:02 +00:00

[Tech debt] Migrate netbird from hardcoded binary stack to a self-contained registry app #52

lfg2025 referenced this issue

2026-06-17 20:34:02 +00:00

[Bug] Reconciler trusts podman 'Up' for conmon-died containers (netbird up-but-not-serving) #53

lfg2025 commented

2026-06-17 20:50:09 +00:00

Follow-up — final root cause + end-to-end fix confirmed.

After the resolver/CORS/LAN-origin fixes, login still failed with "Unauthenticated". Deeper diagnosis on .198 showed /oauth2/.well-known/openid-configuration returning 502 with nginx logging netbird-server could not be resolved (3: Host not found) — the server container was Running=true but had no IP on netbird-net (serverIP= empty), so Podman's aardvark DNS had no record. netbird-server runs an embedded Dex IdP at {origin}/oauth2; with the server unresolvable, the dashboard couldn't fetch the OIDC discovery → "Unauthenticated".

Cause: corrupted netavark/aardvark network state (from repeated manual container churn while debugging). Fix = a clean network + ordered container creation — which is exactly what a fresh package.install does. Validated by tearing down and recreating cleanly on .198: all three containers got IPs (10.89.3.2/3/4) and dashboard / api / oauth2 all return 200, stable across repeated polls.

Net for all nodes: a fresh install via the fixed binary (LAN-IP origin + gateway-derived nginx resolver + variable upstreams + CORS + multi-origin redirect URIs + explicit dashboard network-alias) creates the network cleanly and brings netbird up working. Existing/manually-churned nodes need a netbird reinstall (clean network) to recover. cargo check passes.

After updating a node, hard-refresh the browser — the dashboard bakes its API origin into JS at container start, so a cached bundle pointing at the old origin must be reloaded.

**Follow-up — final root cause + end-to-end fix confirmed.** After the resolver/CORS/LAN-origin fixes, login still failed with "Unauthenticated". Deeper diagnosis on .198 showed `/oauth2/.well-known/openid-configuration` returning **502** with nginx logging `netbird-server could not be resolved (3: Host not found)` — the server container was `Running=true` but had **no IP on netbird-net** (`serverIP=` empty), so Podman's aardvark DNS had no record. netbird-server runs an **embedded Dex IdP** at `{origin}/oauth2`; with the server unresolvable, the dashboard couldn't fetch the OIDC discovery → "Unauthenticated". Cause: corrupted netavark/aardvark network state (from repeated manual container churn while debugging). **Fix = a clean network + ordered container creation** — which is exactly what a fresh `package.install` does. Validated by tearing down and recreating cleanly on .198: all three containers got IPs (10.89.3.2/3/4) and **dashboard / api / oauth2 all return 200**, stable across repeated polls. Net for all nodes: a fresh install via the fixed binary (LAN-IP origin + gateway-derived nginx `resolver` + variable upstreams + CORS + multi-origin redirect URIs + explicit dashboard network-alias) creates the network cleanly and brings netbird up working. Existing/manually-churned nodes need a netbird reinstall (clean network) to recover. `cargo check` passes. After updating a node, **hard-refresh the browser** — the dashboard bakes its API origin into JS at container start, so a cached bundle pointing at the old origin must be reloaded.

lfg2025 commented

2026-06-17 21:19:08 +00:00

Real final root cause of "Unauthenticated": a stale netbird store.

After the resolver/CORS/LAN-origin fixes the infra was fully healthy (all endpoints 200, OIDC discovery valid), yet the dashboard still showed "Unauthenticated" instantly with no login page. Server logs showed it only ever hit /api/instance → setup_required: false, while single account mode … accounts number 0. Contradiction: the store said "already set up" but had zero accounts, so the dashboard tried to auth, found no session, and errored without ever redirecting to the embedded Dex login.

Cause: the netbird data dir on .198 held a store.db dated May 20 — initialized by an earlier install under the old (Tailscale) issuer. Reconfiguring to the LAN origin while reusing that store left it in a half-initialized state.

Fix: wiped the store (kept the GeoLite DBs) and let netbird re-init. /api/instance now correctly returns {"setup_required": true} → the dashboard shows the create-admin/getting-started page. netbird is working on .198.

Rollout implication: a fresh package.install creates a fresh data dir, so new nodes are fine. Existing nodes that already had netbird installed need a clean reinstall (uninstall WITHOUT preserving data, then install) so the store re-initializes under the LAN-IP issuer — otherwise the stale store keeps them in this broken state. Worth considering whether netbird uninstall should always wipe its data dir to make this automatic.

**Real final root cause of "Unauthenticated": a stale netbird store.** After the resolver/CORS/LAN-origin fixes the infra was fully healthy (all endpoints 200, OIDC discovery valid), yet the dashboard still showed "Unauthenticated" instantly with no login page. Server logs showed it only ever hit `/api/instance` → `setup_required: false`, while `single account mode … accounts number 0`. Contradiction: the store said "already set up" but had zero accounts, so the dashboard tried to auth, found no session, and errored without ever redirecting to the embedded Dex login. Cause: the netbird data dir on .198 held a `store.db` dated **May 20** — initialized by an earlier install under the old (Tailscale) issuer. Reconfiguring to the LAN origin while reusing that store left it in a half-initialized state. **Fix:** wiped the store (kept the GeoLite DBs) and let netbird re-init. `/api/instance` now correctly returns `{"setup_required": true}` → the dashboard shows the create-admin/getting-started page. netbird is working on .198. **Rollout implication:** a *fresh* `package.install` creates a fresh data dir, so new nodes are fine. **Existing nodes that already had netbird installed need a clean reinstall (uninstall WITHOUT preserving data, then install)** so the store re-initializes under the LAN-IP issuer — otherwise the stale store keeps them in this broken state. Worth considering whether netbird uninstall should always wipe its data dir to make this automatic.

lfg2025 commented

2026-06-17 22:08:52 +00:00

THE root cause: netbird's dashboard requires a secure context (HTTPS).

Browser console on the failing login showed Uncaught Error: window.crypto.subtle is unavailable. window.crypto.subtle (which react-oidc uses for OIDC PKCE) is only exposed in a secure context — HTTPS or localhost. Over plain http://<LAN-IP>:8087 it's undefined, so the dashboard's auth init threw before it ever redirected to login — which is why we saw "Unauthenticated" with dead buttons and no /oauth2/auth request. All the earlier fixes (nginx resolver, LAN-origin, /nb-auth SPA fallback, conmon-died recreate, fresh store) were real and necessary, but HTTPS was the missing foundation.

Shipped (option A — code complete, compiles, validated live on .198):

stacks.rs: proxy now terminates TLS (self-signed cert generated at install via openssl, SAN = LAN IP + 127.0.0.1 + localhost), listen 443 ssl, published 8087:443; all origins (exposedAddress / issuer / dashboard endpoints / redirect URIs) are now https://.
Frontend appLauncher.ts: netbird added to NEW_TAB_APP_IDS and served via https:// (a self-signed-HTTPS iframe is blocked — you can't accept a cert warning inside a frame), so it opens in a real tab where the user accepts the cert once.

Validated on .198: https://192.168.1.198:8087 loads, registration + login work.

Caveats / follow-ups:

Self-signed → the user accepts a cert warning once per browser. Trusted-cert / serve-via-system-HTTPS so it works embedded in the iframe = separate issue (option B).
Existing nodes with an old netbird store need a clean reinstall (the store carries the old issuer); fresh installs are clean.
Peer/relay connectivity over self-signed TLS (rels://) for actual VPN clients is untested — this fix targets the admin dashboard.

**THE root cause: netbird's dashboard requires a secure context (HTTPS).** Browser console on the failing login showed `Uncaught Error: window.crypto.subtle is unavailable`. `window.crypto.subtle` (which react-oidc uses for OIDC PKCE) is **only exposed in a secure context — HTTPS or localhost**. Over plain `http://<LAN-IP>:8087` it's `undefined`, so the dashboard's auth init threw *before* it ever redirected to login — which is why we saw "Unauthenticated" with dead buttons and no `/oauth2/auth` request. All the earlier fixes (nginx resolver, LAN-origin, `/nb-auth` SPA fallback, conmon-died recreate, fresh store) were real and necessary, but HTTPS was the missing foundation. **Shipped (option A — code complete, compiles, validated live on .198):** - `stacks.rs`: proxy now terminates TLS (self-signed cert generated at install via openssl, SAN = LAN IP + 127.0.0.1 + localhost), `listen 443 ssl`, published `8087:443`; all origins (exposedAddress / issuer / dashboard endpoints / redirect URIs) are now `https://`. - Frontend `appLauncher.ts`: netbird added to `NEW_TAB_APP_IDS` and served via `https://` (a self-signed-HTTPS **iframe is blocked** — you can't accept a cert warning inside a frame), so it opens in a real tab where the user accepts the cert once. Validated on .198: `https://192.168.1.198:8087` loads, registration + login work. **Caveats / follow-ups:** - Self-signed → the user accepts a cert warning once per browser. Trusted-cert / serve-via-system-HTTPS so it works embedded in the iframe = separate issue (option B). - Existing nodes with an old netbird store need a clean reinstall (the store carries the old issuer); fresh installs are clean. - Peer/relay connectivity over self-signed TLS (rels://) for actual VPN clients is untested — this fix targets the admin dashboard.

lfg2025 referenced this issue

2026-06-17 22:08:52 +00:00

[Tech debt] netbird: serve via trusted HTTPS so it works in the iframe (no cert warning) #56

lfg2025 referenced this issue from a commit

2026-06-23 18:08:50 +00:00

feat(netbird): manifest-driven migration via reusable orchestrator primitives

lfg2025 referenced this issue from a commit

2026-06-23 22:07:44 +00:00

docs(master-plan): §8b — uninstall fix deployed+live-verifying, #15 guardian resolved

lfg2025 referenced this issue from a commit

2026-06-25 22:15:30 +00:00

refactor(netbird): delete legacy Rust installer — #20 ph4 (manifest-driven only)