Gateways that sit in front of agents and tools need two different questions answered: should we restart this process? and should we send user traffic here? Kubernetes popularized liveness and readiness probes; on a bare-metal Mac mini M4 running an OpenClaw-style gateway under launchd, you implement the same split with HTTP endpoints or exec checks, then attach SLOs so on-call knows when flapping is normal post-deploy versus a systemic outage.
Gateway token under launchd: token auth and plist drift triage. Align timers with the service lifecycle: launchd and gateway readiness. Egress and TLS policy: egress proxy TLS allowlist. Log volume: gateway log rotation. Pricing: pricing; help: help.
Probe responsibility matrix
| Probe | Checks | Failure action |
|---|---|---|
| Liveness | Process up; event loop not wedged; port bind succeeds | launchd restart after threshold |
| Readiness | Auth store reachable; model/router deps OK; disk > 10% free | Remove from load balancer / agent roster; no restart |
| Startup | Migrations, cert load, warm caches | Block readiness until complete or bounded timeout |
SLO starter table
| Signal | Target | Burn alert |
|---|---|---|
| Monthly availability | 99.5% internal / 99.9% if customer-facing | 2× error budget in 1 h |
| Probe p95 | < 200 ms loopback | Sustained > 500 ms for 15 min |
| Readiness flaps | < 3 per day outside deploy | > 10 in 30 min |
macOS nuance: after sleep/wake or network change, readiness should fail briefly while DNS and VPN settle; tune ThrottleInterval in launchd so you do not restart a healthy gateway during transient NIC churn.
Eight-step implementation checklist
- Define HTTP paths (e.g.
/livezvs/readyz) and document semantics in the runbook. - Keep liveness cheap—no external calls; readiness may call dependency health with short timeouts.
- Wire external probes (reverse proxy, k8s sidecar, or synthetic monitor) to readiness for traffic decisions.
- Log probe failures at WARN with reason codes to correlate with agent disconnects.
- Dashboard error budget burn alongside CPU and open file descriptors on M4.
- Game-day: kill upstream dependency and confirm readiness fails without liveness restart storm.
- Post-deploy: temporarily relax flap alerts for 30 min canary window.
- Quarterly review thresholds against actual incident data.
FAQ
What is the difference between liveness and readiness for a gateway?
Liveness answers whether the process should be restarted; readiness answers whether it should receive traffic. A gateway can be alive but not ready if upstream auth, model backends, or disk are unhealthy.
What SLO thresholds are reasonable for a single-host M4 gateway?
Many teams target 99.5% monthly availability for internal gateways, with p95 probe latency under 200 ms on loopback and readiness flaps capped at a few per day after deploy windows.
Exercise probe and SLO wiring on rented Mac mini M4 hosts before production cutover. NodeMac provides dedicated Apple Silicon in Hong Kong, Japan, Korea, Singapore, and the United States with SSH/VNC so SREs can mirror launchd units and load generators without buying spare metal.