When Your AI's Alarm Clock Fails, Who Watches the Watchdog?
A morning of failed cron jobs, a model fallback that didn't fire, and why the heartbeat is a better watchdog than a dedicated one.
Monday morning. I noticed Hank hadn't sent a morning brief. I checked Telegram — nothing. A few minutes later I saw the error that had been delivered to a system topic: "The AI service is temporarily overloaded. Please try again in a moment." The 6 AM cron job had hit Anthropic's API at the wrong time, gotten back a 529, and died. No retry. No fallback. Just silence.
Then I checked the improvement sprint that runs at 10 AM. Same error. Same result.
Two cron jobs down in the same morning window, both for the same reason, neither of which I knew about until I went looking. That's the kind of failure that makes you want to build something.
The First Instinct: A Watchdog Cron Job
The obvious first move was a watchdog. A separate cron job that runs at 6:30 AM, checks if today's daily note exists, and reruns the daily-note job if it doesn't. Simple. Targeted. Done in five minutes.
We built it. It worked. And then we immediately talked ourselves out of it.
The problem with a dedicated watchdog cron job is that it only watches one thing. As soon as I add another cron job I care about — and I keep adding them — I have to remember to also add a watchdog for it. The watchdog and the watched job have to stay in sync forever. That's maintenance overhead that compounds quietly in the background until the day you forget and it bites you.
There's also a subtler issue: a watchdog cron job is a fixed solution to a dynamic problem. It checks for a specific file. If the failure mode changes — different job, different error, different recovery path — the watchdog doesn't adapt. You'd need to rewrite it.
What About Model Fallback?
Before scrapping the watchdog entirely, I wanted to understand why the overload error didn't trigger a fallback to a different model. OpenClaw supports model fallback via agents.defaults.model.fallbacks — if a provider fails, it moves to the next model in the list.
Reading the docs carefully, fallback triggers on: auth failures, rate limits (HTTP 429), and timeouts that exhausted profile rotation. The overload error is an HTTP 529 — a server-side capacity error, not a rate limit. The docs are explicit: "other errors do not advance fallback."
HTTP 429 — Rate Limited — Your request was rejected because you've exceeded your quota. Fallback fires.
HTTP 529 — Overloaded — The provider's infrastructure is at capacity. Classified as 'other error.' Fallback does not fire.
This is an important distinction. Model fallback is not a general resilience mechanism — it's specifically for auth and rate limit scenarios. If you're counting on fallback to protect you against provider outages or capacity crunches, you're going to be surprised the first time one happens.
Model fallback is still worth configuring. It just doesn't solve the overload case.
Failure Alerts: The Fast Layer
The first concrete fix was simple: turn on failure alerts for every critical job. OpenClaw supports --failure-alert on cron jobs — when a job errors, it sends a Telegram message immediately.
openclaw cron edit <id> \
--failure-alert \
--failure-alert-after 1 \
--failure-alert-channel telegram \
--failure-alert-to <chat-id>
This doesn't prevent failures. It just means I know within seconds instead of hours. That's valuable on its own — if I'd had failure alerts this morning, I would have seen the overload error before I'd even finished my first cup of coffee and could have manually triggered a retry.
Failure alerts are the fast layer. They're reactive, not preventive. But awareness is the prerequisite for everything else.
The Better Answer: Heartbeat Health Check
The real fix came from stepping back and asking a better question. I already have a heartbeat system — a periodic poll that runs every 30 minutes and checks on things. The heartbeat reads HEARTBEAT.md and follows instructions. It's already running. It already has judgment. It already knows how to message me.
Why add a dedicated watchdog cron at all?
Instead, we added a cron health check section to HEARTBEAT.md. Every heartbeat, Hank runs openclaw cron list --json — which returns every cron job with its last run status — and checks for failures. If a job errored and was scheduled to have run today, Hank decides what to do:
Transient error (overload, timeout, network) — Retry immediately via openclaw cron run. Then message Paul with what happened and that it's been re-triggered.
Non-transient error (edit conflict, script failure, logic error) — Don't retry blindly. Message Paul with the error and an assessment of what went wrong. Offer to investigate.
Already retried recently — Check heartbeat-state.json. If a retry happened less than 2 hours ago, skip — don't hammer a failing job.
The key detail: openclaw cron list --json returns all jobs. Not a hardcoded list — everything. Every cron job I ever add is automatically covered by the health check with zero changes to HEARTBEAT.md. The watchdog scales with the system.
Pros and Cons
This approach has real advantages, but it's worth being honest about the tradeoffs.
✅ Pro — Self-scaling — New cron jobs are covered automatically. No watchdog to create, no list to maintain.
✅ Pro — Faster recovery — Heartbeat runs every 30 minutes. A dedicated 6:30 AM watchdog only helps once a day.
✅ Pro — Intelligent triage — Heartbeat can distinguish transient from non-transient errors and respond differently to each.
✅ Pro — One place to maintain — All recovery logic lives in HEARTBEAT.md, not scattered across N watchdog cron jobs.
⚠️ Con — Same provider dependency — The heartbeat runs on the same Anthropic model that might be overloaded. If the outage is widespread, the heartbeat itself could be affected.
⚠️ Con — Adds complexity to the heartbeat — HEARTBEAT.md is already doing a lot — weather, calendar, WHOOP, news. Adding cron health checks makes it longer and heavier.
⚠️ Con — Triage can be wrong — The heartbeat classifies errors before retrying — transient gets retried, non-transient gets escalated. But automated classification isn't perfect. A logic error that surfaces as a generic failure message could get misread as transient and retried when it shouldn't be. To be fair, a dedicated watchdog cron would have the same problem — this is a limitation of automated error triage in general, not specific to the heartbeat approach.
The single-provider dependency is the one I think about most. If Anthropic has a broad outage and both the cron job and the heartbeat are hitting the same endpoint, the heartbeat can't save what it can't reach. In practice, the failure alerts are the safety net for that scenario — I'd see the alert on my phone even if the heartbeat couldn't process it.
What We Shipped
By the end of the conversation, three things were in place:
Failure alerts — All critical cron jobs now alert immediately on the first error. Fast awareness layer.
HEARTBEAT.md cron health check — Every heartbeat checks all cron jobs, retries transients, escalates non-transients, tracks retries in heartbeat-state.json.
The watchdog cron job we initially built got deleted. The heartbeat does the job better.
The Bigger Pattern
What I keep finding with this system is that the instinct to build a dedicated tool for every specific problem is usually wrong. The better move is to extend the ambient infrastructure that already exists.
A watchdog cron job is specific. The heartbeat is general. General wins, because the problem space always expands.
The same logic applies to model fallback. It's tempting to treat it as a resilience catch-all — configure it and assume you're covered. But fallback is a specific mechanism for specific failure modes. Understanding exactly what it protects against (rate limits, auth failures) and what it doesn't (overload, logic errors) is what lets you build the right supplementary layers around it.
Automation is only as reliable as your understanding of its failure modes.
Tools: OpenClaw · Telegram · HEARTBEAT.md · openclaw cron
Previous post: My AI Sends Me a Morning Brief →
Originally published at https://www.paulbrennaman.me/lab/cron-health-watchdog

