OpsDevAI
Insights
Infrastructure 6 min read

Self-healing clusters: the boring version of an exciting idea

Auto-remediation isn't an AI fantasy — it's a tight feedback loop between detection, attribution, and policy. Here's how we built ours.

MI
Mumitul Islam Mumit
Founder, OpsDevAI

Every vendor with a logo claims self-healing. Most of them mean 'we'll restart the pod.' That's not self-healing — that's hope, dressed up.

The three loops

A real self-healing cluster runs three loops at once. A detection loop that watches the right signals (not just CPU and memory — request shape, dependency latency, eviction reason). An attribution loop that decides who caused the problem. And a policy loop that decides what's safe to do about it.

Detection is mostly about restraint

The hard part of detection isn't catching incidents — it's not catching everything. A cluster with a noisy detector becomes a cluster that ignores its own alarms. We tune detection budgets the way you'd tune SLOs: how many remediations per hour is the cluster allowed to attempt before a human is paged?

Attribution is the actual product

'Who caused this?' sounds boring. It isn't. Bad attribution causes the wrong workload to get evicted, the wrong region to get cordoned, the wrong oncall to get paged. Good attribution turns a six-page postmortem into a one-line changelog entry.

Policy is where you bake in your taste

Every team has a different appetite for risk. The policy loop is the lever you give them: which remediations are auto-applied, which are proposed and require a click, which never run without a human. OpsDevAI ships defaults — and gets out of the way as you tighten them.

Read next

Predictive autoscaling, explained without the buzzwords

Reactive autoscaling is a thermostat. Predictive autoscaling is a weather forecast. Here's why that distinction shows up in your p99.