What are auto-rollback monitors?

Auto-rollback monitors are monitors that observe the health of a service or hosts. Typically these are monitoring metrics such as CPU utilization, error rate, or any combination of metrics that indicate that a service is operating as expected. These can be as simple or as complex as desired, but they should go into an alarm state when something is wrong in the service.

How are they used?

As the name implies, auto-rollback monitors are used to trigger rollbacks of deployments automatically. The idea is that when changes are deployed to hosts, it’s important that those changes don’t cause a regression in the service. Ideally regressions would be caught through tests but that’s not fool-proof. So, when changes are being deployed to hosts, and ideally for a set amount of time afterwards the auto-rollback monitors will be monitoring the health of the service. If the monitor goes into alarm it will automatically trigger a rollback.

Why are these monitors important?

They are the last line of defense before deploying bugs to all of production

As described earlier, testing is not 100% successful at catching bugs or regressions. Anyone who has been involved in software development is sadly aware of how bad testing coverage can be. Auto-rollback monitors act as a last line of defense before regressions can affect all hosts. Monitoring the performance and health of a service once changes are deploying is in some ways another form of testing.

They free up engineer bandwidth

It is unreasonable to expect engineers to actively monitor health metrics as their changes are deployed. There are far too many ways engineers can better spend their time. With continuously deploying pipelines there are multiple changes flowing through the pipeline at any given point in time. It just doesn’t make sense to utilize a human to monitor the metrics. It would be a full-time job! These auto-rollback monitors give engineers the peace of mind to go focus on other more productive tasks knowing that the monitors will rollback if there is any issue. The engineers will get notified of the rollback and can address the faulty change without worrying about mitigation.

They mitigate impact faster than a human could

Along the same lines, even if an engineer notices a regression, it’s not guaranteed that they will notice early or at all. The impact of the regression can increase the longer it is deployed in production. Since these monitors automatically initiate a rollback when the monitor goes into an alarm state, this mechanism mitigates the impact of the regression as soon as possible.

Rules of Thumb

These are a few rules of thumb that I’ve created for working with auto-rollback monitors

  1. Deployments to any environment that is serving production traffic should have auto-rollback monitors configured for them
  2. Deployments should have a small bake time window where the changes have been deployed successfully but the auto-rollback monitors can still initiate a rollback if the health of the service degrades
  3. Auto-rollback monitors should monitor metrics for the environments that they would rollback. In other words, an auto-rollback monitor for a North American environment should not monitor metrics for all regions. Just the North American fleet.
  4. Rollbacks can fail. Be sure to check the status of the fleet after being notified of a rollback. There are times when hosts can be stuck in a failed state and need to be replaced.

Fun Fact

This post was inspired by an incident that occurred on 04/19/2024 where a pipeline did not have an auto-rollback monitor associated with it and a change made by another team caused a regression. Around 1.3 million notifications failed to send to customers before the issue was noticed and mitigated by the oncall (me). The lack of auto-rollback monitors was a contributing factor to this incident.