Perfecting your team's alerts, but watch your back

July 16, 2025

0 views

I own and created the alerts and strategy for Square’s web checkout, which processes over $7 billion annually for sellers all over the world. I have been responsible for saving customers tens of millions of dollars, all the while ruining Christmas, birthdays, and any peace and quiet you may hold dear. If I were on your team, you might hate what I do, but that’s just the job description of owning alerts—the reality many teams face.

Let’s get you started on your alerting journey.

Assign an owner

Firstly, if your team does not have a DRI (Directly Responsible Individual—also known as “the owner”) for your alerts, you need one. You can never have accurate and quiet alerts if there is no one to take charge and learn from their mistakes. Better yet, create documentation of your journey and onboard team members to your alerting strategy to reduce the all-important bus factor; but in the end, someone must care, be responsible, and learn.

What to measure

You may feel overwhelmed by the amount of data you need to measure. Should you measure application health upstream, downstream, or maybe memory, CPU utilization, or number of coffees ingested?

First, back up and identify all P0 and P1 Jobs To Be Done (JTBD) that are critical to your users, such as “take a payment”, “add to cart”, “withdraw money”. Write them all down; now for something controversial.

We need to measure degradation on these JTBD; however, we should not measure memory usage, CPU utilization, or network throughput. Instead, measure the rate of successful job events — the ratio of success / valid events; this will often be API responses. CPU, memory, and the network—even if they are hard dependencies—don't tell you when users are actually unable to finish their job. Start and end with user impact; everything else is a distraction until you have perfected your alerts.

Ensure you capture the full flow of P0 jobs your user may need. For ecommerce you may have the jobs: “page load” → “add to cart” → “validate order” → “take payment”.

How to measure

Each metric from your JTBD will have different volume and characteristics; some may even have such low volume that you may consider them no longer P0. For the sake of your sanity, put these jobs and their metrics in a sheet and note which metrics are reliable, such as paying for an item, and which ones are questionable, such as “add to cart”. This sheet will inform the aggressiveness you can set for your alerts and should serve as your primary source of truth.

How to alert

Alerting on a metric is not about a single alert but a series of alerts that capture the behaviors of the metric; for example your alerts might be accurate during the day but noisy at night. Because of this, your alerts and strategies will differ from mine. However, one thing will always be the same.

ALERTS MUST BE ACTIONABLE

If an alert goes off and there is not a slight sense of “oh fuck” the alerts are too noisy. Alerts should only fire if there is some action to take—whether that be fixing the alert, filing a bug, or declaring a SEV and getting all hands on deck.

My alerting strategy

todo