# Alerting

## **Purpose of Alerting**

* **Goal**: Turn SLOs (Service Level Objectives) into actionable alerts.
* **Benefit**: Respond to issues before consuming too much of the error budget.

***

## **Alerting Considerations**

* **Precision**: Detect significant events without false alerts.
* **Recall**: Ensure all significant events trigger alerts.
* **Detection Time**: Minimize the time taken to detect issues.
* **Reset Time**: Ensure alerts resolve quickly once issues are fixed.

***

## **Six Approaches to Alerting**

**1. Target Error Rate ≥ SLO Threshold**

* **Implementation**: Trigger alerts if the error rate exceeds SLO for a short window (e.g., 10 minutes).
* **Pros**: Simple, quick detection.
* **Cons**: Poor precision—too many false positives for minor issues.

**2. Increased Alert Window**

* **Implementation**: Increase the alert window (e.g., 36 hours) to improve precision.
* **Pros**: Better precision than short windows.
* **Cons**: Long reset times and higher memory costs.

**3. Incrementing Alert Duration**

* **Implementation**: Only trigger alerts if the error rate stays above the threshold for a set duration.
* **Pros**: Better precision for sustained issues.
* **Cons**: Poor recall and slow detection time for severe issues.

**4. Alert on Burn Rate**

* **Definition**: Burn rate = how fast the service consumes the error budget.
* **Implementation**: Alert when the burn rate exceeds a critical threshold (e.g., burn rate of 36).
* **Pros**: Good precision and detection time.
* **Cons**: May miss lower, slow-moving errors.

**5. Multiple Burn Rate Alerts**

* **Implementation**: Use multiple burn rates for different error severities (e.g., 2% budget in 1 hour, 5% in 6 hours).
* **Pros**: Adaptive alerting for both fast and slow errors; prioritizes alerts based on severity.
* **Cons**: More complex to manage multiple burn rates and windows.

**6. Multiwindow, Multi-Burn-Rate Alerts (Recommended)**

* **Implementation**: Combine short and long windows with multiple burn rates (e.g., alert if both 1-hour and 5-minute windows exceed burn rate thresholds).
* **Pros**: Best approach for managing precision, recall, detection, and reset times; reduces false positives.

***

## **Handling Low-Traffic Services**

* **Problem**: High sensitivity to errors in low-traffic services causes false alerts.
* **Solutions**:
  * **Generate artificial traffic** to simulate user activity.
  * **Combine smaller services** into a single alerting system.
  * **Modify product design** to reduce impact of individual failed requests.

***

## **Handling Extreme Availability Goals**

* **Low Availability**: E.g., 90% availability—errors may go unnoticed in long error budgets.
* **High Availability**: E.g., 99.999% availability—100% outages can deplete error budgets in seconds.
  * **Solution**: Design systems to avoid 100% outages or implement gradual rollouts (e.g., canarying).

***

## **Scalable Alerting Framework**

* **Avoid Custom Parameters**: Don't specify alert parameters for every microservice.
* **Group Requests into Buckets**:
  * **Critical**: E.g., login requests (99.99% availability).
  * **High Priority**: E.g., user interaction (99.9% availability).
  * **Low Priority**: Non-urgent requests with minimal user impact.

***

## **Conclusion**

* **Best Strategy**: Multiwindow, multi-burn-rate alerting is the most reliable way to defend SLOs.
* **Objective**: Set alerts to notify for actionable, significant events that impact the error budget.
