# Alerting

## **Purpose of Alerting**

* **Goal**: Turn SLOs (Service Level Objectives) into actionable alerts.
* **Benefit**: Respond to issues before consuming too much of the error budget.

***

## **Alerting Considerations**

* **Precision**: Detect significant events without false alerts.
* **Recall**: Ensure all significant events trigger alerts.
* **Detection Time**: Minimize the time taken to detect issues.
* **Reset Time**: Ensure alerts resolve quickly once issues are fixed.

***

## **Six Approaches to Alerting**

**1. Target Error Rate ≥ SLO Threshold**

* **Implementation**: Trigger alerts if the error rate exceeds SLO for a short window (e.g., 10 minutes).
* **Pros**: Simple, quick detection.
* **Cons**: Poor precision—too many false positives for minor issues.

**2. Increased Alert Window**

* **Implementation**: Increase the alert window (e.g., 36 hours) to improve precision.
* **Pros**: Better precision than short windows.
* **Cons**: Long reset times and higher memory costs.

**3. Incrementing Alert Duration**

* **Implementation**: Only trigger alerts if the error rate stays above the threshold for a set duration.
* **Pros**: Better precision for sustained issues.
* **Cons**: Poor recall and slow detection time for severe issues.

**4. Alert on Burn Rate**

* **Definition**: Burn rate = how fast the service consumes the error budget.
* **Implementation**: Alert when the burn rate exceeds a critical threshold (e.g., burn rate of 36).
* **Pros**: Good precision and detection time.
* **Cons**: May miss lower, slow-moving errors.

**5. Multiple Burn Rate Alerts**

* **Implementation**: Use multiple burn rates for different error severities (e.g., 2% budget in 1 hour, 5% in 6 hours).
* **Pros**: Adaptive alerting for both fast and slow errors; prioritizes alerts based on severity.
* **Cons**: More complex to manage multiple burn rates and windows.

**6. Multiwindow, Multi-Burn-Rate Alerts (Recommended)**

* **Implementation**: Combine short and long windows with multiple burn rates (e.g., alert if both 1-hour and 5-minute windows exceed burn rate thresholds).
* **Pros**: Best approach for managing precision, recall, detection, and reset times; reduces false positives.

***

## **Handling Low-Traffic Services**

* **Problem**: High sensitivity to errors in low-traffic services causes false alerts.
* **Solutions**:
  * **Generate artificial traffic** to simulate user activity.
  * **Combine smaller services** into a single alerting system.
  * **Modify product design** to reduce impact of individual failed requests.

***

## **Handling Extreme Availability Goals**

* **Low Availability**: E.g., 90% availability—errors may go unnoticed in long error budgets.
* **High Availability**: E.g., 99.999% availability—100% outages can deplete error budgets in seconds.
  * **Solution**: Design systems to avoid 100% outages or implement gradual rollouts (e.g., canarying).

***

## **Scalable Alerting Framework**

* **Avoid Custom Parameters**: Don't specify alert parameters for every microservice.
* **Group Requests into Buckets**:
  * **Critical**: E.g., login requests (99.99% availability).
  * **High Priority**: E.g., user interaction (99.9% availability).
  * **Low Priority**: Non-urgent requests with minimal user impact.

***

## **Conclusion**

* **Best Strategy**: Multiwindow, multi-burn-rate alerting is the most reliable way to defend SLOs.
* **Objective**: Set alerts to notify for actionable, significant events that impact the error budget.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://notes.mikaelsamvelian.com/devops-knowledge/sre/alerting.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
