👨‍💻
Mike's Notes
  • Introduction
  • MacOs Setup
    • System Preferences
    • Homebrew
      • Usage
    • iTerm
      • VIM
      • Tree
      • ZSH
    • Visual Studio Code
    • Git
    • SSH Keys
  • DevOps Knowledge
    • SRE
      • Scaling Reliably
        • Splitting a Monolith into Microservices
      • Troubleshooting Common Issues
      • Service Level Terminology
      • Toil
      • Monitoring
      • Release Engineering
      • Best Practices
      • On-Call
      • Alerting
    • Containers
      • Docker
        • Best Practices
          • Image Building
          • Docker Development
        • CLI Cheat Sheet
      • Container Orchestration
        • Kubernetes
          • Benefits
          • Cheat Sheet
          • Components
          • Pods
          • Workload Resources
          • Best Practices
    • Developer Portal 👨‍💻
      • Solution Overview 🎯
      • System Architecture 🏗️
      • Implementation Journey 🛠️
      • Cross-team Collaboration 🤝
      • Lessons & Future 🎓
    • Provisioning
      • Terraform
        • Installation
        • Usage
    • Configuration Management
      • Ansible
        • Benefits
        • Installation
    • Build Systems
      • Bazel
        • Features
  • Security
    • Secure Software Engineering
    • Core Concepts
    • Security Design Principles
    • Software Security Requirements
    • Compliance Standards and Policies
      • Sarbanes-Oxley (SOX)
      • HIPAA and HITECH
      • Payment Card Industry Data Security Standard (PCI-DSS)
      • General Data Protection Regulation (GDPR)
      • California Consumer Privacy Act (CCPA)
      • Federal Risk and Authorization Management Program (FedRAMP)
    • Privacy & Data
  • Linux Fundamentals
    • Introduction to Linux
    • Architecture
    • Server Administration
      • User / Groups
      • File Permissions
      • SSH
      • Process Management
    • Networking
      • Diagrams
      • Browser URL Example
      • Network Topologies
      • Signal Routing
      • DNS (Domain Name System)
      • SSL (Secure Sockets Layer)
      • TLS (Transport Layer Security)
  • System Design
    • Process
    • Kafka
      • Advanced Topics
    • URL Shortener
Powered by GitBook
On this page
  • Purpose of Alerting
  • Alerting Considerations
  • Six Approaches to Alerting
  • Handling Low-Traffic Services
  • Handling Extreme Availability Goals
  • Scalable Alerting Framework
  • Conclusion

Was this helpful?

  1. DevOps Knowledge
  2. SRE

Alerting

Purpose of Alerting

  • Goal: Turn SLOs (Service Level Objectives) into actionable alerts.

  • Benefit: Respond to issues before consuming too much of the error budget.


Alerting Considerations

  • Precision: Detect significant events without false alerts.

  • Recall: Ensure all significant events trigger alerts.

  • Detection Time: Minimize the time taken to detect issues.

  • Reset Time: Ensure alerts resolve quickly once issues are fixed.


Six Approaches to Alerting

1. Target Error Rate ≥ SLO Threshold

  • Implementation: Trigger alerts if the error rate exceeds SLO for a short window (e.g., 10 minutes).

  • Pros: Simple, quick detection.

  • Cons: Poor precision—too many false positives for minor issues.

2. Increased Alert Window

  • Implementation: Increase the alert window (e.g., 36 hours) to improve precision.

  • Pros: Better precision than short windows.

  • Cons: Long reset times and higher memory costs.

3. Incrementing Alert Duration

  • Implementation: Only trigger alerts if the error rate stays above the threshold for a set duration.

  • Pros: Better precision for sustained issues.

  • Cons: Poor recall and slow detection time for severe issues.

4. Alert on Burn Rate

  • Definition: Burn rate = how fast the service consumes the error budget.

  • Implementation: Alert when the burn rate exceeds a critical threshold (e.g., burn rate of 36).

  • Pros: Good precision and detection time.

  • Cons: May miss lower, slow-moving errors.

5. Multiple Burn Rate Alerts

  • Implementation: Use multiple burn rates for different error severities (e.g., 2% budget in 1 hour, 5% in 6 hours).

  • Pros: Adaptive alerting for both fast and slow errors; prioritizes alerts based on severity.

  • Cons: More complex to manage multiple burn rates and windows.

6. Multiwindow, Multi-Burn-Rate Alerts (Recommended)

  • Implementation: Combine short and long windows with multiple burn rates (e.g., alert if both 1-hour and 5-minute windows exceed burn rate thresholds).

  • Pros: Best approach for managing precision, recall, detection, and reset times; reduces false positives.


Handling Low-Traffic Services

  • Problem: High sensitivity to errors in low-traffic services causes false alerts.

  • Solutions:

    • Generate artificial traffic to simulate user activity.

    • Combine smaller services into a single alerting system.

    • Modify product design to reduce impact of individual failed requests.


Handling Extreme Availability Goals

  • Low Availability: E.g., 90% availability—errors may go unnoticed in long error budgets.

  • High Availability: E.g., 99.999% availability—100% outages can deplete error budgets in seconds.

    • Solution: Design systems to avoid 100% outages or implement gradual rollouts (e.g., canarying).


Scalable Alerting Framework

  • Avoid Custom Parameters: Don't specify alert parameters for every microservice.

  • Group Requests into Buckets:

    • Critical: E.g., login requests (99.99% availability).

    • High Priority: E.g., user interaction (99.9% availability).

    • Low Priority: Non-urgent requests with minimal user impact.


Conclusion

  • Best Strategy: Multiwindow, multi-burn-rate alerting is the most reliable way to defend SLOs.

  • Objective: Set alerts to notify for actionable, significant events that impact the error budget.

PreviousOn-CallNextContainers

Last updated 8 months ago

Was this helpful?