πŸ‘¨β€πŸ’»
Mike's Notes
  • Introduction
  • MacOs Setup
    • System Preferences
    • Homebrew
      • Usage
    • iTerm
      • VIM
      • Tree
      • ZSH
    • Visual Studio Code
    • Git
    • SSH Keys
  • Developer Productivity
    • How To Measure
  • DevOps Knowledge
    • SRE
      • Scaling Reliably
        • Splitting a Monolith into Microservices
      • Troubleshooting Common Issues
      • Service Level Terminology
      • Toil
      • Monitoring
      • Release Engineering
      • Best Practices
      • On-Call
      • Alerting
    • Containers
      • Docker
        • Best Practices
          • Image Building
          • Docker Development
        • CLI Cheat Sheet
      • Container Orchestration
        • Kubernetes
          • Benefits
          • Cheat Sheet
          • Components
          • Pods
          • Workload Resources
          • Best Practices
    • Developer Portal πŸ‘¨β€πŸ’»
      • Solution Overview 🎯
      • System Architecture πŸ—οΈ
      • Implementation Journey πŸ› οΈ
      • Cross-team Collaboration 🀝
      • Lessons & Future πŸŽ“
    • Provisioning
      • Terraform
        • Installation
        • Usage
    • Configuration Management
      • Ansible
        • Benefits
        • Installation
    • Build Systems
      • Bazel
        • Features
  • Security
    • Secure Software Engineering
    • Core Concepts
    • Security Design Principles
    • Software Security Requirements
    • Compliance Standards and Policies
      • Sarbanes-Oxley (SOX)
      • HIPAA and HITECH
      • Payment Card Industry Data Security Standard (PCI-DSS)
      • General Data Protection Regulation (GDPR)
      • California Consumer Privacy Act (CCPA)
      • Federal Risk and Authorization Management Program (FedRAMP)
    • Privacy & Data
  • Linux Fundamentals
    • Introduction to Linux
    • Architecture
    • Server Administration
      • User / Groups
      • File Permissions
      • SSH
      • Process Management
    • Networking
      • Diagrams
      • Browser URL Example
      • Network Topologies
      • Signal Routing
      • DNS (Domain Name System)
      • SSL (Secure Sockets Layer)
      • TLS (Transport Layer Security)
  • System Design
    • Process
    • Kafka
      • Advanced Topics
    • URL Shortener
Powered by GitBook
On this page
  • Service Level Terminology
  • Service Level Indicators
  • Service Level Objectives
  • Service Level Agreements
  • Indicators in Practice
  • Collecting and Aggregating Indicators
  • Best Practices
  • Conclusion

Was this helpful?

  1. DevOps Knowledge
  2. SRE

Service Level Terminology

Effective service management requires understanding which behaviors matter and how to measure them.

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) help define and deliver the desired level of service.

Metrics should guide appropriate actions when issues arise, ensuring the service remains healthy.


Service Level Terminology

  • SLI (Service Level Indicator): Quantitative measures of service performance (e.g., latency, error rate, throughput, availability).

  • SLO (Service Level Objective): Target values for SLIs, specifying the desired level of service performance.

  • SLA (Service Level Agreement): Contracts with users outlining the consequences if SLOs aren’t met (e.g., financial penalties).


Service Level Indicators

  • SLIs are specific metrics that indicate service health (e.g., request latency, error rate).

  • Common SLIs include availability (e.g., 99.9% availability = "three nines").

  • Some SLIs may only be proxies for actual user experience (e.g., server-side latency vs. client-side latency).


Service Level Objectives

  • SLOs set expectations for service performance and help reduce complaints.

  • Example: Latency SLO (e.g., average request latency < 100ms).

  • Choosing SLOs is complex and should reflect both user expectations and system capabilities.

  • Higher load often increases latency, so SLOs should account for this relationship.


Service Level Agreements

  • SLAs are formal agreements between the service provider and users, typically involving penalties for unmet SLOs.

  • SLAs are more tied to business decisions, while SREs focus on meeting SLOs to avoid penalties.


Indicators in Practice

Focus on a handful of meaningful SLIs that matter to users, such as:

  • User-facing systems: Availability, latency, throughput.

  • Storage systems: Latency, availability, durability.

  • Big data systems: Throughput, end-to-end latency.

  • All systems: Correctness (accuracy of returned data).


Collecting and Aggregating Indicators

  • Metrics can be collected server-side or client-side, depending on the aspect of user experience being measured.

  • Aggregating metrics (e.g., average latency) can obscure important details, such as tail latencies.

  • Percentiles (e.g., 99th percentile latency) offer a clearer view of performance extremes.


Best Practices

  • Use percentiles rather than averages to capture the distribution of performance, especially for latency.

  • Standardize SLIs across services to simplify monitoring and ensure consistency.


Conclusion

  • Defining and managing SLIs, SLOs, and SLAs are key to delivering a reliable service.

  • Prioritize metrics that matter most to users and align with your system's goals for optimal service management.

PreviousTroubleshooting Common IssuesNextToil

Last updated 9 months ago

Was this helpful?