Service Level Terminology

Effective service management requires understanding which behaviors matter and how to measure them.

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) help define and deliver the desired level of service.

Metrics should guide appropriate actions when issues arise, ensuring the service remains healthy.


Service Level Terminology

  • SLI (Service Level Indicator): Quantitative measures of service performance (e.g., latency, error rate, throughput, availability).

  • SLO (Service Level Objective): Target values for SLIs, specifying the desired level of service performance.

  • SLA (Service Level Agreement): Contracts with users outlining the consequences if SLOs aren’t met (e.g., financial penalties).


Service Level Indicators

  • SLIs are specific metrics that indicate service health (e.g., request latency, error rate).

  • Common SLIs include availability (e.g., 99.9% availability = "three nines").

  • Some SLIs may only be proxies for actual user experience (e.g., server-side latency vs. client-side latency).


Service Level Objectives

  • SLOs set expectations for service performance and help reduce complaints.

  • Example: Latency SLO (e.g., average request latency < 100ms).

  • Choosing SLOs is complex and should reflect both user expectations and system capabilities.

  • Higher load often increases latency, so SLOs should account for this relationship.


Service Level Agreements

  • SLAs are formal agreements between the service provider and users, typically involving penalties for unmet SLOs.

  • SLAs are more tied to business decisions, while SREs focus on meeting SLOs to avoid penalties.


Indicators in Practice

Focus on a handful of meaningful SLIs that matter to users, such as:

  • User-facing systems: Availability, latency, throughput.

  • Storage systems: Latency, availability, durability.

  • Big data systems: Throughput, end-to-end latency.

  • All systems: Correctness (accuracy of returned data).


Collecting and Aggregating Indicators

  • Metrics can be collected server-side or client-side, depending on the aspect of user experience being measured.

  • Aggregating metrics (e.g., average latency) can obscure important details, such as tail latencies.

  • Percentiles (e.g., 99th percentile latency) offer a clearer view of performance extremes.


Best Practices

  • Use percentiles rather than averages to capture the distribution of performance, especially for latency.

  • Standardize SLIs across services to simplify monitoring and ensure consistency.


Conclusion

  • Defining and managing SLIs, SLOs, and SLAs are key to delivering a reliable service.

  • Prioritize metrics that matter most to users and align with your system's goals for optimal service management.

Last updated