Monitoring
Principles and Definitions
Google’s SRE teams follow essential principles for successful monitoring and alerting. These include understanding which issues merit human intervention and managing minor issues that don't require immediate attention.
Key Definitions:
Monitoring: Collecting and displaying real-time data about system behavior, including errors, query types, and server performance.
White-box Monitoring: Observing internal system metrics, such as logs or JVM interfaces, for insights.
Black-box Monitoring: Measuring external system behavior as users experience it.
Dashboard: A summary view of critical service metrics, often web-based.
Alert: A notification, typically a page, email, or ticket, indicating a potential issue.
Root Cause: The underlying reason for a failure, which must be fixed to prevent recurrence.
Node and Machine: Refers to a single instance of a kernel in a physical or virtual environment.
Push: Any update to a service’s software or configuration.
Why Monitor?
Monitoring is crucial for:
Long-term trend analysis: Helps assess growth rates, like database size or user engagement.
Performance comparison: Evaluates the impact of changes, like new database software or additional infrastructure.
Alerting: Identifies current or potential problems requiring immediate or near-term attention.
Dashboarding: Provides a quick view of a system's health using key metrics like the "Four Golden Signals" (explained later).
Debugging: Retrospective analysis of unexpected changes, such as latency spikes.
Effective Monitoring and Alerting
Monitoring helps systems notify when they are broken or close to breaking. While it’s essential to alert a human for critical issues, paging can be disruptive and expensive. Too many alerts can cause alert fatigue, where important issues might be overlooked. To prevent this, alerting should be well-tuned to provide high signal with minimal noise.
Setting Expectations for Monitoring
Even with sophisticated tools, monitoring complex systems requires significant resources. Google’s SRE teams typically dedicate one or two members to maintaining these systems. The aim is to keep monitoring simple and scalable, avoiding unnecessarily complex or fragile systems.
Complex dependency-based rules (e.g., alert for website issues only if the database is fine) are rare at Google due to the infrastructure's continuous evolution. Keeping monitoring rules simple and robust is essential for minimizing noise and ensuring clarity when problems arise.
Symptoms vs. Causes
Monitoring systems should differentiate between symptoms ("what’s broken") and causes ("why it’s broken"). For example, a slow website might be due to a slow database, which is a symptom for the web team but a cause for the database team. This distinction is key to effective monitoring.
Black-box vs. White-box Monitoring
Both types of monitoring are necessary. Black-box monitoring identifies immediate, user-visible problems, while white-box monitoring detects underlying causes or imminent failures by observing internal system behavior. White-box monitoring is essential for debugging and detecting masked failures, while black-box monitoring ensures alerting is focused on active, impactful issues.
The Four Golden Signals
When monitoring a user-facing system, four critical metrics should be tracked:
Latency: The time to complete a request. Both successful and failed requests should be measured.
Traffic: The demand on the system, measured in system-specific units (e.g., HTTP requests per second).
Errors: The rate of failed requests, including both explicit and policy-based failures.
Saturation: How close the system is to its full capacity, with a focus on the most constrained resources (e.g., CPU, I/O).
If these four metrics are monitored and alerts are triggered when any become problematic, the system should be well covered.
Tail Latencies and Measurement Granularity
Monitoring the average performance of a system can hide issues experienced by a minority of users. To avoid this, it’s important to focus on the "tail" of the performance distribution (e.g., the slowest 1% of requests). Using latency buckets instead of averages allows for a clearer view of performance issues.
Additionally, different system metrics should be measured with appropriate granularity. For instance, CPU load might need to be measured more frequently than storage availability, depending on the system’s uptime requirements.
By focusing on simple, robust monitoring and keeping noise low, SRE teams can respond more effectively to system issues and minimize disruptions to both services and on-call engineers.
Last updated