On-Call
Purpose of On-Call
Goal: Be available and respond to production incidents with appropriate urgency.
Responsibilities:
Diagnose, mitigate, fix, or escalate incidents.
Perform non-urgent production tasks when not handling incidents.
Key Considerations for On-Call
Balance: Engineers should not sacrifice health for reliability.
Goal: SREs should handle a healthy mix of project work and on-call duties (50% project work is ideal).
Psychological Safety: Support systems and escalation procedures must exist to reduce stress.
Compensation: Time-off or cash incentives for on-call work to prevent burnout.
Example On-Call Setups
Google’s Approach:
Training Roadmap: Checklist of focus areas (e.g., monitoring systems, debugging, handling traffic).
Deep Dives: Focus on gaining expertise in services.
Onboarding: Mentoring, shadowing on-call shifts, and practice with “Wheel of Misfortune” disaster exercises.
Playbooks: Documentation to handle alerts and reduce Mean Time to Repair (MTTR).
Evernote’s Cloud Migration:
Reframing Alerts: Shift focus from infrastructure-level events to user impact (e.g., API responsiveness).
On-Call Rotation: Classify events into three categories:
P1: Immediate action needed, page on-call.
P2: Handle next business day, send email.
P3: Informational only.
Post-Incident Review: Every P1 and P2 event has a postmortem and continuous improvement cycle.
Managing Pager Load
Definition: Number of paging incidents during a shift.
Techniques to Manage Pager Load:
Ensure only actionable issues trigger pages.
Use automation for routine fixes or escalate lower-priority issues to tickets.
Focus on high signal-to-noise ratio for alerts.
Appropriate Response Times:
Critical (5 min): Requires immediate action (e.g., revenue-impacting outages).
Moderate (30 min): Less critical issues.
Low (during work hours): Non-urgent tasks like pre-launch backups.
Reducing Pager Load
Identifying Causes:
Existing production bugs.
New bugs introduced into production.
Alerting thresholds and human processes.
Strategies:
Fix existing bugs before releasing new features.
Automate production changes to reduce human error.
Rollback strategy: Detect, rollback, fix, and reintroduce fixes.
Best Practices for Alerts
Alerts should be immediately actionable and not overwhelm engineers.
SLO-based alerts: Only page when error budget is burned.
Team Review: Alerts should be thoroughly reviewed and tested in production before implementation.
Rigor of Follow-Up
Root Cause Analysis:
Identify and prevent repeat incidents.
Focus on improving systems, not just patching immediate bugs.
Systemic Fixes:
Aim for systemic improvements (e.g., automation, better monitoring).
On-Call Flexibility
Shift Length: 12-hour shifts are preferred for sustainable on-call work.
Flexibility: Accommodate personal life changes (e.g., part-time, temporary breaks).
Automated Scheduling: Use automated tools to handle complex schedules fairly.
Team Dynamics
Building Positive Relations:
Encourage team bonding (e.g., offsite activities, lunches).
Co-locate on-call teams to foster collaboration.
Empowerment:
SRE Ownership: Make SREs responsible for site reliability, working alongside feature developers.
Conclusion
On-call is critical to site reliability but must be structured to prevent burnout and foster team collaboration.
Systematic improvement in on-call processes and team dynamics leads to healthier, more effective teams.
Last updated