On-Call

Purpose of On-Call

Goal: Be available and respond to production incidents with appropriate urgency.
Responsibilities:
- Diagnose, mitigate, fix, or escalate incidents.
- Perform non-urgent production tasks when not handling incidents.

Balance: Engineers should not sacrifice health for reliability.
Goal: SREs should handle a healthy mix of project work and on-call duties (50% project work is ideal).
Psychological Safety: Support systems and escalation procedures must exist to reduce stress.
Compensation: Time-off or cash incentives for on-call work to prevent burnout.

Google’s Approach:
- Training Roadmap: Checklist of focus areas (e.g., monitoring systems, debugging, handling traffic).
- Deep Dives: Focus on gaining expertise in services.
- Onboarding: Mentoring, shadowing on-call shifts, and practice with “Wheel of Misfortune” disaster exercises.
- Playbooks: Documentation to handle alerts and reduce Mean Time to Repair (MTTR).
Evernote’s Cloud Migration:
- Reframing Alerts: Shift focus from infrastructure-level events to user impact (e.g., API responsiveness).
- On-Call Rotation: Classify events into three categories:
  - P1: Immediate action needed, page on-call.
  - P2: Handle next business day, send email.
  - P3: Informational only.
- Post-Incident Review: Every P1 and P2 event has a postmortem and continuous improvement cycle.

Definition: Number of paging incidents during a shift.
Techniques to Manage Pager Load:
- Ensure only actionable issues trigger pages.
- Use automation for routine fixes or escalate lower-priority issues to tickets.
- Focus on high signal-to-noise ratio for alerts.
Appropriate Response Times:
- Critical (5 min): Requires immediate action (e.g., revenue-impacting outages).
- Moderate (30 min): Less critical issues.
- Low (during work hours): Non-urgent tasks like pre-launch backups.

Identifying Causes:
- Existing production bugs.
- New bugs introduced into production.
- Alerting thresholds and human processes.
Strategies:
- Fix existing bugs before releasing new features.
- Automate production changes to reduce human error.
- Rollback strategy: Detect, rollback, fix, and reintroduce fixes.

Alerts should be immediately actionable and not overwhelm engineers.
SLO-based alerts: Only page when error budget is burned.
Team Review: Alerts should be thoroughly reviewed and tested in production before implementation.

Root Cause Analysis:
- Identify and prevent repeat incidents.
- Focus on improving systems, not just patching immediate bugs.
Systemic Fixes:
- Aim for systemic improvements (e.g., automation, better monitoring).

Shift Length: 12-hour shifts are preferred for sustainable on-call work.
Flexibility: Accommodate personal life changes (e.g., part-time, temporary breaks).
Automated Scheduling: Use automated tools to handle complex schedules fairly.

Building Positive Relations:
- Encourage team bonding (e.g., offsite activities, lunches).
- Co-locate on-call teams to foster collaboration.
Empowerment:
- SRE Ownership: Make SREs responsible for site reliability, working alongside feature developers.

On-call is critical to site reliability but must be structured to prevent burnout and foster team collaboration.
Systematic improvement in on-call processes and team dynamics leads to healthier, more effective teams.

Last updated 11 months ago

Was this helpful?