On-Call

Purpose of On-Call

  • Goal: Be available and respond to production incidents with appropriate urgency.

  • Responsibilities:

    • Diagnose, mitigate, fix, or escalate incidents.

    • Perform non-urgent production tasks when not handling incidents.


Key Considerations for On-Call

  • Balance: Engineers should not sacrifice health for reliability.

  • Goal: SREs should handle a healthy mix of project work and on-call duties (50% project work is ideal).

  • Psychological Safety: Support systems and escalation procedures must exist to reduce stress.

  • Compensation: Time-off or cash incentives for on-call work to prevent burnout.


Example On-Call Setups

  • Google’s Approach:

    • Training Roadmap: Checklist of focus areas (e.g., monitoring systems, debugging, handling traffic).

    • Deep Dives: Focus on gaining expertise in services.

    • Onboarding: Mentoring, shadowing on-call shifts, and practice with “Wheel of Misfortune” disaster exercises.

    • Playbooks: Documentation to handle alerts and reduce Mean Time to Repair (MTTR).

  • Evernote’s Cloud Migration:

    • Reframing Alerts: Shift focus from infrastructure-level events to user impact (e.g., API responsiveness).

    • On-Call Rotation: Classify events into three categories:

      • P1: Immediate action needed, page on-call.

      • P2: Handle next business day, send email.

      • P3: Informational only.

    • Post-Incident Review: Every P1 and P2 event has a postmortem and continuous improvement cycle.


Managing Pager Load

  • Definition: Number of paging incidents during a shift.

  • Techniques to Manage Pager Load:

    • Ensure only actionable issues trigger pages.

    • Use automation for routine fixes or escalate lower-priority issues to tickets.

    • Focus on high signal-to-noise ratio for alerts.

  • Appropriate Response Times:

    • Critical (5 min): Requires immediate action (e.g., revenue-impacting outages).

    • Moderate (30 min): Less critical issues.

    • Low (during work hours): Non-urgent tasks like pre-launch backups.


Reducing Pager Load

  • Identifying Causes:

    • Existing production bugs.

    • New bugs introduced into production.

    • Alerting thresholds and human processes.

  • Strategies:

    • Fix existing bugs before releasing new features.

    • Automate production changes to reduce human error.

    • Rollback strategy: Detect, rollback, fix, and reintroduce fixes.


Best Practices for Alerts

  • Alerts should be immediately actionable and not overwhelm engineers.

  • SLO-based alerts: Only page when error budget is burned.

  • Team Review: Alerts should be thoroughly reviewed and tested in production before implementation.


Rigor of Follow-Up

  • Root Cause Analysis:

    • Identify and prevent repeat incidents.

    • Focus on improving systems, not just patching immediate bugs.

  • Systemic Fixes:

    • Aim for systemic improvements (e.g., automation, better monitoring).


On-Call Flexibility

  • Shift Length: 12-hour shifts are preferred for sustainable on-call work.

  • Flexibility: Accommodate personal life changes (e.g., part-time, temporary breaks).

  • Automated Scheduling: Use automated tools to handle complex schedules fairly.


Team Dynamics

  • Building Positive Relations:

    • Encourage team bonding (e.g., offsite activities, lunches).

    • Co-locate on-call teams to foster collaboration.

  • Empowerment:

    • SRE Ownership: Make SREs responsible for site reliability, working alongside feature developers.


Conclusion

  • On-call is critical to site reliability but must be structured to prevent burnout and foster team collaboration.

  • Systematic improvement in on-call processes and team dynamics leads to healthier, more effective teams.

Last updated