# On-Call

## **Purpose of On-Call**

* **Goal**: Be available and respond to production incidents with appropriate urgency.
* **Responsibilities**:
  * Diagnose, mitigate, fix, or escalate incidents.
  * Perform non-urgent production tasks when not handling incidents.

***

## **Key Considerations for On-Call**

* **Balance**: Engineers should not sacrifice health for reliability.
* **Goal**: SREs should handle a healthy mix of project work and on-call duties (50% project work is ideal).
* **Psychological Safety**: Support systems and escalation procedures must exist to reduce stress.
* **Compensation**: Time-off or cash incentives for on-call work to prevent burnout.

***

## **Example On-Call Setups**

* **Google’s Approach**:
  * **Training Roadmap**: Checklist of focus areas (e.g., monitoring systems, debugging, handling traffic).
  * **Deep Dives**: Focus on gaining expertise in services.
  * **Onboarding**: Mentoring, shadowing on-call shifts, and practice with “Wheel of Misfortune” disaster exercises.
  * **Playbooks**: Documentation to handle alerts and reduce Mean Time to Repair (MTTR).
* **Evernote’s Cloud Migration**:
  * **Reframing Alerts**: Shift focus from infrastructure-level events to user impact (e.g., API responsiveness).
  * **On-Call Rotation**: Classify events into three categories:
    * **P1**: Immediate action needed, page on-call.
    * **P2**: Handle next business day, send email.
    * **P3**: Informational only.
  * **Post-Incident Review**: Every P1 and P2 event has a postmortem and continuous improvement cycle.

***

## **Managing Pager Load**

* **Definition**: Number of paging incidents during a shift.
* **Techniques to Manage Pager Load**:
  * Ensure only actionable issues trigger pages.
  * Use automation for routine fixes or escalate lower-priority issues to tickets.
  * Focus on **high signal-to-noise ratio** for alerts.
* **Appropriate Response Times**:
  * **Critical (5 min)**: Requires immediate action (e.g., revenue-impacting outages).
  * **Moderate (30 min)**: Less critical issues.
  * **Low (during work hours)**: Non-urgent tasks like pre-launch backups.

***

## **Reducing Pager Load**

* **Identifying Causes**:
  * Existing production bugs.
  * New bugs introduced into production.
  * Alerting thresholds and human processes.
* **Strategies**:
  * **Fix existing bugs** before releasing new features.
  * **Automate** production changes to reduce human error.
  * **Rollback strategy**: Detect, rollback, fix, and reintroduce fixes.

***

## **Best Practices for Alerts**

* Alerts should be **immediately actionable** and not overwhelm engineers.
* **SLO-based alerts**: Only page when error budget is burned.
* **Team Review**: Alerts should be thoroughly reviewed and tested in production before implementation.

***

## **Rigor of Follow-Up**

* **Root Cause Analysis**:
  * Identify and prevent repeat incidents.
  * Focus on improving systems, not just patching immediate bugs.
* **Systemic Fixes**:
  * Aim for systemic improvements (e.g., automation, better monitoring).

***

## **On-Call Flexibility**

* **Shift Length**: 12-hour shifts are preferred for sustainable on-call work.
* **Flexibility**: Accommodate personal life changes (e.g., part-time, temporary breaks).
* **Automated Scheduling**: Use automated tools to handle complex schedules fairly.

***

## **Team Dynamics**

* **Building Positive Relations**:
  * Encourage team bonding (e.g., offsite activities, lunches).
  * Co-locate on-call teams to foster collaboration.
* **Empowerment**:
  * **SRE Ownership**: Make SREs responsible for site reliability, working alongside feature developers.

***

## **Conclusion**

* On-call is critical to site reliability but must be structured to prevent burnout and foster team collaboration.
* Systematic improvement in on-call processes and team dynamics leads to healthier, more effective teams.
