# On-Call

## **Purpose of On-Call**

* **Goal**: Be available and respond to production incidents with appropriate urgency.
* **Responsibilities**:
  * Diagnose, mitigate, fix, or escalate incidents.
  * Perform non-urgent production tasks when not handling incidents.

***

## **Key Considerations for On-Call**

* **Balance**: Engineers should not sacrifice health for reliability.
* **Goal**: SREs should handle a healthy mix of project work and on-call duties (50% project work is ideal).
* **Psychological Safety**: Support systems and escalation procedures must exist to reduce stress.
* **Compensation**: Time-off or cash incentives for on-call work to prevent burnout.

***

## **Example On-Call Setups**

* **Google’s Approach**:
  * **Training Roadmap**: Checklist of focus areas (e.g., monitoring systems, debugging, handling traffic).
  * **Deep Dives**: Focus on gaining expertise in services.
  * **Onboarding**: Mentoring, shadowing on-call shifts, and practice with “Wheel of Misfortune” disaster exercises.
  * **Playbooks**: Documentation to handle alerts and reduce Mean Time to Repair (MTTR).
* **Evernote’s Cloud Migration**:
  * **Reframing Alerts**: Shift focus from infrastructure-level events to user impact (e.g., API responsiveness).
  * **On-Call Rotation**: Classify events into three categories:
    * **P1**: Immediate action needed, page on-call.
    * **P2**: Handle next business day, send email.
    * **P3**: Informational only.
  * **Post-Incident Review**: Every P1 and P2 event has a postmortem and continuous improvement cycle.

***

## **Managing Pager Load**

* **Definition**: Number of paging incidents during a shift.
* **Techniques to Manage Pager Load**:
  * Ensure only actionable issues trigger pages.
  * Use automation for routine fixes or escalate lower-priority issues to tickets.
  * Focus on **high signal-to-noise ratio** for alerts.
* **Appropriate Response Times**:
  * **Critical (5 min)**: Requires immediate action (e.g., revenue-impacting outages).
  * **Moderate (30 min)**: Less critical issues.
  * **Low (during work hours)**: Non-urgent tasks like pre-launch backups.

***

## **Reducing Pager Load**

* **Identifying Causes**:
  * Existing production bugs.
  * New bugs introduced into production.
  * Alerting thresholds and human processes.
* **Strategies**:
  * **Fix existing bugs** before releasing new features.
  * **Automate** production changes to reduce human error.
  * **Rollback strategy**: Detect, rollback, fix, and reintroduce fixes.

***

## **Best Practices for Alerts**

* Alerts should be **immediately actionable** and not overwhelm engineers.
* **SLO-based alerts**: Only page when error budget is burned.
* **Team Review**: Alerts should be thoroughly reviewed and tested in production before implementation.

***

## **Rigor of Follow-Up**

* **Root Cause Analysis**:
  * Identify and prevent repeat incidents.
  * Focus on improving systems, not just patching immediate bugs.
* **Systemic Fixes**:
  * Aim for systemic improvements (e.g., automation, better monitoring).

***

## **On-Call Flexibility**

* **Shift Length**: 12-hour shifts are preferred for sustainable on-call work.
* **Flexibility**: Accommodate personal life changes (e.g., part-time, temporary breaks).
* **Automated Scheduling**: Use automated tools to handle complex schedules fairly.

***

## **Team Dynamics**

* **Building Positive Relations**:
  * Encourage team bonding (e.g., offsite activities, lunches).
  * Co-locate on-call teams to foster collaboration.
* **Empowerment**:
  * **SRE Ownership**: Make SREs responsible for site reliability, working alongside feature developers.

***

## **Conclusion**

* On-call is critical to site reliability but must be structured to prevent burnout and foster team collaboration.
* Systematic improvement in on-call processes and team dynamics leads to healthier, more effective teams.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://notes.mikaelsamvelian.com/devops-knowledge/sre/on-call.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
