πŸ‘¨β€πŸ’»
Mike's Notes
  • Introduction
  • MacOs Setup
    • System Preferences
    • Homebrew
      • Usage
    • iTerm
      • VIM
      • Tree
      • ZSH
    • Visual Studio Code
    • Git
    • SSH Keys
  • DevOps Knowledge
    • SRE
      • Scaling Reliably
        • Splitting a Monolith into Microservices
      • Troubleshooting Common Issues
      • Service Level Terminology
      • Toil
      • Monitoring
      • Release Engineering
      • Best Practices
      • On-Call
      • Alerting
    • Containers
      • Docker
        • Best Practices
          • Image Building
          • Docker Development
        • CLI Cheat Sheet
      • Container Orchestration
        • Kubernetes
          • Benefits
          • Cheat Sheet
          • Components
          • Pods
          • Workload Resources
          • Best Practices
    • Developer Portal πŸ‘¨β€πŸ’»
      • Solution Overview 🎯
      • System Architecture πŸ—οΈ
      • Implementation Journey πŸ› οΈ
      • Cross-team Collaboration 🀝
      • Lessons & Future πŸŽ“
    • Provisioning
      • Terraform
        • Installation
        • Usage
    • Configuration Management
      • Ansible
        • Benefits
        • Installation
    • Build Systems
      • Bazel
        • Features
  • Security
    • Secure Software Engineering
    • Core Concepts
    • Security Design Principles
    • Software Security Requirements
    • Compliance Standards and Policies
      • Sarbanes-Oxley (SOX)
      • HIPAA and HITECH
      • Payment Card Industry Data Security Standard (PCI-DSS)
      • General Data Protection Regulation (GDPR)
      • California Consumer Privacy Act (CCPA)
      • Federal Risk and Authorization Management Program (FedRAMP)
    • Privacy & Data
  • Linux Fundamentals
    • Introduction to Linux
    • Architecture
    • Server Administration
      • User / Groups
      • File Permissions
      • SSH
      • Process Management
    • Networking
      • Diagrams
      • Browser URL Example
      • Network Topologies
      • Signal Routing
      • DNS (Domain Name System)
      • SSL (Secure Sockets Layer)
      • TLS (Transport Layer Security)
  • System Design
    • Process
    • Kafka
      • Advanced Topics
    • URL Shortener
Powered by GitBook
On this page
  • Toil in SRE:
  • Characteristics of Toil
  • Why Reducing Toil Matters
  • Calculating Toil
  • Engineering vs. Toil
  • Toil's Impact on Teams
  • Conclusion

Was this helpful?

  1. DevOps Knowledge
  2. SRE

Toil

Toil in SRE:

  • Toil refers to operational work that is manual, repetitive, automatable, tactical, and devoid of long-term value.

  • It is tied to running production services and scales linearly with service growth.

Characteristics of Toil

  • Manual: Tasks that require human intervention.

  • Repetitive: Tasks done repeatedly over time.

  • Automatable: Tasks that could be done by machines.

  • Tactical: Interrupt-driven, reactive work (e.g., handling pager alerts).

  • No enduring value: Tasks that don’t result in permanent improvement.

  • Scales linearly: Effort increases with service size or usage.

Why Reducing Toil Matters

  • SREs aim to keep toil under 50% of their time to focus on long-term engineering projects.

  • Excessive toil leads to:

    • Burnout and low morale.

    • Stagnation in career growth.

    • Slower progress and productivity loss.

    • Confusion about SRE’s role as an engineering organization.

    • Risk of attrition among top engineers.

Calculating Toil

  • On-call shifts make up a minimum of 25%-33% of an SRE’s time.

  • Interrupts, urgent responses, and manual processes (e.g., releases) contribute significantly to toil.

Engineering vs. Toil

  • Engineering work: Novel, strategic, and produces lasting value.

    • Includes coding, creating automation tools, and system configuration.

  • Overhead: Administrative tasks like HR work or team meetings, which aren't considered toil but also don't involve engineering.

Toil's Impact on Teams

  • Toil is not always bad; small amounts can be calming and provide quick wins.

  • However, too much toil leads to inefficiency, slower feature delivery, and lower morale.

Conclusion

  • Reducing toil through automation and engineering helps scale services more efficiently and enables SREs to focus on high-value, strategic work.

PreviousService Level TerminologyNextMonitoring

Last updated 8 months ago

Was this helpful?