Toil

Toil in SRE:

  • Toil refers to operational work that is manual, repetitive, automatable, tactical, and devoid of long-term value.

  • It is tied to running production services and scales linearly with service growth.

Characteristics of Toil

  • Manual: Tasks that require human intervention.

  • Repetitive: Tasks done repeatedly over time.

  • Automatable: Tasks that could be done by machines.

  • Tactical: Interrupt-driven, reactive work (e.g., handling pager alerts).

  • No enduring value: Tasks that don’t result in permanent improvement.

  • Scales linearly: Effort increases with service size or usage.

Why Reducing Toil Matters

  • SREs aim to keep toil under 50% of their time to focus on long-term engineering projects.

  • Excessive toil leads to:

    • Burnout and low morale.

    • Stagnation in career growth.

    • Slower progress and productivity loss.

    • Confusion about SRE’s role as an engineering organization.

    • Risk of attrition among top engineers.

Calculating Toil

  • On-call shifts make up a minimum of 25%-33% of an SRE’s time.

  • Interrupts, urgent responses, and manual processes (e.g., releases) contribute significantly to toil.

Engineering vs. Toil

  • Engineering work: Novel, strategic, and produces lasting value.

    • Includes coding, creating automation tools, and system configuration.

  • Overhead: Administrative tasks like HR work or team meetings, which aren't considered toil but also don't involve engineering.

Toil's Impact on Teams

  • Toil is not always bad; small amounts can be calming and provide quick wins.

  • However, too much toil leads to inefficiency, slower feature delivery, and lower morale.

Conclusion

  • Reducing toil through automation and engineering helps scale services more efficiently and enables SREs to focus on high-value, strategic work.

Last updated