Toil
Toil in SRE:
Toil refers to operational work that is manual, repetitive, automatable, tactical, and devoid of long-term value.
It is tied to running production services and scales linearly with service growth.
Characteristics of Toil
Manual: Tasks that require human intervention.
Repetitive: Tasks done repeatedly over time.
Automatable: Tasks that could be done by machines.
Tactical: Interrupt-driven, reactive work (e.g., handling pager alerts).
No enduring value: Tasks that don’t result in permanent improvement.
Scales linearly: Effort increases with service size or usage.
Why Reducing Toil Matters
SREs aim to keep toil under 50% of their time to focus on long-term engineering projects.
Excessive toil leads to:
Burnout and low morale.
Stagnation in career growth.
Slower progress and productivity loss.
Confusion about SRE’s role as an engineering organization.
Risk of attrition among top engineers.
Calculating Toil
On-call shifts make up a minimum of 25%-33% of an SRE’s time.
Interrupts, urgent responses, and manual processes (e.g., releases) contribute significantly to toil.
Engineering vs. Toil
Engineering work: Novel, strategic, and produces lasting value.
Includes coding, creating automation tools, and system configuration.
Overhead: Administrative tasks like HR work or team meetings, which aren't considered toil but also don't involve engineering.
Toil's Impact on Teams
Toil is not always bad; small amounts can be calming and provide quick wins.
However, too much toil leads to inefficiency, slower feature delivery, and lower morale.
Conclusion
Reducing toil through automation and engineering helps scale services more efficiently and enables SREs to focus on high-value, strategic work.
Last updated