Preface |
|
xiii | |
|
|
|
1 Site Reliability Engineering in Six Words |
|
|
2 | (2) |
|
|
2 Do We Know Why We Really Want Reliability? |
|
|
4 | (2) |
|
|
3 Building Self-Regulating Processes |
|
|
6 | (2) |
|
|
4 Four Engineers of an SRE Seder |
|
|
8 | (2) |
|
|
|
10 | (2) |
|
|
6 Infrastructure: It's Where the Power Is |
|
|
12 | (2) |
|
|
7 Thinking About Resilience |
|
|
14 | (2) |
|
|
8 Observability in the Development Cycle |
|
|
16 | (2) |
|
|
|
|
18 | (2) |
|
|
10 How Wikipedia Is Served to You |
|
|
20 | (2) |
|
|
11 Why You Should Understand (a Little) About TCP |
|
|
22 | (2) |
|
|
12 The Importance of a Management Interface |
|
|
24 | (2) |
|
|
13 When It Comes to Storage, Think Distributed |
|
|
26 | (2) |
|
|
14 The Role of Cardinality |
|
|
28 | (2) |
|
|
|
15 Security Is like an Onion |
|
|
30 | (2) |
|
|
|
32 | (2) |
|
|
|
34 | (2) |
|
|
|
36 | (2) |
|
|
19 Sustainability and Burnout |
|
|
38 | (2) |
|
|
20 Don't Take Advice from Graybeards |
|
|
40 | (2) |
|
|
21 Facing That First Page |
|
|
42 | (3) |
|
|
|
|
22 SRE, at Any Size, Is Cultural |
|
|
45 | (2) |
|
|
23 Everyone Is an SRE in a Small Organization |
|
|
47 | (2) |
|
|
24 Auditing Your Environment for Improvements |
|
|
49 | (2) |
|
|
25 With Incident Response, Start Small |
|
|
51 | (2) |
|
|
26 Solo SRE: Effecting Large-Scale Change as a Single Individual |
|
|
53 | (2) |
|
|
27 Design Goals for SLO Measurement |
|
|
55 | (2) |
|
|
28 I Have an Error Budget-Now What? |
|
|
57 | (2) |
|
|
|
59 | (2) |
|
|
30 Methodological Debugging |
|
|
61 | (2) |
|
|
|
31 How Startups Can Build an SRE Mindset |
|
|
63 | (2) |
|
|
32 Bootstrapping SRE in Enterprises |
|
|
65 | (2) |
|
|
33 It's Okay Not to Know, and It's Okay to Be Wrong |
|
|
67 | (2) |
|
|
34 Storytelling Is a Superpower |
|
|
69 | (2) |
|
|
35 Get Your Work Recognized: Write a Brag Document |
|
|
71 | (3) |
|
|
|
|
|
|
74 | (2) |
|
|
37 An Overlooked Engineering Skill |
|
|
76 | (2) |
|
|
38 Unpacking the On-Call Divide |
|
|
78 | (2) |
|
|
39 The Maestros of Incident Response |
|
|
80 | (2) |
|
|
40 Effortless Incident Management |
|
|
82 | (2) |
|
|
|
|
41 If You're Doing Runbooks, Do Them Well |
|
|
84 | (2) |
|
|
42 Why I Hate Our Playbooks |
|
|
86 | (2) |
|
|
|
88 | (2) |
|
|
44 Integrating Empathy into SRE Tools |
|
|
90 | (3) |
|
|
45 Using ChatOps to Implement Empathy |
|
|
93 | (2) |
|
|
46 Move Fast to Unbreak Things |
|
|
95 | (2) |
|
|
47 You Don't Know for Sure Until It Runs in Production |
|
|
97 | (2) |
|
|
48 Sometimes the Fix Is the Problem |
|
|
99 | (2) |
|
|
|
101 | (2) |
|
|
50 Metrics Are Not SLIs (The Measure Everything Trap) |
|
|
103 | (2) |
|
|
51 When SLOs Attack: Pathological SLOs and How to Fix Them |
|
|
105 | (2) |
|
|
52 Holistic Approach to Product Reliability |
|
|
107 | (2) |
|
|
|
53 In Search of the Lost Time |
|
|
109 | (2) |
|
|
54 Unexpected Lessons from Office Hours |
|
|
111 | (2) |
|
|
55 Building Tools for Internal Customers that They Actually Want to Use |
|
|
113 | (2) |
|
|
56 It's About the Individuals and Interactions |
|
|
115 | (2) |
|
|
57 The Human Baseline in SRE |
|
|
117 | (2) |
|
|
58 Remotely Productive or Productively Remote |
|
|
119 | (2) |
|
|
59 Of Margins and Individuals |
|
|
121 | (2) |
|
|
60 The Importance of Margins in Systems |
|
|
123 | (2) |
|
|
61 Fewer Spreadsheets, More Napkins |
|
|
125 | (2) |
|
|
62 Sneaking in Your DevOps Deliciously |
|
|
127 | (2) |
|
|
63 Effecting SRE Cultural Changes in Enterprises |
|
|
129 | (2) |
|
|
64 To All the SREs I've Loved |
|
|
131 | (2) |
|
|
65 Complex: The Most Overloaded Word in Technology |
|
|
133 | (3) |
|
|
|
|
66 The Best Advice I Can Give to Teams |
|
|
136 | (2) |
|
|
67 Create Your Supporting Artifacts |
|
|
138 | (2) |
|
|
|
68 The Order of Operations for Getting SLO Buy-In |
|
|
140 | (2) |
|
|
69 Heroes Are Necessary, but Hero Culture Is Not |
|
|
142 | (2) |
|
|
70 On-Call Rotations that People Want to Join |
|
|
144 | (2) |
|
|
|
|
71 Study of Human Factors and Team Culture to Improve Pager Fatigue |
|
|
146 | (2) |
|
|
72 Optimize for MTTBTB (Mean Time to Back to Bed) |
|
|
148 | (2) |
|
|
73 Mitigating and Preventing Cascading Failures |
|
|
150 | (2) |
|
|
74 On-Call Health: The Metric You Could Be Measuring |
|
|
152 | (2) |
|
|
75 Helping Leaders Prioritize On-Call Health |
|
|
154 | (2) |
|
|
|
156 | (2) |
|
|
77 The Forward-Deployed SRE |
|
|
158 | (2) |
|
|
78 Test Your Disaster Plan |
|
|
160 | (2) |
|
|
79 Why Training Matters to an SRE Practice and SRE Matters to Your Training Program |
|
|
162 | (2) |
|
|
80 The Power of Uniformity |
|
|
164 | (2) |
|
|
|
|
|
166 | (2) |
|
|
82 Make Your Engineering Blog a Priority |
|
|
168 | (2) |
|
|
83 Don't Let Anyone Run Code in Your Context |
|
|
170 | (2) |
|
|
84 Trading Places: SRE and Product |
|
|
172 | (2) |
|
|
85 You See Teams, I See Product |
|
|
174 | (2) |
|
|
86 The Performance Emergency Fund |
|
|
176 | (2) |
|
|
87 Important but Not Urgent: Roadmaps for SREs |
|
|
178 | (3) |
|
|
|
|
|
181 | (2) |
|
|
89 Following the Path of Safety-Critical Systems |
|
|
183 | (2) |
|
|
90 Applicable and Achievable Static Analysis |
|
|
185 | (2) |
|
|
91 The Importance of Formal Specification |
|
|
187 | (2) |
|
|
92 Risk and Rot in Sociotechnical Systems |
|
|
189 | (2) |
|
|
|
191 | (2) |
|
|
94 Expected Risk Limitations |
|
|
193 | (2) |
|
|
95 Beyond Local Risk: Accounting for Angry Birds |
|
|
195 | (2) |
|
|
96 A Word from Software Safety Nerds |
|
|
197 | (2) |
|
|
97 Incidents: A Window into Gaps |
|
|
199 | (2) |
|
|
|
201 | (2) |
|
Contributors |
|
203 | (22) |
Index |
|
225 | (7) |
About the Editors |
|
232 | |