Foreword I |
|
xvii | |
Foreword II |
|
xix | |
Preface |
|
xxiii | |
1 How SRE Relates to DevOps |
|
1 | (16) |
|
|
2 | (2) |
|
|
2 | (1) |
|
|
3 | (1) |
|
|
3 | (1) |
|
Tooling and Culture Are Interrelated |
|
|
4 | (1) |
|
|
4 | (1) |
|
|
4 | (4) |
|
Operations Is a Software Problem |
|
|
5 | (1) |
|
Manage by Service Level Objectives (SLOs) |
|
|
5 | (1) |
|
|
5 | (1) |
|
Automate This Year's Job Away |
|
|
6 | (1) |
|
Move Fast by Reducing the Cost of Failure |
|
|
6 | (1) |
|
Share Ownership with Developers |
|
|
7 | (1) |
|
Use the Same Tooling, Regardless of Function or Job Title |
|
|
7 | (1) |
|
|
8 | (1) |
|
Organizational Context and Fostering Successful Adoption |
|
|
9 | (8) |
|
Narrow, Rigid Incentives Narrow Your Success |
|
|
10 | (1) |
|
It's Better to Fix It Yourself; Don't Blame Someone Else |
|
|
10 | (1) |
|
Consider Reliability Work as a Specialized Role |
|
|
11 | (1) |
|
When Can Substitute for Whether |
|
|
12 | (1) |
|
Strive for Parity of Esteem: Career and Financial |
|
|
12 | (5) |
Part I. Foundations |
|
|
|
17 | (26) |
|
|
17 | (1) |
|
|
18 | (5) |
|
Reliability Targets and Error Budgets |
|
|
19 | (1) |
|
What to Measure: Using SLIs |
|
|
20 | (3) |
|
|
23 | (6) |
|
Moving from SLI Specification to SLI Implementation |
|
|
25 | (1) |
|
|
26 | (2) |
|
Using the SLIs to Calculate Starter SLOs |
|
|
28 | (1) |
|
Choosing an Appropriate Time Window |
|
|
29 | (1) |
|
Getting Stakeholder Agreement |
|
|
30 | (4) |
|
Establishing an Error Budget Policy |
|
|
31 | (1) |
|
Documenting the SLO and Error Budget Policy |
|
|
32 | (1) |
|
|
33 | (1) |
|
Continuous Improvement of SLO Targets |
|
|
34 | (3) |
|
Improving the Quality of Your SLO |
|
|
35 | (2) |
|
Decision Making Using SLOs and Error Budgets |
|
|
37 | (1) |
|
|
38 | (4) |
|
|
39 | (1) |
|
Grading Interaction Importance |
|
|
39 | (1) |
|
|
40 | (1) |
|
Experimenting with Relaxing Your SLOs |
|
|
41 | (1) |
|
|
42 | (1) |
|
3 SLO Engineering Case Studies |
|
|
43 | (18) |
|
|
43 | (6) |
|
Why Did Evernote Adopt the SRE Model? |
|
|
44 | (1) |
|
Introduction of SLOs: A Journey in Progress |
|
|
45 | (3) |
|
Breaking Down the SLO Wall Between Customer and Cloud Provider |
|
|
48 | (1) |
|
|
49 | (1) |
|
The Home Depot's SLO Story |
|
|
49 | (11) |
|
|
50 | (2) |
|
|
52 | (2) |
|
|
54 | (1) |
|
Automating VALET Data Collection |
|
|
55 | (2) |
|
The Proliferation of SLOs |
|
|
57 | (1) |
|
Applying VALET to Batch Applications |
|
|
57 | (1) |
|
|
58 | (1) |
|
|
58 | (1) |
|
|
59 | (1) |
|
|
60 | (1) |
|
|
61 | (14) |
|
Desirable Features of a Monitoring Strategy |
|
|
62 | (2) |
|
|
62 | (1) |
|
|
62 | (1) |
|
|
63 | (1) |
|
|
64 | (1) |
|
Sources of Monitoring Data |
|
|
64 | (3) |
|
|
65 | (2) |
|
Managing Your Monitoring System |
|
|
67 | (2) |
|
Treat Your Configuration as Code |
|
|
67 | (1) |
|
|
68 | (1) |
|
|
68 | (1) |
|
|
69 | (3) |
|
|
70 | (1) |
|
|
70 | (1) |
|
|
71 | (1) |
|
|
72 | (1) |
|
Implementing Purposeful Metrics |
|
|
72 | (1) |
|
|
72 | (1) |
|
|
73 | (2) |
|
|
75 | (18) |
|
|
75 | (1) |
|
Ways to Alert on Significant Events |
|
|
76 | (10) |
|
1 Target Error Rate SLO Threshold |
|
|
76 | (2) |
|
|
78 | (1) |
|
3 Incrementing Alert Duration |
|
|
79 | (1) |
|
|
80 | (2) |
|
5 Multiple Burn Rate Alerts |
|
|
82 | (2) |
|
6 Multiwindow, Multi-Burn-Rate Alerts |
|
|
84 | (2) |
|
Low-Traffic Services and Error Budget Alerting |
|
|
86 | (3) |
|
Generating Artificial Traffic |
|
|
87 | (1) |
|
|
87 | (1) |
|
Making Service and Infrastructure Changes |
|
|
87 | (1) |
|
Lowering the SLO or Increasing the Window |
|
|
88 | (1) |
|
Extreme Availability Goals |
|
|
89 | (1) |
|
|
89 | (2) |
|
|
91 | (2) |
|
|
93 | (38) |
|
|
94 | (2) |
|
|
96 | (2) |
|
|
98 | (3) |
|
|
98 | (1) |
|
|
99 | (1) |
|
|
99 | (1) |
|
|
99 | (1) |
|
Cost Engineering and Capacity Planning |
|
|
100 | (1) |
|
Troubleshooting for Opaque Architectures |
|
|
100 | (1) |
|
Toil Management Strategies |
|
|
101 | (5) |
|
Identify and Measure Toil |
|
|
101 | (1) |
|
Engineer Toil Out of the System |
|
|
101 | (1) |
|
|
101 | (1) |
|
|
102 | (1) |
|
Start with Human-Backed Interfaces |
|
|
102 | (1) |
|
Provide Self-Service Methods |
|
|
102 | (1) |
|
Get Support from Management and Colleagues |
|
|
103 | (1) |
|
Promote Toil Reduction as a Feature |
|
|
103 | (1) |
|
Start Small and Then Improve |
|
|
103 | (1) |
|
|
103 | (1) |
|
Assess Risk Within Automation |
|
|
104 | (1) |
|
|
104 | (1) |
|
Use Open Source and Third-Party Tools |
|
|
105 | (1) |
|
|
105 | (1) |
|
|
106 | (1) |
|
Case Study 1: Reducing Toil in the Datacenter with Automation |
|
|
107 | (14) |
|
|
107 | (3) |
|
|
110 | (1) |
|
|
110 | (1) |
|
Design First Effort: Saturn Line-Card Repair |
|
|
110 | (1) |
|
|
111 | (2) |
|
Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair |
|
|
113 | (1) |
|
|
114 | (4) |
|
|
118 | (3) |
|
Case Study 2: Decommissioning Filer-Backed Home Directories |
|
|
121 | (1) |
|
|
121 | (1) |
|
|
121 | (1) |
|
|
122 | (1) |
|
Design and Implementation |
|
|
123 | (1) |
|
|
124 | (3) |
|
|
127 | (2) |
|
|
129 | (2) |
|
|
131 | (16) |
|
|
131 | (2) |
|
Simplicity Is End-to-End, and SREs Are Good for That |
|
|
133 | (2) |
|
Case Study 1 End-to-End API Simplicity |
|
|
134 | (1) |
|
Case Study 2 Project Lifecycle Complexity |
|
|
134 | (1) |
|
|
135 | (6) |
|
Case Study 3 Simplification of the Display Ads Spiderweb |
|
|
137 | (2) |
|
Case Study 4 Running Hundreds of Microservices on a Shared Platform |
|
|
139 | (1) |
|
Case Study 5 pDNS No Longer Depends on Itself |
|
|
140 | (1) |
|
|
141 | (6) |
Part II. Practices |
|
|
|
147 | (28) |
|
Recap of "Being On-Call" Chapter of First SRE Book |
|
|
148 | (1) |
|
Example On-Call Setups Within Google and Outside Google |
|
|
149 | (7) |
|
Google: Forming a New Team |
|
|
149 | (4) |
|
Evernote: Finding Our Feet in the Cloud |
|
|
153 | (3) |
|
Practical Implementation Details |
|
|
156 | (17) |
|
|
156 | (11) |
|
|
167 | (4) |
|
|
171 | (2) |
|
|
173 | (2) |
|
|
175 | (20) |
|
Incident Management at Google |
|
|
176 | (1) |
|
|
176 | (1) |
|
Main Roles in Incident Response |
|
|
177 | (1) |
|
|
177 | (14) |
|
Case Study 1 Software Bug-The Lights Are On but No One's (Google) Home |
|
|
177 | (3) |
|
Case Study 2 Service Fault-Cache Me If You Can |
|
|
180 | (5) |
|
Case Study 3 Power Outage-Lightning Never Strikes Twice...Until It Does |
|
|
185 | (3) |
|
Case Study 4 Incident Response at PagerDuty |
|
|
188 | (3) |
|
Putting Best Practices into Practice |
|
|
191 | (3) |
|
Incident Response Training |
|
|
191 | (1) |
|
|
192 | (1) |
|
|
193 | (1) |
|
|
194 | (1) |
|
10 Postmortem Culture: Learning from Failure |
|
|
195 | (30) |
|
|
196 | (1) |
|
|
197 | (6) |
|
Why Is This Postmortem Bad? |
|
|
199 | (4) |
|
|
203 | (11) |
|
Why Is This Postmortem Better? |
|
|
212 | (2) |
|
Organizational Incentives |
|
|
214 | (6) |
|
Model and Enforce Blameless Behavior |
|
|
214 | (1) |
|
Reward Postmortem Outcomes |
|
|
215 | (2) |
|
|
217 | (1) |
|
Respond to Postmortem Culture Failures |
|
|
218 | (2) |
|
|
220 | (3) |
|
|
220 | (1) |
|
|
221 | (2) |
|
|
223 | (2) |
|
|
225 | (20) |
|
Google Cloud Load Balancing |
|
|
225 | (11) |
|
|
226 | (1) |
|
|
227 | (2) |
|
Global Software Load Balancer |
|
|
229 | (1) |
|
|
229 | (1) |
|
|
230 | (1) |
|
|
231 | (1) |
|
Case Study 1: Pokemon GO on GCLB |
|
|
231 | (5) |
|
|
236 | (3) |
|
Handling Unhealthy Machines |
|
|
236 | (1) |
|
Working with Stateful Systems |
|
|
237 | (1) |
|
Configuring Conservatively |
|
|
237 | (1) |
|
|
238 | (1) |
|
Including Kill Switches and Manual Overrides |
|
|
238 | (1) |
|
Avoiding Overloading Backends |
|
|
238 | (1) |
|
Avoiding Traffic Imbalance |
|
|
239 | (1) |
|
Combining Strategies to Manage Load |
|
|
239 | (4) |
|
Case Study 2: When Load Shedding Attacks |
|
|
240 | (3) |
|
|
243 | (2) |
|
12 Introducing Non-Abstract Large System Design |
|
|
245 | (18) |
|
|
245 | (1) |
|
|
246 | (1) |
|
|
246 | (14) |
|
|
246 | (1) |
|
|
247 | (1) |
|
|
248 | (3) |
|
|
251 | (9) |
|
|
260 | (3) |
|
13 Data Processing Pipelines |
|
|
263 | (38) |
|
|
264 | (4) |
|
Event Processing/Data Transformation to Order or Structure Data |
|
|
264 | (1) |
|
|
265 | (1) |
|
|
265 | (3) |
|
|
268 | (9) |
|
Define and Measure. Service Level Objectives |
|
|
268 | (2) |
|
Plan for Dependency Failure |
|
|
270 | (1) |
|
Create and Maintain Pipeline Documentation |
|
|
271 | (1) |
|
Map Your Development Lifecycle |
|
|
272 | (3) |
|
Reduce Hotspotting and Workload Patterns |
|
|
275 | (1) |
|
Implement Autoscaling and Resource Planning |
|
|
276 | (1) |
|
Adhere to Access Control and Security Policies |
|
|
277 | (1) |
|
|
277 | (1) |
|
Pipeline Requirements and Design |
|
|
277 | (7) |
|
What Features Do You Need? |
|
|
278 | (1) |
|
Idempotent and Two-Phase Mutations |
|
|
279 | (1) |
|
|
280 | (1) |
|
|
280 | (1) |
|
Pipeline Production Readiness |
|
|
281 | (3) |
|
Pipeline Failures: Prevention and Response |
|
|
284 | (3) |
|
|
284 | (2) |
|
|
286 | (1) |
|
|
287 | (12) |
|
|
288 | (1) |
|
Event Delivery System Design and Architecture |
|
|
289 | (1) |
|
Event Delivery System Operation |
|
|
290 | (3) |
|
Customer Integration and Support |
|
|
293 | (5) |
|
|
298 | (1) |
|
|
299 | (2) |
|
14 Configuration Design and Best Practices |
|
|
301 | (14) |
|
|
301 | (2) |
|
Configuration and Reliability |
|
|
302 | (1) |
|
Separating Philosophy and Mechanics |
|
|
303 | (1) |
|
|
303 | (5) |
|
Configuration Asks Users Questions |
|
|
305 | (1) |
|
Questions Should Be Close to User Goals |
|
|
305 | (1) |
|
Mandatory and Optional Questions |
|
|
306 | (2) |
|
|
308 | (1) |
|
Mechanics of Configuration |
|
|
308 | (5) |
|
Separate Configuration and Resulting Data |
|
|
308 | (2) |
|
|
310 | (2) |
|
Ownership and Change Tracking |
|
|
312 | (1) |
|
Safe Configuration Change Application |
|
|
312 | (1) |
|
|
313 | (2) |
|
15 Configuration Specifics |
|
|
315 | (20) |
|
Configuration-Induced Toil |
|
|
315 | (1) |
|
Reducing Configuration-Induced Toil |
|
|
316 | (1) |
|
Critical Properties and Pitfalls of Configuration Systems |
|
|
317 | (3) |
|
Pitfall 1 Failing to Recognize Configuration as a Programming Language Problem |
|
|
317 | (1) |
|
Pitfall 2 Designing Accidental or Ad Hoc Language Features |
|
|
318 | (1) |
|
Pitfall 3 Building Too Much Domain-Specific Optimization |
|
|
318 | (1) |
|
Pitfall 4 Interleaving "Configuration Evaluation" with "Side Effects" |
|
|
319 | (1) |
|
Pitfall 5 Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua |
|
|
319 | (1) |
|
Integrating a Configuration Language |
|
|
320 | (2) |
|
Generating Config in Specific Formats |
|
|
320 | (1) |
|
Driving Multiple Applications |
|
|
321 | (1) |
|
Integrating an Existing Application: Kubernetes |
|
|
322 | (4) |
|
|
322 | (1) |
|
Example Kubernetes Config |
|
|
322 | (1) |
|
Integrating the Configuration Language |
|
|
323 | (3) |
|
Integrating Custom Applications (In-House Software) |
|
|
326 | (3) |
|
Effectively Operating a Configuration System |
|
|
329 | (2) |
|
|
329 | (1) |
|
|
330 | (1) |
|
|
330 | (1) |
|
|
330 | (1) |
|
When to Evaluate Configuration |
|
|
331 | (2) |
|
Very Early: Checking in the JSON |
|
|
331 | (1) |
|
Middle of the Road: Evaluate at Build Time |
|
|
332 | (1) |
|
Late: Evaluate at Runtime |
|
|
332 | (1) |
|
Guarding Against Abusive Configuration |
|
|
333 | (1) |
|
|
334 | (1) |
|
|
335 | (20) |
|
Release Engineering Principles |
|
|
336 | (1) |
|
Balancing Release Velocity and Reliability |
|
|
337 | (1) |
|
|
338 | (1) |
|
Release Engineering and Canarying |
|
|
338 | (2) |
|
Requirements of a Canary Process |
|
|
339 | (1) |
|
|
339 | (1) |
|
A Roll Forward Deployment Versus a Simple Canary Deployment |
|
|
340 | (2) |
|
|
342 | (3) |
|
Minimizing Risk to SLOB and the Error Budget |
|
|
343 | (1) |
|
Choosing a Canary Population and Duration |
|
|
343 | (2) |
|
Selecting and Evaluating Metrics |
|
|
345 | (3) |
|
Metrics Should Indicate Problems |
|
|
345 | (1) |
|
Metrics Should Be Representative and Attributable |
|
|
346 | (1) |
|
Before/After Evaluation Is Risky |
|
|
347 | (1) |
|
Use a Gradual Canary for Better Metric Selection |
|
|
347 | (1) |
|
Dependencies and Isolation |
|
|
348 | (1) |
|
Canarying in Noninteractive Systems |
|
|
348 | (1) |
|
Requirements on Monitoring Data |
|
|
349 | (1) |
|
|
350 | (1) |
|
|
350 | (1) |
|
Artificial Load Generation |
|
|
350 | (1) |
|
|
351 | (1) |
|
|
351 | (4) |
Part III. Processes |
|
|
17 Identifying and Recovering from Overload |
|
|
355 | (16) |
|
|
356 | (2) |
|
Case Study 1: Work Overload When Half a Team Leaves |
|
|
358 | (2) |
|
|
358 | (1) |
|
|
358 | (1) |
|
|
359 | (1) |
|
|
359 | (1) |
|
|
360 | (1) |
|
Case Study 2: Perceived Overload After Organizational and Workload Changes |
|
|
360 | (6) |
|
|
360 | (1) |
|
|
361 | (1) |
|
|
362 | (1) |
|
|
363 | (2) |
|
|
365 | (1) |
|
|
365 | (1) |
|
Strategies for Mitigating Overload |
|
|
366 | (3) |
|
Recognizing the Symptoms of Overload |
|
|
366 | (1) |
|
Reducing Overload and Restoring Team Health |
|
|
367 | (2) |
|
|
369 | (2) |
|
|
371 | (20) |
|
|
372 | (3) |
|
Phase 1 Architecture and Design |
|
|
372 | (1) |
|
Phase 2 Active Development |
|
|
373 | (1) |
|
Phase 3 Limited Availability |
|
|
373 | (1) |
|
Phase 4 General Availability |
|
|
374 | (1) |
|
|
374 | (1) |
|
|
374 | (1) |
|
|
374 | (1) |
|
Setting Up the Relationship |
|
|
375 | (5) |
|
Communicating Business and Production Priorities |
|
|
375 | (1) |
|
|
375 | (1) |
|
|
375 | (4) |
|
|
379 | (1) |
|
|
379 | (1) |
|
Sustaining an Effective Ongoing Relationship |
|
|
380 | (2) |
|
Investing Time in Working Better Together |
|
|
380 | (1) |
|
Maintaining an Open Line of Communication |
|
|
380 | (1) |
|
Performing Regular Service Reviews |
|
|
381 | (1) |
|
Reassessing When Ground Rules Start to Slip |
|
|
381 | (1) |
|
Adjusting Priorities According to Your SLOs and Error Budget |
|
|
381 | (1) |
|
Handling Mistakes Appropriately |
|
|
382 | (1) |
|
Scaling SRE to Larger Environments |
|
|
382 | (3) |
|
Supporting Multiple Services with a Single SRE Team |
|
|
382 | (1) |
|
Structuring a Multiple SRE Team Environment |
|
|
383 | (1) |
|
Adapting SRE Team Structures to Changing Circumstances |
|
|
384 | (1) |
|
Running Cohesive Distributed SRE Teams |
|
|
384 | (1) |
|
|
385 | (4) |
|
|
385 | (2) |
|
Case Study 2: Data Analysis Pipeline |
|
|
387 | (2) |
|
|
389 | (2) |
|
19 SRE: Reaching Beyond Your Walls |
|
|
391 | (8) |
|
Truths We Hold to Be Self-Evident |
|
|
391 | (3) |
|
Reliability Is the Most Important Feature |
|
|
392 | (1) |
|
Your Users, Not Your Monitoring, Decide Your Reliability |
|
|
392 | (1) |
|
If You Run a Platform, Then Reliability Is a Partnership |
|
|
392 | (1) |
|
Everything Important Eventually Becomes a Platform |
|
|
393 | (1) |
|
When Your Customers Have a Hard Time, You Have to Slow Down |
|
|
393 | (1) |
|
You Will Need to Practice SRE with Your Customers |
|
|
393 | (1) |
|
How to: SRE with Your Customers |
|
|
394 | (4) |
|
Step 1 SLOs and SLIs Are How You Speak |
|
|
394 | (1) |
|
Step 2 Audit the Monitoring and Build Shared Dashboards |
|
|
395 | (1) |
|
Step 3 Measure and Renegotiate |
|
|
396 | (1) |
|
Step 4 Design Reviews and Risk Analysis |
|
|
396 | (1) |
|
Step 5 Practice, Practice, Practice |
|
|
397 | (1) |
|
Be Thoughtful and Disciplined |
|
|
397 | (1) |
|
|
398 | (1) |
|
|
399 | (24) |
|
SRE Practices Without SREs |
|
|
399 | (1) |
|
|
400 | (3) |
|
|
400 | (1) |
|
|
401 | (1) |
|
Bootstrapping Your First SRE |
|
|
402 | (1) |
|
|
403 | (1) |
|
|
403 | (10) |
|
|
404 | (1) |
|
|
405 | (3) |
|
|
408 | (3) |
|
|
411 | (2) |
|
|
413 | (5) |
|
|
413 | (1) |
|
|
414 | (1) |
|
|
415 | (3) |
|
Suggested Practices for Running Many Teams |
|
|
418 | (4) |
|
|
419 | (1) |
|
|
419 | (1) |
|
|
419 | (1) |
|
|
419 | (1) |
|
|
420 | (1) |
|
|
420 | (1) |
|
Launch Coordination Engineering Teams |
|
|
421 | (1) |
|
|
421 | (1) |
|
|
421 | (1) |
|
|
422 | (1) |
|
21 Organizational Change Management in SRE |
|
|
423 | (18) |
|
|
423 | (1) |
|
Introduction to Change Management |
|
|
424 | (3) |
|
Lewin's Three-Stage Model |
|
|
424 | (1) |
|
|
424 | (1) |
|
Kotter's Eight-Step Process for Leading Change |
|
|
425 | (1) |
|
|
425 | (1) |
|
|
426 | (1) |
|
|
426 | (1) |
|
How These Theories Apply to SRE |
|
|
427 | (1) |
|
Case Study 1: Scaling Waze-From Ad Hoc to Planned Change |
|
|
427 | (5) |
|
|
427 | (1) |
|
The Messaging Queue: Replacing a System While Maintaining Reliability |
|
|
427 | (2) |
|
The Next Cycle of Change: Improving the Deployment Process |
|
|
429 | (2) |
|
|
431 | (1) |
|
Case Study 2: Common Tooling Adoption in SRE |
|
|
432 | (7) |
|
|
432 | (1) |
|
|
433 | (1) |
|
|
434 | (1) |
|
|
434 | (2) |
|
Implementation: Monitoring |
|
|
436 | (1) |
|
|
436 | (3) |
|
|
439 | (2) |
Conclusion |
|
441 | (14) |
|
|
445 | (4) |
|
B Example Error Budget Policy |
|
|
449 | (4) |
|
C Results of Postmortem Analysis |
|
|
453 | (2) |
Index |
|
455 | |