| Foreword |
|
xiii | |
| Preface |
|
xv | |
|
|
|
|
|
|
1 | (14) |
|
|
|
2 | (1) |
|
|
|
2 | (6) |
|
|
|
5 | (1) |
|
|
|
6 | (1) |
|
|
|
7 | (1) |
|
|
|
8 | (4) |
|
|
|
9 | (3) |
|
|
|
12 | (1) |
|
|
|
12 | (1) |
|
SLOs Are a Process, Not a Project |
|
|
12 | (1) |
|
|
|
13 | (1) |
|
|
|
13 | (1) |
|
|
|
13 | (1) |
|
|
|
13 | (2) |
|
2 How To Think About Reliability |
|
|
15 | (12) |
|
|
|
16 | (1) |
|
Past Performance and Your Users |
|
|
17 | (4) |
|
|
|
18 | (1) |
|
|
|
18 | (1) |
|
A Worked Example of Reliability |
|
|
19 | (2) |
|
How Reliable Should You Be? |
|
|
21 | (5) |
|
|
|
22 | (2) |
|
|
|
24 | (1) |
|
How to Think About Reliability |
|
|
25 | (1) |
|
|
|
26 | (1) |
|
3 Developing Meaningful Service Level Indicators |
|
|
27 | (16) |
|
What Meaningful SLIs Provide |
|
|
28 | (2) |
|
|
|
28 | (1) |
|
|
|
29 | (1) |
|
|
|
30 | (1) |
|
|
|
30 | (5) |
|
A Request and Response Service |
|
|
32 | (1) |
|
Measuring Many Things by Measuring Only a Few |
|
|
33 | (1) |
|
|
|
34 | (1) |
|
|
|
35 | (5) |
|
Measuring Complex Service User Reliability |
|
|
37 | (2) |
|
|
|
39 | (1) |
|
Business Alignment and SLIs |
|
|
40 | (1) |
|
|
|
40 | (3) |
|
4 Choosing Good Service Level Objectives |
|
|
43 | (24) |
|
|
|
44 | (5) |
|
|
|
44 | (1) |
|
The Problem of Being Too Reliable |
|
|
45 | (1) |
|
The Problem with the Number Nine |
|
|
46 | (2) |
|
The Problem with Too Many SLOs |
|
|
48 | (1) |
|
Service Dependencies and Components |
|
|
49 | (4) |
|
|
|
49 | (3) |
|
|
|
52 | (1) |
|
Reliability for Things You Don't Own |
|
|
53 | (3) |
|
Open Source or Hosted Services |
|
|
54 | (1) |
|
|
|
54 | (2) |
|
|
|
56 | (10) |
|
|
|
56 | (1) |
|
|
|
57 | (4) |
|
|
|
61 | (3) |
|
|
|
64 | (1) |
|
What to Do Without a History |
|
|
65 | (1) |
|
|
|
66 | (1) |
|
5 How To Use Error Budgets |
|
|
67 | (28) |
|
Error Budgets in Practice |
|
|
68 | (8) |
|
To Release New Features or Not? |
|
|
69 | (1) |
|
|
|
70 | (1) |
|
|
|
71 | (1) |
|
Experimentation and Chaos Engineering |
|
|
72 | (1) |
|
|
|
73 | (1) |
|
|
|
74 | (1) |
|
|
|
75 | (1) |
|
|
|
75 | (1) |
|
|
|
76 | (16) |
|
Establishing Error Budgets |
|
|
76 | (10) |
|
|
|
86 | (2) |
|
|
|
88 | (4) |
|
|
|
92 | (3) |
|
Part II SLO Implementation |
|
|
|
|
|
95 | (16) |
|
Engineering Is More than Code |
|
|
95 | (1) |
|
|
|
96 | (5) |
|
|
|
96 | (1) |
|
|
|
97 | (1) |
|
|
|
98 | (1) |
|
|
|
98 | (1) |
|
|
|
99 | (1) |
|
|
|
100 | (1) |
|
|
|
101 | (7) |
|
|
|
101 | (1) |
|
Common Objections and How to Overcome Them |
|
|
102 | (4) |
|
Your First Error Budget Policy (and Your First Critical Test) |
|
|
106 | (2) |
|
Lessons Learned the Hard Way |
|
|
108 | (1) |
|
|
|
109 | (2) |
|
7 Measuring Slis And Slos |
|
|
111 | (18) |
|
|
|
111 | (3) |
|
|
|
112 | (1) |
|
|
|
112 | (1) |
|
|
|
112 | (1) |
|
|
|
113 | (1) |
|
|
|
113 | (1) |
|
Organizational Constraints |
|
|
114 | (1) |
|
|
|
114 | (8) |
|
Centralized Time Series Statistics (Metrics) |
|
|
114 | (5) |
|
Structured Event Databases (Logging) |
|
|
119 | (3) |
|
|
|
122 | (4) |
|
Latency-Sensitive Request Processing |
|
|
122 | (2) |
|
Low-Lag, High-Throughput Batch Processing |
|
|
124 | (1) |
|
|
|
125 | (1) |
|
|
|
126 | (1) |
|
|
|
127 | (1) |
|
Integration with Distributed Tracing |
|
|
127 | (1) |
|
SLI and SLO Discoverability |
|
|
128 | (1) |
|
|
|
128 | (1) |
|
8 Slo Monitoring And Alerting |
|
|
129 | (24) |
|
Motivation: What Is SLO Alerting, and Why Should You Do It? |
|
|
130 | (8) |
|
The Shortcomings of Simple Threshold Alerting |
|
|
130 | (8) |
|
|
|
138 | (1) |
|
|
|
138 | (12) |
|
|
|
139 | (2) |
|
Error Budgets and Response Time |
|
|
141 | (1) |
|
|
|
142 | (1) |
|
|
|
143 | (2) |
|
|
|
145 | (2) |
|
Troubleshooting with SLO Alerting |
|
|
147 | (1) |
|
|
|
148 | (1) |
|
SLO Alerting in a Brownfield Setup |
|
|
149 | (1) |
|
|
|
150 | (2) |
|
|
|
152 | (1) |
|
9 Probability And Statistics For Slis And Slos |
|
|
153 | (56) |
|
|
|
155 | (19) |
|
SLI Example: Availability |
|
|
156 | (6) |
|
|
|
162 | (12) |
|
|
|
174 | (29) |
|
Maximum Likelihood Estimation |
|
|
174 | (3) |
|
|
|
177 | (8) |
|
|
|
185 | (5) |
|
SLI Example: Queueing Latency |
|
|
190 | (6) |
|
|
|
196 | (7) |
|
|
|
203 | (5) |
|
|
|
208 | (1) |
|
|
|
208 | (1) |
|
10 Architecting For Reliability |
|
|
209 | (18) |
|
Example System: Image-Serving Service |
|
|
211 | (13) |
|
Architectural Considerations: Hardware |
|
|
213 | (3) |
|
Architectural Considerations: Monolith or Microservices |
|
|
216 | (1) |
|
Architectural Considerations: Anticipating Failure Modes |
|
|
217 | (1) |
|
Architectural Considerations: Three Types of Requests |
|
|
218 | (2) |
|
Systems and Building Blocks |
|
|
220 | (2) |
|
Quantitative Analysis of Systems |
|
|
222 | (1) |
|
Instrumentation! The System Also Needs Instrumentation! |
|
|
223 | (1) |
|
Architectural Considerations: Hardware, Revisited |
|
|
224 | (1) |
|
SLOs as a Result of System SLIs |
|
|
225 | (1) |
|
The Importance of Identifying and Understanding Dependencies |
|
|
225 | (1) |
|
|
|
226 | (1) |
|
|
|
227 | (30) |
|
|
|
227 | (2) |
|
Designing Data Applications |
|
|
228 | (1) |
|
|
|
229 | (1) |
|
Setting Measurable Data Objectives |
|
|
230 | (22) |
|
Data and Data Application Reliability |
|
|
231 | (2) |
|
|
|
233 | (12) |
|
Data Application Properties |
|
|
245 | (7) |
|
|
|
252 | (2) |
|
Data Application Failures |
|
|
252 | (1) |
|
|
|
253 | (1) |
|
|
|
254 | (1) |
|
|
|
255 | (2) |
|
|
|
257 | (22) |
|
|
|
258 | (3) |
|
|
|
259 | (1) |
|
|
|
260 | (1) |
|
SLIs and SLOs as User Journeys |
|
|
261 | (14) |
|
Customers: Finding and Browsing Products |
|
|
262 | (3) |
|
Other Services as Users: Buying Products |
|
|
265 | (3) |
|
|
|
268 | (5) |
|
|
|
273 | (2) |
|
|
|
275 | (4) |
|
|
|
|
13 Building An Slo Culture |
|
|
279 | (14) |
|
|
|
280 | (1) |
|
Strategies for Shifting Culture |
|
|
281 | (1) |
|
Path to a Culture of SLOs |
|
|
282 | (10) |
|
|
|
283 | (1) |
|
|
|
283 | (2) |
|
|
|
285 | (1) |
|
|
|
286 | (1) |
|
|
|
287 | (1) |
|
|
|
287 | (2) |
|
|
|
289 | (1) |
|
Determining When Your SLOs Are Good Enough |
|
|
290 | (1) |
|
Advocating for Others to Use SLOs |
|
|
291 | (1) |
|
|
|
292 | (1) |
|
|
|
293 | (18) |
|
|
|
294 | (2) |
|
|
|
294 | (1) |
|
|
|
294 | (1) |
|
|
|
295 | (1) |
|
|
|
296 | (3) |
|
Increased Utilization Changes |
|
|
296 | (1) |
|
Decreased Utilization Changes |
|
|
297 | (1) |
|
Functional Utilization Changes |
|
|
298 | (1) |
|
|
|
299 | (3) |
|
Service Dependency Changes |
|
|
299 | (2) |
|
|
|
301 | (1) |
|
Dependency Introduction or Retirement |
|
|
301 | (1) |
|
|
|
302 | (1) |
|
User Expectation and Requirement Changes |
|
|
302 | (2) |
|
|
|
303 | (1) |
|
|
|
304 | (1) |
|
|
|
304 | (2) |
|
|
|
304 | (1) |
|
|
|
305 | (1) |
|
|
|
306 | (1) |
|
Setting Aspirational SLOs |
|
|
306 | (1) |
|
Identifying Incorrect SLOs |
|
|
307 | (1) |
|
Listening to Users (Redux) |
|
|
307 | (1) |
|
Paving Attention to Failures |
|
|
308 | (1) |
|
|
|
308 | (1) |
|
|
|
308 | (1) |
|
|
|
309 | (2) |
|
15 Discoverable And Understandable Slos |
|
|
311 | (14) |
|
|
|
311 | (8) |
|
|
|
312 | (6) |
|
|
|
318 | (1) |
|
|
|
319 | (4) |
|
|
|
319 | (1) |
|
|
|
320 | (1) |
|
|
|
320 | (1) |
|
|
|
321 | (2) |
|
|
|
323 | (2) |
|
|
|
325 | (16) |
|
|
|
327 | (8) |
|
|
|
327 | (1) |
|
|
|
328 | (1) |
|
Create Your Supporting Artifacts |
|
|
329 | (3) |
|
Run Your First Training and Workshop |
|
|
332 | (1) |
|
Implement an SLO Pilot with a Single Service |
|
|
333 | (1) |
|
|
|
333 | (1) |
|
Learn How to Handle Challenges |
|
|
334 | (1) |
|
|
|
335 | (4) |
|
Work with Early Adopters to Implement SLOs for More Services |
|
|
335 | (1) |
|
Celebrate Achievements and Build Confidence |
|
|
336 | (1) |
|
Create a Library of Case Studies |
|
|
336 | (1) |
|
Scale Your Training Program by Adding More Trainers |
|
|
337 | (1) |
|
Scale Your Communications |
|
|
338 | (1) |
|
|
|
339 | (1) |
|
Share Your Library of SLO Case Studies |
|
|
339 | (1) |
|
Create a Community of SLO Experts |
|
|
339 | (1) |
|
|
|
339 | (1) |
|
|
|
340 | (1) |
|
|
|
341 | (16) |
|
|
|
342 | (11) |
|
|
|
343 | (1) |
|
|
|
344 | (2) |
|
The Problem with Mean Time to X |
|
|
346 | (4) |
|
|
|
350 | (3) |
|
|
|
353 | (3) |
|
|
|
353 | (2) |
|
|
|
355 | (1) |
|
|
|
356 | (1) |
| A SLO Definition Template |
|
357 | (4) |
B Proofs for Chapter 9 |
|
361 | (8) |
| Index |
|
369 | |