Foreword |
|
xi | |
Preface |
|
xv | |
|
Part I The Path to Observability |
|
|
|
|
3 | (16) |
|
The Mathematical Definition of Observability |
|
|
4 | (1) |
|
Applying Observability to Software Systems |
|
|
4 | (3) |
|
Mischaracterizations About Observability for Software |
|
|
7 | (1) |
|
Why Observability Matters Now |
|
|
8 | (1) |
|
Is This Really the Best Way? |
|
|
9 | (1) |
|
Why Are Metrics and Monitoring Not Enough? |
|
|
9 | (2) |
|
Debugging with Metrics Versus Observability |
|
|
11 | (2) |
|
|
13 | (1) |
|
The Role of Dimensionality |
|
|
14 | (2) |
|
Debugging with Observability |
|
|
16 | (1) |
|
Observability Is for Modern Systems |
|
|
17 | (1) |
|
|
17 | (2) |
|
2 How Debugging Practices Differ Between Observability and Monitoring |
|
|
19 | (10) |
|
How Monitoring Data Is Used for Debugging |
|
|
19 | (2) |
|
Troubleshooting Behaviors When Using Dashboards |
|
|
21 | (2) |
|
The Limitations of Troubleshooting by Intuition |
|
|
23 | (1) |
|
Traditional Monitoring Is Fundamentally Reactive |
|
|
24 | (2) |
|
How Observability Enables Better Debugging |
|
|
26 | (2) |
|
|
28 | (1) |
|
3 Lessons from Scaling Without Observability |
|
|
29 | (14) |
|
|
29 | (2) |
|
|
31 | (2) |
|
The Evolution Toward Modern Systems |
|
|
33 | (3) |
|
The Evolution Toward Modern Practices |
|
|
36 | (2) |
|
Shifting Practices at Parse |
|
|
38 | (3) |
|
|
41 | (2) |
|
4 How Observability Relates to DevOps, SRE, and Cloud Native |
|
|
43 | (8) |
|
Cloud Native, DevOps, and SRE in a Nutshell |
|
|
43 | (2) |
|
Observability: Debugging Then Versus Now |
|
|
45 | (1) |
|
Observability Empowers DevOps and SRE Practices |
|
|
46 | (2) |
|
|
48 | (3) |
|
Part II Fundamentals of Observability |
|
|
|
5 Structured Events Are the Building Blocks of Observability |
|
|
51 | (10) |
|
Debugging with Structured Events |
|
|
52 | (1) |
|
The Limitations of Metrics as a Building Block |
|
|
53 | (2) |
|
The Limitations of Traditional Logs as a Building Block |
|
|
55 | (1) |
|
|
55 | (1) |
|
|
56 | (1) |
|
Properties of Events That Are Useful in Debugging |
|
|
57 | (2) |
|
|
59 | (2) |
|
6 Stitching Events into Traces |
|
|
61 | (12) |
|
Distributed Tracing and Why It Matters Now |
|
|
61 | (2) |
|
The Components of Tracing |
|
|
63 | (2) |
|
Instrumenting a Trace the Hard Way |
|
|
65 | (3) |
|
Adding Custom Fields into Trace Spans |
|
|
68 | (2) |
|
Stitching Events into Traces |
|
|
70 | (1) |
|
|
71 | (2) |
|
7 Instrumentation with OpenTelemetry |
|
|
73 | (10) |
|
A Brief Introduction to Instrumentation |
|
|
74 | (1) |
|
Open Instrumentation Standards |
|
|
74 | (1) |
|
Instrumentation Using Code-Based Examples |
|
|
75 | (1) |
|
Start with Automatic Instrumentation |
|
|
76 | (2) |
|
Add Custom Instrumentation |
|
|
78 | (2) |
|
Send Instrumentation Data to a Backend System |
|
|
80 | (2) |
|
|
82 | (1) |
|
8 Analyzing Events to Achieve Observability |
|
|
83 | (12) |
|
Debugging from Known Conditions |
|
|
84 | (1) |
|
Debugging from First Principles |
|
|
85 | (1) |
|
Using the Core Analysis Loop |
|
|
86 | (2) |
|
Automating the Brute-Force Portion of the Core Analysis Loop |
|
|
88 | (3) |
|
This Misleading Promise of AIOps |
|
|
91 | (1) |
|
|
92 | (3) |
|
9 How Observability and Monitoring Come Together |
|
|
95 | (12) |
|
|
96 | (1) |
|
|
97 | (1) |
|
System Versus Software Considerations |
|
|
97 | (2) |
|
Assessing Your Organizational Needs |
|
|
99 | (2) |
|
Exceptions: Infrastructure Monitoring That Can't Be Ignored |
|
|
101 | (1) |
|
|
101 | (2) |
|
|
103 | (4) |
|
Part III Observability for Teams |
|
|
|
10 Applying Observability Practices in Your Team |
|
|
107 | (10) |
|
|
107 | (2) |
|
Start with the Biggest Pain Points |
|
|
109 | (1) |
|
|
109 | (2) |
|
Flesh Out Your Instrumentation Iteratively |
|
|
111 | (1) |
|
Look for Opportunities to Leverage Existing Efforts |
|
|
112 | (2) |
|
Prepare for the Hardest Last Push |
|
|
114 | (1) |
|
|
115 | (2) |
|
11 Observability-Driven Development |
|
|
117 | (10) |
|
|
117 | (1) |
|
Observability in the Development Cycle |
|
|
118 | (1) |
|
Determining Where to Debug |
|
|
119 | (1) |
|
Debugging in the Time of Microservices |
|
|
120 | (1) |
|
How Instrumentation Drives Observability |
|
|
121 | (2) |
|
Shifting Observability Left |
|
|
123 | (1) |
|
Using Observability to Speed Up Software Delivery |
|
|
123 | (2) |
|
|
125 | (2) |
|
12 Using Service-Level Objectives for Reliability |
|
|
127 | (12) |
|
Traditional Monitoring Approaches Create Dangerous Alert Fatigue |
|
|
127 | (2) |
|
Threshold Alerting Is for Known-Unknowns Only |
|
|
129 | (2) |
|
User Experience Is a North Star |
|
|
131 | (1) |
|
What Is a Service-Level Objective? |
|
|
132 | (1) |
|
Reliable Alerting with SLOs |
|
|
133 | (2) |
|
Changing Culture Toward SLO-Based Alerts: A Case Study |
|
|
135 | (3) |
|
|
138 | (1) |
|
13 Acting on and Debugging SLO-Based Alerts |
|
|
139 | (18) |
|
Alerting Before Your Error Budget Is Empty |
|
|
139 | (2) |
|
Framing Time as a Sliding Window |
|
|
141 | (1) |
|
Forecasting to Create a Predictive Burn Alert |
|
|
142 | (2) |
|
|
144 | (7) |
|
|
151 | (1) |
|
Acting on SLO Burn Alerts |
|
|
152 | (2) |
|
Using Observability Data for SLOs Versus Time-Series Data |
|
|
154 | (2) |
|
|
156 | (1) |
|
14 Observability and the Software Supply Chain |
|
|
157 | (16) |
|
Why Slack Needed Observability |
|
|
159 | (2) |
|
Instrumentation: Shared Client Libraries and Dimensions |
|
|
161 | (3) |
|
Case Studies: Operationalizing the Supply Chain |
|
|
164 | (1) |
|
Understanding Context Through Tooling |
|
|
164 | (2) |
|
Embedding Actionable Alerting |
|
|
166 | (2) |
|
Understanding What Changed |
|
|
168 | (2) |
|
|
170 | (3) |
|
Part IV Observability at Scale |
|
|
|
15 Build Versus Buy and Return on Investment |
|
|
173 | (12) |
|
How to Analyze the ROI of Observability |
|
|
174 | (1) |
|
The Real Costs of Building Your Own |
|
|
175 | (1) |
|
The Hidden Costs of Using "Free" Software |
|
|
175 | (1) |
|
The Benefits of Building Your Own |
|
|
176 | (1) |
|
The Risks of Building Your Own |
|
|
177 | (2) |
|
The Real Costs of Buying Software |
|
|
179 | (1) |
|
The Hidden Financial Costs of Commercial Software |
|
|
179 | (1) |
|
The Hidden Nonfinancial Costs of Commercial Software |
|
|
180 | (1) |
|
The Benefits of Buying Commercial Software |
|
|
181 | (1) |
|
The Risks of Buying Commercial Software |
|
|
182 | (1) |
|
Buy Versus Build Is Not a Binary Choice |
|
|
182 | (1) |
|
|
183 | (2) |
|
16 Efficient Data Storage |
|
|
185 | (22) |
|
The Functional Requirements for Observability |
|
|
185 | (2) |
|
Time-Series Databases Are Inadequate for Observability |
|
|
187 | (2) |
|
Other Possible Data Stores |
|
|
189 | (1) |
|
|
190 | (3) |
|
Case Study: The Implementation of Honeycombs Retriever |
|
|
193 | (1) |
|
Partitioning Data by Time |
|
|
194 | (1) |
|
Storing Data by Column Within Segments |
|
|
195 | (2) |
|
Performing Query Workloads |
|
|
197 | (2) |
|
|
199 | (1) |
|
Querying Data in Real Time |
|
|
200 | (1) |
|
Making It Affordable with Tiering |
|
|
200 | (1) |
|
Making It Fast with Parallelism |
|
|
201 | (1) |
|
Dealing with High Cardinality |
|
|
202 | (1) |
|
Scaling and Durability Strategies |
|
|
202 | (2) |
|
Notes on Building Your Own Efficient Data Store |
|
|
204 | (1) |
|
|
205 | (2) |
|
17 Cheap and Accurate Enough: Sampling |
|
|
207 | (18) |
|
Sampling to Refine Your Data Collection |
|
|
207 | (2) |
|
Using Different Approaches to Sampling |
|
|
209 | (1) |
|
Constant-Probability Sampling |
|
|
209 | (1) |
|
Sampling on Recent Traffic Volume |
|
|
210 | (1) |
|
Sampling Based on Event Content (Keys) |
|
|
210 | (1) |
|
Combining per Key and Historical Methods |
|
|
211 | (1) |
|
Choosing Dynamic Sampling Options |
|
|
211 | (1) |
|
When to Make a Sampling Decision for Traces |
|
|
211 | (1) |
|
Translating Sampling Strategies into Code |
|
|
212 | (1) |
|
|
212 | (1) |
|
|
213 | (1) |
|
Recording the Sample Rate |
|
|
213 | (2) |
|
|
215 | (1) |
|
|
216 | (2) |
|
Having More Than One Static Sample Rate |
|
|
218 | (1) |
|
Sampling by Key and Target Rate |
|
|
218 | (2) |
|
Sampling with Dynamic Rates on Arbitrarily Many Keys |
|
|
220 | (2) |
|
Putting It All Together: Head and Tail per Key Target Rate Sampling |
|
|
222 | (1) |
|
|
223 | (2) |
|
18 Telemetry Management with Pipelines |
|
|
225 | (18) |
|
Attributes of Telemetry Pipelines |
|
|
226 | (1) |
|
|
226 | (1) |
|
|
227 | (1) |
|
|
227 | (1) |
|
|
228 | (1) |
|
|
228 | (1) |
|
Data Filtering and Augmentation |
|
|
229 | (1) |
|
|
230 | (1) |
|
Ensuring Data Quality and Consistency |
|
|
230 | (1) |
|
Managing a Telemetry Pipeline: Anatomy |
|
|
231 | (2) |
|
Challenges When Managing a Telemetry Pipeline |
|
|
233 | (1) |
|
|
233 | (1) |
|
|
233 | (1) |
|
|
233 | (1) |
|
|
234 | (1) |
|
|
234 | (1) |
|
|
234 | (1) |
|
Use Case: Telemetry Management at Slack |
|
|
235 | (1) |
|
|
235 | (1) |
|
|
236 | (2) |
|
|
238 | (1) |
|
Managing a Telemetry Pipeline: Build Versus Buy |
|
|
239 | (1) |
|
|
240 | (3) |
|
Part V Spreading Observability Culture |
|
|
|
19 The Business Case for Observability |
|
|
243 | (12) |
|
The Reactive Approach to Introducing Change |
|
|
243 | (2) |
|
The Return on Investment of Observability |
|
|
245 | (1) |
|
The Proactive Approach to Introducing Change |
|
|
246 | (2) |
|
Introducing Observability as a Practice |
|
|
248 | (1) |
|
Using the Appropriate Tools |
|
|
249 | (1) |
|
|
250 | (1) |
|
Data Storage and Analytics |
|
|
250 | (1) |
|
Rolling Out Tools to Your Teams |
|
|
251 | (1) |
|
Knowing When You Have Enough Observability |
|
|
252 | (1) |
|
|
253 | (2) |
|
20 Observability's Stakeholders and Allies |
|
|
255 | (12) |
|
Recognizing Nonengineering Observability Needs |
|
|
255 | (3) |
|
Creating Observability Allies in Practice |
|
|
258 | (1) |
|
|
258 | (1) |
|
Customer Success and Product Teams |
|
|
259 | (1) |
|
Sales and Executive Teams |
|
|
260 | (1) |
|
Using Observability Versus Business Intelligence Tools |
|
|
261 | (1) |
|
|
262 | (1) |
|
|
262 | (1) |
|
|
262 | (1) |
|
|
263 | (1) |
|
|
263 | (1) |
|
|
264 | (1) |
|
Using Observability and BI Tools Together in Practice |
|
|
264 | (1) |
|
|
265 | (2) |
|
21 An Observability Maturity Model |
|
|
267 | (12) |
|
A Note About Maturity Models |
|
|
267 | (1) |
|
Why Observability Needs a Maturity Model |
|
|
268 | (1) |
|
About the Observability Maturity Model |
|
|
269 | (1) |
|
Capabilities Referenced in the OMM |
|
|
270 | (1) |
|
Respond to System Failure with Resilience |
|
|
271 | (2) |
|
Deliver High-Quality Code |
|
|
273 | (1) |
|
Manage Complexity and Technical Debt |
|
|
274 | (1) |
|
Release on a Predictable Cadence |
|
|
275 | (1) |
|
|
276 | (1) |
|
Using the OMM for Your Organization |
|
|
277 | (1) |
|
|
277 | (2) |
|
|
279 | (8) |
|
Observability, Then Versus Now |
|
|
279 | (2) |
|
|
281 | (1) |
|
Predictions for Where Observability Is Going |
|
|
282 | (5) |
Index |
|
287 | |