Muutke küpsiste eelistusi

Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets [Pehme köide]

  • Formaat: Paperback / softback, 350 pages, kõrgus x laius: 233x178 mm
  • Ilmumisaeg: 25-Sep-2020
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492076813
  • ISBN-13: 9781492076810
Teised raamatud teemal:
  • Pehme köide
  • Hind: 59,63 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 74,54 €
  • Säästad 20%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 3-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 350 pages, kõrgus x laius: 233x178 mm
  • Ilmumisaeg: 25-Sep-2020
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492076813
  • ISBN-13: 9781492076810
Teised raamatud teemal:

Although service-level objectives (SLOs) continue to grow in importance, there&;s a distinct lack of information about how to implement them. Practical advice that does exist usually assumes that your team already has the infrastructure, tooling, and culture in place. In this book, recognized SLO expert Alex Hidalgo explains how to build an SLO culture from the ground up.

Ideal as a primer and daily reference for anyone creating both the culture and tooling necessary for SLO-based approaches to reliability, this guide provides detailed analysis of advanced SLO and service-level indicator (SLI) techniques. Armed with mathematical models and statistical knowledge to help you get the most out of an SLO-based approach, you&;ll learn how to build systems capable of measuring meaningful SLIs with buy-in across all departments of your organization.

  • Define SLIs that meaningfully measure the reliability of a service from a user&;s perspective
  • Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis
  • Use error budgets to help your team have better discussions and make better data-driven decisions
  • Build supportive tooling and resources required for an SLO-based approach
  • Use SLO data to present meaningful reports to leadership and your users
Foreword xiii
Preface xv
Part I SLO Development
1 The Reliability Stack
1(14)
Service Truths
2(1)
The Reliability Stack
2(6)
Service Level Indicators
5(1)
Service Level Objectives
6(1)
Error Budgets
7(1)
What Is a Service?
8(4)
Example Services
9(3)
Things to Keep in Mind
12(1)
SLOs Are Just Data
12(1)
SLOs Are a Process, Not a Project
12(1)
Iterate Over Everything
13(1)
The World Will Change
13(1)
It's All About Humans
13(1)
Summary
13(2)
2 How To Think About Reliability
15(12)
Reliability Engineering
16(1)
Past Performance and Your Users
17(4)
Implied Agreements
18(1)
Making Agreements
18(1)
A Worked Example of Reliability
19(2)
How Reliable Should You Be?
21(5)
100% Isn't Necessary
22(2)
Reliability Is Expensive
24(1)
How to Think About Reliability
25(1)
Summary
26(1)
3 Developing Meaningful Service Level Indicators
27(16)
What Meaningful SLIs Provide
28(2)
Happier Users
28(1)
Happier Engineers
29(1)
A Happier Business
30(1)
Caring About Many Things
30(5)
A Request and Response Service
32(1)
Measuring Many Things by Measuring Only a Few
33(1)
A Written Example
34(1)
Something More Complex
35(5)
Measuring Complex Service User Reliability
37(2)
Another Written Example
39(1)
Business Alignment and SLIs
40(1)
Summary
40(3)
4 Choosing Good Service Level Objectives
43(24)
Reliability Targets
44(5)
User Happiness
44(1)
The Problem of Being Too Reliable
45(1)
The Problem with the Number Nine
46(2)
The Problem with Too Many SLOs
48(1)
Service Dependencies and Components
49(4)
Service Dependencies
49(3)
Service Components
52(1)
Reliability for Things You Don't Own
53(3)
Open Source or Hosted Services
54(1)
Measuring Hardware
54(2)
Choosing Targets
56(10)
Past Performance
56(1)
Basic Statistics
57(4)
Metric Attributes
61(3)
Percentile Thresholds
64(1)
What to Do Without a History
65(1)
Summary
66(1)
5 How To Use Error Budgets
67(28)
Error Budgets in Practice
68(8)
To Release New Features or Not?
69(1)
Project Focus
70(1)
Examining Risk Factors
71(1)
Experimentation and Chaos Engineering
72(1)
Load and Stress Tests
73(1)
Blackhole Exercises
74(1)
Purposely Burning Budget
75(1)
Error Budgets for Humans
75(1)
Error Budget Measurement
76(16)
Establishing Error Budgets
76(10)
Decision Making
86(2)
Error Budget Policies
88(4)
Summary
92(3)
Part II SLO Implementation
6 Getting Buy-In
95(16)
Engineering Is More than Code
95(1)
Key Stakeholders
96(5)
Engineering
96(1)
Product
97(1)
Operations
98(1)
Qa
98(1)
Legal
99(1)
Executive Leadership
100(1)
Making It So
101(7)
Order of Operation
101(1)
Common Objections and How to Overcome Them
102(4)
Your First Error Budget Policy (and Your First Critical Test)
106(2)
Lessons Learned the Hard Way
108(1)
Summary
109(2)
7 Measuring Slis And Slos
111(18)
Design Goals
111(3)
Flexible Targets
112(1)
Testable Targets
112(1)
Freshness
112(1)
Cost
113(1)
Reliability
113(1)
Organizational Constraints
114(1)
Common Machinery
114(8)
Centralized Time Series Statistics (Metrics)
114(5)
Structured Event Databases (Logging)
119(3)
Common Cases
122(4)
Latency-Sensitive Request Processing
122(2)
Low-Lag, High-Throughput Batch Processing
124(1)
Mobile and Web Clients
125(1)
The General Case
126(1)
Other Considerations
127(1)
Integration with Distributed Tracing
127(1)
SLI and SLO Discoverability
128(1)
Summary
128(1)
8 Slo Monitoring And Alerting
129(24)
Motivation: What Is SLO Alerting, and Why Should You Do It?
130(8)
The Shortcomings of Simple Threshold Alerting
130(8)
A Better Way
138(1)
How to Do SLO Alerting
138(12)
Choosing a Target
139(2)
Error Budgets and Response Time
141(1)
Error Budget Burn Rate
142(1)
Rolling Windows
143(2)
Putting It Together
145(2)
Troubleshooting with SLO Alerting
147(1)
Corner Cases
148(1)
SLO Alerting in a Brownfield Setup
149(1)
Parting Recommendations
150(2)
Summary
152(1)
9 Probability And Statistics For Slis And Slos
153(56)
On Probability
155(19)
SLI Example: Availability
156(6)
SLI Example: Low QPS
162(12)
On Statistics
174(29)
Maximum Likelihood Estimation
174(3)
Maximum a Posteriori
177(8)
Bayesian Inference
185(5)
SLI Example: Queueing Latency
190(6)
Batch Latency
196(7)
SLI Example: Durability
203(5)
Further Reading
208(1)
Summary
208(1)
10 Architecting For Reliability
209(18)
Example System: Image-Serving Service
211(13)
Architectural Considerations: Hardware
213(3)
Architectural Considerations: Monolith or Microservices
216(1)
Architectural Considerations: Anticipating Failure Modes
217(1)
Architectural Considerations: Three Types of Requests
218(2)
Systems and Building Blocks
220(2)
Quantitative Analysis of Systems
222(1)
Instrumentation! The System Also Needs Instrumentation!
223(1)
Architectural Considerations: Hardware, Revisited
224(1)
SLOs as a Result of System SLIs
225(1)
The Importance of Identifying and Understanding Dependencies
225(1)
Summary
226(1)
11 Data Reliability
227(30)
Data Services
227(2)
Designing Data Applications
228(1)
Users of Data Services
229(1)
Setting Measurable Data Objectives
230(22)
Data and Data Application Reliability
231(2)
Data Properties
233(12)
Data Application Properties
245(7)
System Design Concerns
252(2)
Data Application Failures
252(1)
Other Qualities
253(1)
Data Lineage
254(1)
Summary
255(2)
12 A Worked Example
257(22)
Dogs Deserve Clothes
258(3)
How a Service Grows
259(1)
The Design of a Service
260(1)
SLIs and SLOs as User Journeys
261(14)
Customers: Finding and Browsing Products
262(3)
Other Services as Users: Buying Products
265(3)
Internal Users
268(5)
Platforms as Services
273(2)
Summary
275(4)
Part III SLO Culture
13 Building An Slo Culture
279(14)
A Culture of No SLOs
280(1)
Strategies for Shifting Culture
281(1)
Path to a Culture of SLOs
282(10)
Getting Buy-in
283(1)
Prioritizing SLO Work
283(2)
Implementing Your SLO
285(1)
What Will Your SLIs Be?
286(1)
What Will Your SLOs Be?
287(1)
Using Your SLO
287(2)
Iterating on Your SLO
289(1)
Determining When Your SLOs Are Good Enough
290(1)
Advocating for Others to Use SLOs
291(1)
Summary
292(1)
14 Slo Evolution
293(18)
SLO Genesis
294(2)
The First Pass
294(1)
Listening to Users
294(1)
Periodic Revisits
295(1)
Usage Changes
296(3)
Increased Utilization Changes
296(1)
Decreased Utilization Changes
297(1)
Functional Utilization Changes
298(1)
Dependency Changes
299(3)
Service Dependency Changes
299(2)
Platform Changes
301(1)
Dependency Introduction or Retirement
301(1)
Failure-Induced Changes
302(1)
User Expectation and Requirement Changes
302(2)
User Expectation Changes
303(1)
User Requirement Changes
304(1)
Tooling Changes
304(2)
Measurement Changes
304(1)
Calculation Changes
305(1)
Intuition-Based Changes
306(1)
Setting Aspirational SLOs
306(1)
Identifying Incorrect SLOs
307(1)
Listening to Users (Redux)
307(1)
Paving Attention to Failures
308(1)
How to Change SLOs
308(1)
Revisit Schedules
308(1)
Summary
309(2)
15 Discoverable And Understandable Slos
311(14)
Understandability
311(8)
SLO Definition Documents
312(6)
Phraseology
318(1)
Discoverability
319(4)
Document Repositories
319(1)
Discoverability Tooling
320(1)
SLO Reports
320(1)
Dashboards
321(2)
Summary
323(2)
16 Slo Advocacy
325(16)
Crawl
327(8)
Do Your Research
327(1)
Prepare Your Sales Pitch
328(1)
Create Your Supporting Artifacts
329(3)
Run Your First Training and Workshop
332(1)
Implement an SLO Pilot with a Single Service
333(1)
Spread Your Message
333(1)
Learn How to Handle Challenges
334(1)
Walk
335(4)
Work with Early Adopters to Implement SLOs for More Services
335(1)
Celebrate Achievements and Build Confidence
336(1)
Create a Library of Case Studies
336(1)
Scale Your Training Program by Adding More Trainers
337(1)
Scale Your Communications
338(1)
Run
339(1)
Share Your Library of SLO Case Studies
339(1)
Create a Community of SLO Experts
339(1)
Continuously Improve
339(1)
Summary
340(1)
17 Reliability Reporting
341(16)
Basic Reporting
342(11)
Counting Incidents
343(1)
Severity Levels
344(2)
The Problem with Mean Time to X
346(4)
SLOs for Basic Reporting
350(3)
Advanced Reporting
353(3)
SLO Status
353(2)
Error Budget Status
355(1)
Summary
356(1)
A SLO Definition Template 357(4)
B Proofs for
Chapter 9
361(8)
Index 369
Alex Hidalgo is a Site Reliability Engineer and expert at all things related to Service Level Objectives. He developed an interest in computers at a young age, started writing his first BASIC programs at around the age of nine, and remembers the Internet when it was all still text. He eventually turned his hobby into a career, working in various capacities as a network engineer, security engineer, and systems administrator and in many roles within the world of IT support. After moving to New York, he joined Admeld as a Technical Operations Engineer, only to find himself employed by Google a few months later due to acquisition.

At Google, Alex was first introduced to the discipline of Site Reliability Engineering, which connected so closely with him that he wonders how he ever did anything else. Eventually, he found his other calling as an educator, writer, and speaker, traveling all over the world training other Site Reliability Engineers, becoming one of the primary developers of the Coursera Google IT Professional Certification, and contributing to multiple chapters of The Site Reliability Workbook -- most notably "Implementing SLOs" and "SLO Engineering Case Studies."

Recently, he has joined Squarespace, where his focus is now on spreading the concepts of SLO-based approaches to service reliability -- both internally and across the entire industry. When not sharing his passion for error budgets with others, you can find him scuba diving or watching college basketball. He lives in Park Slope, Brooklyn, with his partner Jen and a rescue dog named Taco. He thinks about SLOs so much he once had a dream about defining some for Taco. Twitter handle: @ahidalgosre