Tasuta saatmine! | Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets [Pehme köide]

4.10/5 (145 hinnangut Goodreads-ist)

Alex Hidalgo

Formaat: Paperback / softback, 350 pages, kõrgus x laius: 233x178 mm
Ilmumisaeg: 25-Sep-2020
Kirjastus: O'Reilly Media
ISBN-10: 1492076813
ISBN-13: 9781492076810

Teised raamatud teemal:

Distributed systems

Pehme köide
Hind: 59,63 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Tavahind: 74,54 €
Säästad 20%
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 3-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 350 pages, kõrgus x laius: 233x178 mm
Ilmumisaeg: 25-Sep-2020
Kirjastus: O'Reilly Media
ISBN-10: 1492076813
ISBN-13: 9781492076810

Teised raamatud teemal:

Distributed systems

Teised raamatud sellest sooduspakkumisest: Kevadine sooduspakkumine - kuni 31. maini üle miljoni raamatu kuni 25% soodsamalt

Püsilink: https://www.kriso.ee/db/9781492076810.html

Märksõnad:

Although service-level objectives (SLOs) continue to grow in importance, there&;s a distinct lack of information about how to implement them. Practical advice that does exist usually assumes that your team already has the infrastructure, tooling, and culture in place. In this book, recognized SLO expert Alex Hidalgo explains how to build an SLO culture from the ground up.

Ideal as a primer and daily reference for anyone creating both the culture and tooling necessary for SLO-based approaches to reliability, this guide provides detailed analysis of advanced SLO and service-level indicator (SLI) techniques. Armed with mathematical models and statistical knowledge to help you get the most out of an SLO-based approach, you&;ll learn how to build systems capable of measuring meaningful SLIs with buy-in across all departments of your organization.

Define SLIs that meaningfully measure the reliability of a service from a user&;s perspective
Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis
Use error budgets to help your team have better discussions and make better data-driven decisions
Build supportive tooling and resources required for an SLO-based approach
Use SLO data to present meaningful reports to leadership and your users

Foreword

xiii

Preface

Part I SLO Development

1 The Reliability Stack

(14)

Service Truths

(1)

The Reliability Stack

(6)

Service Level Indicators

(1)

Service Level Objectives

(1)

Error Budgets

(1)

What Is a Service?

(4)

Example Services

(3)

Things to Keep in Mind

(1)

SLOs Are Just Data

(1)

SLOs Are a Process, Not a Project

(1)

Iterate Over Everything

(1)

The World Will Change

(1)

It's All About Humans

(1)

Summary

(2)

2 How To Think About Reliability

(12)

Reliability Engineering

(1)

Past Performance and Your Users

(4)

Implied Agreements

(1)

Making Agreements

(1)

A Worked Example of Reliability

(2)

How Reliable Should You Be?

(5)

100% Isn't Necessary

(2)

Reliability Is Expensive

(1)

How to Think About Reliability

(1)

Summary

(1)

3 Developing Meaningful Service Level Indicators

(16)

What Meaningful SLIs Provide

(2)

Happier Users

(1)

Happier Engineers

(1)

A Happier Business

(1)

Caring About Many Things

(5)

A Request and Response Service

(1)

Measuring Many Things by Measuring Only a Few

(1)

A Written Example

(1)

Something More Complex

(5)

Measuring Complex Service User Reliability

(2)

Another Written Example

(1)

Business Alignment and SLIs

(1)

Summary

(3)

4 Choosing Good Service Level Objectives

(24)

Reliability Targets

(5)

User Happiness

(1)

The Problem of Being Too Reliable

(1)

The Problem with the Number Nine

(2)

The Problem with Too Many SLOs

(1)

Service Dependencies and Components

(4)

Service Dependencies

(3)

Service Components

(1)

Reliability for Things You Don't Own

(3)

Open Source or Hosted Services

(1)

Measuring Hardware

(2)

Choosing Targets

(10)

Past Performance

(1)

Basic Statistics

(4)

Metric Attributes

(3)

Percentile Thresholds

(1)

What to Do Without a History

(1)

Summary

(1)

5 How To Use Error Budgets

(28)

Error Budgets in Practice

(8)

To Release New Features or Not?

(1)

Project Focus

(1)

Examining Risk Factors

(1)

Experimentation and Chaos Engineering

(1)

Load and Stress Tests

(1)

Blackhole Exercises

(1)

Purposely Burning Budget

(1)

Error Budgets for Humans

(1)

Error Budget Measurement

(16)

Establishing Error Budgets

(10)

Decision Making

(2)

Error Budget Policies

(4)

Summary

(3)

Part II SLO Implementation

6 Getting Buy-In

(16)

Engineering Is More than Code

(1)

Key Stakeholders

(5)

Engineering

(1)

Product

(1)

Operations

(1)

Legal

(1)

Executive Leadership

100

(1)

Making It So

101

(7)

Order of Operation

101

(1)

Common Objections and How to Overcome Them

102

(4)

Your First Error Budget Policy (and Your First Critical Test)

106

(2)

Lessons Learned the Hard Way

108

(1)

Summary

109

(2)

7 Measuring Slis And Slos

111

(18)

Design Goals

111

(3)

Flexible Targets

112

(1)

Testable Targets

112

(1)

Freshness

112

(1)

Cost

113

(1)

Reliability

113

(1)

Organizational Constraints

114

(1)

Common Machinery

114

(8)

Centralized Time Series Statistics (Metrics)

114

(5)

Structured Event Databases (Logging)

119

(3)

Common Cases

122

(4)

Latency-Sensitive Request Processing

122

(2)

Low-Lag, High-Throughput Batch Processing

124

(1)

Mobile and Web Clients

125

(1)

The General Case

126

(1)

Other Considerations

127

(1)

Integration with Distributed Tracing

127

(1)

SLI and SLO Discoverability

128

(1)

Summary

128

(1)

8 Slo Monitoring And Alerting

129

(24)

Motivation: What Is SLO Alerting, and Why Should You Do It?

130

(8)

The Shortcomings of Simple Threshold Alerting

130

(8)

A Better Way

138

(1)

How to Do SLO Alerting

138

(12)

Choosing a Target

139

(2)

Error Budgets and Response Time

141

(1)

Error Budget Burn Rate

142

(1)

Rolling Windows

143

(2)

Putting It Together

145

(2)

Troubleshooting with SLO Alerting

147

(1)

Corner Cases

148

(1)

SLO Alerting in a Brownfield Setup

149

(1)

Parting Recommendations

150

(2)

Summary

152

(1)

9 Probability And Statistics For Slis And Slos

153

(56)

On Probability

155

(19)

SLI Example: Availability

156

(6)

SLI Example: Low QPS

162

(12)

On Statistics

174

(29)

Maximum Likelihood Estimation

174

(3)

Maximum a Posteriori

177

(8)

Bayesian Inference

185

(5)

SLI Example: Queueing Latency

190

(6)

Batch Latency

196

(7)

SLI Example: Durability

203

(5)

Further Reading

208

(1)

Summary

208

(1)

10 Architecting For Reliability

209

(18)

Example System: Image-Serving Service

211

(13)

Architectural Considerations: Hardware

213

(3)

Architectural Considerations: Monolith or Microservices

216

(1)

Architectural Considerations: Anticipating Failure Modes

217

(1)

Architectural Considerations: Three Types of Requests

218

(2)

Systems and Building Blocks

220

(2)

Quantitative Analysis of Systems

222

(1)

Instrumentation! The System Also Needs Instrumentation!

223

(1)

Architectural Considerations: Hardware, Revisited

224

(1)

SLOs as a Result of System SLIs

225

(1)

The Importance of Identifying and Understanding Dependencies

225

(1)

Summary

226

(1)

11 Data Reliability

227

(30)

Data Services

227

(2)

Designing Data Applications

228

(1)

Users of Data Services

229

(1)

Setting Measurable Data Objectives

230

(22)

Data and Data Application Reliability

231

(2)

Data Properties

233

(12)

Data Application Properties

245

(7)

System Design Concerns

252

(2)

Data Application Failures

252

(1)

Other Qualities

253

(1)

Data Lineage

254

(1)

Summary

255

(2)

12 A Worked Example

257

(22)

Dogs Deserve Clothes

258

(3)

How a Service Grows

259

(1)

The Design of a Service

260

(1)

SLIs and SLOs as User Journeys

261

(14)

Customers: Finding and Browsing Products

262

(3)

Other Services as Users: Buying Products

265

(3)

Internal Users

268

(5)

Platforms as Services

273

(2)

Summary

275

(4)

Part III SLO Culture

13 Building An Slo Culture

279

(14)

A Culture of No SLOs

280

(1)

Strategies for Shifting Culture

281

(1)

Path to a Culture of SLOs

282

(10)

Getting Buy-in

283

(1)

Prioritizing SLO Work

283

(2)

Implementing Your SLO

285

(1)

What Will Your SLIs Be?

286

(1)

What Will Your SLOs Be?

287

(1)

Using Your SLO

287

(2)

Iterating on Your SLO

289

(1)

Determining When Your SLOs Are Good Enough

290

(1)

Advocating for Others to Use SLOs

291

(1)

Summary

292

(1)

14 Slo Evolution

293

(18)

SLO Genesis

294

(2)

The First Pass

294

(1)

Listening to Users

294

(1)

Periodic Revisits

295

(1)

Usage Changes

296

(3)

Increased Utilization Changes

296

(1)

Decreased Utilization Changes

297

(1)

Functional Utilization Changes

298

(1)

Dependency Changes

299

(3)

Service Dependency Changes

299

(2)

Platform Changes

301

(1)

Dependency Introduction or Retirement

301

(1)

Failure-Induced Changes

302

(1)

User Expectation and Requirement Changes

302

(2)

User Expectation Changes

303

(1)

User Requirement Changes

304

(1)

Tooling Changes

304

(2)

Measurement Changes

304

(1)

Calculation Changes

305

(1)

Intuition-Based Changes

306

(1)

Setting Aspirational SLOs

306

(1)

Identifying Incorrect SLOs

307

(1)

Listening to Users (Redux)

307

(1)

Paving Attention to Failures

308

(1)

How to Change SLOs

308

(1)

Revisit Schedules

308

(1)

Summary

309

(2)

15 Discoverable And Understandable Slos

311

(14)

Understandability

311

(8)

SLO Definition Documents

312

(6)

Phraseology

318

(1)

Discoverability

319

(4)

Document Repositories

319

(1)

Discoverability Tooling

320

(1)

SLO Reports

320

(1)

Dashboards

321

(2)

Summary

323

(2)

16 Slo Advocacy

325

(16)

Crawl

327

(8)

Do Your Research

327

(1)

Prepare Your Sales Pitch

328

(1)

Create Your Supporting Artifacts

329

(3)

Run Your First Training and Workshop

332

(1)

Implement an SLO Pilot with a Single Service

333

(1)

Spread Your Message

333

(1)

Learn How to Handle Challenges

334

(1)

Walk

335

(4)

Work with Early Adopters to Implement SLOs for More Services

335

(1)

Celebrate Achievements and Build Confidence

336

(1)

Create a Library of Case Studies

336

(1)

Scale Your Training Program by Adding More Trainers

337

(1)

Scale Your Communications

338

(1)

Run

339

(1)

Share Your Library of SLO Case Studies

339

(1)

Create a Community of SLO Experts

339

(1)

Continuously Improve

339

(1)

Summary

340

(1)

17 Reliability Reporting

341

(16)

Basic Reporting

342

(11)

Counting Incidents

343

(1)

Severity Levels

344

(2)

The Problem with Mean Time to X

346

(4)

SLOs for Basic Reporting

350

(3)

Advanced Reporting

353

(3)

SLO Status

353

(2)

Error Budget Status

355

(1)

Summary

356

(1)

A SLO Definition Template

357

(4)

B Proofs for
Chapter 9

361

(8)

Index

369

Alex Hidalgo is a Site Reliability Engineer and expert at all things related to Service Level Objectives. He developed an interest in computers at a young age, started writing his first BASIC programs at around the age of nine, and remembers the Internet when it was all still text. He eventually turned his hobby into a career, working in various capacities as a network engineer, security engineer, and systems administrator and in many roles within the world of IT support. After moving to New York, he joined Admeld as a Technical Operations Engineer, only to find himself employed by Google a few months later due to acquisition.

At Google, Alex was first introduced to the discipline of Site Reliability Engineering, which connected so closely with him that he wonders how he ever did anything else. Eventually, he found his other calling as an educator, writer, and speaker, traveling all over the world training other Site Reliability Engineers, becoming one of the primary developers of the Coursera Google IT Professional Certification, and contributing to multiple chapters of The Site Reliability Workbook -- most notably "Implementing SLOs" and "SLO Engineering Case Studies."

Recently, he has joined Squarespace, where his focus is now on spreading the concepts of SLO-based approaches to service reliability -- both internally and across the entire industry. When not sharing his passion for error budgets with others, you can find him scuba diving or watching college basketball. He lives in Park Slope, Brooklyn, with his partner Jen and a rescue dog named Taco. He thinks about SLOs so much he once had a dream about defining some for Taco. Twitter handle: @ahidalgosre

Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv