Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Site Reliability Workbook: Practical ways to implement SRE [Pehme köide]

4.36/5 (674 hinnangut Goodreads-ist)

Stephen Thorne, Niall Richard Murphy, David Rensin, Kent Kawahara, Betsy Beyer

Formaat: Paperback / softback, 500 pages, kõrgus x laius x paksus: 250x150x15 mm, kaal: 666 g
Ilmumisaeg: 31-Jul-2018
Kirjastus: O'Reilly Media
ISBN-10: 1492029505
ISBN-13: 9781492029502

Teised raamatud teemal:

Network management
Computer networking & communications - (Hetkel poes: 3 nimetust)
Operating systems - (Hetkel poes: 1 nimetust)

Pehme köide
Hind: 57,45 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Tavahind: 67,59 €
Säästad 15%
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 500 pages, kõrgus x laius x paksus: 250x150x15 mm, kaal: 666 g
Ilmumisaeg: 31-Jul-2018
Kirjastus: O'Reilly Media
ISBN-10: 1492029505
ISBN-13: 9781492029502

Teised raamatud teemal:

Network management
Computer networking & communications - (Hetkel poes: 3 nimetust)
Operating systems - (Hetkel poes: 1 nimetust)

Püsilink: https://www.kriso.ee/db/9781492029502.html

Märksõnad:

In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Target, Home Depot, The New York Times, and other companies outline their hard-won experience of what worked for them and what didn’t.

Dive into this workbook and learn how to flesh out your own SRE framework, no matter what size your company is.

You’ll learn:

How to run reliable services in environments you don’t completely control—like cloud
Practical examples of how to create, monitor, and run your services via Service Level Objectives
How to convert existing ops teams to SRE—including how to dig out of operational overload
Methods for starting SRE from either greenfield or brownfield

Foreword I

xvii

Foreword II

xix

Preface

xxiii

1 How SRE Relates to DevOps

(16)

Background on DevOps

(2)

No More Silos

(1)

Accidents Are Normal

(1)

Change Should Be Gradual

(1)

Tooling and Culture Are Interrelated

(1)

Measurement Is Crucial

(1)

Background on SRE

(4)

Operations Is a Software Problem

(1)

Manage by Service Level Objectives (SLOs)

(1)

Work to Minimize Toil

(1)

Automate This Year's Job Away

(1)

Move Fast by Reducing the Cost of Failure

(1)

Share Ownership with Developers

(1)

Use the Same Tooling, Regardless of Function or Job Title

(1)

Compare and Contrast

(1)

Organizational Context and Fostering Successful Adoption

(8)

Narrow, Rigid Incentives Narrow Your Success

(1)

It's Better to Fix It Yourself; Don't Blame Someone Else

(1)

Consider Reliability Work as a Specialized Role

(1)

When Can Substitute for Whether

(1)

Strive for Parity of Esteem: Career and Financial

(5)

Part I. Foundations

2 Implementing SLOs

(26)

Why SREs Need SLOs

(1)

Getting Started

(5)

Reliability Targets and Error Budgets

(1)

What to Measure: Using SLIs

(3)

A Worked Example

(6)

Moving from SLI Specification to SLI Implementation

(1)

Measuring the SLIs

(2)

Using the SLIs to Calculate Starter SLOs

(1)

Choosing an Appropriate Time Window

(1)

Getting Stakeholder Agreement

(4)

Establishing an Error Budget Policy

(1)

Documenting the SLO and Error Budget Policy

(1)

Dashboards and Reports

(1)

Continuous Improvement of SLO Targets

(3)

Improving the Quality of Your SLO

(2)

Decision Making Using SLOs and Error Budgets

(1)

Advanced Topics

(4)

Modeling User Journeys

(1)

Grading Interaction Importance

(1)

Modeling Dependencies

(1)

Experimenting with Relaxing Your SLOs

(1)

Conclusion

(1)

3 SLO Engineering Case Studies

(18)

Evernote's SLO Story

(6)

Why Did Evernote Adopt the SRE Model?

(1)

Introduction of SLOs: A Journey in Progress

(3)

Breaking Down the SLO Wall Between Customer and Cloud Provider

(1)

Current State

(1)

The Home Depot's SLO Story

(11)

The SLO Culture Project

(2)

Our First Set of SLOs

(2)

Evangelizing SLOs

(1)

Automating VALET Data Collection

(2)

The Proliferation of SLOs

(1)

Applying VALET to Batch Applications

(1)

Using VALET in Testing

(1)

Future Aspirations

(1)

Summary

(1)

Conclusion

(1)

4 Monitoring

(14)

Desirable Features of a Monitoring Strategy

(2)

Speed

(1)

Calculations

(1)

Interfaces

(1)

Alerts

(1)

Sources of Monitoring Data

(3)

Examples

(2)

Managing Your Monitoring System

(2)

Treat Your Configuration as Code

(1)

Encourage Consistency

(1)

Prefer Loose Coupling

(1)

Metrics with Purpose

(3)

Intended Changes

(1)

Dependencies

(1)

Saturation

(1)

Status of Served Traffic

(1)

Implementing Purposeful Metrics

(1)

Testing Alerting Logic

(1)

Conclusion

(2)

5 Alerting on SLOB

(18)

Alerting Considerations

(1)

Ways to Alert on Significant Events

(10)

1 Target Error Rate SLO Threshold

(2)

2 Increased Alert Window

(1)

3 Incrementing Alert Duration

(1)

4 Alert on Burn Rate

(2)

5 Multiple Burn Rate Alerts

(2)

6 Multiwindow, Multi-Burn-Rate Alerts

(2)

Low-Traffic Services and Error Budget Alerting

(3)

Generating Artificial Traffic

(1)

Combining Services

(1)

Making Service and Infrastructure Changes

(1)

Lowering the SLO or Increasing the Window

(1)

Extreme Availability Goals

(1)

Alerting at Scale

(2)

Conclusion

(2)

6 Eliminating Toil

(38)

What Is Toil?

(2)

Measuring Toil

(2)

Toil Taxonomy

(3)

Business Processes

(1)

Production Interrupts

(1)

Release Shepherding

(1)

Migrations

(1)

Cost Engineering and Capacity Planning

100

(1)

Troubleshooting for Opaque Architectures

100

(1)

Toil Management Strategies

101

(5)

Identify and Measure Toil

101

(1)

Engineer Toil Out of the System

101

(1)

Reject the Toil

101

(1)

Use SLOs to Reduce Toil

102

(1)

Start with Human-Backed Interfaces

102

(1)

Provide Self-Service Methods

102

(1)

Get Support from Management and Colleagues

103

(1)

Promote Toil Reduction as a Feature

103

(1)

Start Small and Then Improve

103

(1)

Increase Uniformity

103

(1)

Assess Risk Within Automation

104

(1)

Automate Toil Response

104

(1)

Use Open Source and Third-Party Tools

105

(1)

Use Feedback to Improve

105

(1)

Case Studies

106

(1)

Case Study 1: Reducing Toil in the Datacenter with Automation

107

(14)

Background

107

(3)

Problem Statement

110

(1)

What We Decided to Do

110

(1)

Design First Effort: Saturn Line-Card Repair

110

(1)

Implementation

111

(2)

Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair

113

(1)

Implementation

114

(4)

Lessons Learned

118

(3)

Case Study 2: Decommissioning Filer-Backed Home Directories

121

(1)

Background

121

(1)

Problem Statement

121

(1)

What We Decided to Do

122

(1)

Design and Implementation

123

(1)

Key Components

124

(3)

Lessons Learned

127

(2)

Conclusion

129

(2)

7 Simplicity

131

(16)

Measuring Complexity

131

(2)

Simplicity Is End-to-End, and SREs Are Good for That

133

(2)

Case Study 1 End-to-End API Simplicity

134

(1)

Case Study 2 Project Lifecycle Complexity

134

(1)

Regaining Simplicity

135

(6)

Case Study 3 Simplification of the Display Ads Spiderweb

137

(2)

Case Study 4 Running Hundreds of Microservices on a Shared Platform

139

(1)

Case Study 5 pDNS No Longer Depends on Itself

140

(1)

Conclusion

141

(6)

Part II. Practices

8 On-Call

147

(28)

Recap of "Being On-Call"
Chapter of First SRE Book

148

(1)

Example On-Call Setups Within Google and Outside Google

149

(7)

Google: Forming a New Team

149

(4)

Evernote: Finding Our Feet in the Cloud

153

(3)

Practical Implementation Details

156

(17)

Anatomy of Pager Load

156

(11)

On-Call Flexibility

167

(4)

On-Call Team Dynamics

171

(2)

Conclusion

173

(2)

9 Incident Response

175

(20)

Incident Management at Google

176

(1)

Incident Command System

176

(1)

Main Roles in Incident Response

177

(1)

Case Studies

177

(14)

Case Study 1 Software Bug-The Lights Are On but No One's (Google) Home

177

(3)

Case Study 2 Service Fault-Cache Me If You Can

180

(5)

Case Study 3 Power Outage-Lightning Never Strikes Twice...Until It Does

185

(3)

Case Study 4 Incident Response at PagerDuty

188

(3)

Putting Best Practices into Practice

191

(3)

Incident Response Training

191

(1)

Prepare Beforehand

192

(1)

Drills

193

(1)

Conclusion

194

(1)

10 Postmortem Culture: Learning from Failure

195

(30)

Case Study

196

(1)

Bad Postmortem

197

(6)

Why Is This Postmortem Bad?

199

(4)

Good Postmortem

203

(11)

Why Is This Postmortem Better?

212

(2)

Organizational Incentives

214

(6)

Model and Enforce Blameless Behavior

214

(1)

Reward Postmortem Outcomes

215

(2)

Share Postmortems Openly

217

(1)

Respond to Postmortem Culture Failures

218

(2)

Tools and Templates

220

(3)

Postmortem Templates

220

(1)

Postmortem Tooling

221

(2)

Conclusion

223

(2)

11 Managing Load

225

(20)

Google Cloud Load Balancing

225

(11)

Anycast

226

(1)

Maglev

227

(2)

Global Software Load Balancer

229

(1)

Google Front End

229

(1)

GCLB: Low Latency

230

(1)

GCLB: High Availability

231

(1)

Case Study 1: Pokemon GO on GCLB

231

(5)

Autoscaling

236

(3)

Handling Unhealthy Machines

236

(1)

Working with Stateful Systems

237

(1)

Configuring Conservatively

237

(1)

Setting Constraints

238

(1)

Including Kill Switches and Manual Overrides

238

(1)

Avoiding Overloading Backends

238

(1)

Avoiding Traffic Imbalance

239

(1)

Combining Strategies to Manage Load

239

(4)

Case Study 2: When Load Shedding Attacks

240

(3)

Conclusion

243

(2)

12 Introducing Non-Abstract Large System Design

245

(18)

What Is NALSD?

245

(1)

Why "Non-Abstract"?

246

(1)

AdWords Example

246

(14)

Design Process

246

(1)

Initial Requirements

247

(1)

One Machine

248

(3)

Distributed System

251

(9)

Conclusion

260

(3)

13 Data Processing Pipelines

263

(38)

Pipeline Applications

264

(4)

Event Processing/Data Transformation to Order or Structure Data

264

(1)

Data Analytics

265

(1)

Machine Learning

265

(3)

Pipeline Best Practices

268

(9)

Define and Measure. Service Level Objectives

268

(2)

Plan for Dependency Failure

270

(1)

Create and Maintain Pipeline Documentation

271

(1)

Map Your Development Lifecycle

272

(3)

Reduce Hotspotting and Workload Patterns

275

(1)

Implement Autoscaling and Resource Planning

276

(1)

Adhere to Access Control and Security Policies

277

(1)

Plan Escalation Paths

277

(1)

Pipeline Requirements and Design

277

(7)

What Features Do You Need?

278

(1)

Idempotent and Two-Phase Mutations

279

(1)

Checkpointing

280

(1)

Code Patterns

280

(1)

Pipeline Production Readiness

281

(3)

Pipeline Failures: Prevention and Response

284

(3)

Potential Failure Modes

284

(2)

Potential Causes

286

(1)

Case Study: Spotify

287

(12)

Event Delivery

288

(1)

Event Delivery System Design and Architecture

289

(1)

Event Delivery System Operation

290

(3)

Customer Integration and Support

293

(5)

Summary

298

(1)

Conclusion

299

(2)

14 Configuration Design and Best Practices

301

(14)

What Is Configuration?

301

(2)

Configuration and Reliability

302

(1)

Separating Philosophy and Mechanics

303

(1)

Configuration Philosophy

303

(5)

Configuration Asks Users Questions

305

(1)

Questions Should Be Close to User Goals

305

(1)

Mandatory and Optional Questions

306

(2)

Escaping Simplicity

308

(1)

Mechanics of Configuration

308

(5)

Separate Configuration and Resulting Data

308

(2)

Importance of Tooling

310

(2)

Ownership and Change Tracking

312

(1)

Safe Configuration Change Application

312

(1)

Conclusion

313

(2)

15 Configuration Specifics

315

(20)

Configuration-Induced Toil

315

(1)

Reducing Configuration-Induced Toil

316

(1)

Critical Properties and Pitfalls of Configuration Systems

317

(3)

Pitfall 1 Failing to Recognize Configuration as a Programming Language Problem

317

(1)

Pitfall 2 Designing Accidental or Ad Hoc Language Features

318

(1)

Pitfall 3 Building Too Much Domain-Specific Optimization

318

(1)

Pitfall 4 Interleaving "Configuration Evaluation" with "Side Effects"

319

(1)

Pitfall 5 Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua

319

(1)

Integrating a Configuration Language

320

(2)

Generating Config in Specific Formats

320

(1)

Driving Multiple Applications

321

(1)

Integrating an Existing Application: Kubernetes

322

(4)

What Kubernetes Provides

322

(1)

Example Kubernetes Config

322

(1)

Integrating the Configuration Language

323

(3)

Integrating Custom Applications (In-House Software)

326

(3)

Effectively Operating a Configuration System

329

(2)

Versioning

329

(1)

Source Control

330

(1)

Tooling

330

(1)

Testing

330

(1)

When to Evaluate Configuration

331

(2)

Very Early: Checking in the JSON

331

(1)

Middle of the Road: Evaluate at Build Time

332

(1)

Late: Evaluate at Runtime

332

(1)

Guarding Against Abusive Configuration

333

(1)

Conclusion

334

(1)

16 Canarying Releases

335

(20)

Release Engineering Principles

336

(1)

Balancing Release Velocity and Reliability

337

(1)

What Is Canarying?

338

(1)

Release Engineering and Canarying

338

(2)

Requirements of a Canary Process

339

(1)

Our Example Setup

339

(1)

A Roll Forward Deployment Versus a Simple Canary Deployment

340

(2)

Canary Implementation

342

(3)

Minimizing Risk to SLOB and the Error Budget

343

(1)

Choosing a Canary Population and Duration

343

(2)

Selecting and Evaluating Metrics

345

(3)

Metrics Should Indicate Problems

345

(1)

Metrics Should Be Representative and Attributable

346

(1)

Before/After Evaluation Is Risky

347

(1)

Use a Gradual Canary for Better Metric Selection

347

(1)

Dependencies and Isolation

348

(1)

Canarying in Noninteractive Systems

348

(1)

Requirements on Monitoring Data

349

(1)

Related Concepts

350

(1)

Blue/Green Deployment

350

(1)

Artificial Load Generation

350

(1)

Traffic Teeing

351

(1)

Conclusion

351

(4)

Part III. Processes

17 Identifying and Recovering from Overload

355

(16)

From Load to Overload

356

(2)

Case Study 1: Work Overload When Half a Team Leaves

358

(2)

Background

358

(1)

Problem Statement

358

(1)

What We Decided to Do

359

(1)

Implementation

359

(1)

Lessons Learned

360

(1)

Case Study 2: Perceived Overload After Organizational and Workload Changes

360

(6)

Background

360

(1)

Problem Statement

361

(1)

What We Decided to Do

362

(1)

Implementation

363

(2)

Effects

365

(1)

Lessons Learned

365

(1)

Strategies for Mitigating Overload

366

(3)

Recognizing the Symptoms of Overload

366

(1)

Reducing Overload and Restoring Team Health

367

(2)

Conclusion

369

(2)

18 SRE Engagement Model

371

(20)

The Service Lifecycle

372

(3)

Phase 1 Architecture and Design

372

(1)

Phase 2 Active Development

373

(1)

Phase 3 Limited Availability

373

(1)

Phase 4 General Availability

374

(1)

Phase 5 Deprecation

374

(1)

Phase 6 Abandoned

374

(1)

Phase 7 Unsupported

374

(1)

Setting Up the Relationship

375

(5)

Communicating Business and Production Priorities

375

(1)

Identifying Risks

375

(1)

Aligning Goals

375

(4)

Setting Ground Rules

379

(1)

Planning and Executing

379

(1)

Sustaining an Effective Ongoing Relationship

380

(2)

Investing Time in Working Better Together

380

(1)

Maintaining an Open Line of Communication

380

(1)

Performing Regular Service Reviews

381

(1)

Reassessing When Ground Rules Start to Slip

381

(1)

Adjusting Priorities According to Your SLOs and Error Budget

381

(1)

Handling Mistakes Appropriately

382

(1)

Scaling SRE to Larger Environments

382

(3)

Supporting Multiple Services with a Single SRE Team

382

(1)

Structuring a Multiple SRE Team Environment

383

(1)

Adapting SRE Team Structures to Changing Circumstances

384

(1)

Running Cohesive Distributed SRE Teams

384

(1)

Ending the Relationship

385

(4)

Case Study 1: Ares

385

(2)

Case Study 2: Data Analysis Pipeline

387

(2)

Conclusion

389

(2)

19 SRE: Reaching Beyond Your Walls

391

(8)

Truths We Hold to Be Self-Evident

391

(3)

Reliability Is the Most Important Feature

392

(1)

Your Users, Not Your Monitoring, Decide Your Reliability

392

(1)

If You Run a Platform, Then Reliability Is a Partnership

392

(1)

Everything Important Eventually Becomes a Platform

393

(1)

When Your Customers Have a Hard Time, You Have to Slow Down

393

(1)

You Will Need to Practice SRE with Your Customers

393

(1)

How to: SRE with Your Customers

394

(4)

Step 1 SLOs and SLIs Are How You Speak

394

(1)

Step 2 Audit the Monitoring and Build Shared Dashboards

395

(1)

Step 3 Measure and Renegotiate

396

(1)

Step 4 Design Reviews and Risk Analysis

396

(1)

Step 5 Practice, Practice, Practice

397

(1)

Be Thoughtful and Disciplined

397

(1)

Conclusion

398

(1)

20 SRE Team Lifecycles

399

(24)

SRE Practices Without SREs

399

(1)

Starting an SRE Role

400

(3)

Finding Your First SRE

400

(1)

Placing Your First SRE

401

(1)

Bootstrapping Your First SRE

402

(1)

Distributed SREs

403

(1)

Your First SRE Team

403

(10)

Forming

404

(1)

Storming

405

(3)

Norming

408

(3)

Performing

411

(2)

Making More SRE Teams

413

(5)

Service Complexity

413

(1)

SRE Rollout

414

(1)

Geographical Splits

415

(3)

Suggested Practices for Running Many Teams

418

(4)

Mission Control

419

(1)

SRE Exchange

419

(1)

Training

419

(1)

Horizontal Projects

419

(1)

SRE Mobility

420

(1)

Travel

420

(1)

Launch Coordination Engineering Teams

421

(1)

Production Excellence

421

(1)

SRE Funding and Hiring

421

(1)

Conclusion

422

(1)

21 Organizational Change Management in SRE

423

(18)

SRE Embraces Change

423

(1)

Introduction to Change Management

424

(3)

Lewin's Three-Stage Model

424

(1)

McKinsey's 7-S Model

424

(1)

Kotter's Eight-Step Process for Leading Change

425

(1)

The Prosci ADKAR Model

425

(1)

Emotion-Based Models

426

(1)

The Deming Cycle

426

(1)

How These Theories Apply to SRE

427

(1)

Case Study 1: Scaling Waze-From Ad Hoc to Planned Change

427

(5)

Background

427

(1)

The Messaging Queue: Replacing a System While Maintaining Reliability

427

(2)

The Next Cycle of Change: Improving the Deployment Process

429

(2)

Lessons Learned

431

(1)

Case Study 2: Common Tooling Adoption in SRE

432

(7)

Background

432

(1)

Problem Statement

433

(1)

What We Decided to Do

434

(1)

Design

434

(2)

Implementation: Monitoring

436

(1)

Lessons Learned

436

(3)

Conclusion

439

(2)

Conclusion

441

(14)

A Example SLO Document

445

(4)

B Example Error Budget Policy

449

(4)

C Results of Postmortem Analysis

453

(2)

Index

455

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.

Niall Murphy has been working in Internet infrastructure for twenty years. He is a company founder, a published author, a photographer, and holds degrees in Computer Science & Mathematics and Poetry Studies.

Dave Rensin is a Google SRE Director, previous O'Reilly author, and serial entrepreneur. He holds a degree in Statistics.

Kent Kawahara is a Program Manager for Google's Site Reliability Engineering team focused on Google Cloud Platform customers and is based in Sunnyvale, CA. In previous Google roles, he managed technical and design teams to develop advertising support tools and worked with large advertisers and agencies on strategic advertising initiatives. Prior to Google, he worked in Product Management, Software QA, and Professional Services at two successful telecommunications startups. He holds a BS Electrical Engineering and Computer Science from the University of California at Berkeley.

Stephen Thorne is a Senior Site Reliability Engineer at Google. He currently works in Customer Reliability Engineering, helping to integrate Google's Cloud customers' operations with Google SRE. Stephen learned how to be an SRE on the team that runs Google's advertiser and publisher user interfaces, and later worked on App Engine. Before his time at Google, he fought against spam and viruses in his home country of Australia, where he also earned his B.S. in Computer Science.

Site Reliability Workbook: Practical ways to implement SRE [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv