Muutke küpsiste eelistusi

Site Reliability Workbook: Practical ways to implement SRE [Pehme köide]

  • Formaat: Paperback / softback, 500 pages, kõrgus x laius x paksus: 250x150x15 mm, kaal: 666 g
  • Ilmumisaeg: 31-Jul-2018
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492029505
  • ISBN-13: 9781492029502
  • Pehme köide
  • Hind: 57,45 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 67,59 €
  • Säästad 15%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 500 pages, kõrgus x laius x paksus: 250x150x15 mm, kaal: 666 g
  • Ilmumisaeg: 31-Jul-2018
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492029505
  • ISBN-13: 9781492029502

In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Target, Home Depot, The New York Times, and other companies outline their hard-won experience of what worked for them and what didn’t.

Dive into this workbook and learn how to flesh out your own SRE framework, no matter what size your company is.

You’ll learn:

  • How to run reliable services in environments you don’t completely control—like cloud
  • Practical examples of how to create, monitor, and run your services via Service Level Objectives
  • How to convert existing ops teams to SRE—including how to dig out of operational overload
  • Methods for starting SRE from either greenfield or brownfield
Foreword I xvii
Foreword II xix
Preface xxiii
1 How SRE Relates to DevOps 1(16)
Background on DevOps
2(2)
No More Silos
2(1)
Accidents Are Normal
3(1)
Change Should Be Gradual
3(1)
Tooling and Culture Are Interrelated
4(1)
Measurement Is Crucial
4(1)
Background on SRE
4(4)
Operations Is a Software Problem
5(1)
Manage by Service Level Objectives (SLOs)
5(1)
Work to Minimize Toil
5(1)
Automate This Year's Job Away
6(1)
Move Fast by Reducing the Cost of Failure
6(1)
Share Ownership with Developers
7(1)
Use the Same Tooling, Regardless of Function or Job Title
7(1)
Compare and Contrast
8(1)
Organizational Context and Fostering Successful Adoption
9(8)
Narrow, Rigid Incentives Narrow Your Success
10(1)
It's Better to Fix It Yourself; Don't Blame Someone Else
10(1)
Consider Reliability Work as a Specialized Role
11(1)
When Can Substitute for Whether
12(1)
Strive for Parity of Esteem: Career and Financial
12(5)
Part I. Foundations
2 Implementing SLOs
17(26)
Why SREs Need SLOs
17(1)
Getting Started
18(5)
Reliability Targets and Error Budgets
19(1)
What to Measure: Using SLIs
20(3)
A Worked Example
23(6)
Moving from SLI Specification to SLI Implementation
25(1)
Measuring the SLIs
26(2)
Using the SLIs to Calculate Starter SLOs
28(1)
Choosing an Appropriate Time Window
29(1)
Getting Stakeholder Agreement
30(4)
Establishing an Error Budget Policy
31(1)
Documenting the SLO and Error Budget Policy
32(1)
Dashboards and Reports
33(1)
Continuous Improvement of SLO Targets
34(3)
Improving the Quality of Your SLO
35(2)
Decision Making Using SLOs and Error Budgets
37(1)
Advanced Topics
38(4)
Modeling User Journeys
39(1)
Grading Interaction Importance
39(1)
Modeling Dependencies
40(1)
Experimenting with Relaxing Your SLOs
41(1)
Conclusion
42(1)
3 SLO Engineering Case Studies
43(18)
Evernote's SLO Story
43(6)
Why Did Evernote Adopt the SRE Model?
44(1)
Introduction of SLOs: A Journey in Progress
45(3)
Breaking Down the SLO Wall Between Customer and Cloud Provider
48(1)
Current State
49(1)
The Home Depot's SLO Story
49(11)
The SLO Culture Project
50(2)
Our First Set of SLOs
52(2)
Evangelizing SLOs
54(1)
Automating VALET Data Collection
55(2)
The Proliferation of SLOs
57(1)
Applying VALET to Batch Applications
57(1)
Using VALET in Testing
58(1)
Future Aspirations
58(1)
Summary
59(1)
Conclusion
60(1)
4 Monitoring
61(14)
Desirable Features of a Monitoring Strategy
62(2)
Speed
62(1)
Calculations
62(1)
Interfaces
63(1)
Alerts
64(1)
Sources of Monitoring Data
64(3)
Examples
65(2)
Managing Your Monitoring System
67(2)
Treat Your Configuration as Code
67(1)
Encourage Consistency
68(1)
Prefer Loose Coupling
68(1)
Metrics with Purpose
69(3)
Intended Changes
70(1)
Dependencies
70(1)
Saturation
71(1)
Status of Served Traffic
72(1)
Implementing Purposeful Metrics
72(1)
Testing Alerting Logic
72(1)
Conclusion
73(2)
5 Alerting on SLOB
75(18)
Alerting Considerations
75(1)
Ways to Alert on Significant Events
76(10)
1 Target Error Rate SLO Threshold
76(2)
2 Increased Alert Window
78(1)
3 Incrementing Alert Duration
79(1)
4 Alert on Burn Rate
80(2)
5 Multiple Burn Rate Alerts
82(2)
6 Multiwindow, Multi-Burn-Rate Alerts
84(2)
Low-Traffic Services and Error Budget Alerting
86(3)
Generating Artificial Traffic
87(1)
Combining Services
87(1)
Making Service and Infrastructure Changes
87(1)
Lowering the SLO or Increasing the Window
88(1)
Extreme Availability Goals
89(1)
Alerting at Scale
89(2)
Conclusion
91(2)
6 Eliminating Toil
93(38)
What Is Toil?
94(2)
Measuring Toil
96(2)
Toil Taxonomy
98(3)
Business Processes
98(1)
Production Interrupts
99(1)
Release Shepherding
99(1)
Migrations
99(1)
Cost Engineering and Capacity Planning
100(1)
Troubleshooting for Opaque Architectures
100(1)
Toil Management Strategies
101(5)
Identify and Measure Toil
101(1)
Engineer Toil Out of the System
101(1)
Reject the Toil
101(1)
Use SLOs to Reduce Toil
102(1)
Start with Human-Backed Interfaces
102(1)
Provide Self-Service Methods
102(1)
Get Support from Management and Colleagues
103(1)
Promote Toil Reduction as a Feature
103(1)
Start Small and Then Improve
103(1)
Increase Uniformity
103(1)
Assess Risk Within Automation
104(1)
Automate Toil Response
104(1)
Use Open Source and Third-Party Tools
105(1)
Use Feedback to Improve
105(1)
Case Studies
106(1)
Case Study 1: Reducing Toil in the Datacenter with Automation
107(14)
Background
107(3)
Problem Statement
110(1)
What We Decided to Do
110(1)
Design First Effort: Saturn Line-Card Repair
110(1)
Implementation
111(2)
Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair
113(1)
Implementation
114(4)
Lessons Learned
118(3)
Case Study 2: Decommissioning Filer-Backed Home Directories
121(1)
Background
121(1)
Problem Statement
121(1)
What We Decided to Do
122(1)
Design and Implementation
123(1)
Key Components
124(3)
Lessons Learned
127(2)
Conclusion
129(2)
7 Simplicity
131(16)
Measuring Complexity
131(2)
Simplicity Is End-to-End, and SREs Are Good for That
133(2)
Case Study 1 End-to-End API Simplicity
134(1)
Case Study 2 Project Lifecycle Complexity
134(1)
Regaining Simplicity
135(6)
Case Study 3 Simplification of the Display Ads Spiderweb
137(2)
Case Study 4 Running Hundreds of Microservices on a Shared Platform
139(1)
Case Study 5 pDNS No Longer Depends on Itself
140(1)
Conclusion
141(6)
Part II. Practices
8 On-Call
147(28)
Recap of "Being On-Call"
Chapter of First SRE Book
148(1)
Example On-Call Setups Within Google and Outside Google
149(7)
Google: Forming a New Team
149(4)
Evernote: Finding Our Feet in the Cloud
153(3)
Practical Implementation Details
156(17)
Anatomy of Pager Load
156(11)
On-Call Flexibility
167(4)
On-Call Team Dynamics
171(2)
Conclusion
173(2)
9 Incident Response
175(20)
Incident Management at Google
176(1)
Incident Command System
176(1)
Main Roles in Incident Response
177(1)
Case Studies
177(14)
Case Study 1 Software Bug-The Lights Are On but No One's (Google) Home
177(3)
Case Study 2 Service Fault-Cache Me If You Can
180(5)
Case Study 3 Power Outage-Lightning Never Strikes Twice...Until It Does
185(3)
Case Study 4 Incident Response at PagerDuty
188(3)
Putting Best Practices into Practice
191(3)
Incident Response Training
191(1)
Prepare Beforehand
192(1)
Drills
193(1)
Conclusion
194(1)
10 Postmortem Culture: Learning from Failure
195(30)
Case Study
196(1)
Bad Postmortem
197(6)
Why Is This Postmortem Bad?
199(4)
Good Postmortem
203(11)
Why Is This Postmortem Better?
212(2)
Organizational Incentives
214(6)
Model and Enforce Blameless Behavior
214(1)
Reward Postmortem Outcomes
215(2)
Share Postmortems Openly
217(1)
Respond to Postmortem Culture Failures
218(2)
Tools and Templates
220(3)
Postmortem Templates
220(1)
Postmortem Tooling
221(2)
Conclusion
223(2)
11 Managing Load
225(20)
Google Cloud Load Balancing
225(11)
Anycast
226(1)
Maglev
227(2)
Global Software Load Balancer
229(1)
Google Front End
229(1)
GCLB: Low Latency
230(1)
GCLB: High Availability
231(1)
Case Study 1: Pokemon GO on GCLB
231(5)
Autoscaling
236(3)
Handling Unhealthy Machines
236(1)
Working with Stateful Systems
237(1)
Configuring Conservatively
237(1)
Setting Constraints
238(1)
Including Kill Switches and Manual Overrides
238(1)
Avoiding Overloading Backends
238(1)
Avoiding Traffic Imbalance
239(1)
Combining Strategies to Manage Load
239(4)
Case Study 2: When Load Shedding Attacks
240(3)
Conclusion
243(2)
12 Introducing Non-Abstract Large System Design
245(18)
What Is NALSD?
245(1)
Why "Non-Abstract"?
246(1)
AdWords Example
246(14)
Design Process
246(1)
Initial Requirements
247(1)
One Machine
248(3)
Distributed System
251(9)
Conclusion
260(3)
13 Data Processing Pipelines
263(38)
Pipeline Applications
264(4)
Event Processing/Data Transformation to Order or Structure Data
264(1)
Data Analytics
265(1)
Machine Learning
265(3)
Pipeline Best Practices
268(9)
Define and Measure. Service Level Objectives
268(2)
Plan for Dependency Failure
270(1)
Create and Maintain Pipeline Documentation
271(1)
Map Your Development Lifecycle
272(3)
Reduce Hotspotting and Workload Patterns
275(1)
Implement Autoscaling and Resource Planning
276(1)
Adhere to Access Control and Security Policies
277(1)
Plan Escalation Paths
277(1)
Pipeline Requirements and Design
277(7)
What Features Do You Need?
278(1)
Idempotent and Two-Phase Mutations
279(1)
Checkpointing
280(1)
Code Patterns
280(1)
Pipeline Production Readiness
281(3)
Pipeline Failures: Prevention and Response
284(3)
Potential Failure Modes
284(2)
Potential Causes
286(1)
Case Study: Spotify
287(12)
Event Delivery
288(1)
Event Delivery System Design and Architecture
289(1)
Event Delivery System Operation
290(3)
Customer Integration and Support
293(5)
Summary
298(1)
Conclusion
299(2)
14 Configuration Design and Best Practices
301(14)
What Is Configuration?
301(2)
Configuration and Reliability
302(1)
Separating Philosophy and Mechanics
303(1)
Configuration Philosophy
303(5)
Configuration Asks Users Questions
305(1)
Questions Should Be Close to User Goals
305(1)
Mandatory and Optional Questions
306(2)
Escaping Simplicity
308(1)
Mechanics of Configuration
308(5)
Separate Configuration and Resulting Data
308(2)
Importance of Tooling
310(2)
Ownership and Change Tracking
312(1)
Safe Configuration Change Application
312(1)
Conclusion
313(2)
15 Configuration Specifics
315(20)
Configuration-Induced Toil
315(1)
Reducing Configuration-Induced Toil
316(1)
Critical Properties and Pitfalls of Configuration Systems
317(3)
Pitfall 1 Failing to Recognize Configuration as a Programming Language Problem
317(1)
Pitfall 2 Designing Accidental or Ad Hoc Language Features
318(1)
Pitfall 3 Building Too Much Domain-Specific Optimization
318(1)
Pitfall 4 Interleaving "Configuration Evaluation" with "Side Effects"
319(1)
Pitfall 5 Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
319(1)
Integrating a Configuration Language
320(2)
Generating Config in Specific Formats
320(1)
Driving Multiple Applications
321(1)
Integrating an Existing Application: Kubernetes
322(4)
What Kubernetes Provides
322(1)
Example Kubernetes Config
322(1)
Integrating the Configuration Language
323(3)
Integrating Custom Applications (In-House Software)
326(3)
Effectively Operating a Configuration System
329(2)
Versioning
329(1)
Source Control
330(1)
Tooling
330(1)
Testing
330(1)
When to Evaluate Configuration
331(2)
Very Early: Checking in the JSON
331(1)
Middle of the Road: Evaluate at Build Time
332(1)
Late: Evaluate at Runtime
332(1)
Guarding Against Abusive Configuration
333(1)
Conclusion
334(1)
16 Canarying Releases
335(20)
Release Engineering Principles
336(1)
Balancing Release Velocity and Reliability
337(1)
What Is Canarying?
338(1)
Release Engineering and Canarying
338(2)
Requirements of a Canary Process
339(1)
Our Example Setup
339(1)
A Roll Forward Deployment Versus a Simple Canary Deployment
340(2)
Canary Implementation
342(3)
Minimizing Risk to SLOB and the Error Budget
343(1)
Choosing a Canary Population and Duration
343(2)
Selecting and Evaluating Metrics
345(3)
Metrics Should Indicate Problems
345(1)
Metrics Should Be Representative and Attributable
346(1)
Before/After Evaluation Is Risky
347(1)
Use a Gradual Canary for Better Metric Selection
347(1)
Dependencies and Isolation
348(1)
Canarying in Noninteractive Systems
348(1)
Requirements on Monitoring Data
349(1)
Related Concepts
350(1)
Blue/Green Deployment
350(1)
Artificial Load Generation
350(1)
Traffic Teeing
351(1)
Conclusion
351(4)
Part III. Processes
17 Identifying and Recovering from Overload
355(16)
From Load to Overload
356(2)
Case Study 1: Work Overload When Half a Team Leaves
358(2)
Background
358(1)
Problem Statement
358(1)
What We Decided to Do
359(1)
Implementation
359(1)
Lessons Learned
360(1)
Case Study 2: Perceived Overload After Organizational and Workload Changes
360(6)
Background
360(1)
Problem Statement
361(1)
What We Decided to Do
362(1)
Implementation
363(2)
Effects
365(1)
Lessons Learned
365(1)
Strategies for Mitigating Overload
366(3)
Recognizing the Symptoms of Overload
366(1)
Reducing Overload and Restoring Team Health
367(2)
Conclusion
369(2)
18 SRE Engagement Model
371(20)
The Service Lifecycle
372(3)
Phase 1 Architecture and Design
372(1)
Phase 2 Active Development
373(1)
Phase 3 Limited Availability
373(1)
Phase 4 General Availability
374(1)
Phase 5 Deprecation
374(1)
Phase 6 Abandoned
374(1)
Phase 7 Unsupported
374(1)
Setting Up the Relationship
375(5)
Communicating Business and Production Priorities
375(1)
Identifying Risks
375(1)
Aligning Goals
375(4)
Setting Ground Rules
379(1)
Planning and Executing
379(1)
Sustaining an Effective Ongoing Relationship
380(2)
Investing Time in Working Better Together
380(1)
Maintaining an Open Line of Communication
380(1)
Performing Regular Service Reviews
381(1)
Reassessing When Ground Rules Start to Slip
381(1)
Adjusting Priorities According to Your SLOs and Error Budget
381(1)
Handling Mistakes Appropriately
382(1)
Scaling SRE to Larger Environments
382(3)
Supporting Multiple Services with a Single SRE Team
382(1)
Structuring a Multiple SRE Team Environment
383(1)
Adapting SRE Team Structures to Changing Circumstances
384(1)
Running Cohesive Distributed SRE Teams
384(1)
Ending the Relationship
385(4)
Case Study 1: Ares
385(2)
Case Study 2: Data Analysis Pipeline
387(2)
Conclusion
389(2)
19 SRE: Reaching Beyond Your Walls
391(8)
Truths We Hold to Be Self-Evident
391(3)
Reliability Is the Most Important Feature
392(1)
Your Users, Not Your Monitoring, Decide Your Reliability
392(1)
If You Run a Platform, Then Reliability Is a Partnership
392(1)
Everything Important Eventually Becomes a Platform
393(1)
When Your Customers Have a Hard Time, You Have to Slow Down
393(1)
You Will Need to Practice SRE with Your Customers
393(1)
How to: SRE with Your Customers
394(4)
Step 1 SLOs and SLIs Are How You Speak
394(1)
Step 2 Audit the Monitoring and Build Shared Dashboards
395(1)
Step 3 Measure and Renegotiate
396(1)
Step 4 Design Reviews and Risk Analysis
396(1)
Step 5 Practice, Practice, Practice
397(1)
Be Thoughtful and Disciplined
397(1)
Conclusion
398(1)
20 SRE Team Lifecycles
399(24)
SRE Practices Without SREs
399(1)
Starting an SRE Role
400(3)
Finding Your First SRE
400(1)
Placing Your First SRE
401(1)
Bootstrapping Your First SRE
402(1)
Distributed SREs
403(1)
Your First SRE Team
403(10)
Forming
404(1)
Storming
405(3)
Norming
408(3)
Performing
411(2)
Making More SRE Teams
413(5)
Service Complexity
413(1)
SRE Rollout
414(1)
Geographical Splits
415(3)
Suggested Practices for Running Many Teams
418(4)
Mission Control
419(1)
SRE Exchange
419(1)
Training
419(1)
Horizontal Projects
419(1)
SRE Mobility
420(1)
Travel
420(1)
Launch Coordination Engineering Teams
421(1)
Production Excellence
421(1)
SRE Funding and Hiring
421(1)
Conclusion
422(1)
21 Organizational Change Management in SRE
423(18)
SRE Embraces Change
423(1)
Introduction to Change Management
424(3)
Lewin's Three-Stage Model
424(1)
McKinsey's 7-S Model
424(1)
Kotter's Eight-Step Process for Leading Change
425(1)
The Prosci ADKAR Model
425(1)
Emotion-Based Models
426(1)
The Deming Cycle
426(1)
How These Theories Apply to SRE
427(1)
Case Study 1: Scaling Waze-From Ad Hoc to Planned Change
427(5)
Background
427(1)
The Messaging Queue: Replacing a System While Maintaining Reliability
427(2)
The Next Cycle of Change: Improving the Deployment Process
429(2)
Lessons Learned
431(1)
Case Study 2: Common Tooling Adoption in SRE
432(7)
Background
432(1)
Problem Statement
433(1)
What We Decided to Do
434(1)
Design
434(2)
Implementation: Monitoring
436(1)
Lessons Learned
436(3)
Conclusion
439(2)
Conclusion 441(14)
A Example SLO Document
445(4)
B Example Error Budget Policy
449(4)
C Results of Postmortem Analysis
453(2)
Index 455
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.

Niall Murphy has been working in Internet infrastructure for twenty years. He is a company founder, a published author, a photographer, and holds degrees in Computer Science & Mathematics and Poetry Studies.

Dave Rensin is a Google SRE Director, previous O'Reilly author, and serial entrepreneur. He holds a degree in Statistics.

Kent Kawahara is a Program Manager for Google's Site Reliability Engineering team focused on Google Cloud Platform customers and is based in Sunnyvale, CA. In previous Google roles, he managed technical and design teams to develop advertising support tools and worked with large advertisers and agencies on strategic advertising initiatives. Prior to Google, he worked in Product Management, Software QA, and Professional Services at two successful telecommunications startups. He holds a BS Electrical Engineering and Computer Science from the University of California at Berkeley.

Stephen Thorne is a Senior Site Reliability Engineer at Google. He currently works in Customer Reliability Engineering, helping to integrate Google's Cloud customers' operations with Google SRE. Stephen learned how to be an SRE on the team that runs Google's advertiser and publisher user interfaces, and later worked on App Engine. Before his time at Google, he fought against spam and viruses in his home country of Australia, where he also earned his B.S. in Computer Science.