|
|
xvii | |
|
|
xxi | |
Equations |
|
xxiii | |
Introduction |
|
xxv | |
|
|
1 | (62) |
|
|
3 | (13) |
|
1.1 Essential Cloud Characteristics |
|
|
4 | (2) |
|
1.1.1 On-Demand Self-Service |
|
|
4 | (1) |
|
1.1.2 Broad Network Access |
|
|
4 | (1) |
|
|
5 | (1) |
|
|
5 | (1) |
|
|
6 | (1) |
|
1.2 Common Cloud Characteristics |
|
|
6 | (1) |
|
1.3 But What, Exactly, Is Cloud Computing? |
|
|
7 | (2) |
|
1.3.1 What Is a Data Center? |
|
|
8 | (1) |
|
1.3.2 How Does Cloud Computing Differ from Traditional Data Centers? |
|
|
9 | (1) |
|
|
9 | (2) |
|
1.5 Cloud Deployment Models |
|
|
11 | (1) |
|
1.6 Roles in Cloud Computing |
|
|
12 | (2) |
|
1.7 Benefits of Cloud Computing |
|
|
14 | (1) |
|
1.8 Risks of Cloud Computing |
|
|
15 | (1) |
|
|
16 | (13) |
|
|
16 | (1) |
|
2.2 What Is Virtualization? |
|
|
17 | (2) |
|
2.2.1 Types of Hypervisors |
|
|
18 | (1) |
|
2.2.2 Virtualization and Emulation |
|
|
19 | (1) |
|
2.3 Server Virtualization |
|
|
19 | (4) |
|
2.3.1 Full Virtualization |
|
|
20 | (1) |
|
|
21 | (1) |
|
|
22 | (1) |
|
|
22 | (1) |
|
|
23 | (5) |
|
|
26 | (1) |
|
|
26 | (2) |
|
2.4.3 High Availability Mechanisms |
|
|
28 | (1) |
|
2.5 Reliability and Availability Risks of Virtualization |
|
|
28 | (1) |
|
3 Service Reliability And Service Availability |
|
|
29 | (34) |
|
|
30 | (1) |
|
3.2 Eight-Ingredient Framework |
|
|
31 | (3) |
|
|
34 | (9) |
|
3.3.1 Service Availability Metric |
|
|
35 | (1) |
|
|
36 | (1) |
|
3.3.3 Service and Network Element Impact Outages |
|
|
37 | (1) |
|
|
38 | (2) |
|
3.3.5 Availability Ratings |
|
|
40 | (1) |
|
3.3.6 Outage Attributability |
|
|
41 | (1) |
|
3.3.7 Planned or Scheduled Downtime |
|
|
42 | (1) |
|
|
43 | (3) |
|
3.4.1 Service Reliability Metrics |
|
|
44 | (1) |
|
3.4.2 Defective Transactions |
|
|
45 | (1) |
|
|
46 | (4) |
|
3.6 Redundancy and High Availability |
|
|
50 | (6) |
|
|
51 | (2) |
|
|
53 | (3) |
|
3.7 High Availability and Disaster Recovery |
|
|
56 | (2) |
|
|
58 | (4) |
|
3.8.1 Control and Data Planes |
|
|
58 | (1) |
|
3.8.2 Service Quality Metrics |
|
|
59 | (1) |
|
|
60 | (1) |
|
3.8.4 Latency Expectations |
|
|
60 | (1) |
|
3.8.5 Streaming Quality Impairments |
|
|
61 | (1) |
|
3.9 Reliability and Availability Risks of Cloud Computing |
|
|
62 | (1) |
|
|
63 | (120) |
|
4 Analyzing Cloud Reliability And Availability |
|
|
65 | (25) |
|
4.1 Expectations for Service Reliability and Availability |
|
|
65 | (1) |
|
4.2 Risks of Essential Cloud Characteristics |
|
|
66 | (4) |
|
4.2.1 On-Demand Self-Service |
|
|
66 | (1) |
|
4.2.2 Broad Network Access |
|
|
67 | (1) |
|
|
67 | (1) |
|
|
67 | (2) |
|
|
69 | (1) |
|
4.3 Impacts of Common Cloud Characteristics |
|
|
70 | (2) |
|
|
70 | (1) |
|
4.3.2 Geographic Distribution |
|
|
70 | (1) |
|
4.3.3 Resilient Computing |
|
|
71 | (1) |
|
|
71 | (1) |
|
|
71 | (1) |
|
|
71 | (1) |
|
4.4 Risks of Service Models |
|
|
72 | (2) |
|
4.4.1 Traditional Accountability |
|
|
72 | (1) |
|
4.4.2 Cloud-Based Application Accountability |
|
|
73 | (1) |
|
4.5 IT Service Management and Availability Risks |
|
|
74 | (6) |
|
|
74 | (1) |
|
|
75 | (1) |
|
|
76 | (1) |
|
|
77 | (1) |
|
|
77 | (1) |
|
4.5.6 Continual Service Improvement |
|
|
78 | (1) |
|
4.5.7 IT Service Management Summary |
|
|
79 | (1) |
|
4.5.8 Risks of Service Orchestration |
|
|
79 | (1) |
|
4.5.9 IT Service Management Risks |
|
|
80 | (1) |
|
4.6 Outage Risks by Process Area |
|
|
80 | (3) |
|
4.6.1 Validating Outage Attributability |
|
|
82 | (1) |
|
4.7 Failure Detection Considerations |
|
|
83 | (4) |
|
|
83 | (2) |
|
|
85 | (1) |
|
4.7.3 Data Inconsistency and Errors |
|
|
85 | (1) |
|
|
86 | (1) |
|
4.7.5 System Power Failures |
|
|
86 | (1) |
|
|
86 | (1) |
|
4.7.7 Application Protocol Errors |
|
|
86 | (1) |
|
4.8 Risks of Deployment Models |
|
|
87 | (1) |
|
4.9 Expectations of IaaS Data Centers |
|
|
87 | (3) |
|
5 Reliability Analysis Of Virtualization |
|
|
90 | (26) |
|
5.1 Reliability Analysis Techniques |
|
|
90 | (5) |
|
5.1.1 Reliability Block Diagrams |
|
|
90 | (2) |
|
5.1.2 Single Point of Failure Analysis |
|
|
92 | (1) |
|
5.1.3 Failure Mode Effects Analysis |
|
|
92 | (3) |
|
5.2 Reliability Analysis of Virtualization Techniques |
|
|
95 | (5) |
|
5.2.1 Analysis of Full Virtualization |
|
|
95 | (1) |
|
5.2.2 Analysis of OS Virtualization |
|
|
95 | (1) |
|
5.2.3 Analysis of Paravirtualization |
|
|
96 | (1) |
|
5.2.4 Analysis of VM Coresidency |
|
|
97 | (2) |
|
|
99 | (1) |
|
5.3 Software Failure Rate Analysis |
|
|
100 | (1) |
|
5.3.1 Virtualization and Software Failure Rate |
|
|
100 | (1) |
|
5.3.2 Hypervisor Failure Rate |
|
|
101 | (1) |
|
5.3.3 Miscellaneous Software Risks of Virtualization and Cloud |
|
|
101 | (1) |
|
|
101 | (7) |
|
5.4.1 Traditional Recovery Options |
|
|
101 | (1) |
|
5.4.2 Virtualized Recovery Options |
|
|
102 | (5) |
|
|
107 | (1) |
|
5.5 Application Architecture Strategies |
|
|
108 | (2) |
|
5.5.1 On-Demand Single-User Model |
|
|
108 | (1) |
|
5.5.2 Single-User Daemon Model |
|
|
109 | (1) |
|
5.5.3 Multiuser Server Model |
|
|
109 | (1) |
|
5.5.4 Consolidated Server Model |
|
|
109 | (1) |
|
5.6 Availability Modeling of Virtualized Recovery Options |
|
|
110 | (6) |
|
5.6.1 Availability of Virtualized Simplex Architecture |
|
|
111 | (1) |
|
5.6.2 Availability of Virtualized Redundant Architecture |
|
|
111 | (1) |
|
5.6.3 Critical Failure Rate |
|
|
112 | (1) |
|
|
113 | (1) |
|
5.6.5 Failure Detection Latency |
|
|
113 | (1) |
|
|
113 | (1) |
|
5.6.7 Switchover Success Probability |
|
|
114 | (1) |
|
5.6.8 Modeling and "Fast Failure" |
|
|
114 | (1) |
|
5.6.9 Comparison of Native and Virtualized Deployments |
|
|
115 | (1) |
|
6 Hardware Reliability, Virtualization, And Service Availability |
|
|
116 | (16) |
|
6.1 Hardware Downtime Expectations |
|
|
116 | (1) |
|
|
117 | (2) |
|
6.3 Hardware Failure Rate |
|
|
119 | (2) |
|
6.4 Hardware Failure Detection |
|
|
121 | (1) |
|
6.5 Hardware Failure Containment |
|
|
122 | (1) |
|
6.6 Hardware Failure Mitigation |
|
|
122 | (2) |
|
6.7 Mitigating Hardware Failures via Virtualization |
|
|
124 | (3) |
|
|
124 | (1) |
|
|
125 | (1) |
|
|
126 | (1) |
|
|
127 | (2) |
|
6.8.1 Virtual Network Interface Cards |
|
|
127 | (1) |
|
6.8.2 Virtual Local Area Networks |
|
|
128 | (1) |
|
6.8.3 Virtual IP Addresses |
|
|
129 | (1) |
|
6.8.4 Virtual Private Networks |
|
|
129 | (1) |
|
6.9 MTTR of Virtualized Hardware |
|
|
129 | (2) |
|
|
131 | (1) |
|
7 Capacity And Elasticity |
|
|
132 | (32) |
|
|
132 | (3) |
|
7.1.1 Extraordinary Event Considerations |
|
|
134 | (1) |
|
|
134 | (1) |
|
7.2 Overload, Service Reliability, and Service Availability |
|
|
135 | (1) |
|
7.3 Traditional Capacity Planning |
|
|
136 | (1) |
|
|
137 | (7) |
|
7.4.1 Nominal Cloud Capacity Model |
|
|
138 | (3) |
|
7.4.2 Elasticity Expectations |
|
|
141 | (3) |
|
7.5 Managing Online Capacity |
|
|
144 | (3) |
|
7.5.1 Capacity Planning Assumptions of Cloud Computing |
|
|
145 | (2) |
|
7.6 Capacity-Related Service Risks |
|
|
147 | (6) |
|
7.6.1 Elasticity and Elasticity Failure |
|
|
147 | (2) |
|
7.6.2 Partial Capacity Failure |
|
|
149 | (1) |
|
7.6.3 Service Latency Risk |
|
|
150 | (2) |
|
7.6.4 Capacity Impairments and Service Reliability |
|
|
152 | (1) |
|
7.7 Capacity Management Risks |
|
|
153 | (4) |
|
7.7.1 Brittle Application Architecture |
|
|
154 | (1) |
|
7.7.2 Faulty or Inadequate Monitoring Data |
|
|
155 | (1) |
|
7.7.3 Faulty Capacity Decisions |
|
|
155 | (1) |
|
7.7.4 Unreliable Capacity Growth |
|
|
155 | (1) |
|
7.7.5 Unreliable Capacity Degrowth |
|
|
156 | (1) |
|
7.7.6 Inadequate Slew Rate |
|
|
156 | (1) |
|
7.7.7 Tardy Capacity Management Decisions |
|
|
156 | (1) |
|
7.7.8 Resource Stock Out Not Covered |
|
|
157 | (1) |
|
|
157 | (1) |
|
7.7.10 Policy Constraints |
|
|
157 | (1) |
|
7.8 Security and Service Availability |
|
|
157 | (5) |
|
7.8.1 Security Risk to Service Availability |
|
|
157 | (2) |
|
7.8.2 Denial of Service Attacks |
|
|
159 | (1) |
|
7.8.3 Defending against DoS Attacks |
|
|
160 | (1) |
|
7.8.4 Quantifying Service Availability Impact of Security Attacks |
|
|
161 | (1) |
|
|
162 | (1) |
|
7.9 Architecting for Elastic Growth and Degrowth |
|
|
162 | (2) |
|
8 Service Orchestration Analysis |
|
|
164 | (10) |
|
8.1 Service Orchestration Definition |
|
|
164 | (2) |
|
8.2 Policy-Based Management |
|
|
166 | (2) |
|
|
167 | (1) |
|
8.2.2 Service Reliability and Availability Measurements |
|
|
168 | (1) |
|
|
168 | (1) |
|
8.3.1 Role of Rapid Elasticity in Cloud Management |
|
|
169 | (1) |
|
8.3.2 Role of Cloud Bursting in Cloud Management |
|
|
169 | (1) |
|
8.4 Service Orchestration's Role in Risk Mitigation |
|
|
169 | (3) |
|
|
170 | (1) |
|
|
170 | (1) |
|
|
171 | (1) |
|
|
171 | (1) |
|
|
172 | (2) |
|
9 Geographic Distribution, Georedundancy, And Disaster Recovery |
|
|
174 | (9) |
|
9.1 Geographic Distribution versus Georedundancy |
|
|
175 | (1) |
|
9.2 Traditional Disaster Recovery |
|
|
175 | (2) |
|
9.3 Virtualization and Disaster Recovery |
|
|
177 | (1) |
|
9.4 Cloud Computing and Disaster Recovery |
|
|
178 | (2) |
|
9.5 Georedundancy Recovery Models |
|
|
180 | (1) |
|
9.6 Cloud and Traditional Collateral Benefits of Georedundancy |
|
|
180 | (2) |
|
9.6.1 Reduced Planned Downtime |
|
|
180 | (1) |
|
9.6.2 Mitigate Catastrophic Network Element Failures |
|
|
181 | (1) |
|
9.6.3 Mitigate Extended Uncovered and Duplex Failure Outages |
|
|
181 | (1) |
|
|
182 | (1) |
|
|
183 | (128) |
|
10 Applications, Solutions, And Accountability |
|
|
185 | (24) |
|
10.1 Application Configuration Scenarios |
|
|
185 | (2) |
|
10.2 Application Deployment Scenario |
|
|
187 | (1) |
|
10.3 System Downtime Budgets |
|
|
188 | (9) |
|
10.3.1 Traditional System Downtime Budget |
|
|
189 | (1) |
|
10.3.2 Virtualized Application Downtime Budget |
|
|
189 | (2) |
|
10.3.3 IaaS Hardware Downtime Expectations |
|
|
191 | (2) |
|
10.3.4 Cloud-Based Application Downtime Budget |
|
|
193 | (2) |
|
|
195 | (2) |
|
10.4 End-to-End Solutions Considerations |
|
|
197 | (4) |
|
10.4.1 What is an End-to-End Solution? |
|
|
197 | (1) |
|
10.4.2 Consumer-Specific Architectures |
|
|
198 | (1) |
|
10.4.3 Data Center Redundancy |
|
|
199 | (2) |
|
10.5 Attributability for Service Impairments |
|
|
201 | (3) |
|
10.6 Solution Service Measurement |
|
|
204 | (3) |
|
10.6.1 Service Availability Measurement Points |
|
|
204 | (3) |
|
10.7 Managing Reliability and Service of Cloud Computing |
|
|
207 | (2) |
|
11 Recommendations For Architecting A Reliable System |
|
|
209 | (35) |
|
11.1 Architecting for Virtualization and Cloud |
|
|
209 | (7) |
|
11.1.1 Mapping Software into VMs |
|
|
210 | (1) |
|
11.1.2 Service Load Distribution |
|
|
210 | (1) |
|
|
211 | (1) |
|
11.1.4 Software Redundancy and High Availability Mechanisms |
|
|
212 | (2) |
|
|
214 | (1) |
|
|
214 | (1) |
|
|
215 | (1) |
|
|
215 | (1) |
|
11.1.9 Isochronal Applications |
|
|
216 | (1) |
|
|
216 | (1) |
|
11.3 IT Service Management Considerations |
|
|
217 | (7) |
|
11.3.1 Software Upgrade and Patch |
|
|
217 | (1) |
|
11.3.2 Service Transition Activity Effect Analysis |
|
|
218 | (1) |
|
11.3.3 Mitigating Service Transition Activity Effects via VM Migration |
|
|
219 | (2) |
|
11.3.4 Testing Service Transition Activities |
|
|
221 | (1) |
|
11.3.5 Minimizing Procedural Errors |
|
|
221 | (2) |
|
11.3.6 Service Orchestration Considerations |
|
|
223 | (1) |
|
11.4 Many Distributed Clouds versus Fewer Huge Clouds |
|
|
224 | (1) |
|
11.5 Minimizing Hardware-Attributed Downtime |
|
|
225 | (6) |
|
11.5.1 Hardware Downtime in Traditional High Availability Configurations |
|
|
226 | (5) |
|
11.6 Architectural Optimizations |
|
|
231 | (13) |
|
11.6.1 Reliability and Availability Criteria |
|
|
232 | (1) |
|
11.6.2 Optimizing Accessibility |
|
|
233 | (2) |
|
11.6.3 Optimizing High Availability, Retainability, Reliability, and Quality |
|
|
235 | (1) |
|
11.6.4 Optimizing Disaster Recovery |
|
|
235 | (1) |
|
11.6.5 Operational Considerations |
|
|
236 | (1) |
|
|
236 | (5) |
|
11.6.7 Theoretically Optimal Application Architecture |
|
|
241 | (3) |
|
12 Design For Reliability Of Virtualized Applications |
|
|
244 | (27) |
|
12.1 Design for Reliability |
|
|
244 | (2) |
|
12.2 Tailoring DfR for Virtualized Applications |
|
|
246 | (2) |
|
12.2.1 Hardware Independence Usage Scenario |
|
|
246 | (1) |
|
12.2.2 Server Consolidation Usage Scenario |
|
|
247 | (1) |
|
12.2.3 Multitenant Usage Scenario |
|
|
248 | (1) |
|
12.2.4 Virtual Appliance Usage Scenario |
|
|
248 | (1) |
|
12.2.5 Cloud Deployment Usage Scenario |
|
|
248 | (1) |
|
12.3 Reliability Requirements |
|
|
248 | (8) |
|
12.3.1 General Availability Requirements |
|
|
249 | (1) |
|
12.3.2 Service Reliability and Latency Requirements |
|
|
250 | (1) |
|
12.3.3 Overload Requirements |
|
|
251 | (2) |
|
12.3.4 Online Capacity Growth and Degrowth |
|
|
253 | (1) |
|
12.3.5 (Virtualization) Live Migration Requirements |
|
|
253 | (1) |
|
12.3.6 System Transition Activity Requirements |
|
|
254 | (1) |
|
12.3.7 Georedundancy and Service Continuity Requirements |
|
|
255 | (1) |
|
12.4 Qualitative Reliability Analysis |
|
|
256 | (3) |
|
12.4.1 SPOF Analysis for Virtualized Applications |
|
|
256 | (2) |
|
12.4.2 Failure Mode Effects Analysis for Virtualized Applications |
|
|
258 | (1) |
|
12.4.3 Capacity Growth and Degrowth Analysis |
|
|
258 | (1) |
|
12.5 Quantitative Reliability Budgeting and Modeling |
|
|
259 | (1) |
|
12.5.1 Availability (Downtime) Modeling |
|
|
259 | (1) |
|
12.5.2 Converging Downtime Budgets and Targets |
|
|
260 | (1) |
|
12.5.3 Managing Maintenance Budget Allocation |
|
|
260 | (1) |
|
|
260 | (7) |
|
12.6.1 Baseline Robustness Testing |
|
|
261 | (4) |
|
12.6.2 Advanced Topic: Can Virtualization Enable Better Robustness Testing? |
|
|
265 | (2) |
|
|
267 | (1) |
|
12.8 Field Performance Analysis |
|
|
268 | (1) |
|
|
269 | (1) |
|
12.10 Hardware Reliability |
|
|
270 | (1) |
|
13 Design For Reliability Of Cloud Solutions |
|
|
271 | (25) |
|
13.1 Solution Design for Reliability |
|
|
271 | (2) |
|
13.2 Solution Scope and Expectations |
|
|
273 | (2) |
|
13.3 Reliability Requirements |
|
|
275 | (4) |
|
13.3.1 Solution Availability Requirements |
|
|
275 | (1) |
|
13.3.2 Solution Reliability Requirements |
|
|
276 | (1) |
|
13.3.3 Disaster Recovery Requirements |
|
|
277 | (1) |
|
13.3.4 Elasticity Requirements |
|
|
277 | (1) |
|
13.3.5 Specifying Configuration Parameters |
|
|
278 | (1) |
|
13.4 Solution Modeling and Analysis |
|
|
279 | (6) |
|
13.4.1 Reliability Block Diagram of Cloud Data Center Deployment |
|
|
279 | (1) |
|
13.4.2 Solution Failure Mode Effects Analysis |
|
|
280 | (1) |
|
13.4.3 Solution Service Transition Activity Effects Analysis |
|
|
280 | (1) |
|
13.4.4 Cloud Data Center Service Availability (MP 2) Analysis |
|
|
280 | (1) |
|
13.4.5 Aggregate Service Availability (MP 3) Modeling |
|
|
281 | (4) |
|
13.4.6 Recovery Point Objective Analysis |
|
|
285 | (1) |
|
13.5 Element Reliability Diligence |
|
|
285 | (1) |
|
13.6 Solution Testing and Validation |
|
|
285 | (3) |
|
13.6.1 Robustness Testing |
|
|
286 | (1) |
|
13.6.2 Service Reliability Testing |
|
|
286 | (1) |
|
13.6.3 Georedundancy Testing |
|
|
286 | (1) |
|
13.6.4 Elasticity and Orchestration Testing |
|
|
287 | (1) |
|
|
287 | (1) |
|
13.6.6 In Service Testing |
|
|
288 | (1) |
|
13.7 Track and Analyze Field Performance |
|
|
288 | (4) |
|
13.7.1 Cloud Service Measurements |
|
|
289 | (2) |
|
13.7.2 Solution Reliability Roadmapping |
|
|
291 | (1) |
|
13.8 Other Solution Reliability Diligence Topics |
|
|
292 | (4) |
|
13.8.1 Service-Level Agreements |
|
|
292 | (1) |
|
13.8.2 Cloud Service Provider Selection |
|
|
293 | (1) |
|
13.8.3 Written Reliability Plan |
|
|
293 | (3) |
|
|
296 | (15) |
|
14.1 Service Reliability and Service Availability |
|
|
297 | (2) |
|
14.2 Failure Accountability and Cloud Computing |
|
|
299 | (2) |
|
14.3 Factoring Service Downtime |
|
|
301 | (2) |
|
14.4 Service Availability Measurement Points |
|
|
303 | (3) |
|
14.5 Cloud Capacity and Elasticity Considerations |
|
|
306 | (1) |
|
14.6 Maximizing Service Availability |
|
|
306 | (3) |
|
14.6.1 Reducing Product Attributable Downtime |
|
|
307 | (1) |
|
14.6.2 Reducing Data Center Attributable Downtime |
|
|
307 | (1) |
|
14.6.3 Reducing IT Service Management Downtime |
|
|
307 | (1) |
|
14.6.4 Reducing Disaster Recovery Downtime |
|
|
308 | (1) |
|
14.6.5 Optimal Cloud Service Availability |
|
|
308 | (1) |
|
14.7 Reliability Diligence |
|
|
309 | (1) |
|
|
310 | (1) |
Abbreviations |
|
311 | (3) |
References |
|
314 | (4) |
About the Authors |
|
318 | (1) |
Index |
|
319 | |