Muutke küpsiste eelistusi

E-raamat: Reliability and Availability of Cloud Computing [Wiley Online]

, (Alcatel-Lucent Reliability)
  • Formaat: 352 pages
  • Ilmumisaeg: 04-Sep-2012
  • Kirjastus: Wiley-IEEE Press
  • ISBN-10: 1118393996
  • ISBN-13: 9781118393994
Teised raamatud teemal:
  • Wiley Online
  • Hind: 94,05 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
  • Formaat: 352 pages
  • Ilmumisaeg: 04-Sep-2012
  • Kirjastus: Wiley-IEEE Press
  • ISBN-10: 1118393996
  • ISBN-13: 9781118393994
Teised raamatud teemal:
A holistic approach to service reliability and availability of cloud computing

Reliability and Availability of Cloud Computing provides IS/IT system and solution architects, developers, and engineers with the knowledge needed to assess the impact of virtualization and cloud computing on service reliability and availability. It reveals how to select the most appropriate design for reliability diligence to assure that user expectations are met.

Organized in three parts (basics, risk analysis, and recommendations), this resource is accessible to readers of diverse backgrounds and experience levels. Numerous examples and more than 100 figures throughout the book help readers visualize problems to better understand the topicand the authors present risks and options in bulleted lists that can be applied directly to specific applications/problems.

Special features of this book include:





Rigorous analysis of the reliability and availability risks that are inherent in cloud computing Simple formulas that explain the quantitative aspects of reliability and availability Enlightening discussions of the ways in which virtualized applications and cloud deployments differ from traditional system implementations and deployments Specific recommendations for developing reliable virtualized applications and cloud-based solutions

Reliability and Availability of Cloud Computing is the guide for IS/IT staff in business, government, academia, and non-governmental organizations who are moving their applications to the cloud. It is also an important reference for professionals in technical sales, product management, and quality management, as well as software and quality engineers looking to broaden their expertise.
Figures
xvii
Tables
xxi
Equations xxiii
Introduction xxv
I BASICS
1(62)
1 Cloud Computing
3(13)
1.1 Essential Cloud Characteristics
4(2)
1.1.1 On-Demand Self-Service
4(1)
1.1.2 Broad Network Access
4(1)
1.1.3 Resource Pooling
5(1)
1.1.4 Rapid Elasticity
5(1)
1.1.5 Measured Service
6(1)
1.2 Common Cloud Characteristics
6(1)
1.3 But What, Exactly, Is Cloud Computing?
7(2)
1.3.1 What Is a Data Center?
8(1)
1.3.2 How Does Cloud Computing Differ from Traditional Data Centers?
9(1)
1.4 Service Models
9(2)
1.5 Cloud Deployment Models
11(1)
1.6 Roles in Cloud Computing
12(2)
1.7 Benefits of Cloud Computing
14(1)
1.8 Risks of Cloud Computing
15(1)
2 Virtualization
16(13)
2.1 Background
16(1)
2.2 What Is Virtualization?
17(2)
2.2.1 Types of Hypervisors
18(1)
2.2.2 Virtualization and Emulation
19(1)
2.3 Server Virtualization
19(4)
2.3.1 Full Virtualization
20(1)
2.3.2 Paravirtualization
21(1)
2.3.3 OS Virtualization
22(1)
2.3.4 Discussion
22(1)
2.4 VM Lifecycle
23(5)
2.4.1 VM Snapshot
26(1)
2.4.2 Cloning VMs
26(2)
2.4.3 High Availability Mechanisms
28(1)
2.5 Reliability and Availability Risks of Virtualization
28(1)
3 Service Reliability And Service Availability
29(34)
3.1 Errors and Failures
30(1)
3.2 Eight-Ingredient Framework
31(3)
3.3 Service Availability
34(9)
3.3.1 Service Availability Metric
35(1)
3.3.2 MTBF and MTTR
36(1)
3.3.3 Service and Network Element Impact Outages
37(1)
3.3.4 Partial Outages
38(2)
3.3.5 Availability Ratings
40(1)
3.3.6 Outage Attributability
41(1)
3.3.7 Planned or Scheduled Downtime
42(1)
3.4 Service Reliability
43(3)
3.4.1 Service Reliability Metrics
44(1)
3.4.2 Defective Transactions
45(1)
3.5 Service Latency
46(4)
3.6 Redundancy and High Availability
50(6)
3.6.1 Redundancy
51(2)
3.6.2 High Availability
53(3)
3.7 High Availability and Disaster Recovery
56(2)
3.8 Streaming Services
58(4)
3.8.1 Control and Data Planes
58(1)
3.8.2 Service Quality Metrics
59(1)
3.8.3 Isochronal Data
60(1)
3.8.4 Latency Expectations
60(1)
3.8.5 Streaming Quality Impairments
61(1)
3.9 Reliability and Availability Risks of Cloud Computing
62(1)
II ANALYSIS
63(120)
4 Analyzing Cloud Reliability And Availability
65(25)
4.1 Expectations for Service Reliability and Availability
65(1)
4.2 Risks of Essential Cloud Characteristics
66(4)
4.2.1 On-Demand Self-Service
66(1)
4.2.2 Broad Network Access
67(1)
4.2.3 Resource Pooling
67(1)
4.2.4 Rapid Elasticity
67(2)
4.2.5 Measured Service
69(1)
4.3 Impacts of Common Cloud Characteristics
70(2)
4.3.1 Virtualization
70(1)
4.3.2 Geographic Distribution
70(1)
4.3.3 Resilient Computing
71(1)
4.3.4 Advanced Security
71(1)
4.3.5 Massive Scale
71(1)
4.3.6 Homogeneity
71(1)
4.4 Risks of Service Models
72(2)
4.4.1 Traditional Accountability
72(1)
4.4.2 Cloud-Based Application Accountability
73(1)
4.5 IT Service Management and Availability Risks
74(6)
4.5.1 ITIL Overview
74(1)
4.5.2 Service Strategy
75(1)
4.5.3 Service Design
76(1)
4.5.4 Service Transition
77(1)
4.5.5 Service Operation
77(1)
4.5.6 Continual Service Improvement
78(1)
4.5.7 IT Service Management Summary
79(1)
4.5.8 Risks of Service Orchestration
79(1)
4.5.9 IT Service Management Risks
80(1)
4.6 Outage Risks by Process Area
80(3)
4.6.1 Validating Outage Attributability
82(1)
4.7 Failure Detection Considerations
83(4)
4.7.1 Hardware Failures
83(2)
4.7.2 Programming Errors
85(1)
4.7.3 Data Inconsistency and Errors
85(1)
4.7.4 Redundancy Errors
86(1)
4.7.5 System Power Failures
86(1)
4.7.6 Network Errors
86(1)
4.7.7 Application Protocol Errors
86(1)
4.8 Risks of Deployment Models
87(1)
4.9 Expectations of IaaS Data Centers
87(3)
5 Reliability Analysis Of Virtualization
90(26)
5.1 Reliability Analysis Techniques
90(5)
5.1.1 Reliability Block Diagrams
90(2)
5.1.2 Single Point of Failure Analysis
92(1)
5.1.3 Failure Mode Effects Analysis
92(3)
5.2 Reliability Analysis of Virtualization Techniques
95(5)
5.2.1 Analysis of Full Virtualization
95(1)
5.2.2 Analysis of OS Virtualization
95(1)
5.2.3 Analysis of Paravirtualization
96(1)
5.2.4 Analysis of VM Coresidency
97(2)
5.2.5 Discussion
99(1)
5.3 Software Failure Rate Analysis
100(1)
5.3.1 Virtualization and Software Failure Rate
100(1)
5.3.2 Hypervisor Failure Rate
101(1)
5.3.3 Miscellaneous Software Risks of Virtualization and Cloud
101(1)
5.4 Recovery Models
101(7)
5.4.1 Traditional Recovery Options
101(1)
5.4.2 Virtualized Recovery Options
102(5)
5.4.3 Discussion
107(1)
5.5 Application Architecture Strategies
108(2)
5.5.1 On-Demand Single-User Model
108(1)
5.5.2 Single-User Daemon Model
109(1)
5.5.3 Multiuser Server Model
109(1)
5.5.4 Consolidated Server Model
109(1)
5.6 Availability Modeling of Virtualized Recovery Options
110(6)
5.6.1 Availability of Virtualized Simplex Architecture
111(1)
5.6.2 Availability of Virtualized Redundant Architecture
111(1)
5.6.3 Critical Failure Rate
112(1)
5.6.4 Failure Coverage
113(1)
5.6.5 Failure Detection Latency
113(1)
5.6.6 Switchover Latency
113(1)
5.6.7 Switchover Success Probability
114(1)
5.6.8 Modeling and "Fast Failure"
114(1)
5.6.9 Comparison of Native and Virtualized Deployments
115(1)
6 Hardware Reliability, Virtualization, And Service Availability
116(16)
6.1 Hardware Downtime Expectations
116(1)
6.2 Hardware Failures
117(2)
6.3 Hardware Failure Rate
119(2)
6.4 Hardware Failure Detection
121(1)
6.5 Hardware Failure Containment
122(1)
6.6 Hardware Failure Mitigation
122(2)
6.7 Mitigating Hardware Failures via Virtualization
124(3)
6.7.1 Virtual CPU
124(1)
6.7.2 Virtual Memory
125(1)
6.7.3 Virtual Storage
126(1)
6.8 Virtualized Networks
127(2)
6.8.1 Virtual Network Interface Cards
127(1)
6.8.2 Virtual Local Area Networks
128(1)
6.8.3 Virtual IP Addresses
129(1)
6.8.4 Virtual Private Networks
129(1)
6.9 MTTR of Virtualized Hardware
129(2)
6.10 Discussion
131(1)
7 Capacity And Elasticity
132(32)
7.1 System Load Basics
132(3)
7.1.1 Extraordinary Event Considerations
134(1)
7.1.2 Slashdot Effect
134(1)
7.2 Overload, Service Reliability, and Service Availability
135(1)
7.3 Traditional Capacity Planning
136(1)
7.4 Cloud and Capacity
137(7)
7.4.1 Nominal Cloud Capacity Model
138(3)
7.4.2 Elasticity Expectations
141(3)
7.5 Managing Online Capacity
144(3)
7.5.1 Capacity Planning Assumptions of Cloud Computing
145(2)
7.6 Capacity-Related Service Risks
147(6)
7.6.1 Elasticity and Elasticity Failure
147(2)
7.6.2 Partial Capacity Failure
149(1)
7.6.3 Service Latency Risk
150(2)
7.6.4 Capacity Impairments and Service Reliability
152(1)
7.7 Capacity Management Risks
153(4)
7.7.1 Brittle Application Architecture
154(1)
7.7.2 Faulty or Inadequate Monitoring Data
155(1)
7.7.3 Faulty Capacity Decisions
155(1)
7.7.4 Unreliable Capacity Growth
155(1)
7.7.5 Unreliable Capacity Degrowth
156(1)
7.7.6 Inadequate Slew Rate
156(1)
7.7.7 Tardy Capacity Management Decisions
156(1)
7.7.8 Resource Stock Out Not Covered
157(1)
7.7.9 Cloud Burst Fails
157(1)
7.7.10 Policy Constraints
157(1)
7.8 Security and Service Availability
157(5)
7.8.1 Security Risk to Service Availability
157(2)
7.8.2 Denial of Service Attacks
159(1)
7.8.3 Defending against DoS Attacks
160(1)
7.8.4 Quantifying Service Availability Impact of Security Attacks
161(1)
7.8.5 Recommendations
162(1)
7.9 Architecting for Elastic Growth and Degrowth
162(2)
8 Service Orchestration Analysis
164(10)
8.1 Service Orchestration Definition
164(2)
8.2 Policy-Based Management
166(2)
8.2.1 The Role of SLRs
167(1)
8.2.2 Service Reliability and Availability Measurements
168(1)
8.3 Cloud Management
168(1)
8.3.1 Role of Rapid Elasticity in Cloud Management
169(1)
8.3.2 Role of Cloud Bursting in Cloud Management
169(1)
8.4 Service Orchestration's Role in Risk Mitigation
169(3)
8.4.1 Latency
170(1)
8.4.2 Reliability
170(1)
8.4.3 Regulatory
171(1)
8.4.4 Security
171(1)
8.5 Summary
172(2)
9 Geographic Distribution, Georedundancy, And Disaster Recovery
174(9)
9.1 Geographic Distribution versus Georedundancy
175(1)
9.2 Traditional Disaster Recovery
175(2)
9.3 Virtualization and Disaster Recovery
177(1)
9.4 Cloud Computing and Disaster Recovery
178(2)
9.5 Georedundancy Recovery Models
180(1)
9.6 Cloud and Traditional Collateral Benefits of Georedundancy
180(2)
9.6.1 Reduced Planned Downtime
180(1)
9.6.2 Mitigate Catastrophic Network Element Failures
181(1)
9.6.3 Mitigate Extended Uncovered and Duplex Failure Outages
181(1)
9.7 Discussion
182(1)
III RECOMMENDATIONS
183(128)
10 Applications, Solutions, And Accountability
185(24)
10.1 Application Configuration Scenarios
185(2)
10.2 Application Deployment Scenario
187(1)
10.3 System Downtime Budgets
188(9)
10.3.1 Traditional System Downtime Budget
189(1)
10.3.2 Virtualized Application Downtime Budget
189(2)
10.3.3 IaaS Hardware Downtime Expectations
191(2)
10.3.4 Cloud-Based Application Downtime Budget
193(2)
10.3.5 Summary
195(2)
10.4 End-to-End Solutions Considerations
197(4)
10.4.1 What is an End-to-End Solution?
197(1)
10.4.2 Consumer-Specific Architectures
198(1)
10.4.3 Data Center Redundancy
199(2)
10.5 Attributability for Service Impairments
201(3)
10.6 Solution Service Measurement
204(3)
10.6.1 Service Availability Measurement Points
204(3)
10.7 Managing Reliability and Service of Cloud Computing
207(2)
11 Recommendations For Architecting A Reliable System
209(35)
11.1 Architecting for Virtualization and Cloud
209(7)
11.1.1 Mapping Software into VMs
210(1)
11.1.2 Service Load Distribution
210(1)
11.1.3 Data Management
211(1)
11.1.4 Software Redundancy and High Availability Mechanisms
212(2)
11.1.5 Rapid Elasticity
214(1)
11.1.6 Overload Control
214(1)
11.1.7 Coresidency
215(1)
11.1.8 Multitenancy
215(1)
11.1.9 Isochronal Applications
216(1)
11.2 Disaster Recovery
216(1)
11.3 IT Service Management Considerations
217(7)
11.3.1 Software Upgrade and Patch
217(1)
11.3.2 Service Transition Activity Effect Analysis
218(1)
11.3.3 Mitigating Service Transition Activity Effects via VM Migration
219(2)
11.3.4 Testing Service Transition Activities
221(1)
11.3.5 Minimizing Procedural Errors
221(2)
11.3.6 Service Orchestration Considerations
223(1)
11.4 Many Distributed Clouds versus Fewer Huge Clouds
224(1)
11.5 Minimizing Hardware-Attributed Downtime
225(6)
11.5.1 Hardware Downtime in Traditional High Availability Configurations
226(5)
11.6 Architectural Optimizations
231(13)
11.6.1 Reliability and Availability Criteria
232(1)
11.6.2 Optimizing Accessibility
233(2)
11.6.3 Optimizing High Availability, Retainability, Reliability, and Quality
235(1)
11.6.4 Optimizing Disaster Recovery
235(1)
11.6.5 Operational Considerations
236(1)
11.6.6 Case Study
236(5)
11.6.7 Theoretically Optimal Application Architecture
241(3)
12 Design For Reliability Of Virtualized Applications
244(27)
12.1 Design for Reliability
244(2)
12.2 Tailoring DfR for Virtualized Applications
246(2)
12.2.1 Hardware Independence Usage Scenario
246(1)
12.2.2 Server Consolidation Usage Scenario
247(1)
12.2.3 Multitenant Usage Scenario
248(1)
12.2.4 Virtual Appliance Usage Scenario
248(1)
12.2.5 Cloud Deployment Usage Scenario
248(1)
12.3 Reliability Requirements
248(8)
12.3.1 General Availability Requirements
249(1)
12.3.2 Service Reliability and Latency Requirements
250(1)
12.3.3 Overload Requirements
251(2)
12.3.4 Online Capacity Growth and Degrowth
253(1)
12.3.5 (Virtualization) Live Migration Requirements
253(1)
12.3.6 System Transition Activity Requirements
254(1)
12.3.7 Georedundancy and Service Continuity Requirements
255(1)
12.4 Qualitative Reliability Analysis
256(3)
12.4.1 SPOF Analysis for Virtualized Applications
256(2)
12.4.2 Failure Mode Effects Analysis for Virtualized Applications
258(1)
12.4.3 Capacity Growth and Degrowth Analysis
258(1)
12.5 Quantitative Reliability Budgeting and Modeling
259(1)
12.5.1 Availability (Downtime) Modeling
259(1)
12.5.2 Converging Downtime Budgets and Targets
260(1)
12.5.3 Managing Maintenance Budget Allocation
260(1)
12.6 Robustness Testing
260(7)
12.6.1 Baseline Robustness Testing
261(4)
12.6.2 Advanced Topic: Can Virtualization Enable Better Robustness Testing?
265(2)
12.7 Stability Testing
267(1)
12.8 Field Performance Analysis
268(1)
12.9 Reliability Roadmap
269(1)
12.10 Hardware Reliability
270(1)
13 Design For Reliability Of Cloud Solutions
271(25)
13.1 Solution Design for Reliability
271(2)
13.2 Solution Scope and Expectations
273(2)
13.3 Reliability Requirements
275(4)
13.3.1 Solution Availability Requirements
275(1)
13.3.2 Solution Reliability Requirements
276(1)
13.3.3 Disaster Recovery Requirements
277(1)
13.3.4 Elasticity Requirements
277(1)
13.3.5 Specifying Configuration Parameters
278(1)
13.4 Solution Modeling and Analysis
279(6)
13.4.1 Reliability Block Diagram of Cloud Data Center Deployment
279(1)
13.4.2 Solution Failure Mode Effects Analysis
280(1)
13.4.3 Solution Service Transition Activity Effects Analysis
280(1)
13.4.4 Cloud Data Center Service Availability (MP 2) Analysis
280(1)
13.4.5 Aggregate Service Availability (MP 3) Modeling
281(4)
13.4.6 Recovery Point Objective Analysis
285(1)
13.5 Element Reliability Diligence
285(1)
13.6 Solution Testing and Validation
285(3)
13.6.1 Robustness Testing
286(1)
13.6.2 Service Reliability Testing
286(1)
13.6.3 Georedundancy Testing
286(1)
13.6.4 Elasticity and Orchestration Testing
287(1)
13.6.5 Stability Testing
287(1)
13.6.6 In Service Testing
288(1)
13.7 Track and Analyze Field Performance
288(4)
13.7.1 Cloud Service Measurements
289(2)
13.7.2 Solution Reliability Roadmapping
291(1)
13.8 Other Solution Reliability Diligence Topics
292(4)
13.8.1 Service-Level Agreements
292(1)
13.8.2 Cloud Service Provider Selection
293(1)
13.8.3 Written Reliability Plan
293(3)
14 Summary
296(15)
14.1 Service Reliability and Service Availability
297(2)
14.2 Failure Accountability and Cloud Computing
299(2)
14.3 Factoring Service Downtime
301(2)
14.4 Service Availability Measurement Points
303(3)
14.5 Cloud Capacity and Elasticity Considerations
306(1)
14.6 Maximizing Service Availability
306(3)
14.6.1 Reducing Product Attributable Downtime
307(1)
14.6.2 Reducing Data Center Attributable Downtime
307(1)
14.6.3 Reducing IT Service Management Downtime
307(1)
14.6.4 Reducing Disaster Recovery Downtime
308(1)
14.6.5 Optimal Cloud Service Availability
308(1)
14.7 Reliability Diligence
309(1)
14.8 Concluding Remarks
310(1)
Abbreviations 311(3)
References 314(4)
About the Authors 318(1)
Index 319
ERIC BAUER is a reliability engineering manager in the Software, Solutions and Services Group of Alcatel-Lucent. The holder of more than a dozen U.S. patents, he is the author of Design for Reliability: Information and Computer-Based Systems, Beyond Redundancy: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems, and Practical System Reliability, also available from Wiley-IEEE Press.

RANDEE ADAMS is a consulting member of technical staff in the Software, Solutions and Services Group of Alcatel-Lucent and the coauthor of Beyond Redundancy: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems.