Muutke küpsiste eelistusi

E-raamat: Beyond Redundancy - How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems [Wiley Online]

(Alcatel-Lucent Reliability), ,
  • Formaat: 330 pages, Photos: 55 B&W, 0 Color; Drawings: 50 B&W, 0 Color
  • Ilmumisaeg: 02-Dec-2011
  • Kirjastus: Wiley-IEEE Press
  • ISBN-10: 1118104919
  • ISBN-13: 9781118104910
  • Wiley Online
  • Hind: 108,85 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
  • Formaat: 330 pages, Photos: 55 B&W, 0 Color; Drawings: 50 B&W, 0 Color
  • Ilmumisaeg: 02-Dec-2011
  • Kirjastus: Wiley-IEEE Press
  • ISBN-10: 1118104919
  • ISBN-13: 9781118104910
While geographic redundancy can obviously be a huge benefit for disaster recovery, it is far less obvious what benefit is feasible and likely for more typical non-catastrophic hardware, software, and human failures. Georedundancy and Service Availability provides both a theoretical and practical treatment of the feasible and likely benefits of geographic redundancy for both service availability and service reliability. The text provides network/system planners, IS/IT operations folks, system architects, system engineers, developers, testers, and other industry practitioners with a general discussion about the capital expense/operating expense tradeoff that frames system redundancy and georedundancy.
Figures
xv
Tables
xix
Equations xxi
Preface and Acknowledgments xxiii
Audience xxiv
Organization xxiv
Acknowledgments xxvi
PART 1 BASICS
1(34)
1 Service, Risk, And Business Continuity
3(17)
1.1 Service Criticality and Availability Expectations
3(1)
1.2 The Eight-Ingredient Model
4(3)
1.3 Catastrophic Failures and Geographic Redundancy
7(4)
1.4 Geographically Separated Recovery Site
11(1)
1.5 Managing Risk
12(2)
1.5.1 Risk Identification
13(1)
1.5.2 Risk Treatments
13(1)
1.6 Business Continuity Planning
14(1)
1.7 Disaster Recovery Planning
15(2)
1.8 Human Factors
17(1)
1.9 Recovery Objectives
17(1)
1.10 Disaster Recovery Strategies
18(2)
2 Service Availability And Service Reliability
20(15)
2.1 Availability and Reliability
20(5)
2.1.1 Service Availability
20(1)
2.1.2 Service Reliability
21(1)
2.1.3 Reliability, Availability, and Failures
22(3)
2.2 Measuring Service Availability
25(8)
2.2.1 Total and Partial Outages
26(1)
2.2.2 Minimum Chargeable Disruption Duration
27(1)
2.2.3 Outage Attributability
28(2)
2.2.4 Systems and Network Elements
30(1)
2.2.5 Service Impact and Element Impact Outages
30(2)
2.2.6 Treatment of Planned Events
32(1)
2.3 Measuring Service Reliability
33(2)
PART 2 MODELING AND ANALYSIS OF REDUNDANCY
35(166)
3 Understanding Redundancy
37(22)
3.1 Types of Redundancy
37(7)
3.1.1 Simplex Configuration
39(2)
3.1.2 Redundancy
41(2)
3.1.3 Single Point of Failure
43(1)
3.2 Modeling Availability of Internal Redundancy
44(8)
3.2.1 Modeling Active-Active Redundancy
45(4)
3.2.2 Modeling Active Standby Redundancy
49(2)
3.2.3 Service Availability Comparison
51(1)
3.3 Evaluating High-Availability Mechanisms
52(7)
3.3.1 Recovery Time Objective (or Nominal Outage Duration)
54(1)
3.3.2 Recovery Point Objective
54(1)
3.3.3 Nominal Success Probability
55(1)
3.3.4 Capital Expense
55(1)
3.3.5 Operating Expense
56(1)
3.3.6 Discussion
56(3)
4 Overview Of External Redundancy
59(18)
4.1 Generic External Redundancy Model
59(15)
4.1.1 Failure Detection
64(2)
4.1.2 Triggering Recovery Action
66(1)
4.1.3 Traffic Redirection
67(4)
4.1.4 Service Context Preservation
71(3)
4.1.5 Graceful Service Migration
74(1)
4.2 Technical Distinctions between Georedundancy and Co-Located Redundancy
74(1)
4.3 Manual Graceful Switchover and Switchback
75(2)
5 External Redundancy Strategy Options
77(21)
5.1 Redundancy Strategies
77(2)
5.2 Data Recovery Strategies
79(1)
5.3 External Recovery Strategies
80(1)
5.4 Manually Controlled Recovery
81(2)
5.4.1 Manually Controlled Example: Provisioning System for a Database
83(1)
5.4.2 Manually Controlled Example: Performance Management Systems
83(1)
5.5 System-Driven Recovery
83(2)
5.5.1 System-Driven Recovery Examples
85(1)
5.6 Client-Initiated Recovery
85(13)
5.6.1 Client-Initiated Recovery Overview
86(2)
5.6.2 Failure Detection by Client
88(7)
5.6.3 Client-Initiated Recovery Example: Automatic Teller Machine (ATM)
95(1)
5.6.4 Client-Initiated Recovery Example: A Web Browser Querying a Web Server
96(1)
5.6.5 Client-Initiated Recovery Example: A Pool of DNS Servers
97(1)
6 Modeling Service Availability With External System Redundancy
98(35)
6.1 The Simplistic Answer
98(1)
6.2 Framing Service Availability of Standalone Systems
99(4)
6.3 Generic Markov Availability Model of Georedundant Recovery
103(12)
6.3.1 Simplifying Assumptions
103(1)
6.3.2 Standalone High-Availability Model
104(3)
6.3.3 Manually Controlled Georedundant Recovery
107(3)
6.3.4 System-Driven Georedundant Recovery
110(1)
6.3.5 Client-Initiated Georedundancy Recovery
111(2)
6.3.6 Complex Georedundant Recovery
113(1)
6.3.7 Comparing the Generic Georedundancy Model to the Simplistic Model
114(1)
6.4 Solving the Generic Georedundancy Model
115(6)
6.4.1 Manually Controlled Georedundant Recovery Model
118(2)
6.4.2 System-Driven Georedundant Recovery Model
120(1)
6.4.3 Client-Initiated Georedundant Recovery Model
120(1)
6.4.4 Conclusion
121(1)
6.5 Practical Modeling of Georedundancy
121(9)
6.5.1 Practical Modeling of Manually Controlled External System Recovery
122(2)
6.5.2 Practical Modeling of System-Driven Georedundant Recovery
124(1)
6.5.3 Practical Modeling of Client-Initiated Recovery
125(5)
6.6 Estimating Availability Benefit for Planned Activities
130(1)
6.7 Estimating Availability Benefit for Disasters
131(2)
7 Understanding Recovery Timing Parameters
133(14)
7.1 Detecting Implicit Failures
134(7)
7.1.1 Understanding and Optimizing Ttimeout
134(3)
7.1.2 Understanding and Optimizing Tkeepalive
137(2)
7.1.3 Understanding and Optimizing Tclient
139(1)
7.1.4 Timer Impact on Service Reliability
140(1)
7.2 Understanding and Optimizing RTO
141(6)
7.2.1 RTO for Manually Controlled Recovery
141(2)
7.2.2 RTO for System-Driven Recovery
143(2)
7.2.3 RTO for Client-Initiated Recovery
145(1)
7.2.4 Comparing External Redundancy Strategies
146(1)
8 Case Study Of Client-Initiated Recovery
147(27)
8.1 Overview of DNS
147(1)
8.2 Mapping DNS onto Practical Client-Initiated Recovery Model
148(6)
8.2.1 Modeling Normal Operation
150(1)
8.2.2 Modeling Server Failure
151(1)
8.2.3 Modeling Timeout Failure
151(2)
8.2.4 Modeling Abnormal Server Failure
153(1)
8.2.5 Modeling Multiple Server Failure
154(1)
8.3 Estimating Input Parameters
154(11)
8.3.1 Server Failure Rate
155(3)
8.3.2 Fexplicit Parameter
158(1)
8.3.3 μclient Parameter
158(2)
8.3.4 μtimeout Parameter
160(1)
8.3.5 μclientsfd Parameter
160(1)
8.3.6 μclient Parameter
161(1)
8.3.7 Acluster-1 Parameter
162(1)
8.3.8 μclient Parameter
162(2)
8.3.9 μgrecover and μmigration Parameters
164(1)
8.3.10 μdouplex Parameter
165(1)
8.3.11 Parameter Summary
165(1)
8.4 Predicted Results
165(7)
8.4.1 Sensitivity Analysis
168(4)
8.5 Discussion of Predicted Results
172(2)
9 Solution And Cluster Recovery
174(27)
9.1 Understanding Solutions
174(3)
9.1.1 Solution Users
175(1)
9.1.2 Solution Architecture
176(1)
9.2 Estimating Solution Availability
177(2)
9.3 Cluster versus Element Recovery
179(3)
9.4 Element Failure and Cluster Recovery Case Study
182(4)
9.5 Comparing Element and Cluster Recovery
186(1)
9.5.1 Failure Detection
186(1)
9.5.2 Triggering Recovery Action
186(1)
9.5.3 Traffic Redirection
186(1)
9.5.4 Service Context Preservation
187(1)
9.5.5 Graceful Migration
187(1)
9.6 Modeling Cluster Recovery
187(14)
9.6.1 Cluster Recovery Modeling Parameters
190(3)
9.6.2 Estimating λsuperelement
193(3)
9.6.3 Example of Super Element Recovery Modeling
196(5)
PART 3 RECOMMENDATIONS
201(84)
10 Georedundancy Strategy
203(16)
10.1 Why Support Multiple Sites?
203(1)
10.2 Recovery Realms
204(2)
10.2.1 Choosing Site Locations
205(1)
10.3 Recovery Strategies
206(1)
10.4 Limp-Along Architectures
207(1)
10.5 Site Redundancy Options
208(8)
10.5.1 Standby Sites
208(3)
10.5.2 N + K Load Sharing
211(4)
10.5.3 Discussion
215(1)
10.6 Virtualization, Cloud Computing, and Standby Sites
216(1)
10.7 Recommended Design Methodology
217(2)
11 Maximizing Service Availability Via Georedundancy
219(11)
11.1 Theoretically Optimal External Redundancy
219(1)
11.2 Practically Optimal Recovery Strategies
220(8)
11.2.1 Internal versus External Redundancy
220(2)
11.2.2 Client-Initiated Recovery as Optimal External Recovery Strategy
222(1)
11.2.3 Multi-Site Strategy
223(1)
11.2.4 Active-Active Server Operation
223(1)
11.2.5 Optimizing Timeout and Retry Parameters
224(1)
11.2.6 Rapid Relogon
225(1)
11.2.7 Rapid Context Restoration
225(1)
11.2.8 Automatic Switchback
226(1)
11.2.9 Overload Control
226(1)
11.2.10 Network Element versus Cluster-Level Recovery
226(2)
11.3 Other Considerations
228(2)
11.3.1 Architecting to Facilitate Planned Maintenance Activities
228(1)
11.3.2 Procedural Considerations
229(1)
12 Georedundancy Requirements
230(13)
12.1 Internal Redundancy Requirements
230(3)
12.1.1 Standalone Network Element Redundancy Requirements
231(1)
12.1.2 Basic Solution Redundancy Requirements
232(1)
12.2 External Redundancy Requirements
233(2)
12.3 Manually Controlled Redundancy Requirements
235(2)
12.3.1 Manual Failover Requirements
235(1)
12.3.2 Graceful Switchover Requirements
236(1)
12.3.3 Switchback Requirements
236(1)
12.4 Automatic External Recovery Requirements
237(5)
12.4.1 System-Driven Recovery
237(2)
12.4.2 Client-Initiated Recovery
239(3)
12.5 Operational Requirements
242(1)
13 Georedundancy Testing
243(13)
13.1 Georedundancy Testing Strategy
243(3)
13.1.1 Network Element Level Testing
244(1)
13.1.2 End-to-End Testing
245(1)
13.1.3 Deployment Testing
245(1)
13.1.4 Operational Testing
246(1)
13.2 Test Cases for External Redundancy
246(1)
13.3 Verifying Georedundancy Requirements
247(7)
13.3.1 Test Cases for Standalone Elements
248(1)
13.3.2 Test Cases for Manually Controlled Recovery
248(1)
13.3.3 Test Cases for System-Driven Recovery
249(1)
13.3.4 Test Cases for Client-Initiated Recovery
250(2)
13.3.5 Test Cases at the Solution Level
252(1)
13.3.6 Test cases for Operational Testing
253(1)
13.4 Summary
254(2)
14 Solution Georedundancy Case Study
256(29)
14.1 The Hypothetical Solution
256(3)
14.1.1 Key Quality Indicators
258(1)
14.2 Standalone Solution Analysis
259(4)
14.2.1 Construct Reliability Block Diagrams
260(1)
14.2.2 Network Element Configuration in Standalone Solution
260(1)
14.2.3 Service Availability Offered by Standalone Solution
261(1)
14.2.4 Discussion of Standalone Solution Analysis
261(2)
14.3 Georedundant Solution Analysis
263(6)
14.3.1 Identify Factors Constraining Recovery Realm Design
263(1)
14.3.2 Define Recovery Realms
264(1)
14.3.3 Define Recovery Strategies
264(1)
14.3.4 Set Recovery Objectives
265(3)
14.3.5 Architecting Site Redundancy
268(1)
14.4 Availability of the Georedundant Solution
269(1)
14.5 Requirements of Hypothetical Solution
269(8)
14.5.1 Internal Redundancy Requirements
269(3)
14.5.2 External Redundancy Requirements
272(1)
14.5.3 Manual Failover Requirements
273(1)
14.5.4 Automatic External Recovery Requirements
274(3)
14.5.5 Operational Requirements
277(1)
14.6 Testing of Hypothetical Solution
277(8)
14.6.1 Testing Strategy
277(1)
14.6.2 Standalone Network Element Testing
278(1)
14.6.3 Automatic External Recovery Requirements Testing
279(3)
14.6.4 End-to-End Testing
282(1)
14.6.5 Deployment Testing
283(1)
14.6.6 Operational Testing
284(1)
Summary 285(7)
Appendix: Markov Modeling of Service Availability 292(4)
Acronyms 296(2)
References 298(2)
About the Authors 300(2)
Index 302
Eric Bauer is Reliability Engineering Manager in the IMS Solutions Organization of Alcatel-Lucent, where he focuses on reliability of Alcatel-Lucent's IMS solution and the network elements that comprise the IMS solution. He has written Design for Reliability: Information and Computer-Based Systems and Practical System Reliability. Randee Adams is a Consulting Member of Technical Staff in the Applications Group of Alcatel-Lucent. Currently, she is focusing on reliability for Alcatel-Lucent's software applications.

Daniel Eustace is a Distinguished Member of Technical Staff in the IMS Solutions Organization of Alcatel-Lucent. Currently, he is a solution architect focusing on reliability, key quality indicators, geographical redundancy, and call processing.