|
|
xv | |
|
|
xix | |
Equations |
|
xxi | |
Preface and Acknowledgments |
|
xxiii | |
Audience |
|
xxiv | |
Organization |
|
xxiv | |
Acknowledgments |
|
xxvi | |
|
|
1 | (34) |
|
1 Service, Risk, And Business Continuity |
|
|
3 | (17) |
|
1.1 Service Criticality and Availability Expectations |
|
|
3 | (1) |
|
1.2 The Eight-Ingredient Model |
|
|
4 | (3) |
|
1.3 Catastrophic Failures and Geographic Redundancy |
|
|
7 | (4) |
|
1.4 Geographically Separated Recovery Site |
|
|
11 | (1) |
|
|
12 | (2) |
|
1.5.1 Risk Identification |
|
|
13 | (1) |
|
|
13 | (1) |
|
1.6 Business Continuity Planning |
|
|
14 | (1) |
|
1.7 Disaster Recovery Planning |
|
|
15 | (2) |
|
|
17 | (1) |
|
|
17 | (1) |
|
1.10 Disaster Recovery Strategies |
|
|
18 | (2) |
|
2 Service Availability And Service Reliability |
|
|
20 | (15) |
|
2.1 Availability and Reliability |
|
|
20 | (5) |
|
2.1.1 Service Availability |
|
|
20 | (1) |
|
2.1.2 Service Reliability |
|
|
21 | (1) |
|
2.1.3 Reliability, Availability, and Failures |
|
|
22 | (3) |
|
2.2 Measuring Service Availability |
|
|
25 | (8) |
|
2.2.1 Total and Partial Outages |
|
|
26 | (1) |
|
2.2.2 Minimum Chargeable Disruption Duration |
|
|
27 | (1) |
|
2.2.3 Outage Attributability |
|
|
28 | (2) |
|
2.2.4 Systems and Network Elements |
|
|
30 | (1) |
|
2.2.5 Service Impact and Element Impact Outages |
|
|
30 | (2) |
|
2.2.6 Treatment of Planned Events |
|
|
32 | (1) |
|
2.3 Measuring Service Reliability |
|
|
33 | (2) |
|
PART 2 MODELING AND ANALYSIS OF REDUNDANCY |
|
|
35 | (166) |
|
3 Understanding Redundancy |
|
|
37 | (22) |
|
|
37 | (7) |
|
3.1.1 Simplex Configuration |
|
|
39 | (2) |
|
|
41 | (2) |
|
3.1.3 Single Point of Failure |
|
|
43 | (1) |
|
3.2 Modeling Availability of Internal Redundancy |
|
|
44 | (8) |
|
3.2.1 Modeling Active-Active Redundancy |
|
|
45 | (4) |
|
3.2.2 Modeling Active Standby Redundancy |
|
|
49 | (2) |
|
3.2.3 Service Availability Comparison |
|
|
51 | (1) |
|
3.3 Evaluating High-Availability Mechanisms |
|
|
52 | (7) |
|
3.3.1 Recovery Time Objective (or Nominal Outage Duration) |
|
|
54 | (1) |
|
3.3.2 Recovery Point Objective |
|
|
54 | (1) |
|
3.3.3 Nominal Success Probability |
|
|
55 | (1) |
|
|
55 | (1) |
|
|
56 | (1) |
|
|
56 | (3) |
|
4 Overview Of External Redundancy |
|
|
59 | (18) |
|
4.1 Generic External Redundancy Model |
|
|
59 | (15) |
|
|
64 | (2) |
|
4.1.2 Triggering Recovery Action |
|
|
66 | (1) |
|
4.1.3 Traffic Redirection |
|
|
67 | (4) |
|
4.1.4 Service Context Preservation |
|
|
71 | (3) |
|
4.1.5 Graceful Service Migration |
|
|
74 | (1) |
|
4.2 Technical Distinctions between Georedundancy and Co-Located Redundancy |
|
|
74 | (1) |
|
4.3 Manual Graceful Switchover and Switchback |
|
|
75 | (2) |
|
5 External Redundancy Strategy Options |
|
|
77 | (21) |
|
5.1 Redundancy Strategies |
|
|
77 | (2) |
|
5.2 Data Recovery Strategies |
|
|
79 | (1) |
|
5.3 External Recovery Strategies |
|
|
80 | (1) |
|
5.4 Manually Controlled Recovery |
|
|
81 | (2) |
|
5.4.1 Manually Controlled Example: Provisioning System for a Database |
|
|
83 | (1) |
|
5.4.2 Manually Controlled Example: Performance Management Systems |
|
|
83 | (1) |
|
5.5 System-Driven Recovery |
|
|
83 | (2) |
|
5.5.1 System-Driven Recovery Examples |
|
|
85 | (1) |
|
5.6 Client-Initiated Recovery |
|
|
85 | (13) |
|
5.6.1 Client-Initiated Recovery Overview |
|
|
86 | (2) |
|
5.6.2 Failure Detection by Client |
|
|
88 | (7) |
|
5.6.3 Client-Initiated Recovery Example: Automatic Teller Machine (ATM) |
|
|
95 | (1) |
|
5.6.4 Client-Initiated Recovery Example: A Web Browser Querying a Web Server |
|
|
96 | (1) |
|
5.6.5 Client-Initiated Recovery Example: A Pool of DNS Servers |
|
|
97 | (1) |
|
6 Modeling Service Availability With External System Redundancy |
|
|
98 | (35) |
|
6.1 The Simplistic Answer |
|
|
98 | (1) |
|
6.2 Framing Service Availability of Standalone Systems |
|
|
99 | (4) |
|
6.3 Generic Markov Availability Model of Georedundant Recovery |
|
|
103 | (12) |
|
6.3.1 Simplifying Assumptions |
|
|
103 | (1) |
|
6.3.2 Standalone High-Availability Model |
|
|
104 | (3) |
|
6.3.3 Manually Controlled Georedundant Recovery |
|
|
107 | (3) |
|
6.3.4 System-Driven Georedundant Recovery |
|
|
110 | (1) |
|
6.3.5 Client-Initiated Georedundancy Recovery |
|
|
111 | (2) |
|
6.3.6 Complex Georedundant Recovery |
|
|
113 | (1) |
|
6.3.7 Comparing the Generic Georedundancy Model to the Simplistic Model |
|
|
114 | (1) |
|
6.4 Solving the Generic Georedundancy Model |
|
|
115 | (6) |
|
6.4.1 Manually Controlled Georedundant Recovery Model |
|
|
118 | (2) |
|
6.4.2 System-Driven Georedundant Recovery Model |
|
|
120 | (1) |
|
6.4.3 Client-Initiated Georedundant Recovery Model |
|
|
120 | (1) |
|
|
121 | (1) |
|
6.5 Practical Modeling of Georedundancy |
|
|
121 | (9) |
|
6.5.1 Practical Modeling of Manually Controlled External System Recovery |
|
|
122 | (2) |
|
6.5.2 Practical Modeling of System-Driven Georedundant Recovery |
|
|
124 | (1) |
|
6.5.3 Practical Modeling of Client-Initiated Recovery |
|
|
125 | (5) |
|
6.6 Estimating Availability Benefit for Planned Activities |
|
|
130 | (1) |
|
6.7 Estimating Availability Benefit for Disasters |
|
|
131 | (2) |
|
7 Understanding Recovery Timing Parameters |
|
|
133 | (14) |
|
7.1 Detecting Implicit Failures |
|
|
134 | (7) |
|
7.1.1 Understanding and Optimizing Ttimeout |
|
|
134 | (3) |
|
7.1.2 Understanding and Optimizing Tkeepalive |
|
|
137 | (2) |
|
7.1.3 Understanding and Optimizing Tclient |
|
|
139 | (1) |
|
7.1.4 Timer Impact on Service Reliability |
|
|
140 | (1) |
|
7.2 Understanding and Optimizing RTO |
|
|
141 | (6) |
|
7.2.1 RTO for Manually Controlled Recovery |
|
|
141 | (2) |
|
7.2.2 RTO for System-Driven Recovery |
|
|
143 | (2) |
|
7.2.3 RTO for Client-Initiated Recovery |
|
|
145 | (1) |
|
7.2.4 Comparing External Redundancy Strategies |
|
|
146 | (1) |
|
8 Case Study Of Client-Initiated Recovery |
|
|
147 | (27) |
|
|
147 | (1) |
|
8.2 Mapping DNS onto Practical Client-Initiated Recovery Model |
|
|
148 | (6) |
|
8.2.1 Modeling Normal Operation |
|
|
150 | (1) |
|
8.2.2 Modeling Server Failure |
|
|
151 | (1) |
|
8.2.3 Modeling Timeout Failure |
|
|
151 | (2) |
|
8.2.4 Modeling Abnormal Server Failure |
|
|
153 | (1) |
|
8.2.5 Modeling Multiple Server Failure |
|
|
154 | (1) |
|
8.3 Estimating Input Parameters |
|
|
154 | (11) |
|
8.3.1 Server Failure Rate |
|
|
155 | (3) |
|
8.3.2 Fexplicit Parameter |
|
|
158 | (1) |
|
|
158 | (2) |
|
|
160 | (1) |
|
8.3.5 μclientsfd Parameter |
|
|
160 | (1) |
|
|
161 | (1) |
|
8.3.7 Acluster-1 Parameter |
|
|
162 | (1) |
|
|
162 | (2) |
|
8.3.9 μgrecover and μmigration Parameters |
|
|
164 | (1) |
|
8.3.10 μdouplex Parameter |
|
|
165 | (1) |
|
|
165 | (1) |
|
|
165 | (7) |
|
8.4.1 Sensitivity Analysis |
|
|
168 | (4) |
|
8.5 Discussion of Predicted Results |
|
|
172 | (2) |
|
9 Solution And Cluster Recovery |
|
|
174 | (27) |
|
9.1 Understanding Solutions |
|
|
174 | (3) |
|
|
175 | (1) |
|
9.1.2 Solution Architecture |
|
|
176 | (1) |
|
9.2 Estimating Solution Availability |
|
|
177 | (2) |
|
9.3 Cluster versus Element Recovery |
|
|
179 | (3) |
|
9.4 Element Failure and Cluster Recovery Case Study |
|
|
182 | (4) |
|
9.5 Comparing Element and Cluster Recovery |
|
|
186 | (1) |
|
|
186 | (1) |
|
9.5.2 Triggering Recovery Action |
|
|
186 | (1) |
|
9.5.3 Traffic Redirection |
|
|
186 | (1) |
|
9.5.4 Service Context Preservation |
|
|
187 | (1) |
|
|
187 | (1) |
|
9.6 Modeling Cluster Recovery |
|
|
187 | (14) |
|
9.6.1 Cluster Recovery Modeling Parameters |
|
|
190 | (3) |
|
9.6.2 Estimating λsuperelement |
|
|
193 | (3) |
|
9.6.3 Example of Super Element Recovery Modeling |
|
|
196 | (5) |
|
|
201 | (84) |
|
10 Georedundancy Strategy |
|
|
203 | (16) |
|
10.1 Why Support Multiple Sites? |
|
|
203 | (1) |
|
|
204 | (2) |
|
10.2.1 Choosing Site Locations |
|
|
205 | (1) |
|
|
206 | (1) |
|
10.4 Limp-Along Architectures |
|
|
207 | (1) |
|
10.5 Site Redundancy Options |
|
|
208 | (8) |
|
|
208 | (3) |
|
10.5.2 N + K Load Sharing |
|
|
211 | (4) |
|
|
215 | (1) |
|
10.6 Virtualization, Cloud Computing, and Standby Sites |
|
|
216 | (1) |
|
10.7 Recommended Design Methodology |
|
|
217 | (2) |
|
11 Maximizing Service Availability Via Georedundancy |
|
|
219 | (11) |
|
11.1 Theoretically Optimal External Redundancy |
|
|
219 | (1) |
|
11.2 Practically Optimal Recovery Strategies |
|
|
220 | (8) |
|
11.2.1 Internal versus External Redundancy |
|
|
220 | (2) |
|
11.2.2 Client-Initiated Recovery as Optimal External Recovery Strategy |
|
|
222 | (1) |
|
11.2.3 Multi-Site Strategy |
|
|
223 | (1) |
|
11.2.4 Active-Active Server Operation |
|
|
223 | (1) |
|
11.2.5 Optimizing Timeout and Retry Parameters |
|
|
224 | (1) |
|
|
225 | (1) |
|
11.2.7 Rapid Context Restoration |
|
|
225 | (1) |
|
11.2.8 Automatic Switchback |
|
|
226 | (1) |
|
|
226 | (1) |
|
11.2.10 Network Element versus Cluster-Level Recovery |
|
|
226 | (2) |
|
11.3 Other Considerations |
|
|
228 | (2) |
|
11.3.1 Architecting to Facilitate Planned Maintenance Activities |
|
|
228 | (1) |
|
11.3.2 Procedural Considerations |
|
|
229 | (1) |
|
12 Georedundancy Requirements |
|
|
230 | (13) |
|
12.1 Internal Redundancy Requirements |
|
|
230 | (3) |
|
12.1.1 Standalone Network Element Redundancy Requirements |
|
|
231 | (1) |
|
12.1.2 Basic Solution Redundancy Requirements |
|
|
232 | (1) |
|
12.2 External Redundancy Requirements |
|
|
233 | (2) |
|
12.3 Manually Controlled Redundancy Requirements |
|
|
235 | (2) |
|
12.3.1 Manual Failover Requirements |
|
|
235 | (1) |
|
12.3.2 Graceful Switchover Requirements |
|
|
236 | (1) |
|
12.3.3 Switchback Requirements |
|
|
236 | (1) |
|
12.4 Automatic External Recovery Requirements |
|
|
237 | (5) |
|
12.4.1 System-Driven Recovery |
|
|
237 | (2) |
|
12.4.2 Client-Initiated Recovery |
|
|
239 | (3) |
|
12.5 Operational Requirements |
|
|
242 | (1) |
|
|
243 | (13) |
|
13.1 Georedundancy Testing Strategy |
|
|
243 | (3) |
|
13.1.1 Network Element Level Testing |
|
|
244 | (1) |
|
13.1.2 End-to-End Testing |
|
|
245 | (1) |
|
13.1.3 Deployment Testing |
|
|
245 | (1) |
|
13.1.4 Operational Testing |
|
|
246 | (1) |
|
13.2 Test Cases for External Redundancy |
|
|
246 | (1) |
|
13.3 Verifying Georedundancy Requirements |
|
|
247 | (7) |
|
13.3.1 Test Cases for Standalone Elements |
|
|
248 | (1) |
|
13.3.2 Test Cases for Manually Controlled Recovery |
|
|
248 | (1) |
|
13.3.3 Test Cases for System-Driven Recovery |
|
|
249 | (1) |
|
13.3.4 Test Cases for Client-Initiated Recovery |
|
|
250 | (2) |
|
13.3.5 Test Cases at the Solution Level |
|
|
252 | (1) |
|
13.3.6 Test cases for Operational Testing |
|
|
253 | (1) |
|
|
254 | (2) |
|
14 Solution Georedundancy Case Study |
|
|
256 | (29) |
|
14.1 The Hypothetical Solution |
|
|
256 | (3) |
|
14.1.1 Key Quality Indicators |
|
|
258 | (1) |
|
14.2 Standalone Solution Analysis |
|
|
259 | (4) |
|
14.2.1 Construct Reliability Block Diagrams |
|
|
260 | (1) |
|
14.2.2 Network Element Configuration in Standalone Solution |
|
|
260 | (1) |
|
14.2.3 Service Availability Offered by Standalone Solution |
|
|
261 | (1) |
|
14.2.4 Discussion of Standalone Solution Analysis |
|
|
261 | (2) |
|
14.3 Georedundant Solution Analysis |
|
|
263 | (6) |
|
14.3.1 Identify Factors Constraining Recovery Realm Design |
|
|
263 | (1) |
|
14.3.2 Define Recovery Realms |
|
|
264 | (1) |
|
14.3.3 Define Recovery Strategies |
|
|
264 | (1) |
|
14.3.4 Set Recovery Objectives |
|
|
265 | (3) |
|
14.3.5 Architecting Site Redundancy |
|
|
268 | (1) |
|
14.4 Availability of the Georedundant Solution |
|
|
269 | (1) |
|
14.5 Requirements of Hypothetical Solution |
|
|
269 | (8) |
|
14.5.1 Internal Redundancy Requirements |
|
|
269 | (3) |
|
14.5.2 External Redundancy Requirements |
|
|
272 | (1) |
|
14.5.3 Manual Failover Requirements |
|
|
273 | (1) |
|
14.5.4 Automatic External Recovery Requirements |
|
|
274 | (3) |
|
14.5.5 Operational Requirements |
|
|
277 | (1) |
|
14.6 Testing of Hypothetical Solution |
|
|
277 | (8) |
|
|
277 | (1) |
|
14.6.2 Standalone Network Element Testing |
|
|
278 | (1) |
|
14.6.3 Automatic External Recovery Requirements Testing |
|
|
279 | (3) |
|
14.6.4 End-to-End Testing |
|
|
282 | (1) |
|
14.6.5 Deployment Testing |
|
|
283 | (1) |
|
14.6.6 Operational Testing |
|
|
284 | (1) |
Summary |
|
285 | (7) |
Appendix: Markov Modeling of Service Availability |
|
292 | (4) |
Acronyms |
|
296 | (2) |
References |
|
298 | (2) |
About the Authors |
|
300 | (2) |
Index |
|
302 | |