Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Beyond Redundancy - How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems [Wiley Online]

Eric Bauer (Alcatel-Lucent Reliability), Daniel Eustace, Randee Adams

Formaat: 330 pages, Photos: 55 B&W, 0 Color; Drawings: 50 B&W, 0 Color
Ilmumisaeg: 02-Dec-2011
Kirjastus: Wiley-IEEE Press
ISBN-10: 1118104919
ISBN-13: 9781118104910

Teised raamatud teemal:

Wiley Online
Hind: 108,85 €*
* hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks

Formaat: 330 pages, Photos: 55 B&W, 0 Color; Drawings: 50 B&W, 0 Color
Ilmumisaeg: 02-Dec-2011
Kirjastus: Wiley-IEEE Press
ISBN-10: 1118104919
ISBN-13: 9781118104910

Teised raamatud teemal:

Rohkem infot Wiley Online kohta

Raamatu kodulehekülg: https://onlinelibrary.wiley.com/doi/book/10.1002/9781118104910

While geographic redundancy can obviously be a huge benefit for disaster recovery, it is far less obvious what benefit is feasible and likely for more typical non-catastrophic hardware, software, and human failures. Georedundancy and Service Availability provides both a theoretical and practical treatment of the feasible and likely benefits of geographic redundancy for both service availability and service reliability. The text provides network/system planners, IS/IT operations folks, system architects, system engineers, developers, testers, and other industry practitioners with a general discussion about the capital expense/operating expense tradeoff that frames system redundancy and georedundancy.

Figures

Tables

xix

Equations

xxi

Preface and Acknowledgments

xxiii

Audience

xxiv

Organization

xxiv

Acknowledgments

xxvi

PART 1 BASICS

(34)

1 Service, Risk, And Business Continuity

(17)

1.1 Service Criticality and Availability Expectations

(1)

1.2 The Eight-Ingredient Model

(3)

1.3 Catastrophic Failures and Geographic Redundancy

(4)

1.4 Geographically Separated Recovery Site

(1)

1.5 Managing Risk

(2)

1.5.1 Risk Identification

(1)

1.5.2 Risk Treatments

(1)

1.6 Business Continuity Planning

(1)

1.7 Disaster Recovery Planning

(2)

1.8 Human Factors

(1)

1.9 Recovery Objectives

(1)

1.10 Disaster Recovery Strategies

(2)

2 Service Availability And Service Reliability

(15)

2.1 Availability and Reliability

(5)

2.1.1 Service Availability

(1)

2.1.2 Service Reliability

(1)

2.1.3 Reliability, Availability, and Failures

(3)

2.2 Measuring Service Availability

(8)

2.2.1 Total and Partial Outages

(1)

2.2.2 Minimum Chargeable Disruption Duration

(1)

2.2.3 Outage Attributability

(2)

2.2.4 Systems and Network Elements

(1)

2.2.5 Service Impact and Element Impact Outages

(2)

2.2.6 Treatment of Planned Events

(1)

2.3 Measuring Service Reliability

(2)

PART 2 MODELING AND ANALYSIS OF REDUNDANCY

(166)

3 Understanding Redundancy

(22)

3.1 Types of Redundancy

(7)

3.1.1 Simplex Configuration

(2)

3.1.2 Redundancy

(2)

3.1.3 Single Point of Failure

(1)

3.2 Modeling Availability of Internal Redundancy

(8)

3.2.1 Modeling Active-Active Redundancy

(4)

3.2.2 Modeling Active Standby Redundancy

(2)

3.2.3 Service Availability Comparison

(1)

3.3 Evaluating High-Availability Mechanisms

(7)

3.3.1 Recovery Time Objective (or Nominal Outage Duration)

(1)

3.3.2 Recovery Point Objective

(1)

3.3.3 Nominal Success Probability

(1)

3.3.4 Capital Expense

(1)

3.3.5 Operating Expense

(1)

3.3.6 Discussion

(3)

4 Overview Of External Redundancy

(18)

4.1 Generic External Redundancy Model

(15)

4.1.1 Failure Detection

(2)

4.1.2 Triggering Recovery Action

(1)

4.1.3 Traffic Redirection

(4)

4.1.4 Service Context Preservation

(3)

4.1.5 Graceful Service Migration

(1)

4.2 Technical Distinctions between Georedundancy and Co-Located Redundancy

(1)

4.3 Manual Graceful Switchover and Switchback

(2)

5 External Redundancy Strategy Options

(21)

5.1 Redundancy Strategies

(2)

5.2 Data Recovery Strategies

(1)

5.3 External Recovery Strategies

(1)

5.4 Manually Controlled Recovery

(2)

5.4.1 Manually Controlled Example: Provisioning System for a Database

(1)

5.4.2 Manually Controlled Example: Performance Management Systems

(1)

5.5 System-Driven Recovery

(2)

5.5.1 System-Driven Recovery Examples

(1)

5.6 Client-Initiated Recovery

(13)

5.6.1 Client-Initiated Recovery Overview

(2)

5.6.2 Failure Detection by Client

(7)

5.6.3 Client-Initiated Recovery Example: Automatic Teller Machine (ATM)

(1)

5.6.4 Client-Initiated Recovery Example: A Web Browser Querying a Web Server

(1)

5.6.5 Client-Initiated Recovery Example: A Pool of DNS Servers

(1)

6 Modeling Service Availability With External System Redundancy

(35)

6.1 The Simplistic Answer

(1)

6.2 Framing Service Availability of Standalone Systems

(4)

6.3 Generic Markov Availability Model of Georedundant Recovery

103

(12)

6.3.1 Simplifying Assumptions

103

(1)

6.3.2 Standalone High-Availability Model

104

(3)

6.3.3 Manually Controlled Georedundant Recovery

107

(3)

6.3.4 System-Driven Georedundant Recovery

110

(1)

6.3.5 Client-Initiated Georedundancy Recovery

111

(2)

6.3.6 Complex Georedundant Recovery

113

(1)

6.3.7 Comparing the Generic Georedundancy Model to the Simplistic Model

114

(1)

6.4 Solving the Generic Georedundancy Model

115

(6)

6.4.1 Manually Controlled Georedundant Recovery Model

118

(2)

6.4.2 System-Driven Georedundant Recovery Model

120

(1)

6.4.3 Client-Initiated Georedundant Recovery Model

120

(1)

6.4.4 Conclusion

121

(1)

6.5 Practical Modeling of Georedundancy

121

(9)

6.5.1 Practical Modeling of Manually Controlled External System Recovery

122

(2)

6.5.2 Practical Modeling of System-Driven Georedundant Recovery

124

(1)

6.5.3 Practical Modeling of Client-Initiated Recovery

125

(5)

6.6 Estimating Availability Benefit for Planned Activities

130

(1)

6.7 Estimating Availability Benefit for Disasters

131

(2)

7 Understanding Recovery Timing Parameters

133

(14)

7.1 Detecting Implicit Failures

134

(7)

7.1.1 Understanding and Optimizing Ttimeout

134

(3)

7.1.2 Understanding and Optimizing Tkeepalive

137

(2)

7.1.3 Understanding and Optimizing Tclient

139

(1)

7.1.4 Timer Impact on Service Reliability

140

(1)

7.2 Understanding and Optimizing RTO

141

(6)

7.2.1 RTO for Manually Controlled Recovery

141

(2)

7.2.2 RTO for System-Driven Recovery

143

(2)

7.2.3 RTO for Client-Initiated Recovery

145

(1)

7.2.4 Comparing External Redundancy Strategies

146

(1)

8 Case Study Of Client-Initiated Recovery

147

(27)

8.1 Overview of DNS

147

(1)

8.2 Mapping DNS onto Practical Client-Initiated Recovery Model

148

(6)

8.2.1 Modeling Normal Operation

150

(1)

8.2.2 Modeling Server Failure

151

(1)

8.2.3 Modeling Timeout Failure

151

(2)

8.2.4 Modeling Abnormal Server Failure

153

(1)

8.2.5 Modeling Multiple Server Failure

154

(1)

8.3 Estimating Input Parameters

154

(11)

8.3.1 Server Failure Rate

155

(3)

8.3.2 Fexplicit Parameter

158

(1)

8.3.3 μclient Parameter

158

(2)

8.3.4 μtimeout Parameter

160

(1)

8.3.5 μclientsfd Parameter

160

(1)

8.3.6 μclient Parameter

161

(1)

8.3.7 Acluster-1 Parameter

162

(1)

8.3.8 μclient Parameter

162

(2)

8.3.9 μgrecover and μmigration Parameters

164

(1)

8.3.10 μdouplex Parameter

165

(1)

8.3.11 Parameter Summary

165

(1)

8.4 Predicted Results

165

(7)

8.4.1 Sensitivity Analysis

168

(4)

8.5 Discussion of Predicted Results

172

(2)

9 Solution And Cluster Recovery

174

(27)

9.1 Understanding Solutions

174

(3)

9.1.1 Solution Users

175

(1)

9.1.2 Solution Architecture

176

(1)

9.2 Estimating Solution Availability

177

(2)

9.3 Cluster versus Element Recovery

179

(3)

9.4 Element Failure and Cluster Recovery Case Study

182

(4)

9.5 Comparing Element and Cluster Recovery

186

(1)

9.5.1 Failure Detection

186

(1)

9.5.2 Triggering Recovery Action

186

(1)

9.5.3 Traffic Redirection

186

(1)

9.5.4 Service Context Preservation

187

(1)

9.5.5 Graceful Migration

187

(1)

9.6 Modeling Cluster Recovery

187

(14)

9.6.1 Cluster Recovery Modeling Parameters

190

(3)

9.6.2 Estimating λsuperelement

193

(3)

9.6.3 Example of Super Element Recovery Modeling

196

(5)

PART 3 RECOMMENDATIONS

201

(84)

10 Georedundancy Strategy

203

(16)

10.1 Why Support Multiple Sites?

203

(1)

10.2 Recovery Realms

204

(2)

10.2.1 Choosing Site Locations

205

(1)

10.3 Recovery Strategies

206

(1)

10.4 Limp-Along Architectures

207

(1)

10.5 Site Redundancy Options

208

(8)

10.5.1 Standby Sites

208

(3)

10.5.2 N + K Load Sharing

211

(4)

10.5.3 Discussion

215

(1)

10.6 Virtualization, Cloud Computing, and Standby Sites

216

(1)

10.7 Recommended Design Methodology

217

(2)

11 Maximizing Service Availability Via Georedundancy

219

(11)

11.1 Theoretically Optimal External Redundancy

219

(1)

11.2 Practically Optimal Recovery Strategies

220

(8)

11.2.1 Internal versus External Redundancy

220

(2)

11.2.2 Client-Initiated Recovery as Optimal External Recovery Strategy

222

(1)

11.2.3 Multi-Site Strategy

223

(1)

11.2.4 Active-Active Server Operation

223

(1)

11.2.5 Optimizing Timeout and Retry Parameters

224

(1)

11.2.6 Rapid Relogon

225

(1)

11.2.7 Rapid Context Restoration

225

(1)

11.2.8 Automatic Switchback

226

(1)

11.2.9 Overload Control

226

(1)

11.2.10 Network Element versus Cluster-Level Recovery

226

(2)

11.3 Other Considerations

228

(2)

11.3.1 Architecting to Facilitate Planned Maintenance Activities

228

(1)

11.3.2 Procedural Considerations

229

(1)

12 Georedundancy Requirements

230

(13)

12.1 Internal Redundancy Requirements

230

(3)

12.1.1 Standalone Network Element Redundancy Requirements

231

(1)

12.1.2 Basic Solution Redundancy Requirements

232

(1)

12.2 External Redundancy Requirements

233

(2)

12.3 Manually Controlled Redundancy Requirements

235

(2)

12.3.1 Manual Failover Requirements

235

(1)

12.3.2 Graceful Switchover Requirements

236

(1)

12.3.3 Switchback Requirements

236

(1)

12.4 Automatic External Recovery Requirements

237

(5)

12.4.1 System-Driven Recovery

237

(2)

12.4.2 Client-Initiated Recovery

239

(3)

12.5 Operational Requirements

242

(1)

13 Georedundancy Testing

243

(13)

13.1 Georedundancy Testing Strategy

243

(3)

13.1.1 Network Element Level Testing

244

(1)

13.1.2 End-to-End Testing

245

(1)

13.1.3 Deployment Testing

245

(1)

13.1.4 Operational Testing

246

(1)

13.2 Test Cases for External Redundancy

246

(1)

13.3 Verifying Georedundancy Requirements

247

(7)

13.3.1 Test Cases for Standalone Elements

248

(1)

13.3.2 Test Cases for Manually Controlled Recovery

248

(1)

13.3.3 Test Cases for System-Driven Recovery

249

(1)

13.3.4 Test Cases for Client-Initiated Recovery

250

(2)

13.3.5 Test Cases at the Solution Level

252

(1)

13.3.6 Test cases for Operational Testing

253

(1)

13.4 Summary

254

(2)

14 Solution Georedundancy Case Study

256

(29)

14.1 The Hypothetical Solution

256

(3)

14.1.1 Key Quality Indicators

258

(1)

14.2 Standalone Solution Analysis

259

(4)

14.2.1 Construct Reliability Block Diagrams

260

(1)

14.2.2 Network Element Configuration in Standalone Solution

260

(1)

14.2.3 Service Availability Offered by Standalone Solution

261

(1)

14.2.4 Discussion of Standalone Solution Analysis

261

(2)

14.3 Georedundant Solution Analysis

263

(6)

14.3.1 Identify Factors Constraining Recovery Realm Design

263

(1)

14.3.2 Define Recovery Realms

264

(1)

14.3.3 Define Recovery Strategies

264

(1)

14.3.4 Set Recovery Objectives

265

(3)

14.3.5 Architecting Site Redundancy

268

(1)

14.4 Availability of the Georedundant Solution

269

(1)

14.5 Requirements of Hypothetical Solution

269

(8)

14.5.1 Internal Redundancy Requirements

269

(3)

14.5.2 External Redundancy Requirements

272

(1)

14.5.3 Manual Failover Requirements

273

(1)

14.5.4 Automatic External Recovery Requirements

274

(3)

14.5.5 Operational Requirements

277

(1)

14.6 Testing of Hypothetical Solution

277

(8)

14.6.1 Testing Strategy

277

(1)

14.6.2 Standalone Network Element Testing

278

(1)

14.6.3 Automatic External Recovery Requirements Testing

279

(3)

14.6.4 End-to-End Testing

282

(1)

14.6.5 Deployment Testing

283

(1)

14.6.6 Operational Testing

284

(1)

Summary

285

(7)

Appendix: Markov Modeling of Service Availability

292

(4)

Acronyms

296

(2)

References

298

(2)

About the Authors

300

(2)

Index

302

Eric Bauer is Reliability Engineering Manager in the IMS Solutions Organization of Alcatel-Lucent, where he focuses on reliability of Alcatel-Lucent's IMS solution and the network elements that comprise the IMS solution. He has written Design for Reliability: Information and Computer-Based Systems and Practical System Reliability. Randee Adams is a Consulting Member of Technical Staff in the Applications Group of Alcatel-Lucent. Currently, she is focusing on reliability for Alcatel-Lucent's software applications.

Daniel Eustace is a Distinguished Member of Technical Staff in the IMS Solutions Organization of Alcatel-Lucent. Currently, he is a solution architect focusing on reliability, key quality indicators, geographical redundancy, and call processing.

Püsilink: https://www.kriso.ee/db/9781118104910_pe.html

Märksõnad:

E-raamat: Beyond Redundancy - How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems: How Geographic Redundancy Can Improve Service Availability and Reliability of Computer-Based Systems [Wiley Online]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Kirjastuste teemad

Vali ostukorv