Preface to the Second Edition |
|
xi | |
Acknowledgments |
|
xiii | |
Chapter 1 Preliminaries |
|
1 | (10) |
|
|
1 | (2) |
|
|
3 | (1) |
|
1.3 Basic Measures of Fault Tolerance |
|
|
4 | (2) |
|
1.3.1 Traditional Measures |
|
|
4 | (2) |
|
|
6 | (1) |
|
|
6 | (2) |
|
|
8 | (1) |
|
|
9 | (2) |
Chapter 2 Hardware Fault Tolerance |
|
11 | (48) |
|
2.1 The Rate of Hardware Failure |
|
|
11 | (2) |
|
2.2 Failure Rate, Reliability, and Mean Time to Failure |
|
|
13 | (2) |
|
2.3 Hardware Failure Mechanisms |
|
|
15 | (4) |
|
|
16 | (1) |
|
|
16 | (1) |
|
2.3.3 Negative Bias Temperature Instability |
|
|
17 | (1) |
|
2.3.4 Hot Carrier Injection |
|
|
17 | (1) |
|
2.3.5 Time-Dependent Dielectric Breakdown |
|
|
18 | (1) |
|
2.3.6 Putting It All Together |
|
|
18 | (1) |
|
|
19 | (1) |
|
2.5 Canonical and Resilient Structures |
|
|
20 | (13) |
|
2.5.1 Series and Parallel Systems |
|
|
20 | (1) |
|
2.5.2 Nonseries/Parallel Systems |
|
|
21 | (3) |
|
|
24 | (2) |
|
|
26 | (1) |
|
2.5.5 Variations on N-Modular Redundancy |
|
|
27 | (3) |
|
|
30 | (3) |
|
2.6 Other Reliability Evaluation Techniques |
|
|
33 | (6) |
|
|
33 | (2) |
|
|
35 | (4) |
|
2.7 Fault-Tolerance Processor-Level Techniques |
|
|
39 | (4) |
|
|
39 | (2) |
|
2.7.2 Simultaneous Multithreading for Fault Tolerance |
|
|
41 | (2) |
|
2.8 Timing Fault Tolerance |
|
|
43 | (2) |
|
2.9 Tolerance of Byzantine Failures |
|
|
45 | (5) |
|
2.9.1 Byzantine Agreement With Message Authentication |
|
|
49 | (1) |
|
|
50 | (1) |
|
|
51 | (4) |
|
|
55 | (4) |
Chapter 3 Information Redundancy |
|
59 | (56) |
|
|
59 | (27) |
|
|
61 | (5) |
|
|
66 | (2) |
|
|
68 | (1) |
|
|
68 | (1) |
|
|
69 | (6) |
|
|
75 | (4) |
|
3.1.7 Local Hard and Soft Decisions |
|
|
79 | (7) |
|
3.2 Resilient Disk Systems |
|
|
86 | (11) |
|
|
86 | (1) |
|
|
87 | (1) |
|
|
88 | (1) |
|
|
89 | (1) |
|
|
90 | (1) |
|
|
91 | (1) |
|
3.2.7 Modeling Correlated Failures |
|
|
92 | (4) |
|
3.2.8 RAID With Solid-State Disks |
|
|
96 | (1) |
|
|
97 | (9) |
|
3.3.1 Voting: Nonhierarchical Organization |
|
|
98 | (5) |
|
3.3.2 Voting: Hierarchical Organization |
|
|
103 | (1) |
|
3.3.3 Primary-Backup Approach |
|
|
104 | (2) |
|
3.4 Algorithm-Based Fault Tolerance |
|
|
106 | (2) |
|
|
108 | (1) |
|
|
109 | (3) |
|
|
112 | (3) |
Chapter 4 Fault-Tolerant Networks |
|
115 | (46) |
|
4.1 Measures of Resilience |
|
|
116 | (1) |
|
4.1.1 Graph Theoretical Measures |
|
|
116 | (1) |
|
4.1.2 Computer Networks Measures |
|
|
116 | (1) |
|
4.2 Common Network Topologies and Their Resilience |
|
|
117 | (22) |
|
4.2.1 Multistage and Extra-Stage Networks |
|
|
118 | (5) |
|
|
123 | (2) |
|
4.2.3 Rectangular Mesh and Interstitial Mesh |
|
|
125 | (2) |
|
|
127 | (4) |
|
4.2.5 Cube-Connected Cycles Networks |
|
|
131 | (1) |
|
|
132 | (2) |
|
|
134 | (2) |
|
4.2.8 Ad Hoc Point-to-Point Networks |
|
|
136 | (3) |
|
4.3 Fault-Tolerant Routing |
|
|
139 | (5) |
|
4.3.1 Hypercube Fault-Tolerant Routing |
|
|
139 | (2) |
|
4.3.2 Origin-Based Routing in the Mesh |
|
|
141 | (3) |
|
|
144 | (5) |
|
4.4.1 Router Fault Tolerance |
|
|
145 | (2) |
|
|
147 | (1) |
|
4.4.3 Routing in the Presence of Failure |
|
|
148 | (1) |
|
4.5 Wireless Sensor Networks |
|
|
149 | (4) |
|
|
149 | (1) |
|
4.5.2 Sensor Network Failures |
|
|
150 | (1) |
|
4.5.3 Sensor Network Fault Tolerance |
|
|
150 | (3) |
|
|
153 | (1) |
|
|
154 | (3) |
|
|
157 | (4) |
Chapter 5 Software Fault Tolerance |
|
161 | (42) |
|
|
161 | (2) |
|
5.2 Single-Version Fault Tolerance |
|
|
163 | (10) |
|
|
163 | (2) |
|
5.2.2 Software Rejuvenation |
|
|
165 | (4) |
|
|
169 | (2) |
|
5.2.4 Software-Implemented Hardware Fault Tolerance (SIHFT) |
|
|
171 | (2) |
|
5.3 N-Version Programming |
|
|
173 | (8) |
|
5.3.1 Consistent Comparison Problem |
|
|
174 | (1) |
|
5.3.2 Version Independence |
|
|
175 | (4) |
|
5.3.3 Other Issues in N-Version Programming |
|
|
179 | (2) |
|
5.4 Recovery Block Approach |
|
|
181 | (4) |
|
|
181 | (1) |
|
5.4.2 Success Probability Calculation |
|
|
182 | (1) |
|
5.4.3 Distributed Recovery Blocks |
|
|
183 | (2) |
|
5.5 Preconditions, Postconditions, and Assertions |
|
|
185 | (1) |
|
|
185 | (4) |
|
5.6.1 Requirements From Exception Handlers |
|
|
186 | (1) |
|
5.6.2 Basics of Exceptions and Exception Handling |
|
|
186 | (3) |
|
|
189 | (1) |
|
5.7 Software Reliability Models |
|
|
189 | (5) |
|
5.7.1 Jelinski-Moranda Model |
|
|
189 | (1) |
|
5.7.2 Littlewood-Verrall Model |
|
|
190 | (1) |
|
|
191 | (1) |
|
5.7.4 Ostrand-Weyuker-Bell (OWB) Fault Model |
|
|
192 | (1) |
|
5.7.5 Model Selection and Parameter Estimation |
|
|
193 | (1) |
|
5.8 Fault-Tolerant Remote Procedure Calls |
|
|
194 | (2) |
|
5.8.1 Primary-Backup Approach |
|
|
194 | (1) |
|
5.8.2 The Circus Approach |
|
|
194 | (2) |
|
|
196 | (1) |
|
|
197 | (2) |
|
|
199 | (4) |
Chapter 6 Checkpointing |
|
203 | (34) |
|
6.1 What Is Checkpointing? |
|
|
205 | (2) |
|
6.1.1 Why Is Checkpointing Nontrivial? |
|
|
206 | (1) |
|
|
207 | (1) |
|
6.3 Optimal Checkpointing: an Analytical Model |
|
|
207 | (6) |
|
6.3.1 Time Between Checkpoints-a First-Order Approximation |
|
|
208 | (1) |
|
6.3.2 Optimal Checkpoint Placement |
|
|
209 | (1) |
|
6.3.3 Time Between Checkpoints: a More Accurate Model |
|
|
210 | (1) |
|
|
211 | (1) |
|
|
212 | (1) |
|
6.4 Cache-Aided Rollback Error Recovery (CARER) |
|
|
213 | (1) |
|
6.5 Checkpointing in Distributed Systems |
|
|
214 | (9) |
|
6.5.1 The Domino Effect and Livelock |
|
|
215 | (2) |
|
6.5.2 A Coordinated Checkpointing Algorithm |
|
|
217 | (1) |
|
6.5.3 Time-Based Synchronization |
|
|
218 | (1) |
|
6.5.4 Diskless Checkpointing |
|
|
219 | (1) |
|
|
219 | (4) |
|
6.6 Checkpointing in Shared-Memory Systems |
|
|
223 | (2) |
|
6.6.1 Bus-Based Coherence Protocol |
|
|
223 | (1) |
|
6.6.2 Directory-Based Protocol |
|
|
224 | (1) |
|
6.7 Checkpointing in Real-Time Systems |
|
|
225 | (3) |
|
6.8 Checkpointing While Using Cloud Computing Utilities |
|
|
228 | (1) |
|
6.9 Emerging Challenges: Petascale and Exascale Computing |
|
|
228 | (1) |
|
6.10 Other Uses of Checkpointing |
|
|
229 | (1) |
|
|
230 | (1) |
|
|
231 | (2) |
|
|
233 | (4) |
Chapter 7 Cyber-Physical Systems |
|
237 | (26) |
|
7.1 Structure of a Cyber-Physical System |
|
|
238 | (2) |
|
7.2 The Controlled Plant State Space |
|
|
240 | (2) |
|
|
242 | (10) |
|
|
244 | (1) |
|
7.3.2 Detecting Faulty Sensors |
|
|
245 | (5) |
|
7.3.3 Confidence Measures for Intervals |
|
|
250 | (2) |
|
|
252 | (4) |
|
|
253 | (2) |
|
|
255 | (1) |
|
|
256 | (1) |
|
|
256 | (3) |
|
|
259 | (1) |
|
|
260 | (1) |
|
|
261 | (2) |
Chapter 8 Case Studies |
|
263 | (28) |
|
|
263 | (3) |
|
8.1.1 Protecting Against Radiation |
|
|
263 | (1) |
|
8.1.2 Flight Control System: Boeing 777 |
|
|
264 | (2) |
|
|
266 | (6) |
|
|
267 | (2) |
|
8.2.2 Maintenance and Repair Aids |
|
|
269 | (1) |
|
|
269 | (1) |
|
8.2.4 Modifications to the NonStop Architecture |
|
|
270 | (2) |
|
|
272 | (2) |
|
8.4 Cassini Command and Data Subsystem |
|
|
274 | (2) |
|
|
276 | (1) |
|
|
277 | (1) |
|
|
278 | (2) |
|
|
280 | (4) |
|
|
280 | (2) |
|
|
282 | (2) |
|
8.9 Oracle SPARC M8 Server |
|
|
284 | (1) |
|
|
285 | (2) |
|
8.10.1 Checkpointing in Response to Spot Pricing |
|
|
285 | (1) |
|
8.10.2 Proactive Virtual Machine Migration |
|
|
286 | (1) |
|
8.10.3 Fault Tolerance as a Service |
|
|
286 | (1) |
|
|
287 | (1) |
|
|
288 | (3) |
Chapter 9 Simulation Techniques |
|
291 | (50) |
|
9.1 Writing a Simulation Program |
|
|
291 | (3) |
|
|
294 | (11) |
|
9.2.1 Point Versus Interval Estimation |
|
|
294 | (1) |
|
|
295 | (2) |
|
9.2.3 Method of Maximum Likelihood |
|
|
297 | (3) |
|
9.2.4 The Bayesian Approach to Parameter Estimation |
|
|
300 | (1) |
|
9.2.5 Confidence Intervals |
|
|
301 | (4) |
|
9.3 Variance Reduction Methods |
|
|
305 | (11) |
|
9.3.1 Antithetic Variables |
|
|
305 | (2) |
|
9.3.2 Using Control Variables |
|
|
307 | (1) |
|
9.3.3 Stratified Sampling |
|
|
307 | (2) |
|
9.3.4 Importance Sampling |
|
|
309 | (7) |
|
|
316 | (5) |
|
9.5 Random Number Generation |
|
|
321 | (11) |
|
9.5.1 Uniformly Distributed Random Number Generators |
|
|
321 | (3) |
|
9.5.2 Testing Uniform Random Number Generators |
|
|
324 | (3) |
|
9.5.3 Generating Other Distributions |
|
|
327 | (5) |
|
|
332 | (3) |
|
9.6.1 Types of Fault Injection Techniques |
|
|
332 | (2) |
|
9.6.2 Fault Injection Application and Tools |
|
|
334 | (1) |
|
|
335 | (1) |
|
|
336 | (2) |
|
|
338 | (3) |
Chapter 10 Defect Tolerance in VLSI Circuits |
|
341 | (32) |
|
10.1 Manufacturing Defects and Circuit Faults |
|
|
341 | (2) |
|
10.2 Probability of Failure and Critical Area |
|
|
343 | (2) |
|
|
345 | (4) |
|
10.3.1 The Poisson and Compound Poisson Yield Models |
|
|
345 | (2) |
|
10.3.2 Variations on the Simple Yield Models |
|
|
347 | (2) |
|
10.4 Yield Enhancement Through Redundancy |
|
|
349 | (15) |
|
10.4.1 Yield Projection for Chips With Redundancy |
|
|
349 | (4) |
|
10.4.2 Memory Arrays With Redundancy |
|
|
353 | (6) |
|
10.4.3 Logic Integrated Circuits With Redundancy |
|
|
359 | (2) |
|
10.4.4 Modifying the Floorplan |
|
|
361 | (3) |
|
|
364 | (2) |
|
|
366 | (3) |
|
|
369 | (4) |
Chapter 11 Fault Detection in Cryptographic Systems |
|
373 | (22) |
|
|
373 | (10) |
|
11.1.1 Symmetric Key Ciphers |
|
|
374 | (7) |
|
11.1.2 Public Key Ciphers |
|
|
381 | (2) |
|
11.2 Security Attacks Through Fault Injection |
|
|
383 | (2) |
|
11.2.1 Fault Attacks on Symmetric Key Ciphers |
|
|
384 | (1) |
|
11.2.2 Fault Attacks on Public (Asymmetric) Key Ciphers |
|
|
385 | (1) |
|
|
385 | (7) |
|
11.3.1 Spatial and Temporal Duplication |
|
|
386 | (1) |
|
11.3.2 Error-Detecting Codes |
|
|
386 | (3) |
|
11.3.3 Are These Countermeasure Sufficient? |
|
|
389 | (2) |
|
|
391 | (1) |
|
|
392 | (1) |
|
|
392 | (1) |
|
|
393 | (2) |
Index |
|
395 | |