I Parallel Computing Concepts and Terminology |
|
|
|
3 | |
|
1.1 Parallel Computing in Quantum Chemistry: Past and Present |
|
|
4 | |
|
1.2 Trends in Hardware Development |
|
|
5 | |
|
|
5 | |
|
1.2.2 Clock Speed and Performance |
|
|
6 | |
|
1.2.3 Bandwidth and Latency |
|
|
7 | |
|
1.2.4 Supercomputer Performance |
|
|
8 | |
|
1.3 Trends in Parallel Software Development |
|
|
10 | |
|
1.3.1 Responding to Changes in Hardware |
|
|
10 | |
|
1.3.2 New Algorithms and Methods |
|
|
10 | |
|
1.3.3 New Programming Models |
|
|
12 | |
|
|
13 | |
|
2 Parallel Computer Architectures |
|
|
17 | |
|
2.1 Flynn's Classification Scheme |
|
|
17 | |
|
2.1.1 Single-Instruction, Single-Data |
|
|
17 | |
|
2.1.2 Single-Instruction, Multiple-Data |
|
|
18 | |
|
2.1.3 Multiple-Instruction, Multiple-Data |
|
|
18 | |
|
|
19 | |
|
2.2.1 Direct and Indirect Networks |
|
|
19 | |
|
|
20 | |
|
2.2.3 Network Performance |
|
|
23 | |
|
|
25 | |
|
|
26 | |
|
|
27 | |
|
|
27 | |
|
|
28 | |
|
|
28 | |
|
|
30 | |
|
|
31 | |
|
|
31 | |
|
2.4 MIMD System Architecture |
|
|
34 | |
|
|
35 | |
|
|
35 | |
|
|
37 | |
|
|
37 | |
|
2.4.2.3 Trends in Storage |
|
|
38 | |
|
|
38 | |
|
2.4.4 Homogeneity and Heterogeneity |
|
|
39 | |
|
2.4.5 Commodity versus Custom Computers |
|
|
40 | |
|
|
42 | |
|
|
43 | |
|
3 Communication via Message-Passing |
|
|
45 | |
|
3.1 Point-to-Point Communication Operations |
|
|
46 | |
|
3.1.1 Blocking Point-to-Point Operations |
|
|
46 | |
|
3.1.2 Non-Blocking Point-to-Point Operations |
|
|
47 | |
|
3.2 Collective Communication Operations |
|
|
49 | |
|
3.2.1 One-to-All Broadcast |
|
|
50 | |
|
3.2.2 All-to-All Broadcast |
|
|
51 | |
|
3.2.3 All-to-One Reduction and All-Reduce |
|
|
54 | |
|
3.3 One-Sided Communication Operations |
|
|
55 | |
|
|
56 | |
|
|
56 | |
|
|
59 | |
|
4.1 Pitfalls of Multi-Threading |
|
|
61 | |
|
|
64 | |
|
4.3 Comparison of Multi-Threading and Message-Passing |
|
|
65 | |
|
|
66 | |
|
|
69 | |
|
|
70 | |
|
5 Parallel Performance Evaluation |
|
|
71 | |
|
5.1 Network Performance Characteristics |
|
|
71 | |
|
5.2 Performance Measures for Parallel Programs |
|
|
74 | |
|
5.2.1 Speedup and Efficiency |
|
|
74 | |
|
|
79 | |
|
|
80 | |
|
5.3.1 Modeling the Execution Time |
|
|
80 | |
|
5.3.2 Performance Model Exampe Matrix-Vector Multiplication |
|
|
83 | |
|
5.4 Presenting and Evaluating Performance Data: A Few Caveats |
|
|
86 | |
|
|
90 | |
|
|
90 | |
|
6 Parallel Program Design |
|
|
93 | |
|
|
94 | |
|
6.1.1 Static Task Distribution |
|
|
95 | |
|
6.1.1.1 Round-Robin and Recursive Task Distributions |
|
|
96 | |
|
6.1.2 Dynamic Task Distribution |
|
|
99 | |
|
6.1.2.1 Manager-Worker Model |
|
|
99 | |
|
6.1.2.2 Decentralized Task Distribution |
|
|
101 | |
|
|
101 | |
|
6.3 Designing a Communication Scheme |
|
|
104 | |
|
6.3.1 Using Collective Communication |
|
|
104 | |
|
6.3.2 Using Point-to-Point Communication |
|
|
105 | |
|
6.4 Design Example: Matrix-Vector Multiplication |
|
|
107 | |
|
6.4.1 Using a Row-Distributed Matrix |
|
|
108 | |
|
6.4.2 Using a Block-Distributed Matrix |
|
|
109 | |
|
6.5 Summary of Key Points of Parallel Program Design |
|
|
112 | |
|
|
114 | |
|
|
114 | |
II Applications of Parallel Programming in Quantum Chemistry |
|
|
7 Two-Electron Integral Evaluation |
|
|
117 | |
|
7.1 Basics of Integral Computation |
|
|
117 | |
|
7.2 Parallel Implementation Using Static Load Balancing |
|
|
119 | |
|
7.2.1 Parallel Algorithms Distributing Shell Quartets and Pairs |
|
|
119 | |
|
7.2.2 Performance Analysis |
|
|
121 | |
|
7.2.2.1 Determination of the Load Imbalance Factor k(p) |
|
|
122 | |
|
7.2.2.2 Determination of μ and σ for Integral Computation |
|
|
123 | |
|
7.2.2.3 Predicted and Measured Efficiencies |
|
|
124 | |
|
7.3 Parallel Implementation Using Dynamic Load Balancing |
|
|
125 | |
|
7.3.1 Parallel Algorithm Distributing Shell Pairs |
|
|
126 | |
|
7.3.2 Performance Analysis |
|
|
128 | |
|
|
128 | |
|
7.3.2.2 Communication Time |
|
|
128 | |
|
7.3.2.3 Predicted and Measured Efficiencies |
|
|
129 | |
|
|
130 | |
|
8 The Hartree—Fock Method |
|
|
131 | |
|
8.1 The Hartree—Fock Equations |
|
|
131 | |
|
8.2 The Hartree—Fock Procedure |
|
|
133 | |
|
8.3 Parallel Fock Matrix Formation with Replicated Data |
|
|
135 | |
|
8.4 Parallel Fock Matrix Formation with Distributed Data |
|
|
138 | |
|
|
145 | |
|
|
146 | |
|
9 Second-Order Moller—Plesset Perturbation Theory |
|
|
147 | |
|
9.1 The Canonical MP2 Equations |
|
|
147 | |
|
9.2 A Scalar Direct MP2 Algorithm |
|
|
149 | |
|
9.3 Parallelization with Minimal Modifications |
|
|
151 | |
|
9.4 High-Performance Parallelization |
|
|
154 | |
|
9.5 Performance of the Parallel Algorithms |
|
|
158 | |
|
|
164 | |
|
|
164 | |
|
10 Local Moller–Plesset Perturbation Theory |
|
|
167 | |
|
|
167 | |
|
10.2 A Scalar LMP2 Algorithm |
|
|
169 | |
|
|
170 | |
|
10.3.1 Two-Electron Integral Transformation |
|
|
171 | |
|
10.3.2 Computation of the Residual |
|
|
173 | |
|
10.3.3 Parallel Performance |
|
|
174 | |
|
|
177 | |
Appendices |
|
|
A A Brief Introduction to MPI |
|
|
181 | |
|
B Pthreads: Explicit Use of Threads |
|
|
189 | |
|
C OpenMP: Compiler Extensions for Multi-Threading |
|
|
195 | |
Index |
|
205 | |