| List of Figures |
|
xix | |
| List of Tables |
|
xxv | |
| Foreword |
|
xxvii | |
| Preface |
|
xxix | |
| Acknowledgments |
|
xxxi | |
| Contributors |
|
xxxix | |
| I Parallel I/O in Practice |
|
1 | (88) |
|
1 Parallel I/O at HPC Facilities |
|
|
3 | (2) |
|
|
|
2 National Energy Research Scientific Computing Center |
|
|
5 | (12) |
|
|
|
|
|
5 | (1) |
|
|
|
6 | (6) |
|
2.2.1 Local Scratch File Systems |
|
|
7 | (2) |
|
|
|
9 | (1) |
|
2.2.3 The NERSC Global File Systems |
|
|
10 | (1) |
|
|
|
11 | (1) |
|
2.3 Workflows, Workloads, and Applications |
|
|
12 | (2) |
|
|
|
14 | (3) |
|
3 National Center for Supercomputing Applications |
|
|
17 | (16) |
|
|
|
|
|
|
|
|
|
|
|
3.1 The Blue Waters Computational and Analysis Subsystems |
|
|
18 | (1) |
|
3.2 Blue Waters On-line Storage Subsystem |
|
|
19 | (5) |
|
3.2.1 On-line Storage Performance |
|
|
22 | (2) |
|
3.3 Blue Waters Near-line Storage Subsystem and External Server Subsystem |
|
|
24 | (4) |
|
3.4 Blue Waters Applications |
|
|
28 | (3) |
|
3.4.1 Science and Engineering Team Application I/O Requirements |
|
|
29 | (2) |
|
|
|
31 | (2) |
|
4 Argonne Leadership Computing Facility |
|
|
33 | (18) |
|
|
|
|
|
|
|
34 | (1) |
|
|
|
34 | (1) |
|
|
|
35 | (1) |
|
4.2 Overview of I/O at ALCF |
|
|
35 | (1) |
|
|
|
36 | (5) |
|
4.3.1 Intrepid: ALCF Blue Gene/P System |
|
|
37 | (2) |
|
4.3.2 Mira: ALCF Blue Gene/Q System |
|
|
39 | (2) |
|
|
|
41 | (3) |
|
|
|
41 | (1) |
|
|
|
41 | (1) |
|
|
|
42 | (1) |
|
|
|
42 | (1) |
|
|
|
43 | (1) |
|
|
|
43 | (1) |
|
4.5 Workloads/Applications |
|
|
44 | (3) |
|
|
|
46 | (1) |
|
4.6 Future I/O Plans at ALCF |
|
|
47 | (4) |
|
5 Livermore Computing Center |
|
|
51 | (14) |
|
|
|
|
|
|
|
51 | (2) |
|
5.2 The Lustre® Parallel File System: Early Developments |
|
|
53 | (1) |
|
5.3 Sequoia, Lustre® 2.0, and ZFS |
|
|
54 | (1) |
|
5.4 IBM Blue Gene Systems |
|
|
55 | (2) |
|
5.5 Sequoia File System Hardware |
|
|
57 | (2) |
|
5.6 Experience with ZFS-Based Lustre® and Sequoia in Production |
|
|
59 | (1) |
|
5.7 Sequoia I/O in Practice |
|
|
60 | (3) |
|
|
|
60 | (1) |
|
5.7.2 Recommendations to Application Developers |
|
|
61 | (1) |
|
5.7.3 SILO: LLNL's I/O Library |
|
|
62 | (1) |
|
5.7.4 Scalable Checkpoint/Restart |
|
|
62 | (1) |
|
|
|
63 | (2) |
|
6 Los Alamos National Laboratory |
|
|
65 | (14) |
|
|
|
|
|
65 | (1) |
|
6.1.1 Facilities and Environments |
|
|
66 | (1) |
|
|
|
66 | (6) |
|
6.2.1 Storage Environment |
|
|
67 | (1) |
|
6.2.2 Storage Area Networks |
|
|
67 | (1) |
|
6.2.3 Global Parallel Scratch File Systems |
|
|
68 | (2) |
|
6.2.4 The Curse of the Burst: Economic Thinking behind Burst Buffers |
|
|
70 | (2) |
|
6.3 Workloads and Applications |
|
|
72 | (3) |
|
6.3.1 Applications and Their Use of Storage |
|
|
72 | (1) |
|
6.3.2 I/O Patterns and the Quest for Performance without Giving Up |
|
|
73 | (1) |
|
6.3.3 Defeating N-to-1 Strided |
|
|
73 | (2) |
|
|
|
75 | (4) |
|
7 Texas Advanced Computing Center |
|
|
79 | (10) |
|
|
|
|
|
79 | (1) |
|
|
|
80 | (6) |
|
|
|
82 | (2) |
|
7.2.2 Parallel File Systems—A Shared Resource |
|
|
84 | (2) |
|
|
|
86 | (3) |
| II File Systems |
|
89 | (60) |
|
|
|
91 | (16) |
|
|
|
|
|
|
|
91 | (1) |
|
8.2 Design and Architecture |
|
|
92 | (11) |
|
|
|
92 | (1) |
|
|
|
93 | (1) |
|
|
|
93 | (1) |
|
|
|
94 | (2) |
|
8.2.3 Distributed Lock Manager |
|
|
96 | (1) |
|
|
|
97 | (1) |
|
|
|
98 | (1) |
|
8.2.6 Object Storage Server |
|
|
99 | (1) |
|
|
|
100 | (1) |
|
|
|
100 | (2) |
|
|
|
102 | (1) |
|
|
|
103 | (1) |
|
|
|
104 | (3) |
|
|
|
107 | (12) |
|
|
|
|
|
|
|
107 | (1) |
|
9.2 Design and Architecture |
|
|
108 | (8) |
|
9.2.1 Shared Storage Model |
|
|
108 | (2) |
|
|
|
110 | (1) |
|
9.2.3 Distributed Locking and Metadata Management |
|
|
111 | (1) |
|
9.2.3.1 The Distributed Lock Manager |
|
|
111 | (1) |
|
9.2.3.2 Metadata Management |
|
|
112 | (1) |
|
9.2.3.3 Concurrent Directory Updates |
|
|
113 | (1) |
|
9.2.4 Advanced Data Management |
|
|
114 | (1) |
|
|
|
114 | (1) |
|
9.2.4.2 Information Lifecycle Management |
|
|
114 | (1) |
|
9.2.4.3 Wide-Area Caching and Replication |
|
|
115 | (1) |
|
|
|
116 | (1) |
|
|
|
117 | (1) |
|
|
|
117 | (2) |
|
|
|
119 | (16) |
|
|
|
|
|
|
|
120 | (1) |
|
|
|
120 | (1) |
|
|
|
121 | (1) |
|
|
|
121 | (1) |
|
10.2 Design and Architecture |
|
|
121 | (10) |
|
|
|
121 | (1) |
|
10.2.2 OrangeFS Request Protocol |
|
|
122 | (1) |
|
10.2.3 File Structure Representation |
|
|
122 | (1) |
|
|
|
123 | (1) |
|
10.2.4 Bulk Messaging Interface |
|
|
123 | (1) |
|
|
|
124 | (1) |
|
|
|
124 | (1) |
|
10.2.7 Request State Machines |
|
|
124 | (1) |
|
10.2.8 Distributed File Metadata |
|
|
125 | (1) |
|
10.2.9 Distributed Directory Entry Metadata |
|
|
125 | (1) |
|
10.2.10 Capability-Based Security |
|
|
126 | (1) |
|
10.2.11 Clients and Interfaces |
|
|
126 | (4) |
|
10.2.12 Features under Development |
|
|
130 | (1) |
|
|
|
131 | (2) |
|
10.3.1 Cluster Shared Scratch |
|
|
132 | (1) |
|
10.3.2 Cluster Node Scratch |
|
|
132 | (1) |
|
10.3.3 Amazon Web Services |
|
|
133 | (1) |
|
|
|
133 | (2) |
|
|
|
135 | (14) |
|
|
|
|
|
135 | (1) |
|
|
|
136 | (2) |
|
|
|
137 | (1) |
|
|
|
138 | (1) |
|
|
|
138 | (1) |
|
|
|
138 | (1) |
|
11.3.3 Complete Cluster View |
|
|
139 | (1) |
|
11.4 OneFS Software Overview |
|
|
139 | (3) |
|
|
|
139 | (1) |
|
11.4.2 File System Structure |
|
|
139 | (2) |
|
|
|
141 | (1) |
|
|
|
142 | (3) |
|
|
|
142 | (1) |
|
|
|
142 | (1) |
|
|
|
143 | (1) |
|
11.5.4 N + M Data Protection |
|
|
143 | (2) |
|
11.6 Dynamic Scale/Scale on Demand |
|
|
145 | (2) |
|
11.6.1 Performance and Capacity |
|
|
145 | (2) |
|
|
|
147 | (2) |
| III I/O Libraries |
|
149 | (76) |
|
12 I/O Libraries: Past, Present and Future |
|
|
151 | (4) |
|
|
|
|
|
151 | (1) |
|
12.2 A Recent History of I/O Libraries, by Example |
|
|
152 | (1) |
|
12.3 What Is the Future of I/O Libraries? |
|
|
153 | (2) |
|
|
|
155 | (14) |
|
|
|
|
|
|
|
155 | (2) |
|
|
|
156 | (1) |
|
13.1.2 Parallel I/O in Practice |
|
|
156 | (1) |
|
13.2 Using MPI for Simple I/O |
|
|
157 | (2) |
|
13.2.1 Three Ways of File Access |
|
|
158 | (1) |
|
13.2.2 Blocking and Nonblocking I/O |
|
|
159 | (1) |
|
13.3 File Access with User Intent |
|
|
159 | (6) |
|
|
|
160 | (1) |
|
|
|
161 | (2) |
|
|
|
163 | (2) |
|
|
|
165 | (1) |
|
|
|
165 | (4) |
|
14 PLFS: Software-Defined Storage for HP C |
|
|
169 | (8) |
|
|
|
|
|
169 | (1) |
|
|
|
170 | (3) |
|
14.2.1 PLFS Shared File Mode |
|
|
170 | (2) |
|
14.2.2 PLFS Flat File Mode |
|
|
172 | (1) |
|
14.2.3 PLFS Small File Mode |
|
|
172 | (1) |
|
14.3 Deployment, Usage, and Applications |
|
|
173 | (2) |
|
|
|
174 | (1) |
|
14.3.2 Cloud File Systems for HPC |
|
|
174 | (1) |
|
|
|
175 | (2) |
|
|
|
177 | (8) |
|
|
|
|
|
177 | (2) |
|
15.2 History and Background |
|
|
179 | (1) |
|
15.3 Design and Architecture |
|
|
179 | (1) |
|
15.4 Deployment and Usage |
|
|
180 | (1) |
|
|
|
181 | (1) |
|
|
|
182 | (1) |
|
15.7 Additional Resources |
|
|
183 | (2) |
|
|
|
185 | (18) |
|
|
|
|
|
|
|
|
|
|
|
|
|
186 | (1) |
|
16.2 History and Background |
|
|
186 | (1) |
|
16.3 Design and Architecture |
|
|
187 | (5) |
|
16.3.1 The HDF5 Data Model |
|
|
188 | (3) |
|
|
|
191 | (1) |
|
16.3.3 The HDF5 File Format |
|
|
192 | (1) |
|
16.4 Usage and Applications |
|
|
192 | (7) |
|
|
|
193 | (1) |
|
|
|
193 | (1) |
|
|
|
193 | (1) |
|
|
|
194 | (1) |
|
|
|
194 | (1) |
|
|
|
194 | (1) |
|
|
|
195 | (1) |
|
|
|
195 | (1) |
|
|
|
197 | (2) |
|
|
|
199 | (1) |
|
16.6 Additional Resources |
|
|
199 | (4) |
|
|
|
203 | (12) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
203 | (1) |
|
17.2 Design and Architecture |
|
|
204 | (1) |
|
17.3 Deployment, Usage, and Applications |
|
|
205 | (6) |
|
17.3.1 Checkpoint/Restart |
|
|
205 | (1) |
|
|
|
206 | (1) |
|
|
|
207 | (1) |
|
|
|
208 | (2) |
|
|
|
210 | (1) |
|
|
|
210 | (1) |
|
|
|
211 | (4) |
|
|
|
215 | (10) |
|
|
|
|
|
|
|
|
|
|
|
215 | (1) |
|
18.2 Design and Architecture |
|
|
216 | (4) |
|
18.2.1 Exploiting Network Topology and Reduced Synchronization for I/O |
|
|
217 | (1) |
|
18.2.2 Leveraging Application Data Semantics |
|
|
218 | (1) |
|
18.2.3 Asynchronous Data Staging |
|
|
219 | (1) |
|
18.2.4 Compression and Subfiling |
|
|
219 | (1) |
|
18.3 Deployment, Usage, and Applications |
|
|
220 | (3) |
|
18.3.1 Checkpoint, Restart, and Analysis I/O for HACC Cosmology |
|
|
220 | (1) |
|
18.3.2 Data Staging for FLASH Astrophysics |
|
|
221 | (1) |
|
18.3.3 Co-Visualization for PHASTA CFD Simulation |
|
|
222 | (1) |
|
|
|
223 | (2) |
| IV I/O Case Studies |
|
225 | (52) |
|
19 Parallel I/O for a Trillion-Particle Plasma Physics Simulation |
|
|
227 | (12) |
|
|
|
|
|
|
|
|
|
|
|
227 | (1) |
|
|
|
228 | (1) |
|
|
|
229 | (1) |
|
19.4 Software and Hardware |
|
|
229 | (1) |
|
|
|
229 | (1) |
|
|
|
230 | (1) |
|
19.5 Parallel I/O in VPIC |
|
|
230 | (2) |
|
|
|
232 | (4) |
|
19.6.1 Tuning Write Performance |
|
|
232 | (1) |
|
|
|
233 | (1) |
|
19.6.3 Tuning Lustre File System and MPI-I/O Parameters |
|
|
233 | (3) |
|
|
|
236 | (1) |
|
|
|
236 | (3) |
|
20 Stochastic Simulation Data Management |
|
|
239 | (10) |
|
|
|
|
|
239 | (1) |
|
|
|
240 | (1) |
|
|
|
241 | (2) |
|
20.4 Using HDF5 in Industrial Stochastic Simulations |
|
|
243 | (2) |
|
20.4.1 Data Model and Versioning |
|
|
244 | (1) |
|
|
|
244 | (1) |
|
|
|
244 | (1) |
|
|
|
244 | (1) |
|
|
|
245 | (1) |
|
20.4.6 Process and Thread Synchronization |
|
|
245 | (1) |
|
20.5 A (Near) Efficient Architecture Using HDF5 |
|
|
245 | (2) |
|
|
|
247 | (1) |
|
|
|
248 | (1) |
|
21 Silo: A General-Purpose API and Scientific Database |
|
|
249 | (10) |
|
|
|
21.1 Canonical Use Case: ALE3D Restart and Visit Visualization Workflow |
|
|
250 | (1) |
|
21.2 Software, Hardware, and Performance |
|
|
251 | (1) |
|
21.3 MIF and SSF Scalable I/O Paradigms |
|
|
252 | (4) |
|
21.4 Successes with HDF5 as Middleware |
|
|
256 | (1) |
|
|
|
257 | (2) |
|
22 Scaling Up Parallel I/O in S3D to 100-K Cores with ADIOS |
|
|
259 | (12) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
259 | (1) |
|
22.2 Software and Hardware |
|
|
260 | (9) |
|
|
|
261 | (1) |
|
22.2.2 Staged Write Method |
|
|
262 | (1) |
|
22.2.3 Group-Based Hierarchical I/O Control |
|
|
262 | (1) |
|
22.2.4 Aggregation and Subfiling |
|
|
263 | (3) |
|
|
|
266 | (1) |
|
22.2.6 Staged Read Method |
|
|
266 | (1) |
|
|
|
266 | (1) |
|
|
|
267 | (1) |
|
|
|
268 | (1) |
|
|
|
269 | (2) |
|
23 In-Transit Processing: Data Analysis Using Burst Buffers |
|
|
271 | (6) |
|
|
|
|
|
|
|
|
|
271 | (2) |
|
|
|
273 | (1) |
|
23.3 Systems Prototypes Related to Burst Buffers |
|
|
274 | (1) |
|
|
|
275 | (2) |
| V I/O Profiling Tools |
|
277 | (46) |
|
24 Overview of I/O Benchmarking |
|
|
279 | (10) |
|
|
|
|
|
|
|
279 | (1) |
|
|
|
280 | (3) |
|
24.3 Why Profile I/O in Scientific Applications? |
|
|
283 | (1) |
|
24.4 Brief Introduction to I/O Profilers |
|
|
283 | (1) |
|
24.5 I/O Profiling at NERSC |
|
|
284 | (3) |
|
24.5.1 Application Profiling Case Studies |
|
|
284 | (1) |
|
24.5.1.1 Checkpointing Too Frequently |
|
|
285 | (1) |
|
24.5.1.2 Reading Small Input Files from Every Rank |
|
|
286 | (1) |
|
24.5.1.3 Using the Wrong File System |
|
|
286 | (1) |
|
|
|
287 | (2) |
|
|
|
289 | (8) |
|
|
|
|
|
|
|
289 | (1) |
|
|
|
290 | (2) |
|
25.2.1 MPI-IO Instrumentation |
|
|
291 | (1) |
|
25.2.2 Runtime Preloading of Instrumented Library |
|
|
291 | (1) |
|
25.2.3 Linker-Based Instrumentation |
|
|
291 | (1) |
|
25.2.4 Instrumented External I/O Libraries |
|
|
292 | (1) |
|
|
|
292 | (2) |
|
|
|
294 | (3) |
|
26 Integrated Performance Monitoring |
|
|
297 | (12) |
|
|
|
|
|
297 | (4) |
|
|
|
301 | (4) |
|
26.2.1 Chombo's ftruncate |
|
|
301 | (1) |
|
26.2.2 MADBENCH and File System Health |
|
|
302 | (1) |
|
|
|
303 | (1) |
|
26.2.4 HPC Workload Studies |
|
|
304 | (1) |
|
|
|
305 | (4) |
|
|
|
309 | (8) |
|
|
|
|
|
309 | (2) |
|
|
|
311 | (2) |
|
|
|
313 | (4) |
|
|
|
317 | (6) |
|
|
|
|
|
|
|
|
|
317 | (1) |
|
|
|
318 | (3) |
|
|
|
321 | (2) |
| VI Future Trends |
|
323 | (62) |
|
29 Parallel Computing Trends for the Coming Decade |
|
|
325 | (8) |
|
|
|
|
|
326 | (3) |
|
29.1.1 Classical Scaling Period (1965-2004) |
|
|
326 | (1) |
|
29.1.2 End of Classical Scaling (2004) |
|
|
326 | (2) |
|
29.1.3 Toward Data-Centric Computing (2014-2022) |
|
|
328 | (1) |
|
29.2 Implications for the Future of Storage Systems |
|
|
329 | (2) |
|
|
|
331 | (2) |
|
30 Storage Models: Past, Present, and Future |
|
|
333 | (12) |
|
|
|
|
|
|
|
334 | (1) |
|
30.2 The Current HPC Storage Model |
|
|
335 | (3) |
|
30.2.1 The POSIX HPC I/O Extensions |
|
|
335 | (2) |
|
|
|
337 | (1) |
|
30.2.3 Object Storage Model |
|
|
337 | (1) |
|
|
|
338 | (3) |
|
|
|
338 | (1) |
|
30.3.2 Object Abstractions in HPC |
|
|
339 | (2) |
|
|
|
341 | (1) |
|
|
|
341 | (4) |
|
|
|
345 | (8) |
|
|
|
|
|
|
|
346 | (2) |
|
31.1.1 Getting the Correct Answer |
|
|
347 | (1) |
|
|
|
348 | (2) |
|
|
|
350 | (3) |
|
|
|
353 | (10) |
|
|
|
|
|
|
|
|
|
|
|
|
|
353 | (1) |
|
32.2 Storage I/O at Present |
|
|
354 | (1) |
|
32.3 Storage I/O in the Near Future |
|
|
355 | (1) |
|
32.4 Challenges and Solutions |
|
|
356 | (4) |
|
|
|
356 | (1) |
|
32.4.2 Improving I/O Caching Efficiency |
|
|
357 | (2) |
|
32.4.3 Dynamic I/O Scheduler Selection |
|
|
359 | (1) |
|
|
|
360 | (3) |
|
33 Storage Networks and Interconnects |
|
|
363 | (6) |
|
|
|
|
|
33.1 Current State of Technology |
|
|
363 | (2) |
|
|
|
365 | (1) |
|
33.3 Challenges and Solutions |
|
|
366 | (1) |
|
|
|
367 | (2) |
|
|
|
369 | (16) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
370 | (1) |
|
34.2 Power Use in Recent and Current Supercomputers |
|
|
370 | (7) |
|
|
|
371 | (1) |
|
|
|
371 | (2) |
|
|
|
373 | (1) |
|
|
|
374 | (1) |
|
34.2.5 Overall Survey Results |
|
|
374 | (3) |
|
34.2.6 Extrapolation to Exascale |
|
|
377 | (1) |
|
34.3 How I/O Changes at Exascale |
|
|
377 | (3) |
|
34.3.1 Introducing More Asynchrony in the File System |
|
|
378 | (1) |
|
34.3.1.1 The Burst Buffer |
|
|
378 | (1) |
|
34.3.1.2 Sirocco: A File System for Heterogeneous Media |
|
|
378 | (1) |
|
34.3.2 Guarding against Single-Node Failures and Soft Errors |
|
|
379 | (1) |
|
|
|
380 | (5) |
| Index |
|
385 | |