Foreword |
|
xxvii | |
Preface |
|
xxix | |
Acknowledgments |
|
xxxv | |
About the Author |
|
xxxvii | |
I Introduction to Hadoop-Architecture and Hadoop Clusters |
|
1 | (126) |
|
1 Introduction to Hadoop and Its Environment |
|
|
3 | (30) |
|
|
4 | (8) |
|
Unique Features of Hadoop |
|
|
5 | (1) |
|
|
5 | (2) |
|
A Typical Scenario for Using Hadoop |
|
|
7 | (1) |
|
Traditional Database Systems |
|
|
7 | (2) |
|
|
9 | (2) |
|
Big Data; Science and Hadoop |
|
|
11 | (1) |
|
Cluster Computing and Hadoop Clusters |
|
|
12 | (3) |
|
|
12 | (1) |
|
|
13 | (2) |
|
Hadoop Components and the Hadoop Ecosphere |
|
|
15 | (3) |
|
What Do Hadoop Administrators Do? |
|
|
18 | (3) |
|
Hadoop Administration-A New Paradigm |
|
|
18 | (2) |
|
What You Need to Know to Administer Hadoop |
|
|
20 | (1) |
|
The Hadoop Administrator's Toolset |
|
|
21 | (1) |
|
Key Differences between Hadoop 1 and Hadoop 2 |
|
|
21 | (3) |
|
Architectural Differences |
|
|
22 | (1) |
|
High-Availability Features |
|
|
22 | (1) |
|
Multiple Processing Engines |
|
|
23 | (1) |
|
Separation of Processing and Scheduling |
|
|
23 | (1) |
|
Resource Allocation in Hadoop 1 and Hadoop 2 |
|
|
24 | (1) |
|
Distributed Data Processing: MapReduce and Spark, Hive and Pig |
|
|
24 | (3) |
|
|
24 | (1) |
|
|
25 | (1) |
|
|
26 | (1) |
|
|
26 | (1) |
|
Data Integration: Apache Sqoop, Apache Flume and Apache Kafka |
|
|
27 | (1) |
|
Key Areas of Hadoop Administration |
|
|
28 | (3) |
|
Managing the Cluster Storage |
|
|
28 | (1) |
|
Allocating the Cluster Resources |
|
|
28 | (1) |
|
|
29 | (1) |
|
|
30 | (1) |
|
|
31 | (2) |
|
2 An Introduction to the Architecture of Hadoop |
|
|
33 | (26) |
|
Distributed Computing and Hadoop |
|
|
33 | (1) |
|
|
34 | (3) |
|
|
35 | (1) |
|
|
36 | (1) |
|
|
36 | (1) |
|
Data Storage-The Hadoop Distributed File System |
|
|
37 | (11) |
|
|
37 | (1) |
|
|
38 | (2) |
|
|
40 | (3) |
|
|
43 | (5) |
|
Data Processing with YARN, the Hadoop Operating System |
|
|
48 | (9) |
|
|
49 | (4) |
|
How the ApplicationMaster Works with the ResourceManager to Allocate Resources |
|
|
53 | (4) |
|
|
57 | (2) |
|
3 Creating and Configuring a Simple Hadoop Cluster |
|
|
59 | (32) |
|
Hadoop Distributions and Installation Types |
|
|
60 | (2) |
|
|
60 | (1) |
|
Hadoop Installation Types |
|
|
61 | (1) |
|
Setting Up a Pseudo-Distributed Hadoop Cluster |
|
|
62 | (9) |
|
Meeting the Operating System Requirements |
|
|
63 | (1) |
|
Modifying Kernel Parameters |
|
|
64 | (4) |
|
|
68 | (1) |
|
|
69 | (1) |
|
Installing the Hadoop Software |
|
|
70 | (1) |
|
Creating the Necessary Hadoop Users |
|
|
70 | (1) |
|
Creating the Necessary Directories |
|
|
71 | (1) |
|
Performing the Initial Hadoop Configuration |
|
|
71 | (15) |
|
Environment Configuration Files |
|
|
73 | (1) |
|
Read-Only Default Configuration Files |
|
|
74 | (1) |
|
Site-Specific Configuration Files |
|
|
74 | (1) |
|
Other Hadoop-Related Configuration Files |
|
|
74 | (2) |
|
Precedence among the Configuration Files |
|
|
76 | (2) |
|
Variable Expansion and Configuration Parameters |
|
|
78 | (1) |
|
Configuring the Hadoop Daemons Environment |
|
|
79 | (2) |
|
Configuring Core Hadoop Properties (with the core-site.xml File) |
|
|
81 | (1) |
|
Configuring MapReduce (with the mapred-site.xml File) |
|
|
82 | (1) |
|
Configuring YARN (with the yarn-site.xml File) |
|
|
83 | (3) |
|
Operating the New Hadoop Cluster |
|
|
86 | (4) |
|
Formatting the Distributed File System |
|
|
86 | (1) |
|
Setting the Environment Variables |
|
|
87 | (1) |
|
Starting the HDFS and YARN Services |
|
|
87 | (2) |
|
Verifying the Service Startup |
|
|
89 | (1) |
|
Shutting Down the Services |
|
|
90 | (1) |
|
|
90 | (1) |
|
4 Planning for and Creating a Fully Distributed Cluster |
|
|
91 | (36) |
|
Planning Your Hadoop Cluster |
|
|
92 | (3) |
|
General Cluster Planning Considerations |
|
|
92 | (2) |
|
|
94 | (1) |
|
Criteria for Choosing the Nodes |
|
|
94 | (1) |
|
Going from a Single Rack to Multiple Racks |
|
|
95 | (7) |
|
|
96 | (1) |
|
General Principles Governing the Choice of CPU, Memory and Storage |
|
|
96 | (3) |
|
Special Treatment for the Master Nodes |
|
|
99 | (1) |
|
Recommendations for Sizing the Servers |
|
|
100 | (1) |
|
|
101 | (1) |
|
Guidelines for Large Clusters |
|
|
101 | (1) |
|
Creating a Multinode Cluster |
|
|
102 | (4) |
|
How the Test Cluster Is Set Up |
|
|
102 | (4) |
|
Modifying the Hadoop Configuration |
|
|
106 | (8) |
|
Changing the HDFS Configuration (hdfs-site.xml file) |
|
|
106 | (3) |
|
Changing the YARN Configuration |
|
|
109 | (4) |
|
Changing the MapReduce Configuration |
|
|
113 | (1) |
|
|
114 | (5) |
|
Starting Up and Shutting Down the Cluster with Scripts |
|
|
116 | (2) |
|
Performing a Quick Check of the New Cluster's File System |
|
|
118 | (1) |
|
Configuring Hadoop Services, Web Interfaces and Ports |
|
|
119 | (7) |
|
Service Configuration and Web Interfaces |
|
|
119 | (3) |
|
Setting Port Numbers for Hadoop Services |
|
|
122 | (2) |
|
|
124 | (2) |
|
|
126 | (1) |
II Hadoop Application Frameworks |
|
127 | (76) |
|
5 Running Applications in a Cluster-The MapReduce Framework (and Hive and Pig) |
|
|
129 | (18) |
|
|
129 | (12) |
|
|
130 | (1) |
|
|
131 | (2) |
|
|
133 | (2) |
|
A Simple MapReduce Program |
|
|
135 | (1) |
|
Understanding Hadoop's Job Processing-Running a WordCount Program |
|
|
136 | (1) |
|
MapReduce Input and Output Directories |
|
|
137 | (1) |
|
How Hadoop Shows You the Job Details |
|
|
137 | (2) |
|
|
139 | (2) |
|
|
141 | (3) |
|
|
142 | (1) |
|
|
142 | (1) |
|
|
142 | (1) |
|
|
143 | (1) |
|
|
144 | (1) |
|
|
144 | (1) |
|
|
145 | (1) |
|
|
145 | (2) |
|
6 Running Applications in a Cluster-The Spark Framework |
|
|
147 | (22) |
|
|
148 | (1) |
|
|
149 | (4) |
|
|
149 | (2) |
|
Ease of Use and Accessibility |
|
|
151 | (1) |
|
General-Purpose Framework |
|
|
152 | (1) |
|
|
153 | (1) |
|
|
153 | (2) |
|
|
155 | (3) |
|
|
157 | (1) |
|
Key Spark Files and Directories |
|
|
157 | (1) |
|
Compiling the Spark Binaries |
|
|
157 | (1) |
|
Reducing Spark's Verbosity |
|
|
158 | (1) |
|
|
158 | (1) |
|
|
158 | (1) |
|
|
158 | (1) |
|
Understanding the Cluster Managers |
|
|
159 | (5) |
|
The Standalone Cluster Manager |
|
|
159 | (2) |
|
|
161 | (1) |
|
|
162 | (1) |
|
How YARN and Spark Work Together |
|
|
163 | (1) |
|
Setting Up Spark on a Hadoop Cluster |
|
|
163 | (1) |
|
|
164 | (3) |
|
Loading Data from the Linux File System |
|
|
164 | (1) |
|
|
164 | (2) |
|
Loading Data from a Relational Database |
|
|
166 | (1) |
|
|
167 | (2) |
|
7 Running Spark Applications |
|
|
169 | (34) |
|
The Spark Programming Model |
|
|
169 | (4) |
|
Spark Programming and RDDs |
|
|
169 | (3) |
|
|
172 | (1) |
|
|
173 | (6) |
|
|
174 | (1) |
|
|
174 | (2) |
|
|
176 | (3) |
|
|
179 | (1) |
|
Architecture of a Spark Application |
|
|
179 | (2) |
|
|
180 | (1) |
|
Components of a Spark Application |
|
|
180 | (1) |
|
Running Spark Applications Interactively |
|
|
181 | (4) |
|
Spark Shell and Spark Applications |
|
|
181 | (1) |
|
A Bit about the Spark Shell |
|
|
182 | (1) |
|
|
182 | (3) |
|
Overview of Spark Cluster Execution |
|
|
185 | (1) |
|
Creating and Submitting Spark Applications |
|
|
185 | (7) |
|
Building the Spark Application |
|
|
186 | (1) |
|
Running an Application in the Standalone Spark Cluster |
|
|
186 | (1) |
|
Using spark-submit to Execute Applications |
|
|
187 | (2) |
|
Running Spark Applications on Mesos |
|
|
189 | (1) |
|
Running Spark Applications in a YARN-Managed Hadoop Cluster |
|
|
189 | (2) |
|
Using the JDBC/ODBC Server |
|
|
191 | (1) |
|
Configuring Spark Applications |
|
|
192 | (2) |
|
Spark Configuration Properties |
|
|
192 | (1) |
|
Specifying Configuration when Running spark-submit |
|
|
193 | (1) |
|
Monitoring Spark Applications |
|
|
194 | (1) |
|
Handling Streaming Data with Spark Streaming |
|
|
194 | (4) |
|
How Spark Streaming Works |
|
|
195 | (2) |
|
A Spark Streaming Example-WordCount Again! |
|
|
197 | (1) |
|
Using Spark SQL for Handling Structured Data |
|
|
198 | (3) |
|
|
198 | (1) |
|
HiveContext and SQLContext |
|
|
198 | (1) |
|
|
199 | (1) |
|
|
200 | (1) |
|
|
201 | (2) |
III Managing and Protecting Hadoop Data and High Availability |
|
203 | (150) |
|
8 The Role of the NameNode and How HDFS Works |
|
|
205 | (38) |
|
HDFS-The Interaction between the NameNode and the DataNodes |
|
|
205 | (4) |
|
Interaction between the Clients and HDFS |
|
|
206 | (1) |
|
NameNode and DataNode Communications |
|
|
207 | (2) |
|
Rack Awareness and Topology |
|
|
209 | (3) |
|
How to Configure Rack Awareness in Your Cluster |
|
|
210 | (1) |
|
Finding Your Cluster's Rack Information |
|
|
210 | (2) |
|
|
212 | (6) |
|
HDFS Data Organization and Data Blocks |
|
|
213 | (1) |
|
|
213 | (3) |
|
|
216 | (2) |
|
How Clients Read and Write HDFS Data |
|
|
218 | (6) |
|
How Clients Read HDFS Data |
|
|
219 | (1) |
|
How Clients Write Data to HDFS |
|
|
220 | (4) |
|
Understanding HDFS Recovery Processes |
|
|
224 | (3) |
|
|
224 | (1) |
|
|
224 | (2) |
|
|
226 | (1) |
|
|
226 | (1) |
|
Centralized Cache Management in HDFS |
|
|
227 | (5) |
|
Hadoop and OS Page Caching |
|
|
228 | (1) |
|
The Key Principles Behind Centralized Cache Management |
|
|
228 | (1) |
|
How Centralized Cache Management Works |
|
|
229 | (1) |
|
|
229 | (1) |
|
|
230 | (1) |
|
|
230 | (1) |
|
|
231 | (1) |
|
Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage) |
|
|
232 | (9) |
|
Performance Characteristics of Storage Types |
|
|
233 | (1) |
|
The Need for Heterogeneous HDFS Storage |
|
|
233 | (1) |
|
Changes in the Storage Architecture |
|
|
234 | (1) |
|
Storage Preferences for Files |
|
|
235 | (1) |
|
Setting Up Archival Storage |
|
|
235 | (4) |
|
Managing Storage Policies |
|
|
239 | (1) |
|
|
239 | (1) |
|
Implementing Archival Storage |
|
|
240 | (1) |
|
|
241 | (2) |
|
9 HDFS Commands, HDFS Permissions and HDFS Storage |
|
|
243 | (34) |
|
Managing HDFS through the HDFS Shell Commands |
|
|
243 | (8) |
|
Using the hdfs dfs Utility to Manage HDFS |
|
|
245 | (2) |
|
Listing HDFS Files and Directories |
|
|
247 | (2) |
|
Creating an HDFS Directory |
|
|
249 | (1) |
|
Removing HDFS Files and Directories |
|
|
249 | (1) |
|
Changing File and Directory Ownership and Groups |
|
|
250 | (1) |
|
Using the dfsadmin Utility to Perform HDFS Operations |
|
|
251 | (4) |
|
The dfsadmin report Command |
|
|
252 | (3) |
|
Managing HDFS Permissions and Users |
|
|
255 | (5) |
|
|
255 | (2) |
|
HDFS Users and Super Users |
|
|
257 | (3) |
|
|
260 | (7) |
|
|
260 | (3) |
|
Allocating HDFS Space Quotas |
|
|
263 | (4) |
|
|
267 | (7) |
|
Reasons for HDFS Data Imbalance |
|
|
268 | (1) |
|
Running the Balancer Tool to Balance HDFS Data |
|
|
268 | (3) |
|
Using hdfs dfsadmin to Make Things Easier |
|
|
271 | (2) |
|
|
273 | (1) |
|
|
274 | (2) |
|
Removing Files and Directories |
|
|
274 | (1) |
|
Decreasing the Replication Factor |
|
|
274 | (2) |
|
|
276 | (1) |
|
10 Data Protection, File Formats and Accessing HDFS |
|
|
277 | (40) |
|
|
278 | (11) |
|
Using HDFS Trash to Prevent Accidental Data Deletion |
|
|
278 | (2) |
|
Using HDFS Snapshots to Protect Important Data |
|
|
280 | (4) |
|
Ensuring Data Integrity with File System Checks |
|
|
284 | (5) |
|
|
289 | (6) |
|
Common Compression Formats |
|
|
290 | (1) |
|
Evaluating the Various Compression Schemes |
|
|
291 | (1) |
|
Compression at Various Stages for MapReduce |
|
|
291 | (4) |
|
|
295 | (1) |
|
|
295 | (1) |
|
|
295 | (13) |
|
Criteria for Determining the Right File Format |
|
|
296 | (2) |
|
File Formats Supported by Hadoop |
|
|
298 | (4) |
|
|
302 | (1) |
|
The Hadoop Small Files Problem and Merging Files |
|
|
303 | (1) |
|
Using a Federated NameNode to Overcome the Small Files Problem |
|
|
304 | (1) |
|
Using Hadoop Archives to Manage Many Small Files |
|
|
304 | (3) |
|
Handling the Performance Impact of Small Files |
|
|
307 | (1) |
|
Using Hadoop WebHDFS and HttpFS |
|
|
308 | (7) |
|
WebHDFS-The Hadoop REST API |
|
|
308 | (1) |
|
|
309 | (1) |
|
Understanding the WebHDFS Commands |
|
|
310 | (3) |
|
Using HttpFS Gateway to Access HDFS from Behind a Firewall |
|
|
313 | (2) |
|
|
315 | (2) |
|
11 NameNode Operations, High Availability and Federation |
|
|
317 | (36) |
|
Understanding NameNode Operations |
|
|
318 | (5) |
|
|
319 | (2) |
|
The NameNode Startup Process |
|
|
321 | (1) |
|
How the NameNode and the DataNodes Work Together |
|
|
322 | (1) |
|
The Checkpointing Process |
|
|
323 | (6) |
|
Secondary, Checkpoint, Backup and Standby Nodes |
|
|
324 | (1) |
|
Configuring the Checkpointing Frequency |
|
|
325 | (2) |
|
Managing Checkpoint Performance |
|
|
327 | (1) |
|
The Mechanics of Checkpointing |
|
|
327 | (2) |
|
NameNode Safe Mode Operations |
|
|
329 | (5) |
|
Automatic Safe Mode Operations |
|
|
329 | (1) |
|
Placing the NameNode in Safe Mode |
|
|
330 | (1) |
|
How the NameNode Transitions Through Safe Mode |
|
|
331 | (1) |
|
Backing Up and Recovering the NameNode Metadata |
|
|
332 | (2) |
|
Configuring HDFS High Availability |
|
|
334 | (15) |
|
NameNode HA Architecture (QJM) |
|
|
335 | (2) |
|
Setting Up an HDFS HA Quorum Cluster |
|
|
337 | (5) |
|
Deploying the High-Availability NameNodes |
|
|
342 | (3) |
|
Managing an HA NameNode Setup |
|
|
345 | (1) |
|
HA Manual and Automatic Failover |
|
|
346 | (3) |
|
|
349 | (2) |
|
Architecture of a Federated NameNode |
|
|
350 | (1) |
|
|
351 | (2) |
IV Moving Data, Allocating Resources, Scheduling Jobs and Security |
|
353 | (174) |
|
12 Moving Data Into and Out of Hadoop |
|
|
355 | (52) |
|
Introduction to Hadoop Data Transfer Tools |
|
|
355 | (1) |
|
Loading Data into HDFS from the Command Line |
|
|
356 | (5) |
|
Using the -cat Command to Dump a File's Contents |
|
|
356 | (1) |
|
|
357 | (1) |
|
Copying and Moving Files from and to HDFS |
|
|
358 | (1) |
|
Using the -get Command to Move Files |
|
|
359 | (1) |
|
Moving Files from and to HDFS |
|
|
360 | (1) |
|
Using the -tail and head Commands |
|
|
360 | (1) |
|
Copying HDFS Data between Clusters with DistCp |
|
|
361 | (4) |
|
How to Use the DistCp Command to Move Data |
|
|
361 | (2) |
|
|
363 | (2) |
|
Ingesting Data from Relational Databases with Sqoop |
|
|
365 | (23) |
|
|
366 | (1) |
|
|
367 | (1) |
|
|
368 | (1) |
|
Importing Data with Sqoop |
|
|
368 | (11) |
|
|
379 | (2) |
|
Exporting Data with Sqoop |
|
|
381 | (7) |
|
Ingesting Data from External Sources with Flume |
|
|
388 | (10) |
|
Flume Architecture in a Nutshell |
|
|
389 | (2) |
|
Configuring the Flume Agent |
|
|
391 | (1) |
|
|
392 | (2) |
|
Using Flume to Move Data to HDFS |
|
|
394 | (1) |
|
A More Complex Flume Example |
|
|
395 | (3) |
|
Ingesting Data with Kafka |
|
|
398 | (8) |
|
Benefits Offered by Kafka |
|
|
398 | (1) |
|
|
399 | (2) |
|
Setting Up an Apache Kafka Cluster |
|
|
401 | (3) |
|
Integrating Kafka with Hadoop and Storm |
|
|
404 | (2) |
|
|
406 | (1) |
|
13 Resource Allocation in a Hadoop Cluster |
|
|
407 | (30) |
|
Resource Allocation in Hadoop |
|
|
407 | (3) |
|
Managing Cluster Workloads |
|
|
408 | (1) |
|
Hadoop's Resource Schedulers |
|
|
409 | (1) |
|
|
410 | (1) |
|
|
411 | (15) |
|
|
412 | (6) |
|
How the Cluster Allocates Resources |
|
|
418 | (3) |
|
|
421 | (1) |
|
Enabling the Capacity Scheduler |
|
|
422 | (1) |
|
A Typical Capacity Scheduler |
|
|
422 | (4) |
|
|
426 | (9) |
|
|
427 | (1) |
|
Configuring the Fair Scheduler |
|
|
428 | (2) |
|
How Jobs Are Placed into Queues |
|
|
430 | (1) |
|
Application Preemption in the Fair Scheduler |
|
|
431 | (1) |
|
Security and Resource Pools |
|
|
432 | (1) |
|
A Sample fair-scheduler.xml File |
|
|
432 | (2) |
|
Submitting Jobs to the Scheduler |
|
|
434 | (1) |
|
Moving Applications between Queues |
|
|
434 | (1) |
|
Monitoring the Fair Scheduler |
|
|
434 | (1) |
|
Comparing the Capacity Scheduler and the Fair Scheduler |
|
|
435 | (1) |
|
Similarities between the Two Schedulers |
|
|
435 | (1) |
|
Differences between the Two Schedulers |
|
|
435 | (1) |
|
|
436 | (1) |
|
14 Working with Oozie to Manage Job Workflows |
|
|
437 | (40) |
|
Using Apache Oozie to Schedule Jobs |
|
|
437 | (2) |
|
|
439 | (2) |
|
|
439 | (1) |
|
|
440 | (1) |
|
|
440 | (1) |
|
Deploying Oozie in Your Cluster |
|
|
441 | (5) |
|
Installing and Configuring Oozie |
|
|
442 | (2) |
|
Configuring Hadoop for Oozie |
|
|
444 | (2) |
|
Understanding Oozie Workflows |
|
|
446 | (3) |
|
Workflows, Control Flow, and Nodes |
|
|
446 | (1) |
|
Defining the Workflows with the workflow.xml File |
|
|
447 | (2) |
|
|
449 | (5) |
|
Configuring the Action Nodes |
|
|
449 | (5) |
|
Creating an Oozie Workflow |
|
|
454 | (7) |
|
Configuring the Control Nodes |
|
|
456 | (4) |
|
|
460 | (1) |
|
Running an Oozie Workflow Job |
|
|
461 | (3) |
|
Specifying the Job Properties |
|
|
461 | (2) |
|
|
463 | (1) |
|
Creating Dynamic Workflows |
|
|
463 | (1) |
|
|
464 | (6) |
|
|
465 | (2) |
|
|
467 | (2) |
|
Time-and-Data-Based Coordinators |
|
|
469 | (1) |
|
Submitting the Oozie Coordinator from the Command Line |
|
|
469 | (1) |
|
Managing and Administering Oozie |
|
|
470 | (5) |
|
Common Oozie Commands and How to Run Them |
|
|
471 | (2) |
|
|
473 | (1) |
|
Oozie crop Scheduling and Oozie Service Level Agreements |
|
|
474 | (1) |
|
|
475 | (2) |
|
|
477 | (50) |
|
Hadoop Security-An Overview |
|
|
478 | (3) |
|
Authentication, Authorization and Accounting |
|
|
480 | (1) |
|
Hadoop Authentication with Kerberos |
|
|
481 | (24) |
|
Kerberos and How It Works |
|
|
482 | (1) |
|
The Kerberos Authentication Process |
|
|
483 | (1) |
|
|
484 | (1) |
|
|
485 | (1) |
|
Adding Kerberos Authorization to your Cluster |
|
|
486 | (4) |
|
Setting Up Kerberos for Hadoop |
|
|
490 | (5) |
|
Securing a Hadoop Cluster with Kerberos |
|
|
495 | (6) |
|
How Kerberos Authenticates Users and Services |
|
|
501 | (1) |
|
Managing a Kerberized Hadoop Cluster |
|
|
501 | (4) |
|
|
505 | (13) |
|
|
505 | (5) |
|
Service Level Authorization |
|
|
510 | (2) |
|
Role-Based Authorization with Apache Sentry |
|
|
512 | (6) |
|
|
518 | (2) |
|
|
519 | (1) |
|
|
519 | (1) |
|
|
520 | (4) |
|
HDFS Transparent Encryption |
|
|
520 | (3) |
|
Encrypting Data in Transition |
|
|
523 | (1) |
|
Other Hadoop-Related Security Initiatives |
|
|
524 | (1) |
|
Securing a Hadoop Infrastructure with Apache Knox Gateway |
|
|
524 | (1) |
|
Apache Ranger for Security Administration |
|
|
525 | (1) |
|
|
525 | (2) |
V Monitoring, Optimization and Troubleshooting |
|
527 | (220) |
|
16 Managing Jobs, Using Hue and Performing Routine Tasks |
|
|
529 | (40) |
|
Using the YARN Commands to Manage Hadoop Jobs |
|
|
530 | (5) |
|
Viewing YARN Applications |
|
|
531 | (1) |
|
Checking the Status of an Application |
|
|
532 | (1) |
|
Killing a Running Application |
|
|
532 | (1) |
|
Checking the Status of the Nodes |
|
|
533 | (1) |
|
|
533 | (1) |
|
Getting the Application Logs |
|
|
533 | (1) |
|
Yarn Administrative Commands |
|
|
534 | (1) |
|
Decommissioning and Recommissioning Nodes |
|
|
535 | (6) |
|
Including and Excluding Hosts |
|
|
536 | (1) |
|
Decommissioning DataNodes and NodeManagers |
|
|
537 | (2) |
|
|
539 | (1) |
|
Things to Remember about Decommissioning and Recommissioning |
|
|
539 | (1) |
|
Adding a New DataNode and/or a NodeManager |
|
|
540 | (1) |
|
ResourceManager High Availability |
|
|
541 | (4) |
|
ResourceManager High-Availability Architecture |
|
|
541 | (1) |
|
Setting Up ResourceManager High Availability |
|
|
542 | (1) |
|
|
543 | (2) |
|
Using the ResourceManager High-Availability Commands |
|
|
545 | (1) |
|
Performing Common Management Tasks |
|
|
545 | (3) |
|
Moving the NameNode to a Different Host |
|
|
545 | (1) |
|
Managing High-Availability NameNodes |
|
|
546 | (1) |
|
Using a Shutdown/Startup Script to Manage your Cluster |
|
|
546 | (1) |
|
|
546 | (1) |
|
Balancing the Storage on the DataNodes |
|
|
547 | (1) |
|
Managing the MySQL Database |
|
|
548 | (3) |
|
Configuring a MySQL Database |
|
|
548 | (1) |
|
Configuring MySQL High Availability |
|
|
549 | (2) |
|
Backing Up Important Cluster Data |
|
|
551 | (2) |
|
|
552 | (1) |
|
Backing Up the Metastore Databases |
|
|
553 | (1) |
|
Using Hue to Administer Your Cluster |
|
|
553 | (9) |
|
Allowing Your Users to Use Hue |
|
|
554 | (2) |
|
|
556 | (1) |
|
Configuring Your Cluster to Work with Hue |
|
|
557 | (4) |
|
|
561 | (1) |
|
|
561 | (1) |
|
Implementing Specialized HDFS Features |
|
|
562 | (5) |
|
Deploying HDFS and YARN in a Multihomed Network |
|
|
562 | (1) |
|
Short-Circuit Local Reads |
|
|
563 | (1) |
|
|
564 | (2) |
|
Using an NFS Gateway for Mounting HDFS to a Local File System |
|
|
566 | (1) |
|
|
567 | (2) |
|
17 Monitoring, Metrics and Hadoop Logging |
|
|
569 | (42) |
|
|
570 | (6) |
|
Basics of Linux System Monitoring |
|
|
570 | (2) |
|
Monitoring Tools for Linux Systems |
|
|
572 | (4) |
|
|
576 | (3) |
|
|
577 | (1) |
|
|
578 | (1) |
|
Capturing Metrics to a File System |
|
|
578 | (1) |
|
Using Ganglia for Monitoring |
|
|
579 | (3) |
|
|
580 | (1) |
|
Setting Up the Ganglia and Hadoop Integration |
|
|
580 | (2) |
|
Setting Up the Hadoop Metrics |
|
|
582 | (1) |
|
Understanding Hadoop Logging |
|
|
582 | (17) |
|
|
583 | (1) |
|
Daemon and Application Logs and How to View Them |
|
|
584 | (1) |
|
How Application Logging Works |
|
|
585 | (2) |
|
How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run |
|
|
587 | (1) |
|
How the NodeManager Uses the Local Directories |
|
|
588 | (4) |
|
Storing Job Logs in HDFS through Log Aggregation |
|
|
592 | (5) |
|
Working with the Hadoop Daemon Logs |
|
|
597 | (2) |
|
Using Hadoop's Web Uls for Monitoring |
|
|
599 | (10) |
|
Monitoring Jobs with the ResourceManager Web UI |
|
|
599 | (7) |
|
The JobHistoryServer Web UI |
|
|
606 | (2) |
|
Monitoring with the NameNode Web UI |
|
|
608 | (1) |
|
Monitoring Other Hadoop Components |
|
|
609 | (1) |
|
|
609 | (1) |
|
|
610 | (1) |
|
|
610 | (1) |
|
18 Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking |
|
|
611 | (48) |
|
How to Allocate YARN Memory and CPU |
|
|
612 | (9) |
|
|
612 | (8) |
|
Configuring the Number of CPU Cores |
|
|
620 | (1) |
|
Relationship between Memory and CPU Vcores |
|
|
621 | (1) |
|
Configuring Efficient Performance |
|
|
621 | (4) |
|
|
621 | (3) |
|
Reducing the I/O Load on the System |
|
|
624 | (1) |
|
Tuning Map and Reduce Tasks-What the Administrator Can Do |
|
|
625 | (10) |
|
|
626 | (1) |
|
|
627 | (3) |
|
|
630 | (2) |
|
Tuning the MapReduce Shuffle Process |
|
|
632 | (3) |
|
Optimizing Pig and Hive Jobs |
|
|
635 | (3) |
|
|
635 | (2) |
|
|
637 | (1) |
|
Benchmarking Your Cluster |
|
|
638 | (9) |
|
Using TestDFSIO for Testing I/O Performance |
|
|
638 | (2) |
|
Benchmarking with TeraSort |
|
|
640 | (3) |
|
Using Hadoop's Rumen and GridMix for Benchmarking |
|
|
643 | (4) |
|
|
647 | (5) |
|
|
649 | (1) |
|
|
649 | (1) |
|
MapReduce Framework Counters |
|
|
650 | (1) |
|
|
651 | (1) |
|
Limiting the Number of Counters |
|
|
651 | (1) |
|
|
652 | (6) |
|
Map-Only versus Map and Reduce Jobs |
|
|
652 | (1) |
|
How Combiners Improve MapReduce Performance |
|
|
652 | (2) |
|
Using a Partitioner to Improve Performance |
|
|
654 | (1) |
|
Compressing Data During the MapReduce Process |
|
|
654 | (1) |
|
Too Many Mappers or Reducers? |
|
|
655 | (3) |
|
|
658 | (1) |
|
19 Configuring and Tuning Apache Spark on YARN |
|
|
659 | (32) |
|
Configuring Resource Allocation for Spark on YARN |
|
|
659 | (17) |
|
|
660 | (1) |
|
|
660 | (1) |
|
How Resources are Allocated to Spark |
|
|
660 | (1) |
|
Limits on the Resource Allocation to Spark Applications |
|
|
661 | (2) |
|
Allocating Resources to the Driver |
|
|
663 | (3) |
|
Configuring Resources for the Executors |
|
|
666 | (4) |
|
How Spark Uses its Memory |
|
|
670 | (2) |
|
|
672 | (2) |
|
|
674 | (2) |
|
Configuring Spark-Related Network Parameters |
|
|
676 | (1) |
|
Dynamic Resource Allocation when Running Spark on YARN |
|
|
676 | (2) |
|
Dynamic and Static Resource Allocation |
|
|
676 | (1) |
|
How Spark Manages Dynamic Resource Allocation |
|
|
677 | (1) |
|
Enabling Dynamic Resource Allocation |
|
|
677 | (1) |
|
Storage Formats and Compressing Data |
|
|
678 | (3) |
|
|
679 | (1) |
|
|
680 | (1) |
|
|
680 | (1) |
|
Monitoring Spark Applications |
|
|
681 | (5) |
|
Using the Spark Web UI to Understand Performance |
|
|
682 | (2) |
|
Spark System and the Metrics REST API |
|
|
684 | (1) |
|
The Spark History Server on YARN |
|
|
684 | (2) |
|
Tracking Jobs from the Command Line |
|
|
686 | (1) |
|
Tuning Garbage Collection |
|
|
686 | (2) |
|
The Mechanics of Garbage Collection |
|
|
687 | (1) |
|
How to Collect GC Statistics |
|
|
687 | (1) |
|
Tuning Spark Streaming Applications |
|
|
688 | (1) |
|
Reducing Batch Processing Time |
|
|
688 | (1) |
|
Setting the Right Batch Interval |
|
|
689 | (1) |
|
Tuning Memory and Garbage Collection |
|
|
689 | (1) |
|
|
689 | (2) |
|
20 Optimizing Spark Applications |
|
|
691 | (34) |
|
Revisiting the Spark Execution Model |
|
|
692 | (2) |
|
The Spark Execution Model |
|
|
692 | (2) |
|
Shuffle Operations and How to Minimize Them |
|
|
694 | (9) |
|
A WordCount Example to Our Rescue Again |
|
|
695 | (1) |
|
Impact of a Shuffle Operation |
|
|
696 | (1) |
|
Configuring the Shuffle Parameters |
|
|
697 | (6) |
|
Partitioning and Parallelism (Number of Tasks) |
|
|
703 | (7) |
|
|
704 | (2) |
|
Problems with Too Few Tasks |
|
|
706 | (1) |
|
Setting the Default Number of Partitions |
|
|
706 | (1) |
|
How to Increase the Number of Partitions |
|
|
707 | (1) |
|
Using the Repartition and Coalesce Operators to Change the Number of Partitions in an RDD |
|
|
708 | (1) |
|
Two Types of Partitioners |
|
|
709 | (1) |
|
Data Partitioning and How It Can Avoid a Shuffle |
|
|
709 | (1) |
|
Optimizing Data Serialization and Compression |
|
|
710 | (2) |
|
|
710 | (1) |
|
|
711 | (1) |
|
Understanding Spark's SQL Query Optimizer |
|
|
712 | (5) |
|
Understanding the Optimizer Steps |
|
|
712 | (2) |
|
Spark's Speculative Execution Feature |
|
|
714 | (1) |
|
The Importance of Data Locality |
|
|
715 | (2) |
|
|
717 | (6) |
|
Fault-Tolerance Due to Caching |
|
|
718 | (1) |
|
|
718 | (5) |
|
|
723 | (2) |
|
21 Troubleshooting Hadoop-A Sampler |
|
|
725 | (18) |
|
|
725 | (6) |
|
Dealing with a 100 Percent Full Linux File System |
|
|
726 | (1) |
|
|
727 | (1) |
|
Local and Log Directories Out of Free Space |
|
|
727 | (2) |
|
Disk Volume Failure Toleration |
|
|
729 | (2) |
|
Handling YARN Jobs That Are Stuck |
|
|
731 | (1) |
|
JVM Memory-Allocation and Garbage-Collection Strategies |
|
|
732 | (5) |
|
Understanding JVM Garbage Collection |
|
|
732 | (1) |
|
Optimizing Garbage Collection |
|
|
733 | (1) |
|
|
734 | (1) |
|
|
734 | (1) |
|
ApplicationMaster Memory Issues |
|
|
735 | (2) |
|
Handling Different Types of Failures |
|
|
737 | (2) |
|
|
737 | (1) |
|
Starting Failures for Hadoop Daemons |
|
|
737 | (1) |
|
|
738 | (1) |
|
Troubleshooting Spark Jobs |
|
|
739 | (1) |
|
Spark's Fault Tolerance Mechanism |
|
|
740 | (1) |
|
|
740 | (1) |
|
Maximum Attempts for a Job |
|
|
740 | (1) |
|
|
740 | (1) |
|
Debugging Spark Applications |
|
|
740 | (2) |
|
Viewing Logs with Log Aggregation |
|
|
740 | (1) |
|
Viewing Logs When Log Aggregation Is Not Enabled |
|
|
741 | (1) |
|
Reviewing the Launch Environment |
|
|
741 | (1) |
|
|
742 | (1) |
|
22 installing VirtualBox and Linux and Cloning the Virtual Machines |
|
|
743 | (4) |
|
Installing Oracle VirtualBox |
|
|
744 | (1) |
|
Installing Oracle Enterprise Linux |
|
|
745 | (1) |
|
|
745 | (2) |
Index |
|
747 | |