Muutke küpsiste eelistusi

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS [Pehme köide]

  • Formaat: Paperback / softback, 848 pages, kõrgus x laius x paksus: 236x172x40 mm, kaal: 1326 g
  • Sari: Addison-Wesley Data & Analytics Series
  • Ilmumisaeg: 19-Jan-2017
  • Kirjastus: Addison Wesley
  • ISBN-10: 0134597192
  • ISBN-13: 9780134597195
Teised raamatud teemal:
  • Formaat: Paperback / softback, 848 pages, kõrgus x laius x paksus: 236x172x40 mm, kaal: 1326 g
  • Sari: Addison-Wesley Data & Analytics Series
  • Ilmumisaeg: 19-Jan-2017
  • Kirjastus: Addison Wesley
  • ISBN-10: 0134597192
  • ISBN-13: 9780134597195
Teised raamatud teemal:

Stop searching the web for out-of-date, fragmentary, and unreliable information about running Hadoop! Now, there's a single source for all the authoritative knowledge and trustworthy procedures you need: Expert Hadoop 2 Administration: Managing Spark, YARN, and MapReduce.

 

Pioneering Hadoop/Big Data administrator Sam R. Alapati shares step-by-step procedures for confidently performing every important task involved in creating, configuring, securing, managing, and optimizing production Hadoop clusters. The only Hadoop administration guide written by a working Hadoop administrator, Expert Hadoop 2 Administration covers an unmatched range of topics, and offers an unparalleled collection of realistic examples. Alapati shares proven answers to complex configuration, management and performance tuning problems Hadoop administrators constantly encounter, and expert guidance for customizing Hadoop 2's intensely complex environment. Throughout, he integrates action-oriented advice with carefully-researched explanations of both problems and solutions. Coverage includes:

  • Indispensable Hadoop 2 concepts, including architecture, clusters, and application frameworks
  • Configuring high-reliability, high-performance Hadoop environments
  • Managing and protecting Hadoop data and high availability, including HDFS management, compression, data formats, and NameNode
  • Moving data, allocating resources and scheduling jobs with YARN, and managing job workflows with Oozie and Hue
  • Hadoop 2 security, monitoring, logging, and benchmarking
  • Troubleshooting root causes of severe performance slowdowns
  • Preventing trouble by proactively maintaining healthy Hadoop environments
  • Installing Hadoop 2 virtual environments, and more
Foreword xxvii
Preface xxix
Acknowledgments xxxv
About the Author xxxvii
I Introduction to Hadoop-Architecture and Hadoop Clusters 1(126)
1 Introduction to Hadoop and Its Environment
3(30)
Hadoop-An Introduction
4(8)
Unique Features of Hadoop
5(1)
Big Data and Hadoop
5(2)
A Typical Scenario for Using Hadoop
7(1)
Traditional Database Systems
7(2)
Data Lake
9(2)
Big Data; Science and Hadoop
11(1)
Cluster Computing and Hadoop Clusters
12(3)
Cluster Computing
12(1)
Hadoop Clusters
13(2)
Hadoop Components and the Hadoop Ecosphere
15(3)
What Do Hadoop Administrators Do?
18(3)
Hadoop Administration-A New Paradigm
18(2)
What You Need to Know to Administer Hadoop
20(1)
The Hadoop Administrator's Toolset
21(1)
Key Differences between Hadoop 1 and Hadoop 2
21(3)
Architectural Differences
22(1)
High-Availability Features
22(1)
Multiple Processing Engines
23(1)
Separation of Processing and Scheduling
23(1)
Resource Allocation in Hadoop 1 and Hadoop 2
24(1)
Distributed Data Processing: MapReduce and Spark, Hive and Pig
24(3)
MapReduce
24(1)
Apache Spark
25(1)
Apache Hive
26(1)
Apache Pig
26(1)
Data Integration: Apache Sqoop, Apache Flume and Apache Kafka
27(1)
Key Areas of Hadoop Administration
28(3)
Managing the Cluster Storage
28(1)
Allocating the Cluster Resources
28(1)
Scheduling Jobs
29(1)
Securing Hadoop Data
30(1)
Summary
31(2)
2 An Introduction to the Architecture of Hadoop
33(26)
Distributed Computing and Hadoop
33(1)
Hadoop Architecture
34(3)
A Hadoop Cluster
35(1)
Master and Worker Nodes
36(1)
Hadoop Services
36(1)
Data Storage-The Hadoop Distributed File System
37(11)
HDFS Unique Features
37(1)
HDFS Architecture
38(2)
The HDFS File System
40(3)
NameNode Operations
43(5)
Data Processing with YARN, the Hadoop Operating System
48(9)
Architecture of YARN
49(4)
How the ApplicationMaster Works with the ResourceManager to Allocate Resources
53(4)
Summary
57(2)
3 Creating and Configuring a Simple Hadoop Cluster
59(32)
Hadoop Distributions and Installation Types
60(2)
Hadoop Distributions
60(1)
Hadoop Installation Types
61(1)
Setting Up a Pseudo-Distributed Hadoop Cluster
62(9)
Meeting the Operating System Requirements
63(1)
Modifying Kernel Parameters
64(4)
Setting Up SSH
68(1)
Java Requirements
69(1)
Installing the Hadoop Software
70(1)
Creating the Necessary Hadoop Users
70(1)
Creating the Necessary Directories
71(1)
Performing the Initial Hadoop Configuration
71(15)
Environment Configuration Files
73(1)
Read-Only Default Configuration Files
74(1)
Site-Specific Configuration Files
74(1)
Other Hadoop-Related Configuration Files
74(2)
Precedence among the Configuration Files
76(2)
Variable Expansion and Configuration Parameters
78(1)
Configuring the Hadoop Daemons Environment
79(2)
Configuring Core Hadoop Properties (with the core-site.xml File)
81(1)
Configuring MapReduce (with the mapred-site.xml File)
82(1)
Configuring YARN (with the yarn-site.xml File)
83(3)
Operating the New Hadoop Cluster
86(4)
Formatting the Distributed File System
86(1)
Setting the Environment Variables
87(1)
Starting the HDFS and YARN Services
87(2)
Verifying the Service Startup
89(1)
Shutting Down the Services
90(1)
Summary
90(1)
4 Planning for and Creating a Fully Distributed Cluster
91(36)
Planning Your Hadoop Cluster
92(3)
General Cluster Planning Considerations
92(2)
Server Form Factors
94(1)
Criteria for Choosing the Nodes
94(1)
Going from a Single Rack to Multiple Racks
95(7)
Sizing a Hadoop Cluster
96(1)
General Principles Governing the Choice of CPU, Memory and Storage
96(3)
Special Treatment for the Master Nodes
99(1)
Recommendations for Sizing the Servers
100(1)
Growing a Cluster
101(1)
Guidelines for Large Clusters
101(1)
Creating a Multinode Cluster
102(4)
How the Test Cluster Is Set Up
102(4)
Modifying the Hadoop Configuration
106(8)
Changing the HDFS Configuration (hdfs-site.xml file)
106(3)
Changing the YARN Configuration
109(4)
Changing the MapReduce Configuration
113(1)
Starting Up the Cluster
114(5)
Starting Up and Shutting Down the Cluster with Scripts
116(2)
Performing a Quick Check of the New Cluster's File System
118(1)
Configuring Hadoop Services, Web Interfaces and Ports
119(7)
Service Configuration and Web Interfaces
119(3)
Setting Port Numbers for Hadoop Services
122(2)
Hadoop Clients
124(2)
Summary
126(1)
II Hadoop Application Frameworks 127(76)
5 Running Applications in a Cluster-The MapReduce Framework (and Hive and Pig)
129(18)
The MapReduce Framework
129(12)
The MapReduce Model
130(1)
How MapReduce Works
131(2)
MapReduce Job Processing
133(2)
A Simple MapReduce Program
135(1)
Understanding Hadoop's Job Processing-Running a WordCount Program
136(1)
MapReduce Input and Output Directories
137(1)
How Hadoop Shows You the Job Details
137(2)
Hadoop Streaming
139(2)
Apache Hive
141(3)
Hive Data Organization
142(1)
Working with Hive Tables
142(1)
Loading Data into Hive
142(1)
Querying with Hive
143(1)
Apache Pig
144(1)
Pig Execution Modes
144(1)
A Simple Pig Example
145(1)
Summary
145(2)
6 Running Applications in a Cluster-The Spark Framework
147(22)
What Is Spark?
148(1)
Why Spark?
149(4)
Speed
149(2)
Ease of Use and Accessibility
151(1)
General-Purpose Framework
152(1)
Spark and Hadoop
153(1)
The Spark Stack
153(2)
Installing Spark
155(3)
Spark Examples
157(1)
Key Spark Files and Directories
157(1)
Compiling the Spark Binaries
157(1)
Reducing Spark's Verbosity
158(1)
Spark Run Modes
158(1)
Local Mode
158(1)
Cluster Mode
158(1)
Understanding the Cluster Managers
159(5)
The Standalone Cluster Manager
159(2)
Spark on Apache Mesos
161(1)
Spark on YARN
162(1)
How YARN and Spark Work Together
163(1)
Setting Up Spark on a Hadoop Cluster
163(1)
Spark and Data Access
164(3)
Loading Data from the Linux File System
164(1)
Loading Data from HDFS
164(2)
Loading Data from a Relational Database
166(1)
Summary
167(2)
7 Running Spark Applications
169(34)
The Spark Programming Model
169(4)
Spark Programming and RDDs
169(3)
Programming Spark
172(1)
Spark Applications
173(6)
Basics of RDDs
174(1)
Creating an RDD
174(2)
RDD Operations
176(3)
RDD Persistence
179(1)
Architecture of a Spark Application
179(2)
Spark Terminology
180(1)
Components of a Spark Application
180(1)
Running Spark Applications Interactively
181(4)
Spark Shell and Spark Applications
181(1)
A Bit about the Spark Shell
182(1)
Using the Spark Shell
182(3)
Overview of Spark Cluster Execution
185(1)
Creating and Submitting Spark Applications
185(7)
Building the Spark Application
186(1)
Running an Application in the Standalone Spark Cluster
186(1)
Using spark-submit to Execute Applications
187(2)
Running Spark Applications on Mesos
189(1)
Running Spark Applications in a YARN-Managed Hadoop Cluster
189(2)
Using the JDBC/ODBC Server
191(1)
Configuring Spark Applications
192(2)
Spark Configuration Properties
192(1)
Specifying Configuration when Running spark-submit
193(1)
Monitoring Spark Applications
194(1)
Handling Streaming Data with Spark Streaming
194(4)
How Spark Streaming Works
195(2)
A Spark Streaming Example-WordCount Again!
197(1)
Using Spark SQL for Handling Structured Data
198(3)
DataFrames
198(1)
HiveContext and SQLContext
198(1)
Working with Spark SQL
199(1)
Creating DataFrames
200(1)
Summary
201(2)
III Managing and Protecting Hadoop Data and High Availability 203(150)
8 The Role of the NameNode and How HDFS Works
205(38)
HDFS-The Interaction between the NameNode and the DataNodes
205(4)
Interaction between the Clients and HDFS
206(1)
NameNode and DataNode Communications
207(2)
Rack Awareness and Topology
209(3)
How to Configure Rack Awareness in Your Cluster
210(1)
Finding Your Cluster's Rack Information
210(2)
HDFS Data Replication
212(6)
HDFS Data Organization and Data Blocks
213(1)
Data Replication
213(3)
Block and Replica States
216(2)
How Clients Read and Write HDFS Data
218(6)
How Clients Read HDFS Data
219(1)
How Clients Write Data to HDFS
220(4)
Understanding HDFS Recovery Processes
224(3)
Generation Stamp
224(1)
Lease Recovery
224(2)
Block Recovery
226(1)
Pipeline Recovery
226(1)
Centralized Cache Management in HDFS
227(5)
Hadoop and OS Page Caching
228(1)
The Key Principles Behind Centralized Cache Management
228(1)
How Centralized Cache Management Works
229(1)
Configuring Caching
229(1)
Cache Directives
230(1)
Cache Pools
230(1)
Using the Cache
231(1)
Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage)
232(9)
Performance Characteristics of Storage Types
233(1)
The Need for Heterogeneous HDFS Storage
233(1)
Changes in the Storage Architecture
234(1)
Storage Preferences for Files
235(1)
Setting Up Archival Storage
235(4)
Managing Storage Policies
239(1)
Moving Data Around
239(1)
Implementing Archival Storage
240(1)
Summary
241(2)
9 HDFS Commands, HDFS Permissions and HDFS Storage
243(34)
Managing HDFS through the HDFS Shell Commands
243(8)
Using the hdfs dfs Utility to Manage HDFS
245(2)
Listing HDFS Files and Directories
247(2)
Creating an HDFS Directory
249(1)
Removing HDFS Files and Directories
249(1)
Changing File and Directory Ownership and Groups
250(1)
Using the dfsadmin Utility to Perform HDFS Operations
251(4)
The dfsadmin report Command
252(3)
Managing HDFS Permissions and Users
255(5)
HDFS File Permissions
255(2)
HDFS Users and Super Users
257(3)
Managing HDFS Storage
260(7)
Checking HDFS Disk Usage
260(3)
Allocating HDFS Space Quotas
263(4)
Rebalancing HDFS Data
267(7)
Reasons for HDFS Data Imbalance
268(1)
Running the Balancer Tool to Balance HDFS Data
268(3)
Using hdfs dfsadmin to Make Things Easier
271(2)
When to Run the Balancer
273(1)
Reclaiming HDFS Space
274(2)
Removing Files and Directories
274(1)
Decreasing the Replication Factor
274(2)
Summary
276(1)
10 Data Protection, File Formats and Accessing HDFS
277(40)
Safeguarding Data
278(11)
Using HDFS Trash to Prevent Accidental Data Deletion
278(2)
Using HDFS Snapshots to Protect Important Data
280(4)
Ensuring Data Integrity with File System Checks
284(5)
Data Compression
289(6)
Common Compression Formats
290(1)
Evaluating the Various Compression Schemes
291(1)
Compression at Various Stages for MapReduce
291(4)
Compression for Spark
295(1)
Data Serialization
295(1)
Hadoop File-formats
295(13)
Criteria for Determining the Right File Format
296(2)
File Formats Supported by Hadoop
298(4)
The Ideal File Format
302(1)
The Hadoop Small Files Problem and Merging Files
303(1)
Using a Federated NameNode to Overcome the Small Files Problem
304(1)
Using Hadoop Archives to Manage Many Small Files
304(3)
Handling the Performance Impact of Small Files
307(1)
Using Hadoop WebHDFS and HttpFS
308(7)
WebHDFS-The Hadoop REST API
308(1)
Using the WebHDFS API
309(1)
Understanding the WebHDFS Commands
310(3)
Using HttpFS Gateway to Access HDFS from Behind a Firewall
313(2)
Summary
315(2)
11 NameNode Operations, High Availability and Federation
317(36)
Understanding NameNode Operations
318(5)
HDFS Metadata
319(2)
The NameNode Startup Process
321(1)
How the NameNode and the DataNodes Work Together
322(1)
The Checkpointing Process
323(6)
Secondary, Checkpoint, Backup and Standby Nodes
324(1)
Configuring the Checkpointing Frequency
325(2)
Managing Checkpoint Performance
327(1)
The Mechanics of Checkpointing
327(2)
NameNode Safe Mode Operations
329(5)
Automatic Safe Mode Operations
329(1)
Placing the NameNode in Safe Mode
330(1)
How the NameNode Transitions Through Safe Mode
331(1)
Backing Up and Recovering the NameNode Metadata
332(2)
Configuring HDFS High Availability
334(15)
NameNode HA Architecture (QJM)
335(2)
Setting Up an HDFS HA Quorum Cluster
337(5)
Deploying the High-Availability NameNodes
342(3)
Managing an HA NameNode Setup
345(1)
HA Manual and Automatic Failover
346(3)
HDFS Federation
349(2)
Architecture of a Federated NameNode
350(1)
Summary
351(2)
IV Moving Data, Allocating Resources, Scheduling Jobs and Security 353(174)
12 Moving Data Into and Out of Hadoop
355(52)
Introduction to Hadoop Data Transfer Tools
355(1)
Loading Data into HDFS from the Command Line
356(5)
Using the -cat Command to Dump a File's Contents
356(1)
Testing HDFS Files
357(1)
Copying and Moving Files from and to HDFS
358(1)
Using the -get Command to Move Files
359(1)
Moving Files from and to HDFS
360(1)
Using the -tail and head Commands
360(1)
Copying HDFS Data between Clusters with DistCp
361(4)
How to Use the DistCp Command to Move Data
361(2)
DistCp Options
363(2)
Ingesting Data from Relational Databases with Sqoop
365(23)
Sqoop Architecture
366(1)
Deploying Sqoop
367(1)
Using Sqoop to Move Data
368(1)
Importing Data with Sqoop
368(11)
Importing Data into Hive
379(2)
Exporting Data with Sqoop
381(7)
Ingesting Data from External Sources with Flume
388(10)
Flume Architecture in a Nutshell
389(2)
Configuring the Flume Agent
391(1)
A Simple Flume Example
392(2)
Using Flume to Move Data to HDFS
394(1)
A More Complex Flume Example
395(3)
Ingesting Data with Kafka
398(8)
Benefits Offered by Kafka
398(1)
How Kafka Works
399(2)
Setting Up an Apache Kafka Cluster
401(3)
Integrating Kafka with Hadoop and Storm
404(2)
Summary
406(1)
13 Resource Allocation in a Hadoop Cluster
407(30)
Resource Allocation in Hadoop
407(3)
Managing Cluster Workloads
408(1)
Hadoop's Resource Schedulers
409(1)
The FIFO Scheduler
410(1)
The Capacity Scheduler
411(15)
Queues and Subqueues
412(6)
How the Cluster Allocates Resources
418(3)
Preempting Applications
421(1)
Enabling the Capacity Scheduler
422(1)
A Typical Capacity Scheduler
422(4)
The Fair Scheduler
426(9)
Queues
427(1)
Configuring the Fair Scheduler
428(2)
How Jobs Are Placed into Queues
430(1)
Application Preemption in the Fair Scheduler
431(1)
Security and Resource Pools
432(1)
A Sample fair-scheduler.xml File
432(2)
Submitting Jobs to the Scheduler
434(1)
Moving Applications between Queues
434(1)
Monitoring the Fair Scheduler
434(1)
Comparing the Capacity Scheduler and the Fair Scheduler
435(1)
Similarities between the Two Schedulers
435(1)
Differences between the Two Schedulers
435(1)
Summary
436(1)
14 Working with Oozie to Manage Job Workflows
437(40)
Using Apache Oozie to Schedule Jobs
437(2)
Oozie Architecture
439(2)
The Oozie Server
439(1)
The Oozie Client
440(1)
The Oozie Database
440(1)
Deploying Oozie in Your Cluster
441(5)
Installing and Configuring Oozie
442(2)
Configuring Hadoop for Oozie
444(2)
Understanding Oozie Workflows
446(3)
Workflows, Control Flow, and Nodes
446(1)
Defining the Workflows with the workflow.xml File
447(2)
How Oozie Runs an Action
449(5)
Configuring the Action Nodes
449(5)
Creating an Oozie Workflow
454(7)
Configuring the Control Nodes
456(4)
Configuring the Job
460(1)
Running an Oozie Workflow Job
461(3)
Specifying the Job Properties
461(2)
Deploying Oozie Jobs
463(1)
Creating Dynamic Workflows
463(1)
Oozie Coordinators
464(6)
Time-Based Coordinators
465(2)
Data-Based Coordinators
467(2)
Time-and-Data-Based Coordinators
469(1)
Submitting the Oozie Coordinator from the Command Line
469(1)
Managing and Administering Oozie
470(5)
Common Oozie Commands and How to Run Them
471(2)
Troubleshooting Oozie
473(1)
Oozie crop Scheduling and Oozie Service Level Agreements
474(1)
Summary
475(2)
15 Securing Hadoop
477(50)
Hadoop Security-An Overview
478(3)
Authentication, Authorization and Accounting
480(1)
Hadoop Authentication with Kerberos
481(24)
Kerberos and How It Works
482(1)
The Kerberos Authentication Process
483(1)
Kerberos Trusts
484(1)
A Special Principal
485(1)
Adding Kerberos Authorization to your Cluster
486(4)
Setting Up Kerberos for Hadoop
490(5)
Securing a Hadoop Cluster with Kerberos
495(6)
How Kerberos Authenticates Users and Services
501(1)
Managing a Kerberized Hadoop Cluster
501(4)
Hadoop Authorization
505(13)
HDFS Permissions
505(5)
Service Level Authorization
510(2)
Role-Based Authorization with Apache Sentry
512(6)
Auditing Hadoop
518(2)
Auditing HDFS Operations
519(1)
Auditing YARN Operations
519(1)
Securing Hadoop Data
520(4)
HDFS Transparent Encryption
520(3)
Encrypting Data in Transition
523(1)
Other Hadoop-Related Security Initiatives
524(1)
Securing a Hadoop Infrastructure with Apache Knox Gateway
524(1)
Apache Ranger for Security Administration
525(1)
Summary
525(2)
V Monitoring, Optimization and Troubleshooting 527(220)
16 Managing Jobs, Using Hue and Performing Routine Tasks
529(40)
Using the YARN Commands to Manage Hadoop Jobs
530(5)
Viewing YARN Applications
531(1)
Checking the Status of an Application
532(1)
Killing a Running Application
532(1)
Checking the Status of the Nodes
533(1)
Checking YARN Queues
533(1)
Getting the Application Logs
533(1)
Yarn Administrative Commands
534(1)
Decommissioning and Recommissioning Nodes
535(6)
Including and Excluding Hosts
536(1)
Decommissioning DataNodes and NodeManagers
537(2)
Recommissioning Nodes
539(1)
Things to Remember about Decommissioning and Recommissioning
539(1)
Adding a New DataNode and/or a NodeManager
540(1)
ResourceManager High Availability
541(4)
ResourceManager High-Availability Architecture
541(1)
Setting Up ResourceManager High Availability
542(1)
ResourceManager Failover
543(2)
Using the ResourceManager High-Availability Commands
545(1)
Performing Common Management Tasks
545(3)
Moving the NameNode to a Different Host
545(1)
Managing High-Availability NameNodes
546(1)
Using a Shutdown/Startup Script to Manage your Cluster
546(1)
Balancing HDFS
546(1)
Balancing the Storage on the DataNodes
547(1)
Managing the MySQL Database
548(3)
Configuring a MySQL Database
548(1)
Configuring MySQL High Availability
549(2)
Backing Up Important Cluster Data
551(2)
Backing Up HDFS Metadata
552(1)
Backing Up the Metastore Databases
553(1)
Using Hue to Administer Your Cluster
553(9)
Allowing Your Users to Use Hue
554(2)
Installing Hue
556(1)
Configuring Your Cluster to Work with Hue
557(4)
Managing Hue
561(1)
Working with Hue
561(1)
Implementing Specialized HDFS Features
562(5)
Deploying HDFS and YARN in a Multihomed Network
562(1)
Short-Circuit Local Reads
563(1)
Mountable HDFS
564(2)
Using an NFS Gateway for Mounting HDFS to a Local File System
566(1)
Summary
567(2)
17 Monitoring, Metrics and Hadoop Logging
569(42)
Monitoring Linux Servers
570(6)
Basics of Linux System Monitoring
570(2)
Monitoring Tools for Linux Systems
572(4)
Hadoop Metrics
576(3)
Hadoop Metric Types
577(1)
Using the Hadoop Metrics
578(1)
Capturing Metrics to a File System
578(1)
Using Ganglia for Monitoring
579(3)
Ganglia Architecture
580(1)
Setting Up the Ganglia and Hadoop Integration
580(2)
Setting Up the Hadoop Metrics
582(1)
Understanding Hadoop Logging
582(17)
Hadoop Log Messages
583(1)
Daemon and Application Logs and How to View Them
584(1)
How Application Logging Works
585(2)
How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run
587(1)
How the NodeManager Uses the Local Directories
588(4)
Storing Job Logs in HDFS through Log Aggregation
592(5)
Working with the Hadoop Daemon Logs
597(2)
Using Hadoop's Web Uls for Monitoring
599(10)
Monitoring Jobs with the ResourceManager Web UI
599(7)
The JobHistoryServer Web UI
606(2)
Monitoring with the NameNode Web UI
608(1)
Monitoring Other Hadoop Components
609(1)
Monitoring Hive
609(1)
Monitoring Spark
610(1)
Summary
610(1)
18 Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking
611(48)
How to Allocate YARN Memory and CPU
612(9)
Allocating Memory
612(8)
Configuring the Number of CPU Cores
620(1)
Relationship between Memory and CPU Vcores
621(1)
Configuring Efficient Performance
621(4)
Speculative Execution
621(3)
Reducing the I/O Load on the System
624(1)
Tuning Map and Reduce Tasks-What the Administrator Can Do
625(10)
Tuning the Map Tasks
626(1)
Input and Output
627(3)
Tuning the Reduce Tasks
630(2)
Tuning the MapReduce Shuffle Process
632(3)
Optimizing Pig and Hive Jobs
635(3)
Optimizing Hive Jobs
635(2)
Optimizing Pig Jobs
637(1)
Benchmarking Your Cluster
638(9)
Using TestDFSIO for Testing I/O Performance
638(2)
Benchmarking with TeraSort
640(3)
Using Hadoop's Rumen and GridMix for Benchmarking
643(4)
Hadoop Counters
647(5)
File System Counters
649(1)
Job Counters
649(1)
MapReduce Framework Counters
650(1)
Custom Java Counters
651(1)
Limiting the Number of Counters
651(1)
Optimizing MapReduce
652(6)
Map-Only versus Map and Reduce Jobs
652(1)
How Combiners Improve MapReduce Performance
652(2)
Using a Partitioner to Improve Performance
654(1)
Compressing Data During the MapReduce Process
654(1)
Too Many Mappers or Reducers?
655(3)
Summary
658(1)
19 Configuring and Tuning Apache Spark on YARN
659(32)
Configuring Resource Allocation for Spark on YARN
659(17)
Allocating CPU
660(1)
Allocating Memory
660(1)
How Resources are Allocated to Spark
660(1)
Limits on the Resource Allocation to Spark Applications
661(2)
Allocating Resources to the Driver
663(3)
Configuring Resources for the Executors
666(4)
How Spark Uses its Memory
670(2)
Things to Remember
672(2)
Cluster or Client Mode?
674(2)
Configuring Spark-Related Network Parameters
676(1)
Dynamic Resource Allocation when Running Spark on YARN
676(2)
Dynamic and Static Resource Allocation
676(1)
How Spark Manages Dynamic Resource Allocation
677(1)
Enabling Dynamic Resource Allocation
677(1)
Storage Formats and Compressing Data
678(3)
Storage Formats
679(1)
File Sizes
680(1)
Compression
680(1)
Monitoring Spark Applications
681(5)
Using the Spark Web UI to Understand Performance
682(2)
Spark System and the Metrics REST API
684(1)
The Spark History Server on YARN
684(2)
Tracking Jobs from the Command Line
686(1)
Tuning Garbage Collection
686(2)
The Mechanics of Garbage Collection
687(1)
How to Collect GC Statistics
687(1)
Tuning Spark Streaming Applications
688(1)
Reducing Batch Processing Time
688(1)
Setting the Right Batch Interval
689(1)
Tuning Memory and Garbage Collection
689(1)
Summary
689(2)
20 Optimizing Spark Applications
691(34)
Revisiting the Spark Execution Model
692(2)
The Spark Execution Model
692(2)
Shuffle Operations and How to Minimize Them
694(9)
A WordCount Example to Our Rescue Again
695(1)
Impact of a Shuffle Operation
696(1)
Configuring the Shuffle Parameters
697(6)
Partitioning and Parallelism (Number of Tasks)
703(7)
Level of Parallelism
704(2)
Problems with Too Few Tasks
706(1)
Setting the Default Number of Partitions
706(1)
How to Increase the Number of Partitions
707(1)
Using the Repartition and Coalesce Operators to Change the Number of Partitions in an RDD
708(1)
Two Types of Partitioners
709(1)
Data Partitioning and How It Can Avoid a Shuffle
709(1)
Optimizing Data Serialization and Compression
710(2)
Data Serialization
710(1)
Configuring Compression
711(1)
Understanding Spark's SQL Query Optimizer
712(5)
Understanding the Optimizer Steps
712(2)
Spark's Speculative Execution Feature
714(1)
The Importance of Data Locality
715(2)
Caching Data
717(6)
Fault-Tolerance Due to Caching
718(1)
How to Specify Caching
718(5)
Summary
723(2)
21 Troubleshooting Hadoop-A Sampler
725(18)
Space-Related Issues
725(6)
Dealing with a 100 Percent Full Linux File System
726(1)
HDFS Space Issues
727(1)
Local and Log Directories Out of Free Space
727(2)
Disk Volume Failure Toleration
729(2)
Handling YARN Jobs That Are Stuck
731(1)
JVM Memory-Allocation and Garbage-Collection Strategies
732(5)
Understanding JVM Garbage Collection
732(1)
Optimizing Garbage Collection
733(1)
Analyzing Memory Usage
734(1)
Out of Memory Errors
734(1)
ApplicationMaster Memory Issues
735(2)
Handling Different Types of Failures
737(2)
Handling Daemon Failures
737(1)
Starting Failures for Hadoop Daemons
737(1)
Task and Job Failures
738(1)
Troubleshooting Spark Jobs
739(1)
Spark's Fault Tolerance Mechanism
740(1)
Killing Spark Jobs
740(1)
Maximum Attempts for a Job
740(1)
Maximum Failures per Job
740(1)
Debugging Spark Applications
740(2)
Viewing Logs with Log Aggregation
740(1)
Viewing Logs When Log Aggregation Is Not Enabled
741(1)
Reviewing the Launch Environment
741(1)
Summary
742(1)
22 installing VirtualBox and Linux and Cloning the Virtual Machines
743(4)
Installing Oracle VirtualBox
744(1)
Installing Oracle Enterprise Linux
745(1)
Cloning the Linux Server
745(2)
Index 747
Sam R. Alapati has been working with various aspects of the Hadoop environment for the past six years. He is currently the principal Hadoop administrator at Sabre Corporation in Westlake, Texas, and works on a daily basis with multiple large Hadoop 2 clusters. In addition to being the point person for all Hadoop administration at Sabre, Sam manages multiple critical data-science- and data-analysis-related Hadoop job flows and is also an expert Oracle Database Administrator. His vast knowledge of relational databases and SQL contributes to his work with Hadoop related projects. Sams recognition in the database and middleware area includes having published 18 well-received books over the past 14 years, mostly on Oracle Database Administration and Oracle Weblogic Server. His experience dealing with numerous configuration, architectural, and performance-related Hadoop issues over the years led him to the realization that many working Hadoop administrators and developers would appreciate having a handy reference such as this book to turn to when creating, managing, securing and optimizing their Hadoop infrastructure.