Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS [Pehme köide]

4.00/5 (26 hinnangut Goodreads-ist)

Sam Alapati

Formaat: Paperback / softback, 848 pages, kõrgus x laius x paksus: 236x172x40 mm, kaal: 1326 g
Sari: Addison-Wesley Data & Analytics Series
Ilmumisaeg: 19-Jan-2017
Kirjastus: Addison Wesley
ISBN-10: 0134597192
ISBN-13: 9780134597195

Teised raamatud teemal:

Distributed systems

Pehme köide
Hind: 52,74 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 848 pages, kõrgus x laius x paksus: 236x172x40 mm, kaal: 1326 g
Sari: Addison-Wesley Data & Analytics Series
Ilmumisaeg: 19-Jan-2017
Kirjastus: Addison Wesley
ISBN-10: 0134597192
ISBN-13: 9780134597195

Teised raamatud teemal:

Distributed systems

Püsilink: https://www.kriso.ee/db/9780134597195.html

Märksõnad:

Stop searching the web for out-of-date, fragmentary, and unreliable information about running Hadoop! Now, there's a single source for all the authoritative knowledge and trustworthy procedures you need: Expert Hadoop 2 Administration: Managing Spark, YARN, and MapReduce.

Pioneering Hadoop/Big Data administrator Sam R. Alapati shares step-by-step procedures for confidently performing every important task involved in creating, configuring, securing, managing, and optimizing production Hadoop clusters. The only Hadoop administration guide written by a working Hadoop administrator, Expert Hadoop 2 Administration covers an unmatched range of topics, and offers an unparalleled collection of realistic examples. Alapati shares proven answers to complex configuration, management and performance tuning problems Hadoop administrators constantly encounter, and expert guidance for customizing Hadoop 2's intensely complex environment. Throughout, he integrates action-oriented advice with carefully-researched explanations of both problems and solutions. Coverage includes:

Indispensable Hadoop 2 concepts, including architecture, clusters, and application frameworks
Configuring high-reliability, high-performance Hadoop environments
Managing and protecting Hadoop data and high availability, including HDFS management, compression, data formats, and NameNode
Moving data, allocating resources and scheduling jobs with YARN, and managing job workflows with Oozie and Hue
Hadoop 2 security, monitoring, logging, and benchmarking
Troubleshooting root causes of severe performance slowdowns
Preventing trouble by proactively maintaining healthy Hadoop environments
Installing Hadoop 2 virtual environments, and more

Foreword

xxvii

Preface

xxix

Acknowledgments

xxxv

About the Author

xxxvii

I Introduction to Hadoop-Architecture and Hadoop Clusters

(126)

1 Introduction to Hadoop and Its Environment

(30)

Hadoop-An Introduction

(8)

Unique Features of Hadoop

(1)

Big Data and Hadoop

(2)

A Typical Scenario for Using Hadoop

(1)

Traditional Database Systems

(2)

Data Lake

(2)

Big Data; Science and Hadoop

(1)

Cluster Computing and Hadoop Clusters

(3)

Cluster Computing

(1)

Hadoop Clusters

(2)

Hadoop Components and the Hadoop Ecosphere

(3)

What Do Hadoop Administrators Do?

(3)

Hadoop Administration-A New Paradigm

(2)

What You Need to Know to Administer Hadoop

(1)

The Hadoop Administrator's Toolset

(1)

Key Differences between Hadoop 1 and Hadoop 2

(3)

Architectural Differences

(1)

High-Availability Features

(1)

Multiple Processing Engines

(1)

Separation of Processing and Scheduling

(1)

Resource Allocation in Hadoop 1 and Hadoop 2

(1)

Distributed Data Processing: MapReduce and Spark, Hive and Pig

(3)

MapReduce

(1)

Apache Spark

(1)

Apache Hive

(1)

Apache Pig

(1)

Data Integration: Apache Sqoop, Apache Flume and Apache Kafka

(1)

Key Areas of Hadoop Administration

(3)

Managing the Cluster Storage

(1)

Allocating the Cluster Resources

(1)

Scheduling Jobs

(1)

Securing Hadoop Data

(1)

Summary

(2)

2 An Introduction to the Architecture of Hadoop

(26)

Distributed Computing and Hadoop

(1)

Hadoop Architecture

(3)

A Hadoop Cluster

(1)

Master and Worker Nodes

(1)

Hadoop Services

(1)

Data Storage-The Hadoop Distributed File System

(11)

HDFS Unique Features

(1)

HDFS Architecture

(2)

The HDFS File System

(3)

NameNode Operations

(5)

Data Processing with YARN, the Hadoop Operating System

(9)

Architecture of YARN

(4)

How the ApplicationMaster Works with the ResourceManager to Allocate Resources

(4)

Summary

(2)

3 Creating and Configuring a Simple Hadoop Cluster

(32)

Hadoop Distributions and Installation Types

(2)

Hadoop Distributions

(1)

Hadoop Installation Types

(1)

Setting Up a Pseudo-Distributed Hadoop Cluster

(9)

Meeting the Operating System Requirements

(1)

Modifying Kernel Parameters

(4)

Setting Up SSH

(1)

Java Requirements

(1)

Installing the Hadoop Software

(1)

Creating the Necessary Hadoop Users

(1)

Creating the Necessary Directories

(1)

Performing the Initial Hadoop Configuration

(15)

Environment Configuration Files

(1)

Read-Only Default Configuration Files

(1)

Site-Specific Configuration Files

(1)

Other Hadoop-Related Configuration Files

(2)

Precedence among the Configuration Files

(2)

Variable Expansion and Configuration Parameters

(1)

Configuring the Hadoop Daemons Environment

(2)

Configuring Core Hadoop Properties (with the core-site.xml File)

(1)

Configuring MapReduce (with the mapred-site.xml File)

(1)

Configuring YARN (with the yarn-site.xml File)

(3)

Operating the New Hadoop Cluster

(4)

Formatting the Distributed File System

(1)

Setting the Environment Variables

(1)

Starting the HDFS and YARN Services

(2)

Verifying the Service Startup

(1)

Shutting Down the Services

(1)

Summary

(1)

4 Planning for and Creating a Fully Distributed Cluster

(36)

Planning Your Hadoop Cluster

(3)

General Cluster Planning Considerations

(2)

Server Form Factors

(1)

Criteria for Choosing the Nodes

(1)

Going from a Single Rack to Multiple Racks

(7)

Sizing a Hadoop Cluster

(1)

General Principles Governing the Choice of CPU, Memory and Storage

(3)

Special Treatment for the Master Nodes

(1)

Recommendations for Sizing the Servers

100

(1)

Growing a Cluster

101

(1)

Guidelines for Large Clusters

101

(1)

Creating a Multinode Cluster

102

(4)

How the Test Cluster Is Set Up

102

(4)

Modifying the Hadoop Configuration

106

(8)

Changing the HDFS Configuration (hdfs-site.xml file)

106

(3)

Changing the YARN Configuration

109

(4)

Changing the MapReduce Configuration

113

(1)

Starting Up the Cluster

114

(5)

Starting Up and Shutting Down the Cluster with Scripts

116

(2)

Performing a Quick Check of the New Cluster's File System

118

(1)

Configuring Hadoop Services, Web Interfaces and Ports

119

(7)

Service Configuration and Web Interfaces

119

(3)

Setting Port Numbers for Hadoop Services

122

(2)

Hadoop Clients

124

(2)

Summary

126

(1)

II Hadoop Application Frameworks

127

(76)

5 Running Applications in a Cluster-The MapReduce Framework (and Hive and Pig)

129

(18)

The MapReduce Framework

129

(12)

The MapReduce Model

130

(1)

How MapReduce Works

131

(2)

MapReduce Job Processing

133

(2)

A Simple MapReduce Program

135

(1)

Understanding Hadoop's Job Processing-Running a WordCount Program

136

(1)

MapReduce Input and Output Directories

137

(1)

How Hadoop Shows You the Job Details

137

(2)

Hadoop Streaming

139

(2)

Apache Hive

141

(3)

Hive Data Organization

142

(1)

Working with Hive Tables

142

(1)

Loading Data into Hive

142

(1)

Querying with Hive

143

(1)

Apache Pig

144

(1)

Pig Execution Modes

144

(1)

A Simple Pig Example

145

(1)

Summary

145

(2)

6 Running Applications in a Cluster-The Spark Framework

147

(22)

What Is Spark?

148

(1)

Why Spark?

149

(4)

Speed

149

(2)

Ease of Use and Accessibility

151

(1)

General-Purpose Framework

152

(1)

Spark and Hadoop

153

(1)

The Spark Stack

153

(2)

Installing Spark

155

(3)

Spark Examples

157

(1)

Key Spark Files and Directories

157

(1)

Compiling the Spark Binaries

157

(1)

Reducing Spark's Verbosity

158

(1)

Spark Run Modes

158

(1)

Local Mode

158

(1)

Cluster Mode

158

(1)

Understanding the Cluster Managers

159

(5)

The Standalone Cluster Manager

159

(2)

Spark on Apache Mesos

161

(1)

Spark on YARN

162

(1)

How YARN and Spark Work Together

163

(1)

Setting Up Spark on a Hadoop Cluster

163

(1)

Spark and Data Access

164

(3)

Loading Data from the Linux File System

164

(1)

Loading Data from HDFS

164

(2)

Loading Data from a Relational Database

166

(1)

Summary

167

(2)

7 Running Spark Applications

169

(34)

The Spark Programming Model

169

(4)

Spark Programming and RDDs

169

(3)

Programming Spark

172

(1)

Spark Applications

173

(6)

Basics of RDDs

174

(1)

Creating an RDD

174

(2)

RDD Operations

176

(3)

RDD Persistence

179

(1)

Architecture of a Spark Application

179

(2)

Spark Terminology

180

(1)

Components of a Spark Application

180

(1)

Running Spark Applications Interactively

181

(4)

Spark Shell and Spark Applications

181

(1)

A Bit about the Spark Shell

182

(1)

Using the Spark Shell

182

(3)

Overview of Spark Cluster Execution

185

(1)

Creating and Submitting Spark Applications

185

(7)

Building the Spark Application

186

(1)

Running an Application in the Standalone Spark Cluster

186

(1)

Using spark-submit to Execute Applications

187

(2)

Running Spark Applications on Mesos

189

(1)

Running Spark Applications in a YARN-Managed Hadoop Cluster

189

(2)

Using the JDBC/ODBC Server

191

(1)

Configuring Spark Applications

192

(2)

Spark Configuration Properties

192

(1)

Specifying Configuration when Running spark-submit

193

(1)

Monitoring Spark Applications

194

(1)

Handling Streaming Data with Spark Streaming

194

(4)

How Spark Streaming Works

195

(2)

A Spark Streaming Example-WordCount Again!

197

(1)

Using Spark SQL for Handling Structured Data

198

(3)

DataFrames

198

(1)

HiveContext and SQLContext

198

(1)

Working with Spark SQL

199

(1)

Creating DataFrames

200

(1)

Summary

201

(2)

III Managing and Protecting Hadoop Data and High Availability

203

(150)

8 The Role of the NameNode and How HDFS Works

205

(38)

HDFS-The Interaction between the NameNode and the DataNodes

205

(4)

Interaction between the Clients and HDFS

206

(1)

NameNode and DataNode Communications

207

(2)

Rack Awareness and Topology

209

(3)

How to Configure Rack Awareness in Your Cluster

210

(1)

Finding Your Cluster's Rack Information

210

(2)

HDFS Data Replication

212

(6)

HDFS Data Organization and Data Blocks

213

(1)

Data Replication

213

(3)

Block and Replica States

216

(2)

How Clients Read and Write HDFS Data

218

(6)

How Clients Read HDFS Data

219

(1)

How Clients Write Data to HDFS

220

(4)

Understanding HDFS Recovery Processes

224

(3)

Generation Stamp

224

(1)

Lease Recovery

224

(2)

Block Recovery

226

(1)

Pipeline Recovery

226

(1)

Centralized Cache Management in HDFS

227

(5)

Hadoop and OS Page Caching

228

(1)

The Key Principles Behind Centralized Cache Management

228

(1)

How Centralized Cache Management Works

229

(1)

Configuring Caching

229

(1)

Cache Directives

230

(1)

Cache Pools

230

(1)

Using the Cache

231

(1)

Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage)

232

(9)

Performance Characteristics of Storage Types

233

(1)

The Need for Heterogeneous HDFS Storage

233

(1)

Changes in the Storage Architecture

234

(1)

Storage Preferences for Files

235

(1)

Setting Up Archival Storage

235

(4)

Managing Storage Policies

239

(1)

Moving Data Around

239

(1)

Implementing Archival Storage

240

(1)

Summary

241

(2)

9 HDFS Commands, HDFS Permissions and HDFS Storage

243

(34)

Managing HDFS through the HDFS Shell Commands

243

(8)

Using the hdfs dfs Utility to Manage HDFS

245

(2)

Listing HDFS Files and Directories

247

(2)

Creating an HDFS Directory

249

(1)

Removing HDFS Files and Directories

249

(1)

Changing File and Directory Ownership and Groups

250

(1)

Using the dfsadmin Utility to Perform HDFS Operations

251

(4)

The dfsadmin report Command

252

(3)

Managing HDFS Permissions and Users

255

(5)

HDFS File Permissions

255

(2)

HDFS Users and Super Users

257

(3)

Managing HDFS Storage

260

(7)

Checking HDFS Disk Usage

260

(3)

Allocating HDFS Space Quotas

263

(4)

Rebalancing HDFS Data

267

(7)

Reasons for HDFS Data Imbalance

268

(1)

Running the Balancer Tool to Balance HDFS Data

268

(3)

Using hdfs dfsadmin to Make Things Easier

271

(2)

When to Run the Balancer

273

(1)

Reclaiming HDFS Space

274

(2)

Removing Files and Directories

274

(1)

Decreasing the Replication Factor

274

(2)

Summary

276

(1)

10 Data Protection, File Formats and Accessing HDFS

277

(40)

Safeguarding Data

278

(11)

Using HDFS Trash to Prevent Accidental Data Deletion

278

(2)

Using HDFS Snapshots to Protect Important Data

280

(4)

Ensuring Data Integrity with File System Checks

284

(5)

Data Compression

289

(6)

Common Compression Formats

290

(1)

Evaluating the Various Compression Schemes

291

(1)

Compression at Various Stages for MapReduce

291

(4)

Compression for Spark

295

(1)

Data Serialization

295

(1)

Hadoop File-formats

295

(13)

Criteria for Determining the Right File Format

296

(2)

File Formats Supported by Hadoop

298

(4)

The Ideal File Format

302

(1)

The Hadoop Small Files Problem and Merging Files

303

(1)

Using a Federated NameNode to Overcome the Small Files Problem

304

(1)

Using Hadoop Archives to Manage Many Small Files

304

(3)

Handling the Performance Impact of Small Files

307

(1)

Using Hadoop WebHDFS and HttpFS

308

(7)

WebHDFS-The Hadoop REST API

308

(1)

Using the WebHDFS API

309

(1)

Understanding the WebHDFS Commands

310

(3)

Using HttpFS Gateway to Access HDFS from Behind a Firewall

313

(2)

Summary

315

(2)

11 NameNode Operations, High Availability and Federation

317

(36)

Understanding NameNode Operations

318

(5)

HDFS Metadata

319

(2)

The NameNode Startup Process

321

(1)

How the NameNode and the DataNodes Work Together

322

(1)

The Checkpointing Process

323

(6)

Secondary, Checkpoint, Backup and Standby Nodes

324

(1)

Configuring the Checkpointing Frequency

325

(2)

Managing Checkpoint Performance

327

(1)

The Mechanics of Checkpointing

327

(2)

NameNode Safe Mode Operations

329

(5)

Automatic Safe Mode Operations

329

(1)

Placing the NameNode in Safe Mode

330

(1)

How the NameNode Transitions Through Safe Mode

331

(1)

Backing Up and Recovering the NameNode Metadata

332

(2)

Configuring HDFS High Availability

334

(15)

NameNode HA Architecture (QJM)

335

(2)

Setting Up an HDFS HA Quorum Cluster

337

(5)

Deploying the High-Availability NameNodes

342

(3)

Managing an HA NameNode Setup

345

(1)

HA Manual and Automatic Failover

346

(3)

HDFS Federation

349

(2)

Architecture of a Federated NameNode

350

(1)

Summary

351

(2)

IV Moving Data, Allocating Resources, Scheduling Jobs and Security

353

(174)

12 Moving Data Into and Out of Hadoop

355

(52)

Introduction to Hadoop Data Transfer Tools

355

(1)

Loading Data into HDFS from the Command Line

356

(5)

Using the -cat Command to Dump a File's Contents

356

(1)

Testing HDFS Files

357

(1)

Copying and Moving Files from and to HDFS

358

(1)

Using the -get Command to Move Files

359

(1)

Moving Files from and to HDFS

360

(1)

Using the -tail and head Commands

360

(1)

Copying HDFS Data between Clusters with DistCp

361

(4)

How to Use the DistCp Command to Move Data

361

(2)

DistCp Options

363

(2)

Ingesting Data from Relational Databases with Sqoop

365

(23)

Sqoop Architecture

366

(1)

Deploying Sqoop

367

(1)

Using Sqoop to Move Data

368

(1)

Importing Data with Sqoop

368

(11)

Importing Data into Hive

379

(2)

Exporting Data with Sqoop

381

(7)

Ingesting Data from External Sources with Flume

388

(10)

Flume Architecture in a Nutshell

389

(2)

Configuring the Flume Agent

391

(1)

A Simple Flume Example

392

(2)

Using Flume to Move Data to HDFS

394

(1)

A More Complex Flume Example

395

(3)

Ingesting Data with Kafka

398

(8)

Benefits Offered by Kafka

398

(1)

How Kafka Works

399

(2)

Setting Up an Apache Kafka Cluster

401

(3)

Integrating Kafka with Hadoop and Storm

404

(2)

Summary

406

(1)

13 Resource Allocation in a Hadoop Cluster

407

(30)

Resource Allocation in Hadoop

407

(3)

Managing Cluster Workloads

408

(1)

Hadoop's Resource Schedulers

409

(1)

The FIFO Scheduler

410

(1)

The Capacity Scheduler

411

(15)

Queues and Subqueues

412

(6)

How the Cluster Allocates Resources

418

(3)

Preempting Applications

421

(1)

Enabling the Capacity Scheduler

422

(1)

A Typical Capacity Scheduler

422

(4)

The Fair Scheduler

426

(9)

Queues

427

(1)

Configuring the Fair Scheduler

428

(2)

How Jobs Are Placed into Queues

430

(1)

Application Preemption in the Fair Scheduler

431

(1)

Security and Resource Pools

432

(1)

A Sample fair-scheduler.xml File

432

(2)

Submitting Jobs to the Scheduler

434

(1)

Moving Applications between Queues

434

(1)

Monitoring the Fair Scheduler

434

(1)

Comparing the Capacity Scheduler and the Fair Scheduler

435

(1)

Similarities between the Two Schedulers

435

(1)

Differences between the Two Schedulers

435

(1)

Summary

436

(1)

14 Working with Oozie to Manage Job Workflows

437

(40)

Using Apache Oozie to Schedule Jobs

437

(2)

Oozie Architecture

439

(2)

The Oozie Server

439

(1)

The Oozie Client

440

(1)

The Oozie Database

440

(1)

Deploying Oozie in Your Cluster

441

(5)

Installing and Configuring Oozie

442

(2)

Configuring Hadoop for Oozie

444

(2)

Understanding Oozie Workflows

446

(3)

Workflows, Control Flow, and Nodes

446

(1)

Defining the Workflows with the workflow.xml File

447

(2)

How Oozie Runs an Action

449

(5)

Configuring the Action Nodes

449

(5)

Creating an Oozie Workflow

454

(7)

Configuring the Control Nodes

456

(4)

Configuring the Job

460

(1)

Running an Oozie Workflow Job

461

(3)

Specifying the Job Properties

461

(2)

Deploying Oozie Jobs

463

(1)

Creating Dynamic Workflows

463

(1)

Oozie Coordinators

464

(6)

Time-Based Coordinators

465

(2)

Data-Based Coordinators

467

(2)

Time-and-Data-Based Coordinators

469

(1)

Submitting the Oozie Coordinator from the Command Line

469

(1)

Managing and Administering Oozie

470

(5)

Common Oozie Commands and How to Run Them

471

(2)

Troubleshooting Oozie

473

(1)

Oozie crop Scheduling and Oozie Service Level Agreements

474

(1)

Summary

475

(2)

15 Securing Hadoop

477

(50)

Hadoop Security-An Overview

478

(3)

Authentication, Authorization and Accounting

480

(1)

Hadoop Authentication with Kerberos

481

(24)

Kerberos and How It Works

482

(1)

The Kerberos Authentication Process

483

(1)

Kerberos Trusts

484

(1)

A Special Principal

485

(1)

Adding Kerberos Authorization to your Cluster

486

(4)

Setting Up Kerberos for Hadoop

490

(5)

Securing a Hadoop Cluster with Kerberos

495

(6)

How Kerberos Authenticates Users and Services

501

(1)

Managing a Kerberized Hadoop Cluster

501

(4)

Hadoop Authorization

505

(13)

HDFS Permissions

505

(5)

Service Level Authorization

510

(2)

Role-Based Authorization with Apache Sentry

512

(6)

Auditing Hadoop

518

(2)

Auditing HDFS Operations

519

(1)

Auditing YARN Operations

519

(1)

Securing Hadoop Data

520

(4)

HDFS Transparent Encryption

520

(3)

Encrypting Data in Transition

523

(1)

Other Hadoop-Related Security Initiatives

524

(1)

Securing a Hadoop Infrastructure with Apache Knox Gateway

524

(1)

Apache Ranger for Security Administration

525

(1)

Summary

525

(2)

V Monitoring, Optimization and Troubleshooting

527

(220)

16 Managing Jobs, Using Hue and Performing Routine Tasks

529

(40)

Using the YARN Commands to Manage Hadoop Jobs

530

(5)

Viewing YARN Applications

531

(1)

Checking the Status of an Application

532

(1)

Killing a Running Application

532

(1)

Checking the Status of the Nodes

533

(1)

Checking YARN Queues

533

(1)

Getting the Application Logs

533

(1)

Yarn Administrative Commands

534

(1)

Decommissioning and Recommissioning Nodes

535

(6)

Including and Excluding Hosts

536

(1)

Decommissioning DataNodes and NodeManagers

537

(2)

Recommissioning Nodes

539

(1)

Things to Remember about Decommissioning and Recommissioning

539

(1)

Adding a New DataNode and/or a NodeManager

540

(1)

ResourceManager High Availability

541

(4)

ResourceManager High-Availability Architecture

541

(1)

Setting Up ResourceManager High Availability

542

(1)

ResourceManager Failover

543

(2)

Using the ResourceManager High-Availability Commands

545

(1)

Performing Common Management Tasks

545

(3)

Moving the NameNode to a Different Host

545

(1)

Managing High-Availability NameNodes

546

(1)

Using a Shutdown/Startup Script to Manage your Cluster

546

(1)

Balancing HDFS

546

(1)

Balancing the Storage on the DataNodes

547

(1)

Managing the MySQL Database

548

(3)

Configuring a MySQL Database

548

(1)

Configuring MySQL High Availability

549

(2)

Backing Up Important Cluster Data

551

(2)

Backing Up HDFS Metadata

552

(1)

Backing Up the Metastore Databases

553

(1)

Using Hue to Administer Your Cluster

553

(9)

Allowing Your Users to Use Hue

554

(2)

Installing Hue

556

(1)

Configuring Your Cluster to Work with Hue

557

(4)

Managing Hue

561

(1)

Working with Hue

561

(1)

Implementing Specialized HDFS Features

562

(5)

Deploying HDFS and YARN in a Multihomed Network

562

(1)

Short-Circuit Local Reads

563

(1)

Mountable HDFS

564

(2)

Using an NFS Gateway for Mounting HDFS to a Local File System

566

(1)

Summary

567

(2)

17 Monitoring, Metrics and Hadoop Logging

569

(42)

Monitoring Linux Servers

570

(6)

Basics of Linux System Monitoring

570

(2)

Monitoring Tools for Linux Systems

572

(4)

Hadoop Metrics

576

(3)

Hadoop Metric Types

577

(1)

Using the Hadoop Metrics

578

(1)

Capturing Metrics to a File System

578

(1)

Using Ganglia for Monitoring

579

(3)

Ganglia Architecture

580

(1)

Setting Up the Ganglia and Hadoop Integration

580

(2)

Setting Up the Hadoop Metrics

582

(1)

Understanding Hadoop Logging

582

(17)

Hadoop Log Messages

583

(1)

Daemon and Application Logs and How to View Them

584

(1)

How Application Logging Works

585

(2)

How Hadoop Uses HDFS Staging Directories and Local Directories During a Job Run

587

(1)

How the NodeManager Uses the Local Directories

588

(4)

Storing Job Logs in HDFS through Log Aggregation

592

(5)

Working with the Hadoop Daemon Logs

597

(2)

Using Hadoop's Web Uls for Monitoring

599

(10)

Monitoring Jobs with the ResourceManager Web UI

599

(7)

The JobHistoryServer Web UI

606

(2)

Monitoring with the NameNode Web UI

608

(1)

Monitoring Other Hadoop Components

609

(1)

Monitoring Hive

609

(1)

Monitoring Spark

610

(1)

Summary

610

(1)

18 Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking

611

(48)

How to Allocate YARN Memory and CPU

612

(9)

Allocating Memory

612

(8)

Configuring the Number of CPU Cores

620

(1)

Relationship between Memory and CPU Vcores

621

(1)

Configuring Efficient Performance

621

(4)

Speculative Execution

621

(3)

Reducing the I/O Load on the System

624

(1)

Tuning Map and Reduce Tasks-What the Administrator Can Do

625

(10)

Tuning the Map Tasks

626

(1)

Input and Output

627

(3)

Tuning the Reduce Tasks

630

(2)

Tuning the MapReduce Shuffle Process

632

(3)

Optimizing Pig and Hive Jobs

635

(3)

Optimizing Hive Jobs

635

(2)

Optimizing Pig Jobs

637

(1)

Benchmarking Your Cluster

638

(9)

Using TestDFSIO for Testing I/O Performance

638

(2)

Benchmarking with TeraSort

640

(3)

Using Hadoop's Rumen and GridMix for Benchmarking

643

(4)

Hadoop Counters

647

(5)

File System Counters

649

(1)

Job Counters

649

(1)

MapReduce Framework Counters

650

(1)

Custom Java Counters

651

(1)

Limiting the Number of Counters

651

(1)

Optimizing MapReduce

652

(6)

Map-Only versus Map and Reduce Jobs

652

(1)

How Combiners Improve MapReduce Performance

652

(2)

Using a Partitioner to Improve Performance

654

(1)

Compressing Data During the MapReduce Process

654

(1)

Too Many Mappers or Reducers?

655

(3)

Summary

658

(1)

19 Configuring and Tuning Apache Spark on YARN

659

(32)

Configuring Resource Allocation for Spark on YARN

659

(17)

Allocating CPU

660

(1)

Allocating Memory

660

(1)

How Resources are Allocated to Spark

660

(1)

Limits on the Resource Allocation to Spark Applications

661

(2)

Allocating Resources to the Driver

663

(3)

Configuring Resources for the Executors

666

(4)

How Spark Uses its Memory

670

(2)

Things to Remember

672

(2)

Cluster or Client Mode?

674

(2)

Configuring Spark-Related Network Parameters

676

(1)

Dynamic Resource Allocation when Running Spark on YARN

676

(2)

Dynamic and Static Resource Allocation

676

(1)

How Spark Manages Dynamic Resource Allocation

677

(1)

Enabling Dynamic Resource Allocation

677

(1)

Storage Formats and Compressing Data

678

(3)

Storage Formats

679

(1)

File Sizes

680

(1)

Compression

680

(1)

Monitoring Spark Applications

681

(5)

Using the Spark Web UI to Understand Performance

682

(2)

Spark System and the Metrics REST API

684

(1)

The Spark History Server on YARN

684

(2)

Tracking Jobs from the Command Line

686

(1)

Tuning Garbage Collection

686

(2)

The Mechanics of Garbage Collection

687

(1)

How to Collect GC Statistics

687

(1)

Tuning Spark Streaming Applications

688

(1)

Reducing Batch Processing Time

688

(1)

Setting the Right Batch Interval

689

(1)

Tuning Memory and Garbage Collection

689

(1)

Summary

689

(2)

20 Optimizing Spark Applications

691

(34)

Revisiting the Spark Execution Model

692

(2)

The Spark Execution Model

692

(2)

Shuffle Operations and How to Minimize Them

694

(9)

A WordCount Example to Our Rescue Again

695

(1)

Impact of a Shuffle Operation

696

(1)

Configuring the Shuffle Parameters

697

(6)

Partitioning and Parallelism (Number of Tasks)

703

(7)

Level of Parallelism

704

(2)

Problems with Too Few Tasks

706

(1)

Setting the Default Number of Partitions

706

(1)

How to Increase the Number of Partitions

707

(1)

Using the Repartition and Coalesce Operators to Change the Number of Partitions in an RDD

708

(1)

Two Types of Partitioners

709

(1)

Data Partitioning and How It Can Avoid a Shuffle

709

(1)

Optimizing Data Serialization and Compression

710

(2)

Data Serialization

710

(1)

Configuring Compression

711

(1)

Understanding Spark's SQL Query Optimizer

712

(5)

Understanding the Optimizer Steps

712

(2)

Spark's Speculative Execution Feature

714

(1)

The Importance of Data Locality

715

(2)

Caching Data

717

(6)

Fault-Tolerance Due to Caching

718

(1)

How to Specify Caching

718

(5)

Summary

723

(2)

21 Troubleshooting Hadoop-A Sampler

725

(18)

Space-Related Issues

725

(6)

Dealing with a 100 Percent Full Linux File System

726

(1)

HDFS Space Issues

727

(1)

Local and Log Directories Out of Free Space

727

(2)

Disk Volume Failure Toleration

729

(2)

Handling YARN Jobs That Are Stuck

731

(1)

JVM Memory-Allocation and Garbage-Collection Strategies

732

(5)

Understanding JVM Garbage Collection

732

(1)

Optimizing Garbage Collection

733

(1)

Analyzing Memory Usage

734

(1)

Out of Memory Errors

734

(1)

ApplicationMaster Memory Issues

735

(2)

Handling Different Types of Failures

737

(2)

Handling Daemon Failures

737

(1)

Starting Failures for Hadoop Daemons

737

(1)

Task and Job Failures

738

(1)

Troubleshooting Spark Jobs

739

(1)

Spark's Fault Tolerance Mechanism

740

(1)

Killing Spark Jobs

740

(1)

Maximum Attempts for a Job

740

(1)

Maximum Failures per Job

740

(1)

Debugging Spark Applications

740

(2)

Viewing Logs with Log Aggregation

740

(1)

Viewing Logs When Log Aggregation Is Not Enabled

741

(1)

Reviewing the Launch Environment

741

(1)

Summary

742

(1)

22 installing VirtualBox and Linux and Cloning the Virtual Machines

743

(4)

Installing Oracle VirtualBox

744

(1)

Installing Oracle Enterprise Linux

745

(1)

Cloning the Linux Server

745

(2)

Index

747

Sam R. Alapati has been working with various aspects of the Hadoop environment for the past six years. He is currently the principal Hadoop administrator at Sabre Corporation in Westlake, Texas, and works on a daily basis with multiple large Hadoop 2 clusters. In addition to being the point person for all Hadoop administration at Sabre, Sam manages multiple critical data-science- and data-analysis-related Hadoop job flows and is also an expert Oracle Database Administrator. His vast knowledge of relational databases and SQL contributes to his work with Hadoop related projects. Sams recognition in the database and middleware area includes having published 18 well-received books over the past 14 years, mostly on Oracle Database Administration and Oracle Weblogic Server. His experience dealing with numerous configuration, architectural, and performance-related Hadoop issues over the years led him to the realization that many working Hadoop administrators and developers would appreciate having a handy reference such as this book to turn to when creating, managing, securing and optimizing their Hadoop infrastructure.

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv