Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake

4.00/5 (31 hinnangut Goodreads-ist)

Venkata Giri, Saurabh Gupta

Formaat: EPUB+DRM
Ilmumisaeg: 27-Jun-2018
Kirjastus: APress
Keel: eng
ISBN-13: 9781484235225

Teised raamatud teemal:

Formaat - EPUB+DRM
Hind: 55,56 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: EPUB+DRM
Ilmumisaeg: 27-Jun-2018
Kirjastus: APress
Keel: eng
ISBN-13: 9781484235225

Teised raamatud teemal:

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.

Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.

What You'll Learn

Get to know data lake architecture and design principles
Implement data capture and streaming strategies
Implement data processing strategies in Hadoop
Understand the data lake security framework and availability model

Who This Book Is For

Big data architects and solution architects

About the Authors

About the Technical Reviewer

xiii

Acknowledgments

Foreword

xvii

Chapter 1 Introduction to Enterprise Data Lakes

(32)

Data explosion: the beginning

(3)

Big data ecosystem

(5)

Hadoop and MapReduce - Early days

(1)

Evolution of Hadoop

(3)

History of Data Lake

(2)

Data Lake: the concept

(1)

Data lake architecture

(6)

Why Data Lake?

(1)

Data Lake Characteristics

(3)

Data lake vs. Data warehouse

(2)

How to achieve success with Data Lake?

(4)

Data governance and data operations

(3)

Data democratization with data lake

(3)

Fast Data - Life beyond Big Data

(2)

Conclusion

(3)

Chapter 2 Data lake ingestion strategies

(54)

What is data ingestion?

(13)

Understand the data sources

(2)

Structured vs. Semi-structured vs. Unstructured data

(2)

Data ingestion framework parameters

(6)

ETL vs. ELT

(2)

Big Data Integration with Data Lake

(4)

Hadoop Distributed File System (HDFS)

(1)

Copy files directly into HDFS

(1)

Batched data ingestion

(2)

Challenges and design considerations

(13)

Design considerations

(5)

Commercial ETL tools

(1)

Real-time ingestion

(2)

CDC design considerations

(1)

Example of CDC pipeline: Databus, LinkedIn's open-source solution

(3)

Apache Sqoop

(7)

Sqoop 1

(1)

Sqoop 2

(1)

How Sqoop works?

(1)

Sqoop design considerations

(4)

Native ingestion utilities

(6)

Oracle copyToBDA

(1)

Greenplum gphdfs utility

(3)

Data transfer from Greenplum to using gpfdist

(1)

Ingest unstructured data into Hadoop

(8)

Apache Flume

(2)

Tiered architecture for convergent flow of events

(1)

Features and design considerations

(5)

Conclusion

(2)

Chapter 3 Capture Streaming Data with Change-Data-Capture

(38)

Change Data Capture Concepts

(1)

Strategies for Data Capture

(4)

Retention and Replay

(1)

Retention Period

(1)

Types of CDC

(2)

Incremental

(1)

Bulk

(1)

Hybrid

(1)

CDC -Trade-offs

(2)

CDC Tools

(4)

Challenges

(1)

Downstream Propagation

(1)

Use Case

(1)

Centralization of Change Data

100

(1)

Analyzing a Centralized Data Store

101

(14)

Metadata: Data about Data

102

(2)

Structure of Data

104

(1)

Privacy/Sensitivity Information

104

(1)

Special Fields

104

(1)

Data Formats

105

(1)

Delimited Format

105

(1)

Avro File Format

106

(1)

Consumption and Checkpointing

107

(1)

Simple Checkpoint Mechanism

107

(1)

Parallelism

107

(1)

Merging and Consolidation

108

(1)

Design Considerations for Merge and Consolidate

109

(1)

Data Quality

110

(1)

Challenges

111

(1)

Design Aspects

112

(1)

Operational Aspects

112

(3)

Publishing to Kafka

115

(2)

Schema and Data

117

(6)

Sample Schema

118

(1)

Schema Repository

119

(1)

Multiple Topics and Partitioning

120

(1)

Sizing and Scaling

121

(1)

Tools

122

(1)

Conclusion

123

(2)

Chapter 4 Data Processing Strategies in Data Lakes

125

(76)

MapReduce Processing Framework

126

(15)

Motivation: Why MapReduce?

127

(1)

MapReduce V1 Refresher and Design Considerations

128

(8)

Yet Another Resource Negotiator - YARN

136

(5)

Hive

141

(19)

Hive - Quick Refresher

143

(3)

Hive Metastore (a.k.a. HCatalog)

146

(2)

Hive - Design Considerations

148

(10)

Hive LLAP

158

(2)

Apache Pig

160

(6)

Pig Execution Architecture

161

(5)

Apache Spark

166

(18)

Why Spark?

167

(2)

Resilient Distributed Datasets (RDD)

169

(2)

RDD Runtime Components

171

(3)

RDD Composition

174

(1)

Datasets and DataFrames

175

(3)

Deployment Modes of Spark Application

178

(2)

Design Considerations

180

(2)

Caching and Persistence of an RDD in Spark

182

(1)

RDD Shared Variables

183

(1)

SQL on Hadoop

184

(15)

Presto

186

(8)

Oracle Big Data SQL

194

(3)

Design Considerations

197

(2)

Conclusion

199

(2)

Chapter 5 Data Archiving Strategies in Data Lakes

201

(24)

The Act of Data Governance

202

(3)

Data lake vs. Data swamp

204

(1)

Introduction to Data Archival

205

(12)

Data Lifecycle Management (DLM)

208

(2)

DLM policy actions

210

(1)

DLM strategies

211

(2)

DLM design considerations

213

(4)

Amazon S3 and Glacier storage classes

217

(5)

Design considerations

219

(1)

DLM Case Study - Archiving with Amazon

220

(2)

Conclusion

222

(3)

Chapter 6 Data Security in Data Lakes

225

(36)

System Architecture

226

(6)

Network Security

227

(3)

Hadoop Roles within a cluster

230

(2)

Host Firewalls for operating system security

232

(1)

Data in Motion

233

(4)

Communication Problem

233

(4)

Data at Rest

237

(7)

Procedure to generate and verify key in LUKS

238

(1)

Access flow for the user

238

(5)

Performance using LUKS

243

(1)

Multiple passphrases with LUKS

243

(1)

Kerberos

244

(12)

Kerberos Protocol overview

244

(2)

Kerberos components

246

(1)

Kerberos flow

247

(2)

Kerberos commands

249

(7)

HDFS ACL

256

(1)

HDFS Authorization with Apache Ranger

257

(2)

What Ranger does?

258

(1)

Conclusion

259

(2)

Chapter 7 Ensure High Availability of Data Lake

261

(36)

Scale Hadoop through HDFS federation

262

(5)

High availability of Hadoop components

267

(20)

Hive metastore

267

(1)

HiveServer2 and Zookeeper integration

268

(1)

Setup HA for Kerberos

269

(3)

NameNode high availability

272

(1)

Architecture

273

(7)

Data Center disaster recovery strategies

280

(7)

Data replication strategies

287

(8)

Active-passive data center replication

289

(1)

Active-active data center replication

290

(5)

Conclusion

295

(2)

Chapter 8 Managing Data Lake Operations

297

(20)

Monitoring Architecture

299

(2)

Hadoop metrics architecture

300

(1)

Identification of source components

301

(6)

YARN metrics

301

(1)

MapReduce metrics

302

(1)

HDFS

302

(1)

Metric collection tools

303

(2)

Metrics and log storage

305

(2)

Logs and Metrics visualization

307

(2)

Kibana

308

(1)

Apache Ambari

309

(2)

Data lake operationalization

311

(4)

Conclusion

315

(2)

Index

317

Saurabh K. Gupta is a technology leader, published author, and database enthusiast with more than 11 years of industry experience in data architecture, engineering, development, and administration. Working as a Manager, Data & Analytics at GE Transportation, his focus lies with data lake analytics programs that build a digital solution for business stakeholders. In the past, he has worked extensively with Oracle database design and development, PaaS and IaaS cloud service models, consolidation, and in-memory technologies. He has authored two books on advanced PL/SQL for Oracle versions 11g and 12c. He is a frequent speaker at numerous conferences organized by the user community and technical institutions. He tweets at @saurabhkg and blogs at sbhoracle.wordpress.com. Venkata Giri currently works with GE Digital and has been involved with building resilient distributed services at a massive scale. He has worked on big data tech stack, relational databases, high availability, and performance tuning. With over 20 years of experience in data technologies, he has in-depth knowledge of big data ecosystems, complex data ingestion pipelines, data engineering, data processing, and operations. Prior to working at GE, he worked with the data teams at Linkedin and Yahoo.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97814842352256e.html

Märksõnad:

E-raamat: Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv