Muutke küpsiste eelistusi

E-raamat: Practical Enterprise Data Lake Insights: Handle Data-Driven Challenges in an Enterprise Big Data Lake

  • Formaat: EPUB+DRM
  • Ilmumisaeg: 27-Jun-2018
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484235225
  • Formaat - EPUB+DRM
  • Hind: 55,56 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: EPUB+DRM
  • Ilmumisaeg: 27-Jun-2018
  • Kirjastus: APress
  • Keel: eng
  • ISBN-13: 9781484235225

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.

When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.

Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.

What You'll Learn
  • Get to know data lake architecture and design principles
  • Implement data capture and streaming strategies
  • Implement data processing strategies in Hadoop
  • Understand the data lake security framework and availability model
Who This Book Is For

Big data architects and solution architects
About the Authors xi
About the Technical Reviewer xiii
Acknowledgments xv
Foreword xvii
Chapter 1 Introduction to Enterprise Data Lakes 1(32)
Data explosion: the beginning
3(3)
Big data ecosystem
6(5)
Hadoop and MapReduce - Early days
7(1)
Evolution of Hadoop
8(3)
History of Data Lake
11(2)
Data Lake: the concept
12(1)
Data lake architecture
13(6)
Why Data Lake?
15(1)
Data Lake Characteristics
16(3)
Data lake vs. Data warehouse
19(2)
How to achieve success with Data Lake?
21(4)
Data governance and data operations
22(3)
Data democratization with data lake
25(3)
Fast Data - Life beyond Big Data
28(2)
Conclusion
30(3)
Chapter 2 Data lake ingestion strategies 33(54)
What is data ingestion?
34(13)
Understand the data sources
35(2)
Structured vs. Semi-structured vs. Unstructured data
37(2)
Data ingestion framework parameters
39(6)
ETL vs. ELT
45(2)
Big Data Integration with Data Lake
47(4)
Hadoop Distributed File System (HDFS)
48(1)
Copy files directly into HDFS
49(1)
Batched data ingestion
49(2)
Challenges and design considerations
51(13)
Design considerations
52(5)
Commercial ETL tools
57(1)
Real-time ingestion
58(2)
CDC design considerations
60(1)
Example of CDC pipeline: Databus, LinkedIn's open-source solution
61(3)
Apache Sqoop
64(7)
Sqoop 1
64(1)
Sqoop 2
65(1)
How Sqoop works?
66(1)
Sqoop design considerations
67(4)
Native ingestion utilities
71(6)
Oracle copyToBDA
72(1)
Greenplum gphdfs utility
73(3)
Data transfer from Greenplum to using gpfdist
76(1)
Ingest unstructured data into Hadoop
77(8)
Apache Flume
77(2)
Tiered architecture for convergent flow of events
79(1)
Features and design considerations
80(5)
Conclusion
85(2)
Chapter 3 Capture Streaming Data with Change-Data-Capture 87(38)
Change Data Capture Concepts
88(1)
Strategies for Data Capture
89(4)
Retention and Replay
91(1)
Retention Period
92(1)
Types of CDC
93(2)
Incremental
94(1)
Bulk
94(1)
Hybrid
95(1)
CDC -Trade-offs
95(2)
CDC Tools
97(4)
Challenges
98(1)
Downstream Propagation
98(1)
Use Case
99(1)
Centralization of Change Data
100(1)
Analyzing a Centralized Data Store
101(14)
Metadata: Data about Data
102(2)
Structure of Data
104(1)
Privacy/Sensitivity Information
104(1)
Special Fields
104(1)
Data Formats
105(1)
Delimited Format
105(1)
Avro File Format
106(1)
Consumption and Checkpointing
107(1)
Simple Checkpoint Mechanism
107(1)
Parallelism
107(1)
Merging and Consolidation
108(1)
Design Considerations for Merge and Consolidate
109(1)
Data Quality
110(1)
Challenges
111(1)
Design Aspects
112(1)
Operational Aspects
112(3)
Publishing to Kafka
115(2)
Schema and Data
117(6)
Sample Schema
118(1)
Schema Repository
119(1)
Multiple Topics and Partitioning
120(1)
Sizing and Scaling
121(1)
Tools
122(1)
Conclusion
123(2)
Chapter 4 Data Processing Strategies in Data Lakes 125(76)
MapReduce Processing Framework
126(15)
Motivation: Why MapReduce?
127(1)
MapReduce V1 Refresher and Design Considerations
128(8)
Yet Another Resource Negotiator - YARN
136(5)
Hive
141(19)
Hive - Quick Refresher
143(3)
Hive Metastore (a.k.a. HCatalog)
146(2)
Hive - Design Considerations
148(10)
Hive LLAP
158(2)
Apache Pig
160(6)
Pig Execution Architecture
161(5)
Apache Spark
166(18)
Why Spark?
167(2)
Resilient Distributed Datasets (RDD)
169(2)
RDD Runtime Components
171(3)
RDD Composition
174(1)
Datasets and DataFrames
175(3)
Deployment Modes of Spark Application
178(2)
Design Considerations
180(2)
Caching and Persistence of an RDD in Spark
182(1)
RDD Shared Variables
183(1)
SQL on Hadoop
184(15)
Presto
186(8)
Oracle Big Data SQL
194(3)
Design Considerations
197(2)
Conclusion
199(2)
Chapter 5 Data Archiving Strategies in Data Lakes 201(24)
The Act of Data Governance
202(3)
Data lake vs. Data swamp
204(1)
Introduction to Data Archival
205(12)
Data Lifecycle Management (DLM)
208(2)
DLM policy actions
210(1)
DLM strategies
211(2)
DLM design considerations
213(4)
Amazon S3 and Glacier storage classes
217(5)
Design considerations
219(1)
DLM Case Study - Archiving with Amazon
220(2)
Conclusion
222(3)
Chapter 6 Data Security in Data Lakes 225(36)
System Architecture
226(6)
Network Security
227(3)
Hadoop Roles within a cluster
230(2)
Host Firewalls for operating system security
232(1)
Data in Motion
233(4)
Communication Problem
233(4)
Data at Rest
237(7)
Procedure to generate and verify key in LUKS
238(1)
Access flow for the user
238(5)
Performance using LUKS
243(1)
Multiple passphrases with LUKS
243(1)
Kerberos
244(12)
Kerberos Protocol overview
244(2)
Kerberos components
246(1)
Kerberos flow
247(2)
Kerberos commands
249(7)
HDFS ACL
256(1)
HDFS Authorization with Apache Ranger
257(2)
What Ranger does?
258(1)
Conclusion
259(2)
Chapter 7 Ensure High Availability of Data Lake 261(36)
Scale Hadoop through HDFS federation
262(5)
High availability of Hadoop components
267(20)
Hive metastore
267(1)
HiveServer2 and Zookeeper integration
268(1)
Setup HA for Kerberos
269(3)
NameNode high availability
272(1)
Architecture
273(7)
Data Center disaster recovery strategies
280(7)
Data replication strategies
287(8)
Active-passive data center replication
289(1)
Active-active data center replication
290(5)
Conclusion
295(2)
Chapter 8 Managing Data Lake Operations 297(20)
Monitoring Architecture
299(2)
Hadoop metrics architecture
300(1)
Identification of source components
301(6)
YARN metrics
301(1)
MapReduce metrics
302(1)
HDFS
302(1)
Metric collection tools
303(2)
Metrics and log storage
305(2)
Logs and Metrics visualization
307(2)
Kibana
308(1)
Apache Ambari
309(2)
Data lake operationalization
311(4)
Conclusion
315(2)
Index 317
Saurabh K. Gupta is a technology leader, published author, and database enthusiast with more than 11 years of industry experience in data architecture, engineering, development, and administration. Working as a Manager, Data & Analytics at GE Transportation, his focus lies with data lake analytics programs that build a digital solution for business stakeholders. In the past, he has worked extensively with Oracle database design and development, PaaS and IaaS cloud service models, consolidation, and in-memory technologies. He has authored two books on advanced PL/SQL for Oracle versions 11g and 12c. He is a frequent speaker at numerous conferences organized by the user community and technical institutions. He tweets at @saurabhkg and blogs at sbhoracle.wordpress.com.  Venkata Giri currently works with GE Digital and has been involved with building resilient distributed services at a massive scale. He has worked on big data tech stack, relational databases, high availability, and performance tuning. With over 20 years of experience in data technologies, he has in-depth knowledge of big data ecosystems, complex data ingestion pipelines, data engineering, data processing, and operations. Prior to working at GE, he worked with the data teams at Linkedin and Yahoo.