Muutke küpsiste eelistusi

Designing Cloud Data Platforms [Pehme köide]

  • Formaat: Paperback / softback, 336 pages, kõrgus x laius x paksus: 235x187x24 mm, kaal: 600 g
  • Ilmumisaeg: 11-Jun-2021
  • Kirjastus: Manning Publications
  • ISBN-10: 1617296449
  • ISBN-13: 9781617296444
Teised raamatud teemal:
  • Formaat: Paperback / softback, 336 pages, kõrgus x laius x paksus: 235x187x24 mm, kaal: 600 g
  • Ilmumisaeg: 11-Jun-2021
  • Kirjastus: Manning Publications
  • ISBN-10: 1617296449
  • ISBN-13: 9781617296444
Teised raamatud teemal:
Well-designed pipelines, storage systems, and APIs eliminate the complicated scaling and maintenance required with on-prem data centers. Once you learn the patterns for designing cloud data platforms, you'll maximize performance no matter which cloud vendor you use. In Designing cloud data platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors.

In Designing Cloud Data Platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors.

Summary
Centralized data warehouses, the long-time defacto standard for housing data for analytics, are rapidly giving way to multi-faceted cloud data platforms. Companies that embrace modern cloud data platforms benefit from an integrated view of their business using all of their data and can take advantage of advanced analytic practices to drive predictions and as yet unimagined data services. Designing Cloud Data Platforms is a hands-on guide to envisioning and designing a modern scalable data platform that takes full advantage of the flexibility of the cloud. As you read, you'll learn the core components of a cloud data platform design, along with the role of key technologies like Spark and Kafka Streams. You'll also explore setting up processes to manage cloud-based data, keep it secure, and using advanced analytic and BI tools to analyze it.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the technology
Well-designed pipelines, storage systems, and APIs eliminate the complicated scaling and maintenance required with on-prem data centers. Once you learn the patterns for designing cloud data platforms, you'll maximize performance no matter which cloud vendor you use.

About the book
In Designing Cloud Data Platforms, Danil Zburivsky and Lynda Partner reveal a six-layer approach that increases flexibility and reduces costs. Discover patterns for ingesting data from a variety of sources, then learn to harness pre-built services provided by cloud vendors.

What's inside
    Best practices for structured and unstructured data sets
    Cloud-ready machine learning tools
    Metadata and real-time analytics
    Defensive architecture, access, and security

About the reader
For data professionals familiar with the basics of cloud computing, and Hadoop or Spark.

About the author
Danil Zburivsky has over 10 years of experience designing and supporting large-scale data infrastructure for enterprises across the globe. Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.

Table of Contents
1 Introducing the data platform
2 Why a data platform and not just a data warehouse
3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google
4 Getting data into the platform
5 Organizing and processing data
6 Real-time data processing and analytics
7 Metadata layer architecture
8 Schema management
9 Data access and security
10 Fueling business value with data platforms
preface xi
acknowledgments xiii
About this book xv
About the authors xviii
About the cover illustration xix
1 Introducing the data platform
1(17)
1.1 The trends behind the change from data warehouses to data platforms
2(1)
1.2 Data warehouses struggle with data variety, volume, and velocity
3(3)
Variety
4(1)
Volume
5(1)
Velocity
5(1)
All the V's at once
6(1)
1.3 Data lakes to the rescue?
6(1)
1.4 Along came the cloud
7(2)
1.5 Cloud, data lakes, and data warehouses: The emergence of cloud data platforms
9(1)
1.6 Building blocks of a cloud data platform
10(4)
Ingestion layer
10(1)
Storage layer
11(1)
Processing layer
12(1)
Serving layer
13(1)
1.7 How the cloud data platform deals with the three V's
14(2)
Variety
14(1)
Volume
15(1)
Velocity
15(1)
Two more V's
16(1)
1.8 Common use cases
16(2)
2 Why a data platform and not just a data warehouse
18(19)
2.1 Cloud data platforms and cloud data warehouses: The practical aspects
19(5)
A closer look at the data sources
20(2)
An example cloud data warehouse-only architecture
22(1)
An example cloud data platform architecture
23(1)
2.2 Ingesting data
24(4)
Ingesting data directly into Azure Synapse
25(1)
Ingesting data into an Azure data platform
26(1)
Managing changes in upstream data sources
26(2)
2.3 Processing data
28(5)
Processing data in the warehouse
29(2)
Processing data in the data platform
31(2)
2.4 Accessing data
33(1)
2.5 Cloud cost considerations
34(2)
2.6 Exercise answers
36(1)
3 Getting bigger and leveraging the Big 3: Amazon, Microsoft Azure, and Google
37(41)
3.1 Cloud data platform layered architecture
38(21)
Data ingestion layer
40(4)
Fast and slow storage
44(2)
Processing layer
46(1)
Technical metadata layer
47(2)
The serving layer and data consumers
49(4)
Orchestration and ETL overlay layers
53(6)
3.2 The importance of layers in a data platform architecture
59(1)
3.3 Mapping cloud data platform layers to specific tools
60(14)
AWS
62(4)
Google Cloud
66(4)
Azure
70(4)
3.4 Open source and commercial alternatives
74(3)
Batch data ingestion
74(1)
Streaming data ingestion and real-time analytics
75(1)
Orchestration layer
75(2)
3.5 Exercise answers
77(1)
4 Getting data into the platform
78(49)
4.1 Databases, files, APIs, and streams
79(4)
Relational databases
80(1)
Files
81(1)
SaaS data via API
82(1)
Streams
82(1)
4.2 Ingesting data from relational databases
83(24)
Ingesting data from RDBMSs using a SQL interface
84(2)
Full-table ingestion
86(5)
Incremental table ingestion
91(3)
Change data capture (CDC)
94(4)
CDC vendors overview
98(2)
Datatype conversion
100(3)
Ingesting data from NoSQL databases
103(1)
Capturing important metadata for RDBMS or NoSQL ingestion pipelines
104(3)
4.3 Ingesting data from files
107(7)
Tracking ingested files
109(3)
Capturing file ingestion metadata
112(2)
4.4 Ingesting data from streams
114(6)
Differences between batch and streaming ingestion
117(2)
Capturing streaming pipeline metadata
119(1)
4.5 Ingesting data from SaaS applications
120(3)
No standard approach to API design
121(1)
No standard way to deal with full vs. incremental data exports
122(1)
Resulting data is typically highly nested JSON
122(1)
4.6 Network and security considerations for data ingestion into the cloud
123(3)
Connecting other networks to your cloud data platform
123(3)
4.7 Exercise answers
126(1)
5 Organizing and processing data
127(29)
5.1 Processing as a separate layer in the data platform
129(2)
5.2 Data processing stages
131(1)
5.3 Organizing your cloud storage
132(8)
Cloud storage containers and folders
134(6)
5.4 Common data processing steps
140(12)
File format conversion
140(5)
Data deduplication
145(5)
Data quality checks
150(2)
5.5 Configurable pipelines
152(3)
5.6 Exercise answers
155(1)
6 Real-time data processing and analytics
156(41)
6.1 Real-time ingestion vs. real-time processing
157(3)
6.2 Use cases for real-time data processing
160(4)
Retail use case: Real-time ingestion
160(1)
Online gaming use case: Real-time ingestion and real-time processing
161(3)
Summary of real-time ingestion vs. real-time processing
164(1)
6.3 When should you use real-time ingestion and/or real-time processing?
164(3)
6.4 Organizing data for real-time use
167(11)
The anatomy of fast storage
167(3)
How does fast storage scale?
170(2)
Organizing data in the real-time storage
172(6)
6.5 Common data transformations in real time
178(12)
Causes of duplicates in real-time systems
178(3)
Deduplicating data in real-time systems
181(5)
Converting message formats in real-time pipelines
186(1)
Real-time data quality checks
187(1)
Combining batch and real-time data
188(2)
6.6 Cloud services for real-time data processing
190(5)
AWS real-time processing services
190(2)
Google Cloud real-time processing services
192(1)
Azure real-time processing services
193(2)
6.7 Exercise answers
195(2)
7 Metadata layer architecture
197(31)
7.1 What we mean by metadata
198(1)
Business metadata
198(1)
Data platform internal metadata or "pipeline metadata"
199(1)
7.2 Taking advantage of pipeline metadata
199(4)
7.3 Metadata model
203(10)
Metadata domains
204(9)
7.4 Metadata layer implementation options
213(7)
Metadata layer as a collection of configuration files
214(3)
Metadata database
217(1)
Metadata API
218(2)
7.5 Overview of existing solutions
220(7)
Cloud metadata services
221(2)
Open source metadata layer implementations
223(4)
7.6 Exercise answers
227(1)
8 Schema management
228(33)
8.1 Why schema management
229(3)
Schema changes in a traditional data warehouse architecture
230(1)
Schema-on-read approach
231(1)
8.2 Schema-management approaches
232(11)
Schema as a contract
233(2)
Schema management in the data platform
235(6)
Monitoring schema changes
241(2)
8.3 Schema Registry Implementation
243(5)
Apache Avro schemas
243(2)
Existing Schema Registry implementations
245(1)
Schema Registry as part of a Metadata layer
246(2)
8.4 Schema evolution scenarios
248(7)
Schema compatibility rules
249(2)
Schema evolution and data transformation pipelines
251(4)
8.5 Schema evolution and data warehouses
255(5)
Schema-management features of cloud data warehouses
257(3)
8.6 Exercise answers
260(1)
9 Data access and security
261(28)
9.1 Different types of data consumers
262(1)
9.2 Cloud data warehouses
263(11)
AWS Redshift
264(4)
Azure Synapse
268(2)
Google BigQuery
270(3)
Choosing the right data warehouse
273(1)
9.3 Application data access
274(4)
Cloud relational databases
275(1)
Cloud key/value data stores
276(1)
Full-text search services
277(1)
In-memory cache
278(1)
9.4 Machine learning on the data platform
278(5)
Machine learning model lifecycle on a cloud data platform
279(3)
ML cloud collaboration tools
282(1)
9.5 Business intelligence and reporting tools
283(2)
Traditional BI tools and cloud data platform integration
283(1)
Using Excel as a BI tool
284(1)
BI tools that are external to the cloud provider
284(1)
9.6 Data security
285(3)
Users, groups, and roles
285(1)
Credentials and configuration management
286(1)
Data encryption
286(1)
Network boundaries
287(1)
9.7 Exercise Answers
288(1)
10 Fueling business value with data platforms
289(15)
10.1 Why you need a data strategy
290(1)
10.2 The analytics maturity journey
291(5)
SEE: Getting insights from data
292(1)
PREDICT: Using data to predict what to do
293(1)
DO: Making your analytics actionable
294(1)
CREATE: Going beyond analytics into products
295(1)
10.3 The data platform: The engine that powers analytics maturity
296(1)
10.4 Platform project stoppers
297(7)
Time does indeed kill
297(1)
User adoption
298(1)
User trust and the need for data governance
299(1)
Operating in a platform silo
300(1)
The dollar dance
301(3)
index 304