Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Fundamentals of Data Engineering: Plan and Build Robust Data Systems [Pehme köide]

4.19/5 (1509 hinnangut Goodreads-ist)

Contributions by Matt Housley, Joe Reis

Formaat: Paperback / softback, 446 pages
Ilmumisaeg: 05-Jul-2022
Kirjastus: O'Reilly Media
ISBN-10: 1098108302
ISBN-13: 9781098108304

Teised raamatud teemal:

Pehme köide
Hind: 75,81 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Tavahind: 89,19 €
Säästad 15%
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 446 pages
Ilmumisaeg: 05-Jul-2022
Kirjastus: O'Reilly Media
ISBN-10: 1098108302
ISBN-13: 9781098108304

Teised raamatud teemal:

Püsilink: https://www.kriso.ee/db/9781098108304.html

Märksõnad:

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you will learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available in the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You will understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, governance, and deployment that are critical in any data environment regardless of the underlying technology.

This book will help you:

Assess data engineering problems using an end-to-end data framework of best practices Cut through marketing hype when choosing data technologies, architecture, and processes Use the data engineering lifecycle to design and build a robust architecture

Incorporate data governance and security across the data engineering lifecycle

Preface

xiii

Part I Foundation and Building Blocks

1 Data Engineering Described

(32)

What Is Data Engineering?

(1)

Data Engineering Defined

(1)

The Data Engineering Lifecycle

(1)

Evolution of the Data Engineer

(5)

Data Engineering and Data Science

(2)

Data Engineering Skills and Activities

(1)

Data Maturity and the Data Engineer

(4)

The Background and Skills of a Data Engineer

(1)

Business Responsibilities

(1)

Technical Responsibilities

(2)

The Continuum of Data Engineering Roles, from A to B

(1)

Data Engineers Inside an Organization

(1)

Internal-Facing Versus External-Facing Data Engineers

(1)

Data Engineers and Other Technical Roles

(4)

Data Engineers and Business Leadership

(3)

Conclusion

(1)

Additional Resources

(3)

2 The Data Engineering Lifecycle

(38)

What Is the Data Engineering Lifecycle?

(1)

The Data Lifecycle Versus the Data Engineering Lifecycle

(1)

Generation: Source Systems

(3)

Storage

(1)

Ingestion

(4)

Transformation

(1)

Serving Data

(4)

Major Undercurrents Across the Data Engineering Lifecycle

(1)

Security

(1)

Data Management

(9)

DataOps

(5)

Data Architecture

(1)

Orchestration

(2)

Software Engineering

(2)

Conclusion

(1)

Additional Resources

(2)

3 Designing Good Data Architecture

(46)

What Is Data Architecture?

(1)

Enterprise Architecture Defined

(3)

Data Architecture Defined

(1)

"Good" Data Architecture

(1)

Principles of Good Data Architecture

(1)

Principle 1 Choose Common Components Wisely

(1)

Principle 2 Plan for Failure

(1)

Principle 3 Architect for Scalability

(1)

Principle 4 Architecture Is Leadership

(1)

Principle 5 Always Be Architecting

(1)

Principle 6 Build Loosely Coupled Systems

(2)

Principle 7 Make Reversible Decisions

(1)

Principle 8 Prioritize Security

(1)

Principle 9 Embrace FinOps

(2)

Major Architecture Concepts

(1)

Domains and Services

(1)

Distributed Systems, Scalability, and Designing for Failure

(2)

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices

(4)

User Access: Single Versus Multitenant

(1)

Event-Driven Architecture

(1)

Brownfield Versus Greenfield Projects

(2)

Examples and Types of Data Architecture

100

(1)

Data Warehouse

100

(3)

Data Lake

103

(1)

Convergence, Next-Generation Data Lakes, and the Data Platform

104

(1)

Modern Data Stack

105

(1)

Lambda Architecture

106

(1)

Kappa Architecture

107

(1)

The Dataflow Model and Unified Batch and Streaming

107

(1)

Architecture for IoT

108

(3)

Data Mesh

111

(1)

Other Data Architecture Examples

112

(1)

Who's Involved with Designing a Data Architecture?

113

(1)

Conclusion

113

(1)

Additional Resources

113

(6)

4 Choosing Technologies Across the Data Engineering Lifecycle

119

(40)

Team Size and Capabilities

120

(1)

Speed to Market

121

(1)

Interoperability

121

(1)

Cost Optimization and Business Value

122

(1)

Total Cost of Ownership

122

(1)

Total Opportunity Cost of Ownership

123

(1)

FinOps

124

(1)

Today Versus the Future: Immutable Versus Transitory Technologies

124

(2)

Our Advice

126

(1)

Location

127

(1)

On Premises

127

(1)

Cloud

128

(3)

Hybrid Cloud

131

(1)

Multicloud

132

(1)

Decentralized: Blockchain and the Edge

133

(1)

Our Advice

133

(1)

Cloud Repatriation Arguments

134

(2)

Build Versus Buy

136

(1)

Open Source Software

137

(4)

Proprietary Walled Gardens

141

(1)

Our Advice

142

(1)

Monolith Versus Modular

143

(1)

Monolith

143

(1)

Modularity

144

(2)

The Distributed Monolith Pattern

146

(1)

Our Advice

146

(1)

Serverless Versus Servers

147

(1)

Serverless

147

(1)

Containers

148

(1)

How to Evaluate Server Versus Serverless

149

(1)

Our Advice

150

(1)

Optimization, Performance, and the Benchmark Wars

151

(1)

Big Data...for the 1990s

152

(1)

Nonsensical Cost Comparisons

152

(1)

Asymmetric Optimization

152

(1)

Caveat Emptor

153

(1)

Undercurrents and Their Impacts on Choosing Technologies

153

(1)

Data Management

153

(1)

DataOps

153

(1)

Data Architecture

154

(1)

Orchestration Example: Airflow

154

(1)

Software Engineering

155

(1)

Conclusion

155

(1)

Additional Resources

155

(4)

Part II The Data Engineering Lifecycle in Depth

5 Data Generation in Source Systems

159

(34)

Sources of Data: How Is Data Created?

160

(1)

Source Systems: Main Ideas

160

(1)

Files and Unstructured Data

160

(1)

APIs

161

(1)

Application Databases (OLTP Systems)

161

(2)

Online Analytical Processing System

163

(1)

Change Data Capture

163

(1)

Logs

164

(1)

Database Logs

165

(1)

CRUD

166

(1)

Insert-Only

166

(1)

Messages and Streams

167

(1)

Types of Time

168

(1)

Source System Practical Details

169

(1)

Databases

170

(8)

APIs

178

(2)

Data Sharing

180

(1)

Third-Party Data Sources

181

(1)

Message Queues and Event-Streaming Platforms

181

(4)

Whom You'll Work With

185

(2)

Undercurrents and Their Impact on Source Systems

187

(1)

Security

187

(1)

Data Management

188

(1)

DataOps

188

(1)

Data Architecture

189

(1)

Orchestration

190

(1)

Software Engineering

191

(1)

Conclusion

191

(1)

Additional Resources

192

(1)

6 Storage

193

(44)

Raw Ingredients of Data Storage

195

(1)

Magnetic Disk Drive

195

(2)

Solid-State Drive

197

(1)

Random Access Memory

198

(1)

Networking and CPU

199

(1)

Serialization

199

(1)

Compression

200

(1)

Caching

201

(1)

Data Storage Systems

201

(1)

Single Machine Versus Distributed Storage

202

(1)

Eventual Versus Strong Consistency

202

(1)

File Storage

203

(3)

Block Storage

206

(3)

Object Storage

209

(6)

Cache and Memory-Based Storage Systems

215

(1)

The Hadoop Distributed File System

215

(1)

Streaming Storage

216

(1)

Indexes, Partitioning, and Clustering

217

(2)

Data Engineering Storage Abstractions

219

(1)

The Data Warehouse

219

(1)

The Data Lake

220

(1)

The Data Lakehouse

220

(1)

Data Platforms

221

(1)

Stream-to-Batch Storage Architecture

221

(1)

Big Ideas and Trends in Storage

222

(1)

Data Catalog

222

(1)

Data Sharing

223

(1)

Schema

223

(1)

Separation of Compute from Storage

224

(3)

Data Storage Lifecycle and Data Retention

227

(3)

Single-Tenant Versus Multitenant Storage

230

(1)

Whom You'll Work With

231

(1)

Undercurrents

232

(1)

Security

232

(1)

Data Management

232

(1)

DataOps

233

(1)

Data Architecture

234

(1)

Orchestration

234

(1)

Software Engineering

234

(1)

Conclusion

234

(1)

Additional Resources

235

(2)

7 Ingestion

237

(38)

What Is Data Ingestion?

238

(1)

Key Engineering Considerations for the Ingestion Phase

239

(1)

Bounded Versus Unbounded Data

240

(1)

Frequency

241

(1)

Synchronous Versus Asynchronous Ingestion

242

(1)

Serialization and Deserialization

243

(1)

Throughput and Scalability

243

(1)

Reliability and Durability

244

(1)

Payload

245

(3)

Push Versus Pull Versus Poll Patterns

248

(1)

Batch Ingestion Considerations

248

(2)

Snapshot or Differential Extraction

250

(1)

File-Based Export and Ingestion

250

(1)

ETL Versus ELT

250

(1)

Inserts, Updates, and Batch Size

251

(1)

Data Migration

251

(1)

Message and Stream Ingestion Considerations

252

(1)

Schema Evolution

252

(1)

Late-Arriving Data

252

(1)

Ordering and Multiple Delivery

252

(1)

Replay

253

(1)

Time to Live

253

(1)

Message Size

253

(1)

Error Handling and Dead-Letter Queues

253

(1)

Consumer Pull and Push

254

(1)

Location

254

(1)

Ways to Ingest Data

254

(1)

Direct Database Connection

255

(1)

Change Data Capture

256

(2)

APIs

258

(1)

Message Queues and Event-Streaming Platforms

259

(1)

Managed Data Connectors

260

(1)

Moving Data with Object Storage

261

(1)

EDI

261

(1)

Databases and File Export

261

(1)

Practical Issues with Common File Formats

262

(1)

Shell

262

(1)

SSH

263

(1)

SFTP and SCP

263

(1)

Webhooks

263

(1)

Web Interface

264

(1)

Web Scraping

264

(1)

Transfer Appliances for Data Migration

265

(1)

Data Sharing

266

(1)

Whom You'll Work With

266

(1)

Upstream Stakeholders

266

(1)

Downstream Stakeholders

267

(1)

Undercurrents

267

(1)

Security

268

(1)

Data Management

268

(2)

DataOps

270

(2)

Orchestration

272

(1)

Software Engineering

272

(1)

Conclusion

272

(1)

Additional Resources

273

(2)

8 Queries, Modeling, and Transformation

275

(66)

Queries

276

(1)

What Is a Query?

277

(1)

The Life of a Query

278

(1)

The Query Optimizer

279

(1)

Improving Query Performance

279

(6)

Queries on Streaming Data

285

(6)

Data Modeling

291

(1)

What Is a Data Model?

292

(1)

Conceptual, Logical, and Physical Data Models

293

(1)

Normalization

294

(4)

Techniques for Modeling Batch Analytical Data

298

(13)

Modeling Streaming Data

311

(2)

Transformations

313

(1)

Batch Transformations

314

(13)

Materialized Views, Federation, and Query Virtualization

327

(3)

Streaming Transformations and Processing

330

(3)

Whom You'll Work With

333

(1)

Upstream Stakeholders

333

(1)

Downstream Stakeholders

334

(1)

Undercurrents

334

(1)

Security

334

(1)

Data Management

335

(1)

DataOps

336

(1)

Data Architecture

337

(1)

Orchestration

337

(1)

Software Engineering

337

(1)

Conclusion

338

(1)

Additional Resources

339

(2)

9 Serving Data for Analytics, Machine Learning, and Reverse ETL

341

(32)

General Considerations for Serving Data

342

(1)

Trust

342

(1)

What's the Use Case, and Who's the User?

343

(1)

Data Products

344

(1)

Self-Service or Not?

345

(1)

Data Definitions and Logic

346

(1)

Data Mesh

347

(1)

Analytics

348

(1)

Business Analytics

348

(2)

Operational Analytics

350

(2)

Embedded Analytics

352

(1)

Machine Learning

353

(1)

What a Data Engineer Should Know About ML

354

(1)

Ways to Serve Data for Analytics and ML

355

(1)

File Exchange

355

(1)

Databases

356

(2)

Streaming Systems

358

(1)

Query Federation

358

(1)

Data Sharing

359

(1)

Semantic and Metrics Layers

359

(1)

Serving Data in Notebooks

360

(2)

Reverse ETL

362

(2)

Whom You'll Work With

364

(1)

Undercurrents

364

(1)

Security

365

(1)

Data Management

366

(1)

DataOps

366

(1)

Data Architecture

367

(1)

Orchestration

367

(1)

Software Engineering

368

(1)

Conclusion

369

(1)

Additional Resources

369

(4)

Part III Security, Privacy, and the Future of Data Engineering

10 Security and Privacy

373

(10)

People

374

(1)

The Power of Negative Thinking

374

(1)

Always Be Paranoid

374

(1)

Processes

375

(1)

Security Theater Versus Security Habit

375

(1)

Active Security

375

(1)

The Principle of Least Privilege

376

(1)

Shared Responsibility in the Cloud

376

(1)

Always Back Up Your Data

376

(1)

An Example Security Policy

377

(1)

Technology

378

(1)

Patch and Update Systems

378

(1)

Encryption

379

(1)

Logging, Monitoring, and Alerting

379

(1)

Network Access

380

(1)

Security for Low-Level Data Engineering

381

(1)

Conclusion

382

(1)

Additional Resources

382

(1)

11 The Future of Data Engineering

383

(12)

The Data Engineering Lifecycle Isn't Going Away

384

(1)

The Decline of Complexity and the Rise of Easy-to-Use Data Tools

384

(1)

The Cloud-Scale Data OS and Improved Interoperability

385

(2)

"Enterprisey" Data Engineering

387

(1)

Titles and Responsibilities Will Morph...

388

(1)

Moving Beyond the Modern Data Stack, Toward the Live Data Stack

389

(1)

The Live Data Stack

389

(1)

Streaming Pipelines and Real-Time Analytical Databases

390

(1)

The Fusion of Data with Applications

391

(1)

The Tight Feedback Between Applications and ML

392

(1)

Dark Matter Data and the Rise of...Spreadsheets?!

392

(1)

Conclusion

393

(2)

A Serialization and Compression Technical Details

395

(8)

B Cloud Networking

403

(4)

Index

407

Joe Reis is a business-minded data nerd who's worked in the data industry for 20 years, with responsibilities ranging from statistical modeling, forecasting, machine learning, data engineering, data architecture, and almost everything else in between. Joe is the CEO and cofounder of Ternary Data, a data engineering and architecture consulting firm based in Salt Lake City, Utah. In addition, he volunteers with several technology groups and teaches at the University of Utah. In his spare time, Joe likes to rock climb, produce electronic music, and take his kids on crazy adventures. Matt Housley is a data engineering consultant and cloud specialist. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. He cofounded Ternary Data with Joe Reis, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. Matt and Joe also pontificate on all things data on The Monday Morning Data Chat.

Fundamentals of Data Engineering: Plan and Build Robust Data Systems [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv