Muutke küpsiste eelistusi

E-raamat: Self-Service Data Roadmap

  • Formaat: 286 pages
  • Ilmumisaeg: 10-Sep-2020
  • Kirjastus: O'Reilly Media
  • Keel: eng
  • ISBN-13: 9781492075202
Teised raamatud teemal:
  • Formaat - EPUB+DRM
  • Hind: 40,37 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: 286 pages
  • Ilmumisaeg: 10-Sep-2020
  • Kirjastus: O'Reilly Media
  • Keel: eng
  • ISBN-13: 9781492075202
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Data-driven insights are a key competitive advantage for any industry today, but deriving insights from raw data can still take days or weeks. Most organizations can&;t scale data science teams fast enough to keep up with the growing amounts of data to transform. What&;s the answer? Self-service data.

With this practical book, data engineers, data scientists, and team managers will learn how to build a self-service data science platform that helps anyone in your organization extract insights from data. Sandeep Uttamchandani provides a scorecard to track and address bottlenecks that slow down time to insight across data discovery, transformation, processing, and production. This book bridges the gap between data scientists bottlenecked by engineering realities and data engineers unclear about ways to make self-service work.

  • Build a self-service portal to support data discovery, quality, lineage, and governance
  • Select the best approach for each self-service capability using open source cloud technologies
  • Tailor self-service for the people, processes, and technology maturity of your data platform
  • Implement capabilities to democratize data and reduce time to insight
  • Scale your self-service portal to support a large number of users within your organization
Preface xv
1 Introduction
1(20)
Journey Map from Raw Data to Insights
3(7)
Discover
4(2)
Prep
6(1)
Build
7(1)
Operationalize
8(2)
Defining Your Time-to-insight Scorecard
10(5)
Build Your Self-Service Data Roadmap
15(6)
Part I Self-Service Data Discovery
2 Metadata Catalog Service
21(14)
Journey Map
22(2)
Understanding Datasets
23(1)
Analyzing Datasets
23(1)
Knowledge Scaling
24(1)
Minimizing Time to Interpret
24(2)
Extracting Technical Metadata
24(1)
Extracting Operational Metadata
25(1)
Gathering Team Knowledge
26(1)
Defining Requirements
26(3)
Technical Metadata Extractor Requirements
27(1)
Operational Metadata Requirements
28(1)
Team Knowledge Aggregator Requirements
28(1)
Implementation Patterns
29(4)
Source-Specific Connectors Pattern
29(2)
Lineage Correlation Pattern
31(1)
Team Knowledge Pattern
32(1)
Summary
33(2)
3 Search Service
35(16)
Journey Map
35(2)
Determining Feasibility of the Business Problem
36(1)
Selecting Relevant Datasets for Data Prep
36(1)
Reusing Existing Artifacts for Prototyping
36(1)
Minimizing Time to Find
37(1)
Indexing Datasets and Artifacts
37(1)
Ranking Results
37(1)
Access Control
38(1)
Defining Requirements
38(3)
Indexer Requirements
39(1)
Ranking Requirements
40(1)
Access Control Requirements
40(1)
Nonfunctional Requirements
40(1)
Implementation Patterns
41(8)
Push-Pull Indexer Pattern
42(2)
Hybrid Search Ranking Pattern
44(2)
Catalog Access Control Pattern
46(3)
Summary
49(2)
4 Feature Store Service
51(12)
Journey Map
52(1)
Finding Available Features
53(1)
Training Set Generation
53(1)
Feature Pipeline for Online Inference
53(1)
Minimize Time to Featurize
53(2)
Feature Computation
54(1)
Feature Serving
54(1)
Defining Requirements
55(2)
Feature Computation
55(1)
Feature Serving
56(1)
Nonfunctional Requirements
57(1)
Implementation Patterns
57(5)
Hybrid Feature Computation Pattern
58(2)
Feature Registry Pattern
60(2)
Summary
62(1)
5 Data Movement Service
63(14)
Journey Map
63(1)
Aggregating Data Across Sources
63(1)
Moving Raw Data to Specialized Query Engines
64(1)
Moving Processed Data to Serving Stores
64(1)
Exploratory Analysis Across Sources
64(1)
Minimizing Time to Data Availability
64(2)
Data Ingestion Configuration and Change Management
65(1)
Compliance
65(1)
Data Quality Verification
65(1)
Defining Requirements
66(4)
Ingestion Requirements
66(2)
Transformation Requirements
68(1)
Compliance Requirements
68(1)
Verification Requirements
69(1)
Nonfunctional Requirements
69(1)
Implementation Patterns
70(6)
Batch Ingestion Pattern
70(2)
Change Data Capture Ingestion Pattern
72(3)
Event Aggregation Pattern
75(1)
Summary
76(1)
6 Clickstream Tracking Service
77(16)
Journey Map
78(1)
Minimizing Time to Click Metrics
79(3)
Managing Instrumentation
80(1)
Event Enrichment
81(1)
Building Insights
82(1)
Defining Requirements
82(2)
Instrumentation Requirements Checklist
82(1)
Enrichment Requirements Checklist
83(1)
Implementation Patterns
84(5)
Instrumentation Pattern
84(1)
Rule-Based Enrichment Patterns
85(2)
Consumption Patterns
87(2)
Summary
89(4)
Part II Self-Service Data Prep
7 Data Lake Management Service
93(14)
Journey Map
94(3)
Primitive Life Cycle Management
95(1)
Managing Data Updates
96(1)
Managing Batching and Streaming Data Flows
96(1)
Minimizing Time to Data Lake Management
97(5)
Requirements
97(5)
Implementation Patterns
102(4)
Data Life Cycle Primitives Pattern
103(1)
Transactional Pattern
104(1)
Advanced Data Management Pattern
105(1)
Summary
106(1)
8 Data Wrangling Service
107(8)
Journey Map
108(1)
Minimizing Time to Wrangle
109(2)
Defining Requirements
110(1)
CuratingData
110(1)
Operational Monitoring
111(1)
Defining Requirements
111(1)
Implementation Patterns
111(3)
Exploratory Data Analysis Patterns
112(1)
Analytical Transformation Patterns
113(1)
Summary
114(1)
9 Data Rights Governance Service
115(16)
Journey Map
117(1)
Executing Data Rights Requests
117(1)
Discovery of Datasets
118(1)
Model Retraining
118(1)
Minimizing Time to Comply
118(1)
Tracking the Customer Data Life Cycle
118(1)
Executing Customer Data Rights Requests
119(1)
Limiting Data Access
119(1)
Defining Requirements
119(3)
Current Pain Point Questionnaire
120(1)
Interop Checklist
120(1)
Functional Requirements
121(1)
Nonfunctional Requirements
122(1)
Implementation Patterns
122(5)
Sensitive Data Discovery and Classification Pattern
123(1)
Data Lake Deletion Pattern
124(1)
Use Case-Dependent Access Control
125(2)
Summary
127(4)
Part III Self-Service Build
10 Data Virtualization Service
131(12)
Journey Map
132(1)
Exploring Data Sources
132(1)
Picking a Processing Cluster
132(1)
Minimizing Time to Query
133(1)
Picking the Execution Environment
133(1)
Formulating Polyglot Queries
133(1)
Joining Data Across Silos
134(1)
Defining Requirements
134(2)
Current Pain Point Analysis
134(1)
Operational Requirements
135(1)
Functional Requirements
135(1)
Nonfunctional Requirements
135(1)
Implementation Patterns
136(5)
Automatic Query Routing Pattern
137(1)
Unified Query Pattern
138(2)
Federated Query Pattern
140(1)
Summary
141(2)
11 Data Transformation Service
143(10)
Journey Map
144(1)
Production Dashboard and ML Pipelines
144(1)
Data-Driven Storytelling
144(1)
Minimizing Time to Transform
144(1)
Transformation Implementation
144(1)
Transformation Execution
145(1)
Transformation Operations
145(1)
Defining Requirements
145(2)
Current State Questionnaire
146(1)
Functional Requirements
146(1)
Nonfunctional Requirements
147(1)
Implementation Patterns
147(5)
Implementation Pattern
148(3)
Execution Patterns
151(1)
Summary
152(1)
12 Model Training Service
153(14)
Journey Map
154(2)
Model Prototyping
154(1)
Continuous Training
155(1)
Model Debugging
156(1)
Minimizing Time to Train
156(2)
Training Orchestration
156(1)
Tuning
157(1)
Continuous Training
157(1)
Defining Requirements
158(3)
Training Orchestration
158(2)
Tuning
160(1)
Continuous Training
160(1)
Nonfunctional Requirements
160(1)
Implementation Patterns
161(5)
Distributed Training Orchestrator Pattern
162(1)
Automated Tuning Pattern
163(1)
Data-Aware Continuous Training
164(2)
Summary
166(1)
13 Continuous Integration Service
167(10)
Journey Map
168(1)
Collaborating on an ML Pipeline
168(1)
Integrating ETL Changes
168(1)
Validating Schema Changes
169(1)
Minimizing Time to Integrate
169(1)
Experiment Tracking
169(1)
Reproducible Deployment
170(1)
Testing Validation
170(1)
Defining Requirements
170(2)
Experiment Tracking Module
171(1)
Pipeline Packaging Module
171(1)
Testing Automation Module
172(1)
Implementation Patterns
172(3)
Programmable Tracking Pattern
173(1)
Reproducible Project Pattern
174(1)
Summary
175(2)
14 A/B Testing Service
177(12)
Journey Map
179(2)
Minimizing Time to A/B Test
181(2)
Experiment Design
182(1)
Execution at Scale
182(1)
Experiment Optimization
183(1)
Implementation Patterns
183(3)
Experiment Specification Pattern
184(1)
Metrics Definition Pattern
185(1)
Automated Experiment Optimization
185(1)
Summary
186(3)
Part IV Self-Service Operationalize
15 Query Optimization Service
189(14)
Journey Map
190(1)
Avoiding Cluster Clogs
190(1)
Resolving Runtime Query Issues
190(1)
Speeding Up Applications
191(1)
Minimizing Time to Optimize
191(3)
Aggregating Statistics
191(1)
Analyzing Statistics
192(1)
Optimizing Jobs
193(1)
Defining Requirements
194(2)
Current Pain Points Questionnaire
194(1)
Interop Requirements
195(1)
Functionality Requirements
195(1)
Nonfunctional Requirements
195(1)
Implementation Patterns
196(5)
Avoidance Pattern
196(2)
Operational Insights Pattern
198(2)
Automated Tuning Pattern
200(1)
Summary
201(2)
16 Pipeline Orchestration Service
203(12)
Journey Map
204(1)
Invoke Exploratory Pipelines
205(1)
Run SLA-Bound Pipelines
205(1)
Minimizing Time to Orchestrate
205(1)
Defining Job Dependencies
205(1)
Distributed Execution
206(1)
Production Monitoring
206(1)
Defining Requirements
206(3)
Current Pain Points Questionnaire
207(1)
Operational Requirements
207(1)
Functional Requirements
208(1)
Nonfunctional Requirements
208(1)
Implementation Patterns
209(4)
Dependency Authoring Patterns
209(2)
Orchestration Observability Patterns
211(1)
Distributed Execution Pattern
212(1)
Summary
213(2)
17 Model Deploy Service
215(12)
Journey Map
216(1)
Model Deployment in Production
216(1)
Model Maintenance and Upgrade
216(1)
Minimizing Time to Deploy
217(1)
Deployment Orchestration
217(1)
Performance Scaling
217(1)
Drift Monitoring
218(1)
Defining Requirements
218(3)
Orchestration
218(2)
Model Scaling and Performance
220(1)
Drift Verification
221(1)
Nonfunctional Requirements
221(1)
Implementation Patterns
221(5)
Universal Deployment Pattern
222(2)
Autoscaling Deployment Pattern
224(1)
Model Drift Tracking Pattern
225(1)
Summary
226(1)
18 Quality Observability Service
227(12)
Journey Map
228(1)
Daily Data Quality Monitoring Reports
228(1)
Debugging Quality Issues
228(1)
Handling Low-Quality Data Records
229(1)
Minimizing Time to Insight Quality
229(2)
Verify the Accuracy of the Data
229(1)
Detect Quality Anomalies
230(1)
Prevent Data Quality Issues
231(1)
Defining Requirements
231(2)
Detection and Handling Data Quality Issues
232(1)
Functional Requirements
232(1)
Nonfunctional Requirements
233(1)
Implementation Patterns
233(5)
Accuracy Models Pattern
234(1)
Profiling-Based Anomaly Detection Pattern
235(1)
Avoidance Pattern
236(2)
Summary
238(1)
19 Cost Management Service
239(12)
Journey Map
240(1)
Monitoring Cost Usage
240(1)
Continuous Cost Optimization
241(1)
Minimizing Time to Optimize Cost
241(2)
Expenditure Observability
241(1)
Matching Supply and Demand
242(1)
Continuous Cost Optimization
242(1)
Defining Requirements
243(1)
Pain Points Questionnaire
243(1)
Functional Requirements
243(1)
Nonfunctional Requirements
244(1)
Implementation Patterns
244(5)
Continuous Cost Monitoring Pattern
245(1)
Automated Scaling Pattern
246(2)
Cost Advisor Pattern
248(1)
Summary
249(2)
Index 251
Dr. Sandeep Uttamchandani is the Chief Data Officer and VP of Product Engineering at Unravel Data Systems. He brings nearly two decades of experience building enterprise data products as well as running petabyte-scale data platforms for business-critical analytics and ML applications. Most recently he was at Intuit, where he ran the data platform team powering analytics and ML for Intuit's financial accounting, payroll, and payments products. Previously in his career, Sandeep was co-founder and CEO of a startup using ML for managing security vulnerabilities of open-source products. He has played engineering leadership roles at VMware and IBM for 15+ years.

Sandeep holds more than 40 issued patents, has 25+ publications in key technical conferences, and has received several product innovation and management excellence awards. He is a regular speaker in data conferences and a guest lecturer at universities. He advises startups and has served as a program/steering committee member for several conferences, including serving as Co-chair of Gartner's SF CDO Executive Summit, and Usenix Operational ML (OpML) conference. Sandeep holds a Ph.D and a Master's in Computer Science from the University of Illinois at Urbana-Champaign.