Muutke küpsiste eelistusi

E-raamat: Learning Apache Drill: Query and Analyze Distributed Data Sources with SQL

  • Formaat: 332 pages
  • Ilmumisaeg: 02-Nov-2018
  • Kirjastus: O'Reilly Media
  • Keel: eng
  • ISBN-13: 9781492032755
  • Formaat - EPUB+DRM
  • Hind: 47,96 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: 332 pages
  • Ilmumisaeg: 02-Nov-2018
  • Kirjastus: O'Reilly Media
  • Keel: eng
  • ISBN-13: 9781492032755

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.

In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, youll learn how Drill helps you analyze data more effectively to drive down time to insight.

Use Drill to clean, prepare, and summarize delimited data for further analysis Query file types including logfiles, Parquet, JSON, and other complex formats

Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL Connect to Drill programmatically using a variety of languages Use Drill even with challenging or ambiguous file formats Perform sophisticated analysis by extending Drills functionality with user-defined functions Facilitate data analysis for network security, image metadata, and machine learning
Preface xi
1 Introduction to Apache Drill 1(8)
What Is Apache Drill?
2(7)
Drill Is Versatile
2(1)
Drill Is Easy to Use
3(1)
A Word About Drill's Performance
4(1)
A Very Brief History of Big Data
5(2)
Drill in the Big Data Ecosystem
7(1)
Comparing Drill with Similar Tools
7(2)
2 Installing and Running Drill 9(8)
Preparing Your Machine for Drill
10(2)
Special Configuration Instructions for Windows Installations
10(2)
Installing Drill on Windows
12(1)
Starting Drill on a Windows Machine
12(1)
Installing Drill in Embedded Mode on macOS or Linux
13(1)
Starting Drill on macOS or Linux in Embedded Mode
13(1)
Installing Drill in Distributed Mode on macOS or Linux
14(2)
Preparing Your Cluster for Drill
15(1)
Starting Drill in Distributed Mode
16(1)
Connecting to the Cluster
16(1)
Conclusion
16(1)
3 Overview of Apache Drill 17(16)
The Apache Hadoop Ecosystem
17(4)
Drill Is a Low-Latency Query Engine
18(1)
Distributed Processing with HDFS
18(1)
Elements of a Drill System
19(1)
Drill Operation: The 30,000-Foot View
20(1)
Drill Is a Query Engine, Not a Database
20(1)
Drill Operation Overview
21(9)
Drill Components
21(1)
SQL Session State
22(1)
Statement Preparation
22(4)
Statement Execution
26(2)
Low-Latency Features
28(2)
Conclusion
30(3)
4 Querying Delimited Data 33(32)
Ways of Querying Data with Drill
33(1)
Other Interfaces
34(1)
Drill SQL Query Format
34(10)
Choosing a Data Source
35(1)
Defining a Workspace
36(1)
Specifying a Default Data Source
37(2)
Accessing Columns in a Query
39(2)
Delimited Data with Column Headers
41(1)
Table Functions
42(1)
Querying Directories
43(1)
Understanding Drill Data Types
44(2)
Cleaning and Preparing Data Using String Manipulation Functions
46(3)
Complex Data Conversion Functions
48(1)
Working with Dates and Times in Drill
49(4)
Converting Strings to Dates
50(1)
Reformatting Dates
51(1)
Date Arithmetic and Manipulation
51(1)
Date and Time Functions in Drill
52(1)
Creating Views
53(1)
Data Analysis Using Drill
54(8)
Summarizing Data with Aggregate Functions
55(7)
Common Problems in Querying Delimited Data
62(2)
Spaces in Column Names
62(1)
Illegal Characters in Column Headers
63(1)
Reserved Words in Column Names
63(1)
Conclusion
64(1)
5 Analyzing Complex and Nested Data 65(22)
Arrays and Maps
65(12)
Arrays in Drill
66(2)
Accessing Maps (Key-Value Pairs) in Drill
68(1)
Querying Nested Data
69(8)
Analyzing Log Files with Drill
77(8)
Configuring Drill to Read HTTPD Web Server Logs
77(1)
Querying Web Server Logs
78(4)
Other Log Analysis with Drill
82(3)
Conclusion
85(2)
6 Connecting Drill to Data Sources 87(20)
Querying Multiple Data Sources
88(18)
Configuring a New Storage Plug-in
88(1)
Connecting Drill to a Relational Database
89(4)
Querying Data in Hadoop from Drill
93(1)
Connecting to and Querying HBase from Drill
93(2)
Querying Hive Data from Drill
95(2)
Connecting to and Querying Streaming Data with Drill and Kafka
97(2)
Connecting to and Querying Kudu
99(1)
Connecting to and Querying MongoDB from Drill
100(1)
Connecting Drill to Cloud Storage
100(4)
Querying Time Series Data from Drill and OpenTSDB
104(2)
Conclusion
106(1)
7 Connecting to Drill 107(26)
Understanding Drill's Interfaces
107(5)
JDBC and Drill
108(1)
ODBC and Drill
109(2)
Drill's REST Interface
111(1)
Connecting to Drill with Python
112(4)
Using drillpy to Query Drill
112(1)
Connecting to Drill Using pydrill
113(1)
Other Ways of Connecting to Drill from Python
114(2)
Connecting to Drill Using R
116(2)
Querying Drill from R Using sergeant
116(2)
Connecting to Drill Using Java
118(1)
Querying Drill with PHP
119(2)
Using the Connector
119(1)
Querying Drill from PHP
120(1)
Interacting with Drill from PHP
120(1)
Querying Drill Using Node.js
121(1)
Using Drill as a Data Source in BI Tools
121(11)
Exploring Data with Apache Zeppelin and Drill
122(5)
Exploring Data with Apache Superset
127(5)
Conclusion
132(1)
8 Data Engineering with Drill 133(42)
Schema-on-Read
133(3)
The SQL Relational Model
133(1)
Data Life Cycle: Data Exploration to Production
134(1)
Schema Inference
135(1)
Data Source Inference
136(4)
Storage Plug-ins
136(1)
Storage Configurations
136(1)
Workspaces
137(2)
Querying Directories
139(1)
Default Schema
139(1)
File Type Inference
140(2)
Format Plug-ins and Format Configuration
140(1)
Format Inference
141(1)
File Format Variations
141(1)
Schema Inference Overview
142(2)
Distributed File Scans
144(16)
Schema Inference for Delimited Data
146(5)
CSV Summary
151(2)
Schema Inference for JSON
153(2)
Ambiguous Numeric Schemas
155(5)
Aligning Schemas Across Files
160(2)
JSON Objects
162(6)
JSON Lists in Drill
164(3)
JSON Summary
167(1)
Using Drill with the Parquet File Format
168(1)
Schema Evolution in Parquet
169(1)
Partitioning Data Directories
169(4)
Defining a Table Workspace
172(1)
Working with Queries in Production
173(1)
Capturing Schema Mapping in Views
173(1)
Running Challenging Queries in Scripts
173(1)
Conclusion
174(1)
9 Deploying Drill in Production 175(22)
Installing Drill
175(9)
Prerequisites
176(1)
Production Installation
176(2)
Configuring ZooKeeper
178(1)
Configuring Memory
179(1)
Configuring Logging
180(1)
Testing the Installation
181(1)
Distributing Drill Binaries and Configuration
182(1)
Starting the Drill Cluster
183(1)
Configuring Storage
184(4)
Working with Apache Hadoop HDFS
184(1)
Working with Amazon S3
185(3)
Admission Control
188(2)
Additional Configuration
190(3)
User-Defined Functions and Custom Plug-ins
190(1)
Security
190(1)
Logging Levels
191(1)
Controlling CPU Usage
192(1)
Monitoring
193(2)
Monitoring the Drill Process
193(1)
Monitoring JMX Metrics
194(1)
Monitoring Queries
194(1)
Other Deployment Options
195(1)
MapR Installer
195(1)
Drill-on-YARN
195(1)
Docker
196(1)
Conclusion
196(1)
10 Setting Up Your Development Environment 197(4)
Installing Maven
197(1)
Creating the Drill Build Environment
198(1)
Setting Up Git and Getting the Source Code
198(1)
Building Drill from Source
199(1)
Installing the IDE
199(1)
Conclusion
200(1)
11 Writing Drill User-Defined Functions 201(20)
Use Case: Finding and Filtering Valid Credit Card Numbers
201(1)
How User-Defined Functions Work in Drill
202(1)
Structure of a Simple Drill UDF
203(8)
The pom.xml File
203(2)
The Function File
205(4)
The Simple Function API
209(1)
Putting It All Together
209(2)
Building and Installing Your UDF
211(1)
Statically Installing a UDF
211(1)
Dynamically Installing a UDF
211(1)
Complex Functions: UDFs That Return Maps or Arrays
212(3)
Example: Extracting User Agent Metadata
213(1)
The ComplexWriter
213(2)
Writing Aggregate User-Defined Functions
215(5)
The Aggregate Function API
216(1)
Example Aggregate UDF: Kendall's Rank Correlation Coefficient
217(3)
Conclusion
220(1)
12 Writing a Format Plug-in 221(38)
The Example Regex Format Plug-in
221(1)
Creating the "Easy" Format Plug-in
222(5)
Creating the Maven pom.xml File
223(2)
Creating the Plug-in Package
225(1)
Drill Module Configuration
225(1)
Format Plug-in Configuration
226(1)
Cautions Before Getting Started
226(1)
Creating the Regex Plug-in Configuration Class
227(3)
Copyright Headers and Code Format
228(1)
Testing the Configuration
228(1)
Fixing Configuration Problems
229(1)
Troubleshooting
230(1)
Creating the Format Plug-in Class
230(6)
Creating a Test File
232(1)
Configuring RAT
233(1)
Efficient Debugging
233(1)
Creating the Unit Test
234(1)
How Drill Finds Your Plug-in
235(1)
The Record Reader
236(14)
Testing the Reader Shell
238(1)
Logging
239(1)
Error Handling
239(1)
Setup
240(1)
Regex Parsing
240(1)
Defining Column Names
241(1)
Projection
241(1)
Column Projection Accounting
242(1)
Project None
243(1)
Project All
243(1)
Project Some
243(2)
Opening the File
245(1)
Record Batches
246(1)
Drill's Columnar Structure
246(1)
Defining Vectors
247(1)
Reading Data
248(1)
Loading Data into Vectors
248(1)
Releasing Resources
249(1)
Testing the Reader
250(3)
Testing the Wildcard Case
250(1)
Testing Explicit Projection
251(1)
Testing Empty Projection
251(1)
Scaling Up
252(1)
Additional Details
253(4)
File Chunks
253(1)
Default Format Configuration
253(1)
Next Steps
254(1)
Production Build
255(1)
Contributing to Drill: The Pull Request
255(1)
Maintaining Your Branch
255(1)
Create a Plug-In Project
256(1)
Conclusion
257(2)
13 Unique Uses of Drill 259(18)
Finding Photos Taken Within a Geographic Region
259(1)
Drilling Excel Files
260(6)
The pom.xml File
261(1)
The Excel Custom Record Reader
262(4)
Using the Excel Format Plug-in
266(1)
Network Packet Analysis (PCAP) with Drill
266(5)
Examples of Queries Using PCAP Data Files
267(4)
Analyzing Twitter Data with Drill
271(1)
Using Drill in a Machine Learning Pipeline
272(4)
Making Predictions Within Drill
272(1)
Building and Serializing a Model
272(1)
Writing the UDF Wrapper
273(2)
Making Predictions Using the UDF
275(1)
Conclusion
276(1)
A List of Drill Functions 277(14)
B Drill Formatting Strings 291(2)
Index 293
Mr. Charles Givre is an Apache Drill committer and has worked as a Senior Lead Data Scientist for Booz Allen Hamilton for the last six years where he works in the intersection of cyber security and data science. Mr. Givre is passionate about teaching others data science and analytic skills and has taught data science classes all over the world at conferences, universities and for clients. Most recently, Mr. Givre taught a data science class at the BlackHat conference in Las Vegas and the Center for Research in Applied Cryptography and Cyber Security at Bar Ilan University. He is a sought-after speaker and has delivered presentations at major industry conferences such as Strata-Hadoop World, BlackHat, Open Data Science Conference and others.

Paul Rogers is an Apache Drill committer at MapR where he focuses on Drill's execution engine. Paul has worked as a software architect at a number database and BI companies such as Oracle, Actuate and Informix. Paul was the early architect of the Eclipse BIRT project. His interests include making Drill even easier to use for end-users and plug-in developers.