Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Learning Apache Drill: Query and Analyze Distributed Data Sources with SQL

4.00/5 (9 hinnangut Goodreads-ist)

Paul Rogers, Charles Givre

Formaat: 332 pages
Ilmumisaeg: 02-Nov-2018
Kirjastus: O'Reilly Media
Keel: eng
ISBN-13: 9781492032755

Teised raamatud teemal:

Formaat - EPUB+DRM
Hind: 47,96 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: 332 pages
Ilmumisaeg: 02-Nov-2018
Kirjastus: O'Reilly Media
Keel: eng
ISBN-13: 9781492032755

Teised raamatud teemal:

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.

In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, youll learn how Drill helps you analyze data more effectively to drive down time to insight.

Use Drill to clean, prepare, and summarize delimited data for further analysis Query file types including logfiles, Parquet, JSON, and other complex formats

Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL Connect to Drill programmatically using a variety of languages Use Drill even with challenging or ambiguous file formats Perform sophisticated analysis by extending Drills functionality with user-defined functions Facilitate data analysis for network security, image metadata, and machine learning

Preface

1 Introduction to Apache Drill

(8)

What Is Apache Drill?

(7)

Drill Is Versatile

(1)

Drill Is Easy to Use

(1)

A Word About Drill's Performance

(1)

A Very Brief History of Big Data

(2)

Drill in the Big Data Ecosystem

(1)

Comparing Drill with Similar Tools

(2)

2 Installing and Running Drill

(8)

Preparing Your Machine for Drill

(2)

Special Configuration Instructions for Windows Installations

(2)

Installing Drill on Windows

(1)

Starting Drill on a Windows Machine

(1)

Installing Drill in Embedded Mode on macOS or Linux

(1)

Starting Drill on macOS or Linux in Embedded Mode

(1)

Installing Drill in Distributed Mode on macOS or Linux

(2)

Preparing Your Cluster for Drill

(1)

Starting Drill in Distributed Mode

(1)

Connecting to the Cluster

(1)

Conclusion

(1)

3 Overview of Apache Drill

(16)

The Apache Hadoop Ecosystem

(4)

Drill Is a Low-Latency Query Engine

(1)

Distributed Processing with HDFS

(1)

Elements of a Drill System

(1)

Drill Operation: The 30,000-Foot View

(1)

Drill Is a Query Engine, Not a Database

(1)

Drill Operation Overview

(9)

Drill Components

(1)

SQL Session State

(1)

Statement Preparation

(4)

Statement Execution

(2)

Low-Latency Features

(2)

Conclusion

(3)

4 Querying Delimited Data

(32)

Ways of Querying Data with Drill

(1)

Other Interfaces

(1)

Drill SQL Query Format

(10)

Choosing a Data Source

(1)

Defining a Workspace

(1)

Specifying a Default Data Source

(2)

Accessing Columns in a Query

(2)

Delimited Data with Column Headers

(1)

Table Functions

(1)

Querying Directories

(1)

Understanding Drill Data Types

(2)

Cleaning and Preparing Data Using String Manipulation Functions

(3)

Complex Data Conversion Functions

(1)

Working with Dates and Times in Drill

(4)

Converting Strings to Dates

(1)

Reformatting Dates

(1)

Date Arithmetic and Manipulation

(1)

Date and Time Functions in Drill

(1)

Creating Views

(1)

Data Analysis Using Drill

(8)

Summarizing Data with Aggregate Functions

(7)

Common Problems in Querying Delimited Data

(2)

Spaces in Column Names

(1)

Illegal Characters in Column Headers

(1)

Reserved Words in Column Names

(1)

Conclusion

(1)

5 Analyzing Complex and Nested Data

(22)

Arrays and Maps

(12)

Arrays in Drill

(2)

Accessing Maps (Key-Value Pairs) in Drill

(1)

Querying Nested Data

(8)

Analyzing Log Files with Drill

(8)

Configuring Drill to Read HTTPD Web Server Logs

(1)

Querying Web Server Logs

(4)

Other Log Analysis with Drill

(3)

Conclusion

(2)

6 Connecting Drill to Data Sources

(20)

Querying Multiple Data Sources

(18)

Configuring a New Storage Plug-in

(1)

Connecting Drill to a Relational Database

(4)

Querying Data in Hadoop from Drill

(1)

Connecting to and Querying HBase from Drill

(2)

Querying Hive Data from Drill

(2)

Connecting to and Querying Streaming Data with Drill and Kafka

(2)

Connecting to and Querying Kudu

(1)

Connecting to and Querying MongoDB from Drill

100

(1)

Connecting Drill to Cloud Storage

100

(4)

Querying Time Series Data from Drill and OpenTSDB

104

(2)

Conclusion

106

(1)

7 Connecting to Drill

107

(26)

Understanding Drill's Interfaces

107

(5)

JDBC and Drill

108

(1)

ODBC and Drill

109

(2)

Drill's REST Interface

111

(1)

Connecting to Drill with Python

112

(4)

Using drillpy to Query Drill

112

(1)

Connecting to Drill Using pydrill

113

(1)

Other Ways of Connecting to Drill from Python

114

(2)

Connecting to Drill Using R

116

(2)

Querying Drill from R Using sergeant

116

(2)

Connecting to Drill Using Java

118

(1)

Querying Drill with PHP

119

(2)

Using the Connector

119

(1)

Querying Drill from PHP

120

(1)

Interacting with Drill from PHP

120

(1)

Querying Drill Using Node.js

121

(1)

Using Drill as a Data Source in BI Tools

121

(11)

Exploring Data with Apache Zeppelin and Drill

122

(5)

Exploring Data with Apache Superset

127

(5)

Conclusion

132

(1)

8 Data Engineering with Drill

133

(42)

Schema-on-Read

133

(3)

The SQL Relational Model

133

(1)

Data Life Cycle: Data Exploration to Production

134

(1)

Schema Inference

135

(1)

Data Source Inference

136

(4)

Storage Plug-ins

136

(1)

Storage Configurations

136

(1)

Workspaces

137

(2)

Querying Directories

139

(1)

Default Schema

139

(1)

File Type Inference

140

(2)

Format Plug-ins and Format Configuration

140

(1)

Format Inference

141

(1)

File Format Variations

141

(1)

Schema Inference Overview

142

(2)

Distributed File Scans

144

(16)

Schema Inference for Delimited Data

146

(5)

CSV Summary

151

(2)

Schema Inference for JSON

153

(2)

Ambiguous Numeric Schemas

155

(5)

Aligning Schemas Across Files

160

(2)

JSON Objects

162

(6)

JSON Lists in Drill

164

(3)

JSON Summary

167

(1)

Using Drill with the Parquet File Format

168

(1)

Schema Evolution in Parquet

169

(1)

Partitioning Data Directories

169

(4)

Defining a Table Workspace

172

(1)

Working with Queries in Production

173

(1)

Capturing Schema Mapping in Views

173

(1)

Running Challenging Queries in Scripts

173

(1)

Conclusion

174

(1)

9 Deploying Drill in Production

175

(22)

Installing Drill

175

(9)

Prerequisites

176

(1)

Production Installation

176

(2)

Configuring ZooKeeper

178

(1)

Configuring Memory

179

(1)

Configuring Logging

180

(1)

Testing the Installation

181

(1)

Distributing Drill Binaries and Configuration

182

(1)

Starting the Drill Cluster

183

(1)

Configuring Storage

184

(4)

Working with Apache Hadoop HDFS

184

(1)

Working with Amazon S3

185

(3)

Admission Control

188

(2)

Additional Configuration

190

(3)

User-Defined Functions and Custom Plug-ins

190

(1)

Security

190

(1)

Logging Levels

191

(1)

Controlling CPU Usage

192

(1)

Monitoring

193

(2)

Monitoring the Drill Process

193

(1)

Monitoring JMX Metrics

194

(1)

Monitoring Queries

194

(1)

Other Deployment Options

195

(1)

MapR Installer

195

(1)

Drill-on-YARN

195

(1)

Docker

196

(1)

Conclusion

196

(1)

10 Setting Up Your Development Environment

197

(4)

Installing Maven

197

(1)

Creating the Drill Build Environment

198

(1)

Setting Up Git and Getting the Source Code

198

(1)

Building Drill from Source

199

(1)

Installing the IDE

199

(1)

Conclusion

200

(1)

11 Writing Drill User-Defined Functions

201

(20)

Use Case: Finding and Filtering Valid Credit Card Numbers

201

(1)

How User-Defined Functions Work in Drill

202

(1)

Structure of a Simple Drill UDF

203

(8)

The pom.xml File

203

(2)

The Function File

205

(4)

The Simple Function API

209

(1)

Putting It All Together

209

(2)

Building and Installing Your UDF

211

(1)

Statically Installing a UDF

211

(1)

Dynamically Installing a UDF

211

(1)

Complex Functions: UDFs That Return Maps or Arrays

212

(3)

Example: Extracting User Agent Metadata

213

(1)

The ComplexWriter

213

(2)

Writing Aggregate User-Defined Functions

215

(5)

The Aggregate Function API

216

(1)

Example Aggregate UDF: Kendall's Rank Correlation Coefficient

217

(3)

Conclusion

220

(1)

12 Writing a Format Plug-in

221

(38)

The Example Regex Format Plug-in

221

(1)

Creating the "Easy" Format Plug-in

222

(5)

Creating the Maven pom.xml File

223

(2)

Creating the Plug-in Package

225

(1)

Drill Module Configuration

225

(1)

Format Plug-in Configuration

226

(1)

Cautions Before Getting Started

226

(1)

Creating the Regex Plug-in Configuration Class

227

(3)

228

(1)

Testing the Configuration

228

(1)

Fixing Configuration Problems

229

(1)

Troubleshooting

230

(1)

Creating the Format Plug-in Class

230

(6)

Creating a Test File

232

(1)

Configuring RAT

233

(1)

Efficient Debugging

233

(1)

Creating the Unit Test

234

(1)

How Drill Finds Your Plug-in

235

(1)

The Record Reader

236

(14)

Testing the Reader Shell

238

(1)

Logging

239

(1)

Error Handling

239

(1)

Setup

240

(1)

Regex Parsing

240

(1)

Defining Column Names

241

(1)

Projection

241

(1)

Column Projection Accounting

242

(1)

Project None

243

(1)

Project All

243

(1)

Project Some

243

(2)

Opening the File

245

(1)

Record Batches

246

(1)

Drill's Columnar Structure

246

(1)

Defining Vectors

247

(1)

Reading Data

248

(1)

Loading Data into Vectors

248

(1)

Releasing Resources

249

(1)

Testing the Reader

250

(3)

Testing the Wildcard Case

250

(1)

Testing Explicit Projection

251

(1)

Testing Empty Projection

251

(1)

Scaling Up

252

(1)

Additional Details

253

(4)

File Chunks

253

(1)

Default Format Configuration

253

(1)

Next Steps

254

(1)

Production Build

255

(1)

Contributing to Drill: The Pull Request

255

(1)

Maintaining Your Branch

255

(1)

Create a Plug-In Project

256

(1)

Conclusion

257

(2)

13 Unique Uses of Drill

259

(18)

Finding Photos Taken Within a Geographic Region

259

(1)

Drilling Excel Files

260

(6)

The pom.xml File

261

(1)

The Excel Custom Record Reader

262

(4)

Using the Excel Format Plug-in

266

(1)

Network Packet Analysis (PCAP) with Drill

266

(5)

Examples of Queries Using PCAP Data Files

267

(4)

Analyzing Twitter Data with Drill

271

(1)

Using Drill in a Machine Learning Pipeline

272

(4)

Making Predictions Within Drill

272

(1)

Building and Serializing a Model

272

(1)

Writing the UDF Wrapper

273

(2)

Making Predictions Using the UDF

275

(1)

Conclusion

276

(1)

A List of Drill Functions

277

(14)

B Drill Formatting Strings

291

(2)

Index

293

Mr. Charles Givre is an Apache Drill committer and has worked as a Senior Lead Data Scientist for Booz Allen Hamilton for the last six years where he works in the intersection of cyber security and data science. Mr. Givre is passionate about teaching others data science and analytic skills and has taught data science classes all over the world at conferences, universities and for clients. Most recently, Mr. Givre taught a data science class at the BlackHat conference in Las Vegas and the Center for Research in Applied Cryptography and Cyber Security at Bar Ilan University. He is a sought-after speaker and has delivered presentations at major industry conferences such as Strata-Hadoop World, BlackHat, Open Data Science Conference and others.

Paul Rogers is an Apache Drill committer at MapR where he focuses on Drill's execution engine. Paul has worked as a software architect at a number database and BI companies such as Oracle, Actuate and Informix. Paul was the early architect of the Eclipse BIRT project. His interests include making Drill even easier to use for end-users and plug-in developers.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97814920327556e.html

Märksõnad:

E-raamat: Learning Apache Drill: Query and Analyze Distributed Data Sources with SQL

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv