Muutke küpsiste eelistusi

Data Science at the Command Line: Obtain, Scrub, Explore, and Model Data with Unix Power Tools 2nd New edition [Pehme köide]

  • Formaat: Paperback / softback, 250 pages, kõrgus x laius: 232x178 mm
  • Ilmumisaeg: 27-Aug-2021
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492087912
  • ISBN-13: 9781492087915
Teised raamatud teemal:
  • Pehme köide
  • Hind: 63,19 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 74,34 €
  • Säästad 15%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 250 pages, kõrgus x laius: 232x178 mm
  • Ilmumisaeg: 27-Aug-2021
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1492087912
  • ISBN-13: 9781492087915
Teised raamatud teemal:

This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 80 tools--useful whether you work with Windows, macOS, or Linux.

You'll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you're comfortable processing data with Python or R, you'll learn how to greatly improve your data science workflow by leveraging the command line's power. This book is ideal for data scientists, analysts, and engineers; software and machine learning engineers; and system administrators.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on text, CSV, HTM, XML, and JSON files
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow
  • Create reusable command-line tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines
  • Model data with dimensionality reduction, clustering, regression, and classification algorithms
Foreword xiii
Preface xv
1 Introduction 1(10)
Data Science Is OSEMN
2(2)
Obtaining Data
3(1)
Scrubbing Data
3(1)
Exploring Data
3(1)
Modeling Data
4(1)
Interpreting Data
4(1)
Intermezzo
Chapters
4(1)
What Is the Command Line?
5(2)
Why Data Science at the Command Line?
7(3)
The Command Line Is Agile
7(1)
The Command Line Is Augmenting
8(1)
The Command Line Is Scalable
8(1)
The Command Line Is Extensible
9(1)
The Command Line Is Ubiquitous
9(1)
Summary
10(1)
For Further Exploration
10(1)
2 Getting Started 11(24)
Getting the Data
11(1)
Installing the Docker Image
12(1)
Essential Unix Concepts
13(20)
The Environment
14(1)
Executing a Command-Line Tool
15(1)
Five Types of Command-Line Tools
16(4)
Combining Command-Line Tools
20(2)
Redirecting Input and Output
22(4)
Working with Files and Directories
26(2)
Managing Output
28(2)
Help!
30(3)
Summary
33(1)
For Further Exploration
33(2)
3 Obtaining Data 35(18)
Overview
36(1)
Copying Local Files to the Docker Container
36(1)
Downloading from the Internet
37(4)
Introducing curl
37(1)
Saving
38(1)
Other Protocols
39(1)
Following Redirects
39(2)
Decompressing Files
41(2)
Converting Microsoft Excel Spreadsheets to CSV
43(3)
Querying Relational Databases
46(1)
Calling Web APIs
47(5)
Authentication
48(2)
Streaming APIs
50(2)
Summary
52(1)
For Further Exploration
52(1)
4 Creating Command-Line Tools 53(24)
Overview
54(1)
Converting One-Liners into Shell Scripts
55(14)
Step 1: Create a File
58(3)
Step 2: Give Permission to Execute
61(1)
Step 3: Define a Shebang
62(3)
Step 4: Remove the Fixed Input
65(1)
Step 5: Add Arguments
66(2)
Step 6: Extend Your PATH
68(1)
Creating Command-Line Tools with Python and R
69(5)
Porting the Shell Script
70(2)
Processing Streaming Data from Standard Input
72(2)
Summary
74(1)
For Further Exploration
74(3)
5 Scrubbing Data 77(30)
Overview
78(1)
Transformations, Transformations Everywhere
78(3)
Plain Text
81(9)
Filtering Lines
81(5)
Extracting Values
86(2)
Replacing and Deleting Values
88(2)
CSV
90(11)
Bodies and Headers and Columns, Oh My!
90(3)
Performing SQL Queries on CSV
93(1)
Extracting and Reordering Columns
94(1)
Filtering Rows
95(1)
Merging Columns
96(3)
Combining Multiple CSV Files
99(2)
Working with XML/HTML and JSON
101(3)
Summary
104(1)
For Further Exploration
105(2)
6 Project Management with Make 107(12)
Overview
108(1)
Introducing Make
109(1)
Running Tasks
109(3)
Building, for Real
112(1)
Adding Dependencies
113(5)
Summary
118(1)
For Further Exploration
118(1)
7 Exploring Data 119(34)
Overview
120(1)
Inspecting Data and Its Properties
120(6)
Header or Not, Here I Come
120(1)
Inspect All the Data
121(1)
Feature Names and Data Types
122(2)
Unique Identifiers, Continuous Variables, and Factors
124(2)
Computing Descriptive Statistics
126(7)
Column Statistics
126(3)
R One-Liners on the Shell
129(4)
Creating Visualizations
133(19)
Displaying Images from the Command Line
133(5)
Plotting in a Rush
138(2)
Creating Bar Charts
140(2)
Creating Histograms
142(1)
Creating Density Plots
143(1)
Happy Little Accidents
144(2)
Creating Scatter Plots
146(1)
Creating Trend Lines
147(2)
Creating Box Plots
149(1)
Adding Labels
150(2)
Going Beyond Basic Plots
152(1)
Summary
152(1)
For Further Exploration
152(1)
8 Parallel Pipelines 153(24)
Overview
154(1)
Serial Processing
154(4)
Looping Over Numbers
155(1)
Looping Over Lines
156(1)
Looping Over Files
157(1)
Parallel Processing
158(9)
Introducing GNU Parallel
160(2)
Specifying Input
162(2)
Controlling the Number of Concurrent Jobs
164(1)
Logging and Output
164(2)
Creating Parallel Tools
166(1)
Distributed Processing
167(7)
Get List of Running AWS EC2 Instances
167(2)
Running Commands on Remote Machines
169(1)
Distributing Local Data Among Remote Machines
170(1)
Processing Files on Remote Machines
171(3)
Summary
174(1)
For Further Exploration
175(2)
9 Modeling Data 177(22)
Overview
178(1)
More Wine, Please!
178(4)
Dimensionality Reduction with Tapkee
182(5)
Introducing Tapkee
183(1)
Linear and Nonlinear Mappings
183(4)
Regression with Vowpal Wabbit
187(6)
Preparing the Data
187(1)
Training the Model
188(2)
Testing the Model
190(3)
Classification with SciKit-Learn Laboratory
193(4)
Preparing the Data
193(1)
Running the Experiment
194(1)
Parsing the Results
195(2)
Summary
197(1)
For Further Exploration
198(1)
10 Polyglot Data Science 199(14)
Overview
200(1)
Jupyter
200(3)
Python
203(2)
R
205(2)
RStudio
207(1)
Apache Spark
208(2)
Summary
210(1)
For Further Exploration
211(2)
11 Conclusion 213(6)
Let's Recap
213(1)
Three Pieces of Advice
214(1)
Be Patient
214(1)
Be Creative
215(1)
Be Practical
215(1)
Where to Go from Here
215(2)
The Command Line
216(1)
Shell Programming
216(1)
Python, R, and SQL
216(1)
APIs
216(1)
Machine Learning
217(1)
Getting in Touch
217(2)
List of Command-Line Tools 219(30)
Index 249
Jeroen Janssens teaches data science; often through training and coaching, occasionally through speaking, and infrequently through writing. His interests include visualizing data, building machine learning models, and automating things using either Python, R, or Bash. He is the author of Data Science at the Command Line, published by OReilly Media. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. Previously, he was an assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. Currently, Jeroen is the CEO of Data Science Workshops, which organises open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. All related to data science of course.