Muutke küpsiste eelistusi

Data Science at the Command Line [Pehme köide]

  • Formaat: Paperback / softback, 212 pages, kõrgus x laius x paksus: 233x176x16 mm, kaal: 28 g
  • Ilmumisaeg: 07-Oct-2014
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1491947853
  • ISBN-13: 9781491947852
Teised raamatud teemal:
  • Pehme köide
  • Hind: 58,59 €*
  • * saadame teile pakkumise kasutatud raamatule, mille hind võib erineda kodulehel olevast hinnast
  • See raamat on trükist otsas, kuid me saadame teile pakkumise kasutatud raamatule.
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Lisa soovinimekirja
  • Formaat: Paperback / softback, 212 pages, kõrgus x laius x paksus: 233x176x16 mm, kaal: 28 g
  • Ilmumisaeg: 07-Oct-2014
  • Kirjastus: O'Reilly Media
  • ISBN-10: 1491947853
  • ISBN-13: 9781491947852
Teised raamatud teemal:

This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.

To get you started—whether you’re on Windows, OS X, or Linux—author Jeroen Janssens introduces the Data Science Toolbox, an easy-to-install virtual environment packed with over 80 command-line tools.

Discover why the command line is an agile, scalable, and extensible technology. Even if you’re already comfortable processing data with, say, Python or R, you’ll greatly improve your data science workflow by also leveraging the power of the command line.

  • Obtain data from websites, APIs, databases, and spreadsheets
  • Perform scrub operations on plain text, CSV, HTML/XML, and JSON
  • Explore data, compute descriptive statistics, and create visualizations
  • Manage your data science workflow using Drake
  • Create reusable tools from one-liners and existing Python or R code
  • Parallelize and distribute data-intensive pipelines using GNU Parallel
  • Model data with dimensionality reduction, clustering, regression, and classification algorithms
Preface xi
1 Introduction
1(12)
Overview
2(1)
Data Science Is OSEMN
2(2)
Obtaining Data
2(1)
Scrubbing Data
3(1)
Exploring Data
3(1)
Modeling Data
3(1)
Interpreting Data
4(1)
Intermezzo
Chapters
4(1)
What Is the Command Line??
5(2)
Why Data Science at the Command Line?
7(2)
The Command Line Is Agile
7(1)
The Command Line Is Augmenting
7(1)
The Command Line Is Scalable
8(1)
The Command Line Is Extensible
8(1)
The Command Line Is Ubiquitous
9(1)
A Real-World Use Case
9(3)
Further Reading
12(1)
2 Getting Started
13(16)
Overview
13(1)
Setting Up Your Data Science Toolbox
13(4)
Step 1 Download and Install VirtualBox
14(1)
Step 2 Download and Install Vagrant
14(1)
Step 3 Download and Start the Data Science Toolbox
15(1)
Step 4 Log In (on Linux and Mac OS X)
16(1)
Step 4 Log In (on Microsoft Windows)
17(1)
Step 5 Shut Down or Start Anew
17(1)
Essential Concepts and Tools
17(10)
The Environment
18(1)
Executing a Command-Line Tool
19(1)
Five Types of Command-Line Tools
20(3)
Combining Command-Line Tools
23(1)
Redirecting Input and Output
24(1)
Working with Files
24(1)
Help!
25(2)
Further Reading
27(2)
3 Obtaining Data
29(12)
Overview
29(1)
Copying Local Files to the Data Science Toolbox
30(1)
Local Version of Data Science Toolbox
30(1)
Remote Version of Data Science Toolbox
30(1)
Decompressing Files
31(1)
Converting Microsoft Excel Spreadsheets
32(2)
Querying Relational Databases
34(1)
Downloading from the Internet
35(2)
Calling Web APIs
37(2)
Further Reading
39(2)
4 Creating Reusable Command-Line Tools
41(14)
Overview
42(1)
Converting One-Liners into Shell Scripts
42(7)
Step 1 Copy and Paste
44(1)
Step 2 Add Permission to Execute
45(1)
Step 3 Define Shebang
46(1)
Step 4 Remove Fixed Input
47(1)
Step 5 Parameterize
47(1)
Step 6 Extend Your PATH
48(1)
Creating Command-Line Tools with Python and R
49(4)
Porting the Shell Script
50(2)
Processing Streaming Data from Standard Input
52(1)
Further Reading
53(2)
5 Scrubbing Data
55(26)
Overview
56(1)
Common Scrub Operations for Plain Text
56(6)
Filtering Lines
57(3)
Extracting Values
60(2)
Replacing and Deleting Values
62(1)
Working with CSV
62(5)
Bodies and Headers and Columns, Oh My!
62(5)
Performing SQL Queries on CSV
67(1)
Working with HTML/XML and JSON
67(5)
Common Scrub Operations for CSV
72(8)
Extracting and Reordering Columns
72(1)
Filtering Lines
73(2)
Merging Columns
75(2)
Combining Multiple CSV Files
77(3)
Further Reading
80(1)
6 Managing Your Data Workflow
81(10)
Overview
82(1)
Introducing Drake
82(1)
Installing Drake
82(2)
Obtain Top Ebooks from Project Gutenberg
84(1)
Every Workflow Starts with a Single Step
85(2)
Well, That Depends
87(2)
Rebuilding Specific Targets
89(1)
Discussion
90(1)
Further Reading
90(1)
7 Exploring Data
91(24)
Overview
92(1)
Inspecting Data and Its Properties
92(4)
Header or Not, Here I Come
92(1)
Inspect All the Data
92(1)
Feature Names and Data Types
93(2)
Unique Identifiers, Continuous Variables, and Factors
95(1)
Computing Descriptive Statistics
96(6)
Using csvstat
96(3)
Using R from the Command Line with Rio
99(3)
Creating Visualizations
102(12)
Introducing Gnuplot and feedgnuplot
102(2)
Introducing ggplot2
104(3)
Histograms
107(1)
Bar Plots
108(2)
Density Plots
110(1)
Box Plots
111(1)
Scatter Plots
112(1)
Line Graphs
113(1)
Summary
114(1)
Further Reading
114(1)
8 Parallel Pipelines
115(20)
Overview
116(1)
Serial Processing
116(3)
Looping Over Numbers
116(1)
Looping Over Lines
117(1)
Looping Over Files
118(1)
Parallel Processing
119(6)
Introducing GNU Parallel
121(1)
Specifying Input
122(1)
Controlling the Number of Concurrent Jobs
123(1)
Logging and Output
123(1)
Creating Parallel Tools
124(1)
Distributed Processing
125(7)
Get a List of Running AWS EC2 Instances
126(1)
Running Commands on Remote Machines
127(1)
Distributing Local Data Among Remote Machines
128(1)
Processing Files on Remote Machines
129(3)
Discussion
132(1)
Further Reading
133(2)
9 Modeling Data
135(24)
Overview
136(1)
More Wine, Please!
136(3)
Dimensionality Reduction with Tapkee
139(3)
Introducing Tapkee
140(1)
Installing Tapkee
140(1)
Linear and Nonlinear Mappings
141(1)
Clustering with Weka
142(8)
Introducing Weka
143(1)
Taming Weka on the Command Line
143(4)
Converting Between CSV and ARFF
147(1)
Comparing Three Clustering Algorithms
147(3)
Regression with SciKit-Learn Laboratory
150(3)
Preparing the Data
150(1)
Running the Experiment
151(1)
Parsing the Results
151(2)
Classification with BigML
153(3)
Creating Balanced Train and Test Data Sets
153(2)
Calling the API
155(1)
Inspecting the Results
155(1)
Conclusion
156(1)
Further Reading
156(3)
10 Conclusion
159(6)
Let's Recap
159(1)
Three Pieces of Advice
160(1)
Be Patient
160(1)
Be Creative
161(1)
Be Practical
161(1)
"Where to Go from Here?
161(1)
APIs
161(1)
Shell Programming
162(1)
Python, R, and SQL
162(1)
Interpreting Data
162(1)
Getting in Touch
162(3)
A List of Command-Line Tools 165(18)
B Bibliography 183(4)
Index 187
Jeroen Janssens is a senior data scientist at YPlan in New York City. His specialties are in machine learning, anomaly detection, and data visualization. Jeroen is passionate about building open source tools for doing data science. He obtained a B.Sc. in Life Sciences and an M.Sc. in Artificial Intelligence, both cum laude from Maastricht University, the Netherlands. Jeroen completed his Ph.D. in Machine Learning at the Tilburg center for Cognition and Communication, Tilburg University. Outside of work, you may find him biking the Brooklyn Bridge, beatboxing, or eating stroopwafels.