Muutke küpsiste eelistusi

Pandas for Everyone: Python Data Analysis [Pehme köide]

Teised raamatud teemal:
Teised raamatud teemal:
The Hands-On, Example-Rich Introduction to Pandas Data Analysis in Python

 

Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple datasets.

 

Pandas for Everyone brings together practical knowledge and insight for solving real problems with Pandas, even if youre new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world problems.

 

Chen gives you a jumpstart on using Pandas with a realistic dataset and covers combining datasets, handling missing data, and structuring datasets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes.

 

Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability, and introduces you to the wider Python data analysis ecosystem. 





Work with DataFrames and Series, and import or export data Create plots with matplotlib, seaborn, and pandas Combine datasets and handle missing data Reshape, tidy, and clean datasets so theyre easier to work with Convert data types and manipulate text strings Apply functions to scale data manipulations Aggregate, transform, and filter large datasets with groupby Leverage Pandas advanced date and time capabilities Fit linear models using statsmodels and scikit-learn libraries Use generalized linear modeling to fit models with different response variables Compare multiple models to select the best Regularize to overcome overfitting and improve performance Use clustering in unsupervised machine learning
Foreword xix
Preface xxi
Acknowledgments xxvii
About the Author xxxi
I Introduction
1(90)
1 Pandas Data Frame Basics
3(22)
1.1 Introduction
3(1)
1.2 Loading Your First Data Set
4(3)
1.3 Looking at Columns, Rows, and Cells
7(11)
1.3.1 Subsetting Columns
7(1)
1.3.2 Subsetting Rows
8(4)
1.3.3 Mixing It Up
12(6)
1.4 Grouped and Aggregated Calculations
18(5)
1.4.1 Grouped Means
19(4)
1.4.2 Grouped Frequency, Counts
23(1)
1.5 Basic Plot
23(1)
1.6 Conclusion
24(1)
2 Pandas Data Structures
25(24)
2.1 Introduction
25(1)
2.2 Creating Your Own Data
26(2)
2.2.1 Creating a Series
26(1)
2.2.2 Creating a DataFrame
27(1)
2.3 The Series
28(8)
2.3.1 The Series Is ndarray-like
30(1)
2.3.2 Boolean Subsetting
30(3)
2.3.3 Operations Are Aligned and Vectorized (Broadcasting)
33(3)
2.4 The DataFrame
36(2)
2.4.1 Boolean Subsetting: DataFrames
36(1)
2.4.2 Operations Are Automatically Aligned and Vectorized (Broadcasting)
37(1)
2.5 Making Changes to Series and DataFrames
38(5)
2.5.1 Add Additional Columns
38(1)
2.5.2 Directly Change a Column
39(4)
2.5.3 Dropping Values
43(1)
2.6 Exporting and Importing Data
43(4)
2.6.1 pickle
43(2)
2.6.2 CSV
45(1)
2.6.3 Excel
46(1)
2.6.4 Feather Format to Interface With R
47(1)
2.6.5 Other Data Output Types
47(1)
2.7 Conclusion
47(2)
3 Introduction to Plotting
49(42)
3.1 Introduction
49(2)
3.2 Matplotlib
51(5)
3.3 Statistical Graphics Using matplotlib
56(5)
3.3.1 Univariate
57(1)
3.3.2 Bivariate
58(1)
3.3.3 Multivariate Data
59(2)
3.4 Seaborn
61(22)
3.4.1 Univariate
62(3)
3.4.2 Bivariate Data
65(8)
3.4.3 Multivariate Data
73(10)
3.5 Pandas Objects
83(3)
3.5.1 Histograms
84(1)
3.5.2 Density Plot
85(1)
3.5.3 Scatterplot
85(1)
3.5.4 Hexbin Plot
86(1)
3.5.5 Boxplot
86(1)
3.6 Seaborn Themes and Styles
86(4)
3.7 Conclusion
90(1)
II Data Manipulation
91(52)
4 Data Assembly
93(16)
4.1 Introduction
93(1)
4.2 Tidy Data
93(1)
4.2.1 Combining Data Sets
94(1)
4.3 Concatenation
94(8)
4.3.1 Adding Rows
94(4)
4.3.2 Adding Columns
98(1)
4.3.3 Concatenation With Different Indices
99(3)
4.4 Merging Multiple Data Sets
102(5)
4.4.1 One-to-One Merge
104(1)
4.4.2 Many-to-One Merge
105(1)
4.4.3 Many-to-Many Merge
105(2)
4.5 Conclusion
107(2)
5 Missing Data
109(14)
5.1 Introduction
109(1)
5.2 What Is a NaN Value?
109(2)
5.3 Where Do Missing Values Come From?
111(5)
5.3.1 Load Data
111(1)
5.3.2 Merged Data
112(2)
5.3.3 User Input Values
114(1)
5.3.4 Re-indexing
114(2)
5.4 Working With Missing Data
116(5)
5.4.1 Find and Count missing Data
116(2)
5.4.2 Cleaning Missing Data
118(2)
5.4.3 Calculations With Missing Data
120(1)
5.5 Conclusion
121(2)
6 Tidy Data
123(20)
6.1 Introduction
123(1)
6.2 Columns Contain Values, Not Variables
124(4)
6.2.1 Keep One Column Fixed
124(2)
6.2.2 Keep Multiple Columns Fixed
126(2)
6.3 Columns Contain Multiple Variables
128(5)
6.3.1 Split and Add Columns Individually (Simple Method)
129(2)
6.3.2 Split and Combine in a Single Step (Simple Method)
131(1)
6.3.3 Split and Combine in a Single Step (More Complicated Method)
132(1)
6.4 Variables in Both Rows and Columns
133(1)
6.5 Multiple Observational Units in a Table (Normalization)
134(3)
6.6 Observational Units Across Multiple Tables
137(4)
6.6.1 Load Multiple Files Using a Loop
139(1)
6.6.2 Load Multiple Files Using a List Comprehension
140(1)
6.7 Conclusion
141(2)
III Data Munging
143(98)
7 Data Types
145(10)
7.1 Introduction
145(1)
7.2 Data Types
145(1)
7.3 Converting Types
146(6)
7.3.1 Converting to String Objects
146(1)
7.3.2 Converting to Numeric Values
147(5)
7.4 Categorical Data
152(1)
7.4.1 Convert to Category
152(1)
7.4.2 Manipulating Categorical Data
153(1)
7.5 Conclusion
153(2)
8 Strings and Text Data
155(16)
8.1 Introduction
155(1)
8.2 Strings
155(3)
8.2.1 Subsetting and Slicing Strings
155(2)
8.2.2 Getting the Last Character in a String
157(1)
8.3 String Methods
158(2)
8.4 More String Methods
160(1)
8.4.1 Join
160(1)
8.4.2 Splitlines
160(1)
8.5 String Formatting
161(3)
8.5.1 Custom String Formatting
161(1)
8.5.2 Formatting Character Strings
162(1)
8.5.3 Formatting Numbers
162(1)
8.5.4 C printf Style Formatting
163(1)
8.5.5 Formatted Literal Strings in Python 3.6+
163(1)
8.6 Regular Expressions (RegEx)
164(6)
8.6.1 Match a Pattern
164(4)
8.6.2 Find a Pattern
168(1)
8.6.3 Substituting a Pattern
168(1)
8.6.4 Compiling a Pattern
169(1)
8.7 The regex Library
170(1)
8.8 Conclusion
170(1)
9 Apply
171(18)
9.1 Introduction
171(1)
9.2 Functions
171(1)
9.3 Apply (Basics)
172(5)
9.3.1 Apply Over a Series
173(1)
9.3.2 Apply Over a DataFrame
174(3)
9.4 Apply (More Advanced)
177(5)
9.4.1 Column-wise Operations
178(2)
9.4.2 Row-wise Operations
180(2)
9.5 Vectorized Functions
182(3)
9.5.1 Using numpy
184(1)
9.5.2 Using numba
185(1)
9.6 Lambda Functions
185(2)
9.7 Conclusion
187(2)
10 Groupby Operations: Split-Apply-Combine
189(24)
10.1 Introduction
189(1)
10.2 Aggregate
190(7)
10.2.1 Basic One-Variable Grouped Aggregation
190(1)
10.2.2 Built-in Aggregation Methods
191(1)
10.2.3 Aggregation Functions
192(3)
10.2.4 Multiple Functions Simultaneously
195(1)
10.2.5 Using a diet in agg/aggregate
195(2)
10.3 Transform
197(4)
10.3.1 z-Score Example
197(4)
10.4 Filter
201(1)
10.5 The pandas.core.groupby .DataFrameGroupBy Object
202(5)
10.5.1 Groups
202(1)
10.5.2 Group Calculations Involving Multiple Variables
203(1)
10.5.3 Selecting a Group
204(1)
10.5.4 Iterating Through Groups
204(2)
10.5.5 Multiple Groups
206(1)
10.5.6 Flattening the Results
206(1)
10.6 Working With a MultiIndex
207(4)
10.7 Conclusion
211(2)
11 The datetime Data Type
213(28)
11.1 Introduction
213(1)
11.2 Python's datetime Object
213(1)
11.3 Converting to datetime
214(3)
11.4 Loading Data That Include Dates
217(1)
11.5 Extracting Date Components
217(3)
11.6 Date Calculations and Timedeltas
220(1)
11.7 Datetime Methods
221(3)
11.8 Getting Stock Data
224(1)
11.9 Subsetting Data Based on Dates
225(2)
11.9.1 The Datetime Index Object
225(1)
11.9.2 The TimedeltaIndex Object
226(1)
11.10 Date Ranges
227(3)
11.10.1 Frequencies
228(1)
11.10.2 Offsets
229(1)
11.11 Shifting Values
230(7)
11.12 Resampling
237(1)
11.13 Time Zones
238(2)
11.14 Conclusion
240(1)
IV Data Modeling
241(62)
12 Linear Models
243(10)
12.1 introduction
243(1)
12.2 Simple Linear Regression
243(4)
12.2.1 Using statsmodels
243(2)
12.2.2 Using sklearn
245(2)
12.3 Multiple Regression
247(4)
12.3.1 Using statsmodels
247(1)
12.3.2 Using statsmodels With Categorical Variables
248(1)
12.3.3 Using sklearn
249(1)
12.3.4 Using sklearn With Categorical Variables
250(1)
12.4 Keeping Index Labels From sklearn
251(1)
12.5 Conclusion
252(1)
13 Generalized Linear Models
253(12)
13.1 Introduction
253(1)
13.2 Logistic Regression
253(4)
13.2.1 Using Statsmodels
255(1)
13.2.2 Using Sklearn
256(1)
13.3 Poisson Regression
257(3)
13.3.1 Using Statsmodels
258(1)
13.3.2 Negative Binomial Regression for Overdispersion
259(1)
13.4 More Generalized Linear Models
260(1)
13.5 Survival Analysis
260(4)
13.5.1 Testing the Cox Model Assumptions
263(1)
13.6 Conclusion
264(1)
14 Model Diagnostics
265(14)
14.1 Introduction
265(1)
14.2 Residuals
265(5)
14.2.1 Q-Q Plots
268(2)
14.3 Comparing Multiple Models
270(5)
14.3.1 Working With Linear Models
270(3)
14.3.2 Working With GLM Models
273(2)
14.4 k-Fold Cross-validation
275(3)
14.5 Conclusion
278(1)
15 Regularization
279(12)
15.1 introduction
279(1)
15.2 Why Regularize?
279(2)
15.3 LASSO Regression
281(2)
15.4 Ridge Regression
283(2)
15.5 Elastic Net
285(4)
1.5.6 Cross-Validation
287(2)
15.7 Conclusion
289(2)
16 Clustering
291(12)
16.1 Introduction
291(1)
16.2 k-Means
291(6)
16.2.1 Dimension Reduction With PCA
294(3)
16.3 Hierarchical Clustering
297(4)
16.3.1 Complete Clustering
298(1)
16.3.2 Single Clustering
298(1)
16.3.3 Average Clustering
299(1)
16.3.4 Centroid Clustering
299(1)
16.3.5 Manually Setting the Threshold
299(2)
16.4 Conclusion
301(2)
V Conclusion
303(10)
17 Life Outside of Pandas
305(4)
17.1 The (Scientific) Computing Stack
305(1)
17.2 Performance
306(1)
17.2.1 Timing Your Code
306(1)
17.2.2 Profiling Your Code
307(1)
17.3 Going Bigger and Faster
307(2)
18 Toward a Self-Directed Learner
309(4)
18.1 It's Dangerous to Go Alone!
309(1)
18.2 Local Meetups
309(1)
18.3 Conferences
309(1)
18.4 The Internet
310(1)
18.5 Podcasts
310(1)
18.6 Conclusion
311(2)
VI Appendixes
313(2)
A Installation
315(2)
A.1 Installing Anaconda
315(1)
A.1.1 Windows
315(1)
A.1.2 Mac
316(1)
A.1.3 Linux
316(1)
A.2 Uninstall Anaconda
316(1)
B Command Line
317(2)
B.1 Installation
317(1)
B.1.1 Windows
317(1)
B.1.2 Mac
317(1)
B.1.3 Linux
318(1)
B.2 Basics
318(1)
C Project Templates
319(2)
D Using Python
321(4)
D.1 Command Line and Text Editor
321(1)
D.2 Python and IPython
322(1)
D.3 Jupyter
322(1)
D.4 Integrated Development Environments (IDEs)
322(3)
E Working Directories
325(2)
F Environments
327(2)
G Install Packages
329(2)
G.1 Updating Packages
330(1)
H Importing Libraries
331(2)
I Lists
333(2)
J Tuples
335(2)
K Dictionaries
337(2)
L Slicing Values
339(2)
M Loops
341(2)
N Comprehensions
343(2)
O Functions
345(4)
O.1 Default Parameters
347(1)
O.2 Arbitrary Parameters
347(2)
O.2.1 *args
347(1)
O.2.2 **kwargs
348(1)
P Ranges and Generators
349(2)
Q Multiple Assignment
351(2)
R numpy ndarray
353(2)
S Classes
355(2)
T Odo: The Shapeshifter
357(2)
Index 359
Daniel Chen is a graduate student in the interdisciplinary PhD program in Genetics, Bioinformatics & Computational Biology (GBCB) at Virginia Tech. He is involved with Software Carpentry as an instructor and lesson maintainer. He completed his masters degree in public health at Columbia University Mailman School of Public Health in Epidemiology, and currently works at the Social and Decision Analytics Laboratory under the Biocomplexity Institute of Virginia Tech where he is working with data to inform policy decision-making. He is the author of Pandas for Everyone and Pandas Data Analysis with Python Fundamentals LiveLessons.