Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Pandas for Everyone: Python Data Analysis [Pehme köide]

4.06/5 (87 hinnangut Goodreads-ist)

Daniel Chen

Formaat: Paperback / softback, 416 pages
Sari: Addison-Wesley Data & Analytics Series
Ilmumisaeg: 12-Feb-2018
Kirjastus: Addison Wesley
ISBN-10: 0134546938
ISBN-13: 9780134546933

Teised raamatud teemal:

Web programming

Pehme köide
Hind: 42,61 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 416 pages
Sari: Addison-Wesley Data & Analytics Series
Ilmumisaeg: 12-Feb-2018
Kirjastus: Addison Wesley
ISBN-10: 0134546938
ISBN-13: 9780134546933

Teised raamatud teemal:

Web programming

Püsilink: https://www.kriso.ee/db/9780134546933.html

Märksõnad:

The Hands-On, Example-Rich Introduction to Pandas Data Analysis in Python

Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple datasets.

Pandas for Everyone brings together practical knowledge and insight for solving real problems with Pandas, even if youre new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world problems.

Chen gives you a jumpstart on using Pandas with a realistic dataset and covers combining datasets, handling missing data, and structuring datasets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes.

Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability, and introduces you to the wider Python data analysis ecosystem.

Work with DataFrames and Series, and import or export data Create plots with matplotlib, seaborn, and pandas Combine datasets and handle missing data Reshape, tidy, and clean datasets so theyre easier to work with Convert data types and manipulate text strings Apply functions to scale data manipulations Aggregate, transform, and filter large datasets with groupby Leverage Pandas advanced date and time capabilities Fit linear models using statsmodels and scikit-learn libraries Use generalized linear modeling to fit models with different response variables Compare multiple models to select the best Regularize to overcome overfitting and improve performance Use clustering in unsupervised machine learning

Foreword

xix

Preface

xxi

Acknowledgments

xxvii

About the Author

xxxi

I Introduction

(90)

1 Pandas Data Frame Basics

(22)

1.1 Introduction

(1)

1.2 Loading Your First Data Set

(3)

1.3 Looking at Columns, Rows, and Cells

(11)

1.3.1 Subsetting Columns

(1)

1.3.2 Subsetting Rows

(4)

1.3.3 Mixing It Up

(6)

1.4 Grouped and Aggregated Calculations

(5)

1.4.1 Grouped Means

(4)

1.4.2 Grouped Frequency, Counts

(1)

1.5 Basic Plot

(1)

1.6 Conclusion

(1)

2 Pandas Data Structures

(24)

2.1 Introduction

(1)

2.2 Creating Your Own Data

(2)

2.2.1 Creating a Series

(1)

2.2.2 Creating a DataFrame

(1)

2.3 The Series

(8)

2.3.1 The Series Is ndarray-like

(1)

2.3.2 Boolean Subsetting

(3)

2.3.3 Operations Are Aligned and Vectorized (Broadcasting)

(3)

2.4 The DataFrame

(2)

2.4.1 Boolean Subsetting: DataFrames

(1)

2.4.2 Operations Are Automatically Aligned and Vectorized (Broadcasting)

(1)

2.5 Making Changes to Series and DataFrames

(5)

2.5.1 Add Additional Columns

(1)

2.5.2 Directly Change a Column

(4)

2.5.3 Dropping Values

(1)

2.6 Exporting and Importing Data

(4)

2.6.1 pickle

(2)

2.6.2 CSV

(1)

2.6.3 Excel

(1)

2.6.4 Feather Format to Interface With R

(1)

2.6.5 Other Data Output Types

(1)

2.7 Conclusion

(2)

3 Introduction to Plotting

(42)

3.1 Introduction

(2)

3.2 Matplotlib

(5)

3.3 Statistical Graphics Using matplotlib

(5)

3.3.1 Univariate

(1)

3.3.2 Bivariate

(1)

3.3.3 Multivariate Data

(2)

3.4 Seaborn

(22)

3.4.1 Univariate

(3)

3.4.2 Bivariate Data

(8)

3.4.3 Multivariate Data

(10)

3.5 Pandas Objects

(3)

3.5.1 Histograms

(1)

3.5.2 Density Plot

(1)

3.5.3 Scatterplot

(1)

3.5.4 Hexbin Plot

(1)

3.5.5 Boxplot

(1)

3.6 Seaborn Themes and Styles

(4)

3.7 Conclusion

(1)

II Data Manipulation

(52)

4 Data Assembly

(16)

4.1 Introduction

(1)

4.2 Tidy Data

(1)

4.2.1 Combining Data Sets

(1)

4.3 Concatenation

(8)

4.3.1 Adding Rows

(4)

4.3.2 Adding Columns

(1)

4.3.3 Concatenation With Different Indices

(3)

4.4 Merging Multiple Data Sets

102

(5)

4.4.1 One-to-One Merge

104

(1)

4.4.2 Many-to-One Merge

105

(1)

4.4.3 Many-to-Many Merge

105

(2)

4.5 Conclusion

107

(2)

5 Missing Data

109

(14)

5.1 Introduction

109

(1)

5.2 What Is a NaN Value?

109

(2)

5.3 Where Do Missing Values Come From?

111

(5)

5.3.1 Load Data

111

(1)

5.3.2 Merged Data

112

(2)

5.3.3 User Input Values

114

(1)

5.3.4 Re-indexing

114

(2)

5.4 Working With Missing Data

116

(5)

5.4.1 Find and Count missing Data

116

(2)

5.4.2 Cleaning Missing Data

118

(2)

5.4.3 Calculations With Missing Data

120

(1)

5.5 Conclusion

121

(2)

6 Tidy Data

123

(20)

6.1 Introduction

123

(1)

6.2 Columns Contain Values, Not Variables

124

(4)

6.2.1 Keep One Column Fixed

124

(2)

6.2.2 Keep Multiple Columns Fixed

126

(2)

6.3 Columns Contain Multiple Variables

128

(5)

6.3.1 Split and Add Columns Individually (Simple Method)

129

(2)

6.3.2 Split and Combine in a Single Step (Simple Method)

131

(1)

6.3.3 Split and Combine in a Single Step (More Complicated Method)

132

(1)

6.4 Variables in Both Rows and Columns

133

(1)

6.5 Multiple Observational Units in a Table (Normalization)

134

(3)

6.6 Observational Units Across Multiple Tables

137

(4)

6.6.1 Load Multiple Files Using a Loop

139

(1)

6.6.2 Load Multiple Files Using a List Comprehension

140

(1)

6.7 Conclusion

141

(2)

III Data Munging

143

(98)

7 Data Types

145

(10)

7.1 Introduction

145

(1)

7.2 Data Types

145

(1)

7.3 Converting Types

146

(6)

7.3.1 Converting to String Objects

146

(1)

7.3.2 Converting to Numeric Values

147

(5)

7.4 Categorical Data

152

(1)

7.4.1 Convert to Category

152

(1)

7.4.2 Manipulating Categorical Data

153

(1)

7.5 Conclusion

153

(2)

8 Strings and Text Data

155

(16)

8.1 Introduction

155

(1)

8.2 Strings

155

(3)

8.2.1 Subsetting and Slicing Strings

155

(2)

8.2.2 Getting the Last Character in a String

157

(1)

8.3 String Methods

158

(2)

8.4 More String Methods

160

(1)

8.4.1 Join

160

(1)

8.4.2 Splitlines

160

(1)

8.5 String Formatting

161

(3)

8.5.1 Custom String Formatting

161

(1)

8.5.2 Formatting Character Strings

162

(1)

8.5.3 Formatting Numbers

162

(1)

8.5.4 C printf Style Formatting

163

(1)

8.5.5 Formatted Literal Strings in Python 3.6+

163

(1)

8.6 Regular Expressions (RegEx)

164

(6)

8.6.1 Match a Pattern

164

(4)

8.6.2 Find a Pattern

168

(1)

8.6.3 Substituting a Pattern

168

(1)

8.6.4 Compiling a Pattern

169

(1)

8.7 The regex Library

170

(1)

8.8 Conclusion

170

(1)

9 Apply

171

(18)

9.1 Introduction

171

(1)

9.2 Functions

171

(1)

9.3 Apply (Basics)

172

(5)

9.3.1 Apply Over a Series

173

(1)

9.3.2 Apply Over a DataFrame

174

(3)

9.4 Apply (More Advanced)

177

(5)

9.4.1 Column-wise Operations

178

(2)

9.4.2 Row-wise Operations

180

(2)

9.5 Vectorized Functions

182

(3)

9.5.1 Using numpy

184

(1)

9.5.2 Using numba

185

(1)

9.6 Lambda Functions

185

(2)

9.7 Conclusion

187

(2)

10 Groupby Operations: Split-Apply-Combine

189

(24)

10.1 Introduction

189

(1)

10.2 Aggregate

190

(7)

10.2.1 Basic One-Variable Grouped Aggregation

190

(1)

10.2.2 Built-in Aggregation Methods

191

(1)

10.2.3 Aggregation Functions

192

(3)

10.2.4 Multiple Functions Simultaneously

195

(1)

10.2.5 Using a diet in agg/aggregate

195

(2)

10.3 Transform

197

(4)

10.3.1 z-Score Example

197

(4)

10.4 Filter

201

(1)

10.5 The pandas.core.groupby .DataFrameGroupBy Object

202

(5)

10.5.1 Groups

202

(1)

10.5.2 Group Calculations Involving Multiple Variables

203

(1)

10.5.3 Selecting a Group

204

(1)

10.5.4 Iterating Through Groups

204

(2)

10.5.5 Multiple Groups

206

(1)

10.5.6 Flattening the Results

206

(1)

10.6 Working With a MultiIndex

207

(4)

10.7 Conclusion

211

(2)

11 The datetime Data Type

213

(28)

11.1 Introduction

213

(1)

11.2 Python's datetime Object

213

(1)

11.3 Converting to datetime

214

(3)

11.4 Loading Data That Include Dates

217

(1)

11.5 Extracting Date Components

217

(3)

11.6 Date Calculations and Timedeltas

220

(1)

11.7 Datetime Methods

221

(3)

11.8 Getting Stock Data

224

(1)

11.9 Subsetting Data Based on Dates

225

(2)

11.9.1 The Datetime Index Object

225

(1)

11.9.2 The TimedeltaIndex Object

226

(1)

11.10 Date Ranges

227

(3)

11.10.1 Frequencies

228

(1)

11.10.2 Offsets

229

(1)

11.11 Shifting Values

230

(7)

11.12 Resampling

237

(1)

11.13 Time Zones

238

(2)

11.14 Conclusion

240

(1)

IV Data Modeling

241

(62)

12 Linear Models

243

(10)

12.1 introduction

243

(1)

12.2 Simple Linear Regression

243

(4)

12.2.1 Using statsmodels

243

(2)

12.2.2 Using sklearn

245

(2)

12.3 Multiple Regression

247

(4)

12.3.1 Using statsmodels

247

(1)

12.3.2 Using statsmodels With Categorical Variables

248

(1)

12.3.3 Using sklearn

249

(1)

12.3.4 Using sklearn With Categorical Variables

250

(1)

12.4 Keeping Index Labels From sklearn

251

(1)

12.5 Conclusion

252

(1)

13 Generalized Linear Models

253

(12)

13.1 Introduction

253

(1)

13.2 Logistic Regression

253

(4)

13.2.1 Using Statsmodels

255

(1)

13.2.2 Using Sklearn

256

(1)

13.3 Poisson Regression

257

(3)

13.3.1 Using Statsmodels

258

(1)

13.3.2 Negative Binomial Regression for Overdispersion

259

(1)

13.4 More Generalized Linear Models

260

(1)

13.5 Survival Analysis

260

(4)

13.5.1 Testing the Cox Model Assumptions

263

(1)

13.6 Conclusion

264

(1)

14 Model Diagnostics

265

(14)

14.1 Introduction

265

(1)

14.2 Residuals

265

(5)

14.2.1 Q-Q Plots

268

(2)

14.3 Comparing Multiple Models

270

(5)

14.3.1 Working With Linear Models

270

(3)

14.3.2 Working With GLM Models

273

(2)

14.4 k-Fold Cross-validation

275

(3)

14.5 Conclusion

278

(1)

15 Regularization

279

(12)

15.1 introduction

279

(1)

15.2 Why Regularize?

279

(2)

15.3 LASSO Regression

281

(2)

15.4 Ridge Regression

283

(2)

15.5 Elastic Net

285

(4)

1.5.6 Cross-Validation

287

(2)

15.7 Conclusion

289

(2)

16 Clustering

291

(12)

16.1 Introduction

291

(1)

16.2 k-Means

291

(6)

16.2.1 Dimension Reduction With PCA

294

(3)

16.3 Hierarchical Clustering

297

(4)

16.3.1 Complete Clustering

298

(1)

16.3.2 Single Clustering

298

(1)

16.3.3 Average Clustering

299

(1)

16.3.4 Centroid Clustering

299

(1)

16.3.5 Manually Setting the Threshold

299

(2)

16.4 Conclusion

301

(2)

V Conclusion

303

(10)

17 Life Outside of Pandas

305

(4)

17.1 The (Scientific) Computing Stack

305

(1)

17.2 Performance

306

(1)

17.2.1 Timing Your Code

306

(1)

17.2.2 Profiling Your Code

307

(1)

17.3 Going Bigger and Faster

307

(2)

18 Toward a Self-Directed Learner

309

(4)

18.1 It's Dangerous to Go Alone!

309

(1)

18.2 Local Meetups

309

(1)

18.3 Conferences

309

(1)

18.4 The Internet

310

(1)

18.5 Podcasts

310

(1)

18.6 Conclusion

311

(2)

VI Appendixes

313

(2)

A Installation

315

(2)

A.1 Installing Anaconda

315

(1)

A.1.1 Windows

315

(1)

A.1.2 Mac

316

(1)

A.1.3 Linux

316

(1)

A.2 Uninstall Anaconda

316

(1)

B Command Line

317

(2)

B.1 Installation

317

(1)

B.1.1 Windows

317

(1)

B.1.2 Mac

317

(1)

B.1.3 Linux

318

(1)

B.2 Basics

318

(1)

C Project Templates

319

(2)

D Using Python

321

(4)

D.1 Command Line and Text Editor

321

(1)

D.2 Python and IPython

322

(1)

D.3 Jupyter

322

(1)

D.4 Integrated Development Environments (IDEs)

322

(3)

E Working Directories

325

(2)

F Environments

327

(2)

G Install Packages

329

(2)

G.1 Updating Packages

330

(1)

H Importing Libraries

331

(2)

I Lists

333

(2)

J Tuples

335

(2)

K Dictionaries

337

(2)

L Slicing Values

339

(2)

M Loops

341

(2)

N Comprehensions

343

(2)

O Functions

345

(4)

O.1 Default Parameters

347

(1)

O.2 Arbitrary Parameters

347

(2)

O.2.1 *args

347

(1)

O.2.2 **kwargs

348

(1)

P Ranges and Generators

349

(2)

Q Multiple Assignment

351

(2)

R numpy ndarray

353

(2)

S Classes

355

(2)

T Odo: The Shapeshifter

357

(2)

Index

359

Daniel Chen is a graduate student in the interdisciplinary PhD program in Genetics, Bioinformatics & Computational Biology (GBCB) at Virginia Tech. He is involved with Software Carpentry as an instructor and lesson maintainer. He completed his masters degree in public health at Columbia University Mailman School of Public Health in Epidemiology, and currently works at the Social and Decision Analytics Laboratory under the Biocomplexity Institute of Virginia Tech where he is working with data to inform policy decision-making. He is the author of Pandas for Everyone and Pandas Data Analysis with Python Fundamentals LiveLessons.

Pandas for Everyone: Python Data Analysis [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv