Preface |
|
xvii | |
|
|
1 | (8) |
|
|
1 | (1) |
|
|
1 | (1) |
|
|
2 | (1) |
|
1.4 What are development and conservation? |
|
|
2 | (1) |
|
1.5 Science and decision making |
|
|
3 | (1) |
|
1.6 Why data science is important |
|
|
4 | (2) |
|
1.6.1 Monitoring and evaluation |
|
|
4 | (1) |
|
1.6.2 Projects versus programmes |
|
|
5 | (1) |
|
1.6.3 Project delivery versus research projects |
|
|
5 | (1) |
|
1.7 The goal of this book |
|
|
6 | (1) |
|
1.8 How this book is organised |
|
|
6 | (1) |
|
1.9 How code is organised in this book |
|
|
7 | (2) |
|
|
9 | (2) |
|
|
11 | (6) |
|
|
11 | (1) |
|
|
12 | (2) |
|
|
14 | (1) |
|
2.4 What makes good data? |
|
|
14 | (1) |
|
2.5 Recommended resources |
|
|
15 | (1) |
|
|
15 | (2) |
|
3 Data integration in project management |
|
|
17 | (12) |
|
3.1 Adaptive management cycles |
|
|
17 | (1) |
|
|
17 | (9) |
|
|
18 | (1) |
|
3.2.1.1 Development of a project strategy and proposal |
|
|
18 | (2) |
|
3.2.1.2 Proposal submission process |
|
|
20 | (1) |
|
3.2.1.3 What is a logframe? |
|
|
21 | (1) |
|
3.2.1.4 Logframe terminology |
|
|
21 | (1) |
|
3.2.1.5 Pre-implementation planning |
|
|
21 | (1) |
|
|
22 | (1) |
|
|
23 | (1) |
|
|
23 | (2) |
|
|
25 | (1) |
|
|
26 | (1) |
|
3.4 Recommended resources |
|
|
26 | (1) |
|
|
27 | (2) |
|
|
29 | (14) |
|
|
29 | (1) |
|
|
30 | (1) |
|
|
30 | (4) |
|
|
31 | (1) |
|
4.3.2 Version information |
|
|
31 | (1) |
|
4.3.3 Writing code in the console |
|
|
32 | (1) |
|
|
33 | (1) |
|
4.3.5 Using the default script editor |
|
|
33 | (1) |
|
|
34 | (1) |
|
|
34 | (1) |
|
|
35 | (4) |
|
|
36 | (1) |
|
|
36 | (1) |
|
4.5.2.1 Getting help on functions |
|
|
37 | (1) |
|
|
37 | (2) |
|
4.5.3.1 Getting help on packages |
|
|
39 | (1) |
|
4.6 Writing meaningful code |
|
|
39 | (1) |
|
|
40 | (2) |
|
4.8 Recommended resources |
|
|
42 | (1) |
|
|
42 | (1) |
|
5 Introduction to data frames |
|
|
43 | (16) |
|
|
43 | (3) |
|
5.2 Importing a data frame |
|
|
46 | (1) |
|
|
47 | (1) |
|
5.4 Investigating a data frame |
|
|
47 | (2) |
|
5.5 Other functions to examine an R object |
|
|
49 | (1) |
|
5.6 Subsetting using the `[ ' and `]' operators |
|
|
50 | (2) |
|
5.7 Descriptive statistics |
|
|
52 | (1) |
|
|
53 | (1) |
|
5.9 Making a reproducible example |
|
|
54 | (2) |
|
5.9.1 Reproducible example steps |
|
|
54 | (2) |
|
5.10 Recommended resources |
|
|
56 | (1) |
|
|
57 | (2) |
|
|
59 | (6) |
|
|
59 | (1) |
|
6.1.1 Why evidence is important |
|
|
60 | (1) |
|
|
60 | (3) |
|
6.2.1 Description of condev data sets |
|
|
62 | (1) |
|
6.3 Recommended resources |
|
|
63 | (1) |
|
|
63 | (2) |
|
|
65 | (132) |
|
7 ggplot2: graphing with the tidyverse |
|
|
67 | (16) |
|
|
67 | (1) |
|
7.2 The tidyverse package |
|
|
68 | (1) |
|
|
68 | (1) |
|
|
69 | (12) |
|
|
69 | (1) |
|
|
70 | (1) |
|
|
71 | (2) |
|
|
73 | (2) |
|
|
75 | (1) |
|
|
76 | (2) |
|
|
78 | (3) |
|
|
81 | (1) |
|
7.6 Recommended resources |
|
|
81 | (1) |
|
|
82 | (1) |
|
|
83 | (1) |
|
8.1 Why customise a ggplot? |
|
|
83 | (1) |
|
|
83 | (1) |
|
|
84 | (1) |
|
|
84 | (1) |
|
8.5 Aesthetics properties |
|
|
84 | (5) |
|
8.5.1 Settings aesthetics |
|
|
85 | (2) |
|
8.5.2 A quick note about colour |
|
|
87 | (1) |
|
8.5.3 Using aesthetics to distinguish groups |
|
|
87 | (1) |
|
8.5.4 Using faceting to distinguish groups |
|
|
88 | (1) |
|
8.6 Improving crowded graphs |
|
|
89 | (2) |
|
|
91 | (1) |
|
|
92 | (2) |
|
8.9 Using the theme() function |
|
|
94 | (5) |
|
|
95 | (1) |
|
|
96 | (1) |
|
8.9.3 Spacing between axis and graph |
|
|
97 | (1) |
|
|
98 | (1) |
|
|
99 | (5) |
|
|
99 | (2) |
|
|
101 | (1) |
|
8.10.3 Forcing a common origin |
|
|
102 | (1) |
|
|
102 | (1) |
|
8.10.5 Forcing a plot to be square |
|
|
103 | (1) |
|
8.10.6 Log scales and large numbers |
|
|
103 | (1) |
|
|
104 | (1) |
|
8.12 Recommended resources |
|
|
105 | (1) |
|
|
105 | (2) |
|
|
107 | (1) |
|
9.1 What is data wrangling? |
|
|
107 | (1) |
|
|
108 | (1) |
|
|
108 | (1) |
|
|
108 | (2) |
|
9.5 Tibbies versus data frame |
|
|
110 | (2) |
|
|
112 | (3) |
|
|
112 | (1) |
|
|
113 | (2) |
|
|
115 | (3) |
|
|
115 | (1) |
|
|
115 | (1) |
|
|
116 | (1) |
|
|
117 | (1) |
|
|
118 | (3) |
|
|
118 | (2) |
|
|
120 | (1) |
|
|
121 | (1) |
|
|
121 | (1) |
|
|
121 | (1) |
|
|
122 | (1) |
|
9.11 Recommended resources |
|
|
123 | (1) |
|
|
123 | (2) |
|
|
125 | (20) |
|
10.1 Cleaning is more than correcting mistakes |
|
|
125 | (1) |
|
|
125 | (1) |
|
|
126 | (1) |
|
|
127 | (3) |
|
|
127 | (1) |
|
|
127 | (1) |
|
|
128 | (1) |
|
|
128 | (2) |
|
10.5 Fixing missing values |
|
|
130 | (4) |
|
|
131 | (1) |
|
|
131 | (1) |
|
|
132 | (1) |
|
|
132 | (1) |
|
10.5.5 Cleaning a whole data set |
|
|
133 | (1) |
|
10.6 Adding and dropping factor levels |
|
|
134 | (2) |
|
|
134 | (1) |
|
|
135 | (1) |
|
10.6.3 Keeping empty levels in ggplot |
|
|
135 | (1) |
|
10.7 Fusing duplicate columns |
|
|
136 | (1) |
|
|
137 | (1) |
|
10.8 Organising factor levels |
|
|
137 | (3) |
|
|
138 | (1) |
|
|
139 | (1) |
|
|
139 | (1) |
|
10.9 Anonymisation and pseudonymisation |
|
|
140 | (3) |
|
|
141 | (2) |
|
10.10 Recommended resources |
|
|
143 | (1) |
|
|
143 | (2) |
|
11 Working with dates and time |
|
|
145 | (16) |
|
|
145 | (1) |
|
|
145 | (1) |
|
|
146 | (1) |
|
|
146 | (3) |
|
11.4.1 Formatting dates with lubridate |
|
|
147 | (1) |
|
11.4.2 Formatting dates with base R |
|
|
148 | (1) |
|
|
148 | (1) |
|
|
149 | (2) |
|
|
151 | (1) |
|
|
151 | (3) |
|
11.7.1 The importance of time zones |
|
|
152 | (1) |
|
11.7.2 Same times in different time zones |
|
|
153 | (1) |
|
11.8 Replacing missing date components |
|
|
154 | (1) |
|
11.9 Graphing: a worked example |
|
|
154 | (5) |
|
11.9.1 Reordering a variable by a date |
|
|
155 | (1) |
|
11.9.2 Summarising date-based data |
|
|
156 | (2) |
|
11.9.3 Date labels with scale_x_date() |
|
|
158 | (1) |
|
11.10 Recommended resources |
|
|
159 | (1) |
|
|
159 | (2) |
|
12 Working with spatial data |
|
|
161 | (28) |
|
12.1 The importance of maps |
|
|
161 | (1) |
|
|
161 | (2) |
|
|
163 | (1) |
|
12.4 What is spatial data? |
|
|
163 | (1) |
|
12.5 Introduction to the sf package |
|
|
163 | (3) |
|
12.5.1 Reading data: st_read() |
|
|
164 | (1) |
|
12.5.2 Converting data: st_as_sf() |
|
|
164 | (1) |
|
12.5.3 Polygon area: st_area() |
|
|
164 | (1) |
|
12.5.4 Plotting maps: geom_sf() |
|
|
165 | (1) |
|
12.5.5 Extracting coordinates st_coordinates() |
|
|
166 | (1) |
|
12.6 Plotting a world map |
|
|
166 | (1) |
|
12.6.1 Filtering with filter() |
|
|
167 | (1) |
|
12.7 Coordinate reference systems |
|
|
167 | (4) |
|
12.7.1 Finding the CRS of an object with st_crs() |
|
|
169 | (1) |
|
12.7.2 Transform the CRS with st_transform() |
|
|
170 | (1) |
|
12.7.3 Cropping with coord_sf() |
|
|
170 | (1) |
|
12.8 Adding reference information |
|
|
171 | (3) |
|
12.8.1 Adding a scale bar and north arrow |
|
|
171 | (1) |
|
12.8.2 Positioning names with centroids |
|
|
172 | (1) |
|
12.8.3 Adding names with geom_text() |
|
|
173 | (1) |
|
12.9 Making a chloropleth |
|
|
174 | (1) |
|
|
175 | (1) |
|
12.11 Saving with st_write() |
|
|
176 | (1) |
|
12.12 Rasters with the raster package |
|
|
177 | (10) |
|
|
177 | (1) |
|
|
177 | (1) |
|
|
178 | (2) |
|
12.12.4 Basic raster calculations |
|
|
180 | (1) |
|
|
181 | (1) |
|
12.12.6 Extracting raster data from points |
|
|
182 | (1) |
|
12.12.7 Turning data frames into rasters |
|
|
183 | (1) |
|
12.12.8 Calculating distances |
|
|
183 | (1) |
|
|
184 | (1) |
|
|
185 | (1) |
|
|
185 | (1) |
|
12.12.12 Changing to a data frame |
|
|
186 | (1) |
|
12.13 Recommended resources |
|
|
187 | (1) |
|
|
187 | (2) |
|
13 Common R code mistakes and quirks |
|
|
189 | (8) |
|
|
189 | (1) |
|
|
190 | (1) |
|
|
190 | (1) |
|
13.4 Capitalisation mistakes |
|
|
190 | (1) |
|
|
191 | (1) |
|
13.6 Forgetting quotation marks |
|
|
191 | (1) |
|
|
192 | (1) |
|
13.8 Forgetting `+' in a ggplot |
|
|
192 | (1) |
|
13.9 Forgetting to call a ggplot object |
|
|
193 | (1) |
|
13.10 Piping but not making an object |
|
|
193 | (1) |
|
13.11 Changing a factor to a number |
|
|
194 | (1) |
|
13.12 Strings automatically read as factors |
|
|
195 | (1) |
|
|
196 | (1) |
|
|
197 | (158) |
|
14 Basic statistical concepts |
|
|
199 | (26) |
|
14.1 Variables and statistics |
|
|
199 | (1) |
|
|
199 | (1) |
|
|
199 | (1) |
|
14.4 Describing things which are variable |
|
|
200 | (6) |
|
|
200 | (1) |
|
|
200 | (1) |
|
|
201 | (2) |
|
14.4.2 Describing variability |
|
|
203 | (1) |
|
|
203 | (1) |
|
14.4.2.2 Standard deviation |
|
|
203 | (1) |
|
14.4.2.3 Percentile range |
|
|
204 | (1) |
|
14.4.3 Reporting central tendency and variability |
|
|
205 | (1) |
|
|
206 | (1) |
|
14.5 Introducing probability |
|
|
206 | (1) |
|
14.6 Probability distributions |
|
|
207 | (8) |
|
14.6.1 Binomial distribution |
|
|
207 | (1) |
|
14.6.1.1 Bernoulli distribution |
|
|
208 | (2) |
|
14.6.2 Poisson distribution |
|
|
210 | (3) |
|
14.6.3 Normal distribution |
|
|
213 | (2) |
|
|
215 | (4) |
|
14.7.1 Simple random sampling |
|
|
216 | (1) |
|
14.7.2 Stratified random sampling |
|
|
217 | (2) |
|
14.8 Modelling approaches |
|
|
219 | (4) |
|
14.8.1 Null hypothesis testing |
|
|
219 | (2) |
|
14.8.2 Information-theoretics |
|
|
221 | (1) |
|
14.8.3 Bayesian approaches |
|
|
222 | (1) |
|
|
222 | (1) |
|
14.9 Under-fitting and over-fitting |
|
|
223 | (1) |
|
14.10 Recommended resources |
|
|
224 | (1) |
|
|
224 | (1) |
|
15 Understanding linear models |
|
|
225 | (42) |
|
15.1 Regression versus classification |
|
|
225 | (1) |
|
|
226 | (1) |
|
|
226 | (1) |
|
15.4 Graphing a y variable |
|
|
227 | (1) |
|
15.5 What is a linear model? |
|
|
227 | (2) |
|
15.5.1 How to draw a linear model from an equation |
|
|
228 | (1) |
|
15.6 Predicting the response variable |
|
|
229 | (2) |
|
15.7 Formulating hypotheses |
|
|
231 | (1) |
|
|
232 | (4) |
|
|
233 | (1) |
|
|
234 | (2) |
|
15.9 Making a linear model in R |
|
|
236 | (1) |
|
15.10 Introduction to model selection |
|
|
237 | (1) |
|
15.10.1 Estimating the number of parameters: K |
|
|
237 | (1) |
|
15.10.2 Goodness of fit: L |
|
|
238 | (1) |
|
15.11 Doing model selection in R |
|
|
238 | (4) |
|
15.11.1 Interpreting an AIC Table |
|
|
240 | (1) |
|
15.11.1.1 Evidence ratios |
|
|
241 | (1) |
|
|
241 | (1) |
|
15.12 Understanding coefficients |
|
|
242 | (2) |
|
15.13 Model equations and prediction |
|
|
244 | (4) |
|
15.13.1 Dummy variables and a design matrix |
|
|
245 | (1) |
|
15.13.2 Plotting a prediction with geom_abline() |
|
|
246 | (1) |
|
15.13.3 Automatic prediction |
|
|
247 | (1) |
|
15.14 Understanding a model summary |
|
|
248 | (2) |
|
15.15 Standard errors and confidence intervals |
|
|
250 | (2) |
|
15.15.1 Confidence intervals for model predictions |
|
|
250 | (2) |
|
|
252 | (4) |
|
|
256 | (1) |
|
15.17 Log transformations |
|
|
256 | (5) |
|
15.17.1 What are logarithms? |
|
|
257 | (3) |
|
15.17.2 Logarithms and zero |
|
|
260 | (1) |
|
|
261 | (4) |
|
15.18.1 Making a for() loop |
|
|
262 | (1) |
|
15.18.2 Example simulation |
|
|
262 | (3) |
|
15.19 Reporting modelling results |
|
|
265 | (1) |
|
|
266 | (1) |
|
16 Extensions to linear models |
|
|
267 | (42) |
|
16.1 Building upon linear models |
|
|
267 | (1) |
|
|
267 | (1) |
|
|
268 | (1) |
|
|
268 | (9) |
|
16.4.1 Additive versus interaction models |
|
|
269 | (1) |
|
16.4.2 Visualising multiple regression |
|
|
270 | (1) |
|
16.4.2.1 Visualising continuous variables |
|
|
271 | (3) |
|
16.4.2.2 A visualisation trick |
|
|
274 | (2) |
|
16.4.3 Colinearity and multicolinearity |
|
|
276 | (1) |
|
16.5 Most statistical tests are linear models |
|
|
277 | (4) |
|
16.5.0.1 One sample t-test |
|
|
278 | (1) |
|
10.5.0.2 Independent t-test |
|
|
279 | (1) |
|
|
279 | (1) |
|
|
280 | (1) |
|
|
281 | (1) |
|
16.6 Generalised linear models |
|
|
281 | (17) |
|
16.6.0.1 The importance of link functions |
|
|
282 | (1) |
|
16.6.1 Gaussian distribution (normal distribution) |
|
|
283 | (1) |
|
16.6.2 Binomial distribution for logistic regression |
|
|
283 | (2) |
|
16.6.2.1 Confidence intervals with link functions |
|
|
285 | (2) |
|
16.6.3 Poisson regresssion for count data |
|
|
287 | (2) |
|
16.6.3.1 Diagnostics for GLM |
|
|
289 | (2) |
|
16.6.3.2 Under and over-dispersion |
|
|
291 | (1) |
|
|
292 | (1) |
|
16.6.4.1 Data preparation |
|
|
293 | (1) |
|
16.6.4.2 Chi-squared goodness-of-fit test |
|
|
294 | (1) |
|
16.6.4.3 Test of independence |
|
|
295 | (3) |
|
16.7 Other related modelling approaches |
|
|
298 | (9) |
|
|
298 | (2) |
|
|
300 | (1) |
|
16.7.2 Cumulative link models |
|
|
301 | (3) |
|
|
304 | (1) |
|
16.7.4 Non-parametric approaches |
|
|
305 | (2) |
|
16.7.5 Advantages of non-parametric tests |
|
|
307 | (1) |
|
16.8 Recommended resources |
|
|
307 | (1) |
|
|
307 | (2) |
|
17 Introduction to clustering and classification |
|
|
309 | (30) |
|
17.1 Clustering and Classification |
|
|
309 | (1) |
|
|
309 | (1) |
|
|
310 | (1) |
|
17.4 Supervised versus unsupervised learning |
|
|
310 | (1) |
|
17.4.1 Why learn classification and clustering |
|
|
311 | (1) |
|
|
311 | (14) |
|
17.5.1 Hierarchical clustering |
|
|
312 | (5) |
|
17.5.2 Dimension reduction |
|
|
317 | (8) |
|
|
325 | (12) |
|
17.6.1 Classification trees |
|
|
326 | (1) |
|
17.6.2 How classification trees work |
|
|
327 | (1) |
|
|
328 | (2) |
|
|
330 | (3) |
|
17.6.4 k-fold cross-validation |
|
|
333 | (3) |
|
17.6.5 Accuracy versus interpretability |
|
|
336 | (1) |
|
17.7 Recommended resources |
|
|
337 | (1) |
|
|
337 | (2) |
|
18 Reporting and worked examples |
|
|
339 | (14) |
|
18.1 Writing the project report |
|
|
339 | (1) |
|
|
340 | (1) |
|
|
340 | (1) |
|
18.4 Reporting training data |
|
|
341 | (4) |
|
|
345 | (3) |
|
|
348 | (5) |
|
|
353 | (2) |
|
A Appendix: step-wise statistical calculations |
|
|
355 | (8) |
|
A.1 How to approach an equation |
|
|
355 | (1) |
|
|
355 | (1) |
|
A.3 The standard deviation |
|
|
356 | (1) |
|
|
356 | (7) |
|
|
357 | (1) |
|
|
358 | (1) |
|
A.4.3 r2 (Pearson's correlation coefficient) |
|
|
359 | (1) |
|
|
360 | (3) |
Index |
|
363 | |