Preface |
|
xiii | |
|
|
1 | (8) |
|
1.1 What is machine learning? |
|
|
1 | (1) |
|
1.2 What kind of problems can be tackled using machine learning? |
|
|
2 | (1) |
|
1.3 Some standard learning tasks |
|
|
3 | (1) |
|
|
4 | (2) |
|
|
6 | (1) |
|
|
7 | (2) |
|
2 The PAC Learning Framework |
|
|
9 | (20) |
|
2.1 The PAC learning model |
|
|
9 | (6) |
|
2.2 Guarantees for finite hypothesis sets --- consistent case |
|
|
15 | (4) |
|
2.3 Guarantees for finite hypothesis sets --- inconsistent case |
|
|
19 | (2) |
|
|
21 | (2) |
|
2.4.1 Deterministic versus stochastic scenarios |
|
|
21 | (1) |
|
2.4.2 Bayes error and noise |
|
|
22 | (1) |
|
|
23 | (1) |
|
|
23 | (6) |
|
3 Rademacher Complexity and VC-Dimension |
|
|
29 | (32) |
|
3.1 Rademacher complexity |
|
|
30 | (4) |
|
|
34 | (2) |
|
|
36 | (7) |
|
|
43 | (5) |
|
|
48 | (2) |
|
|
50 | (11) |
|
|
61 | (18) |
|
4.1 Estimation and approximation errors |
|
|
61 | (1) |
|
4.2 Empirical risk minimization (ERM) |
|
|
62 | (2) |
|
4.3 Structural risk minimization (SRM) |
|
|
64 | (4) |
|
|
68 | (3) |
|
4.5 n-Fold cross-validation |
|
|
71 | (1) |
|
4.6 Regularization-based algorithms |
|
|
72 | (1) |
|
4.7 Convex surrogate losses |
|
|
73 | (4) |
|
|
77 | (1) |
|
|
78 | (1) |
|
5 Support Vector Machines |
|
|
79 | (26) |
|
5.1 Linear classification |
|
|
79 | (1) |
|
|
80 | (7) |
|
5.2.1 Primal optimization problem |
|
|
81 | (2) |
|
|
83 | (1) |
|
5.2.3 Dual optimization problem |
|
|
83 | (2) |
|
5.2.4 Leave-one-out analysis |
|
|
85 | (2) |
|
|
87 | (4) |
|
5.3.1 Primal optimization problem |
|
|
88 | (1) |
|
|
89 | (1) |
|
5.3.3 Dual optimization problem |
|
|
90 | (1) |
|
|
91 | (9) |
|
|
100 | (1) |
|
|
100 | (5) |
|
|
105 | (40) |
|
|
105 | (3) |
|
6.2 Positive definite symmetric kernels |
|
|
108 | (8) |
|
|
108 | (2) |
|
6.2.2 Reproducing kernel Hilbert space |
|
|
110 | (2) |
|
|
112 | (4) |
|
6.3 Kernel-based algorithms |
|
|
116 | (3) |
|
6.3.1 SVMs with PDS kernels |
|
|
116 | (1) |
|
6.3.2 Representer theorem |
|
|
117 | (1) |
|
6.3.3 Learning guarantees |
|
|
117 | (2) |
|
6.4 Negative definite symmetric kernels |
|
|
119 | (2) |
|
|
121 | (9) |
|
6.5.1 Weighted transducers |
|
|
122 | (4) |
|
|
126 | (4) |
|
6.6 Approximate kernel feature maps |
|
|
130 | (5) |
|
|
135 | (2) |
|
|
137 | (8) |
|
|
145 | (32) |
|
|
145 | (1) |
|
|
146 | (8) |
|
7.2.1 Bound on the empirical error |
|
|
149 | (1) |
|
7.2.2 Relationship with coordinate descent |
|
|
150 | (4) |
|
|
154 | (1) |
|
|
154 | (11) |
|
7.3.1 VC-dimension-based analysis |
|
|
154 | (1) |
|
7.3.2 L1-geometric margin |
|
|
155 | (2) |
|
7.3.3 Margin-based analysis |
|
|
157 | (4) |
|
7.3.4 Margin maximization |
|
|
161 | (1) |
|
7.3.5 Game-theoretic interpretation |
|
|
162 | (3) |
|
|
165 | (2) |
|
|
167 | (1) |
|
|
168 | (2) |
|
|
170 | (7) |
|
|
177 | (36) |
|
|
178 | (1) |
|
8.2 Prediction with expert advice |
|
|
178 | (12) |
|
8.2.1 Mistake bounds and Halving algorithm |
|
|
179 | (2) |
|
8.2.2 Weighted majority algorithm |
|
|
181 | (2) |
|
8.2.3 Randomized weighted majority algorithm |
|
|
183 | (3) |
|
8.2.4 Exponential weighted average algorithm |
|
|
186 | (4) |
|
8.3 Linear classification |
|
|
190 | (11) |
|
8.3.1 Perceptron algorithm |
|
|
190 | (8) |
|
|
198 | (3) |
|
8.4 On-line to batch conversion |
|
|
201 | (3) |
|
8.5 Game-theoretic connection |
|
|
204 | (1) |
|
|
205 | (1) |
|
|
206 | (7) |
|
9 Multi-Class Classification |
|
|
213 | (26) |
|
9.1 Multi-class classification problem |
|
|
213 | (2) |
|
9.2 Generalization bounds |
|
|
215 | (6) |
|
9.3 Uncombined multi-class algorithms |
|
|
221 | (7) |
|
|
221 | (1) |
|
9.3.2 Multi-class boosting algorithms |
|
|
222 | (2) |
|
|
224 | (4) |
|
9.4 Aggregated multi-class algorithms |
|
|
228 | (5) |
|
|
229 | (1) |
|
|
229 | (2) |
|
9.4.3 Error-correcting output codes |
|
|
231 | (2) |
|
9.5 Structured prediction algorithms |
|
|
233 | (2) |
|
|
235 | (2) |
|
|
237 | (2) |
|
|
239 | (28) |
|
10.1 The problem of ranking |
|
|
240 | (1) |
|
10.2 Generalization bound |
|
|
241 | (2) |
|
|
243 | (1) |
|
|
244 | (7) |
|
10.4.1 Bound on the empirical error |
|
|
246 | (2) |
|
10.4.2 Relationship with coordinate descent |
|
|
248 | (2) |
|
10.4.3 Margin bound for ensemble methods in ranking |
|
|
250 | (1) |
|
|
251 | (6) |
|
10.5.1 Boosting in bipartite ranking |
|
|
252 | (3) |
|
10.5.2 Area under the ROC curve |
|
|
255 | (2) |
|
10.6 Preference-based setting |
|
|
257 | (5) |
|
10.6.1 Second-stage ranking problem |
|
|
257 | (2) |
|
10.6.2 Deterministic algorithm |
|
|
259 | (1) |
|
10.6.3 Randomized algorithm |
|
|
260 | (2) |
|
10.6.4 Extension to other loss functions |
|
|
262 | (1) |
|
10.7 Other ranking criteria |
|
|
262 | (1) |
|
|
263 | (1) |
|
|
264 | (3) |
|
|
267 | (28) |
|
11.1 The problem of regression |
|
|
267 | (1) |
|
11.2 Generalization bounds |
|
|
268 | (7) |
|
11.2.1 Finite hypothesis sets |
|
|
268 | (1) |
|
11.2.2 Rademacher complexity bounds |
|
|
269 | (2) |
|
11.2.3 Pseudo-dimension bounds |
|
|
271 | (4) |
|
11.3 Regression algorithms |
|
|
275 | (15) |
|
|
275 | (1) |
|
11.3.2 Kernel ridge regression |
|
|
276 | (5) |
|
11.3.3 Support vector regression |
|
|
281 | (4) |
|
|
285 | (4) |
|
11.3.5 Group norm regression algorithms |
|
|
289 | (1) |
|
11.3.6 On-line regression algorithms |
|
|
289 | (1) |
|
|
290 | (2) |
|
|
292 | (3) |
|
12 Maximum Entropy Models |
|
|
295 | (20) |
|
12.1 Density estimation problem |
|
|
295 | (2) |
|
12.1.1 Maximum Likelihood (ML) solution |
|
|
296 | (1) |
|
12.1.2 Maximum a Posteriori (MAP) solution |
|
|
297 | (1) |
|
12.2 Density estimation problem augmented with features |
|
|
297 | (1) |
|
|
298 | (1) |
|
|
299 | (1) |
|
|
299 | (4) |
|
12.6 Generalization bound |
|
|
303 | (1) |
|
12.7 Coordinate descent algorithm |
|
|
304 | (2) |
|
|
306 | (2) |
|
|
308 | (4) |
|
|
312 | (1) |
|
|
313 | (2) |
|
13 Conditional Maximum Entropy Models |
|
|
315 | (18) |
|
|
315 | (1) |
|
13.2 Conditional Maxent principle |
|
|
316 | (1) |
|
13.3 Conditional Maxent models |
|
|
316 | (1) |
|
|
317 | (2) |
|
|
319 | (2) |
|
13.5.1 Optimization problem |
|
|
320 | (1) |
|
|
320 | (1) |
|
|
321 | (1) |
|
13.6 Generalization bounds |
|
|
321 | (4) |
|
|
325 | (1) |
|
13.7.1 Optimization problem |
|
|
325 | (1) |
|
|
325 | (1) |
|
|
326 | (2) |
|
13.9 Proof of the duality theorem |
|
|
328 | (2) |
|
|
330 | (1) |
|
|
331 | (2) |
|
|
333 | (14) |
|
|
333 | (1) |
|
14.2 Stability-based generalization guarantee |
|
|
334 | (2) |
|
14.3 Stability of kernel-based regularization algorithms |
|
|
336 | (6) |
|
14.3.1 Application to regression algorithms: SVR and KRR |
|
|
339 | (2) |
|
14.3.2 Application to classification algorithms: SVMs |
|
|
341 | (1) |
|
|
342 | (1) |
|
|
342 | (1) |
|
|
343 | (4) |
|
15 Dimensionality Reduction |
|
|
347 | (12) |
|
15.1 Principal component analysis |
|
|
348 | (1) |
|
15.2 Kernel principal component analysis (KPCA) |
|
|
349 | (2) |
|
15.3 KPCA and manifold learning |
|
|
351 | (3) |
|
|
351 | (1) |
|
15.3.2 Laplacian eigenmaps |
|
|
352 | (1) |
|
15.3.3 Locally linear embedding (LLE) |
|
|
353 | (1) |
|
15.4 Johnson-Lindenstrauss lemma |
|
|
354 | (2) |
|
|
356 | (1) |
|
|
356 | (3) |
|
16 Learning Automata and Languages |
|
|
359 | (20) |
|
|
359 | (1) |
|
|
360 | (1) |
|
16.3 Efficient exact learning |
|
|
361 | (8) |
|
|
362 | (1) |
|
16.3.2 Learning with queries |
|
|
363 | (1) |
|
16.3.3 Learning automata with queries |
|
|
364 | (5) |
|
16.4 Identification in the limit |
|
|
369 | (6) |
|
16.4.1 Learning reversible automata |
|
|
370 | (5) |
|
|
375 | (1) |
|
|
376 | (3) |
|
17 Reinforcement Learning |
|
|
379 | (30) |
|
|
379 | (1) |
|
17.2 Markov decision process model |
|
|
380 | (1) |
|
|
381 | (6) |
|
|
381 | (1) |
|
|
382 | (1) |
|
|
382 | (3) |
|
|
385 | (2) |
|
|
387 | (6) |
|
|
387 | (3) |
|
|
390 | (2) |
|
17.4.3 Linear programming |
|
|
392 | (1) |
|
|
393 | (12) |
|
17.5.1 Stochastic approximation |
|
|
394 | (3) |
|
|
397 | (1) |
|
17.5.3 Q-learning algorithm |
|
|
398 | (4) |
|
|
402 | (1) |
|
|
402 | (1) |
|
|
403 | (2) |
|
|
405 | (2) |
|
|
407 | (2) |
|
|
409 | (6) |
|
|
409 | (2) |
|
|
409 | (1) |
|
|
410 | (1) |
|
A.1.3 Relationship between norms |
|
|
411 | (1) |
|
|
411 | (4) |
|
|
411 | (1) |
|
A.2.2 Singular value decomposition |
|
|
412 | (1) |
|
A.2.3 Symmetric positive semidefinite (SPSD) matrices |
|
|
412 | (3) |
|
|
415 | (14) |
|
B.1 Differentiation and unconstrained optimization |
|
|
415 | (1) |
|
|
415 | (4) |
|
B.3 Constrained optimization |
|
|
419 | (3) |
|
|
422 | (4) |
|
|
422 | (1) |
|
|
423 | (1) |
|
B.4.3 Conjugate functions |
|
|
423 | (3) |
|
|
426 | (1) |
|
|
427 | (2) |
|
|
429 | (8) |
|
|
429 | (1) |
|
|
429 | (2) |
|
C.3 Conditional probability and independence |
|
|
431 | (1) |
|
C.4 Expectation and Markov's inequality |
|
|
431 | (1) |
|
C.5 Variance and Chebyshev's inequality |
|
|
432 | (2) |
|
C.6 Moment-generating functions |
|
|
434 | (1) |
|
|
435 | (2) |
|
D Concentration Inequalities |
|
|
437 | (12) |
|
D.1 Hoeffding's inequality |
|
|
437 | (1) |
|
|
438 | (1) |
|
D.3 Multiplicative Chernoff bounds |
|
|
439 | (1) |
|
D.4 Binomial distribution tails: Upper bounds |
|
|
440 | (1) |
|
D.5 Binomial distribution tails: Lower bound |
|
|
440 | (1) |
|
|
441 | (1) |
|
D.7 McDiarmid's inequality |
|
|
442 | (1) |
|
D.8 Normal distribution tails: Lower bound |
|
|
443 | (1) |
|
D.9 Khintchine-Kahane inequality |
|
|
443 | (1) |
|
|
444 | (1) |
|
|
445 | (1) |
|
|
445 | (4) |
|
E Notions of Information Theory |
|
|
449 | (10) |
|
|
449 | (1) |
|
|
450 | (3) |
|
|
453 | (1) |
|
|
453 | (3) |
|
|
456 | (1) |
|
|
457 | (2) |
|
|
459 | (2) |
Bibliography |
|
461 | (14) |
Index |
|
475 | |