Preface |
|
x | |
|
|
1 | (1) |
|
1.1 Applications and problems |
|
|
1 | (2) |
|
1.2 Definitions and terminology |
|
|
3 | (2) |
|
|
5 | (2) |
|
|
7 | (1) |
|
|
8 | |
|
2 The PAC Learning Framework |
|
|
1 | (32) |
|
2.1 The PAC learning model |
|
|
11 | (6) |
|
2.2 Guarantees for finite hypothesis sets --- consistent case |
|
|
17 | (4) |
|
2.3 Guarantees for finite hypothesis sets --- inconsistent case |
|
|
21 | (3) |
|
|
24 | (4) |
|
2.4.1 Deterministic versus stochastic scenarios |
|
|
24 | (1) |
|
2.4.2 Bayes error and noise |
|
|
25 | (1) |
|
2.4.3 Estimation and approximation errors |
|
|
26 | (1) |
|
|
27 | (1) |
|
|
28 | (1) |
|
|
29 | (4) |
|
3 Rademacher Complexity and VC-Dimension |
|
|
33 | (30) |
|
3.1 Rademacher complexity |
|
|
34 | (4) |
|
|
38 | (3) |
|
|
41 | (7) |
|
|
48 | (6) |
|
|
54 | (1) |
|
|
55 | (8) |
|
4 Support Vector Machines |
|
|
63 | (26) |
|
4.1 Linear classification |
|
|
63 | (1) |
|
4.2 SVMs --- separable case |
|
|
64 | (7) |
|
4.2.1 Primal optimization problem |
|
|
64 | (2) |
|
|
66 | (1) |
|
4.2.3 Dual optimization problem |
|
|
67 | (2) |
|
4.2.4 Lcav0-one-out analysis |
|
|
69 | (2) |
|
4.3 SVMs non-separable case |
|
|
71 | (4) |
|
4.3.1 Primal optimization problem |
|
|
72 | (1) |
|
|
73 | (1) |
|
4.3.3 Dual optimization problem |
|
|
74 | (1) |
|
|
75 | (8) |
|
|
83 | (1) |
|
|
84 | (5) |
|
|
89 | (32) |
|
|
89 | (3) |
|
5.2 Positive definite symmetric kernels |
|
|
92 | (8) |
|
|
92 | (2) |
|
5.2.2 Reproducing kernel Hilbert space |
|
|
94 | (2) |
|
|
96 | (4) |
|
5.3 Kernel-based algorithms |
|
|
100 | (3) |
|
5.3.1 SVMs with PDS kernels |
|
|
100 | (1) |
|
5.3.2 Representer theorem |
|
|
101 | (1) |
|
5.3.3 Learning guarantees |
|
|
102 | (1) |
|
5.4 Negative definite symmetrics Kernels |
|
|
103 | (3) |
|
|
106 | (9) |
|
5.5.1 Weighted transducers |
|
|
106 | (5) |
|
|
111 | (4) |
|
|
115 | (1) |
|
|
116 | (5) |
|
|
121 | (26) |
|
|
121 | (1) |
|
|
122 | (8) |
|
6.2.1 Bound on the empirical error |
|
|
124 | (2) |
|
6.2.2 Relationship with coordinate descent |
|
|
126 | (3) |
|
6.2.3 Relationship with logistic regression |
|
|
129 | (1) |
|
6.2.4 Standard use in practice |
|
|
129 | (1) |
|
|
130 | (10) |
|
6.3.1 VC-dimension-based analysis |
|
|
131 | (1) |
|
6.3.2 Margin-based analysis |
|
|
131 | (5) |
|
6.3.3 Margin maximization |
|
|
136 | (1) |
|
6.3.4 Game-theoretic interpretation |
|
|
137 | (3) |
|
|
140 | (1) |
|
|
141 | (1) |
|
|
142 | (5) |
|
|
147 | (36) |
|
|
147 | (1) |
|
7.2 Prediction with expert advice |
|
|
148 | (11) |
|
7.2.1 Mistake bounds and Halving algorithm |
|
|
148 | (2) |
|
7.2.2 Weighted majority algorithm |
|
|
150 | (2) |
|
7.2.3 Randomized weighted majority algorithm |
|
|
152 | (4) |
|
7.2.4 Exponential weighted average algorithm |
|
|
156 | (3) |
|
7.3 Linear classification |
|
|
159 | (12) |
|
7.3.1 Perceptron algorithm |
|
|
160 | (8) |
|
|
168 | (3) |
|
7.4 On-line to batch conversion |
|
|
171 | (3) |
|
7.5 Game-theoretic connection |
|
|
174 | (1) |
|
|
175 | (1) |
|
|
176 | (7) |
|
8 Multi-Class Classification |
|
|
183 | (26) |
|
8.1 Multi-class classification problem |
|
|
183 | (2) |
|
8.2 Generalization bounds |
|
|
185 | (6) |
|
8.3 Uncombined multi-class algorithms |
|
|
191 | (7) |
|
|
191 | (1) |
|
8.3.2 Multi-class boosting algorithms |
|
|
192 | (2) |
|
|
194 | (4) |
|
8.4 Aggregated multi-class algorithms |
|
|
198 | (5) |
|
|
198 | (1) |
|
|
199 | (2) |
|
8.4.3 Error-correction codes |
|
|
201 | (2) |
|
8.5 Structured prediction algorithms |
|
|
203 | (3) |
|
|
206 | (1) |
|
|
207 | (2) |
|
|
209 | (28) |
|
9.1 The problem of ranking |
|
|
209 | (2) |
|
|
211 | (2) |
|
|
213 | (1) |
|
|
214 | (7) |
|
9.4.1 Bound on the empirical error |
|
|
216 | (2) |
|
9.4.2 Relationship with coordinate descent |
|
|
218 | (2) |
|
9.4.3 Margin bound for ensemble methods in ranking |
|
|
220 | (1) |
|
|
221 | (5) |
|
9.5.1 Boosting in bipartite ranking |
|
|
222 | (2) |
|
9.5.2 Area under the ROC curve |
|
|
224 | (2) |
|
9.6 Preference-based setting |
|
|
226 | (6) |
|
9.6.1 Second-stage ranking problem |
|
|
227 | (2) |
|
9.6.2 Deterministic algorithm |
|
|
229 | (1) |
|
9.6.3 Randomized algorithm |
|
|
230 | (1) |
|
9.6.4 Extension to other loss functions |
|
|
231 | (1) |
|
|
232 | (1) |
|
|
233 | (1) |
|
|
234 | (3) |
|
|
237 | (30) |
|
10.1 The problem of regression |
|
|
237 | (1) |
|
10.2 Generalization bounds |
|
|
238 | (7) |
|
10.2.1 Finite hypothesis sets |
|
|
238 | (1) |
|
10.2.2 Rademacher complexity bounds |
|
|
239 | (2) |
|
10.2.3 Pseudo-dimension bounds |
|
|
241 | (4) |
|
10.3 Regression algorithms |
|
|
245 | (17) |
|
|
245 | (2) |
|
10.3.2 Kernel ridge regression |
|
|
247 | (5) |
|
10.3.3 Support vector regression |
|
|
252 | (5) |
|
|
257 | (3) |
|
10.3.5 Group norm regression algorithms |
|
|
260 | (1) |
|
10.3.6 On-line regression algorithms |
|
|
261 | (1) |
|
|
262 | (1) |
|
|
263 | (4) |
|
|
267 | (14) |
|
|
267 | (1) |
|
11.2 Stability-based generalization guarantee |
|
|
268 | (2) |
|
11.3 Stability of kernel-based regularization algorithms |
|
|
270 | (7) |
|
11.3.1 Application to regression algorithms: SVR and KRR |
|
|
274 | (2) |
|
11.3.2 Application to classification algorithms: SVMs |
|
|
276 | (1) |
|
|
276 | (1) |
|
|
277 | (1) |
|
|
277 | (4) |
|
12 Dimensionality Reduction |
|
|
281 | (12) |
|
12.1 Principal Component Analysis |
|
|
282 | (1) |
|
12.2 Kernel Principal Component Analysis (KPCA) |
|
|
283 | (2) |
|
12.3 KPCA and manifold learning |
|
|
285 | (3) |
|
|
285 | (1) |
|
12.3.2 Laplacian eigenmaps |
|
|
285 | (2) |
|
12.3.3 Locally linear embedding (LLE) |
|
|
287 | (1) |
|
12.4 Johnson-Lindenstrauss lemma |
|
|
288 | (2) |
|
|
290 | (1) |
|
|
290 | (3) |
|
13 Learning Automata and Languages |
|
|
293 | (20) |
|
|
293 | (1) |
|
|
294 | (1) |
|
13.3 Efficient exact learning |
|
|
295 | (8) |
|
|
296 | (1) |
|
13.3.2 Learning with queries |
|
|
297 | (1) |
|
13.3.3 Learning automata with queries |
|
|
298 | (5) |
|
13.4 Identification in the limit |
|
|
303 | (6) |
|
13.4.1 Learning reversible automata |
|
|
304 | (5) |
|
|
309 | (1) |
|
|
310 | (3) |
|
14 Reinforcement Learning |
|
|
313 | (26) |
|
|
313 | (1) |
|
14.2 Markov decision process model |
|
|
314 | (1) |
|
|
315 | (4) |
|
|
315 | (1) |
|
|
316 | (1) |
|
|
316 | (2) |
|
|
318 | (1) |
|
|
319 | (6) |
|
|
319 | (3) |
|
|
322 | (2) |
|
14.4.3 Linear programming |
|
|
324 | (1) |
|
|
325 | (12) |
|
14.5.1 Stochastic approximation |
|
|
326 | (4) |
|
|
330 | (1) |
|
14.5.3 Q-learning algorithm |
|
|
331 | (3) |
|
|
334 | (1) |
|
|
335 | (1) |
|
|
336 | (1) |
|
|
337 | (2) |
|
|
339 | (2) |
|
|
341 | (8) |
|
|
341 | (3) |
|
|
341 | (1) |
|
|
342 | (2) |
|
|
344 | (5) |
|
|
344 | (1) |
|
A.2.2 Singular value decomposition |
|
|
345 | (1) |
|
A.2.3 Symmetric positive semidefinite (SPSD) matrices |
|
|
346 | (3) |
|
|
349 | (10) |
|
B.1 Differentiation and unconstrained optimization |
|
|
349 | (1) |
|
|
350 | (3) |
|
B.3 Constrained optimization |
|
|
353 | (4) |
|
|
357 | (2) |
|
|
359 | (10) |
|
|
359 | (1) |
|
|
359 | (2) |
|
C.3 Conditional probability and independence |
|
|
361 | (2) |
|
C.4 Expectation, Markov's inequality, and moment-generating function |
|
|
363 | (2) |
|
C.5 Variance and Chebyshev's inequality |
|
|
365 | (4) |
|
D Concentration inequalities |
|
|
369 | (10) |
|
D.1 Hoeffding's inequality |
|
|
369 | (2) |
|
D.2 McDiarmid's inequality |
|
|
371 | (2) |
|
|
373 | (3) |
|
D.3.1 Binomial distribution: Slud's inequality |
|
|
374 | (1) |
|
D.3.2 Normal distribution: tail bound |
|
|
374 | (1) |
|
D.3.3 Khintchine-Kahane inequality |
|
|
374 | (2) |
|
|
376 | (1) |
|
|
377 | (2) |
|
|
379 | (2) |
References |
|
381 | (16) |
Index |
|
397 | |