Preface |
|
xiii | |
|
I MACHINE LEARNING FROM WEAK SUPERVISION |
|
|
|
|
3 | (18) |
|
|
3 | (4) |
|
1.1.1 Supervised Learning |
|
|
3 | (2) |
|
1.1.2 Unsupervised Learning |
|
|
5 | (1) |
|
1.1.3 Reinforcement Learning |
|
|
6 | (1) |
|
1.2 Elements of Classification |
|
|
7 | (2) |
|
|
7 | (1) |
|
|
8 | (1) |
|
1.2.3 Optimization Algorithms |
|
|
8 | (1) |
|
1.3 Aspects of Machine Learning |
|
|
9 | (4) |
|
1.3.1 Logical Learning, Biologically Inspired Learning, and Statistical Learning |
|
|
9 | (2) |
|
1.3.2 Frequentist Learning and Bayesian Learning |
|
|
11 | (1) |
|
1.3.3 Generative Classification and Discriminative Classification |
|
|
12 | (1) |
|
1.3.4 Induction, Deduction, and Transduction |
|
|
13 | (1) |
|
1.4 Improving Data Collection and Weakly Supervised Learning |
|
|
13 | (2) |
|
1.5 Organization of This Book |
|
|
15 | (6) |
|
1.5.1 Weakly Supervised Learning for Binary Classification |
|
|
15 | (3) |
|
1.5.2 Weakly Supervised Learning for Multi-Class Classification |
|
|
18 | (1) |
|
1.5.3 Advanced Topics and Perspectives |
|
|
18 | (3) |
|
2 Formulation and Notation |
|
|
21 | (14) |
|
2.1 Binary Classification |
|
|
21 | (10) |
|
|
21 | (1) |
|
2.1.2 Classification Models |
|
|
22 | (1) |
|
2.1.2.1 Linear-in-input model |
|
|
22 | (1) |
|
2.1.2.2 Linear-in-parameter model |
|
|
22 | (1) |
|
|
23 | (2) |
|
2.1.2.4 Neural network model |
|
|
25 | (1) |
|
|
26 | (2) |
|
|
28 | (2) |
|
|
30 | (1) |
|
2.2 Multi-Class Classification |
|
|
31 | (4) |
|
|
31 | (1) |
|
|
32 | (1) |
|
|
33 | (2) |
|
3 Supervised Classification |
|
|
35 | (32) |
|
3.1 Positive-Negative (PN) Classification |
|
|
35 | (21) |
|
|
35 | (1) |
|
|
35 | (1) |
|
|
36 | (1) |
|
|
37 | (1) |
|
3.1.2 Theoretical Analysis |
|
|
38 | (1) |
|
3.1.2.1 Targets of convergence |
|
|
38 | (2) |
|
3.1.2.2 Measures of convergence |
|
|
40 | (3) |
|
3.1.2.3 Rademacher complexity |
|
|
43 | (3) |
|
3.1.2.4 Rademacher complexity bounds |
|
|
46 | (5) |
|
3.1.2.5 Estimation error bounds |
|
|
51 | (5) |
|
3.2 Multi-Class Classification |
|
|
56 | (11) |
|
|
56 | (2) |
|
3.2.2 Theoretical Analysis |
|
|
58 | (1) |
|
3.2.2.1 Estimation error bounds |
|
|
58 | (3) |
|
3.2.2.2 Classification calibration |
|
|
61 | (6) |
|
II WEAKLY SUPERVISED LEARNING FOR BINARY CLASSIFICATION |
|
|
|
4 Positive-Unlabeled (PU) Classification |
|
|
67 | (18) |
|
|
67 | (1) |
|
|
68 | (1) |
|
4.3 Unbiased Risk Estimation from PU Data |
|
|
69 | (6) |
|
|
69 | (2) |
|
4.3.2 Cost-Sensitive Approach |
|
|
71 | (1) |
|
|
72 | (3) |
|
|
75 | (10) |
|
|
75 | (2) |
|
|
77 | (2) |
|
4.4.3 Comparisons with PN Classification |
|
|
79 | (1) |
|
4.4.3.1 Finite-sample comparisons |
|
|
79 | (3) |
|
4.4.3.2 Asymptotic comparisons |
|
|
82 | (3) |
|
5 Positive-Negative-Unlabeled (PNU) Classification |
|
|
85 | (26) |
|
|
85 | (1) |
|
|
86 | (1) |
|
5.3 Manifold-Based Semi-Supervised Classification |
|
|
87 | (3) |
|
5.3.1 Laplacian Regularization |
|
|
87 | (2) |
|
|
89 | (1) |
|
5.4 Information-Theoretic Semi-Supervised Classification |
|
|
90 | (4) |
|
5.4.1 Squared-Loss Mutual Information Regularization |
|
|
90 | (3) |
|
|
93 | (1) |
|
|
94 | (9) |
|
5.5.1 PNU and PU+NU Risk Estimators |
|
|
94 | (1) |
|
5.5.2 PNU vs. PU+NU Classification |
|
|
95 | (1) |
|
5.5.3 Theoretical Analysis |
|
|
96 | (1) |
|
5.5.3.1 Estimation error bounds |
|
|
97 | (2) |
|
5.5.3.2 Variance reduction |
|
|
99 | (4) |
|
|
103 | (2) |
|
|
103 | (1) |
|
5.6.2 PNU Risk for Validation |
|
|
103 | (1) |
|
5.6.3 Comparison with Other Methods |
|
|
104 | (1) |
|
|
105 | (6) |
|
5.7.1 Multi-Class Extension |
|
|
105 | (2) |
|
|
107 | (1) |
|
|
107 | (4) |
|
6 Positive-Confidence (Pconf) Classification |
|
|
111 | (16) |
|
|
111 | (1) |
|
|
112 | (1) |
|
|
113 | (1) |
|
6.4 Empirical Risk Minimization (ERM) Framework |
|
|
114 | (2) |
|
|
116 | (3) |
|
|
119 | (1) |
|
|
119 | (8) |
|
6.7.1 Synthetic Experiments with Linear Models |
|
|
119 | (3) |
|
6.7.2 Benchmark Experiments with Neural Network Models |
|
|
122 | (5) |
|
7 Pairwise-Constraint Classification |
|
|
127 | (22) |
|
|
127 | (1) |
|
|
128 | (3) |
|
|
129 | (1) |
|
|
130 | (1) |
|
7.2.3 Comparison of Sampling Schemes |
|
|
130 | (1) |
|
7.2.4 Pairwise Constraints as Pointwise Data |
|
|
131 | (1) |
|
7.3 Similar-Unlabeled (SU) Classification |
|
|
131 | (5) |
|
7.3.1 Classification Risk Estimation |
|
|
132 | (1) |
|
7.3.2 Minimum-Variance Risk Estimation |
|
|
132 | (2) |
|
|
134 | (1) |
|
7.3.4 Class-Priors in SU Classification |
|
|
135 | (1) |
|
7.4 Similar-Dissimilar (SD) and Dissimilar-Unlabeled (DU) Classification |
|
|
136 | (4) |
|
7.4.1 Classification Risk Estimation |
|
|
136 | (1) |
|
7.4.2 Interpretation of SD Risk |
|
|
137 | (3) |
|
7.5 Similar-Dissimilar-Unlabeled (SDU) Classification |
|
|
140 | (1) |
|
|
140 | (3) |
|
7.6.1 Derivation of Estimation Error Bounds |
|
|
140 | (1) |
|
7.6.2 Comparison of Estimation Error Bounds |
|
|
141 | (2) |
|
|
143 | (5) |
|
|
143 | (1) |
|
7.7.2 Illustration of SU Classification |
|
|
143 | (2) |
|
7.7.3 Comparison of SU Classification with Other Methods |
|
|
145 | (1) |
|
7.7.4 Comparison of SDU Classification with Other Methods |
|
|
146 | (2) |
|
|
148 | (1) |
|
8 Unlabeled-Unlabeled (UU) Classification |
|
|
149 | (28) |
|
|
149 | (1) |
|
|
150 | (2) |
|
8.2.1 Data Generation Process |
|
|
151 | (1) |
|
8.2.2 Performance Measures |
|
|
151 | (1) |
|
8.2.3 Relation to Classification with Noisy Labels |
|
|
152 | (1) |
|
8.3 Risk Estimation from UU Data |
|
|
152 | (12) |
|
8.3.1 Risk Estimation from One Set of U Data |
|
|
153 | (2) |
|
8.3.2 Risk Estimation from Two Sets of U Data |
|
|
155 | (1) |
|
|
155 | (2) |
|
|
157 | (1) |
|
|
157 | (1) |
|
8.3.3 Theoretical Analysis |
|
|
158 | (1) |
|
|
159 | (1) |
|
|
159 | (1) |
|
8.3.4.2 Benchmark experiments with neural network models |
|
|
160 | (3) |
|
8.3.4.3 Comparison with other methods |
|
|
163 | (1) |
|
|
164 | (13) |
|
8.4.1 Analysis of Bayes-Optimal Classifier |
|
|
164 | (1) |
|
8.4.2 KDE-Based Algorithm |
|
|
165 | (1) |
|
8.4.3 LSDD-Based Algorithm |
|
|
166 | (2) |
|
8.4.4 DSDD-Based Algorithm |
|
|
168 | (3) |
|
|
171 | (6) |
|
III WEAKLY SUPERVISED LEARNING FOR MULTI-CLASS CLASSIFICATION |
|
|
|
9 Complementary-Label Classification |
|
|
177 | (16) |
|
|
177 | (1) |
|
9.2 Risk Estimation from CL Data |
|
|
178 | (4) |
|
|
178 | (1) |
|
|
179 | (1) |
|
9.2.3 Case-Study for Symmetric Losses |
|
|
180 | (1) |
|
9.2.4 Relation to Classification with Noisy Labels |
|
|
181 | (1) |
|
|
182 | (3) |
|
9.4 Incorporation of Ordinary-Labels |
|
|
185 | (1) |
|
|
185 | (2) |
|
9.5.1 Experiments with CL |
|
|
186 | (1) |
|
9.5.2 Experiments with CL and OL |
|
|
186 | (1) |
|
9.6 Incorporation of Multi-Complementary-Labels |
|
|
187 | (6) |
|
|
187 | (2) |
|
9.6.2 Comparison with Multiple Single CLs |
|
|
189 | (1) |
|
9.6.3 Unbiased Risk Estimator |
|
|
190 | (1) |
|
9.6.4 Estimation Error Bound |
|
|
191 | (2) |
|
10 Partial-Label Classification |
|
|
193 | (14) |
|
|
193 | (1) |
|
10.2 Formulation and Assumptions |
|
|
193 | (2) |
|
|
194 | (1) |
|
10.2.2 Data Generation Assumption |
|
|
194 | (1) |
|
|
195 | (1) |
|
|
196 | (2) |
|
10.5 Proper Partial-Label (PPL) Classification |
|
|
198 | (9) |
|
10.5.1 Data Generation Assumption |
|
|
198 | (2) |
|
|
200 | (1) |
|
10.5.3 Theoretical Analysis |
|
|
201 | (6) |
|
IV ADVANCED TOPICS AND PERSPECTIVES |
|
|
|
11 Non-Negative Correction for Weakly Supervised Classification |
|
|
207 | (32) |
|
|
207 | (1) |
|
11.2 Overfitting of Unbiased Learning Objectives |
|
|
208 | (3) |
|
11.2.1 Binary Classification |
|
|
208 | (2) |
|
11.2.2 Multi-Class Classification |
|
|
210 | (1) |
|
11.3 Numerical Illustration |
|
|
211 | (2) |
|
11.4 Non-Negative Correction |
|
|
213 | (5) |
|
11.4.1 NNPU Classification |
|
|
213 | (2) |
|
11.4.2 NNPNU Classification |
|
|
215 | (1) |
|
11.4.3 NNUU Classification |
|
|
216 | (1) |
|
11.4.4 NNCL Classification |
|
|
217 | (1) |
|
11.4.5 CCUU Classification |
|
|
217 | (1) |
|
11.5 Theoretical Analyses |
|
|
218 | (11) |
|
11.5.1 Bias and Consistency |
|
|
219 | (5) |
|
|
224 | (5) |
|
|
229 | (10) |
|
11.6.1 Comparison of PN, uPU, and nnPU Classification |
|
|
229 | (4) |
|
11.6.2 Comparison of uCL and nnCL Classification |
|
|
233 | (2) |
|
11.6.3 Comparison of uUU and ccUU Classification |
|
|
235 | (4) |
|
12 Class-Prior Estimation |
|
|
239 | (36) |
|
|
239 | (2) |
|
12.2 Full Distribution Matching |
|
|
241 | (1) |
|
12.3 Mixture Proportion Estimation |
|
|
242 | (6) |
|
12.3.1 Estimation Goal and Optimization Goal |
|
|
243 | (1) |
|
12.3.2 Redefinition of Optimization Goal |
|
|
244 | (1) |
|
12.3.3 Irreducibility Assumption |
|
|
245 | (1) |
|
12.3.4 Anchor Set/Point Assumption |
|
|
246 | (2) |
|
|
248 | (1) |
|
12.4 Partial Distribution Matching |
|
|
248 | (6) |
|
|
248 | (1) |
|
12.4.2 Differentiable Divergences |
|
|
249 | (2) |
|
12.4.3 Non-Differentiable Divergences |
|
|
251 | (1) |
|
12.4.4 Empirical f-Divergence Estimation |
|
|
252 | (2) |
|
12.5 Penalized L1-Distance Minimization |
|
|
254 | (8) |
|
12.5.1 Penalized L1-Distance |
|
|
254 | (2) |
|
12.5.2 Practical Implementation |
|
|
256 | (2) |
|
12.5.3 Theoretical Analysis |
|
|
258 | (1) |
|
12.5.3.1 Realizability assumption |
|
|
258 | (1) |
|
12.5.3.2 Summary of main results |
|
|
259 | (1) |
|
12.5.3.3 Proofs of main results |
|
|
259 | (2) |
|
12.5.3.4 On the convergence rate of np |
|
|
261 | (1) |
|
12.6 Class-Prior Estimation with Regrouping |
|
|
262 | (11) |
|
|
262 | (1) |
|
12.6.2 Practical Implementation |
|
|
263 | (2) |
|
12.6.3 Theoretical Justification |
|
|
265 | (1) |
|
12.6.3.1 A formal definition of regrouping |
|
|
265 | (2) |
|
|
267 | (1) |
|
12.6.3.3 Convergence analysis |
|
|
268 | (2) |
|
12.6.3.4 Computationally efficient identification of A* |
|
|
270 | (1) |
|
12.6.3.5 Approximation of p'P with a surrogate |
|
|
271 | (2) |
|
12.7 Class-Prior Estimation from Pairwise Data |
|
|
273 | (2) |
|
13 Conclusions and Prospects |
|
|
275 | (4) |
Notes |
|
279 | (4) |
Bibliography |
|
283 | (10) |
Index |
|
293 | |