Muutke küpsiste eelistusi

Machine Learning from Weak Supervision: An Empirical Risk Minimization Approach [Kõva köide]

Teised raamatud teemal:
Teised raamatud teemal:
Fundamental theory and practical algorithms of weakly supervised classification, emphasizing an approach based on empirical risk minimization.
Standard machine learning techniques require large amounts of labeled data to work well. When we apply machine learning to problems in the physical world, however, it is extremely difficult to collect such quantities of labeled data. In this book Masashi Sugiyama, Han Bao, Takashi Ishida, Nan Lu, Tomoya Sakai and Gang Niu present theory and algorithms for weakly supervised learning, a paradigm of machine learning from weakly labeled data. Emphasizing an approach based on empirical risk minimization and drawing on state-of-the-art research in weakly supervised learning, the book provides both the fundamentals of the field and the advanced mathematical theories underlying them. It can be used as a reference for practitioners and researchers and in the classroom.The book first mathematically formulates classification problems, defines common notations, and reviews various algorithms for supervised binary and multiclass classification. It then explores problems of binary weakly supervised classification, including positive-unlabeled (PU) classification, positive-negative-unlabeled (PNU) classification, and unlabeled-unlabeled (UU) classification. It then turns to multiclass classification, discussing complementary-label (CL) classification and partial-label (PL) classification. Finally, the book addresses more advanced issues, including a family of correction methods to improve the generalization performance of weakly supervised learning and the problem of class-prior estimation.
Preface xiii
I MACHINE LEARNING FROM WEAK SUPERVISION
1 Introduction
3(18)
1.1 Machine Learning
3(4)
1.1.1 Supervised Learning
3(2)
1.1.2 Unsupervised Learning
5(1)
1.1.3 Reinforcement Learning
6(1)
1.2 Elements of Classification
7(2)
1.2.1 Classifiers
7(1)
1.2.2 Learning Criteria
8(1)
1.2.3 Optimization Algorithms
8(1)
1.3 Aspects of Machine Learning
9(4)
1.3.1 Logical Learning, Biologically Inspired Learning, and Statistical Learning
9(2)
1.3.2 Frequentist Learning and Bayesian Learning
11(1)
1.3.3 Generative Classification and Discriminative Classification
12(1)
1.3.4 Induction, Deduction, and Transduction
13(1)
1.4 Improving Data Collection and Weakly Supervised Learning
13(2)
1.5 Organization of This Book
15(6)
1.5.1 Weakly Supervised Learning for Binary Classification
15(3)
1.5.2 Weakly Supervised Learning for Multi-Class Classification
18(1)
1.5.3 Advanced Topics and Perspectives
18(3)
2 Formulation and Notation
21(14)
2.1 Binary Classification
21(10)
2.1.1 Formulation
21(1)
2.1.2 Classification Models
22(1)
2.1.2.1 Linear-in-input model
22(1)
2.1.2.2 Linear-in-parameter model
22(1)
2.1.2.3 Kernel model
23(2)
2.1.2.4 Neural network model
25(1)
2.1.3 Surrogate Losses
26(2)
2.1.4 Training Samples
28(2)
2.1.5 Regularization
30(1)
2.2 Multi-Class Classification
31(4)
2.2.1 Formulation
31(1)
2.2.2 Surrogate Losses
32(1)
2.2.3 Training Samples
33(2)
3 Supervised Classification
35(32)
3.1 Positive-Negative (PN) Classification
35(21)
3.1.1 Formulation
35(1)
3.1.1.1 One-sample case
35(1)
3.1.1.2 Two-sample case
36(1)
3.1.1.3 Comparison
37(1)
3.1.2 Theoretical Analysis
38(1)
3.1.2.1 Targets of convergence
38(2)
3.1.2.2 Measures of convergence
40(3)
3.1.2.3 Rademacher complexity
43(3)
3.1.2.4 Rademacher complexity bounds
46(5)
3.1.2.5 Estimation error bounds
51(5)
3.2 Multi-Class Classification
56(11)
3.2.1 Formulation
56(2)
3.2.2 Theoretical Analysis
58(1)
3.2.2.1 Estimation error bounds
58(3)
3.2.2.2 Classification calibration
61(6)
II WEAKLY SUPERVISED LEARNING FOR BINARY CLASSIFICATION
4 Positive-Unlabeled (PU) Classification
67(18)
4.1 Introduction
67(1)
4.2 Formulation
68(1)
4.3 Unbiased Risk Estimation from PU Data
69(6)
4.3.1 General Approach
69(2)
4.3.2 Cost-Sensitive Approach
71(1)
4.3.3 Convex Approach
72(3)
4.4 Theoretical Analysis
75(10)
4.4.1 PU Classification
75(2)
4.4.2 NU Classification
77(2)
4.4.3 Comparisons with PN Classification
79(1)
4.4.3.1 Finite-sample comparisons
79(3)
4.4.3.2 Asymptotic comparisons
82(3)
5 Positive-Negative-Unlabeled (PNU) Classification
85(26)
5.1 Introduction
85(1)
5.2 Formulation
86(1)
5.3 Manifold-Based Semi-Supervised Classification
87(3)
5.3.1 Laplacian Regularization
87(2)
5.3.2 Implementation
89(1)
5.4 Information-Theoretic Semi-Supervised Classification
90(4)
5.4.1 Squared-Loss Mutual Information Regularization
90(3)
5.4.2 Implementation
93(1)
5.5 PU+PN Classification
94(9)
5.5.1 PNU and PU+NU Risk Estimators
94(1)
5.5.2 PNU vs. PU+NU Classification
95(1)
5.5.3 Theoretical Analysis
96(1)
5.5.3.1 Estimation error bounds
97(2)
5.5.3.2 Variance reduction
99(4)
5.6 Experiments
103(2)
5.6.1 Datasets
103(1)
5.6.2 PNU Risk for Validation
103(1)
5.6.3 Comparison with Other Methods
104(1)
5.7 Extensions
105(6)
5.7.1 Multi-Class Extension
105(2)
5.7.2 AUC Maximization
107(1)
5.7.3 Matrix Imputation
107(4)
6 Positive-Confidence (Pconf) Classification
111(16)
6.1 Introduction
111(1)
6.2 Related Works
112(1)
6.3 Problem Formulation
113(1)
6.4 Empirical Risk Minimization (ERM) Framework
114(2)
6.5 Theoretical Analysis
116(3)
6.6 Implementation
119(1)
6.7 Experiments
119(8)
6.7.1 Synthetic Experiments with Linear Models
119(3)
6.7.2 Benchmark Experiments with Neural Network Models
122(5)
7 Pairwise-Constraint Classification
127(22)
7.1 Introduction
127(1)
7.2 Formulation
128(3)
7.2.1 One-Sample Case
129(1)
7.2.2 Two-Sample Case
130(1)
7.2.3 Comparison of Sampling Schemes
130(1)
7.2.4 Pairwise Constraints as Pointwise Data
131(1)
7.3 Similar-Unlabeled (SU) Classification
131(5)
7.3.1 Classification Risk Estimation
132(1)
7.3.2 Minimum-Variance Risk Estimation
132(2)
7.3.3 Convex Formulation
134(1)
7.3.4 Class-Priors in SU Classification
135(1)
7.4 Similar-Dissimilar (SD) and Dissimilar-Unlabeled (DU) Classification
136(4)
7.4.1 Classification Risk Estimation
136(1)
7.4.2 Interpretation of SD Risk
137(3)
7.5 Similar-Dissimilar-Unlabeled (SDU) Classification
140(1)
7.6 Theoretical Analysis
140(3)
7.6.1 Derivation of Estimation Error Bounds
140(1)
7.6.2 Comparison of Estimation Error Bounds
141(2)
7.7 Experiments
143(5)
7.7.1 Setup
143(1)
7.7.2 Illustration of SU Classification
143(2)
7.7.3 Comparison of SU Classification with Other Methods
145(1)
7.7.4 Comparison of SDU Classification with Other Methods
146(2)
7.8 Ongoing Research
148(1)
8 Unlabeled-Unlabeled (UU) Classification
149(28)
8.1 Introduction
149(1)
8.2 Problem Formulation
150(2)
8.2.1 Data Generation Process
151(1)
8.2.2 Performance Measures
151(1)
8.2.3 Relation to Classification with Noisy Labels
152(1)
8.3 Risk Estimation from UU Data
152(12)
8.3.1 Risk Estimation from One Set of U Data
153(2)
8.3.2 Risk Estimation from Two Sets of U Data
155(1)
8.3.2.1 Risk estimation
155(2)
8.3.2.2 Simplification
157(1)
8.3.2.3 Special cases
157(1)
8.3.3 Theoretical Analysis
158(1)
8.3.4 Experiments
159(1)
8.3.4.1 Setup
159(1)
8.3.4.2 Benchmark experiments with neural network models
160(3)
8.3.4.3 Comparison with other methods
163(1)
8.4 Generative Approach
164(13)
8.4.1 Analysis of Bayes-Optimal Classifier
164(1)
8.4.2 KDE-Based Algorithm
165(1)
8.4.3 LSDD-Based Algorithm
166(2)
8.4.4 DSDD-Based Algorithm
168(3)
8.4.5 Experiments
171(6)
III WEAKLY SUPERVISED LEARNING FOR MULTI-CLASS CLASSIFICATION
9 Complementary-Label Classification
177(16)
9.1 Introduction
177(1)
9.2 Risk Estimation from CL Data
178(4)
9.2.1 Formulation
178(1)
9.2.2 Risk Estimation
179(1)
9.2.3 Case-Study for Symmetric Losses
180(1)
9.2.4 Relation to Classification with Noisy Labels
181(1)
9.3 Theoretical Analysis
182(3)
9.4 Incorporation of Ordinary-Labels
185(1)
9.5 Experiments
185(2)
9.5.1 Experiments with CL
186(1)
9.5.2 Experiments with CL and OL
186(1)
9.6 Incorporation of Multi-Complementary-Labels
187(6)
9.6.1 Formulation
187(2)
9.6.2 Comparison with Multiple Single CLs
189(1)
9.6.3 Unbiased Risk Estimator
190(1)
9.6.4 Estimation Error Bound
191(2)
10 Partial-Label Classification
193(14)
10.1 Introduction
193(1)
10.2 Formulation and Assumptions
193(2)
10.2.1 Formulation
194(1)
10.2.2 Data Generation Assumption
194(1)
10.3 Risk Estimation
195(1)
10.4 Experiments
196(2)
10.5 Proper Partial-Label (PPL) Classification
198(9)
10.5.1 Data Generation Assumption
198(2)
10.5.2 Risk Estimation
200(1)
10.5.3 Theoretical Analysis
201(6)
IV ADVANCED TOPICS AND PERSPECTIVES
11 Non-Negative Correction for Weakly Supervised Classification
207(32)
11.1 Introduction
207(1)
11.2 Overfitting of Unbiased Learning Objectives
208(3)
11.2.1 Binary Classification
208(2)
11.2.2 Multi-Class Classification
210(1)
11.3 Numerical Illustration
211(2)
11.4 Non-Negative Correction
213(5)
11.4.1 NNPU Classification
213(2)
11.4.2 NNPNU Classification
215(1)
11.4.3 NNUU Classification
216(1)
11.4.4 NNCL Classification
217(1)
11.4.5 CCUU Classification
217(1)
11.5 Theoretical Analyses
218(11)
11.5.1 Bias and Consistency
219(5)
11.5.2 Estimation Error
224(5)
11.6 Experiments
229(10)
11.6.1 Comparison of PN, uPU, and nnPU Classification
229(4)
11.6.2 Comparison of uCL and nnCL Classification
233(2)
11.6.3 Comparison of uUU and ccUU Classification
235(4)
12 Class-Prior Estimation
239(36)
12.1 Introduction
239(2)
12.2 Full Distribution Matching
241(1)
12.3 Mixture Proportion Estimation
242(6)
12.3.1 Estimation Goal and Optimization Goal
243(1)
12.3.2 Redefinition of Optimization Goal
244(1)
12.3.3 Irreducibility Assumption
245(1)
12.3.4 Anchor Set/Point Assumption
246(2)
12.3.5 Remarks
248(1)
12.4 Partial Distribution Matching
248(6)
12.4.1 Formulation
248(1)
12.4.2 Differentiable Divergences
249(2)
12.4.3 Non-Differentiable Divergences
251(1)
12.4.4 Empirical f-Divergence Estimation
252(2)
12.5 Penalized L1-Distance Minimization
254(8)
12.5.1 Penalized L1-Distance
254(2)
12.5.2 Practical Implementation
256(2)
12.5.3 Theoretical Analysis
258(1)
12.5.3.1 Realizability assumption
258(1)
12.5.3.2 Summary of main results
259(1)
12.5.3.3 Proofs of main results
259(2)
12.5.3.4 On the convergence rate of np
261(1)
12.6 Class-Prior Estimation with Regrouping
262(11)
12.6.1 Motivation
262(1)
12.6.2 Practical Implementation
263(2)
12.6.3 Theoretical Justification
265(1)
12.6.3.1 A formal definition of regrouping
265(2)
12.6.3.2 Bias reduction
267(1)
12.6.3.3 Convergence analysis
268(2)
12.6.3.4 Computationally efficient identification of A*
270(1)
12.6.3.5 Approximation of p'P with a surrogate
271(2)
12.7 Class-Prior Estimation from Pairwise Data
273(2)
13 Conclusions and Prospects
275(4)
Notes 279(4)
Bibliography 283(10)
Index 293