This book provides an overview of crowdsourced data management. Covering all aspects including the workflow, algorithms and research potential, it particularly focuses on the latest techniques and recent advances. The authors identify three key aspects in determining the performance of crowdsourced data management: quality control, cost control and latency control. By surveying and synthesizing a wide spectrum of studies on crowdsourced data management, the book outlines important factors that need to be considered to improve crowdsourced data management. It also introduces a practical crowdsourced-database-system design and presents a number of crowdsourced operators. Self-contained and covering theory, algorithms, techniques and applications, it is a valuable reference resource for researchers and students new to crowdsourced data management with a basic knowledge of data structures and databases.
|
|
1 | (1) |
|
|
1 | (1) |
|
1.2 Crowdsourcing Overview |
|
|
2 | (2) |
|
1.3 Crowdsourced Data Management |
|
|
4 | (4) |
|
|
8 | (3) |
|
2 Crowdsourcing Background |
|
|
11 | (1) |
|
2.1 Crowdsourcing Overview |
|
|
11 | (1) |
|
2.2 Crowdsourcing Workflow |
|
|
12 | (4) |
|
2.2.1 Workflow from Requester Side |
|
|
12 | (3) |
|
2.2.2 Workflow from Worker Side |
|
|
15 | (1) |
|
2.2.3 Workflow from Platform Side |
|
|
16 | (1) |
|
2.3 Crowdsourcing Platforms |
|
|
16 | (2) |
|
2.3.1 Amazon Mechanical Turk (AMT) |
|
|
16 | (1) |
|
|
17 | (1) |
|
|
17 | (1) |
|
2.4 Existing Surveys, Tutorials, and Books |
|
|
18 | (1) |
|
2.5 Optimization Goal of Crowdsourced Data Management |
|
|
18 | (3) |
|
|
19 | (2) |
|
|
21 | (24) |
|
3.1 Overview of Quality Control |
|
|
21 | (2) |
|
|
23 | (13) |
|
3.2.1 Truth Inference Problem |
|
|
23 | (2) |
|
3.2.2 Unified Solution Framework |
|
|
25 | (3) |
|
3.2.3 Comparisons of Existing Works |
|
|
28 | (7) |
|
3.2.4 Extensions of Truth Inference |
|
|
35 | (1) |
|
|
36 | (6) |
|
3.3.1 Task Assignment Setting |
|
|
36 | (4) |
|
3.3.2 Worker Selection Setting |
|
|
40 | (2) |
|
3.4 Summary of Quality Control |
|
|
42 | (3) |
|
|
42 | (3) |
|
|
45 | (18) |
|
4.1 Overview of Cost Control |
|
|
45 | (1) |
|
|
46 | (3) |
|
4.2.1 Difficulty Measurement |
|
|
47 | (1) |
|
4.2.2 Threshold Selection |
|
|
48 | (1) |
|
|
49 | (1) |
|
|
49 | (2) |
|
|
49 | (1) |
|
|
50 | (1) |
|
|
51 | (1) |
|
|
51 | (3) |
|
|
52 | (1) |
|
|
53 | (1) |
|
|
54 | (1) |
|
|
54 | (3) |
|
4.5.1 Crowdsourced Aggregation |
|
|
54 | (1) |
|
|
55 | (2) |
|
|
57 | (1) |
|
|
57 | (3) |
|
4.6.1 User Interface Design |
|
|
58 | (1) |
|
4.6.2 Non-monetary Incentives |
|
|
59 | (1) |
|
|
60 | (1) |
|
4.7 Summary of Cost Control |
|
|
60 | (3) |
|
|
61 | (2) |
|
|
63 | (8) |
|
5.1 Overview of Latency Control |
|
|
63 | (1) |
|
5.2 Single-Task Latency Control |
|
|
64 | (2) |
|
|
64 | (1) |
|
5.2.2 Qualification Test Time |
|
|
65 | (1) |
|
|
65 | (1) |
|
5.3 Single-Batch Latency Control |
|
|
66 | (2) |
|
|
66 | (1) |
|
5.3.2 Straggler Mitigation |
|
|
66 | (2) |
|
5.4 Multi-batch Latency Control |
|
|
68 | (1) |
|
5.4.1 Motivation of Multiple Batches |
|
|
68 | (1) |
|
|
68 | (1) |
|
5.5 Summary of Latency Control |
|
|
69 | (2) |
|
|
70 | (1) |
|
6 Crowdsourcing Database Systems and Optimization |
|
|
71 | (26) |
|
6.1 Overview of Crowdsourcing Database Systems |
|
|
71 | (4) |
|
6.2 Crowdsourcing Query Language |
|
|
75 | (7) |
|
|
75 | (1) |
|
|
76 | (1) |
|
|
77 | (1) |
|
|
78 | (2) |
|
|
80 | (2) |
|
6.3 Crowdsourcing Query Optimization |
|
|
82 | (11) |
|
|
82 | (2) |
|
|
84 | (1) |
|
|
85 | (2) |
|
|
87 | (4) |
|
|
91 | (2) |
|
6.4 Summary of Crowdsourcing Database Systems |
|
|
93 | (4) |
|
|
94 | (3) |
|
|
97 | |
|
7.1 Crowdsourced Selection |
|
|
97 | (4) |
|
7.1.1 Crowdsourced Filtering |
|
|
98 | (1) |
|
|
99 | (2) |
|
7.1.3 Crowdsourced Search |
|
|
101 | (1) |
|
7.2 Crowdsourced Collection |
|
|
101 | (3) |
|
7.2.1 Crowdsourced Enumeration |
|
|
101 | (3) |
|
|
104 | (1) |
|
7.3 Crowdsourced Join (Crowdsourced Entity Resolution) |
|
|
104 | (9) |
|
|
104 | (1) |
|
7.3.2 Candidate Set Generation |
|
|
105 | (1) |
|
7.3.3 Candidate Set Verification |
|
|
106 | (2) |
|
7.3.4 Human Interface for Join |
|
|
108 | (1) |
|
|
109 | (4) |
|
7.4 Crowdsourced Sort, Top-k, and Max/Min |
|
|
113 | (8) |
|
|
113 | (1) |
|
7.4.2 Pairwise Comparisons |
|
|
113 | (1) |
|
|
114 | (5) |
|
|
119 | (1) |
|
|
120 | (1) |
|
7.5 Crowdsourced Aggregation |
|
|
121 | (2) |
|
|
121 | (1) |
|
7.5.2 Crowdsourced Median |
|
|
122 | (1) |
|
7.5.3 Crowdsourced Group By |
|
|
123 | (1) |
|
7.6 Crowdsourced Categorization |
|
|
123 | (1) |
|
|
124 | (2) |
|
7.7.1 Crowdsourced Skyline on Incomplete Data |
|
|
125 | (1) |
|
7.7.2 Crowdsourced Skyline with Comparisons |
|
|
126 | (1) |
|
7.8 Crowdsourced Planning |
|
|
126 | (6) |
|
7.8.1 General Crowdsourced Planning Query |
|
|
127 | (2) |
|
7.8.2 An Application: Route Planning |
|
|
129 | (3) |
|
7.9 Crowdsourced Schema Matching |
|
|
132 | |
Guoliang Li is an associate professor at the Department of Computer Science, Tsinghua University, Beijing, China. His research interests include crowdsourced data management, big spatio-temporal data analytics, large-scale data cleaning and integration. He has published more than 100 papers at leading conferences and in journals, such as SIGMOD, VLDB, ICDE, SIGKDD, SIGIR, TODS, VLDB Journal, and TKDE. He is a PC co-chair of WAIM 2014, WebDB 2014, and NDBC 2016. He servers as associate editor for IEEE Transactions and Data Engineering, the VLDB Journal, BigData Research, IEEE Data Engineering Bulletin. He has regularly served as a PC member for several conferences, such as SIGMOD, VLDB, KDD, ICDE, WWW, IJCAI, and AAAI. His papers have been cited more than 4500 times. He received the VLDB 2017 Early Research Contribution Award, IEEE TCDE Early Career Award 2014, The national youth talent support program 2016, Young ChangJiang Scholar 2016, NSFC Excellent Young Scholars Award 2014, and the CCF Young Scientist award 2014.
Prof. Michael J. Franklin is the inaugural holder of the Liew Family Chair of Computer Science at the University of Chicago. An authority on databases, data analytics, data management and distributed systems, he also serves as senior advisor to the provost on computation and data science. Most recently he was the Thomas M. Siebel Professor of Computer Science and chair of the Computer Science Division of the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley, where he currently is an adjunct professor. He co-founded and directs Berkeleys Algorithms, Machines and People Laboratory (AMPLab), a leading academic big data analytics research center. The AMPLab won a National Science Foundation CISE "Expeditions in Computing" award, which was announced as part of the White House Big Data Research initiative in March 2012, and has received support from over 30 industrial sponsors. AMPLab has created industry-changing open source big data software including Apache Spark and BDAS, the Berkeley Data Analytics Stack. At Berkeley Professor Franklin also served as an executive committee member for the Berkeley Institute for Data Science, a campus-wide initiative to advance data science environments. He is a fellow of the Association for Computing Machinery and two-time recipient of the ACM SIGMOD.