List of Contributors |
|
xiii | |
About the Editors |
|
xix | |
Multimodal behavior analysis in the wild: An introduction |
|
1 | (8) |
|
|
|
|
0.1 Analyzing human behavior in the wild from multimodal data |
|
|
1 | (2) |
|
|
3 | (3) |
|
0.3 Summary of important points |
|
|
6 | (1) |
|
|
7 | (2) |
Chapter 1 Multimodal open-domain conversations with robotic platforms |
|
9 | (18) |
|
|
|
|
9 | (5) |
|
1.1.1 Constructive Dialog Model |
|
|
11 | (3) |
|
|
14 | (4) |
|
1.2.1 Topic shifts and topic trees |
|
|
14 | (2) |
|
1.2.2 Dialogs using Wikipedia |
|
|
16 | (2) |
|
|
18 | (3) |
|
1.3.1 Multimodal WikiTalk for robots |
|
|
19 | (1) |
|
1.3.2 Multimodal topic modeling |
|
|
20 | (1) |
|
|
21 | (2) |
|
1.4.1 Dialogs using domain ontologies |
|
|
21 | (1) |
|
1.4.2 IoT and an integrated robot architecture |
|
|
22 | (1) |
|
|
23 | (1) |
|
|
24 | (3) |
Chapter 2 Audio-motor integration for robot audition |
|
27 | (26) |
|
|
|
|
|
27 | (2) |
|
2.2 Audio-motor integration in psychophysics and robotics |
|
|
29 | (3) |
|
2.3 Single-microphone sound localization using head movements |
|
|
32 | (5) |
|
2.3.1 HRTF model and dynamic cues |
|
|
32 | (2) |
|
2.3.2 Learning-based sound localization |
|
|
34 | (2) |
|
|
36 | (1) |
|
2.4 Ego-noise reduction using proprioceptors |
|
|
37 | (8) |
|
2.4.1 Ego-noise: challenges and opportunities |
|
|
37 | (1) |
|
2.4.2 Proprioceptor-guided dictionary learning |
|
|
37 | (2) |
|
2.4.3 Phase-optimized dictionary learning |
|
|
39 | (2) |
|
2.4.4 Audio-motor integration via support vector machines |
|
|
41 | (3) |
|
|
44 | (1) |
|
2.5 Conclusion and perspectives |
|
|
45 | (1) |
|
|
46 | (7) |
Chapter 3 Audio source separation into the wild |
|
53 | (26) |
|
|
|
|
|
53 | (1) |
|
3.2 Multichannel audio source separation |
|
|
54 | (4) |
|
3.3 Making MASS go from labs into the wild |
|
|
58 | (10) |
|
3.3.1 Moving sources and sensors |
|
|
58 | (3) |
|
3.3.2 Varying number of (active) sources |
|
|
61 | (2) |
|
3.3.3 Spatially diffuse sources and long mixing filters |
|
|
63 | (4) |
|
3.3.4 Ad hoc microphone arrays |
|
|
67 | (1) |
|
3.4 Conclusions and perspectives |
|
|
68 | (2) |
|
|
70 | (9) |
Chapter 4 Designing audio-visual tools to support multisensory disabilities |
|
79 | (24) |
|
|
|
|
|
|
|
|
79 | (3) |
|
|
82 | (3) |
|
|
85 | (7) |
|
4.4 Visual recognition module |
|
|
88 | (1) |
|
4.4.1 Object-instance recognition |
|
|
88 | (1) |
|
4.4.2 Experimental assessment |
|
|
89 | (3) |
|
4.5 Complementary hearing aid module |
|
|
92 | (3) |
|
4.5.1 Measurement of Glassense beam pattern |
|
|
92 | (1) |
|
4.5.2 Analysis of measured beam pattern |
|
|
93 | (2) |
|
4.6 Assessing usability with impaired users |
|
|
95 | (3) |
|
4.6.1 Glassense field tests with visually impaired |
|
|
96 | (1) |
|
4.6.2 Glassense field tests with binaural hearing loss |
|
|
96 | (2) |
|
|
98 | (1) |
|
|
99 | (4) |
Chapter 5 Audio-visual learning for body-worn cameras |
|
103 | (18) |
|
|
|
|
103 | (2) |
|
5.2 Multi-modal classification |
|
|
105 | (2) |
|
5.3 Cross-modal adaptation |
|
|
107 | (3) |
|
5.4 Audio-visual reidentification |
|
|
110 | (1) |
|
5.5 Reidentification dataset |
|
|
111 | (1) |
|
5.6 Reidentification results |
|
|
112 | (4) |
|
|
116 | (1) |
|
|
116 | (5) |
Chapter 6 Activity recognition from visual lifelogs: State of the art and future challenges |
|
121 | (14) |
|
|
|
|
|
121 | (2) |
|
6.2 Activity recognition from egocentric images |
|
|
123 | (2) |
|
6.3 Activity recognition from egocentric photo-streams |
|
|
125 | (2) |
|
|
127 | (4) |
|
|
127 | (1) |
|
|
128 | (1) |
|
6.4.3 Results and discussion |
|
|
129 | (2) |
|
|
131 | (1) |
|
|
132 | (3) |
|
|
l32 | |
Chapter 7 Lifelog retrieval for memory stimulation of people with memory impairment |
|
135 | (24) |
|
|
|
|
|
|
|
|
135 | (3) |
|
|
138 | (3) |
|
7.3 Retrieval based on key-frame semantic selection |
|
|
141 | (8) |
|
7.3.1 Summarization of autobiographical episodes |
|
|
143 | (1) |
|
7.3.2 Semantic key-frame selection |
|
|
144 | (2) |
|
7.3.3 Egocentric image retrieval based on CNNs and inverted index search |
|
|
146 | (3) |
|
|
149 | (5) |
|
|
149 | (1) |
|
|
150 | (1) |
|
7.4.3 Evaluation measures |
|
|
151 | (1) |
|
|
151 | (2) |
|
|
153 | (1) |
|
|
154 | (1) |
|
|
154 | (1) |
|
|
155 | (4) |
Chapter 8 Integrating signals for reasoning about visitors' behavior in cultural heritage |
|
159 | (12) |
|
|
|
|
159 | (2) |
|
8.2 Using technology for reasoning about visitors' behavior |
|
|
161 | (5) |
|
|
166 | (1) |
|
|
167 | (1) |
|
|
168 | (3) |
Chapter 9 Wearable systems for improving tourist experience |
|
171 | (28) |
|
|
|
|
|
|
|
|
171 | (1) |
|
|
172 | (4) |
|
9.3 Behavior analysis for smart guides |
|
|
176 | (1) |
|
|
176 | (10) |
|
|
186 | (8) |
|
|
194 | (1) |
|
|
195 | (4) |
Chapter 10 Recognizing social relationships from an egocentric vision perspective |
|
199 | (26) |
|
|
|
|
|
|
|
199 | (3) |
|
|
202 | (2) |
|
10.2.1 Head pose estimation |
|
|
202 | (1) |
|
10.2.2 Social interactions |
|
|
203 | (1) |
|
10.3 Understanding people interactions |
|
|
204 | (6) |
|
10.3.1 Face detection and tracking |
|
|
205 | (1) |
|
10.3.2 Head pose estimation |
|
|
205 | (3) |
|
10.3.3 3D people localization |
|
|
208 | (2) |
|
10.4 Social group detection |
|
|
210 | (2) |
|
10.4.1 Correlation clustering via structural SVM |
|
|
210 | (2) |
|
10.5 Social relevance estimation |
|
|
212 | (1) |
|
10.6 Experimental results |
|
|
213 | (9) |
|
10.6.1 Head pose estimation |
|
|
214 | (1) |
|
10.6.2 Distance estimation |
|
|
215 | (1) |
|
|
216 | (4) |
|
|
220 | (2) |
|
|
222 | (1) |
|
|
223 | (2) |
Chapter 11 Complex conversational scene analysis using wearable sensors |
|
225 | (22) |
|
|
|
|
|
225 | (2) |
|
11.2 Defining 'in the wild' and ecological validity |
|
|
227 | (1) |
|
11.3 Ecological validity vs. experimental control |
|
|
228 | (1) |
|
11.4 Ecological validity vs. robust automated perception |
|
|
229 | (1) |
|
11.5 Thin vs. thick slices of analysis |
|
|
230 | (1) |
|
11.6 Collecting data of social behavior |
|
|
230 | (4) |
|
11.6.1 Practical concerns when collecting data during social events |
|
|
231 | (3) |
|
11.7 Analyzing social actions with a single body worn accelerometer |
|
|
234 | (7) |
|
11.7.1 Feature extraction and classification |
|
|
235 | (1) |
|
11.7.2 Performance vs. sample size |
|
|
236 | (2) |
|
11.7.3 Transductive parameter transfer (TPT) for personalized models |
|
|
238 | (3) |
|
|
241 | (1) |
|
|
241 | (1) |
|
|
242 | (5) |
Chapter 12 Detecting conversational groups in images using clustering games |
|
247 | (22) |
|
|
|
|
247 | (3) |
|
|
250 | (1) |
|
|
251 | (4) |
|
12.3.1 Notations and definitions |
|
|
251 | (2) |
|
|
253 | (2) |
|
12.4 Conversational groups as equilibria of clustering games |
|
|
255 | (3) |
|
12.4.1 Frustum of attention |
|
|
255 | (2) |
|
12.4.2 Quantifying pairwise interactions |
|
|
257 | (1) |
|
|
258 | (1) |
|
12.5 Finding ESS-clusters using game dynamics |
|
|
258 | (3) |
|
12.6 Experiments and results |
|
|
261 | (4) |
|
|
261 | (1) |
|
12.6.2 Evaluation metrics and parameter exploration |
|
|
262 | (1) |
|
|
263 | (2) |
|
|
265 | (1) |
|
|
265 | (4) |
Chapter 13 We are less free than how we think: Regular patterns in nonverbal communication |
|
269 | (20) |
|
|
|
|
|
|
|
|
|
269 | (2) |
|
13.2 On spotting cues: how many and when |
|
|
271 | (5) |
|
|
272 | (1) |
|
|
273 | (2) |
|
|
275 | (1) |
|
13.3 On following turns: who talks with whom |
|
|
276 | (3) |
|
|
277 | (1) |
|
|
278 | (1) |
|
|
278 | (1) |
|
13.4 On speech dancing: who imitates whom |
|
|
279 | (7) |
|
|
279 | (3) |
|
|
282 | (4) |
|
|
286 | (1) |
|
|
287 | (2) |
Chapter 14 Crowd behavior analysis from fixed and moving cameras |
|
289 | (34) |
|
|
|
|
|
289 | (4) |
|
14.2 Microscopic and macroscopic crowd modeling |
|
|
293 | (2) |
|
14.3 Motion information for crowd representation from fixed cameras |
|
|
295 | (3) |
|
14.3.1 Pre-processing and selection of areas of interest |
|
|
295 | (1) |
|
14.3.2 Motion-based crowd behavior analysis |
|
|
296 | (2) |
|
14.4 Crowd behavior and density analysis |
|
|
298 | (3) |
|
14.4.1 Person detection and tracking in crowded scenes |
|
|
299 | (1) |
|
14.4.2 Low level features for crowd density estimation |
|
|
300 | (1) |
|
14.5 CNN-based crowd analysis methods for surveillance and anomaly detection |
|
|
301 | (6) |
|
14.6 Crowd analysis using moving sensors |
|
|
307 | (3) |
|
14.7 Metrics and datasets |
|
|
310 | (4) |
|
14.7.1 Metrics for performance evaluation |
|
|
310 | (2) |
|
14.7.2 Datasets for crowd behavior analysis |
|
|
312 | (2) |
|
|
314 | (1) |
|
|
315 | (8) |
Chapter 15 Towards multi-modality invariance: A study in visual representation |
|
323 | (26) |
|
|
|
15.1 Introduction and related work |
|
|
323 | (3) |
|
15.2 Variances in visual representation |
|
|
326 | (2) |
|
15.3 Reversal invariance in BoVW |
|
|
328 | (9) |
|
15.3.1 Reversal symmetry and Max-SIFT |
|
|
329 | (1) |
|
15.3.2 RIDE: generalized reversal invariance |
|
|
330 | (2) |
|
15.3.3 Application to image classification |
|
|
332 | (1) |
|
|
332 | |
|
|
136 | (201) |
|
15.4 Reversal invariance in CNN |
|
|
337 | (7) |
|
15.4.1 Reversal-invariant convolution (RI-Cony) |
|
|
337 | (1) |
|
15.4.2 Relationship to data augmentation |
|
|
338 | (2) |
|
|
340 | (1) |
|
15.4.4 ILSVRC2012 classification experiments |
|
|
341 | (2) |
|
|
343 | (1) |
|
|
344 | (1) |
|
|
344 | (5) |
Chapter 16 Sentiment concept embedding for visual affect recognition |
|
349 | (20) |
|
|
|
|
|
|
|
149 | (203) |
|
16.1.1 Embeddings for image classification |
|
|
350 | (1) |
|
16.1.2 Affective computing |
|
|
351 | (1) |
|
16.2 Visual sentiment ontology |
|
|
352 | (1) |
|
16.3 Building output embeddings for ANPs |
|
|
353 | (4) |
|
16.3.1 Combining adjectives and nouns |
|
|
354 | (2) |
|
16.3.2 Loss functions for the embeddings |
|
|
356 | (1) |
|
16.4 Experimental results |
|
|
357 | (5) |
|
16.4.1 Adjective noun pair detection |
|
|
358 | (3) |
|
16.4.2 Zero-shot concept detection |
|
|
361 | (1) |
|
16.5 Visual affect recognition |
|
|
362 | (3) |
|
16.5.1 Visual emotion prediction |
|
|
363 | (1) |
|
16.5.2 Visual sentiment prediction |
|
|
364 | (1) |
|
16.6 Conclusions and future work |
|
|
365 | (1) |
|
|
366 | (3) |
Chapter 17 Video-based emotion recognition in the wild |
|
369 | (18) |
|
|
|
|
|
369 | (1) |
|
|
370 | (4) |
|
|
374 | (2) |
|
17.4 Experimental results |
|
|
376 | (3) |
|
|
376 | (2) |
|
17.4.2 ChaLearn Challenges |
|
|
378 | (1) |
|
17.5 Conclusions and discussion |
|
|
379 | (3) |
|
|
382 | (1) |
|
|
382 | (5) |
Chapter 18 Real-world automatic continuous affect recognition from audiovisual signals |
|
387 | (20) |
|
|
|
|
|
387 | (2) |
|
18.2 Real world vs laboratory settings |
|
|
389 | (1) |
|
18.3 Audio and video affect cues and theories of emotion |
|
|
389 | (3) |
|
|
389 | (1) |
|
|
390 | (1) |
|
18.3.3 Quantifying affect |
|
|
391 | (1) |
|
|
392 | (5) |
|
18.4.1 Multimodal fusion techniques |
|
|
392 | (1) |
|
|
393 | (1) |
|
|
394 | (2) |
|
18.4.4 Affect recognition competitions |
|
|
396 | (1) |
|
18.5 Audiovisual affect recognition: a representative end-to-end learning system |
|
|
397 | (5) |
|
|
398 | (2) |
|
18.5.2 Experiments & results |
|
|
400 | (2) |
|
|
402 | (1) |
|
|
403 | (4) |
Chapter 19 Affective facial computing: Generalizability across domains |
|
407 | (36) |
|
|
|
|
|
|
|
|
407 | (2) |
|
|
409 | (1) |
|
19.3 Approaches to annotation |
|
|
410 | (1) |
|
19.4 Reliability and performance |
|
|
411 | (2) |
|
19.5 Factors influencing performance |
|
|
413 | (2) |
|
19.6 Systematic review of studies of cross-domain generalizability |
|
|
415 | (14) |
|
|
416 | (1) |
|
|
416 | (3) |
|
19.6.3 Cross-domain generalizability |
|
|
419 | (8) |
|
19.6.4 Studies using deep- vs. shallow learning |
|
|
427 | (1) |
|
|
428 | (1) |
|
|
429 | (4) |
|
|
433 | (1) |
|
|
434 | (1) |
|
|
434 | (9) |
Chapter 20 Automatic recognition of self-reported and perceived emotions |
|
443 | (28) |
|
|
|
|
443 | (2) |
|
20.2 Emotion production and perception |
|
|
445 | (3) |
|
20.2.1 Descriptions of emotion |
|
|
445 | (1) |
|
20.2.2 Brunswik's functional lens model |
|
|
446 | (2) |
|
|
448 | (1) |
|
20.3 Observations from perception experiments |
|
|
448 | (2) |
|
20.4 Collection and annotation of labeled emotion data |
|
|
450 | (3) |
|
20.4.1 Emotion-elicitation methods |
|
|
450 | (2) |
|
20.4.2 Data annotation tools |
|
|
452 | (1) |
|
|
453 | (4) |
|
|
453 | (2) |
|
20.5.2 Audio, visual, physiological, and multi-modal datasets |
|
|
455 | (2) |
|
20.6 Recognition of self-reported and perceived emotion |
|
|
457 | (3) |
|
20.7 Challenges and prospects |
|
|
460 | (3) |
|
20.8 Concluding remarks 463 |
|
|
|
|
463 | (1) |
|
|
463 | (8) |
Index |
|
471 | |