1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Visual Question Answering in AI tasks . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Categorisation of VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Classi?ed by Data Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Classi?ed by Task Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Part I Preliminaries
2 Deep Learning Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Recurrent Neural Networks and variants . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Encoder-Decoder Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Transformer Networks and BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Graph Neural Networks Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Question Answering (QA) Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Rule-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Information retrieval-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Neural Semantic Parsing for QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Knowledge Base for QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Part II Image-based VQA
ix
x Contents
4 The Classical Visual Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Generation VS. Classi?cation: Two answering policies . . . . . . . . . . . 39
4.4 Joint Embedding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Sequence-to-Sequence Encoder-Decoder Models . . . . . . . . . . 40
4.4.2 Bilinear Encoding for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Awesome Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.1 Stacked Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.2 Hierarchical Question-Image Co-attention . . . . . . . . . . . . . . . 47
4.5.3 Bottom-Up and Top-Down Attention . . . . . . . . . . . . . . . . . . . . 48
4.6 Memory Networks for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6.1 Improved Dynamic Memory Networks . . . . . . . . . . . . . . . . . . 50
4.6.2 Memory-Augmented Networks . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.7 Compositional Reasoning for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7.1 Neural Modular Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7.2 Dynamic Neural Module Networks . . . . . . . . . . . . . . . . . . . . . 56
4.8 Graph Neural Networks for VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.8.1 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.8.2 Graph Attention Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8.3 Graph Convolutional Networks for VQA . . . . . . . . . . . . . . . . . 62
4.8.4 Graph Attention Networks for VQA . . . . . . . . . . . . . . . . . . . . . 63
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Knowledge-based VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Knowledge Bases introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.1 DBpedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.2 ConceptNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Knowledge Embedding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.1 Word-to-vector representation . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.2 Bert-based representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.5 Question-to-Query Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5.1 Query-mapping based methods . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5.2 Learning based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.6 How to query knowledge bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6.1 RDF query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6.2 Memory Network query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6 Vision-and-Language Pre-training for VQA . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2 General Pre-training Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.1 Embeddings from Language Models . . . . . . . . . . . . . . . . . . . . 88
Contents xi
6.2.2 Generative Pre-Training Model . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.3 Bidirectional Encoder Representations from Transformers . . 89
6.3 Popular Vision-and-Language Pre-training Methods . . . . . . . . . . . . . 93
6.3.1 Single-Stream Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3.2 Two-Stream Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4 Fine-tuning on VQA and Other Downstream Tasks . . . . . . . . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Part III Video-based VQA
7 Video Representation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 Hand-crafted local video descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2 Data-driven deep learning features for video representation . . . . . . . . 108
7.3 Self-supervised learning for video representation . . . . . . . . . . . . . . . . 109
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8 Video Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.1 Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
8.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.2.1 Multi-step reasoning dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.2.2 Single-step reasoning dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.3 Traditional Video Spatio-Temporal Reasoning Using
Encoder-Decoder Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
9 Advanced Models for Video Question Answering . . . . . . . . . . . . . . . . . . 129
9.1 Attention on Spatio-Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . 129
9.2 Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
9.3 Spatio-Temporal Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 134
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Part IV Advanced Topics in VQA
10 Embodied VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
10.2 Simulators, Datasets and Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.2.1 Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
10.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
10.2.3 Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.3 Language-guided Visual Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.3.1 Vision-and-Language Navigation . . . . . . . . . . . . . . . . . . . . . . . 150
10.3.2 Remote Object Localisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
10.4 Embodied QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.5 Interactive QA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
xii Contents
11 Medical VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.3 Medical Image Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
11.3.1 UNet for medical image processing . . . . . . . . . . . . . . . . . . . . . 161
11.4 Answering Medical Related Questions: models and results . . . . . . . . 161
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12 Text-based VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.2.1 TextVQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
12.2.2 ST-VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
12.2.3 OCR-VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.3 OCR tokens representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
12.4 Simple fusion models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
12.4.1 LoRRA: Look, Read, Reason & Answer . . . . . . . . . . . . . . . . . 167
12.5 Graph-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
12.5.1 Structured Multimodal Attentions for TextVQA . . . . . . . . . . . 169
12.6 Transformer-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
12.6.1 Multimodal Multi-Copy Mesh model . . . . . . . . . . . . . . . . . . . . 170
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
13 Visual Question Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.2 VQG as Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.3 Generating Questions from Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.4 Generating Questions from Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.5 Adversarial learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
13.6 VQG as Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
14 Visual Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
14.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
14.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
14.3.1 Hierarchical Recurrent Encoder with Attention (HREA)
and memory network (MN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
14.3.2 History-Conditioned Image Attentive Encoder (HCIAE) . . . 181
14.3.3 Sequential Co-Attention Generative Model (CoAtt) . . . . . . . . 182
14.3.4 Synergistic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
14.4 Visual Co-reference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
14.5 Graph Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
14.5.1 Scene Graph for Visual Representations . . . . . . . . . . . . . . . . . 188
14.5.2 GNN for Visual and Dialogue Representations . . . . . . . . . . . . 189
Contents xiii
14.6 Pretrained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
14.6.1 VD_BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
14.6.2 Visual-Dialog BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
15 Referring Expression Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
15.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
15.3 Two-stage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
15.3.1 Joint Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
15.3.2 Co-attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
15.3.3 Graph-based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
15.4 One-stage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
15.5 Reasoning Process comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Part V Summary and Outlook
16 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.2.1 Explainable VQA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.2.2 VQA in the wild . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.2.3 Eliminating Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
16.2.4 More settings and Applications . . . . . . . . . . . . . . . . . . . . . . . . . 214
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Dr. Qi Wu is Senior Lecturer at the University of Adelaide and Chief Investigator at the ARC Centre of Excellence for Robotic Vision. He is also Director of Vision-and-Language Methods at the Australian Institute for Machine Learning. Dr Wu has been in the Computer Vision field for 10 years and he has a strong track record, having pioneered the field of Vision-and-Language, one of the most interesting and technically challenging areas of Computer Vision. This area, which has emerged over the last 5 years, represents the application of computer vision technology to problems that are closer to Artificial Intelligence. Dr Wu has made breakthroughs in methods and conceptual understanding to advance the field and is recognised as an international leader in the discipline. Beyond publishing some of the seminal papers in the area, he has organised a series of workshops in CVPR, ICCV and ACL. and authored key benchmarks that define the field. Recently, he led a team that won second place in VATEX Video Captioning Challenge, the first place in both TextVQA Challenge and MedicalVQA Challenge. His achievements have been recognised with the Australian Academy of Science J G Russel Award in 2019, one of four awards to ECRs across Australia; and an NVIDIA Pioneer Research Award.
Dr. Peng Wang is Professor at the School of Computer Science, Northwestern Polytechnical University, China. He previously served at the School of Computer Science, University of Adelaide, for four years. His research interests include computer vision, machine learning, and artificial intelligence.
Dr. Xin Wang is currently Assistant Professor at the Department of Computer Science and Technology, Tsinghua University. His research interests include cross-modal multimedia intelligence and inferable recommendations in social media. He has published several high-quality research papers for top conferences including ICML, KDD, WWW, SIGIR ACM Multimedia, etc. In addition to being selected for the 2017 China Postdoctoral innovative talents supporting program, he received the ACM China Rising Star Award in 2020.
Dr. Xiaodong He is Deputy Managing Director of JD AI Research; Head of the Deep Learning, NLP and Speech Lab; and Technical Vice President of JD.com. He is also Affiliate Professor at the University of Washington (Seattle), where he serves on doctoral supervisory committees. His research interests are mainly in artificial intelligence areas including deep learning, natural language, computer vision, speech, information retrieval, and knowledge representation. He has published more than 100 papers in ACL, EMNLP, NAACL, CVPR, SIGIR, WWW, CIKM, NIPS, ICLR, ICASSP, Proc. IEEE, IEEE TASLP, IEEE SPM, and other venues. He has received several awards including the Outstanding Paper Award at ACL 2015. He is Co-inventor of the DSSM, which is now broadly applied to language, vision, IR, and knowledge representation tasks. He also led the development of the CaptionBot, the world-first image captioning cloud service, deployed in 2016. He and colleagues have won major AI challenges including the 2008 NIST MT Eval, IWSLT 2011, COCO Captioning Challenge 2015, and VQA 2017. His work has been widely integrated into influential software and services including Microsoft Image Caption Services, Bing & Ads, Seeing AI, Word, and PowerPoint. He has held editorial positions with several IEEE journals, served as Area Chair for NAACL-HLT 2015 and served on the organizing committees/program committees of major speech and language processing conferences. He is IEEE Fellow and Member of the ACL.
Wenwu Zhu is currently Professor in the Department of Computer Science and Technology at Tsinghua University and Vice Dean of National Research Center for Information Science and Technology. Prior to his current post, he was Senior Researcher and Research Manager at Microsoft Research Asia. He was Chief Scientist and Director at Intel Research China from 2004to 2008. He worked at Bell Labs New Jersey as Member of Technical Staff during 19961999. He received his Ph.D. degree from New York University in 1996.
His current research interests are in the area of data-driven multimedia networking and multimedia intelligence. He has published over 350 referred papers and is Inventor or Co-inventor of over 50 patents. He received eight Best Paper Awards, including ACM Multimedia 2012 and IEEE Transactions on Circuits and Systems for Video Technology in 2001 and 2019.
He served as EiC for IEEE Transactions on Multimedia (20172019). He serves as Chair of the steering committee for IEEE Transactions on Multimedia, and he serves as Associate EiC for IEEE Transactions for Circuits and Systems for Video technology. He serves as General Co-Chair for ACM Multimedia 2018 and ACM CIKM 2019, respectively. He is AAAS Fellow, IEEE Fellow, SPIE Fellow, and Member of The Academy of Europe (Academia Europaea).