Foreword |
|
xvii | |
Preface |
|
xix | |
Authors |
|
xxiii | |
Contributors |
|
xxv | |
Chapter 1 Deep Learning and Transformers: An Introduction |
|
1 | (10) |
|
1.1 Deep Learning: A Historic Perspective |
|
|
1 | (3) |
|
1.2 Transformers And Taxonomy |
|
|
4 | (4) |
|
1.2.1 Modified Transformer Architecture |
|
|
4 | (4) |
|
1.2.1.1 Transformer block changes |
|
|
4 | (1) |
|
1.2.1.2 Transformer sublayer changes |
|
|
5 | (3) |
|
1.2.2 Pre-training Methods and Applications |
|
|
8 | (1) |
|
|
8 | (3) |
|
1.3.1 Libraries and Implementations |
|
|
8 | (1) |
|
|
9 | (1) |
|
1.3.3 Courses, Tutorials, and Lectures |
|
|
9 | (1) |
|
1.3.4 Case Studies and Details |
|
|
10 | (1) |
Chapter 2 Transformers: Basics and Introduction |
|
11 | (32) |
|
2.1 Encoder-Decoder Architecture |
|
|
11 | (1) |
|
|
12 | (2) |
|
|
12 | (1) |
|
|
13 | (1) |
|
|
14 | (1) |
|
2.2.4 Issues with RNN-Based Encoder-Decoder |
|
|
14 | (1) |
|
|
14 | (5) |
|
|
14 | (2) |
|
2.3.2 Types of Score-Based Attention |
|
|
16 | (2) |
|
2.3.2.1 Dot product (multiplicative) |
|
|
17 | (1) |
|
2.3.2.2 Scaled dot product or multiplicative |
|
|
17 | (1) |
|
2.3.2.3 Linear, MLP, or Additive |
|
|
17 | (1) |
|
2.3.3 Attention-Based Sequence-to-Sequence |
|
|
18 | (1) |
|
|
19 | (8) |
|
2.4.1 Source and Target Representation |
|
|
20 | (2) |
|
|
20 | (1) |
|
2.4.1.2 Positional encoding |
|
|
20 | (2) |
|
|
22 | (4) |
|
|
22 | (2) |
|
2.4.2.2 Multi-head attention |
|
|
24 | (1) |
|
2.4.2.3 Masked multi-head attention |
|
|
25 | (1) |
|
2.4.2.4 Encoder-decoder multi-head attention |
|
|
26 | (1) |
|
2.4.3 Residuals and Layer Normalization |
|
|
26 | (1) |
|
2.4.4 Positionwise Feed-forward Networks |
|
|
26 | (1) |
|
|
27 | (1) |
|
|
27 | (1) |
|
2.5 Case Study: Machine Translation |
|
|
27 | (16) |
|
|
27 | (1) |
|
2.5.2 Data, Tools, and Libraries |
|
|
27 | (1) |
|
2.5.3 Experiments, Results, and Analysis |
|
|
28 | (15) |
|
2.5.3.1 Exploratory data analysis |
|
|
28 | (1) |
|
|
29 | (6) |
|
|
35 | (3) |
|
2.5.3.4 Results and analysis |
|
|
38 | (1) |
|
|
38 | (5) |
Chapter 3 Bidirectional Encoder Representations from Transformers (BERT) |
|
43 | (28) |
|
|
43 | (5) |
|
|
43 | (2) |
|
|
45 | (1) |
|
|
46 | (2) |
|
|
48 | (1) |
|
|
48 | (1) |
|
|
49 | (2) |
|
|
49 | (1) |
|
|
50 | (1) |
|
|
51 | (2) |
|
3.4.1 BERT Sentence Representation |
|
|
51 | (1) |
|
|
52 | (1) |
|
3.5 Case Study: Topic Modeling With Transformers |
|
|
53 | (10) |
|
|
53 | (1) |
|
3.5.2 Data, Tools, and Libraries |
|
|
53 | (2) |
|
|
54 | (1) |
|
3.5.2.2 Compute embeddings |
|
|
54 | (1) |
|
3.5.3 Experiments, Results, and Analysis |
|
|
55 | (8) |
|
|
55 | (1) |
|
3.5.3.2 Topic size distribution |
|
|
55 | (1) |
|
3.5.3.3 Visualization of topics |
|
|
56 | (1) |
|
3.5.3.4 Content of topics |
|
|
57 | (6) |
|
3.6 Case Study: Fine-Tuning BERT |
|
|
63 | (8) |
|
|
63 | (1) |
|
3.6.2 Data, Tools, and Libraries |
|
|
63 | (1) |
|
3.6.3 Experiments, Results, and Analysis |
|
|
64 | (7) |
Chapter 4 Multilingual Transformer Architectures |
|
71 | (38) |
|
4.1 Multilingual Transformer Architectures |
|
|
72 | (18) |
|
4.1.1 Basic Multilingual Transformer |
|
|
72 | (2) |
|
4.1.2 Single-Encoder Multilingual NLU |
|
|
74 | (11) |
|
|
74 | (1) |
|
|
75 | (2) |
|
|
77 | (1) |
|
|
77 | (1) |
|
|
78 | (2) |
|
|
80 | (1) |
|
|
81 | (1) |
|
|
82 | (2) |
|
|
84 | (1) |
|
4.1.3 Dual-Encoder Multilingual NLU |
|
|
85 | (4) |
|
|
85 | (2) |
|
|
87 | (2) |
|
|
89 | (1) |
|
|
90 | (3) |
|
|
90 | (1) |
|
4.2.2 Multilingual Benchmarks |
|
|
91 | (2) |
|
|
91 | (1) |
|
4.2.2.2 Structure prediction |
|
|
92 | (1) |
|
4.2.2.3 Question answering |
|
|
92 | (1) |
|
4.2.2.4 Semantic retrieval |
|
|
92 | (1) |
|
4.3 Multilingual Transfer Learning Insights |
|
|
93 | (4) |
|
4.3.1 Zero-Shot Cross-Lingual Learning |
|
|
93 | (3) |
|
|
93 | (1) |
|
4.3.1.2 Model architecture factors |
|
|
94 | (1) |
|
4.3.1.3 Model tasks factors |
|
|
95 | (1) |
|
4.3.2 Language-Agnostic Cross-Lingual Representations |
|
|
96 | (1) |
|
|
97 | (12) |
|
|
97 | (1) |
|
4.4.2 Data, Tools, and Libraries |
|
|
98 | (1) |
|
4.4.3 Experiments, Results, and Analysis |
|
|
98 | (11) |
|
4.4.3.1 Data preprocessing |
|
|
99 | (2) |
|
|
101 | (8) |
Chapter 5 Transformer Modifications |
|
109 | (46) |
|
5.1 Transformer Block Modifications |
|
|
109 | (11) |
|
5.1.1 Lightweight Transformers |
|
|
109 | (5) |
|
5.1.1.1 Funnel-transformer |
|
|
109 | (3) |
|
|
112 | (2) |
|
5.1.2 Connections between Transformer Blocks |
|
|
114 | (1) |
|
|
114 | (1) |
|
5.1.3 Adaptive Computation Time |
|
|
115 | (1) |
|
5.1.3.1 Universal transformers (UT) |
|
|
115 | (1) |
|
5.1.4 Recurrence Relations between Transformer Blocks |
|
|
116 | (4) |
|
|
116 | (4) |
|
5.1.5 Hierarchical Transformers |
|
|
120 | (1) |
|
5.2 Transformers With Modified Multi-Head Self-Attention |
|
|
120 | (25) |
|
5.2.1 Structure of Multi-Head Self-Attention |
|
|
120 | (4) |
|
5.2.1.1 Multi-head self-attention |
|
|
122 | (1) |
|
5.2.1.2 Space and time complexity |
|
|
123 | (1) |
|
5.2.2 Reducing Complexity of Self-Attention |
|
|
124 | (13) |
|
|
124 | (2) |
|
|
126 | (5) |
|
|
131 | (1) |
|
|
132 | (5) |
|
5.2.3 Improving Multi-Head-Attention |
|
|
137 | (3) |
|
5.2.3.1 Talking-heads attention |
|
|
137 | (3) |
|
5.2.4 Biasing Attention with Priors |
|
|
140 | (1) |
|
|
140 | (1) |
|
5.2.5.1 Clustered attention |
|
|
140 | (1) |
|
5.2.6 Compressed Key-Value Memory |
|
|
141 | (2) |
|
5.2.6.1 Luna: Linear Unified Nested Attention |
|
|
141 | (2) |
|
5.2.7 Low-Rank Approximations |
|
|
143 | (2) |
|
|
143 | (2) |
|
5.3 Modifications For Training Task Efficiency |
|
|
145 | (1) |
|
|
145 | (1) |
|
5.3.1.1 Replaced token detection |
|
|
145 | (1) |
|
|
146 | (1) |
|
5.4 Transformer Submodule Changes |
|
|
146 | (2) |
|
|
146 | (2) |
|
5.5 Case Study: Sentiment Analysis |
|
|
148 | (7) |
|
|
148 | (1) |
|
5.5.2 Data, Tools, and Libraries |
|
|
148 | (2) |
|
5.5.3 Experiments, Results, and Analysis |
|
|
150 | (5) |
|
5.5.3.1 Visualizing attention head weights |
|
|
150 | (2) |
|
|
152 | (3) |
Chapter 6 Pre-trained and Application-Specific Transformers |
|
155 | (32) |
|
|
155 | (8) |
|
6.1.1 Domain-Specific Transformers |
|
|
155 | (2) |
|
|
155 | (1) |
|
|
156 | (1) |
|
|
156 | (1) |
|
6.1.2 Text-to-Text Transformers |
|
|
157 | (1) |
|
|
157 | (1) |
|
|
158 | (5) |
|
6.1.3.1 GPT: Generative pre-training |
|
|
158 | (2) |
|
|
160 | (1) |
|
|
161 | (2) |
|
|
163 | (1) |
|
|
163 | (1) |
|
6.3 Automatic Speech Recognition |
|
|
164 | (2) |
|
|
165 | (1) |
|
|
165 | (1) |
|
6.3.3 HuBERT: Hidden Units BERT |
|
|
166 | (1) |
|
6.4 Multimodal And Multitasking Transformer |
|
|
166 | (3) |
|
6.4.1 Vision-and-Language BERT (VilBERT) |
|
|
167 | (1) |
|
6.4.2 Unified Transformer (UniT) |
|
|
168 | (1) |
|
6.5 Video Processing With Timesformer |
|
|
169 | (3) |
|
|
169 | (1) |
|
|
170 | (2) |
|
6.5.2.1 Spatiotemporal self-attention |
|
|
171 | (1) |
|
6.5.2.2 Spatiotemporal attention blocks |
|
|
171 | (1) |
|
|
172 | (5) |
|
6.6.1 Positional Encodings in a Graph |
|
|
173 | (1) |
|
6.6.1.1 Laplacian positional encodings |
|
|
173 | (1) |
|
6.6.2 Graph Transformer Input |
|
|
173 | (4) |
|
6.6.2.1 Graphs without edge attributes |
|
|
174 | (1) |
|
6.6.2.2 Graphs with edge attributes |
|
|
175 | (2) |
|
6.7 Reinforcement Learning |
|
|
177 | (3) |
|
6.7.1 Decision Transformer |
|
|
178 | (2) |
|
6.8 Case Study: Automatic Speech Recognition |
|
|
180 | (7) |
|
|
180 | (1) |
|
6.8.2 Data, Tools, and Libraries |
|
|
180 | (1) |
|
6.8.3 Experiments, Results, and Analysis |
|
|
180 | (10) |
|
6.8.3.1 Preprocessing speech data |
|
|
180 | (1) |
|
|
181 | (6) |
Chapter 7 Interpretability and Explainability Techniques for Transformers |
|
187 | (34) |
|
7.1 Traits Of Explainable Systems |
|
|
187 | (2) |
|
7.2 Related Areas That Impact Explainability |
|
|
189 | (1) |
|
7.3 Explainable Methods Taxonomy |
|
|
190 | (12) |
|
7.3.1 Visualization Methods |
|
|
190 | (5) |
|
7.3.1.1 Backpropagation-based |
|
|
190 | (4) |
|
7.3.1.2 Perturbation-based |
|
|
194 | (1) |
|
|
195 | (3) |
|
7.3.2.1 Local approximation |
|
|
195 | (3) |
|
7.3.2.2 Model translation |
|
|
198 | (1) |
|
|
198 | (4) |
|
7.3.3.1 Probing mechanism |
|
|
198 | (3) |
|
|
201 | (1) |
|
7.4 Attention And Explanation |
|
|
202 | (6) |
|
7.4.1 Attention is Not an Explanation |
|
|
202 | (3) |
|
7.4.1.1 Attention weights and feature importance |
|
|
202 | (2) |
|
7.4.1.2 Counterfactual experiments |
|
|
204 | (1) |
|
7.4.2 Attention is Not Not an Explanation |
|
|
205 | (3) |
|
7.4.2.1 Is attention necessary for all tasks? |
|
|
206 | (1) |
|
7.4.2.2 Searching for adversarial models |
|
|
207 | (1) |
|
7.4.2.3 Attention probing |
|
|
208 | (1) |
|
7.5 Quantifying Attention Flow |
|
|
208 | (2) |
|
7.5.1 Information Flow as DAG |
|
|
208 | (1) |
|
|
209 | (1) |
|
|
209 | (1) |
|
7.6 Case Study: Text Classification With Explainability |
|
|
210 | (11) |
|
|
210 | (1) |
|
7.6.2 Data, Tools, and Libraries |
|
|
211 | (1) |
|
7.6.3 Experiments, Results, and Analysis |
|
|
211 | (10) |
|
7.6.3.1 Exploratory data analysis |
|
|
211 | (1) |
|
|
211 | (1) |
|
7.6.3.3 Error analysis and explainability |
|
|
212 | (9) |
Bibliography |
|
221 | (34) |
Index |
|
255 | |