Acknowledgements |
|
iii | |
Preface |
|
v | |
Abbreviations and Definitions |
|
xiii | |
Overview |
|
xv | |
|
|
1 | (9) |
|
1.1 Background and context |
|
|
2 | (3) |
|
1.1.1 The concept of cohesion |
|
|
4 | (1) |
|
1.2 Machine translation (MT) |
|
|
5 | (5) |
|
1.2.1 History of statistical machine translation (SMT) |
|
|
5 | (1) |
|
1.2.2 Statistical machine translation approach |
|
|
6 | (2) |
|
1.2.3 SMT applications and research trends |
|
|
8 | (2) |
|
2 Statistical Machine Translation and Comparable Corpora |
|
|
10 | (49) |
|
|
10 | (2) |
|
2.2 Textual components and corpora |
|
|
12 | (6) |
|
|
12 | (1) |
|
|
13 | (3) |
|
|
16 | (2) |
|
2.3 Moses tool environment for SMT |
|
|
18 | (5) |
|
|
21 | (1) |
|
2.3.2 Operation sequence model (OSM) |
|
|
22 | (1) |
|
2.3.3 Minimum error rate training tool |
|
|
23 | (1) |
|
2.4 Aspects of SMT processing |
|
|
23 | (24) |
|
|
23 | (1) |
|
|
24 | (1) |
|
|
25 | (2) |
|
2.4.3.1 Out of vocabulary words |
|
|
27 | (1) |
|
2.4.3.2 N-gram smoothing methods |
|
|
28 | (3) |
|
|
31 | (1) |
|
2.4.4.1 Noisy channel model |
|
|
31 | (1) |
|
|
32 | (6) |
|
2.4.4.3 Phrase-based models |
|
|
38 | (2) |
|
2.4.5 Lexicalized reordering |
|
|
40 | (4) |
|
|
44 | (2) |
|
2.4.6 Domain text adaptation |
|
|
46 | (1) |
|
|
46 | (1) |
|
2.4.6.2 Adaptation of parallel corpora |
|
|
47 | (1) |
|
2.5 Evaluation of SMT quality |
|
|
47 | (12) |
|
2.5.1 Current evaluation metrics |
|
|
48 | (1) |
|
|
48 | (2) |
|
2.5.1.2 Other SMT metrics |
|
|
50 | (3) |
|
|
53 | (2) |
|
2.5.1.3.1 Evaluation using HMEANT |
|
|
55 | (1) |
|
2.5.1.3.2 HMEANT calculation |
|
|
55 | (1) |
|
2.5.2 Statistical significance test |
|
|
56 | (3) |
|
|
59 | (13) |
|
3.1 Current methods and results in spoken language translation |
|
|
59 | (2) |
|
3.2 Recent methods in comparable corpora exploration |
|
|
61 | (11) |
|
3.2.1 Native Yalign method |
|
|
63 | (2) |
|
3.2.2 A* algorithm for alignment |
|
|
65 | (2) |
|
3.2.3 Needleman-Wunsch algorithm |
|
|
67 | (2) |
|
3.2.4 Other alignment methods |
|
|
69 | (3) |
|
4 Author's Solutions to PL-EN Corpora Processing Problems |
|
|
72 | (46) |
|
4.1 Parallel data mining improvements |
|
|
72 | (3) |
|
4.2 Multi-threaded, tuned and GPU-accelerated Yalign |
|
|
75 | (3) |
|
4.2.1 Needleman-Wunsch algorithm with GPU optimization |
|
|
75 | (1) |
|
4.2.2 Comparison of alignment methods |
|
|
76 | (2) |
|
4.3 Tuning of Yalign method |
|
|
78 | (2) |
|
4.4 Minor improvements in mining for Wikipedia exploration |
|
|
80 | (3) |
|
4.5 Parallel data mining using other methods |
|
|
83 | (3) |
|
4.5.1 The pipeline of tools |
|
|
83 | (1) |
|
4.5.2 Analogy-based method |
|
|
84 | (2) |
|
4.6 SMT metric enhancements |
|
|
86 | (9) |
|
4.6.1 Enhancements to the BLEU metric |
|
|
86 | (3) |
|
4.6.2 Evaluation using enhanced BLEU metric |
|
|
89 | (6) |
|
4.7 Alignment and filtering of corpora |
|
|
95 | (11) |
|
4.7.1 Corpora used for alignment experiments |
|
|
95 | (3) |
|
4.7.2 Filtering and alignment algorithm |
|
|
98 | (4) |
|
|
102 | (2) |
|
4.7.4 Alignment evaluation results |
|
|
104 | (2) |
|
4.8 Baseline system training |
|
|
106 | (1) |
|
4.9 Description of experiments |
|
|
107 | (11) |
|
4.9.1 Text alignment processing |
|
|
107 | (2) |
|
4.9.2 Machine translation experiments |
|
|
109 | (1) |
|
4.9.2.1 TED lectures translation |
|
|
109 | (1) |
|
4.9.2.1.1 Word stems and SVO word order |
|
|
110 | (2) |
|
|
112 | (1) |
|
4.9.2.1.3 Translation and translation parameter adaptation experiments |
|
|
113 | (1) |
|
4.9.2.2 Subtitles and EuroParl translation |
|
|
114 | (1) |
|
4.9.2.3 Medical texts translation |
|
|
114 | (1) |
|
4.9.2.4 Pruning experiments |
|
|
115 | (1) |
|
4.9.3 Evaluation of obtained comparable corpora |
|
|
115 | (1) |
|
4.9.3.1 Native Yalign method |
|
|
115 | (1) |
|
4.9.3.2 Improved Yalign method |
|
|
116 | (1) |
|
4.9.3.3 Parallel data mining using tool pipeline |
|
|
117 | (1) |
|
4.9.3.4 Analogy-based method |
|
|
117 | (1) |
|
5 Results and Conclusions |
|
|
118 | (120) |
|
5.1 Machine translation results |
|
|
118 | (41) |
|
5.1.1 TED lectures experiments |
|
|
118 | (2) |
|
5.1.1.1 Word stems and SVO word order |
|
|
120 | (3) |
|
|
123 | (2) |
|
5.1.1.3 Translation results and translation parameter adaptation |
|
|
125 | (2) |
|
5.1.2 Subtitles and European Parliament proceedings translation results |
|
|
127 | (1) |
|
5.1.3 Medical text translation experiments |
|
|
128 | (2) |
|
5.1.4 Pruning experiments |
|
|
130 | (1) |
|
5.1.5 MT complexity within West-Slavic languages group |
|
|
131 | (1) |
|
5.1.5.1 West-Slavic languages group |
|
|
132 | (2) |
|
5.1.5.2 Differences between Polish and English languages |
|
|
134 | (1) |
|
5.1.5.3 Spoken vs written language |
|
|
135 | (1) |
|
5.1.5.4 Machine translation |
|
|
136 | (1) |
|
|
137 | (1) |
|
|
138 | (2) |
|
5.1.6 SMT in augumented reality systems |
|
|
140 | (1) |
|
5.1.6.1 Review of state of the art |
|
|
141 | (2) |
|
|
143 | (1) |
|
|
144 | (1) |
|
|
145 | (2) |
|
|
147 | (2) |
|
5.1.7 Neural networks in machine translation |
|
|
149 | (2) |
|
5.1.7.1 Neural networks in translations |
|
|
151 | (1) |
|
|
152 | (1) |
|
5.1.7.3 Translation systems |
|
|
153 | (2) |
|
|
155 | (3) |
|
5.1.8 Joint conclusions on SMT |
|
|
158 | (1) |
|
5.2 Evaluation of obtained comparable corpora |
|
|
159 | (18) |
|
5.2.1 Initial mining results using native Yalign solution |
|
|
159 | (2) |
|
5.2.2 Native Yalign method evaluation |
|
|
161 | (4) |
|
5.2.3 Improvements to the Yalign method |
|
|
165 | (10) |
|
5.2.4 Parallel data mining using pipeline of tools |
|
|
175 | (1) |
|
5.2.5 Analogy-based method |
|
|
176 | (1) |
|
5.3 Quasi comparable corpora exploration |
|
|
177 | (4) |
|
5.4 Other fields of MT techniques application |
|
|
181 | (57) |
|
5.4.1 HMEANT and other metrics in re-speaking quality assessment |
|
|
181 | (1) |
|
|
182 | (1) |
|
|
183 | (1) |
|
5.4.1.3 Data and experimental setup |
|
|
184 | (1) |
|
|
184 | (9) |
|
5.4.2 MT evaluation metrics within FACIT translation methodology |
|
|
193 | (1) |
|
5.4.2.1 PROMIS evaluation process |
|
|
194 | (2) |
|
5.4.2.2 Preparation of automatic evaluation metrics and data |
|
|
196 | (2) |
|
5.4.2.3 SMT system preparation |
|
|
198 | (1) |
|
5.4.2.4 Support vector machine classifier evaluation |
|
|
199 | (1) |
|
5.4.2.5 Neural network-based evaluation |
|
|
200 | (1) |
|
|
201 | (4) |
|
5.4.3 Augmenting SMT with semantically-generated virtual-parallel data |
|
|
205 | (3) |
|
5.4.3.1 Previous research |
|
|
208 | (2) |
|
5.4.3.2 Generating virtual parallel data |
|
|
210 | (2) |
|
5.4.3.3 Semantically-enhanced generated corpora |
|
|
212 | (3) |
|
5.4.3.4 Experimental setup |
|
|
215 | (1) |
|
|
216 | (3) |
|
5.4.4 Statistical noisy parallel data filtration and adaptation |
|
|
219 | (1) |
|
|
219 | (1) |
|
5.4.4.2 Buildingbi-lingual phrase table |
|
|
220 | (1) |
|
5.4.4.3 Experiments and results |
|
|
221 | (4) |
|
5.4.5 Mixing textual data selection methods for improved in-domain adaptation |
|
|
225 | (2) |
|
5.4.5.1 Combined corpora adaptation method |
|
|
227 | (1) |
|
|
227 | (1) |
|
|
228 | (1) |
|
5.4.5.1.3 Levenshtein distance |
|
|
229 | (1) |
|
5.4.5.1.4 Combined methods |
|
|
229 | (1) |
|
5.4.5.2 Results and conclusions |
|
|
230 | (2) |
|
5.4.6 Improving ASR translation quality by de-normalization |
|
|
232 | (2) |
|
5.4.6.1 De-Normalization method |
|
|
234 | (1) |
|
5.4.6.2 De-Normalization experiments |
|
|
235 | (3) |
|
|
238 | (2) |
References |
|
240 | (17) |
Index |
|
257 | |