Muutke küpsiste eelistusi

E-raamat: Machine Learning in Translation Corpora Processing [Taylor & Francis e-raamat]

(Department of Multimedia, Polish-Japanese Academy of Information Technology, Warsaw, Poland)
  • Formaat: 280 pages, 111 Tables, black and white; 4 Illustrations, color; 37 Illustrations, black and white
  • Ilmumisaeg: 05-Mar-2019
  • Kirjastus: CRC Press
  • ISBN-13: 9780429197543
  • Taylor & Francis e-raamat
  • Hind: 240,04 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
  • Tavahind: 342,91 €
  • Säästad 30%
  • Formaat: 280 pages, 111 Tables, black and white; 4 Illustrations, color; 37 Illustrations, black and white
  • Ilmumisaeg: 05-Mar-2019
  • Kirjastus: CRC Press
  • ISBN-13: 9780429197543

This book reviews ways to improve statistical machine speech translation between Polish and English. Research has been conducted mostly on dictionary-based, rule-based, and syntax-based, machine translation techniques. Most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation, and language resources are lacking in parallel and monolingual data. The main objective of this volume to develop an automatic and robust Polish-to-English translation system to meet specific translation requirements and to develop bilingual textual resources by mining comparable corpora.

 

Acknowledgements iii
Preface v
Abbreviations and Definitions xiii
Overview xv
1 Introduction
1(9)
1.1 Background and context
2(3)
1.1.1 The concept of cohesion
4(1)
1.2 Machine translation (MT)
5(5)
1.2.1 History of statistical machine translation (SMT)
5(1)
1.2.2 Statistical machine translation approach
6(2)
1.2.3 SMT applications and research trends
8(2)
2 Statistical Machine Translation and Comparable Corpora
10(49)
2.1 Overview of SMT
10(2)
2.2 Textual components and corpora
12(6)
2.2.1 Words
12(1)
2.2.2 Sentences
13(3)
2.2.3 Corpora
16(2)
2.3 Moses tool environment for SMT
18(5)
2.3.1 Tuning for quality
21(1)
2.3.2 Operation sequence model (OSM)
22(1)
2.3.3 Minimum error rate training tool
23(1)
2.4 Aspects of SMT processing
23(24)
2.4.1 Tokenization
23(1)
2.4.2 Compounding
24(1)
2.4.3 Language models
25(2)
2.4.3.1 Out of vocabulary words
27(1)
2.4.3.2 N-gram smoothing methods
28(3)
2.4.4 Translation models
31(1)
2.4.4.1 Noisy channel model
31(1)
2.4.4.2 IBM models
32(6)
2.4.4.3 Phrase-based models
38(2)
2.4.5 Lexicalized reordering
40(4)
2.4.5.1 Word alignment
44(2)
2.4.6 Domain text adaptation
46(1)
2.4.6.1 Interpolation
46(1)
2.4.6.2 Adaptation of parallel corpora
47(1)
2.5 Evaluation of SMT quality
47(12)
2.5.1 Current evaluation metrics
48(1)
2.5.1.1 BLEU overview
48(2)
2.5.1.2 Other SMT metrics
50(3)
2.5.1.3 HMEANT metric
53(2)
2.5.1.3.1 Evaluation using HMEANT
55(1)
2.5.1.3.2 HMEANT calculation
55(1)
2.5.2 Statistical significance test
56(3)
3 State of the Art
59(13)
3.1 Current methods and results in spoken language translation
59(2)
3.2 Recent methods in comparable corpora exploration
61(11)
3.2.1 Native Yalign method
63(2)
3.2.2 A* algorithm for alignment
65(2)
3.2.3 Needleman-Wunsch algorithm
67(2)
3.2.4 Other alignment methods
69(3)
4 Author's Solutions to PL-EN Corpora Processing Problems
72(46)
4.1 Parallel data mining improvements
72(3)
4.2 Multi-threaded, tuned and GPU-accelerated Yalign
75(3)
4.2.1 Needleman-Wunsch algorithm with GPU optimization
75(1)
4.2.2 Comparison of alignment methods
76(2)
4.3 Tuning of Yalign method
78(2)
4.4 Minor improvements in mining for Wikipedia exploration
80(3)
4.5 Parallel data mining using other methods
83(3)
4.5.1 The pipeline of tools
83(1)
4.5.2 Analogy-based method
84(2)
4.6 SMT metric enhancements
86(9)
4.6.1 Enhancements to the BLEU metric
86(3)
4.6.2 Evaluation using enhanced BLEU metric
89(6)
4.7 Alignment and filtering of corpora
95(11)
4.7.1 Corpora used for alignment experiments
95(3)
4.7.2 Filtering and alignment algorithm
98(4)
4.7.3 Filtering results
102(2)
4.7.4 Alignment evaluation results
104(2)
4.8 Baseline system training
106(1)
4.9 Description of experiments
107(11)
4.9.1 Text alignment processing
107(2)
4.9.2 Machine translation experiments
109(1)
4.9.2.1 TED lectures translation
109(1)
4.9.2.1.1 Word stems and SVO word order
110(2)
4.9.2.1.2 Lemmatization
112(1)
4.9.2.1.3 Translation and translation parameter adaptation experiments
113(1)
4.9.2.2 Subtitles and EuroParl translation
114(1)
4.9.2.3 Medical texts translation
114(1)
4.9.2.4 Pruning experiments
115(1)
4.9.3 Evaluation of obtained comparable corpora
115(1)
4.9.3.1 Native Yalign method
115(1)
4.9.3.2 Improved Yalign method
116(1)
4.9.3.3 Parallel data mining using tool pipeline
117(1)
4.9.3.4 Analogy-based method
117(1)
5 Results and Conclusions
118(120)
5.1 Machine translation results
118(41)
5.1.1 TED lectures experiments
118(2)
5.1.1.1 Word stems and SVO word order
120(3)
5.1.1.2 Lemmatization
123(2)
5.1.1.3 Translation results and translation parameter adaptation
125(2)
5.1.2 Subtitles and European Parliament proceedings translation results
127(1)
5.1.3 Medical text translation experiments
128(2)
5.1.4 Pruning experiments
130(1)
5.1.5 MT complexity within West-Slavic languages group
131(1)
5.1.5.1 West-Slavic languages group
132(2)
5.1.5.2 Differences between Polish and English languages
134(1)
5.1.5.3 Spoken vs written language
135(1)
5.1.5.4 Machine translation
136(1)
5.1.5.5 Evaluation
137(1)
5.1.5.6 Results
138(2)
5.1.6 SMT in augumented reality systems
140(1)
5.1.6.1 Review of state of the art
141(2)
5.1.6.2 The approach
143(1)
5.1.6.3 Data preparation
144(1)
5.1.6.4 Experiments
145(2)
5.1.6.5 Discussion
147(2)
5.1.7 Neural networks in machine translation
149(2)
5.1.7.1 Neural networks in translations
151(1)
5.1.7.2 Data preparation
152(1)
5.1.7.3 Translation systems
153(2)
5.1.7.4 Results
155(3)
5.1.8 Joint conclusions on SMT
158(1)
5.2 Evaluation of obtained comparable corpora
159(18)
5.2.1 Initial mining results using native Yalign solution
159(2)
5.2.2 Native Yalign method evaluation
161(4)
5.2.3 Improvements to the Yalign method
165(10)
5.2.4 Parallel data mining using pipeline of tools
175(1)
5.2.5 Analogy-based method
176(1)
5.3 Quasi comparable corpora exploration
177(4)
5.4 Other fields of MT techniques application
181(57)
5.4.1 HMEANT and other metrics in re-speaking quality assessment
181(1)
5.4.1.1 RIBES metric
182(1)
5.4.1.2 NER metric
183(1)
5.4.1.3 Data and experimental setup
184(1)
5.4.1.4 Results
184(9)
5.4.2 MT evaluation metrics within FACIT translation methodology
193(1)
5.4.2.1 PROMIS evaluation process
194(2)
5.4.2.2 Preparation of automatic evaluation metrics and data
196(2)
5.4.2.3 SMT system preparation
198(1)
5.4.2.4 Support vector machine classifier evaluation
199(1)
5.4.2.5 Neural network-based evaluation
200(1)
5.4.2.6 Results
201(4)
5.4.3 Augmenting SMT with semantically-generated virtual-parallel data
205(3)
5.4.3.1 Previous research
208(2)
5.4.3.2 Generating virtual parallel data
210(2)
5.4.3.3 Semantically-enhanced generated corpora
212(3)
5.4.3.4 Experimental setup
215(1)
5.4.3.5 Evaluation
216(3)
5.4.4 Statistical noisy parallel data filtration and adaptation
219(1)
5.4.4.1 Data acquisition
219(1)
5.4.4.2 Buildingbi-lingual phrase table
220(1)
5.4.4.3 Experiments and results
221(4)
5.4.5 Mixing textual data selection methods for improved in-domain adaptation
225(2)
5.4.5.1 Combined corpora adaptation method
227(1)
5.4.5.1.1 Cosine tf-idf
227(1)
5.4.5.1.2 Perplexity
228(1)
5.4.5.1.3 Levenshtein distance
229(1)
5.4.5.1.4 Combined methods
229(1)
5.4.5.2 Results and conclusions
230(2)
5.4.6 Improving ASR translation quality by de-normalization
232(2)
5.4.6.1 De-Normalization method
234(1)
5.4.6.2 De-Normalization experiments
235(3)
6 Final Conclusions
238(2)
References 240(17)
Index 257
Krzysztof Wok holds a PhD Eng. degree in Computer Science, and is a graduate of the Polish-Japanese Academy of Information Technology. He is currently an associate professor at the Cathedral of Multimedia at the same university. His research is mostly related to natural language processing and machine learning based on statistical methods, neural networks and deep learning; and is interested in IT and its challenges, and engages in interdisciplinary projects, particularly those related to HCI, UX, medicine and psychology.



In addition, he has worked as a lecturer at the Warsaw School of Photography & Graphic Design, and as an IT trainer. His specialties as a teacher are primarily deep learning, machine learning, natural language processing, computational linguistics, multimedia, HCI, UX, mobile applications, HTML 5, Adobe applications and server products from Apple and Microsoft.



As far as his didactic work is concerned, he leads classrooms at the faculty of computer science and at the new media art department at the Polish-Japanese Academy of Information Technology and he also used to lead classes and lectures at the Warsaw School of Photography & Graphic Design.