Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Machine Learning in Translation Corpora Processing [Taylor & Francis e-raamat]

Krzysztof Wolk (Department of Multimedia, Polish-Japanese Academy of Information Technology, Warsaw, Poland)

Formaat: 280 pages, 111 Tables, black and white; 4 Illustrations, color; 37 Illustrations, black and white
Ilmumisaeg: 05-Mar-2019
Kirjastus: CRC Press
ISBN-13: 9780429197543

Teised raamatud teemal:

Taylor & Francis e-raamat
Hind: 240,04 €*
* hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
Tavahind: 342,91 €
Säästad 30%

Formaat: 280 pages, 111 Tables, black and white; 4 Illustrations, color; 37 Illustrations, black and white
Ilmumisaeg: 05-Mar-2019
Kirjastus: CRC Press
ISBN-13: 9780429197543

Teised raamatud teemal:

Rohkem infot Taylor & Francis e-raamatute kohta

Raamatu kodulehekülg: https://www.taylorfrancis.com/books/9780429197543

This book reviews ways to improve statistical machine speech translation between Polish and English. Research has been conducted mostly on dictionary-based, rule-based, and syntax-based, machine translation techniques. Most popular methodologies and tools are not well-suited for the Polish language and therefore require adaptation, and language resources are lacking in parallel and monolingual data. The main objective of this volume to develop an automatic and robust Polish-to-English translation system to meet specific translation requirements and to develop bilingual textual resources by mining comparable corpora.

Acknowledgements

iii

Preface

Abbreviations and Definitions

xiii

Overview

1 Introduction

(9)

1.1 Background and context

(3)

1.1.1 The concept of cohesion

(1)

1.2 Machine translation (MT)

(5)

1.2.1 History of statistical machine translation (SMT)

(1)

1.2.2 Statistical machine translation approach

(2)

1.2.3 SMT applications and research trends

(2)

2 Statistical Machine Translation and Comparable Corpora

(49)

2.1 Overview of SMT

(2)

2.2 Textual components and corpora

(6)

2.2.1 Words

(1)

2.2.2 Sentences

(3)

2.2.3 Corpora

(2)

2.3 Moses tool environment for SMT

(5)

2.3.1 Tuning for quality

(1)

2.3.2 Operation sequence model (OSM)

(1)

2.3.3 Minimum error rate training tool

(1)

2.4 Aspects of SMT processing

(24)

2.4.1 Tokenization

(1)

2.4.2 Compounding

(1)

2.4.3 Language models

(2)

2.4.3.1 Out of vocabulary words

(1)

2.4.3.2 N-gram smoothing methods

(3)

2.4.4 Translation models

(1)

2.4.4.1 Noisy channel model

(1)

2.4.4.2 IBM models

(6)

2.4.4.3 Phrase-based models

(2)

2.4.5 Lexicalized reordering

(4)

2.4.5.1 Word alignment

(2)

2.4.6 Domain text adaptation

(1)

2.4.6.1 Interpolation

(1)

2.4.6.2 Adaptation of parallel corpora

(1)

2.5 Evaluation of SMT quality

(12)

2.5.1 Current evaluation metrics

(1)

2.5.1.1 BLEU overview

(2)

2.5.1.2 Other SMT metrics

(3)

2.5.1.3 HMEANT metric

(2)

2.5.1.3.1 Evaluation using HMEANT

(1)

2.5.1.3.2 HMEANT calculation

(1)

2.5.2 Statistical significance test

(3)

3 State of the Art

(13)

3.1 Current methods and results in spoken language translation

(2)

3.2 Recent methods in comparable corpora exploration

(11)

3.2.1 Native Yalign method

(2)

3.2.2 A* algorithm for alignment

(2)

3.2.3 Needleman-Wunsch algorithm

(2)

3.2.4 Other alignment methods

(3)

4 Author's Solutions to PL-EN Corpora Processing Problems

(46)

4.1 Parallel data mining improvements

(3)

4.2 Multi-threaded, tuned and GPU-accelerated Yalign

(3)

4.2.1 Needleman-Wunsch algorithm with GPU optimization

(1)

4.2.2 Comparison of alignment methods

(2)

4.3 Tuning of Yalign method

(2)

4.4 Minor improvements in mining for Wikipedia exploration

(3)

4.5 Parallel data mining using other methods

(3)

4.5.1 The pipeline of tools

(1)

4.5.2 Analogy-based method

(2)

4.6 SMT metric enhancements

(9)

4.6.1 Enhancements to the BLEU metric

(3)

4.6.2 Evaluation using enhanced BLEU metric

(6)

4.7 Alignment and filtering of corpora

(11)

4.7.1 Corpora used for alignment experiments

(3)

4.7.2 Filtering and alignment algorithm

(4)

4.7.3 Filtering results

102

(2)

4.7.4 Alignment evaluation results

104

(2)

4.8 Baseline system training

106

(1)

4.9 Description of experiments

107

(11)

4.9.1 Text alignment processing

107

(2)

4.9.2 Machine translation experiments

109

(1)

4.9.2.1 TED lectures translation

109

(1)

4.9.2.1.1 Word stems and SVO word order

110

(2)

4.9.2.1.2 Lemmatization

112

(1)

4.9.2.1.3 Translation and translation parameter adaptation experiments

113

(1)

4.9.2.2 Subtitles and EuroParl translation

114

(1)

4.9.2.3 Medical texts translation

114

(1)

4.9.2.4 Pruning experiments

115

(1)

4.9.3 Evaluation of obtained comparable corpora

115

(1)

4.9.3.1 Native Yalign method

115

(1)

4.9.3.2 Improved Yalign method

116

(1)

4.9.3.3 Parallel data mining using tool pipeline

117

(1)

4.9.3.4 Analogy-based method

117

(1)

5 Results and Conclusions

118

(120)

5.1 Machine translation results

118

(41)

5.1.1 TED lectures experiments

118

(2)

5.1.1.1 Word stems and SVO word order

120

(3)

5.1.1.2 Lemmatization

123

(2)

5.1.1.3 Translation results and translation parameter adaptation

125

(2)

5.1.2 Subtitles and European Parliament proceedings translation results

127

(1)

5.1.3 Medical text translation experiments

128

(2)

5.1.4 Pruning experiments

130

(1)

5.1.5 MT complexity within West-Slavic languages group

131

(1)

5.1.5.1 West-Slavic languages group

132

(2)

5.1.5.2 Differences between Polish and English languages

134

(1)

5.1.5.3 Spoken vs written language

135

(1)

5.1.5.4 Machine translation

136

(1)

5.1.5.5 Evaluation

137

(1)

5.1.5.6 Results

138

(2)

5.1.6 SMT in augumented reality systems

140

(1)

5.1.6.1 Review of state of the art

141

(2)

5.1.6.2 The approach

143

(1)

5.1.6.3 Data preparation

144

(1)

5.1.6.4 Experiments

145

(2)

5.1.6.5 Discussion

147

(2)

5.1.7 Neural networks in machine translation

149

(2)

5.1.7.1 Neural networks in translations

151

(1)

5.1.7.2 Data preparation

152

(1)

5.1.7.3 Translation systems

153

(2)

5.1.7.4 Results

155

(3)

5.1.8 Joint conclusions on SMT

158

(1)

5.2 Evaluation of obtained comparable corpora

159

(18)

5.2.1 Initial mining results using native Yalign solution

159

(2)

5.2.2 Native Yalign method evaluation

161

(4)

5.2.3 Improvements to the Yalign method

165

(10)

5.2.4 Parallel data mining using pipeline of tools

175

(1)

5.2.5 Analogy-based method

176

(1)

5.3 Quasi comparable corpora exploration

177

(4)

5.4 Other fields of MT techniques application

181

(57)

5.4.1 HMEANT and other metrics in re-speaking quality assessment

181

(1)

5.4.1.1 RIBES metric

182

(1)

5.4.1.2 NER metric

183

(1)

5.4.1.3 Data and experimental setup

184

(1)

5.4.1.4 Results

184

(9)

5.4.2 MT evaluation metrics within FACIT translation methodology

193

(1)

5.4.2.1 PROMIS evaluation process

194

(2)

5.4.2.2 Preparation of automatic evaluation metrics and data

196

(2)

5.4.2.3 SMT system preparation

198

(1)

5.4.2.4 Support vector machine classifier evaluation

199

(1)

5.4.2.5 Neural network-based evaluation

200

(1)

5.4.2.6 Results

201

(4)

5.4.3 Augmenting SMT with semantically-generated virtual-parallel data

205

(3)

5.4.3.1 Previous research

208

(2)

5.4.3.2 Generating virtual parallel data

210

(2)

5.4.3.3 Semantically-enhanced generated corpora

212

(3)

5.4.3.4 Experimental setup

215

(1)

5.4.3.5 Evaluation

216

(3)

5.4.4 Statistical noisy parallel data filtration and adaptation

219

(1)

5.4.4.1 Data acquisition

219

(1)

5.4.4.2 Buildingbi-lingual phrase table

220

(1)

5.4.4.3 Experiments and results

221

(4)

5.4.5 Mixing textual data selection methods for improved in-domain adaptation

225

(2)

5.4.5.1 Combined corpora adaptation method

227

(1)

5.4.5.1.1 Cosine tf-idf

227

(1)

5.4.5.1.2 Perplexity

228

(1)

5.4.5.1.3 Levenshtein distance

229

(1)

5.4.5.1.4 Combined methods

229

(1)

5.4.5.2 Results and conclusions

230

(2)

5.4.6 Improving ASR translation quality by de-normalization

232

(2)

5.4.6.1 De-Normalization method

234

(1)

5.4.6.2 De-Normalization experiments

235

(3)

6 Final Conclusions

238

(2)

References

240

(17)

Index

257

Krzysztof Wok holds a PhD Eng. degree in Computer Science, and is a graduate of the Polish-Japanese Academy of Information Technology. He is currently an associate professor at the Cathedral of Multimedia at the same university. His research is mostly related to natural language processing and machine learning based on statistical methods, neural networks and deep learning; and is interested in IT and its challenges, and engages in interdisciplinary projects, particularly those related to HCI, UX, medicine and psychology.

In addition, he has worked as a lecturer at the Warsaw School of Photography & Graphic Design, and as an IT trainer. His specialties as a teacher are primarily deep learning, machine learning, natural language processing, computational linguistics, multimedia, HCI, UX, mobile applications, HTML 5, Adobe applications and server products from Apple and Microsoft.

As far as his didactic work is concerned, he leads classrooms at the faculty of computer science and at the new media art department at the Polish-Japanese Academy of Information Technology and he also used to lead classes and lectures at the Warsaw School of Photography & Graphic Design.

Püsilink: https://www.kriso.ee/db/9780429197543_pe.html

Märksõnad:

E-raamat: Machine Learning in Translation Corpora Processing [Taylor & Francis e-raamat]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Kirjastuste teemad

Vali ostukorv