Muutke küpsiste eelistusi

Natural Language Processing of Semitic Languages 2014 ed. [Kõva köide]

Edited by
  • Formaat: Hardback, 459 pages, kõrgus x laius: 235x155 mm, kaal: 8454 g, 23 Illustrations, color; 38 Illustrations, black and white; XXIV, 459 p. 61 illus., 23 illus. in color., 1 Hardback
  • Sari: Theory and Applications of Natural Language Processing
  • Ilmumisaeg: 12-May-2014
  • Kirjastus: Springer-Verlag Berlin and Heidelberg GmbH & Co. K
  • ISBN-10: 3642453570
  • ISBN-13: 9783642453571
  • Kõva köide
  • Hind: 104,29 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Tavahind: 122,69 €
  • Säästad 15%
  • Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Tellimisaeg 2-4 nädalat
  • Lisa soovinimekirja
  • Formaat: Hardback, 459 pages, kõrgus x laius: 235x155 mm, kaal: 8454 g, 23 Illustrations, color; 38 Illustrations, black and white; XXIV, 459 p. 61 illus., 23 illus. in color., 1 Hardback
  • Sari: Theory and Applications of Natural Language Processing
  • Ilmumisaeg: 12-May-2014
  • Kirjastus: Springer-Verlag Berlin and Heidelberg GmbH & Co. K
  • ISBN-10: 3642453570
  • ISBN-13: 9783642453571

Research in Natural Language Processing (NLP) has rapidly advanced in recent years, resulting in exciting algorithms for sophisticated processing of text and speech in various languages. Much of this work focuses on English; in this book we address another group of interesting and challenging languages for NLP research: the Semitic languages. The Semitic group of languages includes Arabic (206 million native speakers), Amharic (27 million), Hebrew (7 million), Tigrinya (6.7 million), Syriac (1 million) and Maltese (419 thousand). Semitic languages exhibit unique morphological processes, challenging syntactic constructions and various other phenomena that are less prevalent in other natural languages. These challenges call for unique solutions, many of which are described in this book.

The 13 chapters presented in this book bring together leading scientists from several universities and research institutes worldwide. While this book devotes some attention to cutting-edge algorithms and techniques, its primary purpose is a thorough explication of best practices in the field. Furthermore, every chapter describes how the techniques discussed apply to Semitic languages. The book covers both statistical approaches to NLP, which are dominant across various applications nowadays and the more traditional, rule-based approaches, that were proven useful for several other application domains. We hope that this book will provide a "one-stop-shop'' for all the requisite background and practical advice when building NLP applications for Semitic languages.

Part I Natural Language Processing Core-Technologies
1 Linguistic Introduction: The Orthography, Morphology and Syntax of Semitic Languages
3(40)
Ray Fabri
Michael Gasser
Nizar Habash
George Kiraz
Shuly Wintner
1.1 Introduction
3(2)
1.2 Amharic
5(8)
1.2.1 Orthography
6(1)
1.2.2 Derivational Morphology
7(2)
1.2.3 Inflectional Morphology
9(2)
1.2.4 Basic Syntactic Structure
11(2)
1.3 Arabic
13(6)
1.3.1 Orthography
14(1)
1.3.2 Morphology
15(3)
1.3.3 Basic Syntactic Structure
18(1)
1.4 Hebrew
19(7)
1.4.1 Orthography
20(2)
1.4.2 Derivational Morphology
22(1)
1.4.3 Inflectional Morphology
23(2)
1.4.4 Morphological Ambiguity
25(1)
1.4.5 Basic Syntactic Structure
25(1)
1.5 Maltese
26(6)
1.5.1 Orthography
26(1)
1.5.2 Derivational Morphology
27(2)
1.5.3 Inflectional Morphology
29(1)
1.5.4 Basic Syntactic Structure
30(2)
1.6 Syriac
32(2)
1.6.1 Orthography
32(1)
1.6.2 Derivational Morphology
33(1)
1.6.3 Inflectional Morphology
33(1)
1.6.4 Syntax
34(1)
1.7 Contrastive Analysis
34(4)
1.7.1 Orthography
34(1)
1.7.2 Phonology
35(1)
1.7.3 Morphology
36(1)
1.7.4 Syntax
37(1)
1.7.5 Lexicon
37(1)
1.8 Conclusion
38(5)
References
38(5)
2 Morphological Processing of Semitic Languages
43(24)
Shuly Wintner
2.1 Introduction
43(1)
2.2 Basic Notions
44(1)
2.3 The Challenges of Morphological Processing
45(2)
2.4 Computational Approaches to Morphology
47(4)
2.4.1 Two-Level Morphology
48(1)
2.4.2 Multi-tape Automata
48(1)
2.4.3 The Xerox Approach
49(1)
2.4.4 Registered Automata
50(1)
2.4.5 Analysis by Generation
50(1)
2.4.6 Functions) Morphology
51(1)
2.5 Morphological Analysis and Generation of Semitic Languages
51(5)
2.5.1 Amharic
52(1)
2.5.2 Arabic
52(2)
2.5.3 Hebrew
54(1)
2.5.4 Other Languages
55(1)
2.5.5 Related Applications
55(1)
2.6 Morphological Disambiguation of Semitic Languages
56(2)
2.7 Future Directions
58(9)
References
58(9)
3 Syntax and Parsing of Semitic Languages
67(62)
Reut Tsarfaty
3.1 Introduction
67(17)
3.1.1 Parsing Systems
69(5)
3.1.2 Semitic Languages
74(6)
3.1.3 The Main Challenges
80(4)
3.1.4 Summary and Conclusion
84(1)
3.2 Case Study: Generative Probabilistic Parsing
84(33)
3.2.1 Formal Preliminaries
85(6)
3.2.2 An Architecture for Parsing Semitic Languages
91(8)
3.2.3 The Syntactic Model
99(14)
3.2.4 The Lexical Model
113(4)
3.3 Empirical Results
117(6)
3.3.1 Parsing Modern Standard Arabic
117(3)
3.3.2 Parsing Modern Hebrew
120(3)
3.4 Conclusion and Future Work
123(6)
References
124(5)
4 Semantic Processing of Semitic Languages
129(32)
Mona Diab
Yuval Marton
4.1 Introduction
129(1)
4.2 Fundamentals of Semitic Language Meaning Units
130(5)
4.2.1 Morpho-Semantics: A Primer
130(5)
4.3 Meaning, Semantic Distance, Paraphrasing and Lexicon Generation
135(4)
4.3.1 Semantic Distance
136(2)
4.3.2 Textual Entailment
138(1)
4.3.3 Lexicon Creation
138(1)
4.4 Word Sense Disambiguation and Meaning Induction
139(3)
4.4.1 WSD Approaches in Semitic Languages
140(1)
4.4.2 WSI in Semitic Languages
141(1)
4.5 Multiword Expression Detection and Classification
142(3)
4.5.1 Approaches to Semitic MWE Processing and Resources
143(2)
4.6 Predicate--Argument Analysis
145(7)
4.6.1 Arabic Annotated Resources
146(2)
4.6.2 Systems for Semantic Role Labeling
148(4)
4.7 Conclusion
152(9)
References
152(9)
5 Language Modeling
161(38)
Ilana Heintz
5.1 Introduction
161(1)
5.2 Evaluating Language Models with Perplexity
162(2)
5.3 N-Gram Language Modeling
164(2)
5.4 Smoothing: Discounting, Backoff, and Interpolation
166(4)
5.4.1 Discounting
166(2)
5.4.2 Combining Discounting with Backoff
168(1)
5.4.3 Interpolation
168(2)
5.5 Extensions to N-Gram Language Modeling
170(17)
5.5.1 Skip N-Grams and Flex Grams
170(1)
5.5.2 Variable-Length Language Models
171(2)
5.5.3 Class-Based Language Models
173(1)
5.5.4 Factored Language Models
174(1)
5.5.5 Neural Network Language Models
175(2)
5.5.6 Syntactic or Structured Language Models
177(1)
5.5.7 Tree-Based Language Models
178(1)
5.5.8 Maximum-Entropy Language Models
178(2)
5.5.9 Discriminative Language Models
180(3)
5.5.10 LSA Language Models
183(1)
5.5.11 Bayesian Language Models
184(3)
5.6 Modeling Semitic Languages
187(6)
5.6.1 Arabic
188(1)
5.6.2 Amharic
189(2)
5.6.3 Hebrew
191(1)
5.6.4 Maltese
191(1)
5.6.5 Syriac
192(1)
5.6.6 Other Morphologically Rich Languages
192(1)
5.7 Summary
193(6)
References
193(6)
Part II Natural Language Processing Applications
6 Statistical Machine Translation
199(22)
Hany Hassan
Kareem Darwish
6.1 Introduction
199(1)
6.2 Machine Translation Approaches
200(4)
6.2.1 Machine Translation Paradigms
200(2)
6.2.2 Rule-Based Machine Translation
202(1)
6.2.3 Example-Based Machine Translation
202(1)
6.2.4 Statistical Machine Translation
203(1)
6.2.5 Machine Translation for Semitic Languages
203(1)
6.3 Overview of Statistical Machine Translation
204(5)
6.3.1 Word-Based Translation Models
204(1)
6.3.2 Phrase-Based SMT
205(1)
6.3.3 Phrase Extraction Techniques
206(1)
6.3.4 SMT Reordering
207(1)
6.3.5 Language Modeling
207(1)
6.3.6 SMT Decoding
208(1)
6.4 Machine Translation Evaluation Metrics
209(1)
6.5 Machine Translation for Semitic Languages
210(3)
6.5.1 Word Segmentation
210(1)
6.5.2 Word Alignment and Reordering
211(1)
6.5.3 Gender-Number Agreement
212(1)
6.6 Building Phrase-Based SMT Systems
213(1)
6.6.1 Data
213(1)
6.6.2 Parallel Data
213(1)
6.6.3 Monolingual Data
214(1)
6.7 SMT Software Resources
214(1)
6.7.1 SMT Moses Framework
214(1)
6.7.2 Language Modeling Toolkits
214(1)
6.7.3 Morphological Analysis
215(1)
6.8 Building a Phrase-Based SMT System: Step-by-Step Guide
215(3)
6.8.1 Machine Preparation
215(1)
6.8.2 Data
216(1)
6.8.3 Data Preprocessing
216(1)
6.8.4 Words Segmentation
216(1)
6.8.5 Language Model
217(1)
6.8.6 Translation Model
217(1)
6.8.7 Parameter Tuning
217(1)
6.8.8 System Decoding
218(1)
6.9 Summary
218(3)
References
218(3)
7 Named Entity Recognition
221(26)
Behrang Mohit
7.1 Introduction
221(1)
7.2 The Named Entity Recognition Task
222(8)
7.2.1 Definition
222(1)
7.2.2 Challenges in Named Entity Recognition
223(1)
7.2.3 Rule-Based Named Entity Recognition
224(1)
7.2.4 Statistical Named Entity Recognition
225(3)
7.2.5 Hybrid Systems
228(1)
7.2.6 Evaluation and Shared Tasks
228(1)
7.2.7 Evaluation Campaigns
229(1)
7.2.8 Beyond Traditional Named Entity Recognition
230(1)
7.3 Named Entity Recognition for Semitic Languages
230(3)
7.3.1 Challenges in Semitic Named Entity Recognition
231(1)
7.3.2 Approaches to Semitic Named Entity Recognition
232(1)
7.4 Case Studies
233(3)
7.4.1 Learning Algorithms
234(1)
7.4.2 Features
234(1)
7.4.3 Experiments
235(1)
7.5 Relevant Problems
236(3)
7.5.1 Named Entity Translation and Transliteration
236(2)
7.5.2 Entity Detection and Tracking
238(1)
7.5.3 Projection
238(1)
7.6 Labeled Named Entity Recognition Corpora
239(1)
7.7 Future Challenges and Opportunities
240(1)
7.8 Summary
241(6)
References
241(6)
8 Anaphora Resolution
247(32)
Khadiga Mahmoud Seddik
Ali Farghaly
8.1 Introduction: Anaphora and Anaphora Resolution
247(1)
8.2 Types of Anaphora
248(1)
8.2.1 Pronominal Anaphora
248(1)
8.2.2 Lexical Anaphora
249(1)
8.2.3 Comparative Anaphora
249(1)
8.3 Determinants in Anaphora Resolution
249(7)
8.3.1 Eliminating Factors
250(1)
8.3.2 Preferential Factors
251(1)
8.3.3 Implementing Features in AR (Anaphora Resolution) Systems
252(4)
8.4 The Process of Anaphora Resolution
256(1)
8.5 Different Approaches to Anaphora Resolution
257(5)
8.5.1 Knowledge-Intensive Versus Knowledge-Poor Approaches
257(2)
8.5.2 Traditional Approach
259(1)
8.5.3 Statistical Approach
259(1)
8.5.4 Linguistic Approach to Anaphora Resolution
260(2)
8.6 Recent Work in Anaphora and Coreference Resolution
262(3)
8.6.1 Mention-Synchronous Coreference Resolution Algorithm Based on the Bell Tree [ 24]
262(1)
8.6.2 A Twin-Candidate Model for Learning-Based Anaphora Resolution [ 47, 48]
263(1)
8.6.3 Improving Machine Learning Approaches to Coreference Resolution [ 36]
264(1)
8.7 Evaluation of Anaphora Resolution Systems
265(4)
8.7.1 MUC [ 45]
265(2)
8.7.2 B-Cube [ 2]
267(1)
8.7.3 ACE (NIST 2003)
267(1)
8.7.4 CEAF [ 23]
268(1)
8.7.5 BLANC [ 40]
269(1)
8.8 Anaphora in Semitic Languages
269(3)
8.8.1 Anaphora Resolution in Arabic
270(2)
8.9 Difficulties with AR in Semitic Languages
272(2)
8.9.1 The Morphology of the Language
272(1)
8.9.2 Complex Sentence Structure
273(1)
8.9.3 Hidden Antecedents
273(1)
8.9.4 The Lack of Corpora Annotated with Anaphoric Links
273(1)
8.10 Summary
274(5)
References
274(5)
9 Relation Extraction
279(20)
Vittorio Castelli
Imed Zitouni
9.1 Introduction
279(1)
9.2 Relations
280(1)
9.3 Approaches to Relation Extraction
281(10)
9.3.1 Feature-Based Classifiers
281(4)
9.3.2 Kernel-Based Methods
285(3)
9.3.3 Semi-supervised and Adaptive Learning
288(3)
9.4 Language-Specific Issues
291(1)
9.5 Data
292(2)
9.6 Results
294(1)
9.7 Summary
295(4)
References
295(4)
10 Information Retrieval
299(36)
Kareem Darwish
10.1 Introduction
299(1)
10.2 The Information Retrieval Task
299(10)
10.2.1 Task Definition
301(1)
10.2.2 The General Architecture of an IR System
302(1)
10.2.3 Retrieval Models
303(2)
10.2.4 IR Evaluation
305(4)
10.3 Semitic Language Retrieval
309(9)
10.3.1 The Major Known Challenges
309(4)
10.3.2 Survey of Existing Literature
313(3)
10.3.3 Best Arabic Index Terms
316(2)
10.3.4 Best Hebrew Index Terms
318(1)
10.3.5 Best Amharic Index Terms
318(1)
10.4 Available IR Test Collections
318(1)
10.4.1 Arabic
318(1)
10.4.2 Hebrew
319(1)
10.4.3 Amharic
319(1)
10.5 Domain-Specific IR
319(10)
10.5.1 Arabic--English CLIR
320(2)
10.5.2 Arabic OCR Text Retrieval
322(4)
10.5.3 Arabic Social Search
326(2)
10.5.4 Arabic Web Search
328(1)
10.6 Summary
329(6)
References
329(6)
11 Question Answering
335(36)
Yassine Benajiba
Paolo Rosso
Lahsen Abouenour
Omar Trigui
Karim Bouzoubaa
Lamia Belguith
11.1 Introduction
335(1)
11.2 The Question Answering Task
336(8)
11.2.1 Task Definition
336(2)
11.2.2 The Major Known Challenges
338(1)
11.2.3 The General Architecture of a QA System
339(2)
11.2.4 Answering Definition Questions and Query Expansion Techniques
341(2)
11.2.5 How to Benchmark QA System Performance: Evaluation Measure for QA
343(1)
11.3 The Case of Semitic Languages
344(3)
11.3.1 NLP for Semitic Languages
344(1)
11.3.2 QA for Semitic Languages
345(2)
11.4 Building Arabic QA Specific Modules
347(19)
11.4.1 Answering Definition Questions in Arabic
347(6)
11.4.2 Query Expansion for Arabic QA
353(13)
11.5 Summary
366(5)
References
367(4)
12 Automatic Summarization
371(38)
Lamla Hadrich Belguith
Mariem Ellouze
Mohamed Hedi Maaloul
Maher Jaoua
Fatma Kallel Jaoua
Philippe Blache
12.1 Introduction
371(1)
12.2 Text Summarization Aspects
372(4)
12.2.1 Types of Summaries
374(1)
12.2.2 Extraction vs. Abstraction
375(1)
12.2.3 The Major Known Challenges
376(1)
12.3 How to Evaluate Summarization Systems
376(2)
12.3.1 Insights from the Evaluation Campaigns
377(1)
12.3.2 Evaluation Measures for Summarization
377(1)
12.4 Single Document Summarization Approaches
378(2)
12.4.1 Numerical Approach
379(1)
12.4.2 Symbolic Approach
379(1)
12.4.3 Hybrid Approach
380(1)
12.5 Multiple Document Summarization Approaches
380(5)
12.5.1 Numerical Approach
381(1)
12.5.2 Symbolic Approach
382(1)
12.5.3 Hybrid Approach
383(2)
12.6 Case of Semitic Languages
385(4)
12.6.1 Language-Independent Systems
385(1)
12.6.2 Arabic Systems
386(2)
12.6.3 Hebrew Systems
388(1)
12.6.4 Maltese Systems
388(1)
12.6.5 Amharic Systems
389(1)
12.7 Case Study: Building an Arabic Summarization System (L.A.E)
389(13)
12.7.1 L.A.E System Architecture
390(1)
12.7.2 Source Text Segmentation
390(10)
12.7.3 Interface
400(1)
12.7.4 Evaluation and Discussion
401(1)
12.8 Summary
402(7)
References
403(6)
13 Automatic Speech Recognition
409
Hagen Soltau
George Saon
Lidia Mangu
Hong-Kwang Kuo
Brian Kingsbury
Stephen Chu
Fadi Biadsy
13.1 Introduction
409(4)
13.1.1 Automatic Speech Recognition
410(1)
13.1.2 Introduction to Arabic: A Speech Recognition Perspective
411(1)
13.1.3 Overview
412(1)
13.2 Acoustic Modeling
413(15)
13.2.1 Language-Independent Techniques
413(5)
13.2.2 Vowelization
418(5)
13.2.3 Modeling of Arabic Dialects in Decision Trees
423(5)
13.3 Language Modeling
428(6)
13.3.1 Language-Independent Techniques for Language Modeling
428(4)
13.3.2 Language-Specific Techniques for Language Modeling
432(2)
13.4 IBM GALE 2011 System Description
434(9)
13.4.1 Acoustic Models
434(5)
13.4.2 Language Models
439(2)
13.4.3 System Combination
441(1)
13.4.4 System Architecture
441(2)
13.5 From MSA to Dialects
443(10)
13.5.1 Dialect Identification
443(3)
13.5.2 ASR and Dialect ID Data Selection
446(1)
13.5.3 Dialect Identification on GALE Data
447(1)
13.5.4 Acoustic Modeling Experiments
448(4)
13.5.5 Dialect ID Based on Text Only
452(1)
13.6 Resources
453(2)
13.6.1 Acoustic Training Data
453(1)
13.6.2 Training Data for Language Modeling
454(1)
13.6.3 Vowelization Resources
454(1)
13.7 Comparing Arabic and Hebrew ASR
455(1)
13.8 Summary
456
References
457
Dr. Imed Zitouni is a Principal Researcher at Microsoft leading the Relevance Measurement Sciences group. Imed received his M.Sc. and Ph.D. with the highest-honors from the University-of-Nancy1 France.

In 1995, he obtained a MEng degree in computer science from ENSI in Tunisia. He is a senior member of IEEE, served as a member of the IEEE Speech and Language Processing Technical Committee (99-11), the Information Officer of the ACL SIG on Semitic-Languages, associate editor of TALIP ACM journal and a member of ISCA and ACL. Imed served as chair and reviewing-committee-member of several conferences and journals and he is the author/co-author of more than 100 patents and papers in international conferences and journals. Imeds research interest is in the area of Multilingual Natural Language Processing (NLP), including Information Retrieval, Information Extraction, Language modeling, etc. Imed has particular interest in advancing state of the art technology in the area of Semitic NLP, especially Arabic.

Imeds current research interest is in the area of Multilingual Information Retrieval focusing on the use of statistics and machine learning techniques to develop web scale offline and online metrics. He also working on the use of NLP to add a layer of semantics and understanding to search engines. Prior to joining Microsoft, Imed was a Senior Researcher at IBM for almost a decade, where he led several Multilingual NLP projects, including Arabic NLP, informatics extraction, semantic role labeling, language modeling, machine translation and speech recognition. Prior to IBM, Imed was a researcher at Bell Laboratories, Lucent Technologies, for almost half dozen years working on language modeling, speech recognition, spoken dialog systems and speech understanding. Imed also experiment the startup experience at DIALOCA in Paris, France, working on e-mail steering and language modeling and served as temporary assistant professor at the University ofNancy 1, France.