Arabic NLP: Overview, the State of the Art Challenges and Opportunities Ali Farghaly.

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Introduction to Computational Linguistics
Contrastive Analysis, Error Analysis, Interlanguage
1 Egyptian Ministry of Communications and Information Technology Research and Development Centers of Excellence Initiative Data Mining and Computer Modeling.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Machine Translation Anna Sågvall Hein Mösg F
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Introduction to Computational Linguistics Lecture 2.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Cross Language IR Philip Resnik Salim Roukos Workshop on Challenges in Information Retrieval and Language Modeling Amherst, Massachusetts, September 11-12,
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Lecture 1 Introduction: Linguistic Theory and Theories
Creation of a Russian-English Translation Program Karen Shiells.
An innovative platform to allow translation and indexing of internet sites Localization World
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
China Patent Information For Western Users Huabing Liu Intellectual Property Publishing House, SIPO.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Introduction to Management of Technology (MOT)
Arabic NLP: Challenges & Opportunities Dr. Samir Tartir Scientific Day Faculty of Information Philadelphia University May 15 th 2013.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Week 9: resources for globalisation Finish spell checkers Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and.
© Copyright 2013 ABBYY NLP PLATFORM FOR EU-LINGUAL DIGITAL SINGLE MARKET Alexander Rylov LTi Summit 2013 Confidential.
Objectives Describe the development of the computer and its impact on business and industry. Analyze the impact of new technology on communications. Explain.
Natural Language Processing Guangyan Song. What is NLP  Natural Language processing (NLP) is a field of computer science and linguistics concerned with.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
1.less than 3 million. 2.less than 10 million. 3.over 23 million. 4.over 100 million. 5.Not sure In the U.S., the number of managers that rely on Information.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
The Lisbon Strategy Liceo Scientifico A. Einstein Classe 5B A. s. 2006/2007.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
PowerPoint Presentation to Accompany Chapter 1 of Management Fundamentals Canadian Edition Schermerhorn  Wright.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Introduction to Management of Technology (MOT) Chapter 1.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
INTRODUCTION TO APPLIED LINGUISTICS
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Jan 2012MT Architectures1 Human Language Technology Machine Translation Architectures Direct MT Transfer MT Interlingual MT.
Introduction to Machine Translation
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
contrastive linguistics
Approaches to Machine Translation
Sentiment analysis algorithms and applications: A survey
Introduction to Machine Translation
What is linguistics?.
contrastive linguistics
Machine Learning in Natural Language Processing
Approaches to Machine Translation
Introduction to Machine Translation
Natural Language Processing
COMPARATIVE Linguistics 2018/2019
contrastive linguistics
contrastive linguistics
Information Retrieval
Presentation transcript:

Arabic NLP: Overview, the State of the Art Challenges and Opportunities Ali Farghaly

Overview (1) Challenges 1. to the Arabic language and culture 2. to Arabic NLP a – inherent properties of Arabic b – problems of Arabic Linguistics

Overview (2) Inherent Opportunities for the Arabic Language 1. Classical Arabic has survived 15 centuries, other language failed to do so 2. Arabic is capable of reinventing itself 3. Classical Arabic is a living language in which 1.4 billion Moslems perform their daily prayers 4. The significance of the Arabic language culturally, strategically and linguistically

Overview (3) Why NLP is important? Fundamental transition from the Industrial Economy to the Knowledge Economy in the 1980s and 1990s Knowledge is coded in Language Necessity for NLP Systems to categorize, retrieve, translate, and/or answer questions from unstructured texts 4

Overview (4) NLP History Four generations of NLP Disappointment with the First Generation of Machine Translation Systems, ALPAC Report (1966) Second Generation of NLP Systems (1970’s-1980’s)

Overview (5) Third Generation NLP Systems 1990’s – present Success of Statistical Approaches Problems with Statistical Approaches The Emergence of the Hybrid Approach (4 th generation?)

Overview (6) Future Directions in Arabic NLP New Attitude towards Arabic Grammar Focus on Constituency The Need for Arabic Language Planning

Overview (7) Deal with syntactic ambiguity, co-reference, unbounded dependencies, phrasal constituencies, PRO Drop.etc. Clear Objectives of Arabic NLP for the Arab World Could be different from Arabic NLP for the Western World Conclusion

Challenges (1) To the Arabic language and culture The English language is becoming the language of the World Wide Web: s, blogs, chats etc. taking away functionalities from Arabic Number of books, papers published in the Arab countries is minimal compared to that produced in the USA and English speaking countries Thus, we consume rather than produce knowledge No first class research universities in the Arab world

Challenges (2) Even when we report research, we do not use Arabic Globalization has intensified the influence of the Western culture in the Arab World Almost all Arab universities teach science and mathematics in a foreign language

Challenges (3) To Arabic NLP Inherent properties of the Arabic language 1. The Arabic script (no short vowels and no capitalization) 2. Explosion of ambiguity (average 2.3 per word in other languages to 19.2 in Arabic. Example: 22 analyses of ”ثمن“ by Buckwalter (2004)

Challenges (4) 3. Complex word structure e.g. “ورأيتهم “ ‘and I saw them’ 4. The problem of Normalization ا ، أ ، إ ، آ  ا losing distinction أن ، إن ، آن 5. Arabic as a Pro Drop Language

Assumptions The Arabic language can meet all the needs of its speakers The Arabs were producers of knowledge at a time when the rest of the world were were consumers of knowledge Contemporary Arab scholars proved their ability to produce knowledge

Opportunities (1) Lessons from recent history Unprecedented accumulation of knowledge 1. Dramatic increase in the number of academic publications 2. Huge investment in R & D companies 3. Fundamental changes in industry and society similar to the Industrial Revolution 4. Impressive progress in many fields such as medicine, space exploration, computer software and hardware development etc.

The Knowledge Economy Fundamental Aspects of the Knowledge Economy 1. Strategic product is knowledge rather than manufactured goods 2. Industrial workers are replaced by knowledge workers 3. Global labor market 4. Democratization of knowledge

The Knowledge Economy & NLP (1) The age of on-line information, electronic communication, World Wide Web (www) Millions of documents are created every minute – from kb -> mg -> gig -> terabites Explosion of knowledge can lead to explosion of ignorance

The Knowledge Economy & NLP (2) Democratization of knowledge through the use of the computer/cell phone as a communication tool Governments, industry, academia, and individuals, desperately, need tools to process information Information is coded in natural language

The knowledge Economy & NLP (3) Globalization -> Multilingual applications such as machine translation and cross language applications Information Retrieval (IR) and Information Extraction (IE) are becoming increasingly important key word search is being replaced by question answering systems Knowledge is encoded in natural language

NLP - Flashback The invention of the computer and language 1940’s - First application: breaking the Nazi’s secret code - Second application: Russian to English machine translation (Warren, 1949)

1 st Generation of MT Principles of the first generation Capitalized on the speed lookup offered by the computer MT is essentially a matter of correct pairing of the source language expressions with the target language equivalents Trivial reordering of words

Problems with 1 st Generation MT naïve concept of language structure Heavy reliance on bilingual dictionaries No attempt to mimic human translation Unrealistic goals and promises

2 nd Generation MT (1) Principles of the Transfer Approach Three Components 1. analysis of source language (SL) 2. transfer the structure of SL to TL 3. Generation of target language surface forms

2 nd Generation MT (2) Basic Principles Linguistic knowledge is essential for the understanding of the source text Target specific domains for better translation More realistic goals and promises

2 nd Generation MT (3) Positive developments in NLP technology: chart parsing (Woods 1970, unification grammar Shieber 1986), definite clause grammar (Periera 1980) Driven by the commercial market: The Georgetown System, Pan American Health Organization, EURORTA Project (Interlingua approach) etc. Emergence of lexical approaches to grammar

Problems with 2 nd Generation MT Limitations Linguistic knowledge is expensive Explosion of syntactic ambiguity (300 parse for each input sentence) Needed huge computing power Limited successes: The METAL system and the Canadian weather forecast translation system

Statistical Approaches to NLP Built on Probability theory Works well for specific domains Relies on training data (machine learning) Very fast Does not require linguistic knowledge

3 rd Generation of MT Systems (1) Principles Relies on the machine learning approach Benefits from the existence of huge corpora through the Internet Low development cost Rapid development time

3 rd Generation of MT Systems (2) Heavy reliance on parallel corpus at several levels Does not require any linguistic knowledge: “Give me enough parallel corpus, and I will give you machine translation system in hours” Represents an empirical approach to language “The proof is in the pudding” (Manning 2000) Unlike the transfer approach, does not attempt to mimic human translators

3 rd Generation MT Systems (3) Benefited by Computers becoming much faster, more powerful and less costly Accumulation of huge corpora on the Internet Availability of annotated Treebanks for training (Linguistic Data Consortium

Problems with 3 rd Generation MT Systems Limitations Performs well when dealing with data similar to the training set Performance deteriorates when documents are different from training set There comes a point when adding more training data does not improve performance (The Threshold Problem)

Problems with 3 rd Generation MT Systems There are domains when data is sparse Sometimes the training data itself is noisy (full of errors) Does not provide any insight into language, linguistics or the translation process

Arabic NLP Goals Goals 1.Transfer of knowledge and technology to the Arab World 2. Modernize and fertilize the Arabic language 3. Improve and modernize Arabic linguistics 4. Make information retrieval, extraction, summarization and translation available to the Arab user

Arabic NLP History (1) Followed and integrated with main stream NLP Kuwait: Mohammed Al-Sharikh & Nabil Ali – Sakhr Morocco: Hlaal (1979, 1985) on Arabic morphology Holland: Everhard Ditter on MSA US: The Weidner English/Arabic MT system

Arabic NLP History (2) IBM Scientific Centers in Kuwait and Cairo France: The Dinar Lexical Data Base, Joseph Dichy Language Resources and Human Language Technology work (ELRA/Elda Choukry)

Arabic NLP History (3) The Language Weaver Statistical Arabic to English MT system The SYSTRAN Arabic to English MT system The Apptek Arabic to English Hybrid MT The LDC Arabic Treebank University of Pennsylvania

Arabic NLP History (4) The Prague Dependency Arabic Treebank Arabic Entity Extraction (Shaalan 2007; Zitouni 2008) Arabic Dialects Modeling Project at Columbia University, USA (Diab and Habash, 2007)

Future Directions in Arabic NLP (1) New Attitude toward Arabic Grammar The need for explicit description of MSA Consider the idafa: مدير البنك حاد الذكاء فوق المنزل

Future Directions in Arabic NLP (2) The first is a noun phrase The second is an adjectival phrase The third is a prepositional phrase The description of all as idafa is not helpful to Arabic NLP

Future Directions in Arabic NLP (3) We need to focus on constituency without case endings. Consider: قال الرجل أن الوزير قد استدعاه قال الرجل أن الوزير قد أقاله الرئيس In the first, alwaziir is a subject and in the second is an object. In both sentences it is marked accusative

Future Directions in Arabic NLP (4) We need to describe rules for Arabic anaphoric relations Subjectless sentences (Pro Drop) Discourse Analysis Arabic love of nominalization

Future Directions in Arabic NLP (5) Defining MSA Mark differences between MSA and CA New Arabic grammars - acknowledging the heritage while being liberated from the paradigm A grammar that is more relevant to Arabic Information Retrieval and Arabic MT

Conclusion (1) Arabic NLP can help in transforming Arab societies Good progress has been achieved in Arabic NLP More explicit grammar of MSA will enhance and speed the development of NLP systems Arabic needs to be restored as the language of Of science and research

Conclusion (2) Standards of usage need to be enforced to preserve Arabic as the expression of the Arabic identity Linguists need to do their homework by writing explicit grammars for discourse Analysis, Anaphoric Relations, Syntactic Structures etc.