Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
1 Egyptian Ministry of Communications and Information Technology Research and Development Centers of Excellence Initiative Data Mining and Computer Modeling.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Bilingual Dictionaries
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Albert Gatt Corpora and Statistical Methods Lecture 9.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Arabic TTS (status & problems) O. Al Dakkak & N. Ghneim.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Measuring Hint Level in Open Cloze Questions Juan Pino, Maxine Eskenazi Language Technologies Institute Carnegie Mellon University International Florida.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
An XPath-based Preference Language for P3P IBM Almaden Research Center Rakesh Agrawal Jerry Kiernan Ramakrishnan Srikant Yirong Xu.
Transliteration System
Engineering Design By Brian Nettleton This material is based upon work supported by the National Science Foundation under Grant No Any opinions,
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Language Independent Method for Question Classification COLING 2004.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Tokenization & POS-Tagging
Daisy Arias Math 382/Lab November 16, 2010 Fall 2010.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Compiler Design Introduction 1. 2 Course Outline Introduction to Compiling Lexical Analysis Syntax Analysis –Context Free Grammars –Top-Down Parsing –Bottom-Up.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Levels of Linguistic Analysis
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
An Efficient Hindi-Urdu Transliteration System Nisar Ahmed PhD Scholar Department of Computer Science and Engineering, UET Lahore.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Kadupitiya JCS Dr. Surangika Ranathunga Prof. Gihan Dias Department of Computer Science and Engineering University of Moratuwa Sri Lanka.
Language Identification and Part-of-Speech Tagging
TITLE What should be in Objective, Method and Significant
Natural Language Processing (NLP)
Presentation 王睿.
Topics in Linguistics ENG 331
Levels of Linguistic Analysis
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim** *Faculty of Informatics Engineering **Higher Institute of Applied Science and Technology

Outline Introduction Previous Works Our Work Implementation Overview Results Future Works Conclusion References & More Information

Introduction There are several types of vowels in Arabic: long vowels: /A/, /w/, /y/ short vowels : /a/ (Fatha), /u/ (Damma), /I/ (Kasra) Other symbols: /F/ Tanween-Fateh ’ an ’, /N/ Tanween-Damm ‘ un ’ or ‘ on ’, /K/ Tanween-Kasir, ‘ in ’ or ‘ en ’, /o/ Sukun where the consonant is not followed by a vowel, /~/ Shadda which means a duplication of the consonant In fact long vowels are long /a/, /u/, /i/. They differ only in duration with the corresponding short ones.

Introduction  Short vowels & other symbols are part of the word and are written as additional marks above or below letters.  These marks are usually not written because Arabic reader can guess them, based on his knowledge of the language and on the context.  They are only put when very necessary, in cases where the word is so ambiguous without them

Purpose While this problem may be trivial for an Arabic native speaker, it is not for computers and new learners of language. Examples of applications that demand vocalization of Arabic texts: Educational tools for children and learners Search engines Text to speech engines Text mining tools

Previous Works Despite the abundance of computational Arabic studies, Arabic Vocalization is not enough studied. Sakhr has a commercial system for Arabic vocalization. Unfortunately, the system is totally closed. Y. Gal HMM trained on vocalized Arabic texts (Holy Quranic), 85% of correct vocalization on the same corpus. R. Nelken and S. Shieber use weighted finite-state transducers trained on LDC corpus of M. Maamouri et al. 93% of correct vocalization.

Drawbacks of Previous Works The previous works provide useful attempts to solve the problem; however: They tackle the problem with a top-down approach. They build a model and train it with a corpus. The problem with this approach is that it is highly dependent on the corpus. For example, Quranic texts used in are not good representative of modern Arabic. And newspaper archives in LDC do not cover all the topics in the language.

Our Proposed System Our work uses a bottom-up approach, where we do a linguistic analysis of Arabic texts, using the following four steps: Parsing the text. Analyzing the text morphologically. Part of speech (POS) tagging of the text. Applying linguistic heuristics rules.

Parsing In this step, the text is split into phrases. Each phrase is also split into words. The process is simple; it uses regular expressions to parse the text.

Morphological Analysis (MA) Each word is passed to the MA, which provides all the vocal possibilities that can be added to the non-vocalized word & POS tag of each possibility. Use of Buckwalter Arabic MA considers each word as a combination of prefix, stem, and suffix. has a dictionary of prefixes, stems, and suffixes, has 2 compatibility tables: prefix-stem & stem-suffixes.

MA algorithm * All the possible combinations of a word to prefix-stem- suffix are considered (as long as the stem length is not zero). For each combination, the prefix, stem and suffix are checked whether they are contained in the dictionaries or not. If so, the compatibility between them is considered; if they are compatible, this combination is considered as a possible analysis to the word.

Part of speech tagging After MA, we must choose the correct POS. We built a POS tagger for Arabic, using unsupervised transformational based learning methods on collected texts from Internet covering multiple disciplines. (LDC Arabic Corpus not afforded) The generated rules are then examined manually. For some words, the POS tagging cannot resolve the ambiguity; therefore we need an additional level of processing.

Heuristic rules of disambiguation Heuristic rules of disambiguation that choose a certain POS with regards to the word and its context. Ex.: "if the word length is less than 3 letters, and one of its part of speech is preposition, then choose it". Finally, if after all these levels, the ambiguity still remains in the word; a random choice of the POS tagged is made.

Implementation Overview We implemented the system using Java TM programming language. & some additional tools and GUI. The implementation allows the user to use each part alone, or use all of them together. Syntactic analysis is not implemented, final vowels are not considered.

The System ’ s Different Parts

Results Because of the lack of digital vocalized Arabic texts, we did not have the opportunity to test our system deeply. However, we did some empirical tests. We vocalized large Arabic texts automatically and gave the results to experts in order to evaluate them. The evaluation shows a percentage of 80-90% of correct vocalization.

Syntactic Analysis: Although texts without the last vowels are well understood, adding this additional level will certainly improve the performance. Semantic Analysis: some semantic analysis can help improving the performance. (Ex. when a word has several possibilities with the same POS). Future Works – 1

Future Works – 2 Pragmatic Analysis: This type of analysis is useful in conversations and idiomatic expressions. Using a supervised method, trained on a tagged corpus, can significantly enhance the part of speech tagging performance. Our work is not finished and we are still working on improving it.

Conclusion The problem of restoring vocals in Arabic is essential for computational applications. Some attempts were made. We have provided a solution based on linguistic analysis. An implementation is done, and the results are promising. We are planning to improve and enhance the system.

Thank you for your attention Time for questions