I-AIBS Institute for Artificial Intelligence and Biological Systems

Slides:



Advertisements
Similar presentations
DAML Queries/Life Cycle SRI International. Parts of Ontologies (used in the examples to follow) Assumptions Researcher String lastName firstName Publication-ref.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
Corpus Linguistics for Understanding the Quran
I-AIBS Institute for Artificial Intelligence and Biological Systems
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
Quranic Arabic Corpus Data Mining & Text Analytics By Ismail Teladia & Abdullah Alazwari.
Who are the Experts?Simon KampaSlide 1 Who are the Experts? Simon Kampa IAM Group University of Southampton
Introduction to Mendeley. What is Mendeley? Mendeley is a reference manager allowing you to manage, read, share, annotate and cite your research papers...
1 Knowledge, Action and Systems Some emerging foundational issues in Computing … Can Information Studies Help? Eric Yu Faculty of Information Studies University.
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
CS 330 Programming Languages 09 / 18 / 2007 Instructor: Michael Eckmann.
An innovative platform to allow translation and indexing of internet sites Localization World
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
English Word Origins Grade 3 Middle School (US 9 th Grade) Advanced English Pablo Sherman The etymology of language.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
8/19/20151 بسم الله الرحمن الرحيم ICS 482 Natural Language Processing Lecture 24: Project Ideas + Students Presentations Husni Al-Muhtaseb.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Some Thoughts to Consider 6 What is the difference between Artificial Intelligence and Computer Science? What is the difference between Artificial Intelligence.
Mohamed Maamouri, Ann Bies, Seth Kulick Linguistic Data Consortium, University of Pennsylvania, USA Presenter Name: Al-Elaiwi Moh’d.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Computational Investigation of Palestinian Arabic Dialects
Survey of Semantic Annotation Platforms
Artificial intelligence project
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
The Humanities in a Global e-Infrastructure A Shopping-List Gregory Crane, Perseus Project, Tufts Brian Fuchs, Internet Centre, Imperial College Dolores.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Artificial Intelligence: Natural Language
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Iana Atanassova Research: – Information retrieval in scientific publications exploiting semantic annotations and linguistic knowledge bases – Ranking algorithms.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
The Unreasonable Effectiveness of Data
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Event-Based Model for Reconciling Digital Entities Ahmet Fatih Mustacoglu Ahmet E. Topcu Aurel Cami Geoffrey C. Fox Indiana University Computer Science.
Basics of Natural Language Processing Introduction to Computational Linguistics.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Programming Languages Concepts Chapter 1: Programming Languages Concepts Lecture # 4.
Mohammad Alqahtani, Dr. Eric Atwell
CSC 594 Topics in AI – Natural Language Processing
Computational and Statistical Methods for Corpus Analysis: Overview
User Interface HEP Summit, DESY, May 2008
Writing Analytics Clayton Clemens Vive Kumar.
Topics in Linguistics ENG 331
How to publish in a format that enhances literature-based discovery?
Statistical n-gram David ling.
Extracting Recipes from Chemical Academic Papers
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Presentation transcript:

I-AIBS Institute for Artificial Intelligence and Biological Systems Arabic Language Computing applied to the Quran - a PhD research project by Kais Dukes I-AIBS Institute for Artificial Intelligence and Biological Systems School of Computing University of Leeds

The Challenge: An interdisciplinary approach to understanding the Quran (1) Quranic Studies (3) Computational Linguistics (2) Traditional Arabic Linguistics

(1) What is the Quran? The last in a series of 5 religious texts Holy Book Prophet Text Dated Suhuf Ibrahim (Scrolls) Abraham ? The Tawrat (Torah) Moses 1500 BCE? The Zabur (Psalms) David 1000 BCE? The Injil (Gospel) Jesus 1 CE The Quran Muhammad (PBUH) 610-632 CE

The central religious text of Islam (1) What is the Quran? The central religious text of Islam Classical Arabic, 1300+ years ago All believers should learn the text; translations are “interpretations” Islamic Law (legal logic) Divine guidance & direction Science and philosophy Has inspired Algebra, Linguistics

(2) Traditional Arabic Linguistics Originated in Arabs studying the language of the Quran (scientific analysis for at least 1000 years – a lot older than English language!): - Orthography (diacritics and vowelization) - Etymology (Semitic roots) - Morphology (derivation and inflection) - Syntax (origins of dependency grammar) - Discourse Analysis & Rhetoric - Semantics & Pragmatics

(3) Computational Linguistics Quran is online, for keyword search BUT verse-by-verse translations are interpretations Muslims should access the “true” Classical Arabic source

(3) Computational Linguistics - How far can we go? - Is an Artificial Intelligence system realistic? Example question-answering dialog system: Question How long should I breastfeed my child for? Answer Mothers should suckle their offspring for two years, if the father wishes to complete the term (The Holy Quran, Verse 2:233).

An AI approach to understanding the Quran Central Hypothesis Augmenting the text of the Quran with rich annotation will lead to a more accurate AI system. - Prepare the data by annotating the Quran. - Use the data to build an AI system for concept search and question-answering.

Annotating the Quran Challenges Orthography - Complex non-standard script Morphology (word structure) - Arabic is highly inflected, challenging to analyze Grammar - Phrase structure, dependency Semantics – Ontology of Entities and Concepts referred to by pronouns and nouns

Annotating the Quran Solutions - Computing advances have made annotation possible, to high accuracy - Leverage existing resources from Traditional Arabic Grammar Machine-Learning annotation followed by manual verification - Community effort using online volunteers

Recent Advances: Orthography An accurate digital copy of the Quran? Encoding Issues Missing diacritics Simplified script (not Uthmani) Windows code page 1256, not Unicode Google Search for verse (68:38) on Jan 21, 2008 shows many typos

Recent Advances: Orthography Tanzil Project (http://tanzil.info) Stable version released May 2008 Uses Unicode XML encoding, including the special characters designed for the complex Arabic script of the Quran Manually verified to 100% accuracy by a group of experts who have memorized the entire text of the Quran

Recent Advances: Orthography Java Quran API (http://jqurantree.org) (Dukes 2009) Java classes for querying the Tanzil XML of the Quran gives authentic script on web-pages

Recent Advances: Morphology - Buckwalter Arabic Morphological Analyzer (Tim Buckwalter, 2002) Morphological Analysis of the Quran at the University of Haifa (Shuly Wintner, 2004) - Lexeme & feature based morphological representation of Arabic (Nizar Habash, 2006)

The Haifa Corpus (2004) Multiple analysis for each word (up to 5) rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sg rbb+fa&l+Noun+Triptotic+Masc+Sg+Gen Not manually verified Authors reports an F-measure of 86% Non-standard annotation scheme not familiar to traditional Arabic linguists e.g. extracting a list of all verbs is non-trivial Arabic text is only encoded phonetically instead of using the original Arabic. e.g. searching for a specific root is not easy

The Quranic Arabic Corpus http://corpus.quran.com/ Kais Dukes Arabic Language Computing Applied to the Quran – PhD (part-time) word structure - colour-coded morphological analysis translation - word-for-word English translations grammar- dependency parse following Arabic tradition semantics – ontology of entities and concepts Machine Learning - annotations used for A.I. training Impact - dozens of researchers have collaborated/cited, and a million visitors have used the website this year

The Quranic Arabic Corpus Verified Uthmani Script Unicode Uthmani Script Sourced from the verified Tanzil project

The Quranic Arabic Corpus Phonetics (faja'alnāhumu) Phonetic transcription generated algorithmically Guided by Arabic vowelized diacritics

The Quranic Arabic Corpus Interlinear translation Word-for-word translation from accepted sources Interlinear translation scheme

The Quranic Arabic Corpus Location Reference (21:70:4) Common standard for verses (Chapter:Verse) Extended in the QAC corpus to include word numbers and segment numbers, e.g. (21:70:4:2)

The Quranic Arabic Corpus Morphological Segmentation Division of a single word into multiple segments Part-of-speech tag assigned to each segment - Traditional Arabic Grammar rules used for division

The Quranic Arabic Corpus Morphological segment features

The Quranic Arabic Corpus Arabic Grammar Summary

The Quranic Arabic Treebank Syntactic Annotation Dependency Grammar based onإعراب (i'rāb) Syntactico-semantic roles for each word

The Quranic Arabic Treebank Ontology of entities and concepts linked to/from nouns and pronouns in the text

The Quranic Arabic Treebank Framework for collaboration Message Board: “If you come across a word and you feel that a better analysis could be provided, you can suggest a correction online by clicking on an Arabic word” (currently 5228 resolved messages; 1048 under review) Resources: Publications; Citations, Reviews, FAQs, Feedback, Data Download, Software download, Mailing list

The Quranic Arabic Treebank Users: researchers, public Artificial Intelligence and Computational Linguistics Arabic linguistics Quranic and Islamic Studies Classical literature analysis Anyone who wants to appreciate the Quran

The Quranic Arabic Treebank new Computational Linguistics? First Treebank of Classical Arabic Free Treebank of the Quran First formal representation of Traditional Arabic Grammar using constituency/dependency graphs Machine-Learning parser

The Quranic Arabic Corpus Part-of-speech Tagging Part-of-speech tags adapted from Traditional Arabic Grammar, and mapped to English equivalents (not the other way around) These tags apply to words in the Quran, as well as to individual morphological segments in the text Part-of-speech Tag Name Arabic Name N Noun اسم PN Proper noun اسماء علم PRON Personal pronoun ضمير DEM Demonstrative pronoun اسم اشارة REL Relative pronoun اسم موصول ADJ Adjective صفة V Verb فعل P Preposition حرف جر PART Particle حرف INTG Interrogative particle حرف استفهام VOC Vocative particle حرف نداء NEG Negative particle حرف نفي FUT Future particle حرف استقبال CONJ Conjunction حرف عطف NUM Number رقم T Time adverb ظرف زمان LOC Location adverb ظرف مكان EMPH Emphatic lām prefix لام التوكيد PRP Purpose lām prefix لام التعليل IMPV Imperative lām prefix لام الامر INL Quranic initials حروف مقطعة

Automatic Annotation Classical Arabic Dependency Parser Joakim Nivre (2009) dependency parsing using a shift/reduce queue/stack architecture with machine learning Following similar architecture, but with hand written rules, custom parser has an F-measure of 77.2%

University of Leeds Postgraduate Researcher Conference 2011 Criteria for “PGR Researcher of the Year 2011” Ability to communicate research to the lay and non-specialist research audience Impact/potential impact of the research in terms of e.g. application of findings for economic or social benefit; the significance of the contribution/potential contribution of the research to the academic subject area Evidence of local or national publicity or public engagement.

Ability to communicate research to the lay and non-specialist audience Example Feedback (319 comments) “I would like to applaud you for your effort” Prof Behnam Sadeghi, Stanford University “We are big admirers of the work” Prof Gregory Crane, Classics Dept, Tufts University “I regularly use your work on the Qur'an and read it whenever I can.” Prof Yousuf Islam, Director, Daffodil International University “Congratulations to all concerned on this project” - Prof Michael Arthur, VC, Leeds Uni

Impact: application of findings for economic or social benefit Over a million users already, and growing; many unforseen social benefits, eg: “I work as a chaplain in correctional centers in the State of Missouri, U.S.A. Thanks for your permission to use the Quranic Arabic Corpus in these correctional centers” Tadar Wazir.

Impact: significance of the research to the academic subject area 10 papers in research conferences & journals 25 citations (from Google Scholar) - so far... Positive feedback from top researchers Only free-to-download Arabic treebank A de-facto standard data-set for AI research

Evidence of local or national publicity or public engagement Newspapers, eg Muslim Post; better still: Website – world-wide public engagement!

I-AIBS Institute for Artificial Intelligence and Biological Systems Conclusion This is not the end to come: 2nd half of PhD project; and more? Kais Dukes I-AIBS Institute for Artificial Intelligence and Biological Systems School of Computing University of Leeds