Download presentation
Presentation is loading. Please wait.
1
Natlanco Senior Supervisor:
Modeling Natural Language Syntax and Morphology in LingBench IDETM Parser, Examples from English and Persian Supervisor: Prof. Dr. Frank Van Eynde Natlanco Senior Supervisor: Filip De Brabander Examiner: Prof. Dr. Geert Adriaens
2
Introduction Background Defined Project and Tasks
Developing an efficient language model Parsing in two linguistic levels 1. Morphology 2. Syntax Conclusion Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
3
What is Language Parsing?
Parsing: In its broader sense is “the identification of parts of a sentence as subject, verb, object, etc. and of words in a sentence as noun (plural), verb (past tense), etc. [5] Assigning some structures to language components. An intermediate step in semantic analysis of an input sentence. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
4
Parsing Techniques Originated from the Formal Language Theory by Chomsky. Using computer algorithms and dynamic programming to analyze natural languages. Based on different techniques: top-down, bottom-up parsing, Finite State Automata, RTN, ATN, and etc. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
5
Formal Language Theory
Formal Languages Formal Grammars Automata (‘Recognizers’ or ‘Transducers’) Parsers Formal language: Natural language (infinite sentences) = finite elements (vocabularies) Grammars (automata): Rule systems, e.g. S NP VP Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
6
Types of Formal Grammars
Type-0: Enumerable languages (no restrictions on the rules): Not interesting for Natural Languages because of its random and uncertain nature. No uniform way to assign structures to sentences. SE AB AB cde Type-1: Context-sensitive grammar interesting for NL, decidable rules. Can assign structures to sentences. |A| |B| / X_Y in-possible impossible |n| |m| / _ [labial consonants] Labialization rule in generative phonology Type-2: Context-free grammar Left-hand side only one node S NP VP Good to show derivations. Decidable algorithms. Large number of well-defined recognition algorithms. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
7
Chomsky’s hierarchy Type-3: Finite-state grammar Regular grammars. Right-hand side contains 1 terminal node followed by 1 non-terminal node. A aB 2-Context-free 0-enumerable 1-context-sensitive 3-Regular Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
8
RTN & ATN RTN: Recursive Transition Network. It contains FSA (Finite State Automata) as directed transitions of terminal or non-terminal nodes. “In an RTN, every time the machine comes to an arc labeled with a non-terminal, it treats that non-terminal as a subroutine. It places its current location onto a stack, jumps to the non-terminal and then jumps back when that non-terminal has been parsed.” [5] Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
9
Simple Demonstration Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
10
Augmented Transition Network
An extension to RTN Implements conditions and actions within the transitions (for unification). Uses features Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
11
BTN, RTN, ATN BTN (Basic Transition Network) [finite-state grammar]
Via RTN (Recursive Transition Network) [context-free grammar] To ATN (Augmented Transition Network) [unrestricted rewrite system] Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
12
LingBench IDETM “Integrated Development Environment”, a program user interface that allows the user to create, modify, model, parse, and manipulate a set of linguistic data. The core parser: Mainly a combination of different parsing techniques. Employs RTN, embeds FSA as phrases, and uses probabilities to disambiguate and conditional features to allow only well-formed transitions in compatibility with ATN. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
13
Very efficient parsing. Demonstration
LingBench IDETM Uses an enhanced 3-Dimensional graphic environment for easy, fast and efficient language modeling. Very easy to use. Fast and reliable. Very efficient parsing. Demonstration Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
14
Hardware and Software Requirements
A Pentium with at least 1GHz CPU speed and 256MB of RAM equipped with an advanced graphic accelerator. Works on Windows platforms. A light version is available to download from the Internet: Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
15
Defined Project and Tasks
The project was defined to be done in three main phases. Developing language models will be easier if it is done in cycles. 1. Getting acquainted with LingBench IDETM, Spotting available corpora (lexicon and sentences), Starting the modeling by 20 simple sentences with limited lexicon, Reporting the application’s bugs. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
16
Defined Project and Tasks
2. Developing a lexicon of 1000 words, sentence corpus with 150 sentences as the training set, Developing further rules and testing them, Reporting bugs. 3. Creating a lexicon of words and 250 sentences, Using the corpus to design rules and testing them, Reporting bugs (5 new versions). Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
17
Syntactic model of Persian Sentences
Modeling was started with modeling 7 phrases using 10 word classes. At the end 16 rules (phrases) were created and a total of 28 word classes (POSs) were implemented based on the training set. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
18
Sentence Model Tuesday, February 05, 2019Tuesday, February 05, 2019
Peyman Nojoumian
19
Sample of a Parsed Sentence
Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
20
Morphological Model of Words
using 2000 word stems of simple verbs and a number of related verb affixes, a large number of our examples from Persian verb forms were recognized by the morphological grammar module. 11 phrases and 36 morphological classes designed and modeled in the morphological grammar. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
21
An example of the Morphological Analyzer
Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
22
Further Functionalities of the LingBench IDETM
Lexicon development Three lexicons: 1)syntactic 2)morphological 3)multi-word Defining Features and efficient probabilities Defining variables in nodes and transitions to add more constraints. Unicode compatibility. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
23
Conclusion It was expected that the system would be able to parse correctly at least 50% of the 200 sentences at the end of the 3rd phase. 100% parsable sentences of the corpus. Estimated to be able to parse 80% of the words and more than 50% of the sentences of a new test corpus. Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
24
Conclusion Efficient, Fast and reliable parsing was made possible with an enriched lexicon featuring subcategories, word frequencies and relative word probabilities with a simple model combining features of the RTN and ATN while using probabilities as well as feature structures. Using the model for parser module of my PhD thesis: “Designing a diacritizer for Persian” Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
25
References Tuesday, February 05, 2019Tuesday, February 05, 2019
1.Assi, S. Mostafa & M. Haji Abdolhosseini (2000). “Grammatical Tagging of a Persian Corpus”, International Journal of Corpus Linguistics, Vol. 5(1), PP 2.Bateni, M. (1998). “Towsife sakhtemane dasturiye zabane Farsi” (Persian Syntax). Amir Kabir Publishers: Tehran. 3.De Brabander, Filip (2003). LingBench IDETM Users Manual. 4.Garside, R. & et all eds. (1997). Corpus Annotation, Linguistic information from Computer Text Corpora. Longman. 5.Jurafsky, Daniel & J.H. Martin (2000). An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, New Jersey. 6.Kennedy, Graeme (1998). An Introduction to Corpus Linguistics, Longman. 7.Lambton Ann K. (1996) “Persian Grammar”, London: Cambridge University Press. 8.Lazard G. (1992) “A Grammar of Contemporary Persian”, California & New York: Mazda Publishers. 9.Longman Dictionary of Language Teaching & Applied Linguistics (1992). London. 10.Manning, Christopher D. & H. Schutze (2001). Foundation of statistical natural language processing. The MIT press, Cambridge 11.Nojoumian, Peyman (1999). Design and Implementation of a Computer Assisted Language Learning (CALL) System for Persian. MA Dissertation, Alame Tabatabaei University, Tehran. 12.Sag, Ivan A. & Thomas Wasow (1999). Syntactic Theory, A formal Introduction, CSLI, Stanford, USA. 13.Van Eynde, Frank and D. Gibbon Eds. (2000). Lexicon Development for Speech and Language Processing, Kluwer Academic Publishers, London. 14.Van Valin, Robert D. & Randy J. Lapolla (1997). Syntax Structure, Meaning and Function, Cambridge University Press Tuesday, February 05, 2019Tuesday, February 05, 2019 Peyman Nojoumian
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.