வணக்கம்.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Morphology.
Noun. Noun - verb noun Noun - verb article- adj. - adj. - Noun - verb.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Bilingual Dictionaries
Morphology Nuha Alwadaani.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Dictionary.
Noun Cluster. DEFINITION (fragment from Definition M1 in LSEG) "Noun" is the sentence element representing people, animals, things, abstract notions,
Kalyani Patel K.S.School of Business Management,Gujarat University.
Logics for Data and Knowledge Representation Exercises: Languages Fausto Giunchiglia, Rui Zhang and Vincenzo Maltese.
Morphology For Marathi POS-Tagger Veena Dixit 11/ 10 /2005.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Prof. Erik Lu. MORPHOLOGY GRAMMAR MORPHOLOGY MORPHEMES BOUND FREE WORDS LEXICAL GRAMMATICAL NOUNS VERBS ADJECTIVES (ADVERBS) PRONOUNS ARTICLES ADVERBS.
Macedonian DELAS – first results Aleksandar Petrovski Tetovo, Macedonia.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Dr.Chisolm What’s happening twitter.com/DrChisolmPlace
SYSTEMS ANALYSIS AND DESIGN TOOLS DATA FLOW DIAGRAMS.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
_____________________ Definition Part of Speech (circle one) Picture Antonym (Opposite) Vocab Word Noun Pronoun Adjective Adverb Conjunction Verb Interjection.
Word classes and part of speech tagging Chapter 5.
Linguistics The ninth week. Chapter 3 Morphology  3.1 Introduction  3.2 Morphemes.
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
Natural Language Processing
LANGUAGE ARTS LA WORKS UNIT 3 REVIEW STUDY GUIDE.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
POS Tagger and Chunker for Tamil
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Word classes and part of speech tagging Chapter 5.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
The structure and Function of Phrases and Sentences
Parts of Speech By: Miaya Nischelle Sample. NOUN A noun is a person place or thing.
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
English. New National Curriculum Aims The overarching aim for English in the national curriculum is to promote high standards of language and literacy.
Dr. Pushpak Bhattacharyya
Lecture 9: Part of Speech
Morphology Part 1.
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
عمادة التعلم الإلكتروني والتعليم عن بعد
Phrase structures Dr.K.Umaraj Assistant Professor Dept. of Linguistics
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
ENGLISH MORPHOLOGY Week 1.
Grammar Workshop Thursday 9th June.
Word Classes and Affixes
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
CSCI 5832 Natural Language Processing
Language Review Topics
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Your New Best Friend: Mr. Dictionary
FIRST SEMESTER GRAMMAR
What part of speech is that word?
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
Língua Inglesa - Aspectos Morfossintáticos
Linguistic Essentials
Natural Language Processing
SANSKRIT ANALYZING SYSTEM
Introduction to English morphology
Introduction to Linguistics
Word phoneme SENTENCE PHRASE SUFFIX prefix PHRASE CLAUSE UTTERANCE PART OF SPEECH MICRO-LINGUISTICS Macro-linguistics Language dictionary LEXICON allophone.
Presentation transcript:

வணக்கம்

சங்க இலக்கியங்களுக்கான உருபன் பகுப்பாய்வி இரா. அகிலன் நிரலாளார் செம்மொழித் தமிழாய்வு மத்திய நிறுவனம் சென்னை

Morphological Analyzer Morphological Analyzer is an essential and basic tool for building any language processing application. Morphological Analysis is the process of providing grammatical information of a word given with its suffix. Morphological Analyzer is a computer program which takes a word as input and produces its grammatical structure as output. Morphological analyzer will return its root/stem word along with its grammatical information depending upon its word category.

Morphological analyzer is playing important role in Natural Language Processing Applications

Rule-based Approaches The rule-based approach has successfully been used in developing many natural language processing systems. Systems that use rule-based transformations are based on a core of solid linguistic knowledge. The linguistic knowledge acquired for one NLP system may be reused to build knowledge required for a similar task in another system. The advantage of the rule-based approach is 1) less-resourced languages, for which large corpora, possibly parallel or bilingual, with representative structures and entities are neither available nor easily affordable, and 2) for morphologically rich languages, which even with the availability of corpora suffer from data sparseness.

Tag sets The tagset which has been used for the present research work contains major 10 categories Noun (NN), Verb (VB), Adjective (ADJ), Adverb (ADV), Pronoun (PR), Postposition (POS), Conjunction (CJ), Particles (PAR), Numerals (NL) and Markers (CM) and subdivided into further 34 categories

Dictionaries The dictionaries plays important role in morphological analyzer. The primary data for Morphological Analyzer have been collected from Classical Tamil Literature. From this, the root word list has been prepared and verified by language experts. The root words have been collected from the authentic editions of Classical Tamil Literature. The database contains the ten root word categories that dictionaries include: the Noun, Verb, Adjective, Adverb, Pronoun, Postposition, Conjunction, Particles, Quantifiers, Numerals, Markers and suffix. The size of database contains around 10,000 words which includes all parts of speech categories

Plural markers The different forms of plural markers in classical Tamil are kaḷ, mar, mār, ar, ār, ir, and īr. The ‘kaḷ’ forms in Tamil are in the different allomorphs of kaḷ, kkaḷ, ṅkaḷ, ṟkaḷ, and ṭkaḷ.

Segmentation rules split the suffix and remove the first character ‘k’ Step 1: Read the input word If (word ) == (dictionary) assign the appropriate tag exit() Step 2: (suffix word) == (‘kaḷ’) { split the suffix assign tag as ‘PL’ assign the remain word tag as ‘NC/NP/ND’ } Step3: (suffix word) == (‘kkaḷ’) split the suffix and remove the first character ‘k’ assign remaining suffix tag as ‘PL’

Step4: (suffix word) == (‘ṅkaḷ’) { split the suffix and remove the first character ‘ṅ’ assign remaining suffix tag as ‘PL’ add(m) in the last character of root word assign the tag as ‘NC/NP/ND’ exit() } Step5: (suffix word ) == (‘ṟkaḷ‘) split the suffix and remove the first character ‘ṟ’ add(l) in the last character of root word Step6: ( suffix word ) == (‘ṭkaḷ’) split the suffix and remove the first character ‘ṭ’ assign the tag as ‘PL’ add(ḷ) in the last character of root word

Step7: ( suffix word ) == (‘mār’) { split the suffix assign tag as ‘PL’ check the dictionary assign the appropriate tag exit() } Step 8: ( suffix word ) == (‘mar’) Step 9: ( suffix word ) == (‘ar’)

Step 10 : ( suffix word ) == (‘ār’) { split the suffix assign tag as ‘PL’ check the dictionary assign the appropriate tag exit() } Step 11 : ( suffix word ) == (‘ir’) Step 12 : ( suffix word ) == (‘īr’)

Step 1: The input word is read by the system and if it is found in the dictionary; then the appropriate tag is assigned. Ex. 1 If the given input மகள் (makaḷ) is checked into the dictionary. Then the tag is assigned. The result is displayed as மகள்/NC (makaḷ/NC) என் மகள் பெரு மடம் யான் பாராட்டத் (அக.397:1) eṉ makaḷ peru maṭam yāṉ pārāṭṭat (aka.397:1) Step 2: The suffix word ‘kaḷ’ is checked and if it is found, the segment is separated and assigned as ‘PL’. Then the remaining word is checked in the dictionary and assigned as ‘NC/NP/ND’. Ex. 2 If the given word is கிளைகள் (kiḷaikaḷ), the suffix ‘kaḷ’ is checked then the suffix is removed and assigned the tag as ‘PL’; the remaining part of the word is checked and if it is found in noun common category then the tag ‘NC’ is assigned; the result is கிளை/NC கள்/PL (kiḷai/NC kaḷ/PL). கேள்ஈவது உண்டு கிளைகளோ துஞ்சுப (நாலடியார்191:2) kēḷ īvatu uṇṭu kiḷaikaḷō tuñcupa (Nālaṭiyār191:2)

Step 3: The suffix ‘kkaḷ’ is checked, if it is found the segment is separated; the first character (k) is removed; then it is tagged as ‘PL’, the remaining part of the word is assigned as ‘NC/NP/ND’. Ex.3 If the given input is அணுக்கள் (aṇukkaḷ). The suffix ‘kkaḷ’ is separated; the first character of the suffix ‘k’ is removed; then the remaining suffix is assigned as ‘PL’; check the remaining part of the word ‘aṇu’ in the dictionary; if it is found then the tag ‘NC’ is assigned; the result is அணு/NN கள்/PL (aṇu/NN kaḷ/PL). நிறைந்த இவ் அணுக்கள் பூதமாய் நிகழின் (மணிமே.27:138) niṟainta iv aṇukkaḷ pūtamāy nikaḻiṉ (maṇimē.27:138) Step 4: The suffix ‘ṅkaḷ’ is checked. If it is found, the segment is separated; in the suffix ‘ṅ’ is removed; the remaining is tagged as ‘PL’; ‘m’ is added in the last character of the remaining part of the word and then it is tagged as ‘NC/NP/ND’. Ex. 4 If the given input is குணங்கள் (kuṇaṅkaḷ) the suffix ‘ṅkaḷ’ is separated; the first character of the suffix ‘ṅ’ is removed; and assigned tag as PL; ‘m’ is added as the last character of the remaining part of the word and assigned the tag as NC. The result is குணம்/NC கள்/PL (kuṇam/NN kaḷ/PL). கண்ணிய பொருளின் குணங்கள் ஆகும் (மணிமே.27:256) kaṇṇiya poruḷiṉ kuṇaṅkaḷ ākum (maṇimē.27:256)

Step 5: The suffix ‘ṟkaḷ’ is checked Step 5: The suffix ‘ṟkaḷ’ is checked. If ‘ṟkaḷ‘ is found, the segment is sepatated; in the suffix the ‘ṟ’ is removed; the remaining is tagged as PL; ‘l’ is added in the last character of the remaining part of the word and assigned the tagged as ‘NC/NP/ND’. Ex. 5 The given input is சொற்கள் (coṟkaḷ) check the suffix‘ṟkaḷ’ segment the first character of the suffix ‘ṟ’ is removed assign the remaining suffix as PL, add ‘l’ in the last character of the remaining word and assign the tag as NN, the result is சொல்/NN கள்/PL (col/NN kaḷ/PL). திருந்துபு நீகற்ற சொற்கள் யாம்கேட்ப (கலி81:13) tiruntupu nī kaṟṟa coṟkaḷ yām kēṭpa (Kali 81:13) Step 6: The suffix ‘ṭkaḷ‘ is checked, if ‘ṭkaḷ‘ is found, the segment is sepatated; in the suffix the ‘ṭ’ is removed; the remaining is tagged as ‘PL’; ‘ḷ’ is added in the last character of the remaining part of the word and assigned the tagged as ‘NC/NP/ND’. Ex. 6 The given input is நாட்கள் (nāṭkaḷ) check the suffix ‘ṭkaḷ’ segment the first character of the suffix ‘ṭ’ then assign the remaining suffix is PL, add ‘ḷ’ in the last of the remaining word then assign the tag as NN. The result is நாள்/NCகள்/PL (nāḷ/NN kaḷ/PL). ஒன்று இட ஈட்டு வருமேல் நின் வாழ் நாட்கள்(ஐந்தி70-56:3) oṉṟu iṭa īṭṭu varumēlniṉ vāḻ nāṭkaḷ (Ainti 70-56:3)

Step 7: The suffix ‘mār’ is checked; the segment is separated and assigned as ‘PL’. Then the remaining word is checked in the dictionary and assigned as ‘NC/NP/ND’. Ex. 7 The given input is பாவைமார் (pāvaimār) check the suffix ‘mār’ segment the suffix and assign the suffix as PL, the remaining word is assign as NN. The result is பாவை்/NC மார்/PL (pāvai/NC mār/PL). பாவைமார் ஆரிக்கும் பாடலே பாடல் (சிலம்பு.29:120) pāvaimār ārikkum pāṭalē pāṭal (cilampu.29:120) Step 8: The suffix ‘mar’ is cheeked; the segment is separated and assigned as ‘PL’. Then the remaining word is checked in the dictionary and assigned as ‘NC/NP/ND’. Ex. 8 the given input is பதின்மர் (patiṉmar) check the suffix ‘mar’ segment the suffix and assign the suffix as PL, the remaining word is assign as NN. The result is பதின்/NC மர்/PL (patiṉ/NC mar/PL). முனியா ஏறுபோல் வைகல் பதின்மரை (கலி.108:48) muṉiyā ēṟupōl vaikal patiṉmarai (kali.108:48)

Step 9: The suffix ‘ar’ is cheeked; the segment is separated and assigned as ‘PL’. Then the remaining word is checked in the dictionary and assigned as ‘NC/NP/ND’. Ex. 9 the given input is கிழவியர் (kiḻaviyar) check the suffix ‘ar’ segment the suffix and assign the suffix as PL, the remaining word is assign as NN. The result is கிழவி/NC அர்/PL (kiḻavi/NC ar/PL). கிழவர் கிழவியர் என்னாது ஏழ்காறும் (பரி:11:120) kiḻavar kiḻaviyar eṉṉātu ēḻkāṟum (pari:11:120) Step 10: The suffix ‘ār’ is cheeked; the segment is separated and assigned as ‘PL’. Then the remaining word is checked in the dictionary and assigned as ‘NC/NP/ND’. Ex. 10 the given input is பேதையார் (pētaiyār) check the suffix ‘ār’ segment the suffix and assign the suffix as PL, the remaining word is assign as NN. The result is பேதை/NC ஆர்/PL (pētai/NC ār/PL). அறிவு எனப்படுவது பேதையார் சொல் நோன்றல் (கலி.133:10) aṟivu eṉappaṭuvatu pētaiyār col nōṉṟal (kali.133:10)

Step 11: The suffix ‘ir’ is cheeked; the segment is separated and assigned as ‘PL’. Then the remaining word is checked in the dictionary and assigned as ‘NC/NP/ND’. Ex. 11 the given input is மகளிர் (makaḷir) check the suffix ‘ir’ segment the suffix and assign the suffix as PL, the remaining word is assign as NN. The result is மகள்/NC இர்/PL (makaḷ/NC ir/PL). மகளிர் கோதை மைந்தர் புனைய உம் பரி.20:20 makaḷir kōtai maintar puṉaiya um pari.20:20 Step 12: The suffix ‘īr’ is cheeked; the segment is separated and assigned as ‘PL’. Then the remaining word is checked in the dictionary and assigned as ‘NC/NP/ND’. Ex. 12 the given input is பெண்டீர் (peṇṭīr) check the suffix ‘īr’ segment the suffix and assign the suffix as PL, the remaining word is assign as NN. The result is பதின்/NC ஈர்/PL (patiṉ/NC īr/PL) விழைவு இலா பெண்டீர் தோள் சேர்வும் உழந்து (திரி.5:2) viḻaivu ilā peṇṭīr tōḷ cērvum uḻantu (tiri.5:2)

for testing and evaluation purposes, frequently-used 3257 words from classical Tamil were taken as input in the analyzer The result of word-level analysis is as follows: 2706 (83%) words were analyzed correctly; 359 (11%) words were analyzed wrongly and 194 (6%) of words were un-analyzed.

நன்றி