Annotating Urdu Corpus

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Grammar Spinner Touch any part of the screen to begin. (Or click your mouse) Touch the screen again each time you want to spin.
 adj (adjectif)  adv (adverbe)  det (déterminant)  nom  prep (préposition)  pron (pronom)  verbe.
Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
BİL711 Natural Language Processing
Used in place of a noun pronoun.
Noun. Noun - verb noun Noun - verb article- adj. - adj. - Noun - verb.
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Word Classes and English Grammar.
NLP and Speech 2004 English Grammar
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Outline of English Syntax.
PARTS OF SPEECH 1 The principles of the traditional classification of the English vocabulary 2 Notional and functional parts of speech. 3 The field structure.
Phrases and Sentences: Grammar
The Eight Parts of Speech
8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303.
Chapter 4 Syntax Part II.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
Instructor: Jully Yin Meeting Room: Room 209. Ms. Jully Yin has been instructing at National Taipei University since Education: Ms. Jully Yin has.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
Dr.Chisolm What’s happening twitter.com/DrChisolmPlace
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
_____________________ Definition Part of Speech (circle one) Picture Antonym (Opposite) Vocab Word Noun Pronoun Adjective Adverb Conjunction Verb Interjection.
Word classes and part of speech tagging Chapter 5.
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
CPE 480 Natural Language Processing Lecture 4: Syntax Adapted from Owen Rambow’s slides for CSc Fall 2006.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
The Parts of Speech The 8 Parts of Speech… Nouns Adjectives Pronouns Verbs Adverbs Conjunctions Prepositions Interjections.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Parts of Speech Major source: Wikipedia. Adjectives An adjective is a word that modifies a noun or a pronoun, usually by describing it or making its meaning.
Parsing and Translating
Grammar Boot Camp Parts of Speech Challenge
GoBack definitions Level 1 Parts of Speech GoBack is a memorization game; the teacher asks students definitions, and when someone misses one, you go back.
Part-of-speech tagging
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Word classes and part of speech tagging Chapter 5.
Parts of Speech By: Miaya Nischelle Sample. NOUN A noun is a person place or thing.
The Magic Lens Introduction to Grammar. Grammar A way of thinking about language.
ENGLISH is a language Learning mode of ENGLISH Subject Language(Spoken) Literature Competition.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
1 Natural Language Processing Vasile Rus
Introduction to Linguistics
Lecture 9: Part of Speech
English Basics Mrs.Azzah.
Parts of Speech Review.
Introduction to Machine Learning and Text Mining
Words, Phrases, Clauses, & Sentences
David Mareček and Zdeněk Žabokrtský
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Grammar Review.
Chapter 4 Basics of English Grammar
Universal Dependencies
Automatically Enhancing Tagging Accuracy and Readability for Common Freeware Taggers Martin Weisser Center for Linguistics & Applied Linguistics Guangdong.
LING/C SC 581: Advanced Computational Linguistics
Topics in Linguistics ENG 331
The Eight Parts of Speech
Natural Language - General
Parts of the speech and abbreviations
FIRST SEMESTER GRAMMAR
PART OF SPEECH TAGGING (POS)
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
Universal Dependencies
Linguistic Essentials
Chapter 4 Basics of English Grammar
Natural Language Processing
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Annotating Urdu Corpus Tafseer Ahmed DHA Suffa University, Karachi

Presentation Plan Practical task Motivation Part of Speech Annotation Annotation for Named Entity (NE) Practical task Dependency Annotation Other Annotations

Motivation

Annotated Corpus – preview

Motivation Software Tools for processing annotated corpus are available. Part of Speech Tagger Named Entity Recognizer Chunker/Shallow Parser Language Modelling (& Grammar Learning) The annotated corpus can be used in Statistical Machine Translation Information Extraction

Motivation Hence, tools are available to create software applications However, a missing ingredient for Pakistani languages is annotated corpus.

Part of Speech (POS) annotation

Part of speech John saw the saw Noun Verb Article Noun a part of speech (also a word class, a lexical class, or a lexical category) is a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question. John saw the saw Noun Verb Article Noun

“Traditional” POS Tag set Noun Pronoun Adjective Verb Adverb Preposition Conjunction Interjection

Tag set sizes 3 Tags: اسم ، فعل، حرف : Arabic (“Tradition”) 8 Tags: English (“Tradition”) 48 Tags :Penn Treebank Tagset 282 Tags: Hardie’s Urdu Tagset

Granularity Problem Adjective: good, bad, اچھا ، برا adjective some, many, چند ، کئی         quantifier first, second, پہلا ، دوسرا ordinal Verb: go, read,  جا، پڑھ        main verb is, was,  ہے ، تھا            helping verb / auxiliary can, may,  سک، چاہیے     modal verb

POS tagset for Computation

Sample Text - English one/CD charge/NN of/IN filing/NN a/DT false/JJ return/N and/CC was/VBD fined/VBN $/$ 5,000/CD and/CC sentenced/VBN to/TO 18/CD months/NNS in/IN prison/NN ./.

Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42) Syntactic versus functional tag Noun Verb

Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42)

Urdu Tagset - Issues Coarse grained tags better for machine learning easy/fast annotation Fine grained tags more information

Urdu Tagset - Issues میں نے پانی پیا vs. میں نے سبق یاد کیا Syntactic versus functional behavior Noun (NN) میں نے پانی پیا vs. میں نے سبق یاد کیا میں گھر کے اندر گیا

CLE Urdu Tagset

Sample Text - Urdu CLE POS Tagged Data

Tagset for other languages Sindhi J A Mahar (2010) Mutee u Rahman (2012)

“Universal” Tagset Naseem et. al, 2010 Google (Petrov et al., 2012) Used (and modified) by Nirve’s Universal Dependency TweetbooParser (CMU)

Google “Universal” Tagset NOUN (nouns) کتاب، کراچی VERB (verbs) پڑھتا، ہے، رہا ADJ (adjectives) اچھا، چند، پہلا ADV (adverbs) روزانہ، بہت، تقریباً PRON (pronouns) میں، تم، وہ DET (determiners and articles) وہ

“Universal” Tagset ADP (prepositions and postpositions)نے، اندر NUM (numerals) 2, دو CONJ (conjunctions) اور، یا، لیکن PRT(particles) ‘.’ (punctuation marks) -، ؟ X (a catch-all for other categories)

Nirve’s “Universal” Tagset ADJ: adjective ADP: adposition ADV: adverb AUX: auxiliary verb CONJ: coordinating conjunction DET: determiner INTJ: interjection NOUN: noun NUM: numeral PART: particle PRON: pronoun PROPN: proper noun PUNCT: punctuation SCONJ: subordinating conjunction SYM: symbol VERB: verb X: other

An Example (using Petrov Tagset) پڑھی کتاب اچھی ایک روزانہ نے لڑکی ذہین Verb Noun Adj Num Adv Adp

An Example – Noun Features English: boy|NN boys|NNS Form Number Nominative Singular اچھا لڑکا آیا Plural اچھے لڑکے آئے Oblique اچھے لڑکے نے کہا اچھے لڑکوں نے کہا

An Example – Verb Features English: walk|VB walks|VBS walked|VBD reading|VBG Number Gender Person Form Singular Masculine 3 imperfective چلتا ہے Feminine Imperfective چلتی ہے Plural چلتی ہیں perfective چلا تھا چلی تھی 1 subjunctive چلوں گا infinitive چلنا

Named Entity Recognition Beyond POS Tagging Named Entity Recognition

Named Entities سعید اجمل 14 اکتوبر 1977ء کو فیصل آباد میں پیداہوۓ Person Organization Location Date Time Money Percent Misc

Inside Outside Beginning (IOB) ہیں بانی کے مائیکروسافٹ گیٹس بل Verb Noun AdP POS O Org-B Per-I Per-B IOB بل Noun Person-B گیٹس Noun Person-I مائیکروسافٹ Noun Organization-B کے AdP O بانی Noun O ہیں Verb O

Practical Task Annotating English file WebAnno Introduction Creating Urdu POS Tagset Annotating a news (from today’s news paper)

Part 2: Syntactic Annotation

Syntactic Representation Phrase Structure vs Dependency Structure

Phrase Structure vs Dependency

Constituent and Functional Structures Lexical Functional Grammar (LFG)

Urdu Examples

Urdu Examples

“Universal” Dependencies

Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین

Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ اچھا لڑکی Lemma

Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS

Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS Form= Perf Gend= Fem Pers=3 Num=Pl Gend=Fem Form=Nom Num= Pl Nom Form=Obl Obl Features

Recap – An Urdu Example

CoNLL Format CoNLL (Conference on Natural Language Learning) format Representing graph (and other tags) in text file Id Word Lemma Coarse Grained POS Fine Grained POS Features Host Dependency Type

CoNLL Format

Tools for Annotation GATE (General Architecture for Text Engineering), Sheffield Brat, Manchester, Tokoyo,...... Webanno, Darmstadt ...

WebAnno - Demo