Annotating Urdu Corpus Tafseer Ahmed DHA Suffa University, Karachi
Presentation Plan Practical task Motivation Part of Speech Annotation Annotation for Named Entity (NE) Practical task Dependency Annotation Other Annotations
Motivation
Annotated Corpus – preview
Motivation Software Tools for processing annotated corpus are available. Part of Speech Tagger Named Entity Recognizer Chunker/Shallow Parser Language Modelling (& Grammar Learning) The annotated corpus can be used in Statistical Machine Translation Information Extraction
Motivation Hence, tools are available to create software applications However, a missing ingredient for Pakistani languages is annotated corpus.
Part of Speech (POS) annotation
Part of speech John saw the saw Noun Verb Article Noun a part of speech (also a word class, a lexical class, or a lexical category) is a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question. John saw the saw Noun Verb Article Noun
“Traditional” POS Tag set Noun Pronoun Adjective Verb Adverb Preposition Conjunction Interjection
Tag set sizes 3 Tags: اسم ، فعل، حرف : Arabic (“Tradition”) 8 Tags: English (“Tradition”) 48 Tags :Penn Treebank Tagset 282 Tags: Hardie’s Urdu Tagset
Granularity Problem Adjective: good, bad, اچھا ، برا adjective some, many, چند ، کئی quantifier first, second, پہلا ، دوسرا ordinal Verb: go, read, جا، پڑھ main verb is, was, ہے ، تھا helping verb / auxiliary can, may, سک، چاہیے modal verb
POS tagset for Computation
Sample Text - English one/CD charge/NN of/IN filing/NN a/DT false/JJ return/N and/CC was/VBD fined/VBN $/$ 5,000/CD and/CC sentenced/VBN to/TO 18/CD months/NNS in/IN prison/NN ./.
Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42) Syntactic versus functional tag Noun Verb
Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42)
Urdu Tagset - Issues Coarse grained tags better for machine learning easy/fast annotation Fine grained tags more information
Urdu Tagset - Issues میں نے پانی پیا vs. میں نے سبق یاد کیا Syntactic versus functional behavior Noun (NN) میں نے پانی پیا vs. میں نے سبق یاد کیا میں گھر کے اندر گیا
CLE Urdu Tagset
Sample Text - Urdu CLE POS Tagged Data
Tagset for other languages Sindhi J A Mahar (2010) Mutee u Rahman (2012)
“Universal” Tagset Naseem et. al, 2010 Google (Petrov et al., 2012) Used (and modified) by Nirve’s Universal Dependency TweetbooParser (CMU)
Google “Universal” Tagset NOUN (nouns) کتاب، کراچی VERB (verbs) پڑھتا، ہے، رہا ADJ (adjectives) اچھا، چند، پہلا ADV (adverbs) روزانہ، بہت، تقریباً PRON (pronouns) میں، تم، وہ DET (determiners and articles) وہ
“Universal” Tagset ADP (prepositions and postpositions)نے، اندر NUM (numerals) 2, دو CONJ (conjunctions) اور، یا، لیکن PRT(particles) ‘.’ (punctuation marks) -، ؟ X (a catch-all for other categories)
Nirve’s “Universal” Tagset ADJ: adjective ADP: adposition ADV: adverb AUX: auxiliary verb CONJ: coordinating conjunction DET: determiner INTJ: interjection NOUN: noun NUM: numeral PART: particle PRON: pronoun PROPN: proper noun PUNCT: punctuation SCONJ: subordinating conjunction SYM: symbol VERB: verb X: other
An Example (using Petrov Tagset) پڑھی کتاب اچھی ایک روزانہ نے لڑکی ذہین Verb Noun Adj Num Adv Adp
An Example – Noun Features English: boy|NN boys|NNS Form Number Nominative Singular اچھا لڑکا آیا Plural اچھے لڑکے آئے Oblique اچھے لڑکے نے کہا اچھے لڑکوں نے کہا
An Example – Verb Features English: walk|VB walks|VBS walked|VBD reading|VBG Number Gender Person Form Singular Masculine 3 imperfective چلتا ہے Feminine Imperfective چلتی ہے Plural چلتی ہیں perfective چلا تھا چلی تھی 1 subjunctive چلوں گا infinitive چلنا
Named Entity Recognition Beyond POS Tagging Named Entity Recognition
Named Entities سعید اجمل 14 اکتوبر 1977ء کو فیصل آباد میں پیداہوۓ Person Organization Location Date Time Money Percent Misc
Inside Outside Beginning (IOB) ہیں بانی کے مائیکروسافٹ گیٹس بل Verb Noun AdP POS O Org-B Per-I Per-B IOB بل Noun Person-B گیٹس Noun Person-I مائیکروسافٹ Noun Organization-B کے AdP O بانی Noun O ہیں Verb O
Practical Task Annotating English file WebAnno Introduction Creating Urdu POS Tagset Annotating a news (from today’s news paper)
Part 2: Syntactic Annotation
Syntactic Representation Phrase Structure vs Dependency Structure
Phrase Structure vs Dependency
Constituent and Functional Structures Lexical Functional Grammar (LFG)
Urdu Examples
Urdu Examples
“Universal” Dependencies
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ اچھا لڑکی Lemma
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS Form= Perf Gend= Fem Pers=3 Num=Pl Gend=Fem Form=Nom Num= Pl Nom Form=Obl Obl Features
Recap – An Urdu Example
CoNLL Format CoNLL (Conference on Natural Language Learning) format Representing graph (and other tags) in text file Id Word Lemma Coarse Grained POS Fine Grained POS Features Host Dependency Type
CoNLL Format
Tools for Annotation GATE (General Architecture for Text Engineering), Sheffield Brat, Manchester, Tokoyo,...... Webanno, Darmstadt ...
WebAnno - Demo