Download presentation
Presentation is loading. Please wait.
1
Annotating Urdu Corpus
Tafseer Ahmed DHA Suffa University, Karachi
2
Presentation Plan Practical task Motivation Part of Speech Annotation
Annotation for Named Entity (NE) Practical task Dependency Annotation Other Annotations
3
Motivation
4
Annotated Corpus – preview
5
Motivation Software Tools for processing annotated corpus are available. Part of Speech Tagger Named Entity Recognizer Chunker/Shallow Parser Language Modelling (& Grammar Learning) The annotated corpus can be used in Statistical Machine Translation Information Extraction
6
Motivation Hence, tools are available to create software applications
However, a missing ingredient for Pakistani languages is annotated corpus.
7
Part of Speech (POS) annotation
8
Part of speech John saw the saw Noun Verb Article Noun
a part of speech (also a word class, a lexical class, or a lexical category) is a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question. John saw the saw Noun Verb Article Noun
9
“Traditional” POS Tag set
Noun Pronoun Adjective Verb Adverb Preposition Conjunction Interjection
10
Tag set sizes 3 Tags: اسم ، فعل، حرف : Arabic (“Tradition”)
8 Tags: English (“Tradition”) 48 Tags :Penn Treebank Tagset 282 Tags: Hardie’s Urdu Tagset
11
Granularity Problem Adjective: good, bad, اچھا ، برا adjective
some, many, چند ، کئی quantifier first, second, پہلا ، دوسرا ordinal Verb: go, read, جا، پڑھ main verb is, was, ہے ، تھا helping verb / auxiliary can, may, سک، چاہیے modal verb
12
POS tagset for Computation
13
Sample Text - English one/CD charge/NN of/IN filing/NN a/DT false/JJ return/N and/CC was/VBD fined/VBN $/$ 5,000/CD and/CC sentenced/VBN to/TO 18/CD months/NNS in/IN prison/NN ./.
14
Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42)
Syntactic versus functional tag Noun Verb
15
Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42)
16
Urdu Tagset - Issues Coarse grained tags better for machine learning
easy/fast annotation Fine grained tags more information
17
Urdu Tagset - Issues میں نے پانی پیا vs. میں نے سبق یاد کیا
Syntactic versus functional behavior Noun (NN) میں نے پانی پیا vs. میں نے سبق یاد کیا میں گھر کے اندر گیا
18
CLE Urdu Tagset
19
Sample Text - Urdu CLE POS Tagged Data
20
Tagset for other languages
Sindhi J A Mahar (2010) Mutee u Rahman (2012)
21
“Universal” Tagset Naseem et. al, 2010 Google (Petrov et al., 2012)
Used (and modified) by Nirve’s Universal Dependency TweetbooParser (CMU)
22
Google “Universal” Tagset
NOUN (nouns) کتاب، کراچی VERB (verbs) پڑھتا، ہے، رہا ADJ (adjectives) اچھا، چند، پہلا ADV (adverbs) روزانہ، بہت، تقریباً PRON (pronouns) میں، تم، وہ DET (determiners and articles) وہ
23
“Universal” Tagset ADP (prepositions and postpositions)نے، اندر
NUM (numerals) 2, دو CONJ (conjunctions) اور، یا، لیکن PRT(particles) ‘.’ (punctuation marks) -، ؟ X (a catch-all for other categories)
24
Nirve’s “Universal” Tagset
ADJ: adjective ADP: adposition ADV: adverb AUX: auxiliary verb CONJ: coordinating conjunction DET: determiner INTJ: interjection NOUN: noun NUM: numeral PART: particle PRON: pronoun PROPN: proper noun PUNCT: punctuation SCONJ: subordinating conjunction SYM: symbol VERB: verb X: other
25
An Example (using Petrov Tagset)
پڑھی کتاب اچھی ایک روزانہ نے لڑکی ذہین Verb Noun Adj Num Adv Adp
26
An Example – Noun Features
English: boy|NN boys|NNS Form Number Nominative Singular اچھا لڑکا آیا Plural اچھے لڑکے آئے Oblique اچھے لڑکے نے کہا اچھے لڑکوں نے کہا
27
An Example – Verb Features
English: walk|VB walks|VBS walked|VBD reading|VBG Number Gender Person Form Singular Masculine 3 imperfective چلتا ہے Feminine Imperfective چلتی ہے Plural چلتی ہیں perfective چلا تھا چلی تھی 1 subjunctive چلوں گا infinitive چلنا
28
Named Entity Recognition
Beyond POS Tagging Named Entity Recognition
29
Named Entities سعید اجمل 14 اکتوبر 1977ء کو فیصل آباد میں پیداہوۓ
Person Organization Location Date Time Money Percent Misc
30
Inside Outside Beginning (IOB)
ہیں بانی کے مائیکروسافٹ گیٹس بل Verb Noun AdP POS O Org-B Per-I Per-B IOB بل Noun Person-B گیٹس Noun Person-I مائیکروسافٹ Noun Organization-B کے AdP O بانی Noun O ہیں Verb O
31
Practical Task Annotating English file WebAnno Introduction
Creating Urdu POS Tagset Annotating a news (from today’s news paper)
32
Part 2: Syntactic Annotation
33
Syntactic Representation
Phrase Structure vs Dependency Structure
34
Phrase Structure vs Dependency
35
Constituent and Functional Structures
Lexical Functional Grammar (LFG)
36
Urdu Examples
37
Urdu Examples
38
“Universal” Dependencies
39
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین
40
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ
اچھا لڑکی Lemma
41
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ
اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS
42
Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ
اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS Form= Perf Gend= Fem Pers=3 Num=Pl Gend=Fem Form=Nom Num= Pl Nom Form=Obl Obl Features
43
Recap – An Urdu Example
44
CoNLL Format CoNLL (Conference on Natural Language Learning) format
Representing graph (and other tags) in text file Id Word Lemma Coarse Grained POS Fine Grained POS Features Host Dependency Type
45
CoNLL Format
46
Tools for Annotation GATE (General Architecture for Text Engineering), Sheffield Brat, Manchester, Tokoyo,...... Webanno, Darmstadt ...
47
WebAnno - Demo
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.