Presentation is loading. Please wait.

Presentation is loading. Please wait.

Annotating Urdu Corpus

Similar presentations


Presentation on theme: "Annotating Urdu Corpus"— Presentation transcript:

1 Annotating Urdu Corpus
Tafseer Ahmed DHA Suffa University, Karachi

2 Presentation Plan Practical task Motivation Part of Speech Annotation
Annotation for Named Entity (NE) Practical task Dependency Annotation Other Annotations

3 Motivation

4 Annotated Corpus – preview

5 Motivation Software Tools for processing annotated corpus are available. Part of Speech Tagger Named Entity Recognizer Chunker/Shallow Parser Language Modelling (& Grammar Learning) The annotated corpus can be used in Statistical Machine Translation Information Extraction

6 Motivation Hence, tools are available to create software applications
However, a missing ingredient for Pakistani languages is annotated corpus.

7 Part of Speech (POS) annotation

8 Part of speech John saw the saw Noun Verb Article Noun
a part of speech (also a word class, a lexical class, or a lexical category) is a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question. John saw the saw Noun Verb Article Noun

9 “Traditional” POS Tag set
Noun Pronoun Adjective Verb Adverb Preposition Conjunction Interjection

10 Tag set sizes 3 Tags: اسم ، فعل، حرف : Arabic (“Tradition”)
8 Tags: English (“Tradition”) 48 Tags :Penn Treebank Tagset 282 Tags: Hardie’s Urdu Tagset

11 Granularity Problem Adjective: good, bad, اچھا ، برا adjective
some, many, چند ، کئی         quantifier first, second, پہلا ، دوسرا ordinal Verb: go, read,  جا، پڑھ        main verb is, was,  ہے ، تھا            helping verb / auxiliary can, may,  سک، چاہیے     modal verb

12 POS tagset for Computation

13 Sample Text - English one/CD charge/NN of/IN filing/NN a/DT false/JJ return/N and/CC was/VBD fined/VBN $/$ 5,000/CD and/CC sentenced/VBN to/TO 18/CD months/NNS in/IN prison/NN ./.

14 Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42)
Syntactic versus functional tag Noun Verb

15 Urdu Tagset - Issues Granularity From Hardie (282 tags) to Hassan (42)

16 Urdu Tagset - Issues Coarse grained tags better for machine learning
easy/fast annotation Fine grained tags more information

17 Urdu Tagset - Issues میں نے پانی پیا vs. میں نے سبق یاد کیا
Syntactic versus functional behavior Noun (NN) میں نے پانی پیا vs. میں نے سبق یاد کیا میں گھر کے اندر گیا

18 CLE Urdu Tagset

19 Sample Text - Urdu CLE POS Tagged Data

20 Tagset for other languages
Sindhi J A Mahar (2010) Mutee u Rahman (2012)

21 “Universal” Tagset Naseem et. al, 2010 Google (Petrov et al., 2012)
Used (and modified) by Nirve’s Universal Dependency TweetbooParser (CMU)

22 Google “Universal” Tagset
NOUN (nouns) کتاب، کراچی VERB (verbs) پڑھتا، ہے، رہا ADJ (adjectives) اچھا، چند، پہلا ADV (adverbs) روزانہ، بہت، تقریباً PRON (pronouns) میں، تم، وہ DET (determiners and articles) وہ

23 “Universal” Tagset ADP (prepositions and postpositions)نے، اندر
NUM (numerals) 2, دو CONJ (conjunctions) اور، یا، لیکن PRT(particles) ‘.’ (punctuation marks) -، ؟ X (a catch-all for other categories)

24 Nirve’s “Universal” Tagset
ADJ: adjective ADP: adposition ADV: adverb AUX: auxiliary verb CONJ: coordinating conjunction DET: determiner INTJ: interjection NOUN: noun NUM: numeral PART: particle PRON: pronoun PROPN: proper noun PUNCT: punctuation SCONJ: subordinating conjunction SYM: symbol VERB: verb X: other

25 An Example (using Petrov Tagset)
پڑھی کتاب اچھی ایک روزانہ نے لڑکی ذہین Verb Noun Adj Num Adv Adp

26 An Example – Noun Features
English: boy|NN boys|NNS Form Number Nominative Singular اچھا لڑکا آیا Plural اچھے لڑکے آئے Oblique اچھے لڑکے نے کہا اچھے لڑکوں نے کہا

27 An Example – Verb Features
English: walk|VB walks|VBS walked|VBD reading|VBG Number Gender Person Form Singular Masculine 3 imperfective چلتا ہے Feminine Imperfective چلتی ہے Plural چلتی ہیں perfective چلا تھا چلی تھی 1 subjunctive چلوں گا infinitive چلنا

28 Named Entity Recognition
Beyond POS Tagging Named Entity Recognition

29 Named Entities سعید اجمل 14 اکتوبر 1977ء کو فیصل آباد میں پیداہوۓ
Person Organization Location Date Time Money Percent Misc

30 Inside Outside Beginning (IOB)
ہیں بانی کے مائیکروسافٹ گیٹس بل Verb Noun AdP POS O Org-B Per-I Per-B IOB بل Noun Person-B گیٹس Noun Person-I مائیکروسافٹ Noun Organization-B کے AdP O بانی Noun O ہیں Verb O

31 Practical Task Annotating English file WebAnno Introduction
Creating Urdu POS Tagset Annotating a news (from today’s news paper)

32 Part 2: Syntactic Annotation

33 Syntactic Representation
Phrase Structure vs Dependency Structure

34 Phrase Structure vs Dependency

35 Constituent and Functional Structures
Lexical Functional Grammar (LFG)

36 Urdu Examples

37 Urdu Examples

38 “Universal” Dependencies

39 Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین

40 Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ
اچھا لڑکی Lemma

41 Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ
اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS

42 Recap – An Urdu Example تھیں پڑھیں کتابیں اچھی نے لڑکیوں ذہین ہے پڑھ
اچھا لڑکی Lemma Aux Verb NN Adj AdP Noun POS Form= Perf Gend= Fem Pers=3 Num=Pl Gend=Fem Form=Nom Num= Pl Nom Form=Obl Obl Features

43 Recap – An Urdu Example

44 CoNLL Format CoNLL (Conference on Natural Language Learning) format
Representing graph (and other tags) in text file Id Word Lemma Coarse Grained POS Fine Grained POS Features Host Dependency Type

45 CoNLL Format

46 Tools for Annotation GATE (General Architecture for Text Engineering), Sheffield Brat, Manchester, Tokoyo,...... Webanno, Darmstadt ...

47 WebAnno - Demo


Download ppt "Annotating Urdu Corpus"

Similar presentations


Ads by Google