Presentation is loading. Please wait.

Presentation is loading. Please wait.

LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta Funded by Danish Research.

Similar presentations


Presentation on theme: "LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta Funded by Danish Research."— Presentation transcript:

1 LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta costanza@hum.ku.dk Funded by Danish Research Councils – Sussi Olsen, CST

2 LREC 2010, Malta 17-23 MajDias 2 Parallel or comparable texts and dialogues in Danish and Italian annotated with information on third-person singular neuter personal pronouns and singular demonstrative pronouns (3sn). Focus on abstract anaphoric uses (abstract pronouns AA), i.e. antecedent is a copula predicate, a verbal phrase, a clause, a discourse segment. The DAD corpora Centre for Language Technology

3 LREC 2010, Malta 17-23 MajDias 3 it, this, that Strong preference for use of demonstrative pronouns to refer to entities introduced in discourse by clauses - 83.7% of occurrences in written corpus (Webber 1991). Similar figures in English written and spoken data (i.a. Byron & Allen 1998, Gundel et al. 2003, 2005). Abstract Anaphora in English Centre for Language Technology

4 LREC 2010, Malta 17-23 MajDias 4 Centre for Language Technology Abstract Anaphora in Danish written Danish: det (it/this/that) dette (this) spoken Danish: unstressed det (it) d'et (this/that), d'et h'er (this) d'et d'er (that) dette (this) – very seldom

5 LREC 2010, Malta 17-23 MajDias 5 zero anaphora (subject pro-drop language); clitics –lo, -ne, -ci; (it) personal pronouns lo, ne, ci; demonstrative pronouns: ciò (this/that), questo (this), quello (that) Abstract Anaphora in Italian Centre for Language Technology

6 LREC 2010, Malta 17-23 MajDias 6 Centre for Language Technology The Annotations Texts: structural information, PoS and lemma (various tagsets) Spoken data: PoS, (lemma), stress, (prosody, phrases), speakers, interaction segments, utterances, timestamps All data (Navarretta & Olsen LREC-2008): 3sn: pronominal function (9), syntactic function anaphoric occurrences: referential links and their type, antecedents, syntactic type of antecedents, anaphoric distance when AA also semantic type of reference

7 LREC 2010, Malta 17-23 MajDias 7 1.dialogues from AVIP, Italian map-task corpus, Pisa, Napoli, Bari (ftp://ftp.cirass.unina.it/cirass/avip);ftp://ftp.cirass.unina.it/cirass/avip 2.dialogues and monologues from Danish map-task corpus DanPASS (Grønnum, 2006); 3. multiparty spontaneous dialogues from the Danish LANCHART corpus (Gregersen, 2007); 4. transcriptions of TV-interviews; Spoken Corpora (da 100000, it 70000) Centre for Language Technology

8 LREC 2010, Malta 17-23 MajDias 8 1.Pirandello’s (1922) stories and Danish translations; 2.parallel Danish and Italian EU texts; 3.articles from Italian financial newspaper, Il Sole 24 Ore; 4.Danish juridical texts; 5.extracts from the Danish general language PAROLE corpus (Keson and Norling- Christensen, 1998);. Written corpora (da 60000, it 50000) Centre for Language Technology

9 LREC 2010, Malta 17-23 MajDias 9 ProNon-refIAAA %OtherTotal Danish Texts det345152130 65%81708 dette02371 35%498 total345175 201 100%85816 Danish Monologues unstressed2210727 73%54210 stressed17410 17%45130 total2318137 100%99340 Danish Dialogues unstressed158483299 59%4671407 stressed10185204 41%197596 total168668503 100%6642003 Centre for Language Technology

10 LREC 2010, Malta 17-23 MajDias 10 PronounNon-refIAAAOtherTotal Italian Texts zero3431719 (48%)22392 clitic01002 (5%)4106 personal016512 (30%)4181 demonstrative016 7 (17%)427 total3459840 (100%)34706 Italian Dialogues zero12642 (75%)372 clitic0190221 personal012811 (20%)56195 demonstrative073 (5%)10 total118056 (100%)71308 Centre for Language Technology

11 LREC 2010, Malta 17-23 MajDias 11 CorpusAntecedentPronounTotalPronounTotal Danish Texts Clause det 72 dette 60 Discourse Seg.67 C. predicate134 VP173 abstract pron53 Danish Monologs Clause unstressed det 22 stressed det 4 Discourse Seg.10 VP12 C. predicate8562 abstract pron1920 Danish Dialogs Clause unstressed det 165 stressed det 122 Discourse Seg.83 VP5235 C. predicate20857 abstract pron149 55 Centre for Language Technology

12 LREC 2010, Malta 17-23 MajDias 12 CorpusAntecedentPronoun Total PronounTotal Italian texts Clause zero 17 ciò 4 Discourse S 1 2 Clause lo, ne 10 questo - Discourse S 1 1 C. Predicate 1- Italian dialogues Clause zero 41 questo 3 Clause lo 3 VP 5- C. Predicateci 2-

13 LREC 2010, Malta 17-23 MajDias 13 Many factors influence the use of pronouns, see i.a. Hajičová et al. (1990), Borthen et al. (1997), Kaiser (2000), Kaiser and Trueswell (2004), Gundel et al. (2003), Navarretta (2002, 2005). Navarretta (WARII-2008): the differences in the use of AA pronouns in Danish and Italian with respect to English are systematic. Language specific characteristics can partly explain these differences. Discussion Centre for Language Technology

14 LREC 2010, Malta 17-23 MajDias 14 Pronouns for inanimate entities English: 1 gender Danish and Italian:2 inanimate genders Danish: common and neuter – only latter can be abstract anaphor Italian: feminine and masculine – only latter can be abstract anaphor In English more necessary to restrict interpretation: via distinction personal- demonstrative pronoun Pronominal Systems Centre for Language Technology

15 LREC 2010, Malta 17-23 MajDias 15 constructions as clefts and left dislocations are much more frequent in Danish than in English and Italian, thus in Danish the clause is often the entity which is in "focus" (Gundel et al. 1993) – this partly explains the frequent use of personal pronouns (det and unstressed det) with clausal antecedents; word order is relatively free in Italian opposed to Danish and English: the use of abstract substantives in Italian restricts the antecedent search space; Syntax Centre for Language Technology

16 LREC 2010, Malta 17-23 MajDias 16 Centre for Language Technology Machine Learning Experiments on Danish data (Navarretta – DAARC 2009) Classifying the function of 3sn-pronouns using the pronominal context (n-grams of various size) and the annotated function (only in training), see i.a. (Evans 2000, Müller 2007, Hoste et al. 2007) More classifiers run on data as proposed by Daeleman et al. (2005), but more types of data and more fine-grained classification. Weka (Witten and Frank, 2005): results evaluated using 10-fold cross-validation; Baseline: results by ZeroR which proposes most frequent nominal category.

17 LREC 2010, Malta 17-23 MajDias 17 Centre for Language Technology Results (F-score) texts: 62.4%, monologues: 64.7%, map-task dialogues: 55.4%, multiparty dialogues: 32.9% (improvement: 36.4%, 30.7%,33.7% and 19.1% with respect to the baseline respectively); results on texts and map task data in line with results obtained on more restricted tasks, e.g. recognition of het by Hoste et al. (2007); recognition of non-referential pronouns slightly lower than in i.a. Boyd et al. (2005); adding pos and lemma information to data improves classification, but not significantly, same result as in Hoste et al. (2007);

18 LREC 2010, Malta 17-23 MajDias 18 CorpusAlgorithmPrecisionRecall F-score Texts Baseline 18.54325.8 SMO79.884.381.1 NBTree7883.680.4 Naive Bayes 71.67673.5 AVIP dialogues Baseline 25.350.333.7 Kstar68.972.469.6 SMO63.569.265.7 NBTree6368.564.5 Classification experiments on Italian Centre for Language Technology

19 LREC 2010, Malta 17-23 MajDias 19 Improvement of classification with respect to the baseline is 55.1% for texts, 35.9% for dialogues. There are more types of pronouns in Italian than in Danish, thus the use of each type of pronoun is much more restricted in the former language than in the latter. Adding PoS and lemma information decreases performance of classifier, but not significantly. Results on Italian data Centre for Language Technology

20 LREC 2010, Malta 17-23 MajDias 20 annotations in DAD corpora: characteristics of use of 3sn in Danish and Italian; differences in use of AA in Danish, English and Italian can be explained in terms of languages' pronominal system and syntax; annotations useful to automatically distinguish function of 3sn; to do: look at relation between pronouns, clausal types of antecedents and anaphoric distance – look at parallel data, investigate resolution, investigate use of lexical resources... Conclusion and future work Centre for Language Technology


Download ppt "LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta Funded by Danish Research."

Similar presentations


Ads by Google