LREC 2010, Malta 17-23 Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta Funded by Danish Research.

Slides:



Advertisements
Similar presentations
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Advertisements

Why study grammar? Knowledge of grammar facilitates language learning
A Bilingual Corpus of Inter-linked Events Tommaso Caselli♠, Nancy Ide ♣, Roberto Bartolini ♠ ♠ Istituto di Linguistica Computazionale – ILC-CNR Pisa ♣
Processing of large document collections Part 6 (Text summarization: discourse- based approaches) Helena Ahonen-Myka Spring 2006.
Chapter 18: Discourse Tianjun Fu Ling538 Presentation Nov 30th, 2006.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Sound and Speech. The vocal tract Figures from Graddol et al.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
CBA-08, Barcelona November 13th-15th 2008 Dias 1 Centre for Language Technology Co-referential chains and discourse topic shifts in parallel and comparable.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
ELN – Natural Language Processing Giuseppe Attardi
Discussions and Oral Presentations as Teaching Material in English for Medicine Zorica Antic Natasa Milosavljevic English language department Faculty of.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
A Comparison of Features for Automatic Readability Assessment Lijun Feng 1 Matt Huenerfauth 1 Martin Jansche 2 No´emie Elhadad 3 1 City University of New.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
A multiple knowledge source algorithm for anaphora resolution Allaoua Refoufi Computer Science Department University of Setif, Setif 19000, Algeria .
On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.
Incorporating Extra-linguistic Information into Reference Resolution in Collaborative Task Dialogue Ryu Iida Shumpei Kobayashi Takenobu Tokunaga Tokyo.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
2011/03/11 Yi-Ting Huang Evans, R. (2001). Applying Machine Learning Toward an Automatic Classification of It. Literary and Linguistic Computing, 16(1),
Using Semantic Relations to Improve Information Retrieval Tom Morton.
A Language Independent Method for Question Classification COLING 2004.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
A Cross-Lingual ILP Solution to Zero Anaphora Resolution Ryu Iida & Massimo Poesio (ACL-HLT 2011)
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Introduction to Computational Linguistics
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
E BERHARD- K ARLS- U NIVERSITÄT T ÜBINGEN SFB 441 Coordinate Structures: On the Relationship between Parsing Preferences and Corpus Frequencies Ilona Steiner.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Automatic recognition of discourse relations Lecture 3.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002.
A Maximum Entropy Based Honorificity Identification for Bengali Pronominal Anaphora Resolution Apurbalal Senapati and Utpal Garain Presented by Samik Some.
Measuring the Influence of Errors Induced by the Presence of Dialogs in Reference Clustering of Narrative Text Alaukik Aggarwal, Department of Computer.
DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level Ted Sanders.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Passive Generalizations Li, Charles N. & Thompson, Sandra A. (1981). Mandarin Chinese - A Functional Reference Grammar. Los Angeles: University of California.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Using Semantic Relations to Improve Information Retrieval
Evaluating NLP Features for Automatic Prediction of Language Impairment Using Child Speech Transcripts Khairun-nisa Hassanali 1, Yang Liu 1 and Thamar.
Towards Semantic Affect Sensing in Sentences Alexander Osherenko.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Interpretese vs Translationese
Statistical NLP: Lecture 3
Authorship Attribution Using Probabilistic Context-Free Grammars
Clustering Algorithms for Noun Phrase Coreference Resolution
Statistical NLP: Lecture 9
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

LREC 2010, Malta Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta Funded by Danish Research Councils – Sussi Olsen, CST

LREC 2010, Malta MajDias 2 Parallel or comparable texts and dialogues in Danish and Italian annotated with information on third-person singular neuter personal pronouns and singular demonstrative pronouns (3sn). Focus on abstract anaphoric uses (abstract pronouns AA), i.e. antecedent is a copula predicate, a verbal phrase, a clause, a discourse segment. The DAD corpora Centre for Language Technology

LREC 2010, Malta MajDias 3 it, this, that Strong preference for use of demonstrative pronouns to refer to entities introduced in discourse by clauses % of occurrences in written corpus (Webber 1991). Similar figures in English written and spoken data (i.a. Byron & Allen 1998, Gundel et al. 2003, 2005). Abstract Anaphora in English Centre for Language Technology

LREC 2010, Malta MajDias 4 Centre for Language Technology Abstract Anaphora in Danish written Danish: det (it/this/that) dette (this) spoken Danish: unstressed det (it) d'et (this/that), d'et h'er (this) d'et d'er (that) dette (this) – very seldom

LREC 2010, Malta MajDias 5 zero anaphora (subject pro-drop language); clitics –lo, -ne, -ci; (it) personal pronouns lo, ne, ci; demonstrative pronouns: ciò (this/that), questo (this), quello (that) Abstract Anaphora in Italian Centre for Language Technology

LREC 2010, Malta MajDias 6 Centre for Language Technology The Annotations Texts: structural information, PoS and lemma (various tagsets) Spoken data: PoS, (lemma), stress, (prosody, phrases), speakers, interaction segments, utterances, timestamps All data (Navarretta & Olsen LREC-2008): 3sn: pronominal function (9), syntactic function anaphoric occurrences: referential links and their type, antecedents, syntactic type of antecedents, anaphoric distance when AA also semantic type of reference

LREC 2010, Malta MajDias 7 1.dialogues from AVIP, Italian map-task corpus, Pisa, Napoli, Bari (ftp://ftp.cirass.unina.it/cirass/avip);ftp://ftp.cirass.unina.it/cirass/avip 2.dialogues and monologues from Danish map-task corpus DanPASS (Grønnum, 2006); 3. multiparty spontaneous dialogues from the Danish LANCHART corpus (Gregersen, 2007); 4. transcriptions of TV-interviews; Spoken Corpora (da , it 70000) Centre for Language Technology

LREC 2010, Malta MajDias 8 1.Pirandello’s (1922) stories and Danish translations; 2.parallel Danish and Italian EU texts; 3.articles from Italian financial newspaper, Il Sole 24 Ore; 4.Danish juridical texts; 5.extracts from the Danish general language PAROLE corpus (Keson and Norling- Christensen, 1998);. Written corpora (da 60000, it 50000) Centre for Language Technology

LREC 2010, Malta MajDias 9 ProNon-refIAAA %OtherTotal Danish Texts det %81708 dette %498 total %85816 Danish Monologues unstressed %54210 stressed %45130 total %99340 Danish Dialogues unstressed % stressed % total % Centre for Language Technology

LREC 2010, Malta MajDias 10 PronounNon-refIAAAOtherTotal Italian Texts zero (48%)22392 clitic01002 (5%)4106 personal (30%)4181 demonstrative016 7 (17%)427 total (100%)34706 Italian Dialogues zero12642 (75%)372 clitic personal (20%)56195 demonstrative073 (5%)10 total (100%)71308 Centre for Language Technology

LREC 2010, Malta MajDias 11 CorpusAntecedentPronounTotalPronounTotal Danish Texts Clause det 72 dette 60 Discourse Seg.67 C. predicate134 VP173 abstract pron53 Danish Monologs Clause unstressed det 22 stressed det 4 Discourse Seg.10 VP12 C. predicate8562 abstract pron1920 Danish Dialogs Clause unstressed det 165 stressed det 122 Discourse Seg.83 VP5235 C. predicate20857 abstract pron Centre for Language Technology

LREC 2010, Malta MajDias 12 CorpusAntecedentPronoun Total PronounTotal Italian texts Clause zero 17 ciò 4 Discourse S 1 2 Clause lo, ne 10 questo - Discourse S 1 1 C. Predicate 1- Italian dialogues Clause zero 41 questo 3 Clause lo 3 VP 5- C. Predicateci 2-

LREC 2010, Malta MajDias 13 Many factors influence the use of pronouns, see i.a. Hajičová et al. (1990), Borthen et al. (1997), Kaiser (2000), Kaiser and Trueswell (2004), Gundel et al. (2003), Navarretta (2002, 2005). Navarretta (WARII-2008): the differences in the use of AA pronouns in Danish and Italian with respect to English are systematic. Language specific characteristics can partly explain these differences. Discussion Centre for Language Technology

LREC 2010, Malta MajDias 14 Pronouns for inanimate entities English: 1 gender Danish and Italian:2 inanimate genders Danish: common and neuter – only latter can be abstract anaphor Italian: feminine and masculine – only latter can be abstract anaphor In English more necessary to restrict interpretation: via distinction personal- demonstrative pronoun Pronominal Systems Centre for Language Technology

LREC 2010, Malta MajDias 15 constructions as clefts and left dislocations are much more frequent in Danish than in English and Italian, thus in Danish the clause is often the entity which is in "focus" (Gundel et al. 1993) – this partly explains the frequent use of personal pronouns (det and unstressed det) with clausal antecedents; word order is relatively free in Italian opposed to Danish and English: the use of abstract substantives in Italian restricts the antecedent search space; Syntax Centre for Language Technology

LREC 2010, Malta MajDias 16 Centre for Language Technology Machine Learning Experiments on Danish data (Navarretta – DAARC 2009) Classifying the function of 3sn-pronouns using the pronominal context (n-grams of various size) and the annotated function (only in training), see i.a. (Evans 2000, Müller 2007, Hoste et al. 2007) More classifiers run on data as proposed by Daeleman et al. (2005), but more types of data and more fine-grained classification. Weka (Witten and Frank, 2005): results evaluated using 10-fold cross-validation; Baseline: results by ZeroR which proposes most frequent nominal category.

LREC 2010, Malta MajDias 17 Centre for Language Technology Results (F-score) texts: 62.4%, monologues: 64.7%, map-task dialogues: 55.4%, multiparty dialogues: 32.9% (improvement: 36.4%, 30.7%,33.7% and 19.1% with respect to the baseline respectively); results on texts and map task data in line with results obtained on more restricted tasks, e.g. recognition of het by Hoste et al. (2007); recognition of non-referential pronouns slightly lower than in i.a. Boyd et al. (2005); adding pos and lemma information to data improves classification, but not significantly, same result as in Hoste et al. (2007);

LREC 2010, Malta MajDias 18 CorpusAlgorithmPrecisionRecall F-score Texts Baseline SMO NBTree Naive Bayes AVIP dialogues Baseline Kstar SMO NBTree Classification experiments on Italian Centre for Language Technology

LREC 2010, Malta MajDias 19 Improvement of classification with respect to the baseline is 55.1% for texts, 35.9% for dialogues. There are more types of pronouns in Italian than in Danish, thus the use of each type of pronoun is much more restricted in the former language than in the latter. Adding PoS and lemma information decreases performance of classifier, but not significantly. Results on Italian data Centre for Language Technology

LREC 2010, Malta MajDias 20 annotations in DAD corpora: characteristics of use of 3sn in Danish and Italian; differences in use of AA in Danish, English and Italian can be explained in terms of languages' pronominal system and syntax; annotations useful to automatically distinguish function of 3sn; to do: look at relation between pronouns, clausal types of antecedents and anaphoric distance – look at parallel data, investigate resolution, investigate use of lexical resources... Conclusion and future work Centre for Language Technology