DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level Ted Sanders.

Slides:



Advertisements
Similar presentations
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Advertisements

Why study grammar? Knowledge of grammar facilitates language learning
1 © 2006 Curriculum K-12 Directorate, NSW Department of Education and Training Implementing English K-6 Using the syllabus for consistency of teacher judgement.
UNDERSTANDING PROBABILITY JUDGMENTS IN LANGUAGE Psycholinguistic data on how speakers assess the probability of events based on linguistic indicators OPERATIONAL.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Annotating Expressions of Opinions and Emotions in Language Wiebe, Wilson, Cardie.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Discourse and intertextual issues in translation.
Semantics and Lexicology Generativist semantics. From structuralist semantics Semantic features, components.
Section VI: Comprehension Teaching Reading Sourcebook 2 nd edition.
Corpus Linguistics Case study 2 Grammatical studies based on morphemes or words. G Kennedy (1998) An introduction to corpus linguistics, London: Longman,
Michigan Common Core Standards
14: THE TEACHING OF GRAMMAR  Should grammar be taught?  When? How? Why?  Grammar teaching: Any strategies conducted in order to help learners understand,
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.
Language Objectives. Planning Teachers should write both content and language objectives Content objectives are drawn from the subject area standards.
Preparing for the A2 exam Summer 2014 English Language B.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
TOM TORLAKSON State Superintendent of Public Instruction CALIFORNIA DEPARTMENT OF EDUCATION Tom Torlakson, State Superintendent of Public Instruction Next.
Sharing linguistic multi-media resources Jacquelijn Ringersma Paul Trilsbeek Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
What is discourse analysis?
The ISO-DCR 17 January /20111CMDI tutorial Marc Kemps-Snijders a, Menzo Windhouwer b, Sue Ellen Wright c a Meertens Institute, b MPI for.
Units 1 & 2.
LREC 2010, Malta Maj Centre for Language Technology The DAD corpora and their uses Costanza Navarretta Funded by Danish Research.
Film Discourse Interpretation Janina Wildfeuer Bremen Institute for Transmedial Textuality Research Faculty of Linguistics and Literary Science Bremen.
Ninke Stukker and Ted Sanders Universiteit Utrecht
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Lecture 19 From sentence to Text. Sentence and text the sentence: the highest rank of grammatical unit and also the basic linguistic unit constituting.
SAC 1 Informal Discourse Comparative Analysis. Analytical Commentary SAC 1: Analytical Commentary What is it? Linguistic analysis. Articulate your understanding.
COGNITIVE SYTLISTICS,SPEECH AND REPRESENTATION DIALOGUE AND DISCOURSE PREPARED BY MIKE KURIA REF BOOK: STYLISTICS: A RESOURCE BOOK FOR STUDENTS By Paul.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Dr. Francisco Perlas Dumanig
1 Cohesion + Coherence Lecture 9 MODULE 2 Meaning and discourse in English.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
HYMES (1964) He developed the concept that culture, language and social context are clearly interrelated and strongly rejected the idea of viewing language.
Discourse Analysis ENGL4339
Corpus approaches to discourse
Topic and the Representation of Discourse Content
The Middle Years Programme. Middle Years Programme is for students between the ages of 11 and 16 is for students between the ages of 11 and 16 helps develop.
Levels of Linguistic Analysis
Defining Discourse.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Human-Assisted Machine Annotation Sergei Nirenburg, Marjorie McShane, Stephen Beale Institute for Language and Information Technologies University of Maryland.
Genre and cultural purpose We recognize a genre when a text does something with language that we’re familiar with. Very often we are able state what kind.
Specialized texts Main features: – External parameters (elements of the communication process) – Internal parameters (formal structure; knowledge structure;
Jeopardy Syntax Morphology Heading3Heading4 Heading5 Q $600 Q $700 Q $800 Q $900 Q $1000 Q $600 Q $700 Q $800 Q $900 Q $1000 Final Jeopardy.
Chapter 11 Language. Some Questions to Consider How do we understand individual words, and how are words combined to create sentences? How can we understand.
Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.
ASSESSMENT PRACTICES IN THE POST-COMMUNICATIVE ERA: A MULTILITERACIES PERSPECTIVE Heather Willis Allen – University of Wisconsin - Madison Beatrice Dupuy.
Introduction to RST (Rhetorical Structure Theory)
Specialized texts Main features: – External parameters (elements of the communication process) – Internal parameters (formal structure; knowledge structure;
Grammar Grammar analysis.
M.Lucero and M.Spyridakis, Spetses, June 2017
IB Assessments CRITERION!!!.
“Embracing the Future”
Natural Language Processing (NLP)
Improving a Pipeline Architecture for Shallow Discourse Parsing
Genre-Based Approach and the Competence-Based Curriculum
Information Structure and Prosody
A Level English Language
Style in E & SA Style is influenced by linguistic choices on all levels: lexical, syntactic, and semantic. For example, consider the differences in meaning.
Levels of Linguistic Analysis
Section VI: Comprehension
Natural Language Processing (NLP)
TEMPLATE ELEMENTS.
Discourse Analysis.
Language in the Media Lesson 2.
Natural Language Processing (NLP)
Presentation transcript:

DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level Ted Sanders Utrecht institute of Linguistics Universiteit Utrecht

Coherence in discourse Many tourists come to Switzerland. They want to see the mountains. Referential coherence Many tourists come to Switzerland because they want to see the mountains. Relational coherence John was happy. It was a Saturday. We do not need explicit linguistic indicators

Coherence in discourse, 2 Coherence is a cognitive phenomenon Coherence relations are conceptual relations that constitute coherence between discourse segments (minimally clauses) Connectives, Cue Phrases and other lexical markers can but need not make this coherence explicit. Coherence relations are the building blocks of discourse structure (causal, contrastive, additive)

In annotated corpora ? The discourse level is largely lacking in annotated Dutch corpora There is an international tendency towards discourse annotation: The Penn Discourse Treebank (Prasad, Joshi, Webber et al.) The Potsdam Corpus (Stede et al.) And at the same time, we do have much data on Dutch: on connectives Mainly causal Across media (various written genres, spoken, chat) At various stages of annotation

Larger research issues in the field To be answered on the basis of annotated corpora The meaning and use of connectives varies across languages: omdat vs. parce que vs. weil Semantic-pragmatic restrictions on use Similarities and differences in acquisition We will start discourse annotation with a study on the category of causals

Annotation Some criteria: Order: cause – consequence and vice versa Subjectivity: want, puisque, since, denn vs. omdat parce que, because weil Linguistic marking: yes/no, perspective etc. Characteristics of the segments: propositional attitude, modality, tense, syntax…

Current situation: 15 studies…. Corpus connfragmnr s1s2 modality s1 modality s2 protags1 s2 relation 7omdat irrelevant want feit Irrelevant want feit Irrelevant want feit1 7omdat2502b Spreker/auteur62 11Expliciet aanwezigIrrelevant want feit1 7omdat irrelevant want feit 6111Irrelevant want feitIrrelevant want feit 1 7omdat irrelevant want feit6111Irrelevant want feit Irrelevant want feit1 7omdat irrelevant want feit33231Irrelevant want feit Impliciet19 7omdat irrelevant want feit31211Irrelevant want feit Expliciet aanwezig1 7omdat Spreker/auteur6211Expliciet aanwezig Irrelevant want feit1

The DiscAn project has five main goals: 1.standardize and open up an existing set of Dutch corpus analyses of coherence relations and discourse connectives; 2.develop the foundations for a discourse annotation system; 3.improve the metadata by investigating existing CMDI profiles or adding new profiles suited for this type of analysis; 4.inventorize the required categories and investigate to what extent these could be included in ISOcat categories for discourse; 5.an interdisciplinary discourse community of text-, corpus and computational linguists to initiate further research in a European context.

A model of analysis Var 1 Name of the coder (values: the names of the two authors) Var 2 Number of the fragment (the values were present in the fragments) Var 3 Utterance number(s) of the segment preceding want (S1) Var 4 Utterance number(s) of the segment following want (S2) Var 5 Propositional attitude of S1 (values: action, fact, opinion, observation, knowledge, experience) Var 6 Propositional attitude of S2 (values: action, fact, opinion, observation, knowledge, experience) Var 7 Identity of the conceptualizer in S1 (values: speaker/1st person, second person, third person (nominal or pronominal, generic person) Var 8 Identity of the conceptualizer in S2 (values: speaker/1st person, second person, third person (nominal or pronominal, generic person) Var 9 Type of relation expressed by want (values: non-volitional content, volitional content, explanation of a mental state, epistemic, textual, speech act) Var 10 Syntactic modification of want (values: no modification, coordinating conjunction, intensifier, focus element)