Presentation is loading. Please wait.

Presentation is loading. Please wait.

DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level Ted Sanders.

Similar presentations


Presentation on theme: "DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level Ted Sanders."— Presentation transcript:

1 DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level Ted Sanders Utrecht institute of Linguistics Universiteit Utrecht

2 Coherence in discourse Many tourists come to Switzerland. They want to see the mountains. Referential coherence Many tourists come to Switzerland because they want to see the mountains. Relational coherence John was happy. It was a Saturday. We do not need explicit linguistic indicators

3 Coherence in discourse, 2 Coherence is a cognitive phenomenon Coherence relations are conceptual relations that constitute coherence between discourse segments (minimally clauses) Connectives, Cue Phrases and other lexical markers can but need not make this coherence explicit. Coherence relations are the building blocks of discourse structure (causal, contrastive, additive)

4 In annotated corpora ? The discourse level is largely lacking in annotated Dutch corpora There is an international tendency towards discourse annotation: The Penn Discourse Treebank (Prasad, Joshi, Webber et al.) The Potsdam Corpus (Stede et al.) And at the same time, we do have much data on Dutch: on connectives Mainly causal Across media (various written genres, spoken, chat) At various stages of annotation

5 Larger research issues in the field To be answered on the basis of annotated corpora The meaning and use of connectives varies across languages: omdat vs. parce que vs. weil Semantic-pragmatic restrictions on use Similarities and differences in acquisition We will start discourse annotation with a study on the category of causals

6 Annotation Some criteria: Order: cause – consequence and vice versa Subjectivity: want, puisque, since, denn vs. omdat parce que, because weil Linguistic marking: yes/no, perspective etc. Characteristics of the segments: propositional attitude, modality, tense, syntax…

7 Current situation: 15 studies…. Corpus connfragmnr s1s2 modality s1 modality s2 protags1 s2 relation 7omdat2502 176 176 11 irrelevant want feit 6 1 1 1 Irrelevant want feit Irrelevant want feit1 7omdat2502b 177 177 21 Spreker/auteur62 11Expliciet aanwezigIrrelevant want feit1 7omdat2509 707 707 11irrelevant want feit 6111Irrelevant want feitIrrelevant want feit 1 7omdat2539 3320 3320 11 irrelevant want feit6111Irrelevant want feit Irrelevant want feit1 7omdat2546 3810 3810 12 irrelevant want feit33231Irrelevant want feit Impliciet19 7omdat2551 4357 4357 12 irrelevant want feit31211Irrelevant want feit Expliciet aanwezig1 7omdat2525 2547 2547 31 Spreker/auteur6211Expliciet aanwezig Irrelevant want feit1

8 The DiscAn project has five main goals: 1.standardize and open up an existing set of Dutch corpus analyses of coherence relations and discourse connectives; 2.develop the foundations for a discourse annotation system; 3.improve the metadata by investigating existing CMDI profiles or adding new profiles suited for this type of analysis; 4.inventorize the required categories and investigate to what extent these could be included in ISOcat categories for discourse; 5.an interdisciplinary discourse community of text-, corpus and computational linguists to initiate further research in a European context.

9 A model of analysis Var 1 Name of the coder (values: the names of the two authors) Var 2 Number of the fragment (the values were present in the fragments) Var 3 Utterance number(s) of the segment preceding want (S1) Var 4 Utterance number(s) of the segment following want (S2) Var 5 Propositional attitude of S1 (values: action, fact, opinion, observation, knowledge, experience) Var 6 Propositional attitude of S2 (values: action, fact, opinion, observation, knowledge, experience) Var 7 Identity of the conceptualizer in S1 (values: speaker/1st person, second person, third person (nominal or pronominal, generic person) Var 8 Identity of the conceptualizer in S2 (values: speaker/1st person, second person, third person (nominal or pronominal, generic person) Var 9 Type of relation expressed by want (values: non-volitional content, volitional content, explanation of a mental state, epistemic, textual, speech act) Var 10 Syntactic modification of want (values: no modification, coordinating conjunction, intensifier, focus element)


Download ppt "DiscAn : Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level Ted Sanders."

Similar presentations


Ads by Google