Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.

Slides:



Advertisements
Similar presentations
An investigation into Corpus-based learning about language inin the primary-school: CLLIP Corpus evidence of the features of childrens literature.
Advertisements

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Measuring Referring Expressions in a Story Context Phyllis Schneider, Speech Pathology & Audiology, University of Alberta Denyse Hayward, University of.
Introduction: A discourse perspective on grammar
Word Order Choices Chapter 12
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
How To Teach Vocabulary. Best Practices What does effective, comprehensive vocabulary instruction look like? It has identified four key components: 1.
Some Linguistic Tools. Lexical Categories (Parts of Speech)
QUANTITATIVE DATA ANALYSIS
Public Communication 1 Focus Questions 1. What is public speaking? 2. Do ordinary people do much public speaking? 3. How do speakers earn credibility?
Corpus 05 Grammar. Unlike lexicography, grammar does not have a long tradition of empirical study. Prescriptive vs descriptive: traditionally, grammatical.
Discourse and intertextual issues in translation.
2-Deixis and distance.
Agenda What is TOEFL PBT? Sections of the TOEFL PBT Test of Written English (TWE) Listening Comprehension Structure and Written Expression Reading Comprehension.
KS2 English Parent Workshop January 2015
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
1. Introduction Which rules to describe Form and Function Type versus Token 2 Discourse Grammar Appreciation.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
V ARIATION IN THE VERB PHRASE : TENSE, ASPECT, VOICE AND MODAL USE Longman Student Grammar of Spoken and Written English Biber; Conrad; Leech (2009, p.148-
Chapter 2 Words and word classes.
Present Tense of Latin Verbs Magister Henderson Latin I.
Top Ten Tips for teachers preparing students for the academic version of IELTS Sam McCarter Macmillan Online Conference 2013.
Automated Essay Evaluation Martin Angert Rachel Drossman.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Chapter 4 Basics of English Grammar Business Communication Copyright 2010 South-Western Cengage Learning.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
 (Worse) The number of banks charging their customers ATM user fees are increasing.  (Better) The number of banks charging their customers ATM user.
Discussions and Oral Presentations as Teaching Material in English for Medicine Zorica Antic Natasa Milosavljevic English language department Faculty of.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Teaching Productive Skills Which ones are they? Writing… and… Speaking They have similarities and Differences.
Adverbials Chapter 11 Longman Student Grammar of Spoken and Written English Biber; Conrad; Leech (2009, p )
Sequencing and Feedback in Teaching Grammar. Problems in Sequencing ► How do we sequence the grammar in a teaching programme? ► From easy to difficult?
Effective Communication for Colleges, 11 th ed., Brantley & Miller 2008©Chapter 2 – Slide 1 The Six Cs of Effective Messages.
UNIT 7 DEIXIS AND DEFINITENESS
Testing Hypotheses about Differences among Several Means.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Capturing patterns of linguistic interaction in a parsed corpus A methodological case study Sean Wallis Survey of English Usage University College London.
1 Cohesion + Coherence Lecture 9 MODULE 2 Meaning and discourse in English.
REFERENTIAL CHOICE AS A PROBABILISTIC MULTI-FACTORIAL PROCESS Andrej A. Kibrik, Grigorij B. Dobrov, Natalia V. Loukachevitch, Dmitrij A. Zalmanov
HYMES (1964) He developed the concept that culture, language and social context are clearly interrelated and strongly rejected the idea of viewing language.
Translation Studies 9. The use of corpora in TS Krisztina Károly, Spring, 2006 Sources: Olohan, 2004; Tirkkonen-Condit, 2005.
Communicative and Academic English for the EFL Professional.
Fita Ariyana Rombel 7 (Thursday 9 am).
Register Analysis. Registers we use Think of all of the reading, writing, listening, and speaking you have done in the past week.
Corpus search What are the most common words in English
IELTS Intensive Writing part two. IELTS Writing Two parts of ielts writing Part one writing about a Graph, chart, diagram Part two is an essay.
Differences between Spoken and Written Discourse
Topic The common errors in usage of written cohesive devices among secondary school Malaysian learners of English of intermediate proficiency.
Inflection. Inflection refers to word formation that does not change category and does not create new lexemes, but rather changes the form of lexemes.
1 Vocabulary acquisition from extensive reading: A case study Maria Pigada and Norbert Schmitt ( 2006)
COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 9 Discource Characteristics and Register Variations.
2. The standards of textuality: cohesion Traditional approach to the study of lannguage: sentence as conventional object of study Structuralism (Bloofield,
T H E D I R E C T M E T H O D DM. Background DM An outcome of a reaction against the Grammar- Translation Method. It was based on the assumption that.
Literary Genres are a category or certain kind of literature or writing. These categories are identified by examining the characteristics of each piece.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
The vocabulary of academic speaking: an interdisciplinary perspective
Parts of an Academic Paper
Introduction to Corpus Linguistics: Exploring Collocation
Mixed Medium The distinction between the medium of speech and the medium of writing at first seems clear-cut: either things are written or they are spoken.
Adverbials (focus on stance)
Differences in comprehension strategies for discourse understanding by native Chinese and Korean speakers learning Japanese Katsuo Tamaoka Graduate.
Core Concepts Lecture 1 Lexical Frequency.
Stylistics and Stylometry
Applied Linguistics Chapter Four: Corpus Linguistics
Mixed Medium The distinction between the medium of speech and the medium of writing at first seems clear-cut: either things are written or they are spoken.
Deixis Saja S. Athamna
TECHNICAL REPORTS WRITING
Presentation transcript:

Corpus 06 Discourse Characteristics

Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically. E.g. known/new information 2. Analysis tools are not helpful. Solutions: 1. Develop interactive programs 2. Use surface grammatical features of a text.

Questions 1. How are references marked in different ways indifferent kinds of texts? 2. How does the sequence of verbs with a text develop with respect to the marking of tense and voice?

Reference Noun phrases are major device in reference to objects, people and other entities. Reference by noun phrase can be full noun phrase or a pronoun, the former expresses new information while the latter given information.

Types of reference Exorphoric: also called text-external, referring directly to the speaker and addressee. E.g. you, I. Anaphoric: a person or thing that has already been referred to in the text. E.g. it, that Inferrable: something that can be inferred according to common sense, and that is neither exorphoric nor anaphoric, as the restructuring and its debt burden in The engineering and consulting firm, which has been plagued by losses for five years, said the restructuring is required to relieve its debt burden and “acute shortage of cash.”

Characteristics of referring expressions Four parameters Status of information: given versus new For given information, type of reference: anaphoric, exophoric, or inferrable For anaphoric reference, form of expression: pronoun, synonym, or repetition For anaphoric reference, the distance between the anaphoric expression and its antecedent

Steps 1. grammatically tag all texts 2. go through the interactive program, stopping when it reaches a noun or phonoun. 3. prompt the user to select the correct codes for that noun phrase.

Computer processing Information status: pronouns are automatically coded as given information. For each noun, the program automatically checks whether there is an earlier occurrence of the same noun in the text. If there is, the repeated noun is automatically coded as given information. All other full nouns are pre- code as new information. These nouns are then checked interactively to determine whether they actually represent given information.

Type of reference The pronouns I and you are automatically coded as marking expophoric reference. Third person pronouns are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences. Nouns with given informational status are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences.

Forms of anaphoric expression If nouns have been coded as anaphoric and an earlier occurrence of the same noun was found in the text, the referring expression is automatically identified as a noun repetition. Other anaphoric nouns are coded as synonymous.

Distance between the target referring expression and its antecedent The antecedent of all anaphoric nouns and pronouns must be identified. For repeated nouns, the antecedent is automatically pre-coded as the earlier occurrence of the same noun; these antecedents are checked interactively to determine if there is a close synonymous expression. for all other nouns and pronouns, the user of the interactive program must type in the antecedent. The distance between the target referring expression and its antecedent can be computed automatically.

Register and Types of Information

Reference: Conversation and speech have relatively frequent referring expression, although news has the largest number of referring expressions. Given/new information: Conversation and speech rely heavily on given information while news and academic prose have more new information.

Types of Reference

Exophoric pronouns: account for over half of all given references in conversation, but it is not the case with written registers. Anaphoric: written registers rely heavily on it. The high proportion of expressions marking new information accounts for the reliance on anaphoric reference in written registers.

Average distance measures for four registers Conversation 4.5 Public speeches 5.5 News reportage 11.0 Academic prose 9.0 This makes sense given the difference in the production and comprehension circumstances of written and spoken registers. Conversation and speeches must be produced and comprehended on-line. Co-references with short anaphoric distance are easier to understand. Frequent use of exophoric pronouns referring to the speaker or listener in conversation

Average distance measures for pronominal versus full noun anaphoric expressions Average pronominal distance Average full noun distance Conversation Public speeches News reportage Academic prose

Average distance measures for pronominal versus full noun anaphoric expressions Pronouns tend to occur much close to their antecedent than repeated full nouns. The greater the number of intervening referring expressions, the greater the chance for ambiguity and confusion over the intended reference of pronominal forms. Thus full noun expressions are preferred for anaphoric reference over large distances.

Discourse maps of verb tense and voice There are shifts in communicative purpose within the course of a text. Example: research articles follow a standard four-part organization: Introduction, Methods, Results, discussion (I-M-R-D).

Steps of analysis of 19 medical research articles Step 1: frequency counts of present tense, past tense and agentless passives across the IMRD sections. Step 2: calculate the average frequency counts for each type of section. Step 3: Compute for ANOVA and correlation coefficients for each linguistic features. The significant level se set at

Mean scores (per 1,000 words) of selected linguistic features across the I-M-R-D sections of English medical research articles (N=19) Section Linguistic featureIMRD Present tense F=29.25; p<.001; r2= Past tense F=36.74; p<.001; r2= Agentless passives F=33.17; p<.001; r2= p<.001: H0 rejected. The difference between groups is significantly larger than the difference within groups. r2=.549: 54.9% of the variation in the normed counts for present tense can be accounted for by knowing the register category of each text. The differences across registers in the use of present tense verbs are very important in addition to being statistically significant.

Findings Present tense occurs most frequently in discussion sections, and somewhat less frequently in introductions. Both sections tend to emphasize on the current state of our knowledge and the present implications of research findings. Past tense appears more in methodology and result sections, reflecting a focus on the reportage of past events and procedures. Agentless passives has a high frequency in methodology sections, presenting events impersonally.