Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically.

Similar presentations


Presentation on theme: "Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically."— Presentation transcript:

1 Corpus 06 Discourse Characteristics

2 Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically. E.g. known/new information 2. Analysis tools are not helpful. Solutions: 1. Develop interactive programs 2. Use surface grammatical features of a text.

3 Questions 1. How are references marked in different ways indifferent kinds of texts? 2. How does the sequence of verbs with a text develop with respect to the marking of tense and voice?

4 Reference Noun phrases are major device in reference to objects, people and other entities. Reference by noun phrase can be full noun phrase or a pronoun, the former expresses new information while the latter given information.

5 Types of reference Exorphoric: also called text-external, referring directly to the speaker and addressee. E.g. you, I. Anaphoric: a person or thing that has already been referred to in the text. E.g. it, that Inferrable: something that can be inferred according to common sense, and that is neither exorphoric nor anaphoric, as the restructuring and its debt burden in The engineering and consulting firm, which has been plagued by losses for five years, said the restructuring is required to relieve its debt burden and “acute shortage of cash.”

6 Characteristics of referring expressions Four parameters Status of information: given versus new For given information, type of reference: anaphoric, exophoric, or inferrable For anaphoric reference, form of expression: pronoun, synonym, or repetition For anaphoric reference, the distance between the anaphoric expression and its antecedent

7 Steps 1. grammatically tag all texts 2. go through the interactive program, stopping when it reaches a noun or phonoun. 3. prompt the user to select the correct codes for that noun phrase.

8 Computer processing Information status: pronouns are automatically coded as given information. For each noun, the program automatically checks whether there is an earlier occurrence of the same noun in the text. If there is, the repeated noun is automatically coded as given information. All other full nouns are pre- code as new information. These nouns are then checked interactively to determine whether they actually represent given information.

9 Type of reference The pronouns I and you are automatically coded as marking expophoric reference. Third person pronouns are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences. Nouns with given informational status are automatically labeled anaphoric but checked interactively to identify exophoric and inferable occurrences.

10 Forms of anaphoric expression If nouns have been coded as anaphoric and an earlier occurrence of the same noun was found in the text, the referring expression is automatically identified as a noun repetition. Other anaphoric nouns are coded as synonymous.

11 Distance between the target referring expression and its antecedent The antecedent of all anaphoric nouns and pronouns must be identified. For repeated nouns, the antecedent is automatically pre-coded as the earlier occurrence of the same noun; these antecedents are checked interactively to determine if there is a close synonymous expression. for all other nouns and pronouns, the user of the interactive program must type in the antecedent. The distance between the target referring expression and its antecedent can be computed automatically.

12 Register and Types of Information

13 Reference: Conversation and speech have relatively frequent referring expression, although news has the largest number of referring expressions. Given/new information: Conversation and speech rely heavily on given information while news and academic prose have more new information.

14 Types of Reference

15 Exophoric pronouns: account for over half of all given references in conversation, but it is not the case with written registers. Anaphoric: written registers rely heavily on it. The high proportion of expressions marking new information accounts for the reliance on anaphoric reference in written registers.

16 Average distance measures for four registers Conversation 4.5 Public speeches 5.5 News reportage 11.0 Academic prose 9.0 This makes sense given the difference in the production and comprehension circumstances of written and spoken registers. Conversation and speeches must be produced and comprehended on-line. Co-references with short anaphoric distance are easier to understand. Frequent use of exophoric pronouns referring to the speaker or listener in conversation

17 Average distance measures for pronominal versus full noun anaphoric expressions Average pronominal distance Average full noun distance Conversation3.09.0 Public speeches3.510.0 News reportage3.013.5 Academic prose2.510.0

18 Average distance measures for pronominal versus full noun anaphoric expressions Pronouns tend to occur much close to their antecedent than repeated full nouns. The greater the number of intervening referring expressions, the greater the chance for ambiguity and confusion over the intended reference of pronominal forms. Thus full noun expressions are preferred for anaphoric reference over large distances.

19 Discourse maps of verb tense and voice There are shifts in communicative purpose within the course of a text. Example: research articles follow a standard four-part organization: Introduction, Methods, Results, discussion (I-M-R-D).

20 Steps of analysis of 19 medical research articles Step 1: frequency counts of present tense, past tense and agentless passives across the IMRD sections. Step 2: calculate the average frequency counts for each type of section. Step 3: Compute for ANOVA and correlation coefficients for each linguistic features. The significant level se set at 0.001.

21 Mean scores (per 1,000 words) of selected linguistic features across the I-M-R-D sections of English medical research articles (N=19) Section Linguistic featureIMRD Present tense F=29.25; p<.001; r2=.549 47.921.135.9 60.6 Past tense F=36.74; p<.001; r2=.605 20.748.540.313.0 Agentless passives F=33.17; p<.001; r2=.580 18.439.916.916.3 p<.001: H0 rejected. The difference between groups is significantly larger than the difference within groups. r2=.549: 54.9% of the variation in the normed counts for present tense can be accounted for by knowing the register category of each text. The differences across registers in the use of present tense verbs are very important in addition to being statistically significant.

22 Findings Present tense occurs most frequently in discussion sections, and somewhat less frequently in introductions. Both sections tend to emphasize on the current state of our knowledge and the present implications of research findings. Past tense appears more in methodology and result sections, reflecting a focus on the reportage of past events and procedures. Agentless passives has a high frequency in methodology sections, presenting events impersonally.


Download ppt "Corpus 06 Discourse Characteristics. Reasons why discourse studies are not corpus-based: 1. Many discourse features cannot be identified automatically."

Similar presentations


Ads by Google