Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative.

Similar presentations


Presentation on theme: "Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative."— Presentation transcript:

1 Compiling a corpus II

2 Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative of a language or text type and compiled for an intended functional purpose. (see McEnery et al. 2006)

3 Quantitative breadth coverage (i.e. representative, generalisable results) statistical relevance descriptive power reliability the principle of total accountability replicability

4 Qualitative depth contextualisation socio-cultural relevance explanatory power

5 Cumulative power Qualitative change cannot be understood, let alone achieved, without noting the accumulation of quantities (Gerbner, 1983: 361)

6 CADS recognition and quantification of patterns systematic analysis of serendipitous discoveries Going backwards and forwards between the quantitative data (wordlists and keyword lists) to qualitative close reading (concordance lines and extended text)

7 Select a phenomenon for investigation Collect a relevant data set Look inside the data-set for systematic patterns Formalize significant patterns as rules describing natural events

8 The research process finding a research question designing the appropriate corpus to answer it compiling the dataset analysing the corpus fine-tuning the RQ / coming up with more questions finding answers (?)

9 Describing your corpus You must give the range of sources, giving as much information as possible (e.g. three British broadsheets, Times, Telegraph and Guardian, and the Sunday editions, six months before and three months after the 2005 elections; or White house briefings for the first two months of each year after the 2008 economic crash) Justify your choice Indicate the number of words overall Indicate, where applicable, the number of texts and average length of texts for each part of the corpus

10 Wordlists and keyword lists Make sure you distinguish between a wordlist and a keyword list Make sure you state which is the corpus under investigation and which is the reference corpus Usually the reference corpus is bigger

11 concordances A concordance brings together a series of fragments of text displaced from their original sequence and by juxtaposing them vertically, one after the other, it makes repetition visible and countable and makes patterns emerge to the surface, while the individual texts are eclipsed.

12 concord This tool helps you to see words in their textual environment You can order the lines according to the right or left co-text (e.g. R1, R2, R3; L1, L2, L3) You can also see collocates, clusters and patterns

13 collocation The idea behind collocation is that a word is defined by the relationships it establishes with other words. ‘you shall judge a word by the company it keeps’ (Firth 1957)

14 Collocates and statistics A collocate is an ‘item that appears with greater than random probability in its (textual) context’ (Hoey 1991: 7). Measures of statistical significance (e.g. log- likelihood, z-score, MI score...)

15 T-score and MI two measures of relative statistical significance T-score measures certainty of collocation, whereas MI score measures strength of collocation (Hunston 2002:73; McEnery & Wilson 2001:86). T-score directs our attention to high-frequency collocates such as grammatical words (and is thus likely to be more useful to the grammarian or lexicographer than to the sociolinguist or discourse analyst), whereas MI score highlights lexical items that are relatively infrequent by themselves but have a higher-than-random probability of co-occurring with the node word (Clear 1993:281). The two scores are useful, above all, in ranking collocations (Manning & Schütze 1999:166).

16 measure of statistical significance z-score: is the number of standard deviation from the mean frequency, it compares the observed frequency with the frequency experienced if only chance is affecting distribution. It does not measure the strength of the relationship, but its significance.

17 Quantitative clues Quantitative indicators highlight particularly promising entry points into the data. They can represent key leads worth pursuing qualitatively, according to the tried and tested principle of corpus linguistics, “Decide on the ‘strongest’ pattern and start there” (Sinclair 2003:xvi).

18 What to do with collocates Look for recurrent lexical patterns Classify the collocates (semantic grouping) Note recurrent semantic patterns Note recurrent evaluative patterns concordance co-occurrences and 2nd level collocation analysis

19 prosody The node’s property of being associated with a ‘semantically consistent set of collocates’ (Bublitz, 1996: 9). Semantic/evaluative (Morley and Partington 2009) prosody is an expression of evaluation (good/bad; desirable/undesirable; beneficial /dangerous; favourable/unfavourable...

20 keywords ‘A key word may be defined as a word which occurs with unusual frequency in a given text. This does not mean high frequency but unusual frequency, by comparison with a reference corpus of some kind’ (Scott, 1997: 236).

21 What you can do with keywords Identify the specificity, trends and the aboutness of the study corpus compared to a reference corpus. Keywords are a very good source of insights and help identifying potentially interesting items for closer observation, but they must be treated with caution. Grammatical words tell us more about style, lexical words about content but not always

22 Working with keywords Keywords lists do not account for textual position of words, they do not allow a distinction to be made between polysemous meanings and are independent from the context. keywords analysis does not reveal discourses, but it directs the researcher’s attention by highlighting patterns of difference that could otherwise go undetected. As with collocation analysis, the software makes the pattern visible, the human works on it.

23 Reading Apart from the work you have already seen presented in class, or in the materials on-line and the practice in reading concordance lines, and the insights into particular aspects of the English language (e.g evaluation and graduation, figurative language) The first part of Scott and Tribble, Patterns and Meanings in Discourse and the book by Paul Baker should help you with your corpus compilation and analysis. Use the resources available


Download ppt "Compiling a corpus II. Corpus A finite size, non random collection of naturally occurring language, in a computer readable form. Non-random = representative."

Similar presentations


Ads by Google