Presentation is loading. Please wait.

Presentation is loading. Please wait.

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo

Similar presentations


Presentation on theme: "ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo"— Presentation transcript:

1 ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo bkaswanti@atmajaya.ac.id

2 Adolph, Svenja (2006) Ch. 3 role of frequency information in relation to characterization of the whole texts or collections of texts techniques and practices in data analysis ▪ quantitative exploration of texts and text collections  different types of wordlists how the wordlists can be used for contrastive studies of different texts ▪ generating hypotheses frequency lists to inform the generation of hypotheses and research questions ▪ testing hypotheses electronic text analysis to test existing hypotheses in any area that deals with the use of language ▪ facilitating manual processes from “manual” to “automated” e.g. extraction of frequency info not necessarily motivated by a particular research question

3 some of the software resources to facilitate the research process ▪ software packages to facilitate the manipulation and analysis of electronic texts ▫ the generation of frequency counts ▫ comparisons of frequency information in different texts ▫ different formats of concordance outputs [including Key Word In Context (KWIC)] » [free of charge via internet] ◊ The Compleat Lexical Tutor (Tom Cobb) ◊ View Variation in English Words and Phrases (Mark Davis) » [commercial] ◊ Wordsmith Tools (Mike Scott)

4 basic information about the text most software packages ▪ allow textual data to be sorted into concordance outputs ▪ produce some basic information about the text or collection of texts ▫ average sentence length ▫ word length ▫ number of paragraphs ▫ number of individual running words (tokens) ▫ number of different words (types) ▫ number of lexical items and number of grammatical items (in tagged corpora) » type-token ratio some of the info can be expressed in terms of ratios: ratio between grammatical and lexical items in the text (lexical density)

5 the type-token ratio ▪ to gain some basic understanding of the lexical variation within the text tokens: the number of running words in a text types: the number of different words This chapter moves from the discussion of design and development of electronic text resources to techniques and practices in data analysis. How many tokens? 21How many types? 19 The type-token ratio: divide number of tokens by number of types 21/19 = 1.11 What is it for?  to asses the level of complexity of a particular text or text collections (e.g. comparisons between documents for different types of audiences) the higher the type-token ratio the less varied the text

6 watch out: the overall size of the text(s) on which the ratio is based  compare type-token ratios of text(s) of similar length  textual complexity ▪ sentence and word length ▪ linguistic analysis of grammatical structure ▪ semantic fields of the individual items » word lists ● single words frequency of a word or phrase in different text types is important for the description of the context of use (e.g. for English language teaching) ▪ various word lists exist in the ELT context e.g. Academic Word List (Coxhead 200) ▪ spoken vs. written discourse ▪ American vs. British English

7 word list ▪ frequency order ▪ alphabetical order ▪ lemmatized format ▪ grammatical tags ▪ other analytical tags word list to account for ▪ individual items ▪ recurrent sequences of two or more items lemmatized frequency lists group together words from the same lemma (all grammatical inflections of a word: e.g. say, said, saying, says) ▪ often variations of meaning between different variants of the lemma (Stubbs 1996, Tognini-Bonelli 2001) ▪ [ELT] beneficial to teach all forms of one lemma together and give priority to the most frequently used form

8 Table 3.1: one basic information from a frequency list ten most frequent items in the ▪ spoken CANCODE corpus ▪ written component (BNC) some of the key differences between the two discourse modes are highlighted: ▪ both contain mainly grammatical items ▪ the spoken corpus includes the personal pronouns I and you (interactive nature of the spoken discourse) ▪ Yeah – listener response tokens in conversation

9

10 ● recurrent continuous sequences other terms: “lexical bundles” (Biber et al. 1999) “clusters” (Scott 1996) corpus research: a large proportion for particular items to co-occur in a non-random fashion of language is phrasal in nature (observable tendency ) collocation: attraction between two words (Ch. 4) [overall length to be determined at the outset; e.g. Wordsmith Tools ] Table 3.2 ten most frequent two-word, three-word, and four- word recurrent sequences in the CANCODE corpus most of the sequences are concerned with ▪ the management of discourse ▪ the deictics: you and I ▪ attempt to establish mutual understanding: know what I mean, I know, I think, do you think, etc.

11

12 ● comparing frequencies in text collections of different sizes How to compare the frequencies of individual items in two corpora of different sizes? ▪ represent them as a percentage of the overall number of words in the respective corpora ▪ use a norming technique of frequency counts ▫ divide the raw frequency of individual items by the total number of words in a text ▫ we need to decide on an appropriate number of words which forms the basis of the norm ▫ multiply the results by this figure

13 » keywords ◊ keywords = items that occur ▪ either with a significantly higher frequency (positive keywords) ▪ or with a significantly lower frequency (negative keywords) in a text or collection texts when compared to a larger reference corpus (Scott 1997) ◊ keywords are identified on the basis of ▪ statistical comparisons of word frequency lists derived from the target corpus and the reference corpus ▪ [via a chi-square or a log-likelihood analysis] each item in the target corpus is compared to its equivalent in the reference corpus and its statistical significance of difference is calculated  to generate words that are characteristic uncharacteristic in a given target corpus

14 ● single keywords ◊ on the basis of a 35,000 word corpus: the spoken language of health professionals ◊ five million word CANCODE corpus of general spoken Eng a study of telephone calls made to the British advice helpline provided by The National Health Service (NHS-Direct)  the data from the medical consultations was recorded ▪ most frequent items in both corpora grammatical items ▪ distribution of personal pronouns Health Service “other-oriented”: you most frequent ▪ the reverse frequency order of you and I ▪ right in Health Service, yeah in CANCODE both are listener response tokens ▫ right signals more transactional nature ▫ yeah interactional nature (encourage the Sp to continue with the turn)

15

16 ▪ comparison of frequency lists can help in the characterization of different spoken genres ▪ keyword analysis (below), based on a log-likelihood calculation, better suited to highlight the main elements that are characteristics for a particular text or collection of texts Table 3.4 shows the top 10 positive keywords the list gives a better idea of the content of the texts in the HP corpus ▪ reference to medication (antibiotics) ▪ ailments (diarrhoea) ▪ the nature of the discourse (information) ▪ the mode of the discourse (call) ▪ the medical context (NHS, Direct)

17

18 the keywords that mark listener response in an advice-giving setting (ok, okay) patient-oriented nature (you, your) Table 3.5 confirms the result of the analysis of positive keywords ▪ the discourse in the HP corpus oriented towards the hearer who phones in with a health problem  you, your third person pronouns – negative keywords (low in HP corpus) past tense verb was also NEG keywords HP reports current medical concerns in the present tense ▪ ▪ laughter ([laughs]) significantly more in CANCODE  HP relatively serious nature of medical consultation

19

20 ● key sequences analysis of keywords can be extended to include extended recurrent sequences Table 3.6 key sequences provides us with even stronger evidence of the particular domain of HP discourse ▪ quite a few of the recurrent sequences “automated response” marking the beginning of telephone interaction with NHS Direct ▪ other sequences relate to the gathering of basic information about the caller ▪ the most significant NEG key sequence in the HP: I don’t know (professionals providing knowledge and advice)

21


Download ppt "ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo"

Similar presentations


Ads by Google