751 - 1
Course admin Canvas info Class rep Email/Announcements Readings Resources – Computer and USB stick Class rep
Corpus A sample of naturally occurring language systematically collected for linguistic analysis Thus – the web is not a corpus, neither is a collection of examples of conditional sentences Hunston provides some examples of different types of corpus. (We will work with a variety of corpora.)
Corpus Useful to distinguish a balanced corpus (e.g., Brown corpus) versus a corpus consisting of a single genre Balanced corpus facilitates comparisons because there are the same number of words in each category and sub-category written versus spoken fiction versus nonfiction However, texts are not complete and genre distinctions can get washed out (Think about the purpose for using a corpus)
Frequency lists Start with wordlists since they are well-known in language teaching Structure of frequency lists Creating frequency lists Keywords Seemingly straightforward – examine some issues related to frequency lists in language teaching
Wordlists Wordlist are familiar – vocab list for a reading or wordlist for a course or a textbook. In these cases the words are taken from the teaching materials. The wordlist can indicate what the student might be expected to know after taking a course or it can indicate which words the student will encounter in a particular reading
Wordlists For our purposes, we are interested in wordlists associated with large texts What words are in a corpus – indicating the nature of the corpus (and language use) What words occur in a language/genre and with what frequency (alternatively, what words are distinctive for a particular genre such as Business English) What words does a language learner need to know (for academic study etc.)
A frequency wordlist for a short text Handout What type of info can be obtained? What can you say about the form of the frequency list? Word distribution Frequency distribution
Larger frequency list
Structure of a word frequency list Same for a single text versus a large corpus Function words are most frequent – the always ranks first in written English texts. Content words lower in the list Many words only occur once (hapax legomena) Zipf’ Law – frequency of a word is inversely proportional to its rank
Frequency list Types and tokens – type the; tokens the the the the the Type-token ratio for a text What is a word? let’s, mid-day, he’s Lemma – analyse, analyses, analysed Lemma – analysis, analyses Word family -- analyse, analyses, analysed, analysis, analytical, analytically, …
Wordlists for language teaching Sampling the language as a whole is difficult We can create as large a corpus as possible. We can then obtain frequency bands – the top 1000 words etc.
Wordlists – general and specialised Wordlists have been around since before the invention of computers. General wordlists are used for curriculum development, textbook writing etc.
Wordlists – general Thorndike (1921) created a frequency list from a corpus of 4.5 million words West's (1953) General Service List Coxhead's Academic Wordlist (AWL) Mark Davies’ Academic Vocabulary lists
Academic Word List
Academic Word List
Academic Word List receptive list (based on morphological derivations) the list excludes words found in non-academic texts (even if they occur in academic texts) do we need subject or genre-specific wordlists? (Hyland)
Wordlists If we can produce a wordlist for English (etc), then we have some idea of what words to teach (the more frequent first) we can estimate the difficulty of texts we can determine what is special about academic English, business English etc.
Wordlists An important threshold is 2000 words (Laufer 1994, Nation 2001) Learners who have control over 6000 words should be able to understand around 90% of a typical text McCarthy (2002) estimates that to reach higher levels of understanding it is necessary to aim for 10,000 word receptive vocabulary Corpus studies can help to identify different frequency bands – the top 2000 word band, etc
Frequency and coverage Levels Conversation Fiction Newspapers Academic text 1st 1000 84.3% 82.3% 75.6% 73.5% 2nd 1000 6% 5.1% 4.7% 4.6% Academic 1.9% 1.7% 3.9% 8.5% Other 7.8% 10.9% 15.7% 13.3%
Vocab Profile Applying the language frequency bands to a particular text results in a lexical or vocab profile Tom Cobb's Vocab Profile site http://www.lextutor.ca/vp/eng/
Vocab Profile
Lextutor: Blue – 1000, Green – 2000, Yellow AWL
Keyword list What words are special for a particular corpus Compare with a reference corpus
Specialised Word List Create a wordlist from a corpus (using concordancer or other utilities) May need to create your own corpus – BootCaT Create a business keyword list in the lab
Some general thoughts We will be using some simple software in the computer lab. Try not to get too involved in the details of using the software, at least not to the exclusion of broader, conceptual issues It is important to know the corpus you are using. What does it consist of? Are there any special features such as all lower case? Are there any annotations?