Download presentation
Presentation is loading. Please wait.
1
Lexical bundles Last week we looked for ngrams
A variant is lexical bundles (although some people use lexical bundles when they mean ngams)
2
What are lexical bundles?
Term from Doug Biber Similar to n-grams With a minimum frequency A minimum range (e.g. must occur in 15 out of the 20 files in the corpus) Sometimes include a minimal MI score
3
Lexical bundles in academic registers
Academic registers vs. Non-academic registers Across academic registers Disciplinary variations in the same register
4
Corpus used – Biber et al
5
Examples of LB in academic registers
Referential expression: at the bottom of is one of the Discourse organizer: on the other hand in addition to the Stance expression: it is difficult to it is important to
6
Disciplinary variation
7
Statistics Based on probability
How likely is it that some event or outcome is based on chance (tossing coins) Applied to experimental data: drug trials, teaching methods Statements about the outcome: the probability that the outcome occurred by chance is less than 1 in 100. (p < 0.01)
8
Statistics and texts I view stats as a way of ranking (presenting) data for you to examine We cannot make statements such as “there is a 1 in 100 probability that this text data occurred by chance” We can note that a word pair have a high Mutual Information score Not an experiment. Text data is not random.
9
Learner Corpora - ICLE
10
ICLE 3640 Essays 2.5 million words
Students learning English in several European countries Task and learner variables for the different texts are stored in a database Learners are young adults at advanced proficiency level Main source for learner corpus studies I have some preliminary data from Japan and I will use that in some comparisons today
11
Emerging paradigm Learner Corpus Studies evolving from
Error analysis: identifying, describing and explaining errors Contrastive analysis: comparison of languages, learner language and target language; learner language and source language; different learner languages SLA: Notions of interlanguage development, universals etc. I won’t talk much about errors today, but learner corpora are sometimes annotated with error tags. You should get some sense of contrastive analysis and SLA
12
LC results are preliminary
need a wider range of learner corpora, covering a range of proficiency levels and a number of L1-L2 combinations. existing learner corpora contain little in the way of analytical markup, i.e., annotations that code grammatical information uncertainty about the exact nature of the relationship between particular learner corpora and a more general characterisation of interlanguage. 1. Obvious. 2. No markup which means that the form-function or form-meaning part of the analysis must be completed manually, with all the practical and theoretical problems typically encountered when assigning language forms to abstract categories. A learner corpus often represents just a single genre, such as argumentative essays, and so some features of the learner’s production may be closely associated with that genre rather than being more generally representative of interlanguage. (See Dagneaux 1995.) Thus while we might expect some aspects of the learner’s production to be relatively invariant over a range of modalities and genres, other aspects of production are likely to be highly modality-specific or genre-specific. But, at present, the required range of learner corpus types is not available to provide the information necessary to assess the variability of different aspects of language production.
13
Contrastive theme Research on learner corpora is often inherently contrastive Granger (1998) refers to a new research paradigm of contrastive interlanguage analysis, which covers both NS/NNS and NNS/NNS comparisons.
14
Contrastive theme Studies which use native speaker corpora as a benchmark for the analysis of learner corpora (i.e., NS/NNS comparisons) provide evidence for the nature of interlanguage, focussing on the non-native aspects of learners’ speech or writing These studies typically report on features which are typically overused or underused, in addition to those which are misused by language learners (Leech 1998). . Studies which use native speaker corpora as a benchmark for the analysis of learner corpora (i.e., NS/NNS comparisons) provide evidence for the nature of interlanguage, focussing on the non-native aspects of learners’ speech or writing.
15
Contrastive theme a comparison of different NNS corpora can be used to highlight aspects of language use and development shared by learners with different language backgrounds. Alternatively, a comparison of different NNS corpora can be used to highlight aspects of language use and development shared by learners with different language backgrounds.
16
Counting learner corpora studies often involve the counting of particular words or grammatical categories, a process which is not as simple as it sounds because of the ill-formed or variable nature of L2 production data
17
Counting automated counting routines are based on formal identity of linguistic items, not form-function identity. a count of the word can in an untagged corpus does not discriminate between can as a noun and can as a modal auxiliary. An even finer-grained functional categorisation may be necessary to distinguish, for example, the ability uses of the modal can from the permission uses. . A further source of complexity is that automated counting routines are based on formal identity of linguistic items, not form-function identity. For instance, a count of the word can in an untagged corpus does not discriminate between can as a noun and can as a modal auxiliary. Furthermore, in some studies an even finer-grained functional categorisation may be necessary to distinguish, for example, the ability uses of the modal can from the permission uses. The tabulation of form-function linguistic items in learner corpora can be time-consuming in cases where the corpora are not already annotated for the categories of interest. Yet such fine-grained analyses are often needed to give an accurate picture of the nature of interlanguage.
18
Explanation of findings
L1 transfer: Some forms or grammatical patterns found in the learner’s language production may result from the intrusion of L1. general learner strategies: To help deal with the complex task of speaking or writing in a second language, the learner may adopt some coping strategies such as the use of L1 forms, circumlocution, avoidance strategies, etc. Once corpus-based data on particular characteristics of interlanguage have been analysed, it is possible to look for explanations for these features, which typically involve factors such as:
19
Explanation of findings
paths of interlanguage development: Some aspects of interlanguage, such as the development of negation or the development of tense/aspect marking proceed in a series of stages which may be tracked using longitudinal studies of learner output. intralingual overgeneralisation: Some features of the learners language may be due to overgeneralisation of an aspect of L2 grammar such as the use of -ed to mark past tense. Using corpora of different types--- NS corpora, learner corpora, translation corpora -- can help to determine the source of the patterns found
20
Explanation of findings
input bias: The form of the learner’s production may reflect the particular input received, such as the language used in coursebooks. genre/register influences: Researchers working with learner corpora have suggested that the writing of L2 learners contains a variety of informal patterns that are characteristic of spoken discourse.
21
Corpus-derived hypotheses
Aijmer (2002) starts out from the general observation that non-native speakers find it difficult to use English modal verbs appropriately. She compares modal use by Swedish learners of English and by native-speakers and finds that there is a general overuse of modals by Swedish learners and a particular overuse of the modals will, must, have (got) to, should and might.
22
Time dimension Longitudinal studies are necessary to track the development of individuals interlanguage over time Cross-sectional or quasi-longitudinal studies are typically used initially
23
Corpus-based exploratory analyses
We can perform some general corpus analyses to find outstanding patterns which can be investigated in more detail An example of a common type of data used in corpus linguistics is the word frequency list, which can easily be generated for any learner corpus and the results inspected for unusual patterns. But rather than simply examine a word frequency list of a learner corpus, we can compare the frequency of words in a learner corpus with a reference corpus, which might be another learner corpus or a native speaker corpus.
24
Frequency analysis -bigrams
Looking at the table as a whole, we can see that further analysis of the bigram will be from the French subcorpus, the bigram have to from the German subcorpus, and the use of I think in the Japanese subcorpus might be fruitful as these results suggest an overuse of these forms in the learners’ English. This methodology can be extended to trigrams, tetragrams, and so on. It should be noted, however, that the larger the n-gram, the more idiosyncrasies appear, due to the particular content being described. (5) It is also possible to examine POS tag sequences rather than word sequences (Aarts and Granger 1998).
25
Three-word chunks In this table I have used some software to calculate the most common three-word chunks in the different learner corpora. I don’t want to say too much about what is involved in that -- but again you can see that such data can be used to generate hypotheses about learner language.
26
Sentence-initial words
27
Sentence-initial words
28
Sentence-initial bigrams
29
Sentence-initial bigrams
30
Sentence-initial bigrams
31
Position of words In addition to a “bag of words” approach, we can look at position data
32
Position of adverbs
33
Time adverbs
34
Time Adverbs Sentence-initial time adverbs are
nowadays, now, today (French L1); nowadays, then, now (Polish L1); today, then, now (Swedish L1); now, then, today (NS)
35
In final position, the top-ranked time adverbs are:
again, before, nowadays (French L1); nowadays, before, again (Polish L1); today, again, before (Swedish L1); today, again, now (NS).
36
Two-word adverbs
37
of course
38
Learner corpora Second reading on criterial features is more complex
39
Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial features associated with each level of proficiency Those criterial features can be used in two ways: As a measure of expected proficiency and as a guide to the focus of language teaching As input to an automatic testing of the level of learners. (We return to this topic in a later class)
40
Criterial features There are different ways you can gather criterial features – for example: Lexicogrammatical features – perhaps using an n-gram analysis A large set of grammatical features. You could count the following sorts of features: passives, relative pronouns, participial clauses, nominalisations A grammatical analysis -- parsing
41
Criterial features The reading opts for the latter – note that this differs from the lexico-grammatical view I have described Step 1 is to tag the words with part-of-speech tags. There are about 50 or so POS tags, not just V, N, P, etc. E.g., CLAWS7 see
42
Criterial features Next stage is parsing – analysing sentences in terms of verb frames NP V PP (Noun Phrase – Verb – Prepostional Phrase) the cat sat on the mat Parsing in this case is based on a probabilistic grammar and so is corpus-based In addition, the grammatical relations are extracted – thus the cat above is identified as SUBJECT
43
Analysis of L2 language
44
Criterial features Having done this analysis, it is possible to analyse and compare students at different levels of proficiency, as discussed in the paper
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.