Guy Aston SSLMIT, University of Bologna The learner as corpus designer
… or the art of fruit salads
Learner uses of corpora Form-focussed (data-driven learning) Meaning-focussed (learning the culture) Skill-focussed (reading practice) Browsing environment (serendipity) Reference tool for other tasks (reading/writing aid)
Why make your own corpus? You can devise your own recipe You know what’s in it You learn how to do it Can be fun Can provide practice in language use
The raw ingredients
Devising your own recipe Only the text-type(s) you want Only the texts you want The quantity you want … small and specialised is beautiful
You know what’s in it Top-down knowledge of corpus Top-down knowledge of texts
You learn how to do it Can be a useful skill for many language workers –technical writers –translators –teachers Can make you a more critical corpus user
It can be fun Provides a challenge Gives sense of achievement/satisfaction Practice in language use Design/construction/evaluation of corpora can be communicative activities
Why use standard corpora? Less effort More reliable Better packaging You don’t want to learn to make your own
Less effort
More reliable if it’s well designed if it fits your needs
Better packaging Metatextual information Annotation Corpus-specific software
You don’t want to learn to make your own?
A compromise strategy: make your own subcorpus assemble using the pre-prepared ingredients of a larger corpus or in other words… go to a (fruit) salad bar
(Pick ’n’ mix with the BNC)
You have a choice of text-types individual texts selection by pre-determined criteria selection by hand … or both
You know what went in so top-down processing is easier Little effort in comparison with making your own
Good packaging Metatextual information Linguistic annotation Can use software designed for full corpus Indexed
You get to learn what are(n’t) useful subcorpora what are(n’t) useful design criteria how to do it
It can be fun challenge / achievement / satisfaction You can talk about its design / construction / evaluation
Talking about fruit salad BNC Sampler: KC2
Talking about fruit salad BNC Sampler: KC2
And now to details … the Sampler awaits!
You can create subcorpora of specific corpus texts texts containing solutions to a query encoded categories of texts your own categories of texts and compare them with other subcorpora the full corpus
Text analysis: selecting Choosing specific texts
Viewing the index
Party policies (will/shall be + VVN)
Or, to return to our fruit salad text …
Frequent adjectives (KC2) Most frequent adjectives (KC2)
Appreciating food (KC2)
A bad language subcorpus: texts containing solutions to a query
Choosing the bad language texts j
collocates of f.*k.* collocates of f_ words
oh fuck.* with oh as collocate
collocates of oh collocates of oh
‘context-governed’ spoken texts - monologue: 17 texts - dialogue: 29 texts Making subcorpora using encoded categories
More frequent in M* –could –had –he –know –their –were –when –who –your More frequent in D* –'ll –'m –any –no –pounds –right –yeah –yes *ranked 20+ positions higher in first 100 words Monologue vs Dialogue
no occurrences of all right in monologue when you’re / you’ll / you’d / you’ve is more common in monologue than when we’re / we’ll / we’d / we’ve; vice-versa in dialogue Investigating the differences
youweyou’*we’* Mo Dia we/we’* much more frequent in dialogue Pronoun (+ contraction)
you and we youwe Monologue Dialogue
Subcorpora using your own categories David Lee’s book genres academic non-fiction (13 texts) non-academic non-fiction (15 texts) prose fiction (13 texts)
Distinctive -ly adverbs of: academic non-fiction –accordingly, essentially, eventually, largely, namely, notably, respectively, surprisingly non-academic non-fiction –effectively, merely, normally, obviously, possibly, specially prose fiction –carefully, quietly, slightly, slowly, softly, surely, truly
largely (academic non-fict) largely (academic non-fiction)
it (academic non-fiction)
To conclude …
Working with subcorpora can allow study/comparison of forms/meanings in particular texts/text-types better-focussed reading practice more appropriate reference tools for particular tasks more focussed browsing
may not be representative (but nor is most language learning data) are good for forming hypotheses to be tested more widely will allow more interesting uses when extracted from a larger corpus Subcorpora
Making your own provides better preparation and motivation for corpus use more critical awareness lots to talk about
Enjoy!