“Corpus Insights from Lextutor R&D that are too small to publish but too interesting to ignore” +1 Tom Cobb SFU March 12, 2015 1.

Slides:



Advertisements
Similar presentations
Working with vocabulary: On and off line Averil Coxhead Victoria University of Wellington 17 March, 2008.
Advertisements

Perspectives on Teaching and Learning Academic Vocabulary Keith Folse Department of Modern Languages University of Central Florida gmail.com.
Can Learners Make the Jump from the Highest Graded Readers to Ungraded Novels?: Four Case Studies Diane Schmitt Jez Uden Nottingham Trent University Norbert.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES language teaching (1) Bambang Kaswanti Purwo
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Review: Grammar  Different kinds of "grammar" Definitions Types  Different approaches to defining content  Grammar should not be seen as totally independent.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
COLLECTING DATA ON A SAMPLE OF RESPONDENTS Designing survey instruments.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
The origins of language curriculum development
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Online Cafés for Heritage Learners ---- The different parameters The Cultura Project, the Italy-USA Exchange, the USA-Spain exchange NFLRC – University.
Corpora and Language Teaching
Using Course books for Language Teaching
The aim of this part of the curriculum design process is to find the situational factors that will strongly affect the course.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Profiling French Vocabulary: The shape of lexicons by frequency & coverage , Monday, March 23, Session K Nfld., Room 13, Mezzanine Tom Cobb.
What would I tell the staff? Literacy PD with Ken Kilpin Thursday 22/08/2013.
Stages of Second Language Acquisition
Evaluating the transfer-promoting potential of ESOL materials Mark Andrew James Arizona State University / Sunshine.
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
Exploring Language: Lecture # 30 To discuss four different features of language. i) The sound system,( ii) the vocabulary (lexical), (iii) the grammatical.
Improving your grades in AS and A2 Sociology. NGfL - Cymru Common myths The exams will be harder this year The grade boundaries will be higher Examiners.
How to do Quality Research for Your Research Paper
10 practical uses of a million-word corpus in ELT ( All easy to find and use on – just add imagination) March 30, Fri, 10-11h30.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
An Introduction to Extensive Reading Richard R. Day, Ph.D. Professor, Department of Second Language Studies University of Hawaii.
Improving your grades in GCSE Sociology. NGfL - Cymru Common myths The exams will be harder this year The grade boundaries will be higher Examiners have.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
B.A. (English Language) UNIVERSITI PUTRA MALAYSIA Second Semester 2011/2012 BBI 3211 (English for Specific Purposes)
Published materials Authentic materials
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Evaluation of Shoreline Science Jia Wang & Joan Herman UCLA Graduate School of Education & Information Studies National Center for Research on Evaluation,
Barbara A. Pijan = Dec Vocabulary for IELP Tutors Dec-2011 Vocabulary Skills for IELP Tutors.
How Much Do We know about Our Textbook? Zhang Lu.
English for Specific Purposes
Slide Chapter 2d Describing Quantitative Data – The Normal Distribution Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley.
Presented by: Rashida Kausar Bhatti ( All new learners of English progress through the same stages to acquire language. However, the length of.
GLOCALL 2015 Globalization and Localization in Computer-Assisted Language Learning The future of Vocabprofiling Tom Cobb Université du Québec à Montréal.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
1 Ch 1. VOCABULARY SIZE, TEXT COVERAGE & WORD LISTS Nation& Waring.
Multiplication of Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math 1 3 ×1 3 Applying.
STAAR Decision Making Process for LPAC for Teachers.
Learning Through Failure. Reflect O Take a few moments to write down your answers to the following questions: O What was your reaction to the video? O.
Chapter 3 Vocabulary Paul Nation & Paul Meara.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
History as a Discipline Unit 1 – Lesson 10. History as a Discipline.
1 Instructing the English Language Learner (ELL) in the Regular Classroom.
Incidental versus intentional vocabulary learning A selection of research articles.
Examples from Advancing Academic Language for All (paired with Word Generation)
1 Vocabulary acquisition from extensive reading: A case study Maria Pigada and Norbert Schmitt ( 2006)
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
The People Of Utah A WebQuest for UEN Created by Kim Colton December, 2006.
Oranim Academic College Dept of Postgraduate Studies M.Ed. Teaching Vocabulary Prof. Penny Ur Lexical Threshold Revisited: Lexical Text Coverage, Learners'
Questionnaire-Part 2. Translating a questionnaire Quality of the obtained data increases if the questionnaire is presented in the respondents’ own mother.

Saudi Contrastive Rhetoric Presented by Hamad Alluhaydan
TODAY’S SITUATION Teachers in a self-contained classroom, as well as those in core content classes such as Social Studies, Math, Science, and Language.

Compleat lexical tutor
Strategies to teach Writing to ESOL students
ESL 433 N Competitive Success-- snaptutorial.com
ESL 433 N Education for Service-- snaptutorial.com
ESL 433 N Teaching Effectively-- snaptutorial.com
Unit Four Exercises on text.
Sampling Distributions
Content-Based Instruction
Doing Research in Applied Linguistics April 22, 2011
Presentation transcript:

“Corpus Insights from Lextutor R&D that are too small to publish but too interesting to ignore” +1 Tom Cobb SFU March 12,

Promised Abstract For proponents of Data-Driven Language Learning, corpora and their frequency lists are supposed to be for learners, but applied linguists and course developers can learn some things too. Like whether it is actually possible to build an L2 lexicon through reading. Like what Zipf's Law really tells us we can and cannot do to simplify reading materials. Like how much English Francophone learners actually "get for free" (and Anglophones for French) from cognates at different stages of learning. And finally whether Data-Driven Learning actually "works" for language learners, and for what in particular. This talk will be a survey of current research spin-off from Lextutor development work. 2

Since a month ago, these ideas are now somewhat modified 3

New + improved focus Corpora and the frequency lists they generate are supposed to be for learners, …but applied linguists, course developers, and teachers can learn a few things too 4

Like what learners themselves see as their vocabulary learning needs 5

Like whether there are more friendly cognates or deadly faux-amis between English and French 6

Like whether it is more important in ESL to learn the Greco-Latin or Anglo- Saxon part of English …and whether this changes with purpose for learning and stage of learning 7

Like whether there are really a small number of Greco-Latin word roots that can generate most of the polysyllabic words in English 8

Like whether Zipf’s Law imposes constraints on how much a text can recycle its lexis 9

+ Like whether learners can actually benefit (for some part of their learning) from approaching language as a corpus, sliced and diced with software tools that expose its patterns 10

In other words ~ This talk will be a survey of current research spin-off from Lextutor work with frequency lists A set of mini-talks … with the subtext that there is a sort of data-driven-language learning for language teachers & researchers as well as learners 11

SPECIFICALLY,… ONE RECENT FRUIT OF THE DDL APPROACH TO LANG LEARNING IS A COMPLETE SET OF FREQUENCY LISTS FOR ENGLISH -FAMILIZED -BY K-LEVEL -SPOKEN + WRITTEN -US + UK 12

A COMPLETE SET OF FREQ LISTS ALLOWS US TO TEST SOME CLAIMS THAT ROLL AROUND THE ESL UNIVERSE, UNTESTED AND UNTESTABLE ~ SOME FOR A FEW WEEKS, SOME FOR A FEW DECADES 13

A complete set of Freq Lists like this 14

Such lists, allied with relevant software, allow us to evaluate previously unverifiable ideas like ~ Claim 1: There exists a small set of ‘Master Words’ whose parts can unlock most of the polysyllabic lexicon of English interesting if true… 15

The famous ‘14 Master Words’ 16

Finding 1 At all frequency level from 1 to 25k do the master-word-parts account for about 10% of tokens – And this is a generous estimate owing to overmatching (-log- matches ‘flog’ etc) 17

Claim 2: Frequency provides a reasonable basis for planning vocabulary development in a second language 18

Group Lex 19

Finding 2 From 5,533 words entered fairly laboriously by 400+ Ss over 5 years ~ 60% of the words that Ss enter are in 3k-6k zone – Supports the idea that instructed ESL vocab size is about 2,500 word families – Ss do not roam about the outer fringes of the lexicon but rather LARGELY seek out items that are in their ZPD – And… 60% of the words Ss enter are Greco-Latin in origin 20

Claim 3 The function words & very common words of English are mainly Anglo- Saxon… ~ but the rest of the lexicon (3k and up) is mainly Greco-Latin 21

VocabProfile 2014 (BNC-Coca) 22

23 List carve-up (  1k, 2k … – 10k 11k )

24

So does ASAX peter out after 1k-2k ? 25

26

Finding 3 GLAT and ASAX are about equal all the way to 11k, and possibly beyond – (So, No, Francophones cannot just get by with “the French part of English”) English Texts can vary from 0% to a max of about 50% GLAT 27

Claim 4 This massive etymo-duality of English would be pedagogically interesting, except that… The problem of faux-amis in English-French cognates is major 28

29

30 FORM (ortho) MEANINGMEANING SIMILARDIFFERENT SIM- ILAR (1) video (vidéo) (2) school (école) DIFF- ERENT (3) actual (actuel) (4) impeach (empêcher)

Finding 4 Less than 3% of GLAT cognates are in Box 4 – (so 97% are probably usable ) So the faux-amis issue is really not major – Except for linguists – even governments know that Spanish immigrants will learn French and Germans learn English 31

VP-Cognates… usable for what? To modify the ‘cognativity’ of ESL reading texts up and down Launch francophone readers with lots of GLAT items Wean intermediates off cognates with lots of ASAX items 32

VocabProfile: Edit-to-a-Profile 33

Claim 5 Not so fast with the text modifications… Zipf’s Law places strong constraints on how much texts can be modified Particularly with regard to the amount of word recycling they can contain 34

“Repetition is affected by Zipf’s law as it is in all meaning-focused activity, with over half of the different words appearing only once” “The use of material written within a controlled vocabulary does little to change this spread of repetitions…” 35

“Graded readers cannot avoid Zipf’s law, and so half of the different words in a text are likely to occur only once.” 36

Paul Nation endorses this view 37

Summary of Zipfian “laws” Any natural text is about 50% singletons Regardless of text length SUCH THAT – It is fruitless to try to write texts that have any substantial amount of extra recycling To investigate this claim requires making some new software  38

39 Muscle, know, helper

Finding 6 Ungraded novel Ch % – Families - 38% Ch2 – 45% – Families – 38% Ch 1+2 – 36% – Families – 31% 40

Finding 6 Ungraded story Ch % – Families - 38% Ch2 – 45% – Families – 38% Ch 1+2 – 36% – Families – 31% Graded story Ch 1 – 17% – Families 12% Ch2 – 17% – Families 12 % Ch 1+2 – 10% – Families – 6% 41

Finding 6 Texts can be modified to have any degree of repetition, from Ø words repeated to every word repeated Would Zipf consider these to be “natural” texts? Does it matter? Even unmodified texts increase their amount of recycling with length 42

Claim 7 French “does not have room for an Academic Word List” -Horst & Cobb,

For a nuance on the finding to this one, come to AAAL in Toronto next week! 44

My first conclusion, then, is that corpora and frequency lists, queried by appropriate software, can show us the merits of some of the claims circulating in our field. But does this approach show our learners anything? 45

Claim 8 + ESL learners generally benefit from hands-on work with corpora 46

47

48

49

50

51

52

Finding 8 In 56 studies comparing some type of corpus investigation with some other approach to learning the same content, the corpus approach surpassed by an average 1.46 standard deviations 53

Example of e.s.=1.5 (a.k.a, a 1.5 std. dev. difference) Control Group Mean = 80 Std Dev = 10 Experimental group Mean = 92 Std Dev = – = √ (( ) /2) = √ = 8 1.5

The greater interest, of course, is not the overall finding, but what corpus consultation is more and less useful for what Writing Collocation Vocab development Translation … and for whom Beginners, Advanced, ESP, EAP… 55

56 Bigger 

57 … For 217 comparison points (research questions)

Learn more, book now for AAAL 2016!

Meantime, your Quiz! 59