Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.

Slides:



Advertisements
Similar presentations
Concordancing at Upper-Intermediate Levels What it is not What you will get from this talk.
Advertisements

Tracking L2 Lexical and Syntactic Development Xiaofei Lu CALPER 2010 Summer Workshop July 14, 2010.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 17, 2012.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Recent Developments in Technological Tools for the Purpose of Facilitating SLA.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Pedagogic uses of a corpus of student writing and their implications for sampling and annotation Alois Heuboeck University of Reading, UK.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpora and Language Teaching
CALL – computer assisted language learning A short course delivered by Dr. Klaus Schwienhorst. MITE January 2002.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
Resources for Using Corpus Linguistics in ELT Kenji Kitao Doshisha University Kyoto, Japan S. Kathleen Kitao Doshisha Women ’ s College Kyoto, Japan.
Presented by Jennifer Robison TexTESOL II March 12, 2010 San Antonio, TX.
Corpus Linguistics What can a corpus tell us ? Levels of information range from simple word lists to catalogues of complex grammatical structures and.
Overview of Search Engines
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Memory Strategy – Using Mental Images
CORPUS LINGUISTICS: AN INTRODUCTION Susi Yuliawati, M.Hum. Universitas Padjadjaran
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Corpus linguistics for translators Amanda Saksida University of Nova Gorica.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Researching language with computers Paul Thompson.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
How Can Corpora Help Me To Be Successful in CO150?
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Putting it All Together Xiaofei Lu APLNG 596D July 17, 2009.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
INTRODUCTION TO THE WIDA FRAMEWORK Presenter Affiliation Date.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
CORPUS LINGUISTICS 1) A revision of corpus linguistics 2) Language corpora in the ESL/EFL classroom.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
PRIMENJENA LINGVISTIKA I NASTAVA JEZIKA II 3 rd class.
Corpus Linguistics Anca Dinu February, 2017.
Learning Usage of English KWICly with WebLEAP/DSR
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Exploring the BNC Corpus
Corpus Linguistics I ENG 617
عمادة التعلم الإلكتروني والتعليم عن بعد
Topics in Linguistics ENG 331
Corpora and Concordancers in ESL/EFL Class:
Corpus-Based ELT CEL Symposium Creating Learning Designers
(word formation: follow up)
Using GOLD to Tracking L2 Development
Applied Linguistics Chapter Four: Corpus Linguistics
Presentation transcript:

Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010

2 Outline Corpora and learner corpora Graphic Online Language Diagnostic (GOLD)

3 Corpora and learner corpora What is a corpus Types of corpora Corpus design and compilation Corpus annotation Corpus querying and analysis Learner corpora and L2 development Resources

4 What is a corpus? Leech (1992):  an unexciting phenomenon, a helluva lot of text, stored on a computer Sinclair (1991):  a collection of naturally-occurring language text, chosen to characterize a state or a variety of language Sinclair (2004):  a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research

5 Types of corpora General-purpose vs. specialized corpora  The British National Corpus The British National Corpus  Michigan Corpus of Academic Spoken English Michigan Corpus of Academic Spoken English Native vs. learner corpora  International Corpus of Learner English International Corpus of Learner English Monolingual vs. parallel & comparable corpora  The JRC-Acquis Multilingual Parallel Corpus The JRC-Acquis Multilingual Parallel Corpus  The English-Chinese Parallel Concordancer The English-Chinese Parallel Concordancer

6 Types of corpora (cont.) Corpora representing one or diverse varieties  International Corpus of English International Corpus of English Synchronic vs. diachronic corpora Spoken vs. written corpora

7 Corpus design Purpose and type of corpus  Spoken/written; cross-sectional/longitudinal External criteria for content selection  Communicative function of a text  Mode, medium, interaction, domain, topic Representativeness, balance, size, sampling Design of the BNC

8 Corpus design (cont.) Encoding meaningful metadata information  Learner: L1, gender, program level, discipline …  Sample: date, mode, task, genre, rating …  Facilitates contrastive and longitudinal studies MICASE speaker and transcript attributes

9 Corpus annotation Why annotate Levels of corpus annotation Difficulties for corpus annotation Standards and encoding

10 Why annotate For linguistic research  Allow more effective corpus searches Allow more effective corpus searches For natural language processing  Spelling and grammar checking  Machine translation

11 Levels of corpus annotation Sentence and word segmentation Part-of-speech (POS) tagging and lemmatization Syntactic parsing Semantic, pragmatic, and discourse tagging Learner corpora: error annotation Project-specific annotation

12 Difficulties for corpus annotation Ambiguity  I saw a pig with binoculars.  Problems for tagging, parsing, & WSD Unknown words  Identification  POS tagging  Semantic annotation

13 Standards and encoding Useful standards  Separable  Documentation  Linguistically consensual  Compatibility with existing standards Encoding  Simple encoding: present_JJ  XML-style: present

14 Corpus querying and analysis Using windows- or web-based software  Good for processing raw corpora  Word frequency, concordances, lexical bundles, and keyword lists  Examples: AntConc and GOLDAntConcGOLD Using natural language processing tools  Good for processing annotated corpora  Extracting occurrences of grammatical patterns  Examples: Stanford parser and TregexStanford parser and Tregex

15 Resources Books and journals  Hunston (2002): Corpora in Applied Linguistics  McEnery (2006): Corpus-Based Language Studies  International Journal of Corpus Linguistics  Corpus Linguistics and Linguistic Theory  Corpora Websites and mailing lists  Bookmarks for corpus-based linguists Bookmarks for corpus-based linguists  Linguistic data consortium Linguistic data consortium  The corpora list The corpora list  Stanford Natural Language Processing Group Stanford Natural Language Processing Group

16 Learner corpora and L2 development Samples from same students at different times  Did (targeted) language development take place?  Was a particular pedagogical intervention effective? Samples from different students  What areas do students show different levels of development?  What factors affect students’ language development?

17 Graphic Online Language Diagnostic A free online tool for teachers to assess their students’ language development  Developed at CALPER, Penn State, funded by DOE  Project co-directors: Xiaofei Lu and Michael McCarthy Teachers can use GOLD to  Compile, upload, and manage their own corpora  Share corpora with each other  Search and analyze corpora Demonstration

18 Corpus compilation A user can compile a corpus by  Directly compiling and uploading an XML file  Using the easy-to-use guided XML creation interface An uploaded corpus can be easily managed  Documents can be added or deleted  The whole corpus can be deleted  Content and metadata of individual documents can be easily accessed

19 Corpus sharing GOLD facilitates easy data sharing A corpus may be set to be  Private, shared, or public Corpus owner may give other users right to  View, add, edit, or delete corpora Demonstration

20 Basic corpus information Word count  Alphabetic or numeric order  Can be downloaded as a text file Corpus and document statistics  Mean sentence length  Mean word length  Type-token ratio Demonstration

21 Corpus search Select one or more corpora to search Specify key words or phrases  May use the wildcard character, e.g. book* Specify contexts  Size of context window  Context words and their positions Specify metadata conditions

22 Corpus search results Display of search results  Sortable KWIC display of search results  Sortable graphic display of search results Demonstration

23 Lexical bundle/collocation search Procedure  Select one or more corpora to search  Specify search word  Specify contexts  Specify metadata conditions Search results  Sortable list of n-grams found in selected corpora Demonstration

24 Summary of features Difference from other online tools  Can create, share, and search multiple corpora  Can easily search subsets of data  Can work with any language Summary of corpus analysis functions  Word list  Corpus and document statistics: mean sentence length, mean word length, type-token ratio  Corpus search and collocation search

25 Sample questions to ask With data from an individual student, one can either describe or track development in  Patterns of usages of words and phrases – frequency, underuse, overuse, etc.  Lexical and syntactic complexity  Appropriate usage of words and phrases in context  Patterns of usages of lexical bundles

26 Sample questions to ask (cont.) With data from different (groups of) students, one can compare similarities or differences among different (groups of) students in terms of  Patterns of usages of words and phrases – frequency, underuse, overuse, etc.  Lexical and syntactic complexity  Appropriate usage of words and phrases in context  Patterns of usages of lexical bundles

27 Future enhancements Corpora for benchmarking Multilingual natural language processing Suggestions on desirable functions welcome