Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.

Slides:



Advertisements
Similar presentations
Corpus Linguistics Richard Xiao
Advertisements

Dr. Dana Ferris University of California, Davis PREPARING TEACHERS TO TREAT ERRORS IN THE K-12 CLASSROOM.
Tracking L2 Lexical and Syntactic Development Xiaofei Lu CALPER 2010 Summer Workshop July 14, 2010.
Chapter 4 Key Concepts.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Learners First: Explicit Language Instruction in EAP Writing Courses Gena Bennett
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Introduction.  “a technique that enables the computer to encode complex grammatical knowledge such as humans use to assemble sentences, recognize errors.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 17, 2012.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
A Language Environment for Second Language Writers Ola Knutsson KTH Nada.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Tools and resources Summary of working group discussion.
Machine Translation Anna Sågvall Hein Mösg F
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Page 1 NAACL-HLT BEA Los Angeles, CA Annotating ESL Errors: Challenges and Rewards Alla Rozovskaya and Dan Roth University of Illinois at Urbana-Champaign.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools.
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
Corpora and Language Teaching
Corpus Linguistics: session 2 Corpus Linguistics (2): The Tools of the Trade 669o4zt
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
Tradition and Transition in Second Language Teaching Methodology.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
U SING C ORPUS - BASED R ESEARCH FOR L ANGUAGE T EACHING AND L EARNING ENGLISH 510 Hee Sung (Grace) Jun & Kimberly LeVelle.
Reflections on Using Corpora Data in EFL Teaching CHEN BO Chongqing Jiaotong University 2006.
Constructing Your Own Corpus from Written Language.
Researching language with computers Paul Thompson.
1 Computational Linguistics Ling 200 Spring 2006.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Multimedia CALL: Lessons to Be Learned from Research on Instructed SLA Carol A. Chapelle Presenters: Thorunn April.
Compiling and Analyzing Your Own Learner Corpus Xiaofei Lu CALPER 2012 Summer Workshop July 16, 2012.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Acknowledgements Contact Information Objective An automated annotation tool was developed to assist human annotators in the efficient production of a high.
An ICALL writing support system tunable to varying levels of learner initiative Karin Harbusch 1 & Gerard Kempen 2,3 1 University of Koblenz-Landau, Koblenz,
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Compiler design Lecture 1: Compiler Overview Sulaimany University 2 Oct
Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson.
Natural Language Programming David Vadas The University of Sydney Supervisor: James Curran.
Pedagogic Corpora for Content & Language Integrated Learning Applied English Linguistics Group Tübingen This project has been funded with support from.
Corpus Linguistics in Research Doctorate in Education University of Warwick 6th November 2008.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Putting it All Together Xiaofei Lu APLNG 596D July 17, 2009.
Learning Objectives Understand the concepts of Information systems.
Human-Assisted Machine Annotation Sergei Nirenburg, Marjorie McShane, Stephen Beale Institute for Language and Information Technologies University of Maryland.
1 CPA: Where do we go from here? Research Institute for Information and Language Processing, University of Wolverhampton; UPF Barcelona; University of.
POS Tagging and Morphological Analysis
Computational and Statistical Methods for Corpus Analysis: Overview
Topics in Linguistics ENG 331
Corpus Linguistics I ENG 617
Annotating ESL Errors: Challenges and Rewards
The CoNLL-2014 Shared Task on Grammatical Error Correction
Statistical n-gram David ling.
Using GOLD to Tracking L2 Development
University of Illinois System in HOO Text Correction Shared Task
Presentation transcript:

Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010

Overview Analyzing raw corpora Error annotation  Issues in corpus annotation  Granger (2003)

Analyzing raw corpora Concordancing software  GOLD  AntConc Other software  CLAN

Issues in corpus annotation Annotation scheme and format Annotation procedure Annotation quality

Annotation scheme and format What are the categories you are using?  Linguistically consensual  Overspecification vs. underspecification  Use short, meaningful codes for your categories Annotation format considerations  Compatible with annotation scheme  Facilitates corpus query

Annotation procedure and quality Annotator training  Scheme and format  Problematic cases and disagreements Computer-assisted manual annotation  Stanford annotation tool Stanford annotation tool  UAM Corpus Tool and NoteTab UAM Corpus ToolNoteTab Inter-annotator agreement  Cohen’s Kappa Cohen’s Kappa  Online Kappa calculator Online Kappa calculator

Granger (2003) Learner corpora Error annotation Error statistics and analysis Integration of results into CALL Conclusion

Learner corpora What is a learner corpus? Difference from traditional data in SLA Difference from native language data  Frequencies  Errors From error annotation to error detection

Computer-aided error annotation Dagneaux, Denness and Granger (1998)  Manual correction of L2 French corpus  Elaboration of an error tagging system  Insertion of error tags and corrections  Retrieval of lists of error types and statistics  Concordance-based error analysis Tagging system  Informative but manageable  Reusable, flexible, consistent

Error tagging system Dulay, Burt & Krashen (1982)  System based on linguistic categories (e.g., syntax)  Surface structure alternations (e.g., omission) Granger’s (2003) three-dimensional taxonomy  Error domain  Error category  Word category

Error tagging system (cont.) Error domain and category  General level: grammatical, lexical, etc.  Domains subdivided into error categories  Table 1, page 468 Word category  A POS tagset with 11 major and 54 sub-categories  Makes it possible to sort errors by POS categories

Error tagging system (cont.) Correct forms inserted next to erroneous forms  Facilitates interpretation of error annotations  Allows for automatic sorting on correct forms Tag insertion using a menu-driven editor

Error statistics and analysis Error frequency by domain or (word) category  Highest ranked domains: grammar and form Error trigrams Concordancers for searching error codes  AntConc AntConc  WordSmith Tools WordSmith Tools

Integrating results into CALL Goal: a hypermedia CALL program  Using NLP and Communicative approaches to SLA  Traditional and NLP-enabled exercises  Automatic error diagnosis and feedback generation Error statistics and analysis used to  Select linguistic areas to focus on  Adapt exercises as a function of attested error types  Adapt NLP tools for error diagnosis

Integrating results into CALL (cont.) Most error-prone linguistic areas  Tense and mood, agreement  Articles, complementation, prepositions Adapting exercises  Exercises reflect type of error-prone context  Formal errors through dictation and exercises targeting specific difficulties  Attention to punctuation

Integrating results into CALL (cont.) Adapting NLP tools for error diagnosis  Spell checker and parser  Handles orthographic, grammatical, syntactic, and lexical errors  Not punctuation, semantic, and tense errors

Granger (2003) summary Effective 3-tier error annotation system  Limited number of categories per tier  Versatile automated data manipulation Limitations of error-tagging  Element of subjectivity in annotation  Focuses on misuse Usefulness of error-tagged learner corpus  Error statistics helps understand learner interlang  Helps adapt pedagogical materials and programs

Activity Using the Stanford annotation tool  Annotate a short text using your own scheme, or  Annotate a short learner text using Granger’s (2003) scheme Query the annotated text using AntConc