Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris
brought to you by … James Thomas Jan Pomíkalek Department of Information Technology Faculty of Informatics Masaryk University Brno Czech Republic
Data Driven Learning doctoral students of Faculty of Informatics faith and skills needed to ask question needed to be able to create queries needed to believe answers needed to trust descriptive accounts
What changed … Web-based interface Bonito became WSE user friendly CQL now optional New features - new results! word sketches sketch differences thesaurus (statistical) frequency distribution (chunks/patterns)
TALC 2004 (2) Corpus consultation hampered by students’ limited vocabulary different tasks needed concordances need to be sorted Background: TALC 2002 Readability Average word frequency of each concordance
Addressing issues of faith and skills Classroom use of concordance printouts Activities set for corpus use Worksheets including instructions Website of sample searches Moodle’s glossary module
Addressing Problem 1 (cont) lack of faith in general corpus use (3) students find the results convincing error correction of each other’s written work Feedback from students Qualitative feedback only See abstract. BNC not “computer savvy”
Success created problem #2 BNC not “computer savvy”
BNC - limited application Dated – 94% texts from 1985 to 1993 modern technology not accounted for Technical vocabulary missing Differences between word usage higher frequency of academic vocabulary not represented (Coxhead) e.g. robust Solution: revisit an old idea …
TALC 2004 Each dept at FI MU was invited to contribute academic papers to a new Informatics Corpus Metatag sections to serve as models for own writing language differences between introductions, methodology, conclusions,
Ran aground 1. demand for metadata – too fine-grained too labour-intensive few could see the point – unable to give priority to it 2. convoluted uploading interface no Windows version ??? time-consuming procedure for uploading
Addressing this Problem Much improved interface “Build Corp” “Corpus Builder” Configurable metadata list Corpus configuration POS tagging, lemmatization Other transformation can be incorporated, e.g., HTML text Notes on Corpus Builder oad.htm oad.htm
Solutions (3) the time demanded of the individuals Interface for converting pdfs Save set in folder Upload quickly Metalanguage (ACM) DEMO
Much improved interface Building Word sketches Statistical thesaurus User accounts management More user-friendly
Enter the Informatics Corpus Currently contains Uses to date Illustrative sentences Some worksheets of Subjunctive Etc
What the future holds Language acquisition Consulting resources doesn ’ t necessarily lead to retention log lookups converted into interactive revision activities, automatically Researching the effectiveness of DDL