TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris
TALC James Thomas & Jan Pomikálek Department of Information Technology Faculty of Informatics Masaryk University Brno Czech Republic
TALC Data Driven Learning doctoral students of Faculty of Informatics training and trust needed to ask questions needed to be able to create queries needed to believe answers needed to trust descriptive accounts
TALC TALC 2002 Corpus consultation hampered by students’ limited vocabulary different tasks needed concordances need to be sorted Readability Average word frequency of each concordance The design of a Lexical Difficulty Filter for language learning on the Internet (pdf)pdf
TALC What changed … Web-based interface Bonito became Word Sketch Engine (WSE) user friendly CQL now optional (example)example New features - new results! (example)example word sketches sketch differences thesaurus (statistical) frequency distribution (chunks/patterns)
TALC Addressing issues of faith and skills Worksheets including instructions example relating to the textbook example Classroom use of concordance printouts prepositions prepositions Activities set for corpus use example relating to the textbook example Error correction of each other’s written work
TALC Addressing Problem 1 (cont) Faith in general corpus use students find the results convincing and useful Feedback from students Qualitative feedback only See abstract.abstract BNC not “computer savvy”
TALC BNC - limited application Dated – 94% texts from 1985 to 1993 modern technology not accounted for Technical vocabulary missing Differences between word usage higher frequency of academic vocabulary not represented (Coxhead) see key words list Solution: revisit an old idea …
TALC TALC 2004 Each dept at FI MU was invited to contribute academic papers to a new Informatics Corpus Metatag sections to serve as models for own writing Language differences between introductions, methodology, conclusions
TALC Ran aground Demand for metadata – too fine-grained too labour-intensive few could see the point – unable to give priority to it Convoluted uploading interface
TALC Addressing Problem 2 “Build Corp” “Corpus Builder”Corpus Builder Configurable metadata list POS tagging, lemmatization Other transformations can be incorporated, e.g., HTML text Corpus configuration Building Word sketches Compiling statistical thesaurus User accounts management
TALC Simplified user’s procedure Interface for converting pdfs Abbyy FineReader Save set in folder Upload files Metadata (ACM) Notes provided to users Notes Demo
TALC An Informatics Corpus is born Currently contains 202 documents 2,763,259 tokens 18 ACM categories (over half documents in one category)
TALC Uses to date Key term extraction herehere Illustrative sentences Moodle’s glossary module Moodle Words in need of pronunciation attention Some worksheets of adjectives with prepositions adjectives Website of sample searches Website
TALC What the future holds Language acquisition consulting resources doesn’t guarantee retention log corpus consultation converted into interactive revision activities, automatically researching the effectiveness of DDL
TALC What the future holds Corpus Builder single click keywords extraction automatic conversion from various formats to plain text POS tagging for LOTE log user ’ s use