GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.

Slides:



Advertisements
Similar presentations
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
L EARNERS ’ D ICTIONARY Deny A. Kwary
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
1. Learning Outcomes At the end of this lecture, you should be able to: –Define the term “Usability Engineering” –Describe the various steps involved.
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Searching on MEC and MEDO Making the most of MEC’s search facilities.
Researching language with computers Paul Thompson.
Works Cited Page. Overview: Your Works Cited page is where you will list all the articles/books/websites/etc you will use in your paper. If you decide.
Administrative Software Chapter 7 Teaching and Learning with Technology.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Using the Sketch Engine for second language learning: an experiment Simon Smith & Alice Chen |
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
CHAPTER 10 – VOCABULARY: STUDENTS IN CHARGE Presenter: 1.
Staying organized and incorporating language learning into classroom management while you’re at it J. E. Seibert Tokyo International University of America.
How to use Microsoft Word. Where can I find Microsoft Word? How to select, copy and paste information Go to the document from which you wish to copy the.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Paul Mundy Readability. Counts  3800 words  113 paragraphs  150 sentences Averages  2 sentences/paragraph  24 words/sentence  5.4.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
1 CA202 Spreadsheet Application Publishing Information on the Web Lecture # 15 Dammam Community College.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
Touchstone Automation’s DART ™ (Data Analysis and Reporting Tool)
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
CHAPTER 10 – VOCABULARY: STUDENTS IN CHARGE Presenter: Laura Mizuha 1.
Copyright © 2010 – MICS 2010, Curt Hill Instructor Tools: Test Data Generation Curt Hill Valley City State University.
Tool Kit. Receiving an When you receive an , it will appear on the white box, which is the conversation list. To do this you will have.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
L ITERATURE REVIEW RESEARCH METHOD FOR ACADEMIC PROJECT I.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
THE PROCESS OF WORDS BEING ENTERED IN A DICTIONARY WORD FORMATION IN ENGLISH Magdalena Soklevska April, 2016.
GDEX: Automatically finding good dictionary examples in a corpus.
Differentiating Instruction Using Nettrekker
Making useful wordlists for ELT
Evaluating word sketches and corpora
Corpora and Concordancers in ESL/EFL Class:
Corpora, Language Technology and Maltese
Computer Basics Applications.
7th Grade Computers.
Presentation transcript:

GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1

Auckland 2012Kilgarriff: GDEX2 Users appreciate examples  Paper: space constraints  Electronic: no space constraints Give lots of examples Constraint: Cost of selection, editing

Auckland 2012Kilgarriff: GDEX3 Project  Macmillan English dictionary  Already had 1000 collocation boxes  Average 8 per box  New electronic version All 8000 collocations need examples  Authentic; from corpus

Auckland 2012Kilgarriff: GDEX4 Old method  Lexicographer Gets concordance for collocation Reads through until they find a good example Cut, paste, edit

Auckland 2012Kilgarriff: GDEX5 New method  Lexicographer Gets sorted concordance  20 best examples in spreadsheet Less reading through Tick the first good one, edit

Auckland 2012Kilgarriff: GDEX6 What makes a good example?  Readable EFL users  Informative Typical, for the collocation Gives context which helps user understand the target word/phrase

Auckland 2012Kilgarriff: GDEX7 Readability  70 years research  Not just (or mainly) EFL Educational theory  Teaching children to read Instruction manuals  Early work: US military Publishing  People like newspapers and magazines that they find easy to read

Auckland 2012Kilgarriff: GDEX8 Readability tests  Fleish Reading Ease test 1948 Ave sentence length, ave word length In some word processing software  Many similar measures  Recent work training data for different reading levels Language modelling  Target levels US grades Now, increasingly: Common European Framwork

Auckland 2012Kilgarriff: GDEX9 GDEX  Get concordance for collocation  For each sentence Score it Sort Show best ones to lexicographer

Auckland 2012Kilgarriff: GDEX10 GDEX heuristics  Sentence length (10-26 words) ‏  Mostly common words is good  Rare words are bad  Sentences Start with capital, end with one of.!?  No [, ],, http, \  Not much other punctuation, numbers  Not too many capitals  Typicality: third collocate is a plus

Auckland 2012Kilgarriff: GDEX11 Weighting  For each sentence Score on each heuristic Weight scores Add together weighted score  How to set weights? Two students:  Manually judged 1000 “ good examples ”  Weights set so system makes same choices as students

Auckland 2012Kilgarriff: GDEX12 Was it successful?  Did it save lexicographer time? Definitely (says project manager) ‏  Rough guess Average number of corpus lines to read until you find a good one:  Unsorted: 20  Sorted: 5

Auckland 2012Kilgarriff: GDEX13 Corpus choice Started with BNC but  Too old  Not enough examples If no good examples in corpus, GDEX can ’ t help Changed to UKWaC  20 times bigger; from web; contemporary  Better  Most web junk filtered out  Usually a good example in top twenty

Auckland 2012Kilgarriff: GDEX14 GDEX and TALC  TALC (Teaching and Language Corpora) ‏  Goal: bring corpora into lg teaching  Usual problem Concordances are tough for learners to read  Way forward GDEX examples Half way between dictionary and corpus

Auckland 2012Kilgarriff: GDEX15 GDEX: Models for use  More examples for dictionaries Speed up, as with MED or Fully automatic “ more examples ”  Corpus query tool Sort concordances, best first Now an option in the Sketch Engine  Automatic collocations dictionary

Recent developments  Configurable GDEX For other languages Interface to help set up  Commonest string Between ‘bare collocate’ and example Auckland 2012Kilgarriff: GDEX16

Auckland 2012Kilgarriff: GDEX17