Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)

Slides:



Advertisements
Similar presentations
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Advertisements

Word 2007 ® Business and Personal Communication How can Word 2007 help you create and manage lengthy documents?
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
English (MPK-4009) 13/14 Semester 1 Instructor: Rama Oktavian Office Hr.: M.13-15, T , F
What is VOICE? VOICE, the Vienna-Oxford International Corpus of English, is a structured collection of language data, the first computer-readable corpus.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
L EARNERS ’ D ICTIONARY Deny A. Kwary
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Grammar and Grammars Dialects of Native Speakers.
Corpus Linguistics Lexicography. Questions for lexicography in corpus linguistics How common are different words? How common are the different senese.
Starting Your Research Communication Studies Library Instruction Fall 2004 Mary Woodley
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.
Deny A. Kwary Airlangga University
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Researching language with computers Paul Thompson.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
The BNC Design Model Adam Kilgarriff, Sue Atkins, Michael Rundell The Lexicography MasterClass
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Working with References Microsoft Office Word 2007 Illustrated Complete.
1 UNOG Library Digitization and Microform Unit (DMU) – December 2009.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
1 CREATING A RESEARCH PAPER (25 June 2010) Objectives: To create a Research Paper using MLA Documentation style.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
How Can Corpora Help Me To Be Successful in CO150?
TYPES OF BOOKS.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Computer Literacy for IC 3 Unit 2: Using Productivity Software Chapter 3: Formatting and Organizing Paragraphs and Documents © 2010 Pearson Education,
Conducting Computer Search Inter American University of Puerto Rico Languages Department Prof. Gladys Cruz GEEN 2313.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
LECTURE 3 1 APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
AMERICAN INSTITUTE OF PHYSICS URL:
Linda Schmandt Structured Text & XML in Medicine 16 Jan 2004.
TEI presentation for IS 590 Robert Patrick Waltz July 10 th, 2012.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
Chapter 20 Asking Questions, Finding Sources. Characteristics of a Good Research Paper Poses an interesting question and significant problem Responds.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
The New English-Irish Dictionary Pádraig Ó Mianáin EFNIL 2012.
General Architecture of Retrieval Systems 1Adrienn Skrop.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Chapter 9.  Personal Knowledge & Experience  Select familiar topics ▪ Personal knowledge is good support ▪ Examples, illustrations, explanations ▪ From.
Language – What Should I Say? ___________ – set of mutually intelligible sounds and symbols that are used for communication. Many languages also have literary.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
THE PROCESS OF WORDS BEING ENTERED IN A DICTIONARY WORD FORMATION IN ENGLISH Magdalena Soklevska April, 2016.
How Many Words Does It Take to Listen and Read in English?
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Evaluating word sketches and corpora
Exploring the BNC Corpus
APPROACHES TO THE STUDY OF LANGUAGE IN SOCIETY
Statistical n-gram David ling.
Using GOLD to Tracking L2 Development
Questioning and evaluating information
Presentation transcript:

Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)

Kilgarriff: Asialex June Tasks Design Collection Encoding

Kilgarriff: Asialex June The project A New English-Irish Dictionary  Authoritative, general purpose  Academics, translators, students, secretaries One year ‘set-up’ phase  Limited time, limited budget  Many tasks, including corpus development Irish and UK Government funded Lead contractor: LexMasterClass Subcontractor: ITE

Kilgarriff: Asialex June Languages English Irish

Kilgarriff: Asialex June The Irish language A Celtic language Long literary tradition  Irish-Latin dictionary from 9 th century Main language of Ireland until  English took over (British imperialist policies) 62,000 speakers as main language Gaeltacht: Irish-speaking areas Three dialects

Kilgarriff: Asialex June Gaeltacht areas

Kilgarriff: Asialex June Design: English Source language for NEID  Very large resource wanted Eg for word sketches, see Friday talk Three language varieties  Irish (Hiberno-English)  British  American

Kilgarriff: Asialex June American  100M words  Journalistic text available British  100M words  British National Corpus (BNC) Model balanced corpus Spoken conversation (10%) Books, newspapers, magazines Popular, academic, technical

Kilgarriff: Asialex June Hiberno-English 25 M words Goal: balanced like BNC except  No budget for spoken corpus collection  New category: web  Dates: since independence (1922) Emphasis on current language

Kilgarriff: Asialex June Design: Irish 30 M words Starting point: BNC-like Native speakers  Native speakers language “better”  Many texts written by non-native speakers  Record status where possible Newspapers, websites: no info available Dialect  Record where possible

Kilgarriff: Asialex June “High quality Irish”  Smaller than 150 years ago  Many documents are translations  Learners’ errors, inelegant prose  Samuel Johnson: “writers of the first reputation” Con  Who judges?  Risk of literary or backward-looking bias Lexicographers needs corpus to translate Boot the computer as well as the babbling brook  Trench and the OED: “an historian, not a critic”  Will a quality filter limit corpus breadth (and size)?

Kilgarriff: Asialex June Quality: outcome Wide range of text types wanted Particular effort to gather native speaker non-translations Period for corpus: 1883-present  Most earlier texts: literary  Most text types: usually recent

Kilgarriff: Asialex June Text categoryIrishHiberno-English Words: actual Books- imaginative 7,600,0006,000,000 Books- Informative 8,400,0007,000,000 Newspapers 4,500,0005,300,000 Periodicals 2,600,000700,000 Official/Govt 1,200,0001,000,000 Broadcast 400,0000 Websites 5,500,0005,000,000 TOTALS30,200,00025,000,000

Kilgarriff: Asialex June Collection Use existing Ask publishers Web

Kilgarriff: Asialex June Use existing Irish: PAROLE corpus (8M words, ITE) English  British: BNC  American: LDC Gigaword – wds journalism  Limerick Corpus of Spoken English  Northern Ireland Corpus of Transcribed Speech

Kilgarriff: Asialex June Ask publishers The junkmail problem Appeals to national pride Charm and persistence Team member who knows them all

Kilgarriff: Asialex June Web Fast becoming the usual place to look  Kilgarriff and Grefenstette, CL 2003 Preliminary experiments  at least 15 M words of Irish out there Hiberno-English  English as found on sites where Irish was found

Kilgarriff: Asialex June Web issues Formats  conversion from pdf etc needed Character representation  Not many pages “do the right thing” Navigational material: “click here” Lists Mixed languages Duplication

Kilgarriff: Asialex June Text categoryIrishHiberno-English Words: actual Words: target Words: actual Words: target Books- imaginative 7,600,0009,000,0006,000,0007,500,000 Books- Informative 8,400,0006,000,0007,000,0005,000,000 Newspapers 4,500,000 5,300,0003,750,000 Periodicals 2,600,0002,500,000700,0002,250,000 Official/Govt 1,200,0001,500,0001,000,000 Broadcast 400,0001,000, ,000 Websites 5,500,000 5,000,0004,750,000 TOTALS30,200,00030,000,00025,000,000

Kilgarriff: Asialex June Encoding Clean-up Linguistic processing Delivery formalism

Kilgarriff: Asialex June Clean-up Deletion of: Title pages, table of contents, tables, figures, footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings …

Kilgarriff: Asialex June Linguistic processing Lemmatize  give giving gives given gave => give (verb) Part-of-speech tagging  bank (verb) or bank (noun)? English: existing tools used Irish: tools developed from scatch  Elaine Ui Dhonnchadha: thesis work  Finite state methods, constraint grammar  Separate talk

Kilgarriff: Asialex June Delivery formalism Both  XML Corpus Encoding Standards (XCES)  For longevity, interchange format And  Loaded into Word Sketch Engine  Corpus query tool optimised for lexicography, linguistic research  Good for searching on grammar, text type etc Friday talk

Kilgarriff: Asialex June Conclusion Large corpora for high-quality lexicography Developed in one year, modest budget Design, collection and encoding Delivered in a convenient form for the lexicographer Thank you