Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Kilgarriff: Asialex June Tasks Design Collection Encoding
Kilgarriff: Asialex June The project A New English-Irish Dictionary Authoritative, general purpose Academics, translators, students, secretaries One year ‘set-up’ phase Limited time, limited budget Many tasks, including corpus development Irish and UK Government funded Lead contractor: LexMasterClass Subcontractor: ITE
Kilgarriff: Asialex June Languages English Irish
Kilgarriff: Asialex June The Irish language A Celtic language Long literary tradition Irish-Latin dictionary from 9 th century Main language of Ireland until English took over (British imperialist policies) 62,000 speakers as main language Gaeltacht: Irish-speaking areas Three dialects
Kilgarriff: Asialex June Gaeltacht areas
Kilgarriff: Asialex June Design: English Source language for NEID Very large resource wanted Eg for word sketches, see Friday talk Three language varieties Irish (Hiberno-English) British American
Kilgarriff: Asialex June American 100M words Journalistic text available British 100M words British National Corpus (BNC) Model balanced corpus Spoken conversation (10%) Books, newspapers, magazines Popular, academic, technical
Kilgarriff: Asialex June Hiberno-English 25 M words Goal: balanced like BNC except No budget for spoken corpus collection New category: web Dates: since independence (1922) Emphasis on current language
Kilgarriff: Asialex June Design: Irish 30 M words Starting point: BNC-like Native speakers Native speakers language “better” Many texts written by non-native speakers Record status where possible Newspapers, websites: no info available Dialect Record where possible
Kilgarriff: Asialex June “High quality Irish” Smaller than 150 years ago Many documents are translations Learners’ errors, inelegant prose Samuel Johnson: “writers of the first reputation” Con Who judges? Risk of literary or backward-looking bias Lexicographers needs corpus to translate Boot the computer as well as the babbling brook Trench and the OED: “an historian, not a critic” Will a quality filter limit corpus breadth (and size)?
Kilgarriff: Asialex June Quality: outcome Wide range of text types wanted Particular effort to gather native speaker non-translations Period for corpus: 1883-present Most earlier texts: literary Most text types: usually recent
Kilgarriff: Asialex June Text categoryIrishHiberno-English Words: actual Books- imaginative 7,600,0006,000,000 Books- Informative 8,400,0007,000,000 Newspapers 4,500,0005,300,000 Periodicals 2,600,000700,000 Official/Govt 1,200,0001,000,000 Broadcast 400,0000 Websites 5,500,0005,000,000 TOTALS30,200,00025,000,000
Kilgarriff: Asialex June Collection Use existing Ask publishers Web
Kilgarriff: Asialex June Use existing Irish: PAROLE corpus (8M words, ITE) English British: BNC American: LDC Gigaword – wds journalism Limerick Corpus of Spoken English Northern Ireland Corpus of Transcribed Speech
Kilgarriff: Asialex June Ask publishers The junkmail problem Appeals to national pride Charm and persistence Team member who knows them all
Kilgarriff: Asialex June Web Fast becoming the usual place to look Kilgarriff and Grefenstette, CL 2003 Preliminary experiments at least 15 M words of Irish out there Hiberno-English English as found on sites where Irish was found
Kilgarriff: Asialex June Web issues Formats conversion from pdf etc needed Character representation Not many pages “do the right thing” Navigational material: “click here” Lists Mixed languages Duplication
Kilgarriff: Asialex June Text categoryIrishHiberno-English Words: actual Words: target Words: actual Words: target Books- imaginative 7,600,0009,000,0006,000,0007,500,000 Books- Informative 8,400,0006,000,0007,000,0005,000,000 Newspapers 4,500,000 5,300,0003,750,000 Periodicals 2,600,0002,500,000700,0002,250,000 Official/Govt 1,200,0001,500,0001,000,000 Broadcast 400,0001,000, ,000 Websites 5,500,000 5,000,0004,750,000 TOTALS30,200,00030,000,00025,000,000
Kilgarriff: Asialex June Encoding Clean-up Linguistic processing Delivery formalism
Kilgarriff: Asialex June Clean-up Deletion of: Title pages, table of contents, tables, figures, footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings …
Kilgarriff: Asialex June Linguistic processing Lemmatize give giving gives given gave => give (verb) Part-of-speech tagging bank (verb) or bank (noun)? English: existing tools used Irish: tools developed from scatch Elaine Ui Dhonnchadha: thesis work Finite state methods, constraint grammar Separate talk
Kilgarriff: Asialex June Delivery formalism Both XML Corpus Encoding Standards (XCES) For longevity, interchange format And Loaded into Word Sketch Engine Corpus query tool optimised for lexicography, linguistic research Good for searching on grammar, text type etc Friday talk
Kilgarriff: Asialex June Conclusion Large corpora for high-quality lexicography Developed in one year, modest budget Design, collection and encoding Delivered in a convenient form for the lexicographer Thank you