Download presentation
Presentation is loading. Please wait.
Published byShavonne Newton Modified over 9 years ago
1
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
2
Kilgarriff: Asialex June 20052 Tasks Design Collection Encoding
3
Kilgarriff: Asialex June 20053 The project A New English-Irish Dictionary Authoritative, general purpose Academics, translators, students, secretaries One year ‘set-up’ phase Limited time, limited budget Many tasks, including corpus development Irish and UK Government funded Lead contractor: LexMasterClass Subcontractor: ITE
4
Kilgarriff: Asialex June 20054 Languages English Irish
5
Kilgarriff: Asialex June 20055 The Irish language A Celtic language Long literary tradition Irish-Latin dictionary from 9 th century Main language of Ireland until 1850-1900 English took over (British imperialist policies) 62,000 speakers as main language Gaeltacht: Irish-speaking areas Three dialects
6
Kilgarriff: Asialex June 20056 Gaeltacht areas
7
Kilgarriff: Asialex June 20057 Design: English Source language for NEID Very large resource wanted Eg for word sketches, see Friday talk Three language varieties Irish (Hiberno-English) British American
8
Kilgarriff: Asialex June 20058 American 100M words Journalistic text available British 100M words British National Corpus (BNC) Model balanced corpus Spoken conversation (10%) Books, newspapers, magazines Popular, academic, technical
9
Kilgarriff: Asialex June 20059 Hiberno-English 25 M words Goal: balanced like BNC except No budget for spoken corpus collection New category: web Dates: since independence (1922) Emphasis on current language
10
Kilgarriff: Asialex June 200510 Design: Irish 30 M words Starting point: BNC-like Native speakers Native speakers language “better” Many texts written by non-native speakers Record status where possible Newspapers, websites: no info available Dialect Record where possible
11
Kilgarriff: Asialex June 200511 “High quality Irish” Smaller than 150 years ago Many documents are translations Learners’ errors, inelegant prose Samuel Johnson: “writers of the first reputation” Con Who judges? Risk of literary or backward-looking bias Lexicographers needs corpus to translate Boot the computer as well as the babbling brook Trench and the OED: “an historian, not a critic” Will a quality filter limit corpus breadth (and size)?
12
Kilgarriff: Asialex June 200512 Quality: outcome Wide range of text types wanted Particular effort to gather native speaker non-translations Period for corpus: 1883-present Most earlier texts: literary Most text types: usually recent
13
Kilgarriff: Asialex June 200513 Text categoryIrishHiberno-English Words: actual Books- imaginative 7,600,0006,000,000 Books- Informative 8,400,0007,000,000 Newspapers 4,500,0005,300,000 Periodicals 2,600,000700,000 Official/Govt 1,200,0001,000,000 Broadcast 400,0000 Websites 5,500,0005,000,000 TOTALS30,200,00025,000,000
14
Kilgarriff: Asialex June 200514 Collection Use existing Ask publishers Web
15
Kilgarriff: Asialex June 200515 Use existing Irish: PAROLE corpus (8M words, ITE) English British: BNC American: LDC Gigaword – wds journalism Limerick Corpus of Spoken English Northern Ireland Corpus of Transcribed Speech
16
Kilgarriff: Asialex June 200516 Ask publishers The junkmail problem Appeals to national pride Charm and persistence Team member who knows them all
17
Kilgarriff: Asialex June 200517 Web Fast becoming the usual place to look Kilgarriff and Grefenstette, CL 2003 Preliminary experiments at least 15 M words of Irish out there Hiberno-English English as found on sites where Irish was found
18
Kilgarriff: Asialex June 200518 Web issues Formats conversion from pdf etc needed Character representation Not many pages “do the right thing” Navigational material: “click here” Lists Mixed languages Duplication
19
Kilgarriff: Asialex June 200519 Text categoryIrishHiberno-English Words: actual Words: target Words: actual Words: target Books- imaginative 7,600,0009,000,0006,000,0007,500,000 Books- Informative 8,400,0006,000,0007,000,0005,000,000 Newspapers 4,500,000 5,300,0003,750,000 Periodicals 2,600,0002,500,000700,0002,250,000 Official/Govt 1,200,0001,500,0001,000,000 Broadcast 400,0001,000,0000750,000 Websites 5,500,000 5,000,0004,750,000 TOTALS30,200,00030,000,00025,000,000
20
Kilgarriff: Asialex June 200520 Encoding Clean-up Linguistic processing Delivery formalism
21
Kilgarriff: Asialex June 200521 Clean-up Deletion of: Title pages, table of contents, tables, figures, footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings …
22
Kilgarriff: Asialex June 200522 Linguistic processing Lemmatize give giving gives given gave => give (verb) Part-of-speech tagging bank (verb) or bank (noun)? English: existing tools used Irish: tools developed from scatch Elaine Ui Dhonnchadha: thesis work Finite state methods, constraint grammar Separate talk
23
Kilgarriff: Asialex June 200523 Delivery formalism Both XML Corpus Encoding Standards (XCES) For longevity, interchange format And Loaded into Word Sketch Engine Corpus query tool optimised for lexicography, linguistic research Good for searching on grammar, text type etc Friday talk
24
Kilgarriff: Asialex June 200524 Conclusion Large corpora for high-quality lexicography Developed in one year, modest budget Design, collection and encoding Delivered in a convenient form for the lexicographer Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.