Presentation is loading. Please wait.

Presentation is loading. Please wait.

Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)

Similar presentations


Presentation on theme: "Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)"— Presentation transcript:

1 Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)

2 Kilgarriff: Asialex June 20052 Tasks Design Collection Encoding

3 Kilgarriff: Asialex June 20053 The project A New English-Irish Dictionary  Authoritative, general purpose  Academics, translators, students, secretaries One year ‘set-up’ phase  Limited time, limited budget  Many tasks, including corpus development Irish and UK Government funded Lead contractor: LexMasterClass Subcontractor: ITE

4 Kilgarriff: Asialex June 20054 Languages English Irish

5 Kilgarriff: Asialex June 20055 The Irish language A Celtic language Long literary tradition  Irish-Latin dictionary from 9 th century Main language of Ireland until 1850-1900  English took over (British imperialist policies) 62,000 speakers as main language Gaeltacht: Irish-speaking areas Three dialects

6 Kilgarriff: Asialex June 20056 Gaeltacht areas

7 Kilgarriff: Asialex June 20057 Design: English Source language for NEID  Very large resource wanted Eg for word sketches, see Friday talk Three language varieties  Irish (Hiberno-English)  British  American

8 Kilgarriff: Asialex June 20058 American  100M words  Journalistic text available British  100M words  British National Corpus (BNC) Model balanced corpus Spoken conversation (10%) Books, newspapers, magazines Popular, academic, technical

9 Kilgarriff: Asialex June 20059 Hiberno-English 25 M words Goal: balanced like BNC except  No budget for spoken corpus collection  New category: web  Dates: since independence (1922) Emphasis on current language

10 Kilgarriff: Asialex June 200510 Design: Irish 30 M words Starting point: BNC-like Native speakers  Native speakers language “better”  Many texts written by non-native speakers  Record status where possible Newspapers, websites: no info available Dialect  Record where possible

11 Kilgarriff: Asialex June 200511 “High quality Irish”  Smaller than 150 years ago  Many documents are translations  Learners’ errors, inelegant prose  Samuel Johnson: “writers of the first reputation” Con  Who judges?  Risk of literary or backward-looking bias Lexicographers needs corpus to translate Boot the computer as well as the babbling brook  Trench and the OED: “an historian, not a critic”  Will a quality filter limit corpus breadth (and size)?

12 Kilgarriff: Asialex June 200512 Quality: outcome Wide range of text types wanted Particular effort to gather native speaker non-translations Period for corpus: 1883-present  Most earlier texts: literary  Most text types: usually recent

13 Kilgarriff: Asialex June 200513 Text categoryIrishHiberno-English Words: actual Books- imaginative 7,600,0006,000,000 Books- Informative 8,400,0007,000,000 Newspapers 4,500,0005,300,000 Periodicals 2,600,000700,000 Official/Govt 1,200,0001,000,000 Broadcast 400,0000 Websites 5,500,0005,000,000 TOTALS30,200,00025,000,000

14 Kilgarriff: Asialex June 200514 Collection Use existing Ask publishers Web

15 Kilgarriff: Asialex June 200515 Use existing Irish: PAROLE corpus (8M words, ITE) English  British: BNC  American: LDC Gigaword – wds journalism  Limerick Corpus of Spoken English  Northern Ireland Corpus of Transcribed Speech

16 Kilgarriff: Asialex June 200516 Ask publishers The junkmail problem Appeals to national pride Charm and persistence Team member who knows them all

17 Kilgarriff: Asialex June 200517 Web Fast becoming the usual place to look  Kilgarriff and Grefenstette, CL 2003 Preliminary experiments  at least 15 M words of Irish out there Hiberno-English  English as found on sites where Irish was found

18 Kilgarriff: Asialex June 200518 Web issues Formats  conversion from pdf etc needed Character representation  Not many pages “do the right thing” Navigational material: “click here” Lists Mixed languages Duplication

19 Kilgarriff: Asialex June 200519 Text categoryIrishHiberno-English Words: actual Words: target Words: actual Words: target Books- imaginative 7,600,0009,000,0006,000,0007,500,000 Books- Informative 8,400,0006,000,0007,000,0005,000,000 Newspapers 4,500,000 5,300,0003,750,000 Periodicals 2,600,0002,500,000700,0002,250,000 Official/Govt 1,200,0001,500,0001,000,000 Broadcast 400,0001,000,0000750,000 Websites 5,500,000 5,000,0004,750,000 TOTALS30,200,00030,000,00025,000,000

20 Kilgarriff: Asialex June 200520 Encoding Clean-up Linguistic processing Delivery formalism

21 Kilgarriff: Asialex June 200521 Clean-up Deletion of: Title pages, table of contents, tables, figures, footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings …

22 Kilgarriff: Asialex June 200522 Linguistic processing Lemmatize  give giving gives given gave => give (verb) Part-of-speech tagging  bank (verb) or bank (noun)? English: existing tools used Irish: tools developed from scatch  Elaine Ui Dhonnchadha: thesis work  Finite state methods, constraint grammar  Separate talk

23 Kilgarriff: Asialex June 200523 Delivery formalism Both  XML Corpus Encoding Standards (XCES)  For longevity, interchange format And  Loaded into Word Sketch Engine  Corpus query tool optimised for lexicography, linguistic research  Good for searching on grammar, text type etc Friday talk

24 Kilgarriff: Asialex June 200524 Conclusion Large corpora for high-quality lexicography Developed in one year, modest budget Design, collection and encoding Delivered in a convenient form for the lexicographer Thank you


Download ppt "Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)"

Similar presentations


Ads by Google