Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.

Slides:



Advertisements
Similar presentations
U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Advertisements

Yemelia International Language Services Translations Translations Translations Interpreting InterpretingInterpreting Multi-lingual IT Presentations Multi-lingual.
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Adaptxt® Enhanced Keyboards for Smartphones and Tablets: CUSTOM-MADE FOR OEM SUCCESS KeyPoint Technologies February 25, 2013.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Ideal Lingua Translations Ideal Lingua Translations is a leading Translation Services Provider which offers:  Highest Quality Language Solutions 
Issues in Building and Exploiting Latin Language Resources Marco Passarotti Università Cattolica del Sacro Cuore, Milan (Italy)
Curricular exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
 They speak German  8.47 million of people live there.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Clients for XProtect VMS What’s new presentation
< Translator Team > 25+ Languages, …and growing!.
English Language Proficiency 2011 Census Analysis Tristan Browne.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
Linkkservicesworld LTD. SERVICES Translation English / Spanish / English Interpretation/ Full Professional Medical Support / Editing / Proofreading.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Talk, Translate, and Voice By: Jill Gruttadauro, Amanda Swetish, Porter Waung.
In the knowledge society of the 21st century, language competence and inter-cultural understanding are not optional extras, they are an essential part.
The Influence of First Language on Reading and Spelling in English Linda Siegel University of British Columbia Vancouver, CANADA
Database publishers RBDigital Zinio Indieflix Recorded Books McGraw-Hill Access Engineering Access Medicine McGraw-Hill E-Books Library Cengage Gale Gale.
Lund Online E-Books & E-Reference Malin Asplund & Monique Schutterop Higher Education & Library Reference.
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Survey on university students choosing a language course as an extra-curricular activity DIUS & AULC Department for Innovation Universities and Skills.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Richard Baraniuk International Experiences with Open Educational Resources.
Module 20 Working with Full-Text Indexes and Queries.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Language Families BBI 3101-HISTORY OF ENGLISH -LECTURE 1.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
What can Parents Do to Help Their Children Learn?.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Luis Avila Tics. We have to recognize all the operating systems we have nowadays in the different smartphones Blackberry: Bb OS Iphone: iOS Nokia: symbian.
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
Report Sharp-Shooter – is the most flexible reporting component for is the most flexible reporting component for.NET. The product provides a wide range.
Mango Languages is a language learning database available for current Bow Valley College students, staff and faculty.
LanguagesLanguages. What is language? A human system of communication that uses arbitrary signals such as voice sounds, gestures, or written symbols.
F ACTORS TO G OOGLE A D S ENSE A PPROVAL By: Aarif Habeeb.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
“ Your Linguistic Needs... … Our Knowledge ”. Objectives CONFIDENTIAL© Copyright 2010 Valuepoint Knowledgeworks Pvt Ltd 2 Why VPKW ? Global Operations.
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
ELanguages creative collaboration for teachers globally.
Languages of Europe Romance, Germanic, and Slavic.
Advanced Directives: What to Assess with Seniors
SQLSaturday Dnipro 2016 Azure Search Anton Boyko
Anton Boyko Microsoft azure mvp, mcp Microsoft Devops TE
Sales Presenter Available now
We Translate… You Market!!
Sales Presenter Available now
Oracle Supplier Management Solution Product Availability
Digital Asset Management Part 11: Access

A Latin corpus for Sketch Engine
Definition of Health WHO approved translation
Part of Speech Tagging with Neural Architecture Search
COUNTRIES NATIONALITIES LANGUAGES.
Sales Presenter Available now Standard v Slim

Presentation transcript:

Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.

Outline Latin corpora Sketch Engine LatinISE: a Latin corpus for SkE Collecting the texts Metadata Automatic annotation Demo Conclusion 2

Latin corpora

Overview Index Thomisticus (1980) by R. Busa S. J. First electronic corpus 11 million words; lemmatized Digital editions Perseus Digital Library (10 million words) Corpus Grammaticorum Latinorum Library of Latin Texts (50 million) Musisque Deoque 4

Morphological annotation Manual LASLA (1.5 million words) Automatic Morpheus (Perseus) CHLT-LEMLAT (ILC-CNR) Words (W. Whitaker), Quick Latin 5

Treebanks Latin Dependency Treebank 53,000 tokens Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust, Vergil Index Thomisticus Treebank 100,000 Thomas Aquinas PROIEL Project 100,000 Translations of the New Testament in Latin, Greek, Old Church Slavonic, Armenian, Gothic 6

Motivation Latin is still a less-resourced language Features of our corpus Size: 13 million words Provided with metadata Automatically annotated Lemmatized Part-of-speech tagged Included in a clever corpus query system 7

Sketch Engine

Corpus query tool, since 2003 Widely used by lexicographers Commercial OUP, CUP, Collins, Macmillan, Le Robert, Cornelsen, Shogukakan National dictionary projects Bulgaria, Czech Republic, Estonia, Netherlands, Slovakia, Slovenia Universities Linguistics, language research, NLP, language teaching 9

44 languages and counting Large corpora ready-to-use for Arabic Bengali Bulgarian Chinese Czech Croatian Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Latin Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Setswana Slovak Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Urdu Vietnamese 10

Handles large corpora Largest to date: 8 billion words Fast Web-based: no software to install Build ‘instant corpora’ from the web Load your own corpus Quota of space on SkE server Word sketches One-page, automatic accounts of a word’s grammatical and collocational behaviour Free 30-day trial: sketchengine.co.uk 11

12 Adam Kilgarriff Lexical Computing Ltd.

Add your language/corpus? In your personal area or maybe For all SkE users Always interested in adding more resources If it’s a corpus that others may want: quid pro quo: free use of tool Contact: 13

LatinISE: a Latin corpus in the Sketch Engine

Collecting the texts Three online digital libraries LacusCurtius IntraText Musique Deoque From HTML to verticalised text 15

Metadata Author; title Genre (prose or poetry) Era; date; century Oldest: Senatus consulta de Baccanalibus (186 B. C.) Most recent: Congregazione per la Dottrina della Fede, Dominus Iesus (2000) Metadata used to delete duplicated texts 16

Annotation Natural Language Processing Lemmatization Proiel Project’s morphological analyser (Dag Haug) Quick Latin Pos-tagging TreeTagger (H. Schmid, IMS, University of Stuttgart) Advantages Not prone to human errors, fast, less costly 17

The corpus in SkE 18

Subcorpora Early (VII-II cent. B. C.)401,557 Classical (I cent. B. C.)2,275,030 Post-classical (I-VI cent. A. D.)6,080,181 Medieval (VII-XIV cent. A. D.)2,920,446 Modern (XV-XXI cent. A. D.) 2,034,940 Poetry3,818,603 Prose 9,935,401 19

20

A first search 21

22

Cum (conjunction) 23

24

Cum (preposition) 25

26

Search a phrase 27

28

29

Magna pars vs. pars magna 30

Context: Dico/puto/credo quod 31

32

33

Conclusion

A new large resource for a less-resourced language NLP tools on a dead language Advanced corpus queries with Sketch Engine Future Morphological tags (case, mood, voice, …) Syntactic tags (Word Sketches) 35