Million Books Update: Perseus

Slides:



Advertisements
Similar presentations
Greek Influence on Latin Alphabet
Advertisements

Chapter 13 – The Scientific Revolution
1 JCDL 2011 Report Kazunari Sugiyama WING meeting 19 th August, 2011.
Corpus Linguistics. What is corpus linguistics? Method / Theory in Linguistics Analysis of collections of texts (corpora) Verifying/ Strengthening or.
Introduction to Old and Middle English: Part I Overview October 28, 2005 Andreas H. Jucker.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
APPROACHES and METHODS IN LANGUAGE TEACHING
Chapter 2 How the New Testament Was Formed and Handed Down to Us.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
Scholars and Philosophers of the Ideas of Humanism Petrarch ( ) Erasmus ( ) Guillaume Bude ( ) Michel de Montaigne ( )
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi.
Research Papers Locating Your Sources. Two Kinds of Sources Primary source: original text, document, interview, speech, or letter (it is the text itself)
School library systems 3.2 Education. Libraries often contain many thousands of books, magazines, CD- ROMs, etc. In fact, some of the largest libraries.
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
©McGraw-Hill Higher Education Chapter 2 How the New Testament Was Formed and Handed Down to Us.
The Scientific Revolution
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
1 2 Modern Approaches to Corpus Linguistics Dominique L ONGRÉE, LASLA – Université de Liège et FUSL (Bruxelles) automatic taggers as heuristic tools multilevel.
The Scientific Revolution
Maximizing Comprehensible Input and Language Use/Practice Victor Shen May 2015.
Research library of the National Aerospace University Kharkiv Aviation Institute.
The Scientific Revolution
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Sources (I). Sources for antiquity How do we know what happened in the Greek and Roman world? How do we know when it happened? How do we know what events.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
The Scientific Revolution (1500s-1700s)
Oratio n Reported by: Renae Jasmin C. Ravelas II-6 BEEd.
Connecting you with information, support and your community Classics Postgraduate Students Library Resources University of Warwick Library October 2015.
The Scientific Revolution. Ancient Greece and Rome  Mathematics, astronomy, and medicine were three of the earliest sciences.  The Greeks developed.
The Scientific Revolution (1500s-1700s) © Student Handouts, Inc.
Over 350, 000 late 15 th to long 19 th century texts (EEBO, ECCO & BL 19 th Century) Introduction to Historical Texts.
Interpreting the New Testament
And new Philosophy calls all in doubt. The element of fire is quite put out: The sun is lost and the earth, and no man’s wit Can well direct him where.
Syntax By WJQ. Syntax : Syntax is the study of the rules governing the way words are combined to form sentences in a language, or simply, the study of.
+ Aim: How did discoveries in Science lead to new ways of thinking for Europeans? Do Now: Team Building Exercise Paper Towers Unit Essential Question:
Brill Over three centuries of scholarly publishing.
Tracking Linguistic Variation in Historical Corpora David Bamman The Perseus Project, Tufts University.
Topic: Early Astronomy PSSA: D/S8.A.2.2. Objective: TLW explain how the discoveries of early astronomers has changed mankind’s understanding of.
 How did the Scientific Revolution change the way people thought?  What do you know about Isaac Newton?
What was the Enlightenment? What does the ‘Scientific Revolution’ refer to? Describe each of the following important writings of the Enlightenment. Include.
[CULTURE]. THE CONCEPT OF THE RENAISSANCE The French term Renaissance means ‘rebirth’ and it refers to the rebirth of classical (Greek and Latin) learning.
The Scientific Revolution (1500s-1700s) © Student Handouts, Inc.
The Scientific Revolution (1500s-1700s) © Student Handouts, Inc.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Changes in English 1 In this presentation we are going to look at the way other languages have influenced English and at the similarities and differences.
Scientific Revolution
Introduction to Historical Texts
Measuring Monolinguality
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Exploring the BNC Corpus
Introduction to Corpus Linguistics: Dispersion/concordance plots
The Scientific Revolution
Textual Criticism John 7:53-8:11
Roots of Language: Foundations of English
Classical Civilizations
Thomas L. Packer BYU CS DEG
Salvete, omnes! Welcome to IB Latin!.
The Scientific Revolution (1500s-1700s)
The Scientific Revolution
The Scientific Revolution (1500s-1700s)
The Renaissance and Reformation
A Latin corpus for Sketch Engine
facebook Johannes Kepler Wall Photos Flair Boxes Johannes Kepler
facebook Rene Descartes Wall Photos Flair Boxes Rene Descartes Logout
Applied Linguistics Chapter Four: Corpus Linguistics
Introduction to Historical Texts
facebook Isaac Newton Wall Photos Flair Boxes Isaac Newton Logout
facebook Sir Francis Bacon Wall Photos Flair Boxes Sir Francis Bacon
Presentation transcript:

Million Books Update: Perseus

2000+ Years of Latin Classical Latin: 200 BCE – 200 CE Vergil, Caesar, Cicero Late/Medieval Latin (200 CE – 1300 CE) Augustine, Thomas Aquinas Renaissance/Neo-Latin (1300 CE – present) Erasmus, Luther Tycho Brahe, Galileo, Kepler, Newton, Euler, Bernoulli, Linnaeus Thomas Hobbes, Leibnitz, Spinoza, Francis Bacon, Descartes

Goal: Tracking linguistic variation in Latin over 2000 years Lexical change (new vocabulary, shift in the meanings of words) Syntactic change (including the influence of the author’s L1 on the Latin syntax) Topical change (the rise of new genres) Identifying the flow of information. E.g., Cicero + Augustine influencing Petrarch; Petrarch influencing Leonardo Bruni.

Data 1.2M books from the Internet Archive (snapshot of collection from 2009) 27,014 works catalogued as Latin Problems: 1. Many of these works are not Latin. 2. Recorded dates = dates of publication, not dates of composition.

Language ID Language ID to identify which of these works actually have Latin as a major language. Trained a 5gram language classifier on: 24 editions of Wikipedia Perseus classical corpus Known badly-OCR’d Greek in the IA. Results 10,263 of 25,886 books (39.6%) catalogued as “Latin” are classified as something else (majority Greek or bad OCR). 6,790 books not catalogued as Latin in the 1.2M collection are in fact so. = net total of 22,413 Latin books.

Composition dating With undergraduate students, currently establishing the dates of composition for each Latin text. So far, considered 10,398 of them: 7,055 dated 3,343 excluded as not Latin or reference works (dictionaries, catalogues, lists of manuscripts) From these 7,055 works, we extract just the Latin to create a dated historical corpus

7,055 Latin works in the IA, charted by date of composition.

Atomic variables Track lexical trends Track syntactic change (“America” used more after 1508) Track syntactic change (SOV -> SVO) Track lexical change (“oratio” used more and more to mean “prayer” rather than “speech”)

“America” (1066)

“de” (2,955,462)

Atomic variables Track lexical trends Track syntactic change (“America” used more after 1508) Track syntactic change (SOV -> SVO) Track lexical change (“oratio” used more and more to mean “prayer” rather than “speech”)

Tracking sense variation in 2000 years of Latin Identify translations - (130 English translations manually identified by students from a representative range of dates) Word align Latin text <-> English text - (ca. 1.3M words) Induce a sense inventory from the alignment Train a WSD classifier on noisily aligned texts Automatically classify remaining 365M words Track lexical change

Changing sense distribution of “oratio”

Immediate Future De-duplicate collection Measure syntactic variation