Tracking Linguistic Variation in Historical Corpora David Bamman The Perseus Project, Tufts University.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Issues in Building and Exploiting Latin Language Resources Marco Passarotti Università Cattolica del Sacro Cuore, Milan (Italy)
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
1 JCDL 2011 Report Kazunari Sugiyama WING meeting 19 th August, 2011.
An Introduction to English Lexicology Lectured by Huang xue e
Introduction to Old and Middle English: Part I Overview October 28, 2005 Andreas H. Jucker.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Is this text reliable? Criteria for establishing the “original” text.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Young Children Learn a Native English Anat Ninio The Hebrew University, Jerusalem 2010 Conference of Human Development, Fordham University, New York Background:
A Hierarchical Phrase-Based Model for Statistical Machine Translation Author: David Chiang Presented by Achim Ruopp Formulas/illustrations/numbers extracted.
Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University.
APPROACHES and METHODS IN LANGUAGE TEACHING
Natural Language Processing Expectation Maximization.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
Scholars and Philosophers of the Ideas of Humanism Petrarch ( ) Erasmus ( ) Guillaume Bude ( ) Michel de Montaigne ( )
21st Century Classics Gregory Crane Professor and Chair Department of Classics Adjunct Professor of Computer Science Winnick Family Chair of Technology.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
A brief analysis of Chinese and English thought patterns Kaplan, Robert. (1966) Cultural Thought Patterns in Inter-Cultural Education. Language Learning.
Emma Coonan, University Library David Lowe, University Library Jane Devine Mejia Modern & Medieval Languages Library.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Elpida Palaiokrassa & Aris Foufas.  Latin is the language that was originally spoken in both regions, Latium which was around Rome and ancient Rome.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Historical Syntax & Lexical Change How Sentence Structure and Vocabulary Change over Time Asian 401.
Historical linguistics Historical linguistics (also called diachronic linguistics) is the study of language change. Diachronic: The study of linguistic.
Syntax. Syntax n The natural internal “grammar” of a language.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Chapter 5 Syntax English Linguistics: An Introduction.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
1 2 Modern Approaches to Corpus Linguistics Dominique L ONGRÉE, LASLA – Université de Liège et FUSL (Bruxelles) automatic taggers as heuristic tools multilevel.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
Sources (I). Sources for antiquity How do we know what happened in the Greek and Roman world? How do we know when it happened? How do we know what events.
What’s in a translation rule? Paper by Galley, Hopkins, Knight & Marcu Presentation By: Behrang Mohit.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
1 The Digital du Cange: Moldy Old Tomes Make an Internet Comeback Andrew Gollan and Ross Scaife Modern and Classical Languages, Literatures, and Cultures.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Zenodotus of Ephesus Pollyana Bell.
Statistical Machine Translation Part III – Many-to-Many Alignments Alexander Fraser CIS, LMU München WSD and MT.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Improving a Statistical MT System with Automatically Learned Rewrite Rules Fei Xia and Michael McCord IBM T. J. Watson Research Center Yorktown Heights,
Over 350, 000 late 15 th to long 19 th century texts (EEBO, ECCO & BL 19 th Century) Introduction to Historical Texts.
What is humanism? A philosophy that developed during the Renaissance that focused more on the potential of man and less on the doctrine of the Catholic.
Interpreting the New Testament
Syntax By WJQ. Syntax : Syntax is the study of the rules governing the way words are combined to form sentences in a language, or simply, the study of.
Sources. Homer Iliad Odyssey “Homeric Hymns” “Epic cycle” Oral poetry Central role 7 th c. BCE (first written poetry)
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Recap: distributional hypothesis What is tezgüino? – A bottle of tezgüino is on the table. – Everybody likes tezgüino. – Tezgüino makes you drunk. – We.
[CULTURE]. THE CONCEPT OF THE RENAISSANCE The French term Renaissance means ‘rebirth’ and it refers to the rebirth of classical (Greek and Latin) learning.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Changes in English 1 In this presentation we are going to look at the way other languages have influenced English and at the similarities and differences.
Linguistic Annotation of Classical Texts
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Introduction to Historical Texts
Introduction to Computational Methods for Classical Philology
Measuring Monolinguality
Million Books Update: Perseus
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
A Latin corpus for Sketch Engine
Introduction to Historical Texts
Presentation transcript:

Tracking Linguistic Variation in Historical Corpora David Bamman The Perseus Project, Tufts University

2000+ Years of Latin Classical Latin: 200 BCE – 200 CE – Vergil, Caesar, Cicero Late/Medieval Latin (200 CE – 1300 CE) – Augustine, Thomas Aquinas Renaissance/Neo-Latin (1300 CE – present) – Erasmus, Luther – Tycho Brahe, Galileo, Kepler, Newton, Euler, Bernoulli, Linnaeus – Thomas Hobbes, Leibnitz, Spinoza, Francis Bacon, Descartes

Goal: Tracking Language Change Lexical change (new vocabulary, shift in the meanings of words) Syntactic change (including the influence of the author’s L1 on the Latin syntax) Topical change (the rise of new genres) Identifying the flow of information. E.g., Cicero + Augustine influencing Petrarch; Petrarch influencing Leonardo Bruni.

Data 1.2M books from the Internet Archive (snapshot of collection from 2009) 27,014 works catalogued as Latin Problems: 1. Many of these works are not Latin. 2. Recorded dates = dates of publication, not dates of composition.

27,014 works catalogued as Latin in the IA, charted by “date.”

Language ID Language ID to identify which of these works actually have Latin as a major language. – Trained a language classifier on: 24 editions of Wikipedia Perseus classical corpus Known badly-OCR’d Greek in the IA. Results – ~20% of 27,014 books catalogued as “Latin” are not (mostly Greek) – 4,581 books not catalogued as Latin in the 1.2M collection are in fact so.

Composition dating With undergraduate students, currently establishing the dates of composition for each Latin text. So far, considered 10,398 (38%) of them: – 7,055 dated – 3,343 excluded as not Latin or reference works (dictionaries, catalogues, lists of manuscripts) From these 7,055 works, we extract just the Latin to create a dated historical corpus

27,014 works catalogued as Latin in the IA, charted by “date.”

7,055 Latin works in the IA, charted by date of composition.

Word counts by century. 364,000,000 total.

Atomic variables 1.Track lexical trends – (“America” used more after 1508) 2.Track syntactic change – (SOV -> SVO) 3.Track lexical change – (“oratio” used more and more to mean “prayer” rather than “speech”)

Lexical trends: Google Ngram Viewer

Lexical trends

“America” (1066)

“de” (2,955,462)

“ad” (3,655,191)

“in” (8,126,487)

Atomic variables 1.Track lexical trends – (“America” used more after 1508) 2.Track syntactic change – SOV word order (“The dog me bit”) -> SVO (“The dog bit me”). 3.Track lexical change – (“oratio” used more and more to mean “prayer” rather than “speech”)

Historical treebanks Most recent research and investment in treebanks has focused on modern languages, but treebanks for historical languages are now arising as well: – Middle English (Kroch and Taylor 2000) – Medieval Portuguese (Rocio et al. 2000) – Classical Chinese (Huang et al. 2002) – Old English (Taylor et al. 2003) – Early Modern English (Kroch et al. 2004) – Latin (Bamman and Crane 2006, Passarotti 2007) – Ugaritic (Zemánek 2007) – New Testament Greek, Latin, Gothic, Armenian, Church Slavonic (Haug and Jøhndal 2008)

Design Latin and Greek are heavily inflected languages with a high degree of variability in its word order: constituents of sentences are often broken up with elements of other constituents, as in ista meam norit gloria canitiem (“that glory will know my old age”). Because of this flexibility, we based our annotation standards on the dependency grammar used by the Prague Dependency Treebank (of Czech).

Latin Dependency Treebank AuthorWords Caesar1,488 Cicero6,229 Sallust12,311 Vergil2,613 Jerome8,382 Ovid4,789 Petronius12,474 Propertius4,857 Total53,143

Ancient Greek Dependency Treebank WorkWords Aeschylus (complete)48,172 Hesiod, Shield of Heracles3,834 Hesiod, Theogony8,106 Hesiod, Works and Days6,941 Homer, Iliad128,102 Homer, Odyssey104,467 Sophocles, Ajax9,474 Total309,096

Perseus Digital Library

Treebank Annotation

Graphical editor: build a syntactic annotation by dragging and dropping each word onto its syntactic head.

Annotator forum

Class treebanking Currently being used in 9 universities in the United States, Argentina and Australia.

Perseus Digital Library

Undergraduate Contributions

Ownership Model...

Treebank data

Syntactic variation CiceroCaesarVergilJerome SVO5.3%0%20.8%68.5% SOV26.3%64.7%18.8%4.7% VSO5.3%0%6.3%16.5% VOS0% 10.4%3.1% OSV52.6%35.3%25.0%3.9% OVS10.5%0%18.8%3.1% Word order rates by author (sentences with overt subjects and objects). Cicero, n=19; Caesar, n=17; Vergil, n=48; Jerome, n=127.

Syntactic variation CiceroCaesarVergilJerome OV68.2%95.2%56.2%13.9% VO31.8%4.8%43.8%86.1% Word order rates by author (sentences with one zero-anaphor). OV/VO: Cicero, n=44; Caesar, n=63; Vergil, n=121; Jerome, n=309. SV/VS: Cicero, n=58; Caesar, n=90; Vergil, n=97; Jerome, n=404. CiceroCaesarVergilJerome SV75.9%86.7%53.6%65.8% VS24.1%13.3%46.4%34.2%

Atomic variables 1.Track lexical trends – (“America” used more after 1508) 2.Track syntactic change – (SOV -> SVO) 3.Track lexical change – (“oratio” used more and more to mean “prayer” rather than “speech”)

Dynamic Lexicon

Tracking lexical change SMT based on Brown et al (1990) Different senses for a word in one language are translated by different words in another. “Bank” (English) – financial institution = French “banque” – side of a river = French “rive” (e.g., la rive gauche)

Dynamic Lexicon Sentence level: Moore’s Bilingual Sentence Aligner (Moore 2002) – aligns sentences that are 1-1 translations of each other w/ high precision (98.5% on a corpus of 10K English-Hindi sentences) Word level: MGIZA++ (Gao and Vogel 2008) – parallel version of: GIZA++ (Och and Ney 2003) - implementation of IBM Models 1-5.

Multilingual Alignment Word-level alignment of Homer’s Odyssey

Latin/Greek  English Senses

English  Greek/Latin Senses

Dynamic Lexicon

Parallel Text Data The Internet Archive alone contains editions of Horace’s Odes in eight different languages. Latin: carpe diem quam minimum credula postero (Horace, Ode 1.11) Italian: tu l’oggi goditi: e gli stolti al domani s’affidino (Chiarini 1916) French: Cueille le jour, et ne crois pas au lendemain (De Lisle 1887) English: Seize the present; trust tomorrow e’en as little as you may (Conington 1872) German: Pflücke des Tag’s Blüten, und nie traue dem morgenden (Schmidt 1820) Portuguese: colhe o dia, do de amanh ́a mui pouco confiando (Duriense 1807) Spanish: Coge este dia, dando muy poco credito al siguiente (Campos and Minguez 1783) Early Modern French: Jouissez donc en repos du jour present, & ne vous attendez point au lendemain (Dacier 1681)

Tracking sense variation in 2000 years of Latin 1.Identify translations - (130 English translations manually identified by students from a representative range of dates) 2.Word align Latin text English text - (ca. 1.3M words) 3.Induce a sense inventory from the alignment 4.Train a WSD classifier on noisily aligned texts 5.Automatically classify remaining 365M words 6.Track lexical change

Oratio

Knight

URLs Treebank data Treebank annotation environment Translation information Greek lexicon