Introduction to Computational Methods for Classical Philology

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
1 JCDL 2011 Report Kazunari Sugiyama WING meeting 19 th August, 2011.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Close Reading.  You should be working on your last note catcher.  Yesterday we combined our note catchers to make a packet will all 5 note catchers.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
The Indo- European Language Family A large group of languages spoken over most of Europe and also in Iran, Afghanistan, Pakistan, Nepal, Northern India,
Natural Language Processing Expectation Maximization.
Dr. Robert Patrick Parkview High School Gwinnett County Public Schools.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
21st Century Classics Gregory Crane Professor and Chair Department of Classics Adjunct Professor of Computer Science Winnick Family Chair of Technology.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
Paradise Lost John Milton. Puritan Believed an individual’s relationship with God was at the heart of religion Believed each person should develop his.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
English 9 Unit 3 Week 2 Poetry 1. Eng. 9 Poetry 11/10-11/14 ObjectiveAssignmentsHW MonDefine & identify poetic devices WU: fragments Noes: Poetic Terms.
Food for thought Symbolic Power LIS 610 Bair-Mundy.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Automatic Readability Evaluation Using a Neural Network Vivaek Shivakumar October 29, 2009.
What is the distribution of world languages density concentration patterns How is culture influenced or limited by this language distribution? How does.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
“ Of Man’s first disobedience, and the fruit Of that forbidden tree whose mortal taste Brought death into the world, and all our woe, With the loss of.
1 The Digital du Cange: Moldy Old Tomes Make an Internet Comeback Andrew Gollan and Ross Scaife Modern and Classical Languages, Literatures, and Cultures.
Rhythmic Fluency by Lance Piantaggini Overview - Why bother?
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Data Mining: Text Mining
OF Mans First Disobedience, and the Fruit Of that Forbidden Tree, whose mortal tast Brought Death into the World, and all our woe, With loss of Eden, till.
 Born: October 15, 70 BC  Tradition says that he was born in the village of Andres, near Mantua in Cisalpine Gaul.  Died: September 21: 19 BC  He.
What is an Epic?. An epic in its most specific sense is a genre of classical poetry originating in Greece. The conventions of this genre are several:
Iambic Pentameter Short and Sweet.
John Milton One of the “three greats” Born in London to very Protestant parents Studied divinity at Cambridge Debated humanist issues in Latin,
1/9/14 O CO: Evaluate Lincoln’s efforts to abolish slavery and to end the Civil War. O QW: O Read and analyze the quotes from Lincoln’s letters.
John Milton Paradise Lost and the Art of the Epic Poem First Year English Matthew Martin.
Tracking Linguistic Variation in Historical Corpora David Bamman The Perseus Project, Tufts University.
Virgil and Aeneas The Greatest Roman Author and Hero.
Poetry Scansion dactylic hexameter dactylic hexameter.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
The Gettysburg Address By Zoe and Bryony. Information Abraham Lincoln wrote and read the famous speech It was spoken at the dedication of the soldiers'
Close Reading.  You should be working on your last note catcher.  Yesterday we combined our note catchers to make a packet will all 5 note catchers.
Gettysburg Picture Analysis- Gallery Walk Civil War Picture Analysis- With a partner- Use post-it notes to analyze and annotate the photos. Put the post-its.
NAME____________ Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in Liberty, and dedicated to the.
“Of Man’s first disobedience, and the fruit
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Measuring Monolinguality
Approaches to Machine Translation
Statistical NLP: Lecture 7
Million Books Update: Perseus
Desiring Rome: Ovidian erotic elegy
--Mengxue Zhang, Qingyang Li
Warm Up In 1963, Dr. Martin Luther King Jr. began his famous “I have a dream” speech with the words “Five score years ago, a great American, in whose symbolic.
Detecting evolutionary forces in language change (2017)
Gettysburg Picture Analysis- Gallery Walk
A Latin corpus for Sketch Engine
MLK vs. Malcolm X Debate How were the men similar?
Approaches to Machine Translation
Andreas divus iustinopolitanus and A parallel corpus of greek and latin texts Petra Šoštarić Department of Classical Philology University of Zagreb.
Style Analysis: SYNTAX
1. Who formed the Second Triumvirate
Statistical n-gram David ling.
Machine Learning, Language Rules, and Statistical Strategies for Language Translation Andrew Runge Computer Systems Lab
Emancipation Proclamation
Presentation transcript:

Introduction to Computational Methods for Classical Philology David Bamman The Perseus Project, Tufts University http://nlp.perseus.tufts.edu/docs/xxisnec/slides/1.intro.pdf

Homer Multitext 39-megapixel scans of the 10th-century Marcianus Graecus Z. 454 (= 822) manuscript of the Iliad. Publicly released under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License by the Biblioteca Nazionale Marciana and the Center for Hellenic Studies http://chs.harvard.edu/chs/manuscript_images

Physical Access Perseus Digital Library (http://www.perseus.tufts.edu) Latin Library (http://www.thelatinlibrary.com/) LASLA (http://www.cipl.ulg.ac.be/lsl.htm) Index Thomisticus (http://www.corpusthomisticum.org/it) Documenta Catholica Omnia (http://www.documentacatholicaomnia.eu/) TLG (http://www.tlg.uci.edu/) Brepols corpora [BTL etc.] (http://www.brepols.net/publishers/cd-rom.htm) Google Books (http://books.google.com/) Internet Archive (http://www.archive.org)

Perseus Digital Library

“Open” Access: XML

Philologic (Chicago)

Archimedes (Harvard)

Diogenes (Durham)

Hestia (Open University)

Open Source Perseus http://www.perseus.tufts.edu/hopper/opensource 4.5 million words of Classical Latin 4.9 million words of Ancient Greek TEI-Compliant XML

Internet Archive www.archive.org 27,000+ works in Latin; 1 billion words.

Intellectual Access Large-scale linguistic analysis Tracking language change in 2000 years of Latin Downstream computational tasks Automatically creating dynamic bilingual dictionaries Discovering textual allusions

Tracking Language Change Lexical change (new vocabulary, shift in the meanings of words) Syntactic change (including the influence of the author’s first on the Latin syntax) Topical change (the rise of new genres) Identifying the flow of information. E.g., Cicero + Augustine influencing Petrarch; Petrarch influencing Leonardo Bruni.

6,385 Latin works in the Internet Archive, charted by date of publication.

6,385 Latin works in the Internet Archive, charted by date of composition.

“America” (1,006)

“de” (2,955,462) Now an interesting pattern emerges when we start looking at linguistic features over these two thousand years. This here charts the changing frequency of the preposition “de” from the Classical Latin period up to the 19th century. You can see a very visible rise in its use – this is evidence that Latin authors are using “de” much more frequently as time goes on. But it’s not just “de.”

“ad” (3,655,191)

“in” (8,126,487)

“et” (9,317,773)

Vocabulary density in Latin authors from 200 BCE to 1900 CE (Type-Token Ratio)

Intellectual Access Large-scale linguistic analysis Tracking language change in 2000 years of Latin Computational tasks to extract information from texts Automatically creating dynamic bilingual dictionaries Discovering textual allusions

Use #1: Automatically Building Bilingual Dictionaries Based on parallel text analysis: aligning source texts (here, in Greek and Latin) to translations (English, Spanish, etc.) Driven mainly by statistical machine translation for modern languages.

Parallel Text Data The Internet Archive alone contains editions of Horace’s Odes in eight different languages. Latin: carpe diem quam minimum credula postero (Horace, Ode 1.11) English: Seize the present; trust tomorrow e’en as little as you may (Conington 1872) French: Cueille le jour, et ne crois pas au lendemain (De Lisle 1887) Early Modern French: Jouissez donc en repos du jour present, & ne vous attendez point au lendemain (Dacier 1681) Italian: tu l’oggi goditi: e gli stolti al domani s’affidino (Chiarini 1916) Spanish: Coge este dia, dando muy poco credito al siguiente (Campos and Minguez 1783) Portuguese: colhe o dia, do de amanh ́a mui pouco confiando (Duriense 1807) German: Pflücke des Tag’s Blüten, und nie traue dem morgenden (Schmidt 1820)

Sense Discovery SMT based on Brown et al (1990) Different senses for a word in one language are translated by different words in another. “Bank” (English) financial institution = French “banque” side of a river = French “rive” (e.g., la rive gauche)

Progressive Alignment Sentence level: Moore’s Bilingual Sentence Aligner (Moore 2002) aligns sentences that are 1-1 translations of each other w/ high precision (98.5% on a corpus of 10K English-Hindi sentences) Word level: MGIZA++ (Gao and Vogel 2008) parallel version of: GIZA++ (Och and Ney 2003) - implementation of IBM Models 1-5.

Multilingual Alignment Word-level alignment of Homer’s Odyssey

Interlinear translations

Interlinear translations

Latin/Greek  English Senses

English  Greek/Latin Senses

Automatic Bilingual Dictionaries http://nlp.perseus.tufts.edu/lexicon

Use #2: Allusion detection Given a large collection of texts, we can apply computational techniques to look at all pairs of sentences in a collection and determine which are most similar (however we define similarity). --- “Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation ...” (Martin Luther King, Jr. 1963). “Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal” (Abraham Lincoln, 1863).

Classical allusion Arma virumque cano (Vergil, Aeneid 1.1) (Arms and the man I sing) μῆνιν ἄειδε θεὰ (Homer, Iliad 1.1) (Rage sing, goddess) ἄνδρα μοι ἔννεπε, μοῦσα (Homer, Odyssey 1.1) (Man me tell, Muse) Of man’s first disobedience, and the fruit Of that forbidden tree, whose mortal taste Brought death into the world, and all our woe, With loss of Eden, till one greater Man Restore us, and regain the blissful seat, Sing, heavenly Muse (Milton, Paradise Lost 1.1-6)

Allusion in Latin poetry Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”) Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) First, we need to identify the variables to look for: what defines similarity?

#1: Identical words Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”)

#2: Word order Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”)

#3: Syntax Arma -que bella edere (Ovid) Arma virumque cano (Vergil)

#4: Meter/phonetic similarity Ārmă grăvī nŭmĕrō || … Ārmă vĭrūmqŭe cănō || …

#5: Semantic similarity Arma gravi numero violentaque bella parabam Edere … (Ovid, Amores 1.1-2). (“I was planning to write about arms and violent wars in a heavy meter”) Arma virumque cano … (Vergil, Aen. 1.1) (“I sing of arms and man”) Both are about war (violenta bella) and the instruments of war (arma).

Translate traditional variables into computational terms Identical words = token similarity Word order = ngram similarity Syntax = dependency tree similarity

Allusion Discovery Test corpus of Latin poets from the Perseus digital library. Data syntactically parsed using McDonald et al’s MSTParser (2005), trained on data from the Latin Dependency Treebank. Author Words Sentences Ovid 141,091 10,459 Vergil 97,495 6,553 Horace 35,136 2,345 Catullus 14,793 903 Propertius 4,867 366 293,382 20,626

Discovery nulli illum iuvenes, nullae tetigere puellae (Ov., Met. 3.353) “No youths, no girls touched him.” idem cum tenui carptus defloruit ungui nulli illum pueri, nullae optavere puellae (Cat., Carm. 62) “This same one withered when plucked by a slender nail; no boys, no girls hope for it.”

Discovery Variable TF/IDF nullae:puellae:ATR 9.24 nullae:puellae nulli/illum p:SBJ_EXD_OBJ_CO:u:COORD:v ,/nullae 8.84 nullus1:puella1 8.55 ... nulli 6.30 puellae 5.55

Arma gravi numero ... Arma gravi numero violentaque bella parabam Edere ... (Ov., Amores 1.1) 1. Arma procul currusque virum miratur inanes (.059) (Verg., Aen. 6.651) - “At a distance he marvels at the arms and the shadowy chariots of men” 2. Quid tibi de turba narrem numeroque virorum (.042) (Ov., Ep. 16.183) - “What could I tell you of the crowd and the number of men?” 11. Arma virumque cano, Troiae qui primus ab oris Italiam, fato profugus, Laviniaque venit litora, multum ille et terris iactatus et alto vi superum saevae memorem Iunonis ob iram (.025) (Aen. 1.1) - “I sing of arms and the man ...

Summary: elements of computational philology

Tomorrow II. Linguistic Annotation of Classical Texts how traditional (non-computational) scholars in Classical Studies can get involved in digital philological projects.