Linguistic Annotation of Classical Texts

Slides:



Advertisements
Similar presentations
Day 1 Punctuation and Capitalization
Advertisements

Day 1 Punctuation and Capitalization
Day 1 Punctuation and Capitalization
Parts of a sentence.
Issues in Building and Exploiting Latin Language Resources Marco Passarotti Università Cattolica del Sacro Cuore, Milan (Italy)
Statistical NLP: Lecture 3
Introduction to phrases & clauses
MORPHOLOGY - morphemes are the building blocks that make up words.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Automatic Acquisition of Subcategorization Frames for Czech Anoop Sarkar Daniel Zeman.
Introduction to treebanks Session 1: 7/08/
1/13 Parsing III Probabilistic Parsing and Conclusions.
1/17 Probabilistic Parsing … and some other approaches.
ACT English Test Prep Lesson 1 Hanyang University GAC Instructor: Samuel Kim.
EQ: How can I identify and use elements of grammar correctly?
INSTRUCTOR: TSUEIFEN CHEN TERM:   Participial phrase: what is it and what does it do?  Participle forms: 1. General form –ing participial phrases.
ESP COURSE ( English for Specific Purposes) for Class Teachers (11-12, 13-14) Vera Savic, MA Lecturer in English 2010/2011 Faculty of Education in Jagodina.
21st Century Classics Gregory Crane Professor and Chair Department of Classics Adjunct Professor of Computer Science Winnick Family Chair of Technology.
GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.
MECHANICS OF WRITING C.RAGHAVA RAO.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Morphology and Surface Syntax 1 The PDT Morphology and Surface Syntax.
Daily Grammar Practice
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 13, Feb 16, 2007.
Chapter 5 Syntax English Linguistics: An Introduction.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
GrammaticalHierarchy in Information Flow Translation Grammatical Hierarchy in Information Flow Translation CAO Zhixi School of Foreign Studies, Lingnan.
Culture , Language and Communication
DAILY GRAMMAR PRACTICE (DGP)
$100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100.
Unit 8 Syntax. Syntax Syntax deals with rules for combining words into sentences, as well as with relationship between elements in one sentence Basic.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
 V = verb: action verb or linking verb  S = subject: noun or pronoun performing the action  Adverb = Modifies an adjective, a verb, or another adverb.
Unit Seven Syntactic Structures (Continued) Structure of … 2 main components Modification(Mod) Head & Modifier H / M Predication(Pred) Subject & Predicate.
SATS WEEK 9 th - 12 th May, 2016 Full attendance please!!
Tracking Linguistic Variation in Historical Corpora David Bamman The Perseus Project, Tufts University.
THE STRUCTURE OF THE ENGLISH SENTENCE ETSI de Telecomunicaciones ENGLISH.
SYNTAX.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 King Faisal University.
DAILY GRAMMAR PRACTICE (DGP)
Natural Language Processing Vasile Rus
KS2 English Parent Workshop 21st October 2016
Language Identification and Part-of-Speech Tagging
Corpus Linguistics Anca Dinu February, 2017.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Grammar.
Daily Grammar Practice (DGP)
Million Books Update: Perseus
Statistical NLP: Lecture 3
David Mareček and Zdeněk Žabokrtský
The Comma.
The Comma.
A clause is a group of words with a
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
Probabilistic and Lexicalized Parsing
Initial Considerations
Phil Durrant Debra Myhill Mark Brenchley
Probabilistic and Lexicalized Parsing
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
Introduction to Linguistics
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Preparation for End of Key Stage 1 Testing and Assessment. 2018
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
Daily Grammar Practice
A Latin corpus for Sketch Engine
English section.
English parts of speech
Dr. Bill Vicars Lifeprint.com
Structure of a Lexicon Debasri Chakrabarti 13-May-19.
Presentation transcript:

Linguistic Annotation of Classical Texts David Bamman The Perseus Project http://nlp.perseus.tufts.edu/docs/xxisnec/slides/2.annotation.pdf

Classical Philology “We should not fail to hear the almost benevolent nuances which for a Greek noble, for example, lie in all the words with which he set himself above the lower people—how a constant type of pity, consideration, and forbearance is mixed in there, sweetening the words, to the point where almost all words which refer to the common man finally remain as expressions for "unhappy," "worthy of pity" (compare deilos [cowardly], deilaios [lowly, mean], ponêros [oppressed by toil, wretched], mochthêros [suffering, wretched]—the last two basically designating the common man as a slave worker and beast of burden).” F. Nietzsche, Genealogy of Morals 1.10

Historical treebanks Most recent research and investment in treebanks has focused on modern languages, but treebanks for historical languages are now arising as well: Middle English (Kroch and Taylor 2000) Medieval Portuguese (Rocio et al. 2000) Classical Chinese (Huang et al. 2002) Old English (Taylor et al. 2003) Early Modern English (Kroch et al. 2004) Latin (Bamman and Crane 2006, Passarotti 2007) Ugaritic (Zemánek 2007) New Testament Greek, Latin, Gothic, Armenian, Church Slavonic (Haug and Jøhndal 2008)

Design Latin and Greek are heavily inflected languages with a high degree of variability in its word order: constituents of sentences are often broken up with elements of other constituents, as in ista meam norit gloria canitiem (“that glory will know my old age”).

Design This high level of non-projectivity has encouraged us to base our annotation style on that used by the Prague Dependency Treebank (PDT) for Czech (another non-projective language), while tailoring it for Latin via the grammar of Pinkster (1990). In contrast to the phrase-structure style annotation of other treebanks (e.g., the Penn Treebank), the PDT annotation is based on the dependency grammar of Mel’cuk (1988), which links words to their immediate heads without any intervening non-terminal phrasal categories.

Tagset PRED predicate SBJ subject OBJ object ATR attribute ADV adverbial ATV/AtvV complement PNOM predicate nominal OCOMP object complement COORD coordinator APOS apposing element AuxP preposition AuxC conjunction AuxR reflexive passive AuxV auxiliary verb AuxX commas AuxG bracketing punctuation AuxK terminal punctuation AuxY sentence adverbials AuxZ emphasizing particles ExD ellipsis

Latin Dependency Treebank Author Words Caesar 1,488 Cicero 6,229 Sallust 12,311 Vergil 2,613 Jerome 8,382 Ovid 4,789 Petronius 12,474 Propertius 4,857 Total 53,143

Ancient Greek Dependency Treebank Work Words Aeschylus (complete) 48,158 Hesiod, Works and Days 6,303 Homer, Iliad 38,390 Homer, Odyssey 99,353 Total 192,204

Building Treebanks Solicit annotations from two independent annotators; reconcile differences between them. Background: ranges from advanced undergraduates to PhD and professors, with the majority being students in graduate programs in Classics. Average speed: 124 words per hour. Interannotator accuracy: attachment (ATT), label (LAB), labeled attachment (LABATT): ATT LAB LABATT Hesiod, W&D 85.1% 85.9% 79.5% Homer, Iliad 87.1% 83.2% 79.3% Homer, Odyssey 87.5% 85.7% 80.9% Total 87.4% 85.3% 80.6%

Perseus Digital Library

TEI XML Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt … (“The Garonneriver separates the Gauls from the Aquitani and the Marne and the Seine (rivers) separate them from the Belgae. The bravest of all of these are …”)

Treebank Annotation

Treebank Annotation Graphical editor: build a syntactic annotation by dragging and dropping each word onto its syntactic head.

Annotator forum

Class treebanking Currently being used in 8 universities from the United States to Australia.

Perseus Digital Library

Perseus Digital Library

Undergraduate Contributions

Undergraduate Contributions

Undergraduate Contributions

Undergraduate Contributions Summary: Class treebanking yields lower overall accuracy than highly trained annotators, but: Some tasks have high accuracy rates across the board (identify these to maximize undergraduate contributions) Individual annotators have different skill levels on different kinds of tasks (use to identify strengths/weaknesses in understanding) The accuracy rates of individual highly trained undergraduates is comparable to that of their more advanced peers.

Ownership Model ...

Scholarly treebank Syntactically annotated corpus that reflects an interpretation of a single scholar. Mambrini Aeschylus.

Treebank Data Download it yourself: http://nlp.perseus.tufts.edu/syntax/treebank/

Use Treebanks are not just for computer scientists and computational linguists; they provide a large dataset for many different types of inquiry by traditional researchers, including: rhetoric historical linguistics lexicography/philology

Rhetoric Hyperbaton: the transposition of (normally projective) word order for rhetorical effect. tris notus abreptas in saxa latentia torquet, saxa vocant Itali mediis quae in fluctibus aras (Vergil, Aen. 1.108-9) = tris abreptas notus in saxa torquet, quae saxa in mediis fluctibus latentia Itali aras vocant (Donatus, De Tropis)

Rhetoric Hyperbaton: the transposition of (normally projective) word order for rhetorical effect. tris notus abreptas in saxa latentia torquet, saxa vocant Itali mediis quae in fluctibus aras (Vergil, Aen. 1.108-9) = tris abreptas notus in saxa torquet, quae saxa in mediis fluctibus latentia Itali aras vocant (Donatus, De Tropis)

Rhetoric Prepositional hyperbaton, where an adjectives occurs to the left of a prepositional phrase that contains its head: “memorem ... ob iram” “magna cum laude” Less frequently, prepositional hyperbaton occurs when the object of the preposition itself occurs outside of the prepositional phrase: “iram ... ob memorem”

Rhetoric Author adj < prep < noun noun < prep < adj Vergil A treebank lets us locate and measure the rates of this kind of hyperbaton, since it encodes the linear order of words along with their syntactic dependents. Hyperbaton is the interaction of these two variables. Author adj < prep < noun noun < prep < adj Vergil 40.0% 15.6% Cicero 8.9% 0% Caesar 2.2% Jerome Prepositional phrase hyperbaton rates. Vergil, n=90; Cicero, n=45; Caesar, n=138; Jerome, n=540.

Historical linguistics Classical Latin word order is generally thought to be SOV (though see Pinkster 1991 for an opposing argument), but eventually transformed into the SVO order found in modern-day romance languages. Treebanks again provide an excellent resource for answering this question statistically - it can provide a quantitative and reproducible figure that can suggest potentially fruitful directions for further qualitative analysis.

Historical linguistics Cicero Caesar Vergil Jerome SVO 5.3% 0% 20.8% 68.5% SOV 26.3% 64.7% 18.8% 4.7% VSO 6.3% 16.5% VOS 10.4% 3.1% OSV 52.6% 35.3% 25.0% 3.9% OVS 10.5% Word order rates by author (sentences with overt subjects and objects). Cicero, n=19; Caesar, n=17; Vergil, n=48; Jerome, n=127.

Historical linguistics Cicero Caesar Vergil Jerome OV 68.2% 95.2% 56.2% 13.9% VO 31.8% 4.8% 43.8% 86.1% Cicero Caesar Vergil Jerome SV 75.9% 86.7% 53.6% 65.8% VS 24.1% 13.3% 46.4% 34.2% Word order rates by author (sentences with one zero-anaphor). OV/VO: Cicero, n=44; Caesar, n=63; Vergil, n=121; Jerome, n=309. SV/VS: Cicero, n=58; Caesar, n=90; Vergil, n=97; Jerome, n=404.

Collaboration Index Thomisticus PROIEL Merge datasets Share experience/resources Establish common method of annotation

The Index Thomisticus 42,889 words, Scriptum super Sententiis Magistri Petri Lombardi http://itreebank.marginalia.it/

The PROIEL Corpus http://www.hf.uio.no/ifikk/proiel/ Language (Text) Completed In Progress Slavic (Codex Marianus) 4,742 12,804 Latin (Vulgate) 12,414 87,747 Greek (New Testament) 11,439 94,839 Gothic (Bible) 1,654 8,305

Collaboration: Comparison With a single treebank, we can conduct synchronic linguistic or stylistic studies With multiple treebanks, we can conduct diachronic studies, measuring the evolution of linguistic phenomena over time. E.g.: Indirect discourse in Latin ACI vs. Quod clauses

Accusativus cum Infinitivo tu es contentus (“you are content”) dicebas te esse contentum (“you said that you were content”)

Quod/quia clauses

ACI -> Quod/quia Existing studies (Mayen 1889, Cuzzolin 1991/1994, Herman 1989) Author Date Ratio ACI:Quod Tertullian c. 235 CE 33:1 Cyprian c. 258 CE 12:1 Lucifero di Cagliari c. 370 CE 6:1

Methodology ACI = infinitive verb (or accusative participle with optional “esse”) dependent on head via SBJ or OBJ Accusative SBJ optional in Latin (pro-drop language) Inspection still necessary to prune modal and prolative infinitives (dependent on “can” or “begin” etc.) Quod/quia clause = verbs dependent via argument relation (SBJ/OBJ) on quod/quia, as opposed to adjunct relation of (e.g.) causal clauses (ADV). Divided all verbs into two classes: verbs of thinking and saying (“you say that you are content”) and impersonal verbs (e.g., “going to the store is permitted”)

Results verbs of saying and thinking impersonal verbs Author ACI Quod/quia Ratio Classical authors 182 1 99.5% Jerome 3 9 25.0% Aquinas 35 80 30.4% Author ACI Quod/quia Ratio Classical authors 33 1 97.1% Jerome 15 100% Aquinas 27 72 27.3%

Transparency Kühner himself reported the number of passages he counted: “So hat nach meiner Zählung bei doleo 57 Stellen mit Acc. c. Inf. gegen 4 quod, bei miror 110 gegen 8, bei glorior 19 gegen 2, bei queror 71 gegen 15, bei gaude 84 gegen 9 usw.” (1914:77), although it is difficult to say what he meant by the word “Stelle” and impossible to say which texts his counting is based upon. P. Cuzzolin (1991)

Dynamic Lexicon

Inducing selectional preferences Trained a parser on our Greek and Latin treebanks Parsed all the texts in our 3.5M word Latin corpus, 4.9M word Greek corpus To find selectional preferences from this noisy data, we use hypothesis tests (log likelihood etc.) to locate syntactic collocations.

Dynamic Lexicon: do (to give) Latin English OLD def. Log score Opera Service (= take pains) 22c 254.2 Obses Hostage 11a 21.8 Signum Sign - 12.6 Velum Sail (= set sail) 18f 7.9 Pecunia Money (= pay) 6a 7.3 Negotium Business 6.2 Poena Penalty 7b 5.6 Possessio Possession 1c 4.8 Littera Letter (for delivery) 10a 4.3

Dynamic Lexicon: do (to give) Latin English Log score Obses Hostage 18.4 Opera Service 11.9 Suspicio Suspicion 2.2 Facultas Faculty 1.8 Signum Sign 1.5 Latin English Log score Potus Drink 16.6 Esca Food 3.3 Requies Rest 3.0 Glora Glory 2.4 Terra Earth 1.6 Strongest OBJ of do in Caesar Strongest OBJ of do in Jerome Latin English Log score Osculum Kiss 8.5 Velum Sail 5.9 Munus Gift 3.5 Signum Sign 2.6 Strongest OBJ of do in Ovid

How do you get involved? Pick a text. Come tomorrow!