Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linguistic Annotation of Classical Texts

Similar presentations


Presentation on theme: "Linguistic Annotation of Classical Texts"— Presentation transcript:

1 Linguistic Annotation of Classical Texts
David Bamman The Perseus Project

2 Classical Philology “We should not fail to hear the almost benevolent nuances which for a Greek noble, for example, lie in all the words with which he set himself above the lower people—how a constant type of pity, consideration, and forbearance is mixed in there, sweetening the words, to the point where almost all words which refer to the common man finally remain as expressions for "unhappy," "worthy of pity" (compare deilos [cowardly], deilaios [lowly, mean], ponêros [oppressed by toil, wretched], mochthêros [suffering, wretched]—the last two basically designating the common man as a slave worker and beast of burden).” F. Nietzsche, Genealogy of Morals 1.10

3 Historical treebanks Most recent research and investment in treebanks has focused on modern languages, but treebanks for historical languages are now arising as well: Middle English (Kroch and Taylor 2000) Medieval Portuguese (Rocio et al. 2000) Classical Chinese (Huang et al. 2002) Old English (Taylor et al. 2003) Early Modern English (Kroch et al. 2004) Latin (Bamman and Crane 2006, Passarotti 2007) Ugaritic (Zemánek 2007) New Testament Greek, Latin, Gothic, Armenian, Church Slavonic (Haug and Jøhndal 2008)

4 Design Latin and Greek are heavily inflected languages with a high degree of variability in its word order: constituents of sentences are often broken up with elements of other constituents, as in ista meam norit gloria canitiem (“that glory will know my old age”).

5 Design This high level of non-projectivity has encouraged us to base our annotation style on that used by the Prague Dependency Treebank (PDT) for Czech (another non-projective language), while tailoring it for Latin via the grammar of Pinkster (1990). In contrast to the phrase-structure style annotation of other treebanks (e.g., the Penn Treebank), the PDT annotation is based on the dependency grammar of Mel’cuk (1988), which links words to their immediate heads without any intervening non-terminal phrasal categories.

6 Tagset PRED predicate SBJ subject OBJ object ATR attribute ADV
adverbial ATV/AtvV complement PNOM predicate nominal OCOMP object complement COORD coordinator APOS apposing element AuxP preposition AuxC conjunction AuxR reflexive passive AuxV auxiliary verb AuxX commas AuxG bracketing punctuation AuxK terminal punctuation AuxY sentence adverbials AuxZ emphasizing particles ExD ellipsis

7 Latin Dependency Treebank
Author Words Caesar 1,488 Cicero 6,229 Sallust 12,311 Vergil 2,613 Jerome 8,382 Ovid 4,789 Petronius 12,474 Propertius 4,857 Total 53,143

8 Ancient Greek Dependency Treebank
Work Words Aeschylus (complete) 48,158 Hesiod, Works and Days 6,303 Homer, Iliad 38,390 Homer, Odyssey 99,353 Total 192,204

9 Building Treebanks Solicit annotations from two independent annotators; reconcile differences between them. Background: ranges from advanced undergraduates to PhD and professors, with the majority being students in graduate programs in Classics. Average speed: 124 words per hour. Interannotator accuracy: attachment (ATT), label (LAB), labeled attachment (LABATT): ATT LAB LABATT Hesiod, W&D 85.1% 85.9% 79.5% Homer, Iliad 87.1% 83.2% 79.3% Homer, Odyssey 87.5% 85.7% 80.9% Total 87.4% 85.3% 80.6%

10 Perseus Digital Library

11 TEI XML Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt … (“The Garonneriver separates the Gauls from the Aquitani and the Marne and the Seine (rivers) separate them from the Belgae. The bravest of all of these are …”)

12 Treebank Annotation

13 Treebank Annotation Graphical editor: build a syntactic annotation by dragging and dropping each word onto its syntactic head.

14 Annotator forum

15 Class treebanking Currently being used in 8 universities from the United States to Australia.

16 Perseus Digital Library

17 Perseus Digital Library

18 Undergraduate Contributions

19 Undergraduate Contributions

20 Undergraduate Contributions

21 Undergraduate Contributions
Summary: Class treebanking yields lower overall accuracy than highly trained annotators, but: Some tasks have high accuracy rates across the board (identify these to maximize undergraduate contributions) Individual annotators have different skill levels on different kinds of tasks (use to identify strengths/weaknesses in understanding) The accuracy rates of individual highly trained undergraduates is comparable to that of their more advanced peers.

22 Ownership Model ...

23 Scholarly treebank Syntactically annotated corpus that reflects an interpretation of a single scholar. Mambrini Aeschylus.

24 Treebank Data Download it yourself:

25 Use Treebanks are not just for computer scientists and computational linguists; they provide a large dataset for many different types of inquiry by traditional researchers, including: rhetoric historical linguistics lexicography/philology

26 Rhetoric Hyperbaton: the transposition of (normally projective) word order for rhetorical effect. tris notus abreptas in saxa latentia torquet, saxa vocant Itali mediis quae in fluctibus aras (Vergil, Aen ) = tris abreptas notus in saxa torquet, quae saxa in mediis fluctibus latentia Itali aras vocant (Donatus, De Tropis)

27

28 Rhetoric Hyperbaton: the transposition of (normally projective) word order for rhetorical effect. tris notus abreptas in saxa latentia torquet, saxa vocant Itali mediis quae in fluctibus aras (Vergil, Aen ) = tris abreptas notus in saxa torquet, quae saxa in mediis fluctibus latentia Itali aras vocant (Donatus, De Tropis)

29 Rhetoric Prepositional hyperbaton, where an adjectives occurs to the left of a prepositional phrase that contains its head: “memorem ... ob iram” “magna cum laude” Less frequently, prepositional hyperbaton occurs when the object of the preposition itself occurs outside of the prepositional phrase: “iram ... ob memorem”

30 Rhetoric Author adj < prep < noun noun < prep < adj Vergil
A treebank lets us locate and measure the rates of this kind of hyperbaton, since it encodes the linear order of words along with their syntactic dependents. Hyperbaton is the interaction of these two variables. Author adj < prep < noun noun < prep < adj Vergil 40.0% 15.6% Cicero 8.9% 0% Caesar 2.2% Jerome Prepositional phrase hyperbaton rates. Vergil, n=90; Cicero, n=45; Caesar, n=138; Jerome, n=540.

31 Historical linguistics
Classical Latin word order is generally thought to be SOV (though see Pinkster 1991 for an opposing argument), but eventually transformed into the SVO order found in modern-day romance languages. Treebanks again provide an excellent resource for answering this question statistically - it can provide a quantitative and reproducible figure that can suggest potentially fruitful directions for further qualitative analysis.

32 Historical linguistics
Cicero Caesar Vergil Jerome SVO 5.3% 0% 20.8% 68.5% SOV 26.3% 64.7% 18.8% 4.7% VSO 6.3% 16.5% VOS 10.4% 3.1% OSV 52.6% 35.3% 25.0% 3.9% OVS 10.5% Word order rates by author (sentences with overt subjects and objects). Cicero, n=19; Caesar, n=17; Vergil, n=48; Jerome, n=127.

33 Historical linguistics
Cicero Caesar Vergil Jerome OV 68.2% 95.2% 56.2% 13.9% VO 31.8% 4.8% 43.8% 86.1% Cicero Caesar Vergil Jerome SV 75.9% 86.7% 53.6% 65.8% VS 24.1% 13.3% 46.4% 34.2% Word order rates by author (sentences with one zero-anaphor). OV/VO: Cicero, n=44; Caesar, n=63; Vergil, n=121; Jerome, n=309. SV/VS: Cicero, n=58; Caesar, n=90; Vergil, n=97; Jerome, n=404.

34 Collaboration Index Thomisticus PROIEL Merge datasets
Share experience/resources Establish common method of annotation

35 The Index Thomisticus 42,889 words, Scriptum super Sententiis Magistri Petri Lombardi

36 The PROIEL Corpus http://www.hf.uio.no/ifikk/proiel/ Language (Text)
Completed In Progress Slavic (Codex Marianus) 4,742 12,804 Latin (Vulgate) 12,414 87,747 Greek (New Testament) 11,439 94,839 Gothic (Bible) 1,654 8,305

37 Collaboration: Comparison
With a single treebank, we can conduct synchronic linguistic or stylistic studies With multiple treebanks, we can conduct diachronic studies, measuring the evolution of linguistic phenomena over time. E.g.: Indirect discourse in Latin ACI vs. Quod clauses

38 Accusativus cum Infinitivo
tu es contentus (“you are content”) dicebas te esse contentum (“you said that you were content”)

39 Quod/quia clauses

40 ACI -> Quod/quia Existing studies (Mayen 1889, Cuzzolin 1991/1994, Herman 1989) Author Date Ratio ACI:Quod Tertullian c. 235 CE 33:1 Cyprian c. 258 CE 12:1 Lucifero di Cagliari c. 370 CE 6:1

41 Methodology ACI = infinitive verb (or accusative participle with optional “esse”) dependent on head via SBJ or OBJ Accusative SBJ optional in Latin (pro-drop language) Inspection still necessary to prune modal and prolative infinitives (dependent on “can” or “begin” etc.) Quod/quia clause = verbs dependent via argument relation (SBJ/OBJ) on quod/quia, as opposed to adjunct relation of (e.g.) causal clauses (ADV). Divided all verbs into two classes: verbs of thinking and saying (“you say that you are content”) and impersonal verbs (e.g., “going to the store is permitted”)

42 Results verbs of saying and thinking impersonal verbs Author ACI
Quod/quia Ratio Classical authors 182 1 99.5% Jerome 3 9 25.0% Aquinas 35 80 30.4% Author ACI Quod/quia Ratio Classical authors 33 1 97.1% Jerome 15 100% Aquinas 27 72 27.3%

43 Transparency Kühner himself reported the number of passages he counted: “So hat nach meiner Zählung bei doleo 57 Stellen mit Acc. c. Inf. gegen 4 quod, bei miror 110 gegen 8, bei glorior 19 gegen 2, bei queror 71 gegen 15, bei gaude 84 gegen 9 usw.” (1914:77), although it is difficult to say what he meant by the word “Stelle” and impossible to say which texts his counting is based upon. P. Cuzzolin (1991)

44 Dynamic Lexicon

45 Inducing selectional preferences
Trained a parser on our Greek and Latin treebanks Parsed all the texts in our 3.5M word Latin corpus, 4.9M word Greek corpus To find selectional preferences from this noisy data, we use hypothesis tests (log likelihood etc.) to locate syntactic collocations.

46 Dynamic Lexicon: do (to give)
Latin English OLD def. Log score Opera Service (= take pains) 22c 254.2 Obses Hostage 11a 21.8 Signum Sign - 12.6 Velum Sail (= set sail) 18f 7.9 Pecunia Money (= pay) 6a 7.3 Negotium Business 6.2 Poena Penalty 7b 5.6 Possessio Possession 1c 4.8 Littera Letter (for delivery) 10a 4.3

47 Dynamic Lexicon: do (to give)
Latin English Log score Obses Hostage 18.4 Opera Service 11.9 Suspicio Suspicion 2.2 Facultas Faculty 1.8 Signum Sign 1.5 Latin English Log score Potus Drink 16.6 Esca Food 3.3 Requies Rest 3.0 Glora Glory 2.4 Terra Earth 1.6 Strongest OBJ of do in Caesar Strongest OBJ of do in Jerome Latin English Log score Osculum Kiss 8.5 Velum Sail 5.9 Munus Gift 3.5 Signum Sign 2.6 Strongest OBJ of do in Ovid

48 How do you get involved? Pick a text. Come tomorrow!


Download ppt "Linguistic Annotation of Classical Texts"

Similar presentations


Ads by Google