Reconstructing Ancient Literary Texts from Noisy Manuscripts

Reconstructing Ancient Literary Texts from Noisy Manuscripts
Moshe Koppel, Bar Ilan Moty Michaely, Bar Ilan Alex Tal, University of Haifa Presented by Bar Rosanski

Overview Introduction Previous work Formalizing the problem
The UR algorithm Testing it out Conclusion

Introduction In ancient times, original documents were written by hand and then copied by scribes. These copies were many times inexact. Flawed copies spread around the world where they were then themselves imperfectly copied. Some small subset of these corrupted documents survived until modern times.

The Babylonian Talmud Firenze, 1177 Vatican, 1381

Republic (Plato) 3rd century 9th century

What’s Our Goal? Reconstruct the original text from the manuscripts.
This has traditionally been done using manual methods.

Manual Reconstruction Methods
The reconstruction of original text from corrupted manuscripts using manual methods can be divided into two: select a single best manuscript. create an optimal hybrid.

Stemma A tree diagram showing which manuscript was copied from which.
Original Text 2nd Generation 3rd Generation

Stemma The “Stemmatic” approach is preferable when the collection of extant manuscripts for a given text is relatively complete. Especially if the original text itself might be found in the collection. In the case of ancient documents, this situation is very rare.

Manual Reconstruction Methods
The reconstruction of original text from corrupted manuscripts using manual methods can be divided into two: select a single best manuscript. create an optimal hybrid.

First Step – Creating The Synopsis
To reconstruct the original text, we first need to arrange all manuscripts so that parallel words or phrases could be compared.

Lets assume we have the following “manuscripts”: United States on the 4th of July. USA on Fourth of July. United States in the end of June.

Guidelines: Phrases should be in a single column.

Guidelines: Should not distinguish words with the same meaning.

Guidelines: Words that play the same role in the text (fourth/end; June/July) should be aligned.

There have been efforts to automate the process of creating synopses from raw text. For the testbeds considered in this research, manual synopses were available, allowing us to focus on the more basic issue of text reconstruction.

First Attempt - SMR Simple Majority Rule (SMR): for each column, choose the token found most frequently in that column.

First Attempt - SMR The good news: This method’s accuracy approaches 1 as the number of manuscripts grows. The bad news: In real-life, however, the number of manuscripts available is usually quite limited.

Formalizing The Problem
Suppose we have a synopsis of n manuscripts, each of which is divided to m tokens. We can think of our synopsis as an 𝑛∗𝑚 matrix, 𝑎 = { 𝑎 𝑖𝑗 }. 𝑎 𝑖𝑗 is the word in the 𝑗 𝑡ℎ column according to the 𝑖 𝑡ℎ manuscript.

reliability level: We assign reliability level 𝑝 for each manuscript. For any given token, manuscript 𝑖 has probability 𝑝 𝑖 of choosing the right token. The value of 𝑝 𝑖 is not known to us.

Number of forms: For any token 𝑗∈{1,…,𝑚} , there are 𝑘 𝑗 potential distinct forms other than the original. Thus each column has at most 𝑘 𝑗 +1 distinct choices. We arbitrarily map each choice to a number in the set {1,…, 𝑘 𝑗 +1}. For each column 𝑗, 𝑡 𝑗 indicates the number of the correct form.

We wish to choose the most probable choice in each column. The resulting sequence of words is the proposed original text.

The UR Algorithm We assign some initial constant value to { 𝑝 𝑖 }. Then we repeat the following two steps until convergence:

𝑎 𝑖𝑗 =𝑊 𝑝 𝑖 ∗ 𝑎 𝑖𝑗 ≠𝑊 1− 𝑝 𝑖 𝑘 𝑗 ∗ 𝑍
The UR Algorithm First step: We assume that 𝑃 t j =𝑊 is equal for every form 𝑊∈ 1,…, 𝑘 𝑗 +1 . For each 𝑗 and 𝑊∈ 1,…, 𝑘 𝑗 +1 , compute P( 𝑡 𝑗 =𝑊 | 𝑎) using { 𝑝 𝑖 } and Bayes rule: 𝑃 𝑡 𝑗 =𝑊 𝑎)=P t j =W 𝑎 𝑗 )= =𝑃 𝑎 𝑗 𝑡 𝑗 =𝑊) ∗ 𝑃 t j =𝑊 𝑃 𝑎 𝑗 = 𝑎 𝑖𝑗 =𝑊 𝑝 𝑖 ∗ 𝑎 𝑖𝑗 ≠𝑊 1− 𝑝 𝑖 𝑘 𝑗 ∗ 𝑍

The UR Algorithm 𝑝 𝑖 = 1 𝑚 ( 𝑗 𝑝( 𝑡 𝑗 = 𝑎 𝑖𝑗 |𝑎) ) Second step:
Find the values of 𝑝 𝑖 (again) using 𝑃 𝑡 𝑗 =𝑊 𝑎) values from the previews step. 𝑝 𝑖 is the average (over j) probability that 𝑡 𝑗 = 𝑎 𝑖𝑗 . 𝑝 𝑖 = 1 𝑚 ( 𝑗 𝑝( 𝑡 𝑗 = 𝑎 𝑖𝑗 |𝑎) )

Handling Dependencies
UR algorithm is optimal only when manuscripts are independent of each other. However, manuscripts are copied from one another. 3rd generation manuscripts fall naturally into clusters.

Handling Dependencies
The errors in manuscripts in the same cluster tend to be similar. In most real life cases, domain experts are able to identify clusters using. Given such a clustering, we can use the UR method recursively to reconstruct the original text.

Experiments We will test the UR algorithm on 3 different groups of manuscripts: Artificial manuscripts – 2nd generation copies. Artificial manuscripts – 3rd generation copies. Real-World example. SMR will be used as baseline algorithm.

Artificial Test – 2nd Generation
Manuscripts are copied directly from the original text. Each manuscript has reliability 𝑝 𝑖 . 𝑝 𝑖 is chosen from a uniform distribution between and 0.99. If a word is copied incorrectly, it is randomly replaced by one of 𝑘 𝑗 possible other words. In this way, we generate 20 “manuscripts”, each one with m tokens.

Different manuscript lengths:

Different number of manuscript:

Artificial Test – 3rd Generation
We generate 20 2nd generation manuscripts as before. Now we generate 200 3rd generation manuscripts. 3rd generation manuscripts are used as input. We assume that the clusters are known.

We run the following algorithms: 1. SMR 2. UR 3. Recursive SMR 4. Recursive UR

Real-World Example We use a synoptic version of a single chapter of the Talmud. Synopsis consists 20 manuscripts and 8564 columns . Manuscripts split naturally into six clusters.

Real-World Example pre-processing steps on the synopsis:
Within a given column, words with spelling-related differences are treated as identical. Merge consecutive columns containing a single phrase. we remain with 5912 columns in the synopsis.

Real-World Example Columns distribution by number of forms:

Real-World Example We apply Recursive UR and Recursive SMR to the processed synopsis. The two methods disagree for 448 of the columns and agree for the rest. Domain expert provided the most likely correct word according to his own judgment for those columns. Of the 448 disagreements, he determined that 80 were resolvable. In 66 of these 80 cases (82.5%), the expert’s agreed with the UR algorithm.

Conclusions Original ancient texts can be reconstructed using automated methods far more effectively than using a simple majority rule. We have assumed that: correct clustering of manuscripts is known. a given manuscript has some fixed reliability over all words. each distinct form of a word is an equally probable alternative to the original.

Thank You!

Reconstructing Ancient Literary Texts from Noisy Manuscripts

Similar presentations

Presentation on theme: "Reconstructing Ancient Literary Texts from Noisy Manuscripts"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconstructing Ancient Literary Texts from Noisy Manuscripts

Similar presentations

Presentation on theme: "Reconstructing Ancient Literary Texts from Noisy Manuscripts"— Presentation transcript:

Similar presentations

About project

Feedback