Download presentation
Presentation is loading. Please wait.
Published byRafe Morrison Modified over 9 years ago
1
An Algorithm to Align Words for Historical Comparison Michael A. Covington (The University of Georgia) Journal of Computational Linguistics 1996 February 09, 2011 Hyojin Song
2
Contents Introduction Algorithm Experiment Conclusion 2 / 24
3
3 / 24 Introduction Mami’s here!! Maumi ……..Why…...mam…….. 음성인식을 좀 더 잘 할 수 없을까 ??
4
Introduction The goal of this paper is to apply the comparative method to a pair of words suspected of being cognate An algorithm for finding probably correct alignments on the basis of phonetic similarity › An evaluation metric › A guided search procedure 4 / 24 For example, the correct alignment of Latin “do” with Greek “didomi” is
5
Introduction The segments of two words may be misaligned › affixes (living or fossilized) › reduplication › sound changes › elision › Monophthongization 5 / 24 Motivation › A guided search algorithm for finding the best alignment of one word with another Both words are given in a broad phonetic transcription Only see surface forms, not sound laws or phonological rules
6
Contents Introduction Algorithm › Alignment › The Search Space › The Full Evaluation Metric Experiment Conclusion 6 / 24
7
Algorithm Alignment Inexact string matching › Same words are only exact string matching › Finding the alignment that minimizes the difference between the two words Dynamic programming algorithm › Well known for inexact string matching › However we do not use it, for several reasons The string being aligned are relatively short » The efficiency of dynamic programming on long strings is not needed It gives only one alignment for each pair of strings, not n best alternatives 7 / 24 a b c b d e a b c b d e a b ─ c ─ b d e a b ─ c ─ b d e
8
Algorithm Alignment An alignment can be viewed as › A way of stepping through two words concurrently › Consuming all the segments of each The aligner can perform either a match or skip › A match: when the aligner consumes a segment from each of the two words in a single step › A skip: when it consumes a segment from one word while leaving the other word alone 8 / 24 a b ─ c ─ b d e a b ─ c ─ b d e
9
Algorithm Alignment The aligner is not allowed to perform › In succession, a skip on one string and then a skip on the other Because the result would be almost equivalent to a match This restriction is called as the no-alternating-skips rule To identify the best alignment, the algorithm must assign a penalty to every skip or match › The best alignment is the one with the lowest total penalty. 9 / 24
10
Algorithm Alignment We can use the following penalties: 10 / 24 Then the possible alignments of Spanish el and French le (phonetically [ l ∂ ]) are: 혜원이 두루마리 성위꺼 두루마리 지범이 두루마리
11
Algorithm The Search Space Every possible pair of alignments between words can be presented as the form of a tree For example › Word ‘has’ (English [haez] and German [hat]) › We know that these words correspond Segment-by-segment, but the aligner doesn’t › The best alignment 11 / 24
12
Algorithm The Search Space Several Rules › The aligner tries first a match, then a skip on the first word, then a skip on the second, and computes all the consequences of each › After completing each alignment, It backs up to the most recent untried alternative › “Dead end” in the tree are places where further computation is blocked by the no-alternating-skip rule 12 / 24
13
Algorithm The Search Space As should be evident, the search tree can be quite large › Even if the words being aligned are fairly short Table 1 gives the number of possible alignments for words of various lengths › When both words are of length n, there are about 3 n-1 alignments. › Without the no-alternating-skip rule, › The number would be about 5 n /2 Fortunately, the aligner can greatly Narrow the search › To abandon any branch of the search tree as soon as the accumulated penalty exceeds the total penalty of the best alignment found so far 13 / 24
14
Algorithm The Search Space The search tree after pruning › The total amount of work is roughly cut in half › With larger trees, the saving can be even greater It is important, at each stage, to try matches before trying skips › Otherwise the aligner would start by generating a large number of useless displacements of each string relative to the other 14 / 24
15
Algorithm The Full Evaluation Metric Table 2 shows an evaluation metric › Developed by trial and error › Using the 82 cognate pairs For example › Maumi VS Mami 15 / 24 m a u m i m a - m i m a u m i m a - m i 0 + 5 + 50 + 0 + 5 = 60
16
Contents Introduction Algorithm Experiment › Results on Actual Data Conclusion 16 / 24
17
Experiment Results on Actual Data Table 3 to 10 show how the aligner performed on 82 cognate pairs in various languages › Tables 5-8 are loosely based on the Swadesh word lists of Ringe 1992 Table 3, 4: test set of Spanish-French cognate pairs › This test set is chosen because they are historically close but phonologically very different › The aligner performed almost flawlessly 17 / 24
18
Experiment Results on Actual Data Table 5, 6: test set of English and German cognate pairs › With English and German it did almost as well › The s in this is aligned with the wrong s in dieses because that alignment gave greater phonetic similarity › Taking off the inflectional ending would have prevented this mistake 18 / 24
19
Experiment Results on Actual Data Table 7, 8: test set of English and Latin cognate pairs › They are much harder to pair up Since they are separated by millennia of phonological and morphological change, including Grimm’s Law › Nonetheless, the aligner did reasonably well with them, correctly aligning › Although it found the correct alignment of fish with piscis, it could not distinguish it from three alternatives 19 / 24
20
Experiment Results on Actual Data Table 9: test set of Fox-Menomini cognate pairs › Table 9 shows that the algorithm works well with non-Indo-European languages › Apart from some minor trouble with the suffix of the first item, the aligner had smooth sailing 20 / 24
21
Experiment Results on Actual Data Table 10: test set of other languages cognate pairs › Table 10 shows how the aligner fared with some word pairs involving Latin, Greek, Sanskrit, and Avestan, again without knowledge of morphology › Because it knows nothing about place of articulation or Grimm’s Law, it cannot tell whether the d in daughter corresponds with the th or the g in Greek thugater 21 / 24
22
Contents Introduction Algorithm Experiment Conclusion 22 / 24
23
Conclusion An algorithm for finding probably correct alignments on the basis of phonetic similarity › An evaluation metric › A guided search procedure This alignment algorithm and its evaluation metric are, in effect, a formal reconstruction of something that historical linguists do intuitively. Extended algorithm would be to enable the aligner to recognize assimilation, metathesis, and even reduplication › can assign lower penalties to words than to arbitrary mismatches 23 / 24
24
Thank You! Any question or comment?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.