An Algorithm to Align Words for Historical Comparison Michael A. Covington (The University of Georgia) Journal of Computational Linguistics 1996 February.

Slides:



Advertisements
Similar presentations
Qualitative methods - conversation analysis
Advertisements

Heuristic Search techniques
Indexing DNA Sequences Using q-Grams
Traveling Salesperson Problem
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Branch & Bound Algorithms
CS 106 Introduction to Computer Science I 02 / 29 / 2008 Instructor: Michael Eckmann.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
21-May-15 Genetic Algorithms. 2 Evolution Here’s a very oversimplified description of how evolution works in biology Organisms (animals or plants) produce.
The Symbol Table Lecture 13 Wed, Feb 23, The Symbol Table When identifiers are found, they will be entered into a symbol table, which will hold.
COFFEE: an objective function for multiple sequence alignments
Introduction to Computability Theory
Creating Difficult Instances of the Post Correspondence Problem Presenter: Ling Zhao Department of Computing Science University of Alberta March 20, 2001.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Programming Languages An Introduction to Grammars Oct 18th 2002.
CS5371 Theory of Computation Lecture 8: Automata Theory VI (PDA, PDA = CFG)
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Classroom Assessment A Practical Guide for Educators by Craig A. Mertler Chapter 9 Subjective Test Items.
Heuristic Search Heuristic - a “rule of thumb” used to help guide search often, something learned experientially and recalled when needed Heuristic Function.
Supporting Literacy and Numeracy 14 th November, 2011 How to Teach Children good numeracy skills.
Genetic Programming.
Game Trees: MiniMax strategy, Tree Evaluation, Pruning, Utility evaluation Adapted from slides of Yoonsuck Choe.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Minimax Trees: Utility Evaluation, Tree Evaluation, Pruning CPSC 315 – Programming Studio Spring 2008 Project 2, Lecture 2 Adapted from slides of Yoonsuck.
SPANISH CRYPTOGRAPHY DAYS (SCD 2011) A Search Algorithm Based on Syndrome Computation to Get Efficient Shortened Cyclic Codes Correcting either Random.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Modeling and Simulation Random Number Generators
Reading Comprehension Exercises Online: The Effects of Feedback, Proficiency and Interaction N97C0025 Judith.
CHAPTER 3 Function Overloading. 2 Introduction The polymorphism refers to ‘one name having many forms’ ‘different behaviour of an instance depending upon.
Exact methods for ALB ALB problem can be considered as a shortest path problem The complete graph need not be developed since one can stop as soon as in.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Chapter Five Language Description language study and linguistic study 1Applied Linguistics Chapter 5 by TIAN Bing.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Bitwise Sort By Matt Hannon. What is Bitwise Sort It is an algorithm that works with the individual bits of each entry in order to place them in groups.
Chapter 13: Historical Linguistics Language Change over Time NoTES: About exercising: it keeps you healthy: physically & mentally. We won’t cover the entire.
The joint influence of break and noise variance on break detection Ralf Lindau & Victor Venema University of Bonn Germany.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Psychological status of phonological analyses Before Chomsky linguists didn't talk about psychological aspects of linguistics Chomsky called linguistics.
The normal approximation for probability histograms.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
CHAPTER 4 Designing Studies
Introduction to Linguistics
CHAPTER 4 Designing Studies
Fast Sequence Alignments
CHAPTER 4 Designing Studies
Significance Tests: The Basics
CHAPTER 4 Designing Studies
How to use hash tables to solve olympiad problems
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
Presentation transcript:

An Algorithm to Align Words for Historical Comparison Michael A. Covington (The University of Georgia) Journal of Computational Linguistics 1996 February 09, 2011 Hyojin Song

Contents  Introduction  Algorithm  Experiment  Conclusion 2 / 24

3 / 24 Introduction Mami’s here!! Maumi ……..Why…...mam…….. 음성인식을 좀 더 잘 할 수 없을까 ??

Introduction  The goal of this paper is to apply the comparative method to a pair of words suspected of being cognate  An algorithm for finding probably correct alignments on the basis of phonetic similarity › An evaluation metric › A guided search procedure 4 / 24  For example, the correct alignment of Latin “do” with Greek “didomi” is

Introduction  The segments of two words may be misaligned › affixes (living or fossilized) › reduplication › sound changes › elision › Monophthongization 5 / 24  Motivation › A guided search algorithm for finding the best alignment of one word with another Both words are given in a broad phonetic transcription Only see surface forms, not sound laws or phonological rules

Contents  Introduction  Algorithm › Alignment › The Search Space › The Full Evaluation Metric  Experiment  Conclusion 6 / 24

Algorithm Alignment  Inexact string matching › Same words are only exact string matching › Finding the alignment that minimizes the difference between the two words  Dynamic programming algorithm › Well known for inexact string matching › However we do not use it, for several reasons The string being aligned are relatively short » The efficiency of dynamic programming on long strings is not needed It gives only one alignment for each pair of strings, not n best alternatives 7 / 24 a b c b d e a b c b d e a b ─ c ─ b d e a b ─ c ─ b d e

Algorithm Alignment  An alignment can be viewed as › A way of stepping through two words concurrently › Consuming all the segments of each  The aligner can perform either a match or skip › A match: when the aligner consumes a segment from each of the two words in a single step › A skip: when it consumes a segment from one word while leaving the other word alone 8 / 24 a b ─ c ─ b d e a b ─ c ─ b d e

Algorithm Alignment  The aligner is not allowed to perform › In succession, a skip on one string and then a skip on the other Because the result would be almost equivalent to a match This restriction is called as the no-alternating-skips rule  To identify the best alignment, the algorithm must assign a penalty to every skip or match › The best alignment is the one with the lowest total penalty. 9 / 24

Algorithm Alignment  We can use the following penalties: 10 / 24  Then the possible alignments of Spanish el and French le (phonetically [ l ∂ ]) are: 혜원이 두루마리 성위꺼 두루마리 지범이 두루마리

Algorithm The Search Space  Every possible pair of alignments between words can be presented as the form of a tree  For example › Word ‘has’ (English [haez] and German [hat]) › We know that these words correspond Segment-by-segment, but the aligner doesn’t › The best alignment 11 / 24

Algorithm The Search Space  Several Rules › The aligner tries first a match, then a skip on the first word, then a skip on the second, and computes all the consequences of each › After completing each alignment, It backs up to the most recent untried alternative › “Dead end” in the tree are places where further computation is blocked by the no-alternating-skip rule 12 / 24

Algorithm The Search Space  As should be evident, the search tree can be quite large › Even if the words being aligned are fairly short  Table 1 gives the number of possible alignments for words of various lengths › When both words are of length n, there are about 3 n-1 alignments. › Without the no-alternating-skip rule, › The number would be about 5 n /2  Fortunately, the aligner can greatly Narrow the search › To abandon any branch of the search tree as soon as the accumulated penalty exceeds the total penalty of the best alignment found so far 13 / 24

Algorithm The Search Space  The search tree after pruning › The total amount of work is roughly cut in half › With larger trees, the saving can be even greater  It is important, at each stage, to try matches before trying skips › Otherwise the aligner would start by generating a large number of useless displacements of each string relative to the other 14 / 24

Algorithm The Full Evaluation Metric  Table 2 shows an evaluation metric › Developed by trial and error › Using the 82 cognate pairs  For example › Maumi VS Mami 15 / 24 m a u m i m a - m i m a u m i m a - m i = 60

Contents  Introduction  Algorithm  Experiment › Results on Actual Data  Conclusion 16 / 24

Experiment Results on Actual Data  Table 3 to 10 show how the aligner performed on 82 cognate pairs in various languages › Tables 5-8 are loosely based on the Swadesh word lists of Ringe 1992  Table 3, 4: test set of Spanish-French cognate pairs › This test set is chosen because they are historically close but phonologically very different › The aligner performed almost flawlessly 17 / 24

Experiment Results on Actual Data  Table 5, 6: test set of English and German cognate pairs › With English and German it did almost as well › The s in this is aligned with the wrong s in dieses because that alignment gave greater phonetic similarity › Taking off the inflectional ending would have prevented this mistake 18 / 24

Experiment Results on Actual Data  Table 7, 8: test set of English and Latin cognate pairs › They are much harder to pair up Since they are separated by millennia of phonological and morphological change, including Grimm’s Law › Nonetheless, the aligner did reasonably well with them, correctly aligning › Although it found the correct alignment of fish with piscis, it could not distinguish it from three alternatives 19 / 24

Experiment Results on Actual Data  Table 9: test set of Fox-Menomini cognate pairs › Table 9 shows that the algorithm works well with non-Indo-European languages › Apart from some minor trouble with the suffix of the first item, the aligner had smooth sailing 20 / 24

Experiment Results on Actual Data  Table 10: test set of other languages cognate pairs › Table 10 shows how the aligner fared with some word pairs involving Latin, Greek, Sanskrit, and Avestan, again without knowledge of morphology › Because it knows nothing about place of articulation or Grimm’s Law, it cannot tell whether the d in daughter corresponds with the th or the g in Greek thugater 21 / 24

Contents  Introduction  Algorithm  Experiment  Conclusion 22 / 24

Conclusion  An algorithm for finding probably correct alignments on the basis of phonetic similarity › An evaluation metric › A guided search procedure  This alignment algorithm and its evaluation metric are, in effect, a formal reconstruction of something that historical linguists do intuitively.  Extended algorithm would be to enable the aligner to recognize assimilation, metathesis, and even reduplication › can assign lower penalties to words than to arbitrary mismatches 23 / 24

Thank You! Any question or comment?