Inferring phylogenetic models for European and other Languages using MML Jane N. Ooi 18560210 Supervisor : A./Prof. David L. Dowe.

Slides:



Advertisements
Similar presentations
Yuri R. Tsoy, Vladimir G. Spitsyn, Department of Computer Engineering
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Priority Queues  MakeQueuecreate new empty queue  Insert(Q,k,p)insert key k with priority p  Delete(Q,k)delete key k (given a pointer)  DeleteMin(Q)delete.
Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1Causality & MDL Causal Models as Minimal Descriptions of Multivariate Systems Jan Lemeire June 15 th 2006.
MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer.
CS336: Intelligent Information Retrieval
MML, inverse learning and medical data-sets Pritika Sanghi Supervisors: A./Prof. D. L. Dowe Dr P. E. Tischer.
Chapter 10: Algorithm Design Techniques
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Data Structures – LECTURE 10 Huffman coding
Intro to AI Genetic Algorithm Ruth Bergman Fall 2002.
Who am I and what am I doing here? Allan Tucker A brief introduction to my research
Probabilistic Approaches to Phylogeny Wouter Van Gool & Thomas Jellema.
CMPT-825 (Natural Language Processing) Presentation on Zipf’s Law & Edit distance with extensions Presented by: Kaustav Mukherjee School of Computing Science,
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
What is Neutral? Neutral Changes and Resiliency Terence Soule Department of Computer Science University of Idaho.
Intro to AI Genetic Algorithm Ruth Bergman Fall 2004.
Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Genetic Programming on Program Traces as an Inference Engine for Probabilistic Languages Vita Batishcheva, Alexey Potapov
Machine Learning Chapter 3. Decision Tree Learning
CS-2852 Data Structures LECTURE 13B Andrew J. Wozniewicz Image copyright © 2010 andyjphoto.com.
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Prof. Amr Goneid Department of Computer Science & Engineering
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Inferring Decision Trees Using the Minimum Description Length Principle J. R. Quinlan and R. L. Rivest Information and Computation 80, , 1989.
Data Structures and Algorithms Lecture (BinaryTrees) Instructor: Quratulain.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
ICS 220 – Data Structures and Algorithms Lecture 11 Dr. Ken Cosh.
Calculating branch lengths from distances. ABC A B C----- a b c.
Machine Learning, Decision Trees, Overfitting Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14,
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
Physics Fluctuomatics (Tohoku University) 1 Physical Fluctuomatics 12th Bayesian network and belief propagation in statistical inference Kazuyuki Tanaka.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and.
Project 2: Classification Using Genetic Programming Kim, MinHyeok Biointelligence laboratory Artificial.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Bahareh Sarrafzadeh 6111 Fall 2009
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
Chapter 10: Trees A tree is a connected simple undirected graph with no simple circuits. Properties: There is a unique simple path between any 2 of its.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Huffman Coding The most for the least. Design Goals Encode messages parsimoniously No character code can be the prefix for another.
1Computer Sciences Department. 2 Advanced Design and Analysis Techniques TUTORIAL 7.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
1 Representation and Evolution of Lego-based Assemblies Maxim Peysakhov William C. Regli ( Drexel University) Authors: {umpeysak,
Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.
(C) 2000, The University of Michigan 1 Language and Information Handout #2 September 21, 2000.
Estimation of Distribution Algorithm and Genetic Programming Structure Complexity Lab,Seoul National University KIM KANGIL.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning Chapter 3. Decision Tree Learning
Forest Learning from Data
Algorithms CSCI 235, Spring 2019 Lecture 31 Huffman Codes
Presentation transcript:

Inferring phylogenetic models for European and other Languages using MML Jane N. Ooi Supervisor : A./Prof. David L. Dowe

Table of Contents  Motivation and Background  What is a phylogenetic model?  Phylogenetic Trees and Graphs  Types of evolution of languages  Minimum Message Length (MML) Multistate distribution – modelling of mutations Multistate distribution – modelling of mutations  Results/Discussion  Conclusion and future work

Motivation  To study how languages have evolved (Phylogeny of languages). e.g. Artificial languages, European languages. e.g. Artificial languages, European languages.  To refine natural language compression method.

Evolution of languages  What is phylogeny? Phylogeny means Phylogeny means Evolution  What is a phylogenetic model? model? A phylogenetic tree/graph is A phylogenetic tree/graph is a tree/graph showing the evolutionary interrelationships among various species or other entities that are believed to have a common ancestor. tree/graphevolutionaryspeciescommon ancestortree/graphevolutionaryspeciescommon ancestor

Difference between a phylogenetic tree and a phylogenetic graph  Phylogenetic trees Each child node has exactly one parent node. Each child node has exactly one parent node.one  Phylogenetic graphs (new concept) Each child node can descend from one or more parent(s) node. Each child node can descend from one or more parent(s) node.one or moreone or more YZ X YX Z

Evolution of languages  3 types of evolution Evolution of phonology/pronunciation Evolution of phonology/pronunciationphonology/pronunciation Evolution of written script/spelling Evolution of written script/spellingwritten script/spellingwritten script/spelling Evolution of grammatical structures Evolution of grammatical structuresgrammatical structuresgrammatical structures WordsUSUK scheduleskeduleshedule leisureleezhurelezhure EnglishMalayMobileMobil TelevisionTelevisyen

Minimum Message Length (MML)  What is MML? A measure of goodness of classification based on information theory. A measure of goodness of classification based on information theory.goodness of classificationgoodness of classification  Data can be described using “models”  MML methods favour the “best” description of data where best “best” = shortest overall message length “best” = shortest overall message lengthbestshortestbestshortest  Two part message Msglength = Msglength(model) + msglength(data|model) Msglength = Msglength(model) + msglength(data|model)

Minimum Message Length (MML)  Degree of similarity between languages can be measured by compressing them in terms of one another.  Example : Language A Language B Language A Language B 3 possibilities –3 possibilities – Unrelated – shortest message length when compressed separately. Unrelated – shortest message length when compressed separately. Unrelated A descended from B – shortest message length when A compressed in terms of B. A descended from B – shortest message length when A compressed in terms of B. A descended from B A descended from B B descended from A – shortest message length when B compressed in terms of A. B descended from A – shortest message length when B compressed in terms of A. B descended from A B descended from A

Minimum Message Length (MML) The best phylogenetic model is the tree/graph that achieves the shortest overall message length. best shortestbest shortest

Modelling mutation between words  Root language Root language Root language Equal frequencies for all characters. Equal frequencies for all characters. Log(size of alphabet) * no. of chars.Log(size of alphabet) * no. of chars. Some characters occur more frequently than others. Some characters occur more frequently than others. Exp : English “x” compared with “a”.Exp : English “x” compared with “a”. Multistate distribution of characters.Multistate distribution of characters.

Modelling mutation between words  Child languages Child languages Child languages Mutistate distribution Mutistate distribution 4 states.4 states. Insert Insert Delete Delete Copy Copy Change Change Use string alignment techniques to find the best alignment between words. Use string alignment techniques to find the best alignment between words.string alignment string alignment Dynamic Programming Algorithm to find alignment between strings. Dynamic Programming Algorithm to find alignment between strings. MML favors the alignment between words that produces the shortest overall message length. MML favors the alignment between words that produces the shortest overall message length.shortest

Example : recommander ||||||||||| recommend--

Work to date  Preliminary model Only copy and change mutations. Only copy and change mutations.copy changecopy change Words of the same length. Words of the same length. artificial and European languages. artificial and European languages.  Expanded model Copy, change, insert and delete mutations Copy, change, insert and delete mutations Copychangeinsertdelete Copychangeinsertdelete Words of different length. Words of different length. artificial and European languages. artificial and European languages.

Results – Preliminary model  Artificial languages  A – random  B – 5% mutation from A  C – 5% mutation from B  Full stop “.” marks the end of string. ABC 1234…50asdfge.zlsdrya.wet.vsert.…..…..assfge.zlcdrya.wet.vsegt.…..…..assfge.zlchrya.wbt.vsagt.…..…..

Results – Preliminary model  Possible tree topologies for 3 languages : tree topologies tree topologies Expanded model only X Z Y Null hypothesis : totally unrelated Fully related Partially related Expected topology X Y Z X Y Z

Results – Preliminary model  Possible graph topologies for graph topologies graph topologies 3 languages : YX Z Y Z X Non-related parents Related parents

Results – Preliminary model  Results : Best tree = Best tree = Language B / \ Pmut(B,A)~ Pmut(B,C)~ / \ v v Language A Language C Overall Message Length = bits Overall Message Length = bits Overall Message Length = bits Overall Message Length = bits Cost of topology = log(5)Cost of topology = log(5) Cost of fixing root language (B) = log(3)Cost of fixing root language (B) = log(3) Cost of root language = bitsCost of root language = bits Branch 1Branch 1 Cost of child language (Lang. A) binomial distribution = bits Cost of child language (Lang. A) binomial distribution = bits Branch 2Branch 2 Cost of child language (Lang. C) binomial distribution = bits Cost of child language (Lang. C) binomial distribution = bits B AC

Results – Preliminary model  European Languages French French English English Spanish Spanish EnglishFrenchSpanish 1234…30baby.beach.biscuits.cream.…..…..bebe.plage.biscuits.creme.…..…..nene.playa.bizocho.crema.…..…..

Results – Preliminary model French French P(from French)~ Pmut(French,Spanish) ~ P(from French)~ Pmut(French,Spanish) ~ P(from Spanish P(from Spanish not French) ~ Spanish not French) ~ Spanish P(from neither)~ P(from neither)~ English English French English Spanish Cost of parent language (French) = bits Cost of language (Spanish) binomial distribution = bits Cost of child language (English) trinomial distribution = bits Total tree cost = log(5) + log(3) + log(2) = bits bits

Results – Expanded model  16 sets of 4 languages  Different length vocabularies A – randomly generated A – randomly generated B – mutated from A B – mutated from A C – mutated from A C – mutated from A D – mutated from B D – mutated from B  Mutation probabilities Copy – 0.65 Copy – 0.65 Change – 0.20 Change – 0.20 Insert – 0.05 Insert – 0.05 Delete – 0.10 Delete – 0.10

Results – Expanded model Language A Language B Language C Language D 1awjmv.afjmv.wqmv.afjnv. 2bauke.baxke.auke.bave. 3doinet.domnitdeoinet.domnit. 4 eni. eni.eol.enc.eol. 5 foijgnw. foijgnw.fiogw.foijnw.fidgw. …..……..……..……..…….. …..……..……..……..…….. 50……..……..……..…….. Examples of a set of 4 vocabularies used

Results – Expanded model  Possible tree structures for 4 languages: A C B D Null hypothesis : totally unrelated AB CD A B DC Partially related A B C D

Results – Expanded model A BC D A B C D A B C D Fully related A B CD Expected topology

Results – Expanded model  Correct tree structure 100% of the time.  Sample of inferred tree and cost : Language A : size = 383 chars, cost = bits Language A : size = 383 chars, cost = bits A B C D

Results – Expanded model  Pr(Delete) =  Pr(Insert) =  Pr(Mismatch) =  Pr(Match) =  4 state Multinomial cost = bits bits bits  Pr(Delete) =  Pr(Insert) =  Pr(Mismatch) =  Pr(Match) =  4 state Multinomial cost = bits bits bits  *Note that all multinomial cost includes and extra cost of log(26) to state the new character for mismatch and insert * A B A C

Results – Expanded model  Pr(Delete) =  Pr(Insert) =  Pr(Mismatch) =  Pr(Match) =  4 state Multinomial cost = bits  Cost of fixing topology = log(7) = 2.81 bits  Total tree cost = log(7) + log(4) + log(3) + log(2) Total tree cost Total tree cost = bits = bits bits bits B D

Results – Expanded model  European Languages French French English English German German EnglishFrenchGerman 1234…601even.eyes.false.fear.…..…..meme.oeil.faux.peur.…..…..sogar.auge.falsch. angst. angst.…..…..

Results – Expanded model Total cost of this tree = bits Total cost of this tree = bits Cost of fixing topology = log(4) = 2 bits Cost of fixing root language (French) = log(3) = bits  Cost of French = no. of chars * log(27) = bits French English German

 Cost of fixing parent/child language (English) = log(2) = 1 bit  Cost of multistate distribution (French -> English) = bits  MML inferred probabilities: Pr(Delete) = Pr(Delete) = Pr(Insert) = Pr(Insert) = Pr(Mismatch) = Pr(Mismatch) = Pr(Match) = Pr(Match) =  Cost of multistate distribution (English -> German) = bits  MML inferred probabilities: Pr(Delete) = Pr(Delete) = Pr(Insert) = Pr(Insert) = Pr(Mismatch) = Pr(Mismatch) = Pr(Match) = Pr(Match) =  Note that an extra cost of log(26) is needed for each mismatch and log(27) for each insert to state the new character. Results – Expanded model

Conclusion  MML methods have managed to infer the correct phylogenetic tree/graphs for artificial languages. infer the correct phylogenetic tree/graphs for artificial languages.correct artificial languagescorrect artificial languages infer phylogenetic trees/graphs for languages by encoding them in terms of one another. infer phylogenetic trees/graphs for languages by encoding them in terms of one another.  We cannot conclude that one language really descend from another language. We can only conclude that they are related. related

Future work :  Compression – grammar and vocabulary.  Compression – phonemes of languages.  Endangered languages – Indigenous languages.  Refine coding scheme. Some characters occur more frequently than others. Exp: English - “x” compared with “a”. Some characters occur more frequently than others. Exp: English - “x” compared with “a”. Some characters are more likely to mutate from one language to another language. Some characters are more likely to mutate from one language to another language.

Questions?

Papers on success of MML  C. S. Wallace and P. R. Freeman. Single factor analysis by MML estimation. Journal of the Royal Statistical Society. Series B, 54(1): ,  C. S.Wallace. Multiple factor analysis by MML estimation. Technical Report CS 95/218, Department of Computer Science, Monash University,  C. S. Wallace and D. L. Dowe. MML estimation of the von Mises concentration parameter. Technical Report CS 93/193, Department of Computer Science, Monash University,1993.  C. S. Wallace and D. L. Dowe. Refinements of MDL and MML coding. The Computer Journal, 42(4): ,  P. J. Tan and D. L. Dowe. MML inference of decision graphs with multi-way joins. In Proceedings of the 15th Australian Joint Conference on Artificial Intelligence, Canberra, Australia, 2-6 December 2002, published in Lecture Notes in Artificial Intelligence (LNAI) 2557, pages Springer-Verlag,  S. L. Needham and D. L. Dowe. Message length as an effective Ockham's razor in decision tree induction. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AI+STATS 2001), Key West, Florida, U.S.A., January 2001, pages , 2001  Y. Agusta and D. L. Dowe. Unsupervised learning of correlated multivariate Gaussian mixture models using MML. In Proceedings of the Australian Conference on Artificial Intelligence 2003, Lecture Notes in Artificial Intelligence (LNAI) 2903, pages Springer-Verlag,  J. W. Comley and D. L. Dowe. General Bayesian networks and asymmetric languages. In Proceedings of the Hawaii International Conference on Statistics and Related Fields, June 5-8, 2003,  J. W. Comley and D. L. Dowe. Minimum Message Length, MDL and Generalised Bayesian Networks with Asymmetric Languages, chapter 11, pages M.I.T. Press, [Camera ready copy submitted October 2003].  P. J. Tan and D. L. Dowe. MML inference of oblique decision trees. In Proc. 17th Australian Joint Conference on Artificial Intelligence (AI04), Cairns, Qld., Australia, pages Springer- Verlag, December 2004.