Phylogenetic Tree 12/8/2018.

Slides:



Advertisements
Similar presentations
Chapter 9 Greedy Technique. Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible - b feasible.
Advertisements

What is a graph ? G=(V,E) V = a set of vertices E = a set of edges edge = unordered pair of vertices
WSPD Applications.
Divide and Conquer. Subject Series-Parallel Digraphs Planarity testing.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Lauritzen-Spiegelhalter Algorithm
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Greedy Algorithms Greed is good. (Some of the time)
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Combinatorial Algorithms
Chapter 3 The Greedy Method 3.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Phylogenetic Trees: Assumptions All existing species have a common ancestor Each species is descended from a single ancestor Each speciation gives rise.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Distance-Based Phylogenetic Reconstruction Tutorial #8 © Ilan Gronau, edited by Itai Sharon.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
CS420 lecture eight Greedy Algorithms. Going from A to G Starting with a full tank, we can drive 350 miles before we need to gas up, minimize the number.
. Phylogenetic Trees (2) Lecture 12 Based on: Durbin et al Section 7.3, 7.8, Gusfield: Algorithms on Strings, Trees, and Sequences Section 17.
Important Problem Types and Fundamental Data Structures
Graph Algorithms Using Depth First Search Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms.
Data Structures and Algorithms Graphs Minimum Spanning Tree PLSD210.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
1 ELEC692 Fall 2004 Lecture 1b ELEC692 Lecture 1a Introduction to graph theory and algorithm.
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
Phylogenetics II.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 146: Data Structures and Algorithms July 16 Class Meeting Department of Computer Science San Jose State University Summer 2015 Instructor: Ron Mak
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Algorithms for hard problems Parameterized complexity Bounded tree width approaches Juris Viksna, 2015.
. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Discrete Structures Li Tak Sing( 李德成 ) Lectures
Hamiltonian Graphs Graphs Hubert Chan (Chapter 9.5)
Greedy Algorithms.
Chapter 5 : Trees.
Discrete Mathematicsq
Hamiltonian Graphs Graphs Hubert Chan (Chapter 9.5)
Parameterized complexity Bounded tree width approaches
Haim Kaplan and Uri Zwick
PC trees and Circular One Arrangements
Character-Based Phylogeny Reconstruction
Algorithms and networks
Greedy Algorithms / Minimum Spanning Tree Yin Tat Lee
Chapter 5. Optimal Matchings
Algorithms and Complexity
Graph Algorithms Using Depth First Search
ICS 353: Design and Analysis of Algorithms
Graphs Chapter 13.
Coping With NP-Completeness
Algorithms and networks
Lectures on Graph Algorithms: searching, testing and sorting
Clustering.
CS 581 Tandy Warnow.
COMPS263F Unit 2 Discrete Structures Li Tak Sing( 李德成 ) Room A
Graphs and Algorithms (2MMD30)
Phylogeny.
Disjoint Sets DS.S.1 Chapter 8 Overview Dynamic Equivalence Classes
Graph Algorithms DS.GR.1 Chapter 9 Overview Representation
Important Problem Types and Fundamental Data Structures
Coping With NP-Completeness
Clustering.
Perfect Phylogeny Tutorial #10
INTRODUCTION A graph G=(V,E) consists of a finite non empty set of vertices V , and a finite set of edges E which connect pairs of vertices .
Data Structures and Algorithms
Presentation transcript:

Phylogenetic Tree 12/8/2018

Phylogenetic Tree: What it is Drawing evolutionary tree from characteristics of organisms or some measured distances between them Represented as a tree where nodes are the organisms/objects and arcs are the proximity between the respective nodes Based on how close the organisms are 12/8/2018

Phylogenetic Tree: Motivation Pure curiosity: biological science One species can be studied for a related one: Drug test on monkeys for human Rare species can be spared in a study Drug design on evolution of micro-organism: aids/flu vaccine/drug design depends on how do they evolve Tracking pathogen sources Genesis, archeology,,, 12/8/2018

Phylogenetic Tree: topology Evolutionary distance is not same as elapsed time: former is a crude approximation of the latter (if distance can be calculated at all) Leaves are objects, internal nodes may or may not be objects (may represent hypothetical ancestors) Mostly binary trees, sometimes not 12/8/2018

Phylogenetic Tree: source data types Discrete characters: does it have long beaks? Could be Boolean or multi-valued Provided in matrix form (objects X characters) Numerical distance matrix: Symmetric pairwise distances measured by some means, e.g., by aligning sequences Continuous character: character value is in numerical domain 12/8/2018

Characters for phylogeny Characters should be relevant in the context of phylogeny: depends on the user scientist Characters should be independent: inherited without interference between the characters (eye color and hair color may not be a good combination in character set) All characters must evolve from the same ancestor: we presume that (1) it is tree, (2) it is a connected tree Closest objects are called “homologous”: max possible characters have same values or related values 12/8/2018

Phylogeny using character state matrix A “state” is a tuple with values for each character (value could be “unassigned”) Internal node may be a state without any object assigned on it Leaves are where the states correspond to objects with the respective assigned characters P 178: a source character state matrix 12/8/2018

Phylogeny using character state matrix: Problems Convergence evolution: two non-homologous objects (most characters does not match, loosely speaking) happen to have same value on a character (needs a cycle in the graph) 12/8/2018

Phylogeny using character state matrix: Problems In one case evolution suggests character value of c evolves from “long” to “short,” in another case the reverse: confusion over the direction of evolution Again, the tree property would be violated to accommodate this 12/8/2018

Character domain types Domain of character c could be: red < - > blue < - > yellow < - > green C cannot evolve from blue to green without taking value yellow first C is “ordered” C can be directed and ordered, instead of undirected as above 12/8/2018

Perfect phylogeny Problem-free source Each edge in phylogeny is a transition of the respective character’s value All nodes with the same value for a character must form a subutree (with the transition at its root) Such a tree is “perfect phylogeny” 12/8/2018

Perfect phylogeny problem Given a character state matrix does there exist a perfect phylogeny over it P 178 table does not have a perfect phylogeny (presume transitions always 0 -> 1). Why? P 180: table and its perfect phylogeny What do you do when you do not have perfect phylogeny? Presume data is noisy and minimize errors in drawing perfect phylogeny 12/8/2018

Perfect phylogeny problem You can always try all possible trees over the objects and check whether each tree is perfect phylogeny or not The total number of such trees is Pi[i=3 to n] (2i-5): Exponential 12/8/2018

Perfect phylogeny problem: to check existence (Boolean matrix) Organize char state matrix columnwise: for each col i set of objects is Oi Every pair of Oi and Ok should be: either Oi  Ok or Oi  Ok or Oi  Ok = null Either one belongs to another one or they do not overlap at all If they overlap, no perfect phylogeny exist 12/8/2018

Perfect phylogeny problem: to check existence (Boolean matrix) In contrary, suppose Oi and Ok overlaps and a perfect phylogeny exists say, i is the edge between (u, v): v and subtree has i=1, but all other nodes have i=0. Suppose, three objects a, b, and c such that, a, b  Oi, but c is not: a,b in subtree of v and c is not there But, suppose b, c  Ok, and a is not: b,c must belong to some other subtree separated by edge k Contradiction 12/8/2018

Perfect phylogeny problem: to check existence (Boolean matrix) When no overlap exists: Contained sets go within same subtree, if Oi  Ok, then i-subtree is subtree of k-subtree Disjoint sets are separate subtrees Proves if and only if of the condition for perfect phylogeny Algorithm for checking: Pairwise checking of object set may take O(m^2) for m characters, but set overlap may check even more time 12/8/2018

Perfect phylogeny problem: Algorithm (Boolean matrix) Sort the columns by number of 1’s (descending) Scan each row to find which col number has the rightmost 1 for that box Scan each column: every box should agree Complexity O(mn) count, O(m log m) sort, O(mn) index matrix creation, O(mn) checking over index matrix: total O(mn) presuming n > log m 12/8/2018

Perfect phylogeny problem: Algorithm (Boolean matrix) Exercise: try the algorithm for tables 6.1 p 178 and 6.2 p 180 Construction Algorithm: (1) sort characters/col increasing order, (2) each object – (3) each character – (4) if edge for char exists put obj on the end, (5) else create an edge and put object at the end, (6: cosmetic step) if more objects in a leaf node create edges for each object O(nm) Exc. Try it on table 6.2 p180 12/8/2018

Perfect phylogeny problem: Algorithm (non-Boolean matrix, but…) If two states per character but the order of transition not known, then presume an order: majority state 0, minority 1 (more ancestors are available) Same Lemma must be applied after this presumption: no overlapping set of objects 12/8/2018

Phylogeny problem: arbitrary domain size, unordered characters (Def) Triangulated graph: [no big hole] cycle with >3 vertices has a short-cut edge Sub-trees of a tree form triangulated graph (as intersection graph?) (Def) Intersection Graph over subsets: subsets are nodes and edges between pairs of overlapping subsets 12/8/2018

Phylogeny problem: arbitrary domain size, unordered characters Fig 6.7, p187 intersection graph for Table 6.3 p188 [not triangulated, yet] (Def) c-Triangulated graph: Connect edges of intersection graph G where nodes are of different characters, and if the graph becomes now triangulated, then G is c-triangulated Fig 6.7 is c-triangulated 12/8/2018

Phylogeny problem: arbitrary domain size, unordered characters Iff a character state matrix translates to a c-triangulated graph then it admits perfect phylogeny Creating+checking c-triangulation is NP-hard (related to finding max-clique problem) 12/8/2018

Phylogeny problem: arbitrary domain size, unordered characters: 2 characters For 2 characters, the intersection graph is bi-partite Perfect phylogeny means (iff) the state intersection graph is acyclic 12/8/2018

Phylogeny construction: arbitrary domain size, unordered characters: 2 characters Algorithm: (1) Construct intersection graph (2) make nodes for edges (intersection of the objects in old nodes now goes to the new nodes) (3) connect new nodes if they have overlapping objects (4) spanning tree of the graph is phylogeny (5: cosmetic step) objects huddled on a node should be put on separate leaves Try on Table 6.4 p190, and check against Fig 6.8 p189 12/8/2018

When Perfect Phylogeny does not exist Eliminate problematic characters: which ones, an optimization problem – min number of characters: Compatibility criterion Minimize convergence (character goes back to its previous value): Parsimony criterion Both NP-complete problems 12/8/2018

When Perfect Phylogeny does not exist: Parsimony Compatibility problem: Does there exist a subset of characters such that Lemma 6.1 (non-overlapping set of objects) is valid (or Perfect Phylogeny exists)? Equivalent to K-clique problem: does there exist a connected-subgraph with K or more nodes? 12/8/2018

When Perfect Phylogeny does not exist: Parsimony Poly-transformation from Clique to compatibility problem: nodes to character, 3 objects for each edge with specific character values Every pair of NP-complete problems have two way poly-trans Compatibility can also be poly-trans to Clique: characters to nodes, non-overlapping (compatible) characters to edges 12/8/2018

Phylogeny with Distance Matrix Input is a distance matrix (square, symmetric) between all pair of objects, instead of character state matrix Output is phylogeny with leaves as objects and arcs have distances as labels 12/8/2018

Phylogeny with Distance Matrix Additive matrix: when you can draw a tree where distance between every pair of leaves on the tree is the real distance on distance matrix Matrices are unlikely to be additive in practice For non-additive matrix, minimize deviation over the tree: NP-hard problem 12/8/2018

Phylogeny with Distance Matrix Typically we have 2 matrices: (1) upper bound on distances, and (2) for lower bounds Metric space: dij>0, dii=0, dij=dji, for all I, j dij =< dik + dkj Additive metric spaces follow 4 point condition: dij+dkl=dik+djl >= dil+djk 12/8/2018

Phylogeny with Distance Matrix Tree should have 3-degree internal nodes (Fig 6.9, p194) Arc xy to be split proportionately at c, to add a node z by arc cz, so that distances xz, zy are proper 12/8/2018

Phylogeny with Distance Matrix Mxz = dxc + dzc Myz = dyc + dzc Mxy = dxc + dyc Three equations, three unknowns dxc, dyc, dzc to be solved for The tree drawn is unique for 3 objects x, y and z 12/8/2018

Phylogeny with Distance Matrix Adding 4th object w is same as adding 3rd object z: Add between older objects x and y splitting xy at c2 If c2 coincides with c, ignore this and redo the same between zc Object w may hang (from c2) between xz or yz, but will not have 2 different opportunities 12/8/2018

Phylogeny with Distance Matrix The property of uniqueness of the tree remain valid for any k objects for k>4, for metric additive distance matrix The algorithm may have to try all possible places to split an arc, but there will be a unique position, for metric additive space 12/8/2018

Phylogeny: Ultrametric tree Exc: Get MST of a complete graph over table 6.5 p195 Ultrametric tree construction: Input: Distance matrices for High cut-off Mh, Low cut-off Ml (table 6.6 p 201) Output: Phylogeny where leaf-to-leaf distances are within the bounds provided by the 2 matrices (fig 6.16 p202) 12/8/2018

Phylogeny: Ultrametric tree Algorithm: Compute MST T over Mh (algorithm?): provides basis for structure of the tree Compute “cut-off” values between each edge on T using Ml: provides basis for distances on the tree edges Compute the ultrametric tree U and find distance on each arc using the cut-offs 12/8/2018

Phylogeny: Ultrametric tree Step 2.1: input T, output is rooted tree R where internal nodes represent edges of T Sort MST T by edge weights (from Mh) non-increasing Pick up edges by the sort as root in each iteration The path between the end nodes must go via the root: the two nodes edge should be in two different subtrees Next edge in the sort to be picked up that has the corresponding node (x) on the respective side of the previous root (xy) Until no edge for a node (x) is left (all such xy is picked up), then the node x is on a leaf 12/8/2018

Phylogeny: Ultrametric tree Step 2.2 (cut-off): For each pair of nodes (x, y) look at the path in R See which is the least common ancestor, say (ab) [note each internal node represents an edge] Look up table Ml, if Ml_xy is more than current cut-off(ab) replace it with M_xy In other words, the highest Ml value on any edge on the path from x to y in T should be its distance on the ultrametric tree On example p201-202: root (ad) is updated for pairs of all nodes on the opposite sides EB(1), ED(1), AD(4), AB(3), CB(4), CD(3) 12/8/2018

Phylogeny: Ultrametric tree Step 3 (ultrametric tree): Recompute R again same way as before But, now put distance on internal nodes Height of an internal node is its cut-off / 2 Note, computation of R starts with root downwards Adjust distances between the nodes as heights are being calculated Done 12/8/2018

Comparing phylogenies Two trees are expected to be isomorphic All nodes should be on the leaves, if not make it so Pick up a node u and its sibling v on T1 Look for u in T2 and if its sibling is not v: return False If the sibling is v then merge uv into its parent (an dremove subtree with u and v) Continue bottom up until both T1 and T2 become single node trees, then return True 12/8/2018