CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
CS262 Lecture 9, Win07, Batzoglou Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length represents evolution time AKA genetic distance Not necessarily chronological time
CS262 Lecture 9, Win07, Batzoglou Parsimony – direct method not using distances One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: 1.Find the parsimony cost of a given tree (easy) 2.Search through all tree topologies (hard)
CS262 Lecture 9, Win07, Batzoglou Example: Parsimony cost of one column A B A A {A, B} Cost C+=1 {A} Final cost C = 1 {A} {B} {A} ABAAABAA
CS262 Lecture 9, Win07, Batzoglou Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set R k = { x k [u] }// R k is simply the character of k th species If k is not a leaf, Let i, j be the daughter nodes; Set R k = R i R j if intersection is nonempty Set R k = R i R j, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C
CS262 Lecture 9, Win07, Batzoglou Example AAAB {A} {B} BABA {A}{B}{A}{B} {A} {A,B} {B}
CS262 Lecture 9, Win07, Batzoglou Traceback: 1.Choose an arbitrary nucleotide from R 2N – 1 for the root 2.Having chosen nucleotide r for parent k, If r R i choose r for daughter i Else, choose arbitrary nucleotide from R i Easy to see that this traceback produces some assignment of cost C Traceback to find ancestral nucleotides
CS262 Lecture 9, Win07, Batzoglou Example A B A B {A, B} {A} {B} {A} {B} A B A B A A A x x A B A B A B A x x A B A B B B B x x Admissible with Traceback Still optimal, but inadmissible with Traceback
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments
CS262 Lecture 9, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication
CS262 Lecture 9, Win07, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence
CS262 Lecture 9, Win07, Batzoglou Orthology and Paralogy HB Human WB Worm HA1 Human HA2 Human Yeast WA Worm Orthologs: Derived by speciation Paralogs: Everything else Orthologs: Derived by speciation Paralogs: Everything else
CS262 Lecture 9, Win07, Batzoglou Orthology, Paralogy, Inparalogs, Outparalogs
CS262 Lecture 9, Win07, Batzoglou
Definition Given N sequences x 1, x 2,…, x N : Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology The patterns of conservation can help us tell function of the element
CS262 Lecture 9, Win07, Batzoglou Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
CS262 Lecture 9, Win07, Batzoglou Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) = k<l w kl s(m k, m l ) Duck
CS262 Lecture 9, Win07, Batzoglou A Profile Representation Given a multiple alignment M = m 1 …m n Replace each column m i with profile entry p i Frequency of each letter in # gaps Optional: # gap openings, extensions, closings Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments Algorithms
CS262 Lecture 9, Win07, Batzoglou Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))
CS262 Lecture 9, Win07, Batzoglou Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(x i, x j, x k ), F(i – 1, j – 1, k ) + S(x i, x j, - ), F(i – 1, j, k – 1) + S(x i, -, x k ), F(i – 1, j, k ) + S(x i, -, - ), F(i, j – 1, k – 1) + S( -, x j, x k ), F(i, j – 1, k ) + S( -, x j, - ), F(i, j, k – 1) + S( -, -, x k ) } Multidimensional DP
CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP
CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly! Require 2 N – 1 states, one per combination of gapped/ungapped sequences Running time: O(2 N 2 N L N ) = O(4 N L N ) XYXYZZ YYZ XXZ
CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw
CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)
CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is unknown: Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree x w y z ?
CS262 Lecture 9, Win07, Batzoglou Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA
CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing
CS262 Lecture 9, Win07, Batzoglou Consistency z x y xixi yjyj y j’ zkzk
CS262 Lecture 9, Win07, Batzoglou Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment, For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz ) Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE High throughput One of the best in accuracy ProbCons High accuracy Reasonable speed
CS262 Lecture 9, Win07, Batzoglou MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on those distances, with UPGMA 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.Measure new Kimura-based distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept
CS262 Lecture 9, Win07, Batzoglou PROBCONS at a glance 1.Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM 2.Re-estimation of posterior matrices M’ xy with probabilistic consistency M’ xy (i, j) = 1/N sequence z k M xz (i, k) M yz (j, k);M’ xy = Avg z (M xz M zy ) 3.Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes aligned (i, j) in A M’ xy (i, j) Define E(x, y) = aligned (i, j) in Axy M’ xy (i, j) 4.Build tree T with hierarchical clustering using similarity measure E(x, y) 5.Progressive alignment on T to maximize E(.,.) 6.Iterative refinement; for many rounds, do: Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles
CS262 Lecture 9, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC Specialized VISTA alignment browser at LBNL ABC—Nice Stanford tool for browsing alignments Protein Multiple Aligners CLUSTALW – most widely used MUSCLE – most scalable PROBCONS – most accurate