CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction 1 4 3 2 5 1 4 2 3 5.

Slides:



Advertisements
Similar presentations
Computational Molecular Biology Biochem 218 – BioMedical Informatics Doug Brutlag Professor.
Advertisements

. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Multiple Sequence Alignment
Molecular Evolution and Phylogenetic Tree Reconstruction
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Problem Set 2 Solutions Tree Reconstruction Algorithms
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
BNFO 602 Multiple sequence alignment Usman Roshan.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
Lecture 8: Multiple Sequence Alignment
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
Some new sequencing technologies. Molecular Inversion Probes.
1 Protein Multiple Alignment by Konstantin Davydov.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
CS273a Lecture 10, Aut 08, Batzoglou CS273a Lecture 10, Fall 2008 Local Alignments.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.
Phylogeny Tree Reconstruction
BNFO 602 Multiple sequence alignment Usman Roshan.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.
Multiple Sequence Alignments
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Sequence Alignments
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Multiple Sequence Alignment
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Multiple Sequence Alignment
Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University
Multiple Sequence Alignment
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
Multiple sequence alignment (msa)
Inferring phylogenetic trees: Distance and maximum likelihood methods
Multiple Sequence Alignment
Phylogeny.
Presentation transcript:

CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction

CS262 Lecture 9, Win07, Batzoglou Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length represents evolution time  AKA genetic distance  Not necessarily chronological time

CS262 Lecture 9, Win07, Batzoglou Parsimony – direct method not using distances One of the most popular methods:  GIVEN multiple alignment  FIND tree & history of substitutions explaining alignment Idea: Find the tree that explains the observed sequences with a minimal number of substitutions Two computational subproblems: 1.Find the parsimony cost of a given tree (easy) 2.Search through all tree topologies (hard)

CS262 Lecture 9, Win07, Batzoglou Example: Parsimony cost of one column A B A A {A, B} Cost C+=1 {A} Final cost C = 1 {A} {B} {A} ABAAABAA

CS262 Lecture 9, Win07, Batzoglou Parsimony Scoring Given a tree, and an alignment column u Label internal nodes to minimize the number of required substitutions Initialization: Set cost C = 0; node k = 2N – 1 (last leaf) Iteration: If k is a leaf, set R k = { x k [u] }// R k is simply the character of k th species If k is not a leaf, Let i, j be the daughter nodes; Set R k = R i  R j if intersection is nonempty Set R k = R i  R j, and C += 1, if intersection is empty Termination: Minimal cost of tree for column u, = C

CS262 Lecture 9, Win07, Batzoglou Example AAAB {A} {B} BABA {A}{B}{A}{B} {A} {A,B} {B}

CS262 Lecture 9, Win07, Batzoglou Traceback: 1.Choose an arbitrary nucleotide from R 2N – 1 for the root 2.Having chosen nucleotide r for parent k, If r  R i choose r for daughter i Else, choose arbitrary nucleotide from R i Easy to see that this traceback produces some assignment of cost C Traceback to find ancestral nucleotides

CS262 Lecture 9, Win07, Batzoglou Example A B A B {A, B} {A} {B} {A} {B} A B A B A A A x x A B A B A B A x x A B A B B B B x x Admissible with Traceback Still optimal, but inadmissible with Traceback

CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments

CS262 Lecture 9, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication

CS262 Lecture 9, Win07, Batzoglou Protein Phylogenies Proteins evolve by both duplication and species divergence

CS262 Lecture 9, Win07, Batzoglou Orthology and Paralogy HB Human WB Worm HA1 Human HA2 Human Yeast WA Worm Orthologs: Derived by speciation Paralogs: Everything else Orthologs: Derived by speciation Paralogs: Everything else

CS262 Lecture 9, Win07, Batzoglou Orthology, Paralogy, Inparalogs, Outparalogs

CS262 Lecture 9, Win07, Batzoglou

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments reveal elements that are conserved among a class of organisms and therefore important in their common biology The patterns of conservation can help us tell function of the element

CS262 Lecture 9, Win07, Batzoglou Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

CS262 Lecture 9, Win07, Batzoglou Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) Duck

CS262 Lecture 9, Win07, Batzoglou A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T

CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments Algorithms

CS262 Lecture 9, Win07, Batzoglou Multidimensional DP Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

CS262 Lecture 9, Win07, Batzoglou Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(x i, x j, x k ), F(i – 1, j – 1, k ) + S(x i, x j, - ), F(i – 1, j, k – 1) + S(x i, -, x k ), F(i – 1, j, k ) + S(x i, -, - ), F(i, j – 1, k – 1) + S( -, x j, x k ), F(i, j – 1, k ) + S( -, x j, - ), F(i, j, k – 1) + S( -, -, x k ) } Multidimensional DP

CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

CS262 Lecture 9, Win07, Batzoglou Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly!  Require 2 N – 1 states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

CS262 Lecture 9, Win07, Batzoglou Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree (UPGMA / Neighbor Joining / Other methods)  Align on the tree x w y z ?

CS262 Lecture 9, Win07, Batzoglou Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

CS262 Lecture 9, Win07, Batzoglou Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

CS262 Lecture 9, Win07, Batzoglou Consistency z x y xixi yjyj y j’ zkzk

CS262 Lecture 9, Win07, Batzoglou Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment,  For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz )  Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk

CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable speed

CS262 Lecture 9, Win07, Batzoglou MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on those distances, with UPGMA 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.Measure new Kimura-based distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

CS262 Lecture 9, Win07, Batzoglou PROBCONS at a glance 1.Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM 2.Re-estimation of posterior matrices M’ xy with probabilistic consistency M’ xy (i, j) = 1/N  sequence z  k M xz (i, k)  M yz (j, k);M’ xy = Avg z (M xz M zy ) 3.Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes  aligned (i, j) in A M’ xy (i, j) Define E(x, y) =  aligned (i, j) in Axy M’ xy (i, j) 4.Build tree T with hierarchical clustering using similarity measure E(x, y) 5.Progressive alignment on T to maximize E(.,.) 6.Iterative refinement; for many rounds, do: Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

CS262 Lecture 9, Win07, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC Specialized VISTA alignment browser at LBNL ABC—Nice Stanford tool for browsing alignments Protein Multiple Aligners CLUSTALW – most widely used MUSCLE – most scalable PROBCONS – most accurate