Multiple Sequence Alignment

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Computational Molecular Biology Biochem 218 – BioMedical Informatics Doug Brutlag Professor.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Multiple Sequence Alignment
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Molecular Evolution and Phylogenetic Tree Reconstruction
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.
Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.
Overview of Phylogeny Artiodactyla (pigs, deer, cattle, goats, sheep, hippopotamuses, camels, etc.) Cetacea (whales, dolphins, porpoises)
Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)
CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
Lecture 8: Multiple Sequence Alignment
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.
Phylogeny Tree Reconstruction
Some new sequencing technologies. Molecular Inversion Probes.
CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
The Tree of Life From Ernst Haeckel, 1891.
CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
CISC667, F05, Lec15, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (II) Distance-based methods.
Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.
Phylogeny Tree Reconstruction
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogeny Tree Reconstruction
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 24th, 2013.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
1 Chapter 7 Building Phylogenetic Trees. 2 Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances –UPGMA method.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Calculating branch lengths from distances. ABC A B C----- a b c.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
Evolutionary tree reconstruction
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Comp. Genomics Recitation 8 Phylogeny. Outline Phylogeny: Distance based Probabilistic Parsimony.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Multiple Sequence Alignment
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Multiple Alignment and Phylogenetic Trees
The Tree of Life From Ernst Haeckel, 1891.
Multiple Sequence Alignment
Presentation transcript:

Multiple Sequence Alignment

Evolution at the DNA level Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication

All Homologous Sequences Coalesce Y-chromosome coalescence Trevor Lamb: As a result of recent work, it has become possible to track the changes that enabled an ancestral chordate C-opsin (that exhibited many properties in common with R-opsins) to evolve into the immediate pre-cursor of modern cone and rod opsins. http://webvision.umh.es/webvision/Evolution.%20PART%20II.html

Orthology and Paralogy Yeast Orthologs: Derived by speciation Paralogs: Everything else HA1 Human HA2 Human WA Worm HB Human WB Worm

Orthology, Paralogy, Inparalogs, Outparalogs

Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length represents evolution time AKA genetic distance Not necessarily chronological time

Inferring Phylogenetic Trees Trees can be inferred by several criteria: Morphology of the organisms Can lead to mistakes Sequence comparison Example: Mouse: ACAGTGACGCCCCAAACGT Rat: ACAGTGACGCTACAAACGT Baboon: CCTGTGACGTAACAAACGA Chimp: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Distance Between Two Sequences Basic principle: Distance proportional to degree of independent sequence evolution Given sequences xi, xj, dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u]  xj[u] Better scores are derived by modeling evolution as a continuous change process

Molecular Evolution Modeling sequence substitution: Consider what happens at a position for time Δt, P(t) = vector of probabilities of {A,C,G,T} at time t Given an alignment between two sequences, we can estimate P(t) (Simplistic) Count non-match positions in the alignment How do we estimate t from that information? A Δt ~= 0 C 9

Molecular Evolution Modeling sequence substitution: Consider what happens at a position for time Δt, P(t) = vector of probabilities of {A,C,G,T} at time t μAC = rate of transition from A to C per unit time μA = μAC + μAG + μAT rate of transition out of A pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + pG(t) μGA Δt + pT(t) μTA Δt A Δt ~= 0 C 10

P(t+Δt) = P(t) + Q P(t) Δt Molecular Evolution In matrix/vector notation, we get P(t+Δt) = P(t) + Q P(t) Δt where Q is the substitution rate matrix A Δt ~= 0 C 11

Molecular Evolution This is a differential equation: P’(t) = Q P(t) Q => prob. distribution over {A,C,G,T} at each position, stationary (equilibrium) frequencies πA, πC, πG, πT Each Q is an evolutionary model Some work better than others A Δt ~= 0 C 12

Evolutionary Models Jukes-Cantor Kimura Felsenstein HKY 13

Estimating Distances Solve the differential equation and compute expected evolutionary time given sequences P’(t) = Q P(t) Jukes-Cantor: Let PAA(t) = PCC(t) = PCC(t) = PCC(t) = r PAC(t) = … = PTG(t) = s Then, r’(t) = - ¾ r(t) m + ¾ s(t) m s’(t) = - ¼ s(t) m + ¼ r(t) m Which is satisfied by r(t) = ¼ (1 + 3e-mt) s(t) = ¼ (1 - e-mt) 14

Estimating Distances Solve the differential equation and compute expected evolutionary time given sequences P’(t) = Q P(t) Jukes-Cantor: 15

Estimating Distances Let p = probability a base is different between two sequences, Solve to find t Jukes-Cantor r(t) = 1 – p = ¼ (1 + 3e-mt) p = ¾ – ¾ e-mt ¾ – p = ¾ e-mt 1 – 4p/3 = e-mt Therefore, mt = - ln(1 – 4p/3) Letting d = ¾ mt, denoting substitutions per site, 16

d: Branch length in terms of substitutions per site Estimating Distances d: Branch length in terms of substitutions per site Jukes-Cantor Kimura 17

A simple clustering method for building tree UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Claim that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|

Algorithm: Average Linkage Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Ci, Cj s.t. dij is min Let Ck = Ci  Cj Define node connecting Ci, Cj, and place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 1 4 3 2 5 1 4 2 3 5

Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = (N – 2) dij – ki dik – kj djk Claim: The above “magic trick” ensures that i, j are neighbors if Dij is minimal 1 3 0.1 0.1 0.1 0.4 0.4 2 4

Mammalian alignments a, A phylogenetic tree of all 29 mammals used in this analysis based on the substitution rates in the MultiZ alignments. Organisms with finished genome sequences are indicated in blue, high quality drafts in green and 2× assemblies in black. Substitutions per 100 bp are given for each branch; branches with ≥10 substitutions are coloured red, blue indicates <10 substitutions. b, At 10% FDR, 3.6 million constrained elements can be detected encompassing 4.2% of the genome, including a substantial fraction of newly detected bases (blue) compared to the union of the HMRD 50-bp + Siepel vertebrate elements17 (seeSupplementary Fig. 4b for comparison to HMRD elements only). The largest fraction of constraint can be seen in coding exons, introns and intergenic regions. For unique counts, the analysis was performed hierarchically: coding exons, 5′ UTRs, 3′ UTRs, promoters, pseudogenes, non-coding RNAs, introns, intergenic. The constrained bases are particularly enriched in coding transcripts and their promoters (Supplementary Fig. 4c).

Genome Evolutionary Rate Profiling (GERP)

Species Trees and Gene Trees Relationship between gene trees and species trees. (A) Ortholog trees used to study species evolution. Each internal node represents a speciation event (circle). (B) Paralog trees used to study gene family expansions within a single species. Each internal node represents a duplication event (star). (C) General gene trees combine both orthologs and paralogs across multiple species to infer gene duplication (star), gene loss (×), and speciation (circle) events. Each gene is named with the first letter of the corresponding species. The gene tree (black lines) can be viewed as evolving inside the species tree (blue area), implying coordinated speciation events at branching points in the species tree (dotted line). (D) Gene duplication and loss events are inferred by reconciling a gene tree to a species tree, mapping each gene-tree node to its closest species-tree common ancestor node (arrows). (E) When the gene tree is incorrect, many spurious events will be inferred. In this example, a common misplacement of rodents due to long-branch-attraction leads to four spurious events (one duplication and at least three losses). Rasmussen M D , Kellis M Genome Res. 2007;17:1932-1942 ©2007 by Cold Spring Harbor Laboratory Press

Evolutionary rates decoupled into gene-specific and species-specific components Evolutionary rates decoupled into gene-specific and species-specific components. (A) Syntenic ortholog trees appear as scaled versions of a common species tree, and can be expressed as the product of a gene-specific rate and species-specific rates. (B) Gene-specific rates of 5154 fly orthologs follow a gamma distribution. (C) Species-specific rates for each lineage follow normal distributions. Means and standard deviations shown in Supplemental Figure S7. (D) Unnormalized (absolute) branch lengths are highly correlated. Lengths for D. virilis and D. ananassae since their last common ancestor across the 5154 orthologs show correlation r = 0.813. (E) Relative branch lengths become independent after normalization by the gene-specific rate (r = 0.082). (F) Correlations are high for all species pairs before normalization, except for very closely related species. (G) Relative lengths are uncorrelated for all species pairs, showing that gene-specific rate accounts for their initial dependencies. Rasmussen M D , Kellis M Genome Res. 2007;17:1932-1942 ©2007 by Cold Spring Harbor Laboratory Press

Evaluating gene-tree likelihood using learned rate distributions. Evaluating gene-tree likelihood using learned rate distributions. (A) Observed distance matrix for mammalian orthologs of hemoglobin-β estimated from an HKY model based on multiple alignments of the four genes. (B–D) Likelihood evaluation for proposed topology T1. (B) Distance matrix M1 is mapped onto the proposed topology T1, resulting in branch lengths a–f. (C) Gene-tree branches are mapped to species-tree branches by reconciliation. Since the gene-tree topology is congruent to the species tree, each branch is mapped to exactly one lineage. (D) The probability of each branch length is evaluated based on species-specific rate distributions. T1 results in overall high-likelihood density, since the resulting relative branch lengths a–f fall near the average rate for the corresponding species-specific distribution (dotted lines). (E–G) Likelihood evaluation for proposed topology T2. (E) Distance matrix M1 is mapped onto the proposed topology T2, resulting in branch lengths v–z. (F) Reconciliation results in one gene duplication and three gene losses; gene-tree branches w and z now span two species-tree branches each and are evaluated based on accordingly longer species-tree rate distributions obtained by summing two normals. (G) The resulting branch lengths z, w, and v show large discrepancies from the average species-rate distributions, resulting in a 3.7-fold lower likelihood for branch lengths corresponding to the incorrect topology T2. (H) All other methods select the incorrect topology T2 due to long-branch attraction, even though the hemoglobin-β genes are unambiguous one-to-one orthologs and should follow the known mammalian phylogeny T1. (I) Branch-level comparison of likelihood scores shows consistently higher scores for T1, the correct topology. Notice that the gene-rate likelihood for T1 is different from that for T2, as the two topologies imply different gene family rates. Rasmussen M D , Kellis M Genome Res. 2007;17:1932-1942

Multiple Sequence Alignments

Definition Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that All sequences have the same length L Score of the global map is maximum Alignment of p53, http://ntoc.wordpress.com/2010/03/18/hello-world/

Applications

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) Human Mouse Duck Chicken Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken Weighted SOP: S(m) = k<l wkl s(mk, ml)

A Profile Representation - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0 C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0 G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1 T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0 - .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0 Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings Can think of this as a “likelihood” of each letter in each position

Multiple Sequence Alignments Algorithms

Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(mi) (sum of column scores) F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Multidimensional DP Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk), F(i – 1, j – 1, k ) + S(xi, xj, - ), F(i – 1, j , k – 1) + S(xi, -, xk), F(i – 1, j , k ) + S(xi, -, - ), F(i , j – 1, k – 1) + S( -, xj, xk), F(i , j – 1, k ) + S( -, xj, - ), F(i , j , k – 1) + S( -, -, xk) }

Multidimensional DP Running Time: Size of matrix: LN; Where L = length of each sequence N = number of sequences Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Multidimensional DP Running Time: Size of matrix: LN; Where L = length of each sequence N = number of sequences Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Progressive Alignment x pxy y z pxyzw pzw w When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

Progressive Alignment x y Example Profile: (A, C, G, T, -) px = (0.8, 0.2, 0, 0, 0) py = (0.6, 0, 0, 0, 0.4) s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: pxy = (0.7, 0.1, 0, 0, 0.2) s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: px- = (0.4, 0.1, 0, 0, 0.5) z w When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

Progressive Alignment x y ? z w When evolutionary tree is unknown: Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree

MUSCLE at a glance Fast measurement of all pairwise distances between sequences DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time Build tree TDRAFT based on those distances, with UPGMA Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT Measure new Kimura-based distances D(x, y) based on MDRAFT Build tree T based on D Progressive alignment over T, to build M Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept