Multiple Sequence Alignment

Multiple Sequence Alignment

Evolution at the DNA level
Deletion Mutation …ACGGTGCAGTTACCA… SEQUENCE EDITS …AC----CAGTCCACCA… REARRANGEMENTS Inversion Translocation Duplication

All Homologous Sequences Coalesce
Y-chromosome coalescence Trevor Lamb: As a result of recent work, it has become possible to track the changes that enabled an ancestral chordate C-opsin (that exhibited many properties in common with R-opsins) to evolve into the immediate pre-cursor of modern cone and rod opsins.

Orthology and Paralogy
Yeast Orthologs: Derived by speciation Paralogs: Everything else HA1 Human HA2 Human WA Worm HB Human WB Worm

Orthology, Paralogy, Inparalogs, Outparalogs

Phylogenetic Trees Nodes: species Edges: time of independent evolution
Edge length represents evolution time AKA genetic distance Not necessarily chronological time

Inferring Phylogenetic Trees
Trees can be inferred by several criteria: Morphology of the organisms Can lead to mistakes Sequence comparison Example: Mouse: ACAGTGACGCCCCAAACGT Rat: ACAGTGACGCTACAAACGT Baboon: CCTGTGACGTAACAAACGA Chimp: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA

Distance Between Two Sequences
Basic principle: Distance proportional to degree of independent sequence evolution Given sequences xi, xj, dij = distance between the two sequences One possible definition: dij = fraction f of sites u where xi[u]  xj[u] Better scores are derived by modeling evolution as a continuous change process

Molecular Evolution Modeling sequence substitution:
Consider what happens at a position for time Δt, P(t) = vector of probabilities of {A,C,G,T} at time t Given an alignment between two sequences, we can estimate P(t) (Simplistic) Count non-match positions in the alignment How do we estimate t from that information? A Δt ~= 0 C 9

Molecular Evolution Modeling sequence substitution:
Consider what happens at a position for time Δt, P(t) = vector of probabilities of {A,C,G,T} at time t μAC = rate of transition from A to C per unit time μA = μAC + μAG + μAT rate of transition out of A pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + pG(t) μGA Δt + pT(t) μTA Δt A Δt ~= 0 C 10

P(t+Δt) = P(t) + Q P(t) Δt
Molecular Evolution In matrix/vector notation, we get P(t+Δt) = P(t) + Q P(t) Δt where Q is the substitution rate matrix A Δt ~= 0 C 11

Molecular Evolution This is a differential equation: P’(t) = Q P(t)
Q => prob. distribution over {A,C,G,T} at each position, stationary (equilibrium) frequencies πA, πC, πG, πT Each Q is an evolutionary model Some work better than others A Δt ~= 0 C 12

Evolutionary Models Jukes-Cantor Kimura Felsenstein HKY 13

Estimating Distances Solve the differential equation and compute expected evolutionary time given sequences P’(t) = Q P(t) Jukes-Cantor: Let PAA(t) = PCC(t) = PCC(t) = PCC(t) = r PAC(t) = … = PTG(t) = s Then, r’(t) = - ¾ r(t) m + ¾ s(t) m s’(t) = - ¼ s(t) m + ¼ r(t) m Which is satisfied by r(t) = ¼ (1 + 3e-mt) s(t) = ¼ (1 - e-mt) 14

Estimating Distances Solve the differential equation and compute expected evolutionary time given sequences P’(t) = Q P(t) Jukes-Cantor: 15

Estimating Distances Let p = probability a base is different between two sequences, Solve to find t Jukes-Cantor r(t) = 1 – p = ¼ (1 + 3e-mt) p = ¾ – ¾ e-mt ¾ – p = ¾ e-mt 1 – 4p/3 = e-mt Therefore, mt = - ln(1 – 4p/3) Letting d = ¾ mt, denoting substitutions per site, 16

d: Branch length in terms of substitutions per site
Estimating Distances d: Branch length in terms of substitutions per site Jukes-Cantor Kimura 17

A simple clustering method for building tree
UPGMA (unweighted pair group method using arithmetic averages) Or the Average Linkage Method Given two disjoint clusters Ci, Cj of sequences, 1 dij = ––––––––– {p Ci, q Cj}dpq |Ci|  |Cj| Claim that if Ck = Ci  Cj, then distance to another cluster Cl is: dil |Ci| + djl |Cj| dkl = –––––––––––––– |Ci| + |Cj|

Algorithm: Average Linkage
Initialization: Assign each xi into its own cluster Ci Define one leaf per sequence, height 0 Iteration: Find two clusters Ci, Cj s.t. dij is min Let Ck = Ci  Cj Define node connecting Ci, Cj, and place it at height dij/2 Delete Ci, Cj Termination: When two clusters i, j remain, place root at height dij/2 1 4 3 2 5 1 4 2 3 5

Neighbor-Joining Guaranteed to produce the correct tree if distance is additive May produce a good tree even when distance is not additive Step 1: Finding neighboring leaves Define Dij = (N – 2) dij – ki dik – kj djk Claim: The above “magic trick” ensures that i, j are neighbors if Dij is minimal 1 3 0.1 0.1 0.1 0.4 0.4 2 4

Mammalian alignments a, A phylogenetic tree of all 29 mammals used in this analysis based on the substitution rates in the MultiZ alignments. Organisms with finished genome sequences are indicated in blue, high quality drafts in green and 2× assemblies in black. Substitutions per 100 bp are given for each branch; branches with ≥10 substitutions are coloured red, blue indicates <10 substitutions. b, At 10% FDR, 3.6 million constrained elements can be detected encompassing 4.2% of the genome, including a substantial fraction of newly detected bases (blue) compared to the union of the HMRD 50-bp + Siepel vertebrate elements17 (seeSupplementary Fig. 4b for comparison to HMRD elements only). The largest fraction of constraint can be seen in coding exons, introns and intergenic regions. For unique counts, the analysis was performed hierarchically: coding exons, 5′ UTRs, 3′ UTRs, promoters, pseudogenes, non-coding RNAs, introns, intergenic. The constrained bases are particularly enriched in coding transcripts and their promoters (Supplementary Fig. 4c).

Genome Evolutionary Rate Profiling (GERP)

Species Trees and Gene Trees
Relationship between gene trees and species trees. (A) Ortholog trees used to study species evolution. Each internal node represents a speciation event (circle). (B) Paralog trees used to study gene family expansions within a single species. Each internal node represents a duplication event (star). (C) General gene trees combine both orthologs and paralogs across multiple species to infer gene duplication (star), gene loss (×), and speciation (circle) events. Each gene is named with the first letter of the corresponding species. The gene tree (black lines) can be viewed as evolving inside the species tree (blue area), implying coordinated speciation events at branching points in the species tree (dotted line). (D) Gene duplication and loss events are inferred by reconciling a gene tree to a species tree, mapping each gene-tree node to its closest species-tree common ancestor node (arrows). (E) When the gene tree is incorrect, many spurious events will be inferred. In this example, a common misplacement of rodents due to long-branch-attraction leads to four spurious events (one duplication and at least three losses). Rasmussen M D , Kellis M Genome Res. 2007;17: ©2007 by Cold Spring Harbor Laboratory Press

Evolutionary rates decoupled into gene-specific and species-specific components
Evolutionary rates decoupled into gene-specific and species-specific components. (A) Syntenic ortholog trees appear as scaled versions of a common species tree, and can be expressed as the product of a gene-specific rate and species-specific rates. (B) Gene-specific rates of 5154 fly orthologs follow a gamma distribution. (C) Species-specific rates for each lineage follow normal distributions. Means and standard deviations shown in Supplemental Figure S7. (D) Unnormalized (absolute) branch lengths are highly correlated. Lengths for D. virilis and D. ananassae since their last common ancestor across the 5154 orthologs show correlation r = (E) Relative branch lengths become independent after normalization by the gene-specific rate (r = 0.082). (F) Correlations are high for all species pairs before normalization, except for very closely related species. (G) Relative lengths are uncorrelated for all species pairs, showing that gene-specific rate accounts for their initial dependencies. Rasmussen M D , Kellis M Genome Res. 2007;17: ©2007 by Cold Spring Harbor Laboratory Press

Evaluating gene-tree likelihood using learned rate distributions.
Evaluating gene-tree likelihood using learned rate distributions. (A) Observed distance matrix for mammalian orthologs of hemoglobin-β estimated from an HKY model based on multiple alignments of the four genes. (B–D) Likelihood evaluation for proposed topology T1. (B) Distance matrix M1 is mapped onto the proposed topology T1, resulting in branch lengths a–f. (C) Gene-tree branches are mapped to species-tree branches by reconciliation. Since the gene-tree topology is congruent to the species tree, each branch is mapped to exactly one lineage. (D) The probability of each branch length is evaluated based on species-specific rate distributions. T1 results in overall high-likelihood density, since the resulting relative branch lengths a–f fall near the average rate for the corresponding species-specific distribution (dotted lines). (E–G) Likelihood evaluation for proposed topology T2. (E) Distance matrix M1 is mapped onto the proposed topology T2, resulting in branch lengths v–z. (F) Reconciliation results in one gene duplication and three gene losses; gene-tree branches w and z now span two species-tree branches each and are evaluated based on accordingly longer species-tree rate distributions obtained by summing two normals. (G) The resulting branch lengths z, w, and v show large discrepancies from the average species-rate distributions, resulting in a 3.7-fold lower likelihood for branch lengths corresponding to the incorrect topology T2. (H) All other methods select the incorrect topology T2 due to long-branch attraction, even though the hemoglobin-β genes are unambiguous one-to-one orthologs and should follow the known mammalian phylogeny T1. (I) Branch-level comparison of likelihood scores shows consistently higher scores for T1, the correct topology. Notice that the gene-rate likelihood for T1 is different from that for T2, as the two topologies imply different gene family rates. Rasmussen M D , Kellis M Genome Res. 2007;17:

Multiple Sequence Alignments

Definition Given N sequences x1, x2,…, xN:
Insert gaps (-) in each sequence xi, such that All sequences have the same length L Score of the global map is maximum Alignment of p53,

Applications

Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) Human Mouse Duck Chicken
Heuristic way to incorporate evolution tree: Human Mouse Duck Chicken Weighted SOP: S(m) = k<l wkl s(mk, ml)

A Profile Representation
- A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T Given a multiple alignment M = m1…mn Replace each column mi with profile entry pi Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings Can think of this as a “likelihood” of each letter in each position

Multiple Sequence Alignments
Algorithms

Multidimensional DP Generalization of Needleman-Wunsh: S(m) = i S(mi)
(sum of column scores) F(i1,i2,…,iN): Optimal alignment up to (i1, …, iN) F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))

Multidimensional DP Example: in 3D (three sequences): 7 neighbors/cell
F(i,j,k) = max{ F(i – 1, j – 1, k – 1) + S(xi, xj, xk), F(i – 1, j – 1, k ) + S(xi, xj, - ), F(i – 1, j , k – 1) + S(xi, -, xk), F(i – 1, j , k ) + S(xi, -, - ), F(i , j – 1, k – 1) + S( -, xj, xk), F(i , j – 1, k ) + S( -, xj, - ), F(i , j , k – 1) + S( -, -, xk) }

Multidimensional DP Running Time: Size of matrix: LN;
Where L = length of each sequence N = number of sequences Neighbors/cell: 2N – 1 Therefore………………………… O(2N LN)

Progressive Alignment
x pxy y z pxyzw pzw w When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x y Example Profile: (A, C, G, T, -) px = (0.8, 0.2, 0, 0, 0) py = (0.6, 0, 0, 0, 0.4) s(px, py) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: pxy = (0.7, 0.1, 0, 0, 0.2) s(px, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: px- = (0.4, 0.1, 0, 0, 0.5) z w When evolutionary tree is known: Align closest first, in the order of the tree In each step, align two sequences x, y, or profiles px, py, to generate a new alignment with associated profile presult Weighted version: Tree edges have weights, proportional to the divergence in that edge New profile is a weighted average of two old profiles

x y ? z w When evolutionary tree is unknown: Perform all pairwise alignments Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment Construct a tree (UPGMA / Neighbor Joining / Other methods) Align on the tree

MUSCLE at a glance Fast measurement of all pairwise distances between sequences DDRAFT(x, y) defined in terms of # common k-mers (k~3) – O(N2 L logL) time Build tree TDRAFT based on those distances, with UPGMA Progressive alignment over TDRAFT, resulting in multiple alignment MDRAFT Measure new Kimura-based distances D(x, y) based on MDRAFT Build tree T based on D Progressive alignment over T, to build M Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence Alignment

Similar presentations

Presentation on theme: "Multiple Sequence Alignment"— Presentation transcript:

Similar presentations

About project

Feedback