CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.

CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments

CS262 Lecture 14, Win06, Batzoglou Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

CS262 Lecture 14, Win06, Batzoglou Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree (UPGMA / Neighbor Joining / Other methods)  Align on the tree x w y z ?

CS262 Lecture 14, Win06, Batzoglou Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

CS262 Lecture 14, Win06, Batzoglou Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = G-ACTT

CS262 Lecture 14, Win06, Batzoglou Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary

CS262 Lecture 14, Win06, Batzoglou Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA Variant: Refinement on a tree

CS262 Lecture 14, Win06, Batzoglou Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

CS262 Lecture 14, Win06, Batzoglou A* for Multiple Alignments Review of the A* algorithm v START GOAL Say that we have a gigantic graph G START: start node GOAL: we want to reach this node with the minimum path Dijkstra: O(VlogV + E) – too slow if the number of edges is huge A*: a way of finding the optimal solution faster in practice

CS262 Lecture 14, Win06, Batzoglou A* for Multiple Alignments Review of the A* algorithm v START GOAL g(v) h(v) g(v) is the cost so far h(v) is an estimate of the minimum cost from v to GOAL f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v 1. Expand v with the smallest f(v) 2. Never expand v, if f(v) ≥ shortest path to the goal found so far Lemma Given sequences x, y, z, … The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimal pairwise alignments Proof M induces projected pairwise alignments a xy, a yz, a xz, …, and Score(M) = d(a xy ) + d(a xz ) + d(a yz ) +… Each of d(.) is smaller than the optimal edit distance

CS262 Lecture 14, Win06, Batzoglou A* for Multiple Alignments Nodes: Cells in the DP matrix g(v): alignment cost so far h(v): sum-of-pairs of individual pairwise alignments Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments v START GOAL g(v) h(v) To compute h(v) For each pair of sequences x, y, Compute F R (x, y), the DP matrix of scores of aligning a suffix of x to a suffix of y Then, at position (i 1, i 2, …, i N ), h(v) becomes the sum of (N choose 2) F R scores

CS262 Lecture 14, Win06, Batzoglou Consistency z x y xixi yjyj y j’ zkzk

CS262 Lecture 14, Win06, Batzoglou Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment,  For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz )  Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk

CS262 Lecture 14, Win06, Batzoglou Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

CS262 Lecture 14, Win06, Batzoglou MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on those distances, with UPGMA 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT Only perform alignment steps for the parts of the tree that have changed 4.Measure new Kimura-based distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

CS262 Lecture 14, Win06, Batzoglou PROBCONS at a glance 1.Computation of all posterior matrices M xy : M xy (i, j) = Prob(x i ~ y j ), using a HMM 2.Re-estimation of posterior matrices M’ xy with probabilistic consistency M’ xy (i, j) = 1/N  sequence z  k M xz (i, k)  M yz (j, k);M’ xy = Avg z (M xz M zy ) 3.Compute for every pair x, y, the maximum expected accuracy alignment A xy : alignment that maximizes  aligned (i, j) in A M’ xy (i, j) Define E(x, y) =  aligned (i, j) in Axy M’ xy (i, j) 4.Build tree T with hierarchical clustering using similarity measure E(x, y) 5.Progressive alignment on T to maximize E(.,.) 6.Iterative refinement; for many rounds, do: Randomized Partitioning: Split sequences in M in two subsets by flipping a coin for each sequence and realign the two resulting profiles

CS262 Lecture 14, Win06, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time

CS262 Lecture 14, Win06, Batzoglou

Motivation Genomic sequences are very long:  Human genome = 3 x 10 9 –long  Mouse genome = 2.7 x 10 9 –long Aligning genomic regions is useful for revealing common gene structure  Useful to compare regions > 1,000,000-long

CS262 Lecture 14, Win06, Batzoglou Main Idea Genomic regions of interest contain islands of similarity, such as genes 1.Find local alignments 2.Chain an optimal subset of them 3.Refine/complete the alignment Systems that use this idea to various degrees: MUMmer, GLASS, DIALIGN, CHAOS, AVID, LAGAN, TBA, & others

CS262 Lecture 14, Win06, Batzoglou Saving cells in DP 1.Find local alignments 2.Chain -O(NlogN) L.I.S. 3.Restricted DP

CS262 Lecture 14, Win06, Batzoglou Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

CS262 Lecture 14, Win06, Batzoglou The Problem: Find a Chain of Local Alignments (x,y)  (x’,y’) requires x < x’ y < y’ Each local alignment has a weight FIND the chain with highest total weight

CS262 Lecture 14, Win06, Batzoglou Quadratic Time Solution Build Directed Acyclic Graph (DAG):  Nodes: local alignments [(x a,x b )  (y a,y b )] & score  Directed edges: local alignments that can be chained edge ( (x a, x b, y a, y b ), (x c, x d, y c, y d ) ) x a < x b < x c < x d y a < y b < y c < y d Each local alignment is a node v i with alignment score s i

CS262 Lecture 14, Win06, Batzoglou Quadratic Time Solution Initialization: Find each node v a s.t. there is no edge (u, v a ) Set score of V(a) to be s a Iteration: For each v i, optimal path ending in v i has total score: V(i) = ma x j s.t. there is edge (v j, v i ) ( weight(v j, v i ) + V(j) ) Termination: Optimal global chain: j = argmax ( V(j) ); trace chain from v j Worst case time: quadratic

CS262 Lecture 14, Win06, Batzoglou Sparse Dynamic Programming Back to the LCS problem: Given two sequences  x = x 1, …, x m  y = y 1, …, y n Find the longest common subsequence  Quadratic solution with DP How about when “hits” x i = y j are sparse?

CS262 Lecture 14, Win06, Batzoglou Sparse Dynamic Programming 15324162042431118 4 20 24 3 11 15 11 4 18 20 Imagine a situation where the number of hits is much smaller than O(nm) – maybe O(n) instead

CS262 Lecture 14, Win06, Batzoglou Sparse Dynamic Programming – L.I.S. Longest Increasing Subsequence Given a sequence over an ordered alphabet  x = x 1, …, x m Find a subsequence  s = s 1, …, s k  s 1 < s 2 < … < s k

CS262 Lecture 14, Win06, Batzoglou Sparse Dynamic Programming – L.I.S. Let input be w: w 1,…, w n INITIALIZATION: L: 1-indexed array, L[1]  w 1 B: 0-indexed array of backpointers; B[0] = 0 P: array used for traceback // L[j]: smallest last element w i of j-long LIS seen so far ALGORITHM for i = 2 to n { Find j such that L[j] < w[i] ≤ L[j+1] L[j+1]  w[i] B[j+1]  i P[i]  B[j] } That’s it!!! Running time?

CS262 Lecture 14, Win06, Batzoglou Sparse LCS expressed as LIS Create a sequence w Every matching point (i, j), is inserted into w as follows: For each column j, from smallest to largest, insert in w the points (i, j), in decreasing row i order The 11 example points are inserted in the order given a = (y, x), b = (y’, x’) can be chained iff  a is before b in w, and  y < y’ 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

CS262 Lecture 14, Win06, Batzoglou Sparse LCS expressed as LIS Create a sequence w w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) Consider now w’s elements as ordered lexicographically, where (y, x) < (y’, x’) if y < y’ Claim: An increasing subsequence of w is a common subsequence of x and y 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

CS262 Lecture 14, Win06, Batzoglou Sparse Dynamic Programming for LIS Algorithm: initialize empty array L /* at each point, l j will contain the last element of the longest j-long increasing subsequence that ends with the smallest w i */ for i = 1 to |w| binary search for w[i] in L, to find l j < w[i] ≤ l j+1 replace l j+1 with w[i] keep a backptr l j  w[i] That’s it!!! 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

CS262 Lecture 14, Win06, Batzoglou Sparse Dynamic Programming for LIS Example: w = (4,2) (3,3) (10,5) (2,5) (8,6) (1,6) (3,7) (4,8) (7,9) (5,9) (9,10) L = 1.(4,2) 2.(3,3) 3.(3,3) (10,5) 4.(2,5) (10,5) 5.(2,5) (8,6) 6.(1,6) (8,6) 7.(1,6) (3,7) 8.(1,6) (3,7) (4,8) 9.(1,6) (3,7) (4,8) (7,9) 10.(1,6) (3,7) (4,8) (5,9) 11.(1,6) (3,7) (4,8) (5,9) (9,10) Longest common subsequence: s = 4, 24, 3, 11, 18 15324162042431118 6 4 27 18 10 9 5 11 3 4 20 24 3 11 15 11 4 18 20 x y

CS262 Lecture 14, Win06, Batzoglou Sparse DP for rectangle chaining 1,…, N: rectangles (h j, l j ): y-coordinates of rectangle j w(j):weight of rectangle j V(j): optimal score of chain ending in j L: list of triplets (l j, V(j), j)  L is sorted by l j : top to bottom  L is implemented as a balanced binary tree y h l

CS262 Lecture 14, Win06, Batzoglou Sparse DP for rectangle chaining Main idea: Sweep through x- coordinates To the right of b, anything chainable to a is chainable to b Therefore, if V(b) > V(a), rectangle a is “useless” – remove it In L, keep rectangles j sorted with increasing l j - coordinates  sorted with increasing V(j) V(b) V(a)

CS262 Lecture 14, Win06, Batzoglou Sparse DP for rectangle chaining Go through rectangle x-coordinates, from lowest to highest: 1.When on the leftmost end of i: a.j: rectangle in L, with largest l j < h i b.V(i) = w(i) + V(j) 2.When on the rightmost end of i: a.k: rectangle in L, with largest l k  l i b.If V(i)  V(k): i.INSERT (l i, V(i), i) in L ii.REMOVE all (l j, V(j), j) with V(j)  V(i) & l j  l i i j k

CS262 Lecture 14, Win06, Batzoglou Example x y 1: 5 3: 3 2: 6 4: 4 5: 2 2 5 6 9 10 11 12 14 15 16

CS262 Lecture 14, Win06, Batzoglou Time Analysis 1.Sorting the x-coords takes O(N log N) 2.Going through x-coords: N steps 3.Each of N steps requires O(log N) time: Searching L takes log N Inserting to L takes log N All deletions are consecutive, so log N per deletion Each element is deleted at most once: N log N for all deletions Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree

CS262 Lecture 14, Win06, Batzoglou Examples Human Genome Browser ABC

CS262 Lecture 14, Win06, Batzoglou Whole-genome alignment Rat—Mouse—Human

CS262 Lecture 14, Win06, Batzoglou Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.

Similar presentations

Presentation on theme: "CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.

Similar presentations

Presentation on theme: "CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments."— Presentation transcript:

Similar presentations

About project

Feedback