Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.

Multiple Sequence Alignments

The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 O.2.8.4.4 E.4 C.2.8.4.2

Multiple Sequence Alignments Algorithms

1. Multidimensional Dynamic Programming Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } 1. Multidimensional Dynamic Programming

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) 1. Multidimensional Dynamic Programming How do affine gaps generalize? VERY badly!  Require 2 N states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

2. Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

2. Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree (we will describe more in detail later in the course)  Align on the tree x w y z ?

Aligning two alignments Given two alignments, m 1, m 2, can we find the optimal alignment under SOP scoring, with affine gaps? GTAGTCAGTCG x m 1 ---GTCACGTG y GTCGTCAGTCG z m 2 --CGCCAGGGG w --CGCCAGGGA v m 1 x GGGCACTGCAT y GGTTACGTC-- m 2 z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

Aligning two alignments Given two alignments, m 1, m 2, can we find the optimal alignment under SOP scoring, with affine gaps? NP-hard! GTAGTCAGTCG x m 1 ---GTCACGTG y GTCGTCAGTCG z m 2 --CGCCAGGGG w --CGCCAGGGA v m 1 x GGGCACTGCAT y GGTTACGTC-- m 2 z GGGAACTGCAG w GGACGTACC-- v GGACCT----- Optimistic: assume no gap – don’t pay gap-open penalty Pessimistic: assume gap – pay gap-open penalty

Heuristics to improve multiple alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): 1.Align most similar x i, x j 2.Align x k most similar to (x i x j ) 3.Repeat 2 until (x 1 …x N ) are aligned 4.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 5.Repeat 4 until convergence Note: Guaranteed to converge

Iterative Refinement For each sequence y 1.Remove y 2.Realign y (while rest fixed) x y z x,z fixed projection allow y to vary

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

A* for Multiple Alignments Review of the A* algorithm v START GOAL Say that we have a gigantic graph G START: start node GOAL: we want to reach this node with the minimum path Dijkstra: O(VlogV + E) – too slow if the number of edges is huge A*: a way of finding the optimal solution faster in practice

A* for Multiple Alignments Review of the A* algorithm v START GOAL g(v) h(v) g(v) is the cost so far h(v) is an estimate of the minimum cost from v to GOAL f(v) ≥ g(v) + h(v) is the minimum cost of a path passing by v 1. Expand v with the smallest f(v) 2. Never expand v, if f(v) ≥ shortest path to the goal found so far Lemma Given sequences x, y, z, … The sum-of pairs score of multiple alignment M is lower (worse) than the sum of the optimal pairwise alignments Proof M induces projected pairwise alignments a xy, a yz, a xz, …, and Score(M) = d(a xy ) + d(a xz ) + d(a yz ) +… Each of d(.) is smaller than the optimal edit distance

A* for Multiple Alignments Nodes: Cells in the DP matrix g(v): alignment cost so far h(v): sum-of-pairs of individual pairwise alignments Initial minimum alignment cost estimate: sum-of-pairs of global pairwise alignments v START GOAL g(v) h(v) To compute h(v) For each pair of sequences x, y, Compute F R (x, y), the DP matrix of scores of aligning a suffix of x to a suffix of y Then, at position (i 1, i 2, …, i N ), h(v) becomes the sum of (N choose 2) F R scores

Consistency z x y xixi yjyj y j’ zkzk

Consistency Basic method for applying consistency Compute all pairs of alignments xy, xz, yz, … When aligning x, y during progressive alignment,  For each (x i, y j ), let s(x i, y j ) = function_of(x i, y j, a xz, a yz )  Align x and y with DP using the modified s(.,.) function z x y xixi yjyj y j’ zkzk

Some Resources Genome Resources Annotation and alignment genome browser at UCSC http://genome.ucsc.edu/cgi-bin/hgGateway Specialized VISTA alignment browser at LBNL http://pipeline.lbl.gov/cgi-bin/gateway2 ABC—Nice Stanford tool for browsing alignments http://encode.stanford.edu/~asimenos/ABC/ Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

Whole-genome alignment Rat—Mouse—Human

Next 2 years: 20+ mammals, & many other animals, will be sequenced & aligned

Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.

Similar presentations

Presentation on theme: "Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.

Similar presentations

Presentation on theme: "Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z."— Presentation transcript:

Similar presentations

About project

Feedback