Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.

Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family characteristics. Three questions: 1.Scoring 2.Computation of Mult-Seq-Align. 3.Family representation.

Multiple Sequence Alignment

Scoring: SP (sum of pairs) SP – the sum of pairwise scores of all pairs of symbols in the column. ρ 3 (-,A,A) = (-,A)+(-,A)+(A,A) SP Total Score = Σ ρ i (-,-) = 0

Induced pairwise alignment Induced pairwise alignment or projection of a multiple alignment. a(S 1, S 2 ) a(S 2, S 3 ) a(S 1, S 3 ) (-,-) = 0 SP Total Score = Σ i<j score[ a(S i, S j ) ]

Dyn.Prog. Solution

Dynamic Programming Solution The best multiple alignment of r sequences is calculated using an r- dimensional hyper-cube The size of the hyper-cube is O( Πn i ) Time complexity O(2 r n r ) * O( computation of the ρ function ). Exact problem is NP-Complete (metrics: sum-of-pairs or evolutionary tree). more efficient solution is needed

Multiple Alignment from Pairwise Alignments ? Problem: The best pairwise alignment does not necessary lead to the best multiple alignment.

Pattern-APattern-X Pattern-APattern-X Pattern-B Pattern-XPattern-B Pattern-D S1 S3 S2 S1S2S1S3S2S3 Pattern-APattern-BPattern-D Empty Correct Solution S1S2S3 Pattern-X

S1 S3 S2 S1S2 S1S3S2S3 ? ? ? ? Global or Local alignment?

Center Star Alignment S1S1 S2S2 S3S3 SkSk ScSc S k-1 S k-2 (a)Scoring scheme – distance. (b)Scoring scheme satisfies the triangle inequality: for any character a,b,c dist(a,c) ≤ dist(a,b) + dist(b,c) (in practice not all scoring matrices satisfy the triangle inequality) (c) D(S i, S j ) – score of the optimal pairwise alignment. (d) D(M) = Σ i<j a M (S i, S j ) – score of the multiple alignment M. (e) a Mc (S c, S i ) – pairwise alignment/score induced by M c.

S1S1 S2S2 S3S3 SkSk ScSc S k-1 S k-2 The Center Star Algorithm: (a) Find S c minimizing Σ i  c D(S c, S i ). (b) Iteratively construct the multiple alignment M c : 1. M c ={S c } 2. Add the sequences in S\{S c } to M c one by one so that the induced alignment a Mc (S c, S i ) of every newly added sequence S i with S c is optimal. Add spaces, when needed, to all pre-aligned sequences. Running time: * O(n 2 ). AC-BC DCABC AC--BC DCAAB C AC--BC DCA-BC DCAAB C

Running time: (a) * O(n 2 ). (b) (since the worst- case length of St after the addition of i strings is(i+1) ・ n )

D(M c ) is at most twice the score of the D(M opt ) D (M c ) / D (M opt ) ≤ 2(k-1)/k ( < 2 ) Proof: (a) a(S i, S j ) ≥ D (S i, S j ) (any induced align. is not better than optimal align.) a Mc (S c, S j ) = D (S c, S j ) (b) a Mc (S i, S j ) ≤ a Mc (S i, S c ) + a Mc (S c, S j ) = D (S i, S c ) + D (S c, S j ) (follows from the triangle inequality) (c) 2 D(M c ) = Σ i=1..k Σ j=1..k,j  i a Mc (S i, S j ) ≤ Σ i=1..k Σ j=1..k,j  i ( a Mc (S i, S c ) + a Mc (S c, S j ) )= 2(k-1) Σ j  c a Mc (S c, S j ) = 2(k-1) Σ j  c D(S c, S j )

(d) k Σ j=1..k,j  c D(S c, S j ) = Σ i=1..k Σ j=1..k,j  c D(S c, S j ) ≤ Σ i=1..k Σ j=1..k,j  i D(S i, S j ) ≤ Σ i=1..k Σ j=1..k,j  i a Mopt (S i, S j ) = 2 D(M opt ) (e) → 2 D(M c ) ≤ 2(k-1) Σ j  c D(S c, S j ) k Σ j  c D(S c, S j ) ≤ 2 D(M opt ) → D(M c )/(k-1) ≤ Σ j  c D(S c, S i ) Σ j  c D(S c, S i ) ≤ 2 D(M opt )/k → D (M c ) / D (M opt ) ≤ 2(k-1)/k

Scoring Metrics

Question: How to represent a family? Consensus sequence Profiles, HMM Signature

Consensus Sequence a b a a b – – b a a b a Seq1-> Seq3-> Consensus-> Seq2->

Consensus error: E(S con ) = Σ i  con D(S con, S i ) Steiner string S* : E(S*) = min S E(S) Approximation Algorithm: E(S center star )/E(S*) ≤ 2(k-1)/k (in case that scoring scheme satisfies the triangle inequality) Notice: (a)S con is not necessary one of the input strings. (b)Consensus error and Steiner string are defined without Mult. Align.

Mult.Align. -> Consensus Sequence Definitions: (a) M j - column j of mult. align. M (b) M i,j – character in row i, column j of mult. align. M (c) Distance between the character x and column j: d(x,M j ) = Σ i dist(x, M i,j ) Given mult. align. M, the consensus character x* of column ‘j’ of mult. align. M is such that: d(x*,M j ) = min x d(x,M j ) Consensus string S M : S M = x* 1 x* 2 … x* q

Alignment Error of S M : σ(S M ) = Σ i d(x* j,M j ) The optimal consensus multiple alignment is a multiple alignment M with consensus string S* M such that : σ(S* M ) = min M,S M σ(S M ) It can be shown that: The optimal consensus multiple alignment specifies the optimal Steiner string S* and vice versa – from S* we can construct the optimal consensus multiple alignment.

Conclusion: Center Star method approximation error 2(k-1)/k for: (a) SP mult. align. (b) Consensus mult. align.

Profiles Seq1-> Seq3-> Seq4-> Seq2->

Profile Analysis M. Gribskov, D. Eisenberg. Profile Analysis - detection of distantly related proteins by sequence comparison. The information is expressed in a position- specific scoring table (profile).

Profile calculation The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps in more variable regions

Profile calculation The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps in more variable regions p(x,j)/p(x) [or log p(x,j)/p(x)] p(x,j) – frequency that character x appears in row (according to previous slide) i p(x) – frequency that character x appears anywhere in all sequences from mult.align.

Profile alignment Sequence – Profile Alignment. Profile – Profile Alignment. Dynamic Programming. (the same idea as in Pairwise Sequence Alignment)

reminder: Pairwise Sequence Alignment Sequence-Profile alignment: S(x,j) – aligning ‘x’ with column ‘j’ S(x,j)= Σ y σ(x,y) p(y,j)/p(x) σ(x,y) – any regular score for Pairwise Alignment (PAM-k, BLOSUM-k …) p(x,j) – frequency that character x appears in mult. align. column ‘j’ p(x) – frequency that character x appears anywhere in all sequences from mult.align. The position-specific gap coefficients penalize gaps in conserved regions more heavily than gaps in more variable regions

Profiles in GCG PileUpPileUp creates a multiple sequence alignment from a group of related sequences. ProfileMakeProfileMake makes a profile from a multiple sequence alignment. ProfileSearchProfileSearch uses the profile to search a database for sequences with similarity to the group of aligned sequences. ProfileSegmentsProfileSegments displays optimal alignments between each sequence in the ProfileSearch output list and the group of aligned sequences (represented by the profile consensus). ProfileGapProfileGap makes optimal alignments between one or more sequences and a group of aligned sequences represented as a profile. ProfileScanProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.

Iterative pairwise alignment 1. Align some pair. 2. While (not done) (a)Pick an unaligned string which is ”near” some aligned one(s). (b)Align with the profile of the previously aligned group. Resulting new spaces are inserted into all strings in the group.

Progressive Alignment Feng-Doolittle 1987 Implemented in PileUp (GCG package) 1. Calculate the pairwise alignment scores, and convert them to distances. 2. Use an incremental clustering algorithm to construct a tree from the distances. 3. Traverse the nodes in their order of addition to the tree, progressively aligning the sequences. This way, the most similar pair is aligned first, followed by the addition of the next most similar sequence or set of sequences.

Progressive Alignment ClustalW (algorithm of Thompson, Higgins, Gibson 1994) 1. Calculate the pairwise alignment scores, and convert them to distances. 2. Use a neighbor-joining algorithm to build a tree from the distances. 3. Align sequence - sequence, sequence - profile, profile - profile in decreasing similarity order.

Alignment tree built by ClustalW

Profile HMMs MiMi - main state, models the column i of the mult. align. DiDi IiIi - insert state - delete state Aligning a string S versus a profile: (a)From the profile build a HMM (b)Calculate a likelihood of S against the HMM

Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.

Similar presentations

Presentation on theme: "Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.

Similar presentations

Presentation on theme: "Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family."— Presentation transcript:

Similar presentations

About project

Feedback