Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.

Slides:

Advertisements

Similar presentations

Multiple Sequence Alignment

Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.

CS262 Lecture 9, Win07, Batzoglou History of WGA 1982: -virus, 48,502 bp 1995: h-influenzae, 1 Mbp 2000: fly, 100 Mbp 2001 – present  human (3Gbp), mouse.

BNFO 602 Multiple sequence alignment Usman Roshan.

Lecture 6, Thursday April 17, 2003

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Genomic Sequence Alignment. Overview Dynamic programming & the Needleman-Wunsch algorithm Local alignment—BLAST Fast global alignment Multiple sequence.

Methods to CHAIN Local Alignments Sparse Dynamic Programming O(N log N)

Heuristic alignment algorithms and cost matrices

CS262 Lecture 9, Win07, Batzoglou Fragment Assembly Given N reads… Where N ~ 30 million… We need to use a linear-time algorithm.

Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.

Lecture 8: Multiple Sequence Alignment

CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.

CS273a Lecture 11, Aut 08, Batzoglou Multiple Sequence Alignment.

1 Protein Multiple Alignment by Konstantin Davydov.

CS262 Lecture 14, Win07, Batzoglou Multiple Sequence Alignments.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Bioinformatics and Phylogenetic Analysis

Expected accuracy sequence alignment

CS262 Lecture 9, Win07, Batzoglou Multiple Sequence Alignments.

Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.

CS262 Lecture 9, Win07, Batzoglou Phylogeny Tree Reconstruction

Multiple Sequence Alignments. Lecture 12, Tuesday May 13, 2003 Reading Durbin’s book: Chapter Gusfield’s book: Chapter 14.1, 14.2, 14.5,

Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

CS262 Lecture 14, Win06, Batzoglou Multiple Sequence Alignments.

CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.

CS262 Lecture 9, Win07, Batzoglou Real-world protein aligners MUSCLE  High throughput  One of the best in accuracy ProbCons  High accuracy  Reasonable.

Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.

CS262 Lecture 12, Win07, Batzoglou Some new sequencing technologies.

BNFO 602 Multiple sequence alignment Usman Roshan.

Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.

Marina Sirota CS374 October 19, 2004 P ROTEIN M ULTIPLE S EQUENCE A LIGNMENT.

Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

Introduction to Bioinformatics Algorithms Multiple Alignment.

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =

Sequence alignment, E-value & Extreme value distribution

Chapter 5 Multiple Sequence Alignment.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Multiple Sequence Alignments

Multiple Alignment Modified from Tolga Can’s lecture notes (METU)

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Multiple Sequence Alignment

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Multiple Sequence Alignment. Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Expected accuracy sequence alignment Usman Roshan.

Multiple Sequence Alignment

Construction of Substitution matrices

Step 3: Tools Database Searching

Expected accuracy sequence alignment Usman Roshan.

Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

Multiple Sequence Alignment

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Tools (BLAST) Multiple Sequence Alignment.

Multiple Sequence Alignments. The Global Alignment problem AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC x y z.

Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Multiple Sequence Alignment

Presentation transcript:

Sequence Similarity

The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j ending in state M  I(i, j): most likely alignment of x 1 …x i with y 1 …y j ending in state I  J(i, j): most likely alignment of x 1 …x i with y 1 …y j ending in state J M(i, j) = log( Prob(x i, y j ) ) + max{ M(i-1, j-1) + log(1-2  ), I(i-1, j) + log(1-  ), J(i, j-1) + log(1-  ) } I(i, j) = max{ M(i-1, j) + log , I(i-1, j) + log  } M P(x i, y j ) I P(x i ) J P(y j ) log(1 – 2  ) log(1 –  ) log  log  log(1 –  ) log Prob(x i, y j )

One way to view the state paths – State M x1x1 xmxm y1y1 ynyn ……

State I x1x1 xmxm y1y1 ynyn ……

State J x1x1 xmxm y1y1 ynyn ……

Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) x1x1 xmxm y1y1 ynyn ……

Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) Optimal solution is the best scoring path from top-left to bottom-right corner This gives the likeliest alignment according to our HMM x1x1 xmxm y1y1 ynyn ……

Yet another way to represent this model Mx 1 Mx m Sequence X BEGIN IyIy IyIy IxIx IxIx END We are aligning, or threading, sequence Y through sequence X Every time y j lands in state x i, we get substitution score s(x i, y j ) Every time y j is gapped, or some x i is skipped, we pay gap penalty

From this model, we can compute additional statistics P(x i ~ y j | x, y)The probability that positions i, j align, given that sequences x and y align P(x i ~ y j | x, y) =  α: alignment P(α | x, y) 1(x i ~ y j in α) We will not cover the details, but this quantity can also be calculated with DP M P(x i, y j ) I P(x i ) J P(y j ) log(1 – 2  ) log(1 –  ) log  log  log(1 –  ) log Prob(x i, y j )

Fast database search – BLAST (Basic Local Alignment Search Tool) Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

BLAST  Original Version Dictionary: All words of length k (~11 nucl.; ~4 aa) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M

BLAST Variants BLASTN – genomic sequences BLASTP – proteins BLASTX – translated genome versus proteins TBLASTN – proteins versus translated genomes TBLASTX – translated genome versus translated genome PSIBLAST – iterated BLAST search

Multiple Sequence Alignments

Protein Phylogenies Proteins evolve by both duplication and species divergence

Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A G C A G – C T A C C A G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A C G T

Multiple Sequence Alignments Algorithms

Multidimensional DP Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } Multidimensional DP

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly!  Require 2 N states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree  Align on the tree x w y z ?

Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA Variant: Refinement on a tree “tree partitioning”

Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

Some Resources BLAST & PSI-BLAST CLUSTALW – most widely used MUSCLE – most scalable PROBCONS – most accurate

MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on D DRAFT, with a hierarchical clustering method (UPGMA) 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT 4.Measure distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

INSERTXINSERTY MATCH A pair-HMM model of pairwise alignment  Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences  Transition probabilities ~ gap penalties  Emission probabilities ~ substitution matrix ABRACA-DABRA AB-ACARDI--- x y xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

Computing Pairwise Alignments The Viterbi algorithm  conditional distribution P( α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y  identifies highest probability alignment, α viterbi, in O(L 2 ) time Caveat: the most likely alignment is not the most accurate  Alternative: find the alignment of maximum expected accuracy P(α) P(α | x, y) α viterbi

The Lazy-Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key?  Approach #1: Use the answer sheet of the best student!  Approach #2: Weighted majority vote! A-AB A B+B+B- C 4. F 4. T 4. F 4. T

Viterbi vs. Maximum Expected Accuracy (MEA) Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes E α * [1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes E α* [accuracy(α, α*)] A 4. T A-AB A B+B+B- C 4. F 4. T 4. F 4. T