. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
. Computational Genomics Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
BLAST Sequence alignment, E-value & Extreme value distribution.
. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Bioinformatics Algorithms and Data Structures
Heuristic alignment algorithms and cost matrices
. Class 4: Sequence Alignment II Gaps, Heuristic Search.
Sequence Alignment.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Class 2: Basic Sequence Alignment
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Heuristic Alignment Algorithms Hongchao Li Jan
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Pairwise Sequence Alignment (cont.)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

. Fasta, Blast, Probabilities

2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple alignment u All of these assumed a scoring rule: that determines the quality of perfect matches, substitutions, insertions, and deletions.

3 Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database.” u The current protein database contains about 100 millions (i.e.,10 8 ) residues ! So searching a 1000 long target sequence requires to evaluate about matrix cells which will take about three hours in the rate of 10 millions evaluations per second. u Quite annoying when, say, one thousand target sequences need to be searched because it will take about four months to run.

4 Heuristic Fast Search u Instead, most searches rely on heuristic procedures u These are not guaranteed to find the best match u Sometimes, they will completely miss a high-scoring match We now describe the main ideas used by the best known of these heuristic procedures.

5 Basic Intuition u Almost all heuristic search procedures are based on the observation that real-life matches often contain long strings with gap-less matches. u These heuristic try to find significant gap-less matches and then extend them.

6 Banded DP  Suppose that we have two strings s[1..n] and t[1..m] such that n  m u If the optimal alignment of s and t has few gaps, then path of the alignment will be close to diagonal s t

7 Banded DP u To find such a path, it suffices to search in a diagonal region of the matrix.  If the diagonal band has width k, then the dynamic programming step takes O(kn).  Much faster than O(n 2 ) of standard DP. s t k V[i+1, i+k/2 +1]Out of range V[i, i+k/2+1]V[i,i+k/2] Note that for diagonals i-j = constant.

8 Banded DP for local alignment Problem: Where is the banded diagonal ? It need not be the main diagonal when looking for a good local alignment. How do we select which subsequences to align using banded DP? s t k We heuristically find potential diagonals and evaluate them using Banded DP. This is the main idea of FASTA.

9 Finding Potential Diagonals Suppose that we have a relatively long gap-less match AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA u Can we find “clues” that will let us find it quickly? u Each such sequence defines a potential diagonal (which is then evaluated using Banded DP.

10 Signature of a Match s t Assumption: good matches contain several “patches” of perfect matches AGCGCCATGGATTGAGCTA TGCGACATTGATCGACCTA Since this is a gap-less alignment, all perfect match regions should be on one diagonal

11 FASTA-finding ungapped matches Input: strings s and t, and a parameter ktup u Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] u Locate sets of pairs that are on the same diagonal By sorting according to the difference i-j u Compute the score for the diagonal that contains all these pairs s t

12 FASTA-finding ungapped matches Input: strings s and t, and a parameter ktup u Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] l Step one: prepare an index of the database and the query sequence such that given a sequence of length ktup, one gets the list of positions. (Linear time). l Step two: for each ktup from the query add 1 in the diagonal (i-j) in which it appears. Then find contiguous (possibly with mismatch) ktup in diagonals. s t

13 FASTA- using banded DP Step 3: l Select the ten high scoring contiguous segments l Try and score all combinations of these ten segments in order to constitute a pass into the matrix Step 4: l Run banded DP on the region containing the best scoring pass (say with width 12). Hence, the algorithm may combine some diagonals into gapped matches (in the example below combine diagonals 2 and 3). s t 3 2 1

14 FASTA- practical choices Some implementation choices /tricks have not been explicated herein. s t Most applications of FASTA use very small ktup (1-2 for proteins, and 4-6 for DNA). Higher values are faster, yielding less diagonal to search around, but increase the chance to miss the optimal local alignment.

15 FASTA-summary Input: strings s and t, and a parameter ktup = 1,2,4,5, or 6 depending on the application. Output: A highly scored local alignment 1. Find pairs of matching substrings s[i..i+ktup]=t[j..j+ktup] 2. Extend to ungapped diagonals 3. Extend to gapped matches using banded DP

16 BLAST Overview (Basic Local Alignment Search Tool) Input: strings s and t, and a parameter T = threshold value Output: A highly scored local alignment Definition: Two strings s and t of length k are a high scoring pair (HSP) if d(s,t) > T (usually consider un-gapped alignments only). 1. Find high scoring pairs of substrings such that d(s,t) > T  These words serve as seeds for finding longer matches 2. Extend to ungapped diagonals (as in FASTA) 3. Extend to gapped matches

17 BLAST Overview (cont.) Step 1: Find high scoring pairs of substrings such that d(s,t) > T (The seeds): u Find all strings of length k which score at least T with substrings of s in a gapless alignment (k = 4 for proteins, 11 for DNA) (note: possibly, not all k-words must be tested, e.g. when such a word scores less than T with itself). u Find in t all exact matches with each of the above strings.

18 Extending Potential Matches s t Once a seed is found, BLAST attempts to find a local alignment that extends the seed. Seeds on the same diagonal are combined (as in FASTA), then extended as far as possible in a greedy manner without gap. During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time). For the best ungap alignment do a banded SW an assign a probabilistic score.

19

20

21

22

23

24

25

26 Why use probability to define and/or interpret a scoring function ? Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are not deterministic. We could answer questions such as: How probable two sequences are similar? Is the similarity found significant or random? How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?

27 A Probabilistic Model u For now, we will focus on alignment without indels. u For now, we assume each position (nucleotide /amino-acid) is independent of other positions. u We consider two options: M: the sequences are Matched (related) R: the sequences are Random (unrelated)

28 Unrelated Sequences u Our random model of unrelated sequences is simple l Each position is sampled independently from a distribution over the alphabet  We assume there is a distribution q(  ) that describes the probability of letters in such positions. u Then:

29 Related Sequences  We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor  Let p(a,b) be a distribution over pairs of letters.  p(a,b) is the probability that some ancestral letter evolved into this particular pair of letters.

30 Odd-Ratio Test for Alignment If Q > 1, then the two strings s and t are more likely to be related (M) than unrelated (R). If Q < 1, then the two strings s and t are more likely to be unrelated (R) than related (M).

31 Score(s[i],t[i]) Log Odd-Ratio Test for Alignment Taking logarithm of Q yields If log Q > 0, then s and t are more likely to be related. If log Q < 0, then they are more likely to be unrelated. How can we relate this quantity to a score function ?

32 Probabilistic Interpretation of Scores u We define the scoring function via u Then, the score of an alignment is the log-ratio between the two models: Score > 0  Model is more likely Score < 0  Random is more likely

33 Estimating Probabilities  Suppose we are given a long string s[1..n] of letters from   We want to estimate the distribution q(·) that generated the sequence u How should we go about this? We build on the theory of parameter estimation in statistics using either maximum likelihood estimation or the Bayesian approach.

34 Estimating q(  )  Suppose we are given a long string s[1..n] of letters from  s can be the concatenation of all sequences in our database  We want to estimate the distribution q(  )  That is, q is defined per letter Likelihood function:

35 Estimating q(  ) (cont.) How do we define q ? Likelihood function: ML parameters ( M aximum L ikelihood) MAP parameters ( M aximum A posteriori P robability)

36 Estimating p(·,·) Intuition:  Find pair of aligned sequences s[1..n], t[1..n], u Estimate probability of pairs: u Again, s and t can be the concatenation of many aligned pairs from the database Number of times a is aligned with b in (s,t)

37 Problems in Estimating p(·,·) u How do we find pairs of aligned sequences? u How far is the ancestor ? earlier divergence  low sequence similarity later divergence  high sequence similarity u Does one letter mutate to the other or are they both mutations of a common ancestor having yet another residue/nucleotide acid ?

38 Estimating p(·,·) for proteins Generate a large diverse collection of accepted mutations. An accepted mutation is a mutation due to an alignment of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins). Recall that Define: to be the number of mutations a  b, to be the total number of mutations of a, and to be the total number of amino acids involved in a mutation. Note that f is twice the number of mutations.

39 PAM-1 matrices For PAM-1 it is assumed that 1% of all amino acids are mutated. #(a-mutations) = #(a-occurrences) =, The relative mutability of amino acid a, should reflect the probability that a is mutated to any other amino acid : Proportion of mutation that concern a times number of mutation

40 PAM-1 matrices Define M ab to be the probability matrix for switching from a to b via a mutation

41 Properties of PAM-1 matrices Note that Namely, the probability of not changing and changing sums to 1. Namely, only 1% of amino acids change according to this matrix. Hence the name, 1-Percent Accepted Mutation (PAM). Also note that This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types. What is the substitution matrix for k units of evolutionary time ?

42 Model of Evolution Again, we need to make some assumptions u Each position changes independently of the rest u The probability of mutations is the same in each positions u Evolution does not “remember” Time t t+  t+2  t+3  t+4  A A C CG T T T CG

43 Model of Evolution u How do we model such a process? u This process is called a Markov Chain A chain is defined by the transition probability  P(X t+  =b|X t =a) - the probability that the next state is b given that the current state is a  We often describe these probabilities by a matrix: M[  ] ab = P(X t+  =b|X t =a)

44 Multi-Step Changes  Thus M[2  ] = M[  ]M[  ]  By induction (HMW exercise): M[k  ] = M[  ] k  Based on M ab, we can compute the probabilities of changes over two time periods

45 Longer Term Changes  Estimate M[  ] (PAM-1 matrices)  Use M[k  ] = M[  ] k (PAM-k matrices) u Define

46 Using PAM u Historically researchers use PAM-250. (The only one published in the original paper.) u Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples. u Used to be the most popular scoring rule, but there are some problems with PAM matrices.

47

48 Problems with PAM Normalization step is quite arbitrary. If, for example, we define relative mutability using the constant 50 rather than 100, we get: We will get: Now we have two different ways to estimate the matrix M[4  ] : M[4  ] = M[  ] 4 as we did before or M[4  ] = M[2  ] 2 M[250  ] for example does not reflect well long period changes.

49 BLOSUM u Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling. u Procedure: l Cluster together sequences in a family whenever more than L% identical residues are shared. l Count number of substitutions across different clusters in the same family. l Estimate frequencies as before. u Practice: Blosum50 and Blosum62 are wildly used. (See page in the text book). Considered state of the art nowadays.

50 BLOSUM 62