Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chap. 4: Multiple Sequence Alignment

Similar presentations


Presentation on theme: "Chap. 4: Multiple Sequence Alignment"— Presentation transcript:

1 Chap. 4: Multiple Sequence Alignment
Pairwise Alignment Dynamic Programming Multi-sequence Alignment FASTA (Fast Alignment) BLAST (Basic Local Alignment Search Tool)

2 FASTA Pearson and Lipman, 1988, Fast Alignment
Steps Perform exact match of a subsequences in the query sequence of at least length ktup to subsequences of database sequences ktup: default – 2 AA Search diagonal regions in the alignment matrix that contain as many of subsequence matches with small distances between subsequences Then, see if initial regions can be joined by allowing gaps Time saved By performing dynamic programming on initially filtered sequences which are already similar Only considers pathways through the alignment matrix that remain within a band centered around the highest-scoring initial regions

3

4 Blast Exact methods are good and can pick up very distant relationships Approximate methods can detect only close relationships well Maybe OK when the probe sequence is fairly similar to one or more sequences in the databank Take a small k residues in the probe sequence, find all instances of the k-tuple in the database For the selected candidate sequences, approximate optimal alignment is performed Particularly useful in multi-sequence alignments

5 Blast Detect the best region of local alignment between a query and the target and if there are other plausible alignments Computational efficiency comes from “seeding” the search with a small subset of substrings in the query Substrings from two sequences may be highly conserved in biological applications Temple Smith and Michael Waterman, 1981 Biologically relevant diagonal matches are likely to have a higher score

6 Word Length and Threshold
Select word length, w (similar to ktup, default is 3 AA), and a threshold T Given word length of w, scan database for words of length w that score higher than a threshold T Example: for a human RBP query …FSGTWYA… (query word in red) A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS

7 (T=11) According to Blosum62 GTW 6,5,11 22 neighborhood GSW 6,1,11 18
word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)

8 Effect of Threshold You can modify the threshold parameter.
The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options of BLAST+. (To find BLAST+ go to BLAST  help  download.)

9 extend extend Hit! KENFDKARFSGTWYAMAKKDPEG 50 RBP (query)
2. Scan the database for entries matching compiled list 3. In each direction, extension terminates when the score falls more than a certain distance below the best score reached so far KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!

10

11 Identify all exact matches of k-tuple words (no gaps, no mismatches)
Extend exact matches in both directions (no gaps, no mismatches) Put extended matches together with mismatches and gaps, only in limited regions containing preliminary matches

12

13

14 Blast, 1997 Refined to require two independent hits
The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occurred greatly speeding the time required for a search

15 Comparison of Search Methods
FASTA vs. BLAST FASTA is more sensitive for DNA-DNA searches, especially for highly diverged sequences. BLAST is better at finding short regions of high similarity, while FAST is better at finding long regions of lower similarity BLAST will miss similar sequences if they do not have a single identical word Protein similarity search can find more distant similarities DNA has four letters and thus the prob. of chancy matches is much greater Protein databanks are much smaller, and searches can be more sensitive William Pearson (FASTA author): “The number one thing that you should learn is that in general, you should try not do DNA sequence comparison.” Protein-protein search from BLASTX produces more sensitive results than DNA-DNA search

16 Blast (Basic Local Alignment Search Tool)
Program Input Database 1 blastn DNA DNA blastp protein protein 6 blastx DNA protein tblastn protein DNA 36 tblastx DNA DNA

17 DNA can encode six Proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

18 Effect of word size For blastn, the word size is typically 7, 11, or 15 (EXACT match). Changing word size is like changing threshold of proteins. w=15 gives fewer matches and is faster than w=11 or w=7. For megablast, the word size is 28 and can be adjusted to 64. What will this do? Megablast is VERY fast for finding closely related DNA sequences!

19 Word-matching problem
Consider two sequences of lengths N and M A simple local alignment algorithm looks for the longest exactly matching word Score for alignment = the length of the longest match (l) e.g., l=6 Let n denote the length of matching words starting at two random sites (in red, for example, n = 3 for ‘TCC’) Then, l=max(n) F(l) has EVD distribution G G A T A T C C A G C G C T C C T A T C C G A T A T C T T G G G G A T A T C C A G C G C T C C T A T C C G A T A T C T T G G

20 Word-matching problem -theory
Prob[two bases at randomly selected sites are equal] = a N: length of matching bases starting at these random sites P[n ≥ l] = al Define  = -ln(a) = ln(1/a) => P[n ≥ l] = e -l Compute E[l] N ways of choosing the first starting sites, and M ways for the 2nd If NM selected sites are independent, E[l] = NM e -l However, words starting at different sites overlap, and are not independent Thus, E[l] = kNM e -l with k < 1 Since P[n < l] = 1- e -l, from the exponential distribution, P[n = l] = e -l Prob. of longest matching words of length l: F(l) = e-(l-u)exp(e-(l –u) ), u = ln(kMN)/ 

21 EVD Distribution Thus, F(mmax) does NOT have the same distribution as Gaussian F(mmax) is of an extreme value distribution (EVD) or Gumbel distribution F(mmax) = *e- (mmax-u)exp(e- (mmax - u)) Single peak skewed to one side EVD arises whenever dealing with the maximum value taken from a large number of independent alternatives Thus, it is likely to be considerably higher than the value of m obtained from just two typical sequences e.g., # of sequences, S=2000; length of seq., N=200, only two bases C and G with equal prob. m has a binomial distribution mmax is EVD distributed

22 EVD distribution A r.v. x with distribution P(x), and a large sample of S Find F(xmax) xmax Prob[x <xmax] = - P(x) dx F(xmax) = Prob[choose one with xmax and the rest < xmax ] = S*P(xmax)*{Prob[x <xmax]}S-1 When P(x) =  e-x, Prob[x <xmax] = 1 – e- xmax, F(mmax) = S e- xmax (1-e- xmax ) S-1 = Se- xmaxe- Sxmax ((1-a)n exp(-na)) Set u = ln(S)/  , S = exp( u) F(mmax) = e-(xmax-u)exp(e-(xmax –u)) Single peak at xmax = u Width of the peak is controlled by 

23 The probability density function of EVD
(characteristic value u=0, decay constant l=1) 0.40 0.35 0.30 0.25 normal distribution extreme value distribution probability 0.20 0.15 0.10 0.05 -5 -4 -3 -2 -1 1 2 3 4 5 x

24 Significance of matches
Consider the significance of a match The observed value of the top-hit score for the query is mobs Prob. of obtaining a value mmax ≥ mobs by chance is given by the area under the tail of the distribution p(mobs) = 1 – exp(-exp(-  (mobs - u))) Small p implies that it is less likely the match is to arise by chance (greater the significance) In the same example, mobs = 130, p=3.3% (just big enough to be significant) Significance increases as S increases as F(mmax) shifts to the right

25 Alignment stats (reality)
BLAST, etc. works by looking for high-scoring local alignments When gaps are not allowed, pairwise local alignment scores are shown to be EVD distributed (Karlin and Altschul, 1990) With gaps, scores are believed to be also EVD distributed But EVD parameters  and u are not known, and has to be computed empirically Once the scoring system and EVD parameters for a given search algorithm are known, one can estimate the significance of a match E: expected number of sequences with a score ≥ observed score S E(S) = kMNexp(-  S) (N: length of query sequence; M: total length of all the sequences in the database)

26 Blast Result: E value The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e-lS

27 E = Kmn e-lS This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high-scoring segment pairs (HSPs) expected to occur with a score of at least S m, n = the length of two sequences l, K = Karlin Altschul statistics

28 Properties The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly

29 From Raw scores to Bit scores
There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = (lS - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.

30 E and p The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e-E Default value of E: 10 p>0.05 is considered to be significant

31 Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to interpret than corresponding p values. E p (about 0.1) (about 0.05) (about 0.001)

32 Two problems standard BLAST cannot solve
Use human beta globin as a query against human RefSeq proteins, and blastp does not “find” human myoglobin. This is because the two proteins are too distantly related. PSI-BLAST at NCBI as well as hidden Markov models easily solve this problem. How can we search using 10,000 base pairs as a query, or even millions of base pairs? Many BLAST-like tools for genomic DNA are available such as PatternHunter, Megablast, BLAT, and BLASTZ. Page 141

33 Position specific iterated BLAST:
PSI-BLAST The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by employing a scoring matrix that is customized to your query. Page 146

34 PSI-BLAST PSI (Position Specific Iterated) BLAST – Altschul et al., 1997 Use the original BLAST algorithm and retrieve database sequences with significant matches (E < 0.01) Multiple alignment is performed Place all locally aligned sections of the database sequences below the query sequence When a gap is inserted in the query seq., corresponding residue in database seq. is removed, so that all seq.’s are of the same length => for speed and simplicity Multiple sequences are then used as input to the 2nd run Use PSSM (Position Specific Scoring Matrix) for scoring Score is dependent on the frequencies of the residues n the column of the alignment (V is more likely to be aligned with the column with many V’s or other hydrophobic) Repeat the process until no more new sequences are added

35 Search results During iteration in PSI Blast,
New distantly related sequences can be found, which are not detected in straightforward Blast Due to the extra info in the aligned group of sequences not in any one sequence On the other hand, by adding sequences, the range of sequence becomes too broad Alignment may end up having little relationship to the original query Ranking of hits gives some info about the degree of relatedness But, the top hits are not necessarily the most meaningful in terms of evolution Several top hits to human genes were from bacteria, which led to claims of horizontal gene transfer (dis-proved by phylogenetic methods) Frequently top hits are not the closest relatives

36 Multiple Sequence Alignment
Based on local sequence alignments Wants to recognize resemblance even when sequences share only weak similarities Problem Statement Given k strings, v1, …, vk of lengths n1,…, nk over an alphabet A (A’ = A U{-}), And k dimensional score matrix δ Find kxn matrix, s.t. Each character in vi is in order Every column contains at least one symbol from A The sum of scores of the columns is maximum Can extend global alignment approach to k dimension

37 PSI-BLAST is performed in five steps
Select a query and search it against a protein database PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) Page 146

38 Inspect the blastp output to identify empirical “rules” regarding amino acids tolerated at each position R,I,K C D,E,T K,R,T N,L,Y,G

39 A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A 20 amino acids all the amino acids from position 1 to the end of your PSI-BLAST query protein

40 A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A note that a given amino acid (such as alanine) in your query protein can receive different scores for matching alanine—depending on the position in the protein

41 A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A note that a given amino acid (such as tryptophan) in your query protein can receive different scores for matching tryptophan—depending on the position in the protein

42 PSI-BLAST is performed in five steps
Select a query and search it against a protein database PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) The PSSM is used as a query against the database PSI-BLAST estimates statistical significance (E values) Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query Page 146

43

44 Multiple Sequence Alignment Example: k=3
With three sequences v, w, and u and 3D δ (score of a column with x,y, and z) In global alignment In multiple sequences si-1, j + δ(vi, -) si, j = max [ si, j δ(-, wj) ] si-1, j-1 + δ(vi, wj) si-1, j,k δ(vi, -, -) si, j-1,k δ(-, wj , -) si, j,k δ(-, -, uk) si, j,k = max [ si-1, j-1,k + δ(vi, wj, -) ] si-1, j,k δ(vi, -, uk) si, j-1,k δ(vi, wj, -) si-1, j-1,k δ(vi, wj, uk)

45 Multiple Sequence Alignment
Time complexity is O((2n)k) Heuristics 1 Compute all (k2) optimal pairwise alignments, and combines them Does not work all the time

46 Multiple Sequence Alignment
Heuristics 2: Greedy progressive multiple alignment Select two string with greatest similarities Merge the two into a new string Works well for very close sequences Maybe dependent upon two seed sequences Clustal uses this approach


Download ppt "Chap. 4: Multiple Sequence Alignment"

Similar presentations


Ads by Google