Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Similar presentations


Presentation on theme: "Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics."— Presentation transcript:

1 Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics

2 Which Program should one use? Most researchers use methods for determining local similarities:  Smith-Waterman (gold standard)  FASTA  BLAST } Do not find every possible alignment of query with database sequence. These are used because they run faster than S-W

3 Heuristic Database Search Methods Smith-Waterman dynamic programming too computer and time intensive for searching big databases  e.g., UniProt July 2004 – 1.5M sequences Most popular: BLASTx (Altschul et al 1990, 1997) and FASTx (Lipman and Pearson 1985)

4 BLAST – Basic Local Alignment Search Tool Basic idea:  Identify short very similar segment pairs – extend local alignment Critical issues:  For every database sequence d significantly similar to q, one should find at least one segment pair  Fewer segment pairs means faster computation

5 Definitions Maximal Segment Pair (MSP qd ) – pair of identical length segments having the highest score of all ungapped local alignments between q and d. High-Scoring Segment Pair (HSP) – segment pair for which the score cannot be increased by shortening or extension Word – segment of fixed length w Word pair – pair of segments of length w

6 Reformulating the Problem Identify those database sequences d such that MSP qd is over a threshold V. A segment pair scoring at least V has with a high probability a word pair scoring at least T. Identify word pairs with score at least T, extend to high-scoring segment pairs – check if score over V

7 Finding Hits and HSPs Hit – word pair scoring at least T Preprocess q  Find all words o T (length w) that can score at least T against a word in q  Save in easy-to-use data structure Find the hits  Search in d for all occurrences (o d ) of the words o T Extend (heuristically) to high-scoring segment pairs Perform dynamic programming around HSPs scoring over a certain threshold – allows introduction of gaps

8 Pre-processing q Aim:  Allow rapid identification of all words o T in d – and the location of corresponding words in q to allow extension into HSPs Possibility: table of 20 w entries

9 Pre-processing q

10 Finding HSPs For each word in d (starting in position j) hitting a word in q (starting in position i), record the hit indexed by its diagonal (j-i ). Hits close together on the same diagonal are joined before extension to HSPs Extending to HSP:  Ideally – move to the end of the sequences in both directions  Heuristic – if score falls “far below” best seen so far, stop extension

11 Dynamic Programming Around HSPs DP is time consuming and need to be constrained Starting from identified HSP, find ”seed pair” Perform ”forward” and ”backward” DP from seed pair (independently) Stop DP if score falls T below best score S’ seen so far

12 Significance of alignments Suppose alignment reveals an intriguing similarity between two sequences. Is the similarity significant ? Or could it have arisen by chance?

13 Significance of alignment If the score of the alignment observed is no better than might be expected from a random permutation of the sequence, then it is likely to have arisen by chance.

14 How to Generate the Random Sequences? Global alignment  Randomize one of the sequences, many times, realign each result to the second sequence (fixed), and collect the distribution of resulting scores. Local alignment  Uses the population of results returned from the entire database as the population with which to measure the statistics.

15 Statistical parameters Z-score  A measure of how unusual our original match is A z-score of 0 means the observed similarity is no better than the average of the control population. The higher the Z-score, the greater the probability. Z-score  5

16 Statistical parameters P = the probability that the alignment is better than random  P ≤ 10 -100 exact match  P in range 10 -100 - 10 -50 sequences very nearly identical  P in range 10 -50 - 10 -10 closely-related sequences, homology certain  P in range 10 -5 - 10 -1 distant relatives, usually  P > 10 -1 match probably insignificant

17 Statistical parameters E-value  The expected number of sequences that give the same Z-score or better if the database is probed with a random sequence.  found by multiplying the value of P by the size of the database probed.  Note that E but not P depends on the size of the database.

18 Statistical parameters Interpreting E values  E ≤ 0.02 sequences probably homologous  E between 0.02 and 1 homologous cannot be ruled out  E > 1 you’d have to expect this good a match just by chance

19 Rules and thinking.. Percent of identical residues in the optimal alignment  over 45%, very similar structures, common or at least a related function.  Over 25%, a similar general folding pattern.  A lower degree of sequence similarity cannot rule out homology

20 Rules and thinking.. 18%-25% twilight zone, the suggestion of homology is tantalizing but dangerous Absence of significant similarity does not imply that the sequences are not homologous – could be distantly related (twilight zone or beyond)


Download ppt "Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics."

Similar presentations


Ads by Google