. Computational Genomics Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by.

Slides:



Advertisements
Similar presentations
Sequence Alignment I Lecture #2
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
. Computational Genomics Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
BLAST Sequence alignment, E-value & Extreme value distribution.
Sequence Alignment Tutorial #2
Sequence Alignment Tutorial #2
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
. Class 4: Sequence Alignment II Gaps, Heuristic Search.
We continue where we stopped last week: FASTA – BLAST
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
. Computational Genomics Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Introduction to Bioinformatics Algorithms Sequence Alignment.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Class 2: Basic Sequence Alignment
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence Alignment ..
Fast Sequence Alignments
From Pairwise Alignment to Database Similarity Search Part II
Computational Genomics Lecture #3a
Presentation transcript:

. Computational Genomics Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. Chapters 2.5, 2.7 in Biological Sequence Analysis, Durbin et al. Part III in Algorithms on Strings, Trees, and Sequences, Gusfield Chapters , in Introduction to Computational Molecular Biology, Setubal and Meidanis. 1. Hirshberg linear space alignment 2. Local alignment 3. Heuristic alignment: FASTA and BLAST 4. Scoring functions

2 Global Alignment (reminder) Last time we saw a dynamic programming algorithm to solve global alignment, whose performance is Space: O(mn) Time: O(mn)  Filling the matrix O(mn)  Backtrace O(m+n)  Reducing time to O((m+n) 2 -  ) is a major open problem S T

3 Space Complexity  In real-life applications, n and m can be very large u The space requirements of O(mn) can be too demanding  If m = n = 1000, we need 1MB space  If m = n = 10000, we need 100MB space u In general, time is cheaper than space. u We can afford to perform some extra computation in order to save space u Can we trade space with time?

4 Why Do We Need So Much Space?  Compute V(i,j), column by column, storing only two columns in memory (or line by line if lines are shorter) A 1 G 2 C 3 0 A 1 A 2 A 3 C 4 Note however that u This “trick” fails when we need to reconstruct the optimizing sequence.  Trace back information requires O(mn) memory. To compute the value V[n,m]=d(s[1..n],t[1..m]), we need only O(min(n,m)) space:

5 Hirshberg’s Space Efficient Algorithm u If n=1, align s[1,1] and t[1,m]  Else, find position (n/2, j) at which an optimal alignment crosses the midline s t u Construct alignments  A=s[1,n/2] vs t[1,j]  B=s[n/2+1,n] vs t[j+1,m]  Return AB Input: Sequences s[1,n] and t[1,m] to be aligned. Idea: Divide and conquer

6 Finding the Midpoint The score of the best alignment that goes through j equals: V(s[1,n/2],t[1,j]) + V(s[n/2+1,n],t[j+1,m])  So we want to find the value(s) of j that maximizes this sum u optimal alignment goes through (n/2,j). s t

7 Finding the Midpoint The score of the best alignment that goes through j equals: V(s[1,n/2],t[1,j]) + V(s[n/2+1,n],t[j+1,m]) u Want to compute these two quantities for all values of j. u Let V[i,j] = V(s[1,i],t[1,j]) (“forward”). u Compute V[i,j] just like we did before. u Store all V[n/2,j] s t

8 Finding the Midpoint The score of the best alignment that goes through j equals: V(s[1,n/2],t[1,j]) + V(s[n/2+1,n],t[j+1,m]) u We want to compute these two quantities for all values of j. u Let U[i,j] = V(s[i+1,n],t[j+1,m]) (“backwars”) u Hey - U[i,j] is not something we already saw. s t

9 Finding the Midpoint u U[i,j] = V(s[i+1,n],t[j+1,m]) is the value of optimal alignment between a suffix of s and a suffix of t. u But in the lecture we only talked about alignments between two prefixes. u Don’t be ridiculous: Think backwards. u U[i,j] is the value of optimal alignment between prefixes of s reversed and t reversed. s t

10 Algorithm: Finding the Midpoint Define u V[i,j] = V(s[1,i],t[1,j]) (“forward”) u U[i,j] = V(s[i+1,n],t[j+1,m]) (“backward”)  V[i,j] + U[i,j] = score of best alignment through (i,j)  We compute V[i,j] as we did before  We compute U[i,j] in a similar manner, going “backward” from U[n,m]

11 Space Complexity of Algorithm We first find j where V[i,n/2] + U[n/2+1,j] is maximized. To do this, we need to compute values of V[*,n/2], U[n/2+1,*], which take O(n+m) space. Once “midpoint” computed, we keep it in memory, (consant memory), then solve the sub-problems recursively. Recursion depth is O(log n). Memory requirement is O(1) per level + O(m+n) reusable memory at all recursion levels = O(n+m) memory overall

12 Time Complexity  Time to find a mid-point: cnm ( c - a constant) u Size of two recursive sub-problems is (n/2,j) and (n/2,m-j-1), hence T(n,m) = cnm + T(n/2,j) + T(n/2,m-j-1) Lemma: T(n,m)  2cnm Proof (by induction): T(n,m)  cnm + 2c(n/2)j + 2c(n/2)(m-j-1)  2cnm. Thus, time complexity is linear in size of the DP matrix At worst, twice the cost of the regular solution.

13 Local Alignment The alignment version we studies so far is called global alignment: We align the whole sequence s to the whole sequence t. Global alignment is appropriate when s,t are highly similar (examples?), but makes little sense if they are highly dissimilar. For example, when s (“the query”) is very short, but t (“the database”) is very long.

14 Local Alignment When s and t are not necessarily similar, we may want to consider a different question:  Find similar subsequences of s and t  Formally, given s[1..n] and t[1..m] find i,j,k, and l such that V(s[i..j],t[k..l]) is maximal u This version is called local alignment.

15 Local Alignment u As before, we use dynamic programming  We now want to set V[i,j] to record the maximum value over all alignments of a suffix of s[1..i] and a suffix of t[1..j]  In other words, we look for a suffix of a prefix. u How should we change the recurrence rule?  Same as before but with an option to start afresh u The result is called the Smith-Waterman algorithm, after its inventors (1981).

16 Local Alignment New option: u We can start a new alignment instead of extending a previous one Alignment of empty suffixes

17 Local Alignment Example s = TAATA t = TACTAA S T

18 Local Alignment Example s = TAATA t = TACTAA S T

19 Local Alignment Example s = TAATA t = TACTAA S T

20 Local Alignment Example s = TAATA t = TACTAA S T

21 Local Alignment Example s = TAATA t = TACTAA S T

22 Two related notions for sequences comparisons: Roughly Similarity of 2 sequences? Count matches. Distance of 2 sequences? Count mismatches. Similarity can be either positive or negative. Distance is always non-negative (>0). Identical sequences have zero (0) distance. HIGH SIMILARITY = LOW DISTANCE Similarity vs. Distance

23 Similarity vs. Distance So far, the scores of alignments we dealt with were similarity scores. We sometimes want to measure distance between sequences rather than similarity (e.g. in evolutionary reconstruction). u Can we convert one score to the other (similarity to distance)? u What should a distance function satisfy? u Of the global and local versions of alignment, only one is appropriate for distance formulation.

24 Remark: Edit Distance In many stringology applications, one often talks about the edit distance between two sequences, defined to be the minimum number of edit operations needed to transform one sequence into the other.  “no change” is charged 0  “replace” and “indel” are charged 1 This problem can be solved as a global distance alignment problem, using DP. It can easily be generalized to have unequal “costs” of replace and indel. To satisfy triangle inequality, “replace” should not be more expensive than two “indels”.

25 Alignment with affine gap scores Observation: Insertions and deletions often occur in blocks longer than a single nucleotide. Consequence: Standard scoring of alignment studied in lecture, which give a constant penalty d per gap unit, does not score well this phenomenon; Hence, a better gap score model is needed. Question: Can you think of an appropriate change to the scoring system for gaps?

26 (Improved Pricing of InDels) Motivation: Aligning cDNAs to Genomic DNA Example: In this case, if we penalize every single gap by -1, the similarity score will be very low, and the parent DNA will not be picked up. Genomic DNA cDNA query More Motivation for Gap Penalties

27 Variants of Sequence Alignment We have seen two variants of sequence alignment : u Global alignment u Local alignment Other variants, in the books and in recitation, can also be solved with the same basic idea of dynamic programming. : 1. Using an affine cost V(g) = -d –(g-1)e for gaps of length g. The –d (“gap open”) is for introducing a gap, and the –e (“gap extend”) is for continuing the gap. We used d=e=2 in the naïve scoring, but could use smaller e. 2. Finding best overlap

28 Motivation Insertions and deletions are rare in evolution. But once they are created, they are easier to extend. Examples (in the popular BLAST and FASTA, to be studied soon): BLAST: Cost to open a gap = 10 (high penalty). Cost to extend a gap = 0.5 (low penalty). FASTA: Specific Gap Penalties in Use

29 Alignment in Real Life u One of the major uses of alignments is to find sequences in a large “database” (e.g. genebank). u The current protein database contains about 100 millions (i.e.,10 8 ) residues! So searching a 1000 long target sequence requires to evaluate about matrix cells which will take approximately three hours for a processor running 10 millions evaluations per second. u Quite annoying when, say, 1000 target sequences need to be searched because it will take about four months to run. u So even O(nm) is too slow. u Need something faster!

30 Heuristic Fast Search u Instead, most searches rely on heuristic procedures. u These are not guaranteed to find the best match. u Sometimes, they will completely miss a high-scoring match. We now describe the main ideas used by the best known of these heuristic procedures.

31 Basic Intuition u Almost all heuristic search procedures are based on the observation that good real-life alignments usually contain long runs with no gaps (mostly matches, maybe a few mismatches). u These heuristic try to find significant gap-less runs and then extend them.

32 A Simple Graphical Representation - Dot Plot Put a dot at every position with identical nucleotides in the two sequences. C T T A G G A C T GAGGACTGAGGACT Sequences: C T T A G G A C T G A G G A C T

33 A Simple Graphical Representation - Dot Plot Put a dot at every position with identical nucleotides in the two sequences. C T T A G G A C T GAGGACTGAGGACT Long diagonals with dots = long matches (good !) C T T A G G A C T G A G G A C T Short dotted diagonals - short matches (unimpressive) C T T A G G A C T G A G G A C T

34 Getting Rid of Short Diagonals - “word size” Start with original dot plot. Retain a run along a diagonal only if it has “word size” length of 6 or more (for DNA). This “word size” is called Ktup in Fasta, W in Blast C T T A G G A C T GAGGACTGAGGACT

35 Banded DP  Suppose that we have two strings s[1..n] and t[1..m] such that n  m u If the optimal alignment of s and t has few gaps, then path of the alignment will be close to diagonal s t

36 Banded DP u To find such a path, it suffices to search in a diagonal region of the matrix.  If the diagonal band has width k, then the dynamic programming step takes O(kn).  Much faster than O(n 2 ) of standard DP. u Boundary values set to 0 (we’re doing local alignment) s t k V[i+1, i+k/2 +1]Out of range V[i, i+k/2+1]V[i,i+k/2] Note that for diagonals i-j = constant.

37 Banded DP for local alignment Problem: But where is the banded diagonal ? It need not be the main diagonal when looking for a good local alignment. How do we select which subsequences to align using banded DP? s t k We heuristically find potential diagonals and evaluate them using Banded DP. This is the main idea of FASTA.

38 Finding Potential Diagonals Suppose that we have a relatively long gap-less alignment AGCGCCATGGATTGAGCGA TGCGACATTGATCGACCTA u Can we find “clues” that will let us find it quickly? u Each such alignment defines a potential diagonal, which is then evaluated using Banded DP.

39 Signature of a Match s t Assumption: good alignments contain several “patches” of perfect matches AGCGCCATGGATTGAGCTA TGCGACATTGATCGACCTA Since this is a gap-less alignment, all perfect match regions should be on same diagonal

40 FASTA-finding ungapped matches Input: strings s and t, and a parameter ktup u Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup] u Locate sets of matching pairs that are on the same diagonal  By sorting according to the difference i-j u Compute the score for the diagonal that contains all these pairs s t

41 FASTA-finding ungapped matches Input: strings s and t, and a parameter ktup u Find all pairs (i,j) such that s[i..i+ktup]=t[j..j+ktup]  Step one: Preprocess an index of the database: For every sequence of length ktup, make a list of all positions where it appears. Takes linear time (why?).  Step two: Run on all sequences of size ktup on the query sequence. ( time is linear in query size). Identify all matches (i,j). s t

42 FASTA- using banded DP Final steps:  List the highest scoring diagonal matches  Run banded DP on the region containing any high scoring diagonal (say with width 12). Hence, the algorithm may combine some diagonals into gapped matches (in the example below combine diagonals 2 and 3). s t 3 2 1

43 FASTA- practical choices Some implementation choices / tricks have not been explicated herein. s t Most applications of FASTA use fairly small ktup (2 for proteins, and 6 for DNA). Higher values are faster, yielding fewer diagonals to search around, but increase the chance to miss the optimal local alignment.

44 Effect of Word Size (ktup) Large word size - fast, less sensitive, more selective: distant relatives do not have many runs of matches, un-related sequences stand no chance to be selected. Small word size - slow, more sensitive, less selective. Example: If ktup = 3, we will consider all substrings containing TCG in this sequence (very sensitive compared to large word size, but less selective. Will find all TCGs).

45 FASTA Visualization Identify all hot spots longer than Ktup. Ignore all short hot spots. The longest hot spot is called init 1. Extend hot spots to longer diagonal runs. Longest diagonal run is init n. Merge diagonal runs. Optimize using SW in a narrow band. Best result is called opt.

46 FastA Output FastA produces a list, where each entry looks like: EM_HUM:AF‭ Homo sapiens glucocer (5420) [f] e-176 The database name and entry ( accession numbers). Then comes the species. and a short gene name. The length of the sequence. Scores: Similarity score of the optimal alignment (opt). The bits score, and the E-value. Both measure the statistical significance of the alignment.

47 FastA Output - Explanation E-value is the theoretically Expected number of false hits (“random sequences”) per sequence query, given a similarity score (a statistical significance threshold for reporting matches against database sequences). Low E-value means: high significance, fewer matches will be reported. Bits is an alternative statistical measure for significance. High bits means high significance. Some versions also display z-score, a measure similar to Bits.

48 What Is a Significant E-Value ? How many false positives to expect? For E-value: 10 – 4 = 1 in 10,000 Database No. of Entries False Positive SwissProt105, PIR-PSD283, TrEMBL594,

49 Expect Value (E) and Score (S) u The probability that an alignment score as good as the one found between a query sequence and a database sequence would be found by random chance. Example:Score E-value –2 = >1 in 100 will have the same score. u For a given score, the E-value increases with increasing size of the database. u For a given database, the E-value decreases exponentially with increasing score.

50 opt the “usual” bell curve “Unexpected”, high score sequences (signal vs noise) A Histogram for observed (=) vs expected (*)

51 FASTA-summary Input: strings s and t, and a parameter ktup = 2 or 6 or user’s choice, depending on the application. Output: A high score local alignment 1. Find pairs of matching substrings s[i..i+ktup]=t[j..j+ktup] 2. Extend to ungapped diagonals 3. Extend to gapped alignment using banded DP 4. Can you think of example for pairs of sequences that have high local similarity scores but will be missed by FASTA ?

52 BLAST Overview Basic Local Alignment Search Tool (BLAST is one of the most quoted papers ever) Input: strings s and t, and a parameter T = threshold value Output: A highly scored local alignment Definition: Two strings s and t of length k are a high scoring pair (HSP) if V(s,t) > T (usually consider un-gapped alignments only, but not necessarily perfect matches). 1. Find high scoring pairs of substrings such that V(s,t) > T 2. These words serve as seeds for finding longer matches 3. Extend to ungapped diagonals (as in FASTA) 4. Extend to gapped matches

53 BLAST Overview (cont.) Step 1: Find high scoring pairs of substrings such that V(s,t) > T (The seeds): uFind all strings of length k which score at least T with substrings of s in a gapless alignment (k = 4 for AA, 11 for DNA) (note: possibly, not all k-words must be tested, e.g. when such a word scores less than T with itself). uFind in t all exact matches with each of the above strings.

54 Extending Potential Matches s t Once a seed is found, BLAST attempts to find a local alignment that extends the seed. Seeds on the same diagonal are combined (as in FASTA), then extended as far as possible in a greedy manner. u During the extension phase, the search stops when the score passes below some lower bound computed by BLAST (to save time). u A few extensions with highest score are kept, and attempt to join them is made, even if they are on distant diagonals, provided the join improves both scores.

55 Statistical Significance To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. Chance could mean sampling from: (i) Real but non-homologous sequences. (ii) Real sequences that are shuffled to preserve compositional properties. (iii) sequences generated randomly based upon a DNA or protein sequence model.

56 Statistical Significance II u Even the simplest random models and scoring systems, very little is known about the random distribution of optimal global alignment scores. u Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions.

57 Statistical Significance III u Statistics for the scores of local alignments scores are well understood, esp. for gap-less alignments. u Gap-less alignment = pair of equi-length segments, one from each of the two sequences. u Assuming an indep. model for letters, plus a score function whose expected value on pairs of letters is negative (why?), can prove that expected number of sequences with score at least S (E-value for the score S) equals where K, are constants (depending on model alone).

58 BLAST Statistics Theory BLAST is the most frequently used sequence alignment program. An impressive statistical theory, employing issues of the renewal theorem, random walks, and sequential analysis was developed for analyzing the statistical significance of BLAST results. These are all out of scope for this course. See the book ``Statistical Methods in BioInformatics” by Ewens and Grant (Springer 2001) for many details, or NCBI tutorial for not that many details.

59 Scoring Functions, Reminder u So far, we discussed dynamic programming algorithms for  global alignment  local alignment u All of these assumed a scoring function: that determines the value of perfect matches, substitutions, insertions, and deletions.

60 Where does the scoring function come from ? We have defined an additive scoring function by specifying a function  ( ,  ) such that  (x,y) is the score of replacing x by y  (x,-) is the score of deleting x  (-,x) is the score of inserting x But how do we come up with the “correct” score ? Answer: By encoding experience of what are similar sequences for the task at hand. Similarity depends on time, evolution trends, and sequence types.

61 Why probability setting is appropriate to define and interpret a scoring function ? Similarity is probabilistic in nature because biological changes like mutation, recombination, and selection, are random events. We could answer questions such as: How probable it is for two sequences to be similar? Is the similarity found significant or spurious ? How to change a similarity score when, say, mutation rate of a specific area on the chromosome becomes known ?