Sequence comparison and database search
strings String String (sequence): an ordered succession of characters (symbols) Alphabet: (1)DNA alphabet {A,C,G,T} of nucleotides (核苷) ; (2) 20-charater alphabet of amino acids. Length: |s|; s[i]; 1…|s|; s=AATGCA, s[3]=T Empty string: |ε|=0
strings Subsequence and Supersequence Subsequence: subsequence of s is a sequence that can be obtained from s by removal of some characters. Supersequence: when sequence t is a subsequence of sequence s, we say that s is a supersequence of t.
Strings Substring and Superstring: Substring: substring of s is a string formed by consecutive characters of s, in the same order as they appear in s. Superstring: when sequence t is a substring of sequence s, we say that s is a superstring of t. Interval (区间) : of a string s is a set of consecutive indices... s=[i..j] denotes the empty string when i=j+1. therefore, for any substring t of s there is at least one interval [i..j] of s with t=s[i..j]
strings Concatenation (拼接) Concatenation of two strings st, t after s Concatenation of the same string: by raising s to a suitable power, s 3 =sss Prefix Prefix of s is any substring of s of the form s[1..j] for 0≤j≤|s| ⇔ s=tu Prefix(s, k): to refer to the prefix of s with exactly k characters with 0≤k≤|s| Suffix: s[i..|s|] for 1≤i≤|s| +1
sequence comparison and database search Outline What is sequence comparison How to compare sequences 1. compare two sequences a. compare all of the sequences b. compare parts of the sequences 2. compare more sequences
biological background TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
9 3.1 biological background Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.
biological background a correspondence between elements of two sequences with order (topology) kept pairwise alignment: 2 sequences aligned multiple alignment: alignment of 3 or more sequences FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR
biological background Four cases, task and application (1) 2 sequences: tens of thousands (10,000) of characters; isolated difference: insertions, deletions, substitutions / rarely as one each hundred (100) characters, to find the difference, two different sequencings (2) 2 sequences : whether a prefix similar to a suffix; (3) 2 sequences : whether there are two similar substrings, one from each sequence, similar, to analyze conservation sequence
comparing two sequences alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences. (Liao L, Noble WS: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J Comput Biol 2003, 10(6): )
global comparison- example example of aligning GACGGATTAG GATCGGAATAG GA –CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash
global comparison- the basic algorithm Definitions Alignment: insertion of spaces: same size creating a correspondence: one over the other Both spaces are not allowed (Spaces can be inserted in beginning or end) Scoring function : a measure of similarity between elements (nucleotides, amino acids, gaps); a match: +1/ identical characters a mismatch: -1/ distinct characters
global comparison- the basic algorithm a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces GA –CGGATTAG GATCGGAATAG What is the score? similarity : sim(s, t) maximum alignment score; many alignments with similarity best alignment alignment with similarity
global comparison- the basic algorithm Needleman-Wunsch Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm dynamic programming (DP): prefixes: shorter to larger An example: S=AAAC T=AGC
Idea (m+1)*(n+1) array: entry (i, j) is similarity between s 1..i and t 1..j p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]:
global comparison- the basic algorithm A A A C AGC
global comparison- the basic algorithm 2. a good computing order: row by row: left to right on each row column by column: top to bottom on each column other order: to make sure a[i, j-1], a[i-1, j], and a[ i-1, j-1] are available when a[i, j] must be computed. 3. notes: parameter g: specifying the space penalty (usually g<0)/g=-2 scoring function p for pairs of characters/p(a,b)=1 if a=b, and p(a,b)=-1 if a!=b.
global comparison- the basic algorithm Algorithm Similarity input: sequence s and t output: similarity between s and t m←|s| n←|t| for i←0 to m do a[i, 0] ←i×g for j←0 to n do a[0, j] ←j×g for i←1 to m do for j←1 to n do a[i, j] ←max(a[i, j-1]+g, a[ i-1, j-1]+p(i,j), a[i-1, j]+g) return a[m,n]
optimal alignments How to construct an optimal alignment between two sequences ← similarity Idea of Algorithm Align All we need to do is to start at entry (m, n) and follow the arrows until we get to (0, 0). An optimal alignment can be easily constructed from right to left if we have the matrix a computed by the basic algorithm. The variables align-s and align-t are treated as globals in the code. Call Align(m, n, len) will construct an optimal alignment Note: max(|s|, |t|)≤len≤m+n
Recursive algorithm for optimal alignment Algorithm Align input: indices i, j, array a given by algorithm Similarity output: alignment in align-s, align-t, and length in len if i=0 and j=0 then len← 0 else if i>0 and a[i, j]= a[i-1, j]+g then Align(i-1, j, len) len← len+1 align-s[len] ←s[i] align-t[len] ←- else if i>0 and j>0 and a[i, j]= a[ i-1, j-1]+p(i,j) then Align(i-1, j-1, len) len← len+1 align-s[len] ←s[i] align-t[len] ← t[j] else //has to be j>0 and a[i, j]= a[i, j-1]+g Align(i, j-1, len) len← len+1 align-s[len] ←- align-t[len] ← t[i]
optimal alignments Arrow preference When there is choice, a column with a space in t has precedence over a column with two symbols, which in turn has precedence over a column with a space in s AAAC rather than AAAC AG -C - AGC maximum preference minimum preference
optimal alignments Complexity of the algorithms for time and space: Basic dynamic programming: comparison of two sequences/ to compute Similarity: O(mn) or O(n 2 ) Recursive algorithm for optimal alignment: O(len)=O(m+n)
Scoring matrix Scoring function: p(a, a)=1 p(a, b)=-1 p(a, -)=p(-, b)=-2
ATCG A5-4 T 5 C 5 G 5 BLAST matrix the default matrix for BLAST nucleic acid matix (1)
ATCG A1-5 T-51-5 C 1-5 G-5 1 transfer matrix nucleic acid matix (1)
PAM (Point Accepted Mutation) : — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. protein substitution matrix
Blocks Substitution Matrix (BLOSUM ) Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity
BLOSUM62 VDSCY and VESLCY gap d=-11
global comparison (1) GapVDSCY 01gap2gap… V1gap E2gap S… L C Y
global comparison (2) GapVDSCY V-11S ij E-22 S-33 L-44 C-55 Y-66
global comparison (3) i j GapVDSCY V-11S ij E-22 S-33 L-44 C-55 Y-66 Needleman-Wunsch ; S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)
BLOSUM62
global comparison (4) GapVDSCY V-114 E-22 S-33 L-44 C-55 Y Needleman-Wunsch ; S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)
BLOSUM62 替代矩阵
global comparison (5) GapVDSCY V E-22 S-33 L-44 C-55 Y V D: -3
BLOSUM62 替代矩阵
result : V D S – C Y V E S L C Y GapVDSCY V E S L C Y
local comparison Problem: local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences? Example: AATC AATC Which one is better? AAT - AACT
local comparison Idea: Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j]. Initialization First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.
Local comparison An example L D S C H G E S L C K To find the optimal alignment
Local comparison (1) GapLDSCH G0S ij E0 S0 L0 C0 K0 Smith-Waterman ; S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -11
BLOSUM62
Local comparison (2) GapLDSCH G00 E0 S0 L0 C0 K
Local comparison (3) GapLDSCH G000 E0 S0 L0 C0 K0 -11
Result : GapLDSCH G E S L C K L D S – C H G E S L C K Does it make sense?
local comparison score 1. Smith-waterman score =9 L D S – C H G E S L C K
revisit this example Change the gap from -11 to -4, what will we get? L D S C H G E S L C K To find the optimal alignment
Local comparison (1) GapLDSCH G0S ij E0 S0 L0 C0 K0 Smith-Waterman ; S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -4
BLOSUM62
Local comparison (2) GapLDSCH G00 E0 S0 L0 C0 K0 -4
Local comparison (3) GapLDSCH G000 E0 S0 L0 C0 K0 -4
Result : GapLDSCH G E S L C K L D S – C H G E S L C K
local comparison score 1. Smith-waterman score =11 L D S – C H G E S L C K
comparing multiple sequences Multiple sequence alignments are used for many reasons, including: (1) to detect regions of variability or conservation in a family of proteins, (2) to provide stronger evidence than pairwise similarity for structural and functional inferences.
comparing multiple sequences motivation multiple alignment (MA): which parts of the sequences are similar and which parts are different / s 1, …, s k multiple alignment is a generalization of pairwise alignment, similar operation no column made exclusively of spaces
comparing multiple sequences Amino acid sequences: are more common with proteins How to evaluate different MAs of the same set of sequences?
comparing multiple sequences Scoring scheme: (1) SP measure: scoring a alignment based on pairwise alignments. (2) star alignment
the SP (sum-of-pairs) measure Scoring MA additive functions here: “Reasonable” properties (1)Functions: independent of order of sequences,i.e SP(I,-,I,V)=SP(V,I,I-) (2)To reward presence of many equal or strongly related residues and penalize unrelated residues and spaces
the SP measure sum-of-pairs (SP) function is a function which meets the two properties E.g., SP-score(I, -, I, V)=P(I, -)+ P(I, I)+ P(I, V)+ P(-, I)+ P(-, V)+ P(I, V) ( match = 1, a mismatch = -1, and a gap = -2) SP(I,-,I,V) = score(I,-) + score(I, I) +score(I,V) + score(-,I) + score (-,V) + score(I,V) = = -7
the SP measure Although there is never an entire column of gaps, if we look at any 2 sequences in the alignment, there may be columns where both have gaps p(-, -)=0
the SP measure Induced pairwise alignment/ projection of a multiple alignment E.g., In MA, select two of sequences / forget all the rest / remove columns with two spaces and derive a true PA (induced pairwise alignment) PEAALYGRFT---IKSDVM PEALNYGRY---SSESDVW PEAALYGRFT-IKSDVM PEALNYGRY-SSESDVW α ij : PA induced by α on s i and s j
3.4.1 the SP measure summary Way 1: compute scores of each column, and then add all column scores Way 2: compute scores for induced PA, and then add these scores
star alignments Heuristic method for multiple sequence alignments Select a sequence s c as the center of the star For each sequence s 1, …, s k such that index i c, perform a global alignment Aggregate alignments with the principle “ once a gap, always a gap. ”
star alignments For example, say your sequences are: S1 A T T G C C A T T S2 A T G G C C A T T S3 A T C C A A T T T T S4 A T C T T C T T S5 A C T G A C C (1) To find the center sequence
star alignments (2) do pairwise alignments
star alignments (3) build the multip le align ment