Download presentation
Presentation is loading. Please wait.
Published byDenis Elliott Modified over 9 years ago
1
2016-3-191 Sequence comparison and database search
2
2016-3-192 strings String String (sequence): an ordered succession of characters (symbols) Alphabet: (1)DNA alphabet {A,C,G,T} of nucleotides (核苷) ; (2) 20-charater alphabet of amino acids. Length: |s|; s[i]; 1…|s|; s=AATGCA, s[3]=T Empty string: |ε|=0
3
2016-3-193 strings Subsequence and Supersequence Subsequence: subsequence of s is a sequence that can be obtained from s by removal of some characters. Supersequence: when sequence t is a subsequence of sequence s, we say that s is a supersequence of t.
4
Strings 2016-3-194 Substring and Superstring: Substring: substring of s is a string formed by consecutive characters of s, in the same order as they appear in s. Superstring: when sequence t is a substring of sequence s, we say that s is a superstring of t. Interval (区间) : of a string s is a set of consecutive indices... s=[i..j] denotes the empty string when i=j+1. therefore, for any substring t of s there is at least one interval [i..j] of s with t=s[i..j]
5
2016-3-195 strings Concatenation (拼接) Concatenation of two strings st, t after s Concatenation of the same string: by raising s to a suitable power, s 3 =sss Prefix Prefix of s is any substring of s of the form s[1..j] for 0≤j≤|s| ⇔ s=tu Prefix(s, k): to refer to the prefix of s with exactly k characters with 0≤k≤|s| Suffix: s[i..|s|] for 1≤i≤|s| +1
6
2016-3-196 sequence comparison and database search Outline What is sequence comparison How to compare sequences 1. compare two sequences a. compare all of the sequences b. compare parts of the sequences 2. compare more sequences
7
2016-3-197 3.1 biological background TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT
8
2016-3-198
9
9 3.1 biological background Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.
10
2016-3-1910 3.1 biological background a correspondence between elements of two sequences with order (topology) kept pairwise alignment: 2 sequences aligned multiple alignment: alignment of 3 or more sequences FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR
11
2016-3-1911 3.1 biological background Four cases, task and application (1) 2 sequences: tens of thousands (10,000) of characters; isolated difference: insertions, deletions, substitutions / rarely as one each hundred (100) characters, to find the difference, two different sequencings (2) 2 sequences : whether a prefix similar to a suffix; (3) 2 sequences : whether there are two similar substrings, one from each sequence, similar, to analyze conservation sequence
12
2016-3-1912 3.2 comparing two sequences alignments involving: global comparisons: entire sequences local comparisons: just substrings of sequences. (Liao L, Noble WS: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J Comput Biol 2003, 10(6):857-868.)
13
2016-3-1913 3.2.1 global comparison- example example of aligning GACGGATTAG GATCGGAATAG GA –CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash
14
2016-3-1914 3.2.1 global comparison- the basic algorithm Definitions Alignment: insertion of spaces: same size creating a correspondence: one over the other Both spaces are not allowed (Spaces can be inserted in beginning or end) Scoring function : a measure of similarity between elements (nucleotides, amino acids, gaps); a match: +1/ identical characters a mismatch: -1/ distinct characters
15
2016-3-1915 3.2.1 global comparison- the basic algorithm a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces GA –CGGATTAG GATCGGAATAG What is the score? similarity : sim(s, t) maximum alignment score; many alignments with similarity best alignment alignment with similarity
16
2016-3-1916 3.2.1 global comparison- the basic algorithm Needleman-Wunsch Basic DP algorithm for comparison of two sequences number of alignment between two sequences: exponential Efficient algorithm dynamic programming (DP): prefixes: shorter to larger An example: S=AAAC T=AGC
17
Idea (m+1)*(n+1) array: entry (i, j) is similarity between s 1..i and t 1..j p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]: 2016-3-1917
18
2016-3-1918 3.2.1 global comparison- the basic algorithm 0 0 0 -2 1 -4 2 -6 3 1 -2 1 -4 2 -6 3 -8 4 -5 1 -3 1 A A A C AGC -4 -2 0 1 -2 -3
19
2016-3-1919 3.2.1 global comparison- the basic algorithm 2. a good computing order: row by row: left to right on each row column by column: top to bottom on each column other order: to make sure a[i, j-1], a[i-1, j], and a[ i-1, j-1] are available when a[i, j] must be computed. 3. notes: parameter g: specifying the space penalty (usually g<0)/g=-2 scoring function p for pairs of characters/p(a,b)=1 if a=b, and p(a,b)=-1 if a!=b.
20
2016-3-1920 3.2.1 global comparison- the basic algorithm Algorithm Similarity input: sequence s and t output: similarity between s and t m←|s| n←|t| for i←0 to m do a[i, 0] ←i×g for j←0 to n do a[0, j] ←j×g for i←1 to m do for j←1 to n do a[i, j] ←max(a[i, j-1]+g, a[ i-1, j-1]+p(i,j), a[i-1, j]+g) return a[m,n]
21
2016-3-1921 optimal alignments How to construct an optimal alignment between two sequences ← similarity Idea of Algorithm Align All we need to do is to start at entry (m, n) and follow the arrows until we get to (0, 0). An optimal alignment can be easily constructed from right to left if we have the matrix a computed by the basic algorithm. The variables align-s and align-t are treated as globals in the code. Call Align(m, n, len) will construct an optimal alignment Note: max(|s|, |t|)≤len≤m+n
22
2016-3-1922 Recursive algorithm for optimal alignment Algorithm Align input: indices i, j, array a given by algorithm Similarity output: alignment in align-s, align-t, and length in len if i=0 and j=0 then len← 0 else if i>0 and a[i, j]= a[i-1, j]+g then Align(i-1, j, len) len← len+1 align-s[len] ←s[i] align-t[len] ←- else if i>0 and j>0 and a[i, j]= a[ i-1, j-1]+p(i,j) then Align(i-1, j-1, len) len← len+1 align-s[len] ←s[i] align-t[len] ← t[j] else //has to be j>0 and a[i, j]= a[i, j-1]+g Align(i, j-1, len) len← len+1 align-s[len] ←- align-t[len] ← t[i]
23
2016-3-1923 optimal alignments Arrow preference When there is choice, a column with a space in t has precedence over a column with two symbols, which in turn has precedence over a column with a space in s AAAC rather than AAAC AG -C - AGC maximum preference minimum preference
24
2016-3-1924 optimal alignments Complexity of the algorithms for time and space: Basic dynamic programming: comparison of two sequences/ to compute Similarity: O(mn) or O(n 2 ) Recursive algorithm for optimal alignment: O(len)=O(m+n)
25
Scoring matrix Scoring function: p(a, a)=1 p(a, b)=-1 p(a, -)=p(-, b)=-2
26
ATCG A5-4 T 5 C 5 G 5 BLAST matrix the default matrix for BLAST nucleic acid matix (1)
27
ATCG A1-5 T-51-5 C 1-5 G-5 1 transfer matrix nucleic acid matix (1)
28
PAM (Point Accepted Mutation) : — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. protein substitution matrix
29
Blocks Substitution Matrix (BLOSUM ) Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity 2016-3-1929
30
BLOSUM62 VDSCY and VESLCY gap d=-11
31
global comparison (1) GapVDSCY 01gap2gap… V1gap E2gap S… L C Y
32
global comparison (2) GapVDSCY 0-11-22-33-44-55 V-11S ij E-22 S-33 L-44 C-55 Y-66
33
global comparison (3) i j GapVDSCY 0-11-22-33-44-55 V-11S ij E-22 S-33 L-44 C-55 Y-66 Needleman-Wunsch ; S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)
34
BLOSUM62
35
global comparison (4) GapVDSCY 0-11-22-33-44-55 V-114 E-22 S-33 L-44 C-55 Y-66 4 -11 Needleman-Wunsch ; S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)
36
BLOSUM62 替代矩阵
37
global comparison (5) GapVDSCY 0-11-22-33-44-55 V-114-7 E-22 S-33 L-44 C-55 Y-66 -3 -11 V D: -3
38
BLOSUM62 替代矩阵
39
result : V D S – C Y V E S L C Y GapVDSCY 0-11-22-33-44-55 V-114-7-18-29-40 E-22-76-5-16-27 S-33-18-510-12 L-44-29-169-3 C-55-40-27-1287 Y-66-51-38-23-315
40
2016-3-1940 3.2.2 local comparison Problem: local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences? Example: AATC AATC Which one is better? AAT - AACT
41
2016-3-1941 3.2.2 local comparison Idea: Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j]. Initialization First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.
42
Local comparison An example L D S C H G E S L C K To find the optimal alignment
43
Local comparison (1) GapLDSCH 000000 G0S ij E0 S0 L0 C0 K0 Smith-Waterman ; S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -11
44
BLOSUM62
45
Local comparison (2) GapLDSCH 000000 G00 E0 S0 L0 C0 K0 -11 -4
46
Local comparison (3) GapLDSCH 000000 G000 E0 S0 L0 C0 K0 -11
47
Result : GapLDSCH 000000 G000000 E002210 S002610 L040052 C001092 K000008 L D S – C H G E S L C K Does it make sense?
48
local comparison score 1. Smith-waterman score =9 L D S – C H G E S L C K
49
revisit this example Change the gap from -11 to -4, what will we get? L D S C H G E S L C K To find the optimal alignment
50
Local comparison (1) GapLDSCH 000000 G0S ij E0 S0 L0 C0 K0 Smith-Waterman ; S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -4
51
BLOSUM62
52
Local comparison (2) GapLDSCH 000000 G00 E0 S0 L0 C0 K0 -4
53
Local comparison (3) GapLDSCH 000000 G000 E0 S0 L0 C0 K0 -4
54
Result : GapLDSCH 000000 G000000 E002000 S000620 L040250 C0010117 K0000710 L D S – C H G E S L C K
55
local comparison score 1. Smith-waterman score =11 L D S – C H G E S L C K
56
2016-3-1956 3.4 comparing multiple sequences Multiple sequence alignments are used for many reasons, including: (1) to detect regions of variability or conservation in a family of proteins, (2) to provide stronger evidence than pairwise similarity for structural and functional inferences.
57
2016-3-1957 3.4 comparing multiple sequences motivation multiple alignment (MA): which parts of the sequences are similar and which parts are different / s 1, …, s k multiple alignment is a generalization of pairwise alignment, similar operation no column made exclusively of spaces
58
2016-3-1958 3.4 comparing multiple sequences Amino acid sequences: are more common with proteins How to evaluate different MAs of the same set of sequences?
59
2016-3-1959 3.4 comparing multiple sequences Scoring scheme: (1) SP measure: scoring a alignment based on pairwise alignments. (2) star alignment
60
2016-3-1960 3.4.1 the SP (sum-of-pairs) measure Scoring MA additive functions here: “Reasonable” properties (1)Functions: independent of order of sequences,i.e SP(I,-,I,V)=SP(V,I,I-) (2)To reward presence of many equal or strongly related residues and penalize unrelated residues and spaces
61
2016-3-1961 3.4.1 the SP measure sum-of-pairs (SP) function is a function which meets the two properties E.g., SP-score(I, -, I, V)=P(I, -)+ P(I, I)+ P(I, V)+ P(-, I)+ P(-, V)+ P(I, V) ( match = 1, a mismatch = -1, and a gap = -2) SP(I,-,I,V) = score(I,-) + score(I, I) +score(I,V) + score(-,I) + score (-,V) + score(I,V) = -2 + 1 + -1 + -2 + -2 + -1 = -7
62
2016-3-1962 3.4.1 the SP measure Although there is never an entire column of gaps, if we look at any 2 sequences in the alignment, there may be columns where both have gaps p(-, -)=0
63
2016-3-1963 3.4.1 the SP measure Induced pairwise alignment/ projection of a multiple alignment E.g., In MA, select two of sequences / forget all the rest / remove columns with two spaces and derive a true PA (induced pairwise alignment) PEAALYGRFT---IKSDVM PEALNYGRY---SSESDVW PEAALYGRFT-IKSDVM PEALNYGRY-SSESDVW α ij : PA induced by α on s i and s j
64
3.4.1 the SP measure summary Way 1: compute scores of each column, and then add all column scores Way 2: compute scores for induced PA, and then add these scores 2016-3-1964
65
2016-3-1965 3.4.2 star alignments Heuristic method for multiple sequence alignments Select a sequence s c as the center of the star For each sequence s 1, …, s k such that index i c, perform a global alignment Aggregate alignments with the principle “ once a gap, always a gap. ”
66
2016-3-19 66 3.4.2 star alignments For example, say your sequences are: S1 A T T G C C A T T S2 A T G G C C A T T S3 A T C C A A T T T T S4 A T C T T C T T S5 A C T G A C C (1) To find the center sequence
67
2016-3-1967 3.4.2 star alignments (2) do pairwise alignments
68
2016-3-1968 3.4.2 star alignments (3) build the multip le align ment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.