2016-3-191 Sequence comparison and database search.

2016-3-191 Sequence comparison and database search

2016-3-192 strings String  String (sequence): an ordered succession of characters (symbols)  Alphabet: (1)DNA alphabet {A,C,G,T} of nucleotides （核苷） ; (2) 20-charater alphabet of amino acids.  Length: |s|; s[i]; 1…|s|; s=AATGCA, s[3]=T  Empty string: |ε|=0

2016-3-193 strings Subsequence and Supersequence  Subsequence: subsequence of s is a sequence that can be obtained from s by removal of some characters.  Supersequence: when sequence t is a subsequence of sequence s, we say that s is a supersequence of t.

Strings 2016-3-194 Substring and Superstring:  Substring: substring of s is a string formed by consecutive characters of s, in the same order as they appear in s.  Superstring: when sequence t is a substring of sequence s, we say that s is a superstring of t.  Interval （区间） : of a string s is a set of consecutive indices...  s=[i..j] denotes the empty string when i=j+1.  therefore, for any substring t of s there is at least one interval [i..j] of s with t=s[i..j]

2016-3-195 strings Concatenation （拼接）  Concatenation of two strings st, t after s  Concatenation of the same string: by raising s to a suitable power, s 3 =sss Prefix  Prefix of s is any substring of s of the form s[1..j] for 0≤j≤|s| ⇔ s=tu  Prefix(s, k): to refer to the prefix of s with exactly k characters with 0≤k≤|s| Suffix: s[i..|s|] for 1≤i≤|s| +1

2016-3-196 sequence comparison and database search Outline  What is sequence comparison  How to compare sequences 1. compare two sequences a. compare all of the sequences b. compare parts of the sequences 2. compare more sequences

2016-3-197 3.1 biological background TT....TGTGTGCATTTAAGGGTGATAGTGTATTTGCTCTTTAAGAGCTG || || || | | ||| | |||| ||||| ||| ||| TTGACAGGTACCCAACTGTGTGTGCTGATGTA.TTGCTGGCCAAGGACTG AGTGTTTGAGCCTCTGTTTGTGTGTAATTGAGTGTGCATGTGTGGGAGTG | | | | |||||| | |||| | || | | AAGGATC.............TCAGTAATTAATCATGCACCTATGTGGCGG AAATTGTGGAATGTGTATGCTCATAGCACTGAGTGAAAATAAAAGATTGT ||| | ||| || || ||| | ||||||||| || |||||| | AAA.TATGGGATATGCATGTCGA...CACTGAGTG..AAGGCAAGATTAT

2016-3-198

9 3.1 biological background  Two notions Similarity: a measure of how similar two sequences are Alignment: a basic operation to compare two sequences, a way of placing one sequence above the other in order to make clear the correspondence between similar characters or substrings from the sequences.

2016-3-1910 3.1 biological background a correspondence between elements of two sequences with order (topology) kept pairwise alignment: 2 sequences aligned multiple alignment: alignment of 3 or more sequences FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR

2016-3-1911 3.1 biological background  Four cases, task and application (1) 2 sequences: tens of thousands (10,000) of characters; isolated difference: insertions, deletions, substitutions / rarely as one each hundred (100) characters, to find the difference, two different sequencings (2) 2 sequences : whether a prefix similar to a suffix; (3) 2 sequences : whether there are two similar substrings, one from each sequence, similar, to analyze conservation sequence

2016-3-1912 3.2 comparing two sequences alignments involving:  global comparisons: entire sequences  local comparisons: just substrings of sequences. (Liao L, Noble WS: Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships. J Comput Biol 2003, 10(6):857-868.)

2016-3-1913 3.2.1 global comparison- example example of aligning GACGGATTAG GATCGGAATAG GA –CGGATTAG GATCGGAATAG an extra T; a change from A to T; space: dash

2016-3-1914 3.2.1 global comparison- the basic algorithm Definitions  Alignment: insertion of spaces: same size creating a correspondence: one over the other Both spaces are not allowed (Spaces can be inserted in beginning or end)  Scoring function : a measure of similarity between elements (nucleotides, amino acids, gaps); a match: +1/ identical characters a mismatch: -1/ distinct characters

2016-3-1915 3.2.1 global comparison- the basic algorithm a space: -2/ Scoring system: to reward matches and penalize mismatches and spaces GA –CGGATTAG GATCGGAATAG What is the score? similarity : sim(s, t) maximum alignment score; many alignments with similarity  best alignment alignment with similarity

2016-3-1916 3.2.1 global comparison- the basic algorithm Needleman-Wunsch Basic DP algorithm for comparison of two sequences  number of alignment between two sequences: exponential  Efficient algorithm dynamic programming (DP): prefixes: shorter to larger An example: S=AAAC T=AGC

Idea (m+1)*(n+1) array: entry (i, j) is similarity between s  1..i  and t  1..j  p(i, j)=+1 if s[i]=t[j], and -1 if s[i] ≠ t[j]: 2016-3-1917

2016-3-1918 3.2.1 global comparison- the basic algorithm 0 0 0 -2 1 -4 2 -6 3 1 -2 1 -4 2 -6 3 -8 4 -5 1 -3 1 A A A C AGC -4 -2 0 1 -2 -3

2016-3-1919 3.2.1 global comparison- the basic algorithm 2. a good computing order:  row by row: left to right on each row  column by column: top to bottom on each column  other order: to make sure a[i, j-1], a[i-1, j], and a[ i-1, j-1] are available when a[i, j] must be computed. 3. notes:  parameter g: specifying the space penalty (usually g<0)/g=-2  scoring function p for pairs of characters/p(a,b)=1 if a=b, and p(a,b)=-1 if a!=b.

2016-3-1920 3.2.1 global comparison- the basic algorithm Algorithm Similarity input: sequence s and t output: similarity between s and t m←|s| n←|t| for i←0 to m do a[i, 0] ←i×g for j←0 to n do a[0, j] ←j×g for i←1 to m do for j←1 to n do a[i, j] ←max(a[i, j-1]+g, a[ i-1, j-1]+p(i,j), a[i-1, j]+g) return a[m,n]

2016-3-1921 optimal alignments How to construct an optimal alignment between two sequences ← similarity Idea of Algorithm Align  All we need to do is to start at entry (m, n) and follow the arrows until we get to (0, 0).  An optimal alignment can be easily constructed from right to left if we have the matrix a computed by the basic algorithm.  The variables align-s and align-t are treated as globals in the code.  Call Align(m, n, len) will construct an optimal alignment  Note: max(|s|, |t|)≤len≤m+n

2016-3-1922 Recursive algorithm for optimal alignment Algorithm Align input: indices i, j, array a given by algorithm Similarity output: alignment in align-s, align-t, and length in len if i=0 and j=0 then len← 0 else if i>0 and a[i, j]= a[i-1, j]+g then Align(i-1, j, len) len← len+1 align-s[len] ←s[i] align-t[len] ←- else if i>0 and j>0 and a[i, j]= a[ i-1, j-1]+p(i,j) then Align(i-1, j-1, len) len← len+1 align-s[len] ←s[i] align-t[len] ← t[j] else //has to be j>0 and a[i, j]= a[i, j-1]+g Align(i, j-1, len) len← len+1 align-s[len] ←- align-t[len] ← t[i]

2016-3-1923 optimal alignments Arrow preference  When there is choice, a column with a space in t has precedence over a column with two symbols, which in turn has precedence over a column with a space in s  AAAC rather than AAAC AG -C - AGC maximum preference minimum preference

2016-3-1924 optimal alignments  Complexity of the algorithms for time and space: Basic dynamic programming: comparison of two sequences/ to compute Similarity: O(mn) or O(n 2 ) Recursive algorithm for optimal alignment: O(len)=O(m+n)

Scoring matrix Scoring function: p(a, a)=1 p(a, b)=-1 p(a, -)=p(-, b)=-2

ATCG A5-4 T 5 C 5 G 5 BLAST matrix the default matrix for BLAST nucleic acid matix (1)

ATCG A1-5 T-51-5 C 1-5 G-5 1 transfer matrix nucleic acid matix (1)

PAM (Point Accepted Mutation) ： — also known as a PAM — is the replacement of a single amino acid in the primary structure of a protein with another single amino acid, which is accepted by the processes of natural selection. BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. protein substitution matrix

Blocks Substitution Matrix (BLOSUM )  Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins  Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity 2016-3-1929

BLOSUM62 VDSCY and VESLCY gap d=-11

global comparison (1) GapVDSCY 01gap2gap… V1gap E2gap S… L C Y

global comparison (2) GapVDSCY 0-11-22-33-44-55 V-11S ij E-22 S-33 L-44 C-55 Y-66

global comparison (3) i j GapVDSCY 0-11-22-33-44-55 V-11S ij E-22 S-33 L-44 C-55 Y-66 Needleman-Wunsch ； S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)

BLOSUM62

global comparison (4) GapVDSCY 0-11-22-33-44-55 V-114 E-22 S-33 L-44 C-55 Y-66 4 -11 Needleman-Wunsch ； S i-1, j-1 + σ(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right)

BLOSUM62 替代矩阵

global comparison (5) GapVDSCY 0-11-22-33-44-55 V-114-7 E-22 S-33 L-44 C-55 Y-66 -3 -11 V  D: -3

BLOSUM62 替代矩阵

result ： V D S – C Y V E S L C Y GapVDSCY 0-11-22-33-44-55 V-114-7-18-29-40 E-22-76-5-16-27 S-33-18-510-12 L-44-29-169-3 C-55-40-27-1287 Y-66-51-38-23-315

2016-3-1940 3.2.2 local comparison Problem:  local alignment between s and t: an alignment between a substring of s and a substring of t Algorithm: to find the highest scoring local alignment between two sequences? Example: AATC AATC Which one is better? AAT - AACT

2016-3-1941 3.2.2 local comparison Idea:  Data structure: an (m+1)×(n+1) array; entry: holding the highest score of an alignment between a suffix of s[1..i] and a suffix of t[1..j].  Initialization First row and column: initialized with zeros←for any entry (i,j), there is always the alignment between the empty suffixes of s[1..i] and t[1..j], which has score zero.

Local comparison An example L D S C H G E S L C K  To find the optimal alignment

Local comparison (1) GapLDSCH 000000 G0S ij E0 S0 L0 C0 K0 Smith-Waterman ； S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -11

BLOSUM62

Local comparison (2) GapLDSCH 000000 G00 E0 S0 L0 C0 K0 -11 -4

Local comparison (3) GapLDSCH 000000 G000 E0 S0 L0 C0 K0 -11

Result ： GapLDSCH 000000 G000000 E002210 S002610 L040052 C001092 K000008 L D S – C H G E S L C K Does it make sense?

local comparison score 1. Smith-waterman score =9 L D S – C H G E S L C K

revisit this example Change the gap from -11 to -4, what will we get? L D S C H G E S L C K  To find the optimal alignment

Local comparison (1) GapLDSCH 000000 G0S ij E0 S0 L0 C0 K0 Smith-Waterman ； S i-1, j-1 + p(x i, y j ) S ij = max of S i-1, j +d (up to down) S i, j-1 +d (left to right) 0 gap: -4

BLOSUM62

Result ： GapLDSCH 000000 G000000 E002000 S000620 L040250 C0010117 K0000710 L D S – C H G E S L C K

local comparison score 1. Smith-waterman score =11 L D S – C H G E S L C K

2016-3-1956 3.4 comparing multiple sequences Multiple sequence alignments are used for many reasons, including: (1) to detect regions of variability or conservation in a family of proteins, (2) to provide stronger evidence than pairwise similarity for structural and functional inferences.

2016-3-1957 3.4 comparing multiple sequences motivation  multiple alignment (MA): which parts of the sequences are similar and which parts are different / s 1, …, s k  multiple alignment is a generalization of pairwise alignment, similar operation no column made exclusively of spaces

2016-3-1958 3.4 comparing multiple sequences  Amino acid sequences: are more common with proteins  How to evaluate different MAs of the same set of sequences?

2016-3-1959 3.4 comparing multiple sequences  Scoring scheme:  (1) SP measure: scoring a alignment based on pairwise alignments.  (2) star alignment

2016-3-1960 3.4.1 the SP (sum-of-pairs) measure Scoring MA  additive functions here:  “Reasonable” properties (1)Functions: independent of order of sequences,i.e SP(I,-,I,V)=SP(V,I,I-) (2)To reward presence of many equal or strongly related residues and penalize unrelated residues and spaces

2016-3-1961 3.4.1 the SP measure sum-of-pairs (SP) function is a function which meets the two properties E.g., SP-score(I, -, I, V)=P(I, -)+ P(I, I)+ P(I, V)+ P(-, I)+ P(-, V)+ P(I, V) ( match = 1, a mismatch = -1, and a gap = -2)  SP(I,-,I,V)  = score(I,-) + score(I, I) +score(I,V) + score(-,I) + score (-,V) + score(I,V)  = -2 + 1 + -1 + -2 + -2 + -1 = -7

2016-3-1962 3.4.1 the SP measure Although there is never an entire column of gaps, if we look at any 2 sequences in the alignment, there may be columns where both have gaps p(-, -)=0

2016-3-1963 3.4.1 the SP measure Induced pairwise alignment/ projection of a multiple alignment E.g., In MA, select two of sequences / forget all the rest / remove columns with two spaces and derive a true PA (induced pairwise alignment) PEAALYGRFT---IKSDVM PEALNYGRY---SSESDVW PEAALYGRFT-IKSDVM PEALNYGRY-SSESDVW α ij : PA induced by α on s i and s j

3.4.1 the SP measure summary Way 1: compute scores of each column, and then add all column scores Way 2: compute scores for induced PA, and then add these scores 2016-3-1964

2016-3-1965 3.4.2 star alignments Heuristic method for multiple sequence alignments Select a sequence s c as the center of the star For each sequence s 1, …, s k such that index i  c, perform a global alignment Aggregate alignments with the principle “ once a gap, always a gap. ”

2016-3-19 66 3.4.2 star alignments For example, say your sequences are: S1 A T T G C C A T T S2 A T G G C C A T T S3 A T C C A A T T T T S4 A T C T T C T T S5 A C T G A C C (1) To find the center sequence

2016-3-1967 3.4.2 star alignments (2) do pairwise alignments

2016-3-1968 3.4.2 star alignments (3) build the multip le align ment

2016-3-191 Sequence comparison and database search.

Similar presentations

Presentation on theme: "2016-3-191 Sequence comparison and database search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2016-3-191 Sequence comparison and database search.

Similar presentations

Presentation on theme: "2016-3-191 Sequence comparison and database search."— Presentation transcript:

Similar presentations

About project

Feedback