C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [2] Sequence Analysis Searching for similarities What is the function of a new gene? The “lazy” investigation: – Find a set of similar proteins – Identify similarities and differences – For long proteins: identify domains Domains are structural units in a protein tertiary structure and often provide a given (sub)function to the complete protein
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [3] Sequence Analysis Is similarity really interesting? Common ancestry is a very important observation Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – It’s a nice tool: When (a known gene) G is homologous to (an unknown) X it means that we gain a lot of information on X Z X G
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [4] Sequence Analysis Functional and evolutionary Evolutionary relation, reconstruction: – Based on sequence Identity (simplest method) Similarity – Homology (the ultimate goal) – Other (e.g., 3D structure) Functional relation Sequence Structure Function determines
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [5] Sequence Analysis Evolution and 3d protein structure information Isocitrate dehydrogenise: The distance from the active site (yellow) determines the rate of evolution. (red = fast evolution blue = slow evolution) Dean, A. M. and G. B. Golding, Pacific Symposium on Bioinformatics 2000
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [6] Sequence Analysis How to determine similarity? Frequent evolutionary events: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion Evolution at work We’ll use only these Z X Y Common ancestor, usually extinct available
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [7] Sequence Analysis Alignment Mutations: substitution, insertion and deletion Which alignment is better? Use common sense and call it: – Simplest – Most probable – Maximum likely
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [8] Sequence Analysis Scoring Should give reasonable alignments And have to assign scores to: – Substitution (or match/mismatch) DNA proteins – Gap penalty Linear: g(k)= k Affine: g(k)= + k Concave, e.g.: g(k)=log(k) The score for an alignment is the sum of scores of all alignment columns
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [9] Sequence Analysis Substitution matrices Define a score for match/mismatch of letters DNA - Simple: - Used in genome alignments
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [10] Sequence Analysis Substitution matrices for aa Amino acids are not equal: 1. Some are easily substituted, similar: biochemical properties structure 2. Some mutations occur more often due to similar codons The two above give us substitution matrices
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [11] Sequence Analysis Blosum62 matrix # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = , Expected = A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [12] Sequence Analysis Linear vs. affine scoring Seq1 G T A - - G - T - A Seq2 - - A T G - A T G - Linear -2 –2 1 –2 –2 (SUM=-7) -2 – –2 (SUM=-7) Affine -3 – –1 (SUM=-7) -3 – –3 (SUM=-11) … and +1 for match Gap Scoring Introductionextension Linear-2 Affine-3
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [13] Sequence Analysis The algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion Dynamic programming – Solve smaller subproblem(s) – Iteratively extend the solution The best alignment for X[ 1…i ] and Y[ 1…j ] is called M[ i, j ] X 1 … X i X i Y 1 … - Y j-1 Y j -
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [14] Sequence Analysis The algorithm Goal: find the maximal scoring alignment Scores: m match, -s mismatch, -g for insertion/deletion The best alignment for X[1…i] and Y[1…j] is called M[i, j] 3 ways to extend the alignment: X[1…i-1] X[i] X[1…i] - X[1…i-1] X[i] Y[1…j-1] Y[j] Y[1…j-1] Y[j] Y[1…j] - M[i,j]= M[i-1,j-1] M[i,j-1]-g M[i-1,j]-g +m -s
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [15] Sequence Analysis The algorithm for linear gap penalties M[i-1,j-1]+score(X[i],Y[j]) M[i,j]= max M[i,j-1]-g M[i-1,j]-g Corresponds to: X 1 …X i-1 X i Y 1 …Y j-1 Y j X 1 …X i - Y 1 …Y j-1 Y j X 1 …X i-1 X i Y 1 …Y j-1 - Value form residue exchange matrix i-1 i j-1 j
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [16] Sequence Analysis Example: global alignment of two sequences Align two DNA sequences: – GAGTGA – GAGGCGA (note the length difference) Parameters of the algorithm: – Match: score(A,A) = 1 – Mismatch: score(A,T) = – 1 – Gap: g = 2 M[i-1,j-1] 1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [17] Sequence Analysis The algorithm. Step 1: init Create the matrix Initiation – 0 at [0,0] – Apply the equation… M[i-1,j-1] 1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 jj i -GAGTGA G 2A 3G 4G 5C 6G 7A
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [18] Sequence Analysis The algorithm. Step 1: init Initiation of the matrix: – 0 at pos [0,0] – Fill in the first row using the “ ” rule – Fill in the first column using “ ” M[i-1,j-1] 1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA G -2 A -4 G -6 G -8 C -10 G -12 A -14 j i
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [19] Sequence Analysis The algorithm. Step 2: fill in Continue filling in of the matrix, remembering from which cell the result comes (arrows) M[i-1,j-1] 1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA G A -42 G -6 G -8 C -10 G -12 A -14 j i
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [20] Sequence Analysis The algorithm. Step 2: fill in We are done… Where’s the result? M[i-1,j-1] 1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2 -GAGTGA G A G G C G A j i
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [21] Sequence Analysis The algorithm. Step 3: backtrace Start at the last cell of the matrix Go in the direction of arrows Sometimes the value may be obtained from more than one cell (which one?) -GAGTGA G A G G C G A j i M[i-1,j-1] 1 M[i,j]= max M[i,j-1]-2 M[i-1,j]-2
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [22] Sequence Analysis The algorithm. Step 3: backtrace Extract the alignments a) GAGT-GA GAGGCGA b) GA-GTGA GAGGCGA -GAGTGA G A G G C G A j i a b
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [23] Sequence Analysis Global dynamic programming – general algorithm M[i-1,j-1] M[i,j] = score(X[i],Y[j]) + max max{M[0<x<i-1, j-1] - g open - (i-x- 1)g extension } max{M[i-1, 0<x<j-1] - g open - (i-y- 1)g extension } Value form residue exchange matrix i-1 i j-1 j Gap open penalty Gap extension penalty Number of gap extensions This more general way of dynamic programming also allows for affine or other gap penalties
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [24] Sequence Analysis Easy DP recipe for using affine gap penalties M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check Cell[i-1, j-1]: apply score for cell[i-1, j-1] preceding row until j-2: apply appropriate gap penalties preceding column until i-2: apply appropriate gap penalties i-1 j-1
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [25] Sequence Analysis Note about gap penalties Some affine schemes use gap_penalty = -g open –g extension *(l-1), while others use gap_penalty = -g open –g extension *l, where l is the length of the gap.
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [26] Sequence Analysis Global dynamic programming g open =10, g extension =2 DWVTALK T D W V L K DWVTALK T D W V L K These values are copied from the PAM250 matrix, after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [27] Sequence Analysis Variation on global alignment Global alignment: the previous algorithm is called global alignment, because it uses all letters from both sequences. CAGCACTTGGATTCTCGG CAGC-----G-T----GG Semi-global alignment: don’t penalize for start/end gaps (omit the start/end of sequences). CAGCA-CTTGGATTCTCGG ---CAGCGTGG – Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other – Danger! – really bad alignments for divergent seqs seq X: seq Y:
CENTRFORINTEGRATIVE BIOINFORMATICSVU E [28] Sequence Analysis Take-home message Homology Why are we interested in similarity? Pairwise alignment: global alignment