Introduction to bioinformatics 2007

Slides:

Advertisements

Similar presentations

Global Sequence Alignment by Dynamic Programming.

Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Measuring the degree of similarity: PAM and blosum Matrix

DNA sequences alignment measurement

Lecture 8 Alignment of pairs of sequence Local and global alignment

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.

Bioinformatics Sequence Analysis I

Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in.

Heuristic alignment algorithms and cost matrices

1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Sequence Analysis Lecture 3 C E N T R F O R I N T E G R A T I V E B I O I N F O.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

Sequence Alignment III CIS 667 February 10, 2004.

CENTRFORINTEGRATIVE BIOINFORMATICSVU E [1] Sequence Analysis Alignments 2: Local alignment Sequence Analysis

1-month Practical Course Genome Analysis Lecture 4: Pair-wise alignment Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam The.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Alignment III PAM Matrices. 2 PAM250 scoring matrix.

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Pairwise alignment Computational Genomics and Proteomics.

LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Developing Pairwise Sequence Alignment Algorithms

Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.

Bioinformatics in Biosophy

Pairwise & Multiple sequence alignments

Pair-wise Sequence Alignment (II) Introduction to bioinformatics 2008 Lecture 6 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

An Introduction to Bioinformatics

Protein Sequence Alignment and Database Searching.

CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Pair-wise Sequence Alignment Introduction to bioinformatics 2007 Lecture 5 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Sequence Alignment.

Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.

Step 3: Tools Database Searching

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Last lecture summary. Sequence alignment What is sequence alignment Three flavors of sequence alignment Point mutations, indels.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Pairwise Sequence Alignment and Database Searching

Sequence similarity, BLAST alignments & multiple sequence alignments

The ideal approach is simultaneous alignment and tree estimation.

Sequence comparison: Local alignment

Gil McVean Department of Statistics, Oxford

Biology 162 Computational Genetics Todd Vision Fall Aug 2004

Protein Sequence Alignments

Introduction to bioinformatics 2007

Introduction to bioinformatics 2007

Pairwise sequence Alignment.

Intro to Alignment Algorithms: Global and Local

Pairwise Alignment Global & local alignment

Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment

Presentation transcript:

Introduction to bioinformatics 2007 E N T R F O I G A V B M S U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment

Bioinformatics “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) “Nothing in bioinformatics makes sense except in the light of Biology”

Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion

Divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion true alignment

What can be observed about divergent evolution Ancestral sequence G C A C One substitution - one visible Two substitutions - one visible Sequence 1 Sequence 2 (c) G (d) G 1: ACCTGTAATC 2: ACGTGCGATC * ** D = 3/10 (fraction different sites (nucleotides)) G A A A Back mutation - not visible Two substitutions - none visible G

Convergent evolution Often with shorter motifs (e.g. active sites) Motif (function) has evolved more than once independently, e.g. starting with two very different sequences adopting different folds Sequences and associated structures remain different, but (functional) motif can become identical Classical example: serine proteinase and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin Different evolutionary origins Similarities in the reaction mechanisms. Chymotrypsin, subtilisin and carboxypeptidase C have a catalytic triad of serine, aspartate and histidine in common: serine acts as a nucleophile, aspartate as an electrophile, and histidine as a base. The geometric orientations of the catalytic residues are similar between families, despite different protein folds. The linear arrangements of the catalytic residues reflect different family relationships. For example the catalytic triad in the chymotrypsin clan is ordered HDS, but is ordered DHS in the subtilisin clan and SDH in the carboxypeptidase clan.

Serine proteinase (subtilisin) and chymotrypsin carboxypeptidase C Catalytic triads Read http://www.ebi.ac.uk/interpro/potm/2003_5/Page1.htm

Serine proteinase (subtilisin) and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin There is also divergent evolution.. Proc Natl Acad Sci U S A. 2000 December 19; 97(26): 14097–14102. The structure of aspartyl dipeptidase reveals a unique fold with a Ser-His-Glu catalytic triad Kjell Håkansson,* Andrew H.-J. Wang,† and Charles G. Miller*‡

A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Searching for similarities What is the function of the new gene? The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques): – Find a set of similar protein sequences to the unknown sequence – Identify similarities and differences – For long proteins: first identify domains

Intermezzo: what is a domain A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992). “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

Protein domains recur in different combinations The DEATH Domain (DD) Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers. http://www.mshri.on.ca/pawson

Structural domain organisation can intricate… Pyruvate kinase Phosphotransferase b barrel regulatory domain a/b barrel catalytic substrate binding domain a/b nucleotide binding domain 1 continuous + 2 discontinuous domains

Evolutionary and functional relationships Reconstruct evolutionary relation: Based on sequence -Identity (simplest method) -Similarity Homology (common ancestry: the ultimate goal) Other (e.g., 3D structure) Functional relation: Sequence Structure Function

Searching for similarities Common ancestry is more interesting: Makes it more likely that genes share the same function Homology: sharing a common ancestor – a binary property (yes/no) – it’s a nice tool: When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion.

How to go from DNA to protein sequence A piece of double stranded DNA: 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’ DNA direction is from 5’ to 3’

How to go from DNA to protein sequence 6-frame translation using the codon table (last lecture): 5’ attcgttggcaaatcgcccctatccggc 3’ 3’ taagcaaccgtttagcggggataggccg 5’

Evolution and three-dimensional protein structure information Isocitrate dehydrogenase: The distance from the active site (in yellow) determines the rate of evolution (red = fast evolution, blue = slow evolution) Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000

Bioinformatics tool tool Algorithm Data Biological Interpretation (model)

Example today: Pairwise sequence alignment needs sense of evolution Global dynamic programming MDAGSTVILCFVG Evolution M D A S T I L C G Amino Acid Exchange Matrix Search matrix MDAGSTVILCFVG- Gap penalties (open,extension) MDAAST-ILC--GS

How to determine similarity Frequent evolutionary events at the DNA level: 1. Substitution 2. Insertion, deletion 3. Duplication 4. Inversion We will restrict ourselves to these events

A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ******

Dynamic programming Scoring alignments – Substitution (or match/mismatch) • DNA • proteins – Gap penalty • Linear: gp(k)=ak • Affine: gp(k)=b+ak • Concave, e.g.: gp(k)=log(k) The score for an alignment is the sum of the scores of all alignment columns

Dynamic programming Scoring alignments Sa,b = - gp(k) = gapinit + kgapextension affine gap penalties

DNA: define a score for match/mismatch of letters Simple: Used in genome alignments: A C G T 1 -1 A C G T 91 -114 -31 -123 100 -125

Dynamic programming Scoring alignments T D W V T A L K T D W L - - I K 2020 10 1 Affine gap penalties (open, extension) Amino Acid Exchange Matrix Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px + +s(L,I)+s(K,K)

Amino acid exchange matrices 2020 How do we get one? And how do we get associated gap penalties? First systematic method to derive a.a. exchange matrices by Margaret Dayhoff et al. (1968) – Atlas of Protein Structure

amino acid exchange matrix (log odds) Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4 1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z 0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 A R N D C Q E G H I L K M F P S T W Y V B Z PAM250 matrix amino acid exchange matrix (log odds) Positive exchange values denote mutations that are more likely than randomly expected, while negative numbers correspond to avoided mutations compared to the randomly expected situation

Amino acid exchange matrices Amino acids are not equal: 1. Some are easily substituted because they have similar: • physico-chemical properties • structure 2. Some mutations between amino acids occur more often due to similar codons The two above observations give us ways to define substitution matrices

Pair-wise alignment T D W V T A L K T D W L - - I K Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 22n = ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

Technique to overcome the combinatorial explosion: Dynamic Programming Alignment is simulated as Markov process, all sequence positions are seen as independent Chances of sequence events are independent Therefore, probabilities per aligned position need to be multiplied Amino acid matrices contain so-called log-odds values (log10 of the probabilities), so probabilities can be summed

To say the same more statistically… To perform statistical analyses on messages or sequences, we need a reference model. The model: each letter in a sequence is selected from a defined alphabet in an independent and identically distributed (i.i.d.) manner. This choice of model system will allow us to compute the statistical significance of certain characteristics of a sequence, its subsequences, or an alignment. Given a probability distribution, Pi, for the letters in a i.i.d. message, the probability of seeing a particular sequence of letters i, j, k, ... n is simply Pi Pj Pk···Pn. As an alternative to multiplication of the probabilities, we could sum their logarithms and exponentiate the result. The probability of the same sequence of letters can be computed by exponentiating log Pi + log Pj + log Pk+ ··· + log Pn. In practice, when aligning sequences we only add log-odds values (residue exchange matrix) but we do not exponentiate the final score.

Sequence alignment History of Dynamic Programming algorithm 1970 Needleman-Wunsch global pair-wise alignment Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53. 1981 Smith-Waterman local pair-wise alignment Smith, TF, Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197.

Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG Evolution M D A S T I L C G Amino Acid Exchange Matrix Search matrix MDAGSTVILCFVG- Gap penalties (open,extension) MDAAST-ILC--GS

Global dynamic programming j-1 j i-1 i Value from residue exchange matrix H(i-1,j-1) + S(i,j) H(i-1,j) - g H(i,j-1) - g diagonal vertical horizontal H(i,j) = Max This is a recursive formula

Global dynamic programming PAM250, Gap =6 (linear) S P E A R 2 1 H -1 -2 K 3 4 S P E A R -6 -12 -18 -24 -30 -36 2 -4 -10 -16 -22 -28 H -3 -9 -14 -20 -1 -7 -13 K -15 -5 -2 6 These values are copied from the PAM250 matrix (see earlier slide) The extra bottom row and rightmost column give the penalties that would need to be applied due to end gaps Higgs & Attwood, p. 124

Global dynamic programming Affine gap penalties j-1 i-1 Gap opening penalty Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px} Si-1,j-1 Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px} Si,j = si,j + Max Gap extension penalty

Global dynamic programming Gapo=10, Gape=2 W V T A L K 8 3 11 9 12 1 6 4 25 2 5 10 14 7 13 D W V T A L K -12 -14 -16 -18 -20 -22 -24 8 -9 -6 -5 -11 9 2 3 -3 -34 -13 25 11 5 4 -21 -10 -4 37 21 19 15 -2 23 46 31 26 1 17 33 53 39 50 14 -29 -1 27 These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) The extra bottom row and rightmost column give the final global alignment scores

Easy DP recipe for using affine gap penalties j-1 i-1 M[i,j] is optimal alignment (highest scoring alignment until [i,j]) Check preceding row until j-2: apply appropriate gap penalties preceding row until i-2: apply appropriate gap penalties and cell[i-1, j-1]: apply score for cell[i-1, j-1]

DP is a two-step process Forward step: calculate scores Trace back: start at highest score and reconstruct the path leading to the highest score These two steps lead to the highest scoring alignment (the optimal alignment) This is guaranteed when you use DP!

Global dynamic programming

Semi-global pairwise alignment Global alignment: all gaps are penalised Semi-global alignment: N- and C-terminal gaps (end-gaps) are not penalised MSTGAVLIY--TS----- ---GGILLFHRTSGTSNS End-gaps End-gaps

Semi-global dynamic programming - two examples with different gap penalties - These values are copied from the PAM250 matrix (see earlier slide), after being made non-negative by adding 8 to each PAM250 matrix cell (-8 is the lowest number in the PAM250 matrix) Global score is 65 –10 – 1*2 –10 – 2*2

Semi-global pairwise alignment Applications of semi-global: – Finding a gene in genome – Placing marker onto a chromosome – One sequence much longer than the other Danger: if gap penalties high -- really bad alignments for divergent sequences

Local dynamic programming (Smith & Waterman, 1981) LCFVMLAGSTVIVGTR E D A S T I L C G Negative numbers Amino Acid Exchange Matrix Search matrix Gap penalties (open, extension) AGSTVIVG A-STILCG

Local dynamic programming (Smith & Waterman, 1981) j-1 i-1 Gap opening penalty Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px} Si,j + Si-1,j-1 Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px} Si,j = Max Gap extension penalty

Local dynamic programming

Dot plots Way of representing (visualising) sequence similarity without doing dynamic programming (DP) Make same matrix, but locally represent sequence similarity by averaging using a window

Comparing two sequences We want to be able to choose the best alignment between two sequences. A simple method of visualising similarities between two sequences is to use dot plots. The first sequence to be compared is assigned to the horizontal axis and the second is assigned to the vertical axis.

Dot plots can be filtered by window approaches (to calculate running averages) and applying a threshold They can identify insertions, deletions, inversions