Comparing Two Protein Sequences

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Sequence Alignment.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Introduction to Bioinformatics
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Sequence Analysis Tools
Sequence similarity.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Introduction to Bioinformatics Algorithms Sequence Alignment.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Page 1 march 2003 Pairwise sequence alignments Volker Flegel.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Bioinformatics in Biosophy
Thursday and Friday Dr Michael Carton Formerly VO’F group, now National Disease Surveillance Centre (NDSC) Wed (tomorrow) 10am - this suite booked for.
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Bioinformática 2007-I Prof. Mirko Zimic Lunes -Alineamiento simple de secuencias (pairwise alignment). - Alineamiento local y global. - Matrices de ‘score’
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique CN+LF An introduction to multiple alignments © Cédric Notredame.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Multiple alignments, PATTERNS, PSI-BLAST.
Construction of Substitution matrices
Sequence comparisons April 9, 2002 Review homework Learning objectives-Review amino acids. Understand difference between identity, similarity and homology.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Cédric Notredame (22/02/2016) Comparing Two Protein Sequences Cédric Notredame.
Day 7 Carlow Bioinformatics Aligning sequences. What is an alignment? CENTRAL concept in bioinformatics Easy if straight-forward, similar seqs –THISTHESAME.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Tutorial 4 Comparing Protein Sequences Intro to Bioinformatics 1.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Introduction to sequence alignment Mike Hallett (David Walsh)
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Protein Sequence Alignments
Using Dynamic Programming To Align Sequences
Pairwise Sequence Alignment
Pairwise Alignment Global & local alignment
Comparing Two Protein Sequences
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Comparing Two Protein Sequences Cédric Notredame

Our Scope If You Understand the LIMITS they Become VERY POWERFUL Look once Under the Hood Pairwise Alignment methods are POWERFUL Pairwise Alignment methods are LIMITED If You Understand the LIMITS they Become VERY POWERFUL

Outline -WHY Does It Make Sense To Compare Sequences -HOW Can we Compare Two Sequences ? -HOW Can we Align Two Sequences ? -HOW can I Search a Database ?

Why Does It Make Sense To Compare Sequences ? Sequence Evolution

Why Do We Want To Compare Sequences wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| |||| ????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA EXTRAPOLATE ?????? Homology? SwissProt

Why Do We Want To Compare Sequences

Why Does It Make Sense To Align Sequences ? -Evolution is our Real Tool. -Nature is LAZY and Keeps re-using Stuff. -Evolution is mostly DIVERGEANT Same Sequence  Same Ancestor

Why Does It Make Sense To Align Sequences ? Same Sequence Same Function Same Origin Same 3D Fold Many Counter-examples!

Comparing Is Reconstructing Evolution

An Alignment is a STORY ADKPKRPLSAYMLWLN ADKPKRPLSAYMLWLN ADKPRRPLS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutations + Selection Mutations, deletions are the engines of evolution, but selection does the steering… As shown here it is often impossible to tell appart insertions and deletions, hence their generic name: indels. Next: Homology

An Alignment is a STORY ADKPKRPLSAYMLWLN ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN Mutations + Selection Mutations, deletions are the engines of evolution, but selection does the steering… As shown here it is often impossible to tell appart insertions and deletions, hence their generic name: indels. Next: Homology Deletion Insertion ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutation

Evolution is NOT Always Divergent… Chen et al, 97, PNAS, 94, 3811-16 AFGP with (ThrAlaAla)n Similar To Trypsynogen N AFGP with (ThrAlaAla)n S NOT Similar to Trypsinogen

Evolution is NOT Always Divergent AFGP with (ThrAlaAla)n Similar To Trypsynogen NOT Similar to Trypsinogen N S SIMILAR Sequences BUT DIFFERENT origin

Evolution is NOT always Divergent… But in MOST cases, you may assume it is… Similar Function DOES NOT REQUIRE Similar Sequence Same Sequence Function 3D Fold Origin Similar Sequence  Historical Legacy

How Do Sequences Evolve Each Portion of a Genome has its own Agenda.

How Do Sequences Evolve ? CONSTRAINED Genome Positions Evolve SLOWLY EVERY Protein Family Has its Own Level Of Constraint Family KS KA Histone3 6.4 0 Insulin 4.0 0.1 Interleukin I 4.6 1.4 a-Globin 5.1 0.6 Apolipoprot. AI 4.5 1.6 Interferon G 8.6 2.8 Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years) Ks Synonymous Mutations, Ka Non-Neutral.

Different molecular clocks for different proteins--another prediction The Neutral Theory also makes another prediction about molecular clocks--namely that different types of proteins will have different clock rates? In particular, proteins whose structures are such that a small change in the amino acid sequence can impair the function of that protein, should evolve at the slowest rates, whereas proteins whose amino acid sequences can be modified fairly dramatically WITHOUT impairing function, should evolve at the fastest rates? Do we, in fact, see evidence of this? Yes. Consider the fibrinopeptide class of protein. These proteins are involved in blood clotting. They can perform this function even when there are numerous amino acid changes. They evolve at a relatively rapid rate, as the slode of the line relating aa substitutions to time shows (slide). On the other hand, cytochrome c, a protein involved in respiration metabolism, cannot tolerate many changes to its aa sequence without losing function. As the slide shows, it evolves (“its clock ticks at”) a much slower rate.

How Do Sequences Evolve ? The amino Acids Venn Diagram To Make Things Worse, Every Residue has its Own Personality G C L I V A F Aliphatic Aromatic Hydrophobic P G Small C S T W Y Q H K R E D N Polar

How Do Sequences Evolve ? In a structure, each Amino Acid plays a Special Role OmpR, Cter Domain In the core, SIZE MATTERS On the surface, CHARGE MATTERS - +

How Do Sequences Evolve ? Accepted Mutations Depend on the Structure Big -> Big Small ->Small NO DELETION + - - Charged -> Charged Small <-> Big or Small DELETIONS

How Can We Compare Sequences ? Substitution Matrices

How Can We Compare Sequences ? To Compare Two Sequences, We need: We Do Not Have Them !!! Their Structure Their Function

How Can We Compare Sequences ? We will Need To Replace Structural Information With Sequence Information. Same Sequence Same Origin Same Function Same 3D Fold It CANNOT Work ALL THE TIME !!!

How Can We Compare Sequences ? To Compare Sequences, We need to Compare Residues We Need to Know How Much it COSTS to SUBSTITUTE an Alanine into an Isoleucine a Tryptophan into a Glycine … The table that contains the costs for all the possible substitutions is called the SUBSTITUTION MATRIX How to derive that matrix?

How Can We Compare Sequences ? G C L I V A F Aliphatic Aromatic Hydrophobic S T W Y Q H K R E D N Polar P Small Using Knowledge Could Work But we do not know enough about Evolution and Structure. Using Data works better.

How Can We Compare Sequences ? Making a Substitution Matrix -Take 100 nice pairs of Protein Sequences, easy to align (80% identical). -Align them… -Count each mutations in the alignments -25 Tryptophans into phenylalanine -30 Isoleucine into Leucine … -For each mutation, set the substitution score to the log odd ratio: Expected by chance Observed Log

You’re kidding! … I was struck by a lightning twice too!! Garry Larson, The Far Side

How Can We Compare Sequences ? Making a Substitution Matrix The Diagonal Indicates How Conserved a residue tends to be. W is VERY Conserved Cysteins that make disulfide bridges and those that do not get averaged Some Residues are Easier To mutate into other similar

How Can We Compare Sequences ? Making a Substitution Matrix

How Can We Compare Sequences ? Using Substitution Matrix ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutation Insertion Deletion Given two Sequences and a substitution Matrix, We must Compute the CHEAPEST Alignment

Scoring an Alignment TPEA ¦| | APGA Most popular Subsitution Matrices PAM250 Blosum62 (Most widely used) Raw Score TPEA ¦| | APGA Score = = 9 Question: Is it possible to get such a good alignment by chance only? 1 + 6 + + 2

Insertions and Deletions Gap Penalties Opening a gap is more expensive than extending it Gap Opening Penalty Gap Extension Penalty gap Seq A GARFIELDTHE----CAT ||||||||||| ||| Seq B GARFIELDTHELASTCAT

How Can We Compare Sequences ? Limits of the substitution Matrices They ignore non-local interactions and Assume that identical residues are equal They assume evolution rate to be constant ADKPKRPLSAYMLWLN ADKPKRPKPRLSAYMLWLN ADKPRRPLS-YMLWLN Mutations + Selection

How Can We Compare Sequences ? Limits of the substitution Matrices Substitution Matrices Cannot Work !!!

How Can We Compare Sequences ? Limits of the substitution Matrices I know… But at least, could I get some idea of when they are likely to do all right

How Can We Compare Sequences ? The Twilight Zone %Sequence Identity Similar Sequence Similar Structure 30% Different Sequence Structure ???? Same 3D Fold 30 Twilight Zone Length 100

How Can We Compare Sequences ? The Twilight Zone Substitution Matrices Work Reasonably Well on Sequences that have more than 30 % identity over more than 100 residues

How Can We Compare Sequences ? Which Matrix Shall I used The Initial PAM matrix was computed on 80% similar Proteins It been extrapolated to more distantly related sequences. Pam 250 Pam 350 Other Matrices Exist: BLOSUM 42 BLOSUM 62

How Can We Compare Sequences ? Which Matrix Shall I use PAM: Distant Proteins High Index (PAM 350) BLOSUM: Distant Proteins  Low Index (Blosum30) GONNET 250> BLOSUM62>PAM 250. But This will depend on: The Family. The Program Used and Its Tuning. Choosing The Right Matrix may be Tricky… Insertions, Deletions?

HOW Can we Align Two Sequences ? Dot Matrices Global Alignments Local Alignment

Dot Matrices QUESTION What are the elements shared by two sequences ?

Dot Matrices >Seq1 THEFATCAT >Seq2 THELASTCAT Window Stringency

Dot Matrices Sequences Window size Stringency

Dot Matrices Strigency Window=1 Stringency=1 Window=11 Stringency=7

Dot Matrices x y x y x

Dot Matrices

Dot Matrices

Dot Matrices

Dot Matrices

Dot Matrices Limits -Visual aid -Best Way to EXPLORE the Sequence Organisation -Does NOT provide us with an ALIGNMENT wheat --DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER | | |||||||| || | ||| ||| | |||| |||| ????? KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

Parsimony: Evolution takes the simplest path (So We Think…) Global Alignments -Take 2 Nice Protein Sequences -A good Substitution Matrix (blosum) -A Gap opening Penalty (GOP) -A Gap extension Penalty (GEP) Cost L Afine Gap Penalty GOP GEP GOP GOP Parsimony: Evolution takes the simplest path (So We Think…)

Insertions and Deletions Gap Penalties Opening a gap is more expensive than extending it Gap Opening Penalty Gap Extension Penalty gap Seq A GARFIELDTHE----CAT ||||||||||| ||| Seq B GARFIELDTHELASTCAT

Global Alignments >Seq1 THEFATCAT >Seq2 THEFASTCAT THEFA-TCAT -Take 2 Nice Protein Sequences -A good Substitution Matrix (blosum) -A Gap opening Penalty (GOP) -A Gap extension Penalty (GEP) -DYNAMIC PROGRAMMING >Seq1 THEFATCAT >Seq2 THEFASTCAT DYNAMIC PROGRAMMING THEFA-TCAT THEFASTCAT

( ) Global Alignments Brute Force Enumeration 2 F A S T F A T (L1+l2)! DYNAMIC PROGRAMMING Brute Force Enumeration 2 ----FAT FAST--- F A S T ( ) (L1+l2)! ---FAT- FAST--- F A T (L1)!*(L2)! --F-AT- FAST---

Global Alignments Dynamic Programming (Needlman and Wunsch) F A S T F Match=1 MisMatch=-1 Gap=-1 F A S T F A S T F A S T -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4 F F F -1 1 -1 1 -1 1 A A A -2 2 -2 2 1 2 1 T T T -3 -3 -1 -1 1 2 2 F A S T F A - T

Global Alignments DYNAMIC PROGRAMMING Global Alignments are very sensitive to gap Penalties GOP GEP

Global Alignments DYNAMIC PROGRAMMING Global Alignments are very sensitive to gap Penalties Global Alignments do not take into account the MODULAR nature of Proteins C: K vitamin dep. Ca Binding K: Kringle Domain G: Growth Factor module F: Finger Module

Local Alignments LOCAL Alignment GLOBAL Alignment Smith And Waterman (SW)=LOCAL Alignment

Local Alignments We now have a PairWise Comparison Algorithm, We are ready to search Databases

Database Search Q QUERRY Comparison Engine Database E-values How many time do we expect such an Alignment by chance? Database SW Q 1.10e-20 10 1.10e-100 1.10e-2 1.10e-1 3 1 6 20 15 13

CONCLUSION

Sequence Comparison -Thanks to evolution, We CAN compare Sequences -There is a relation between Sequence and Structure. -Substitution matrices only work well with similar Sequences (More than 30% id). The Easiest way to Compare Two Sequences is a dotplot.

A few Addresses