Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
An Introduction to Phylogenetic Methods
Phylogenetic Trees Lecture 4
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
1 Detecting selection using phylogeny. 2 Evaluation of prediction methods  Comparing our results to experimentally verified sites Positive (hit)Negative.
MCB Class 1. Protein structure: Angles in the protein backbone.
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling.
Molecular Evolution, Part 2 Everything you didn’t want to know… and more! Everything you didn’t want to know… and more!
1 Functional prediction in proteins (purifying and positive selection)
Probabilistic methods for phylogenetic trees (Part 2)
Lecture 13 – Performance of Methods Folks often use the term “reliability” without a very clear definition of what it is. Methods of assessing performance.
1 HW Clarifications Homology implies shared ancestry Partial sequence identity does not necessarily imply homology A high coverage of sequence identity.
Materials and Methods Abstract Conclusions Introduction 1. Korber B, et al. Br Med Bull 2001; 58: Rambaut A, et al. Nat. Rev. Genet. 2004; 5:
Sequencing a genome and Basic Sequence Alignment
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Molecular phylogenetics
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
© Wiley Publishing All Rights Reserved. Building Multiple- Sequence Alignments.
3- RIBOSOMAL RNA GENE RECONSTRUCITON  Phenetics Vs. Cladistics  Homology/Homoplasy/Orthology/Paralogy  Evolution Vs. Phylogeny  The relevance of the.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
Identifying and Modeling Selection Pressure (a review of three papers) Rose Hoberman BioLM seminar Feb 9, 2004.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
26.1 Organisms Evolve Through Genetic Change Occurring Within Populations. “Nothing in Biology makes sense except in the light of Evolution” –Theodosius.
November 18, 2000ICTCM 2000 Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee,
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
1 Prediction of functional/structural sites in a protein using conservation and hyper-variation (ConSeq, ConSurf, Selecton)
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Multiple sequence alignment (msa)
Molecular Clocks Rose Hoberman.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Molecular Evolution.
MCB Class 1.
MCB Class 1.
Chapter 19 Molecular Phylogenetics
Presentation transcript:

Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky

Multiple sequence alignment (vWF) RatQEPGGLVVPPTDAPVSSTTPYVEDTPEPPLHNFYCSK RabbitQEPGGMVVPPTDAPVRSTTPYMEDTPEPPLHDFYWSN GorillaQEPGGLVVPPTDAPVSPTTLYVEDISEPPLHDFYCSR CatREPGGLVVPPTEGPVRATTPYVEDTPESTLHDFYCSR The problem: find for each position its conservation score.

Finding conserved regions from an alignment S1 KIFERCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN s2 KIFERCELARTLKRLGLDGYRGISLANWVCLAKWFWDYN s3 KVFERCELARTLKRLGMDFYRGISLANWMCLAKWESGYN s4 KTYERCEFARTLKRNGMSGYYGVSLADWVCLAQHESNYN s5 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYN s6 KVFSKCELAHKLKAQEMDGFGGYSLANWVCMAEYESNFS Solution 1: assign a score of 1 if the position is fully conserved and a score of 0 if it is variable. Problem: this method is very “rough…”

Finding conserved regions from an alignment S1 KIFERCELARTDMKLGLDFYKGVSLANWVCLAKWESGYN s2 KIFERCELARTLKRLGLDGYRGISLANWVCLAKWFWDYN s3 KVFERCELARTLKRLGMDFYRGISLANWMCLAKWESGYN s4 KTYERCEFARTLKRNGMSGYYGVSLADWVCLAQHESNYN s5 KVFERCELARTLKKLGLDGYKGVSLANWLCLTKWESSYN s6 KVFSKCELAHKLKAQEMDGFGGYSLANWVCMAEYESNFS Solution 2: count number of character states. Problem: this method does not take the evolutionary tree into account.

Evolutionary forces (e.g., mutation and selection) are the source of sequence variation S1S2S3 S6 S5 S4

A phylogenetic tree represents the history of evolution for the entire sequence. It is inferred based on all positions or from external data (e.g., fossils, other genes) S1S2S3 S6 S5 S4

Mapping changes onto the tree S1(K)S2(A)S3(A) S6(A) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(K) K A K A 3 K’s, 3 A’s and one replacement

Mapping changes onto the tree S1(K)S2(A)S3(A) S6(K) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(A) K A K A 3 K’s, 3 A’s and 3 replacements

When the phylogenetic tree is known, for each position, the minimum number of changes needed to “explain” the data will be evaluated. The more changes -> the more variable the position Maximum Parsimony (MP)

Mapping changes onto the tree S1(K)S2(A)S3(A) S6(A) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(K) K A K A Maximum parsimony score = 1 -> conserved.

Mapping changes onto the tree S1(K)S2(A)S3(A) S6(K) S5(K) K S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S4(A) K A K A Maximum parsimony score = 3 -> variable.

What if the tree is not known… S1 K K s2 A A s3 A A s4 K A s5 K K s6 A K S1S2S3 S6 S5 S4 The score of each tree is the sum of scores over all positions. If the tree is not known, we choose the tree with the lowest score, the maximum parsimony tree.

Parsimony has many shortcomings. To name a few: (1) All changes are counted the same, which is not true for biological systems (Leu->Ile is much more likely than Leu-> His). (2) Cannot take biological context into account (secondary structures, dependencies among sites, evolutionary distances between the analyzed organisms, etc). (3) Statistical basis questionable.

Alternative: MAXIMUM-LIKELIHOOD METHOD.

Maximum likelihood uses a probabilistic model of evolution Each amino acid has a certain probability to change and this probability depends on the evolutionary distances. Evolutionary distances are inferred from the entire set of sequences.

Evolutionary distances Positions can be conserved because of two reasons. Either because of functional constraints, or because of short evolutionary time. 5 replacements in 10 positions between 2 chimps, is considered very variable. 5 replacements between human, and cucumber, is not considered that variable… Maximum likelihood takes this information into account.

Maximum ParsimonyMaximum Likelihood All changes counted the same Different probabilities to the different types of substitutions Statistically questionableStatistically robust Ignores biological context Accounts for biological context

The likelihood computations t1t1 t5t5 t3t3 X C K t2t2 Z Y MA t6t6 t4t4 With likelihood models we can: 1.Infer the phylogenetic tree 2.Compute conservation for each site

Maximum likelihood tree reconstruction This is incredibly difficult (and challenging) from the computational point of view, but efficient algorithms to find approximate solutions were developed.

Back to conservation: ‘rate of evolution’ We estimate the rate of evolution for each site in the alignment Conserved site Slow evolving site Variable site Fast evolving site Given a multiple sequence alignment (MSA), we define:

Evolutionary rates We model the rate by assuming that each site i in the sequence has a different rate, r i, relative to the average rate over all sites. A site of rate 2 evolves twice as fast as the average.

“conseq” ( Bcl-X L – a key regulator influencing the release of apoptosis promoting factors from mitochondria

“rate4site” PositionML Rate

“conseq” (

Melamed D., et al. J. Virol (2004) 78:9675:9688 Conseq was used to study 11 unstructured amino acids in the Capsid Domain (CA) of the Gag protein. The Capsid Domain of the Gag protein makes a major contribution to the assembly process of the virion particle.

Integrating the 3D information We map each color onto the 3D structure.

Integrating the 3D information: validation of the method (1) Do the results make sense for biologists?

Conservation pattern in the Bcl-X L protein, using alignment of 53 homologes from Protomap Primary signal, Bak/ Bcl-X L interface. Secondary signal, BH4 homology region; found only on Bcl-2 subfamily (BH4 may interact with CED-4). Example: Bcl-X L protein (1bxl pdb ID)

The Structure of Human Src Tyrosine Kinase (Adapted from: Branden and Tooze, 1999)

SH2-SH3 interface MP results (233 SH2 homologues)

SH2-SH3 interface ML results (233 SH2 homologues)

Web-Server We developed a Web server applying this method. Using this server, one can enter a single PDB structure, and the server finds homologous sequences, produces the alignment and the tree, calculates the conservation scores, and visualizes the results on the 3D structure…