25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools.

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Basic Gene Expression Data Analysis--Clustering
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Doug Brutlag Professor.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Lecture 3 Molecular Evolution and Phylogeny. Facts on the molecular basis of life Every life forms is genome based Genomes evolves There are large numbers.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Pattern recognition and phylogeny Genome Analyis (Integrative Bioinformatics & Genomics) 2008 Lecture 9 C E N T R F O R I N T E G R A T I V E B I O I.
Pattern Recognition Introduction to bioinformatics 2007 Lecture 4 C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 13 Evolution/Phylogeny.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
10/10/06 Evolution/Phylogeny Bioinformatics Course Computational Genomics & Proteomics (CGP)
Bioinformatics and Phylogenetic Analysis
Bioinformatics Master Course Sequence Analysis
10/10/06 Evolution/Phylogeny Bioinformatics Course Computational Genomics & Proteomics (CGP)
Pattern Recognition Introduction to bioinformatics 2005 Lecture 4.
07/05/2004 Evolution/Phylogeny Introduction to Bioinformatics MNW2.
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
1-month Practical Course Genome Analysis Evolution and Phylogeny methods Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Phylogenetic Analysis
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Introduction to Bioinformatics Lecture 19 Intracellular Networks Graph theory C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Terminology of phylogenetic trees
Molecular phylogenetics
Christian M Zmasek, PhD 15 June 2010.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Phylogenetics and Coalescence Lab 9 October 24, 2012.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
A brief introduction to phylogenetics
Building phylogenetic trees. Contents Phylogeny Phylogenetic trees How to make a phylogenetic tree from pairwise distances  UPGMA method (+ an example)
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Building Phylogenetic Trees.
Multiple Sequence Alignment benchmarking, pattern recognition and Phylogeny Introduction to bioinformatics 2008 Lecture 11 C E N T R F O R I N T E G R.
Lecture 9 Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools.
Rooting Phylogenetic Trees with Non-reversible Substitution Models Von Bing Yap* and Terry Speed § *Statistics and Applied Probability, National University.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Introduction to bioinformatics 2008 Lecture 12
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 13 Evolution/Phylogeny.
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Distance based phylogenetics
Multiple Alignment and Phylogenetic Trees
Introduction to bioinformatics 2007 Lecture 11
Phylogenetic Trees.
Lecture 16: Evolution/Phylogeny
Lecture 19: Evolution/Phylogeny
Presentation transcript:

25/05/2004 Evolution/Phylogeny/Pattern recognition Bioinformatics Master Course Bioinformatics Data Analysis and Tools

Patterns Some are easy some are not Knitting patterns Cooking recipes Pictures (dot plots) Colour patterns Maps

Example of algorithm reuse: Data clustering Many biological data analysis problems can be formulated as clustering problems –microarray gene expression data analysis –identification of regulatory binding sites (similarly, splice junction sites, translation start sites,......) –(yeast) two-hybrid data analysis (for inference of protein complexes) –phylogenetic tree clustering (for inference of horizontally transferred genes) –protein domain identification –identification of structural motifs –prediction reliability assessment of protein structures –NMR peak assignments –......

Data Clustering Problems Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar” cluster identification -- identifying clusters with significantly different features than the background

Application Examples Regulatory binding site identification: CRP (CAP) binding site Two hybrid data analysis l Gene expression data analysis Are all solvable by the same algorithm!

Other Application Examples Phylogenetic tree clustering analysis (Evolutionary trees) Protein sidechain packing prediction Assessment of prediction reliability of protein structures Protein secondary structures Protein domain prediction NMR peak assignments ……

Multivariate statistics – Cluster analysis Dendrogram Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Raw table Similarity criterion Cluster criterion

Human Evolution

Comparing sequences - Similarity Score - Many properties can be used: Nucleotide or amino acid composition Isoelectric point Molecular weight Morphological characters But: molecular evolution through sequence alignment

Multivariate statistics – Cluster analysis Now for sequences Phylogenetic tree Scores Similarity matrix 5×5 Multiple sequence alignment Similarity criterion

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQ Lamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ Lactate dehydrogenase multiple alignment Distance Matrix Human Chicken Dogfish Lamprey Barley Maizey Lacto_casei Bacillus_stea Lacto_plant Therma_mari Bifido Thermus_aqua Mycoplasma

Multivariate statistics – Cluster analysis Dendrogram/tree Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Data table Similarity criterion Cluster criterion

Multivariate statistics – Cluster analysis Why do it? Finding a true typology Model fitting Prediction based on groups Hypothesis testing Data exploration Data reduction Hypothesis generation But you can never prove a classification/typology!

Cluster analysis – data normalisation/weighting C1 C2 C3 C4 C5 C6.. Raw table Normalisation criterion C1 C2 C3 C4 C5 C6.. Normalised table Column normalisationx/max Column range normalise(x-min)/(max-min)

Cluster analysis – (dis)similarity matrix Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Raw table Similarity criterion D i,j = (  k | x ik – x jk | r ) 1/r Minkowski metrics r = 2 Euclidean distance r = 1 City block distance

Cluster analysis – Clustering criteria Dendrogram (tree) Scores Similarity matrix 5×5 Cluster criterion Single linkage - Nearest neighbour Complete linkage – Furthest neighbour Group averaging – UPGMA Ward Neighbour joining – global measure

Cluster analysis – Clustering criteria 1.Start with N clusters of 1 object each 2.Apply clustering distance criterion iteratively until you have 1 cluster of N objects 3.Most interesting clustering somewhere in between Dendrogram (tree) distance N clusters1 cluster

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2

Single linkage clustering (nearest neighbour) Char 1 Char 2 Distance from point to cluster is defined as the smallest distance between that point and any point in the cluster

Cluster analysis – Ward’s clustering criterion Per cluster: calculate Error Sum of Squares (ESS) ESS =  x 2 – (  x) 2 /n calculate minimum increase of ESS Suppose: ObjValc l u s t e r i n g  ESS

Multivariate statistics – Cluster analysis Phylogenetic tree Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Data table Similarity criterion Cluster criterion

Multivariate statistics – Cluster analysis Scores 5× C1 C2 C3 C4 C5 C6 Similarity criterion Cluster criterion Scores 6×6 Cluster criterion Make two-way ordered table using dendrograms

Multivariate statistics – Principal Component Analysis (PCA) C1 C2 C3 C4 C5 C6 Similarity Criterion: Correlations 6×6 Calculate eigenvectors with greatest eigenvalues: Linear combinations Orthogonal Correlations Project data points onto new axes (eigenvectors) 1 2

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in the light of Biology” Bioinformatics

Evolution Most of bioinformatics is comparative biology Comparative biology is based upon evolutionary relationships between compared entities Evolutionary relationships are normally depicted in a phylogenetic tree

Where can phylogeny be used For example, finding out about orthology versus paralogy Predicting secondary structure of RNA Studying host-parasite relationships Mapping cell-bound receptors onto their binding ligands Multiple sequence alignment (e.g. Clustal)

Phylogenetic tree (unrooted) human mousefugu Drosophila edge internal node leaf OTU – Observed taxonomic unit

Phylogenetic tree (unrooted) human mousefugu Drosophila root edge internal node leaf OTU – Observed taxonomic unit

Phylogenetic tree (rooted) human mouse fugu Drosophila root edge internal node (ancestor) leaf OTU – Observed taxonomic unit time

How to root a tree Outgroup – place root between distant sequence and rest group Midpoint – place root at midpoint of longest path (sum of branches between any two OTUs) Gene duplication – place root between paralogous gene copies f D m h Dfmh f D m h Dfmh f-  h-  f-  h-  f-  h-  f-  h- 

Combinatoric explosion # sequences# unrooted# rooted trees , ,395135, ,1352,027, ,027,02534,459,425

Tree distances humanx mouse6 x fugu7 3 x Drosophila x human mouse fugu Drosophila human mousefuguDrosophila Evolutionary (sequence distance) = sequence dissimilarity

Phylogeny methods Parsimony – fewest number of evolutionary events (mutations) – relatively often fails to reconstruct correct phylogeny Distance based – pairwise distances Maximum likelihood – L = Pr[Data|Tree]

Parsimony & Distance Sequences Drosophila t t a t t a a fugu a a t t t a a mouse a a a a a t a human a a a a a a t humanx mouse2 x fugu3 4 x Drosophila5 5 3 x human mousefuguDrosophila fugu mouse human Drosophila fugu mouse human parsimony distance

Maximum likelihood If data=alignment, hypothesis = tree, and under a given evolutionary model, maximum likelihood selects the hypothesis (tree) that maximises the observed data Extremely time consuming method We also can test the relative fit to the tree of different models (Huelsenbeck & Rannala, 1997)

Bayesian methods Calculates the posterior probability of a tree (Huelsenbeck et al., 2001) –- probability that tree is true tree given evolutionary model Most computer intensive technique Feasible thanks to Markov chain Monte Carlo (MCMC) numerical technique for integrating over probability distributions Gives confidence number (posterior probability) per node

Distance methods: fastest Clustering criterion using a distance matrix Distance matrix filled with alignment scores (sequence identity, alignment scores, E-values, etc.) Cluster criterion

Phylogenetic tree by Distance methods (Clustering) Phylogenetic tree Scores Similarity matrix 5×5 Multiple alignment Similarity criterion

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQ Lamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ Lactate dehydrogenase multiple alignment Distance Matrix Human Chicken Dogfish Lamprey Barley Maizey Lacto_casei Bacillus_stea Lacto_plant Therma_mari Bifido Thermus_aqua Mycoplasma

Cluster analysis – (dis)similarity matrix Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Raw table Similarity criterion D i,j = (  k | x ik – x jk | r ) 1/r Minkowski metrics r = 2 Euclidean distance r = 1 City block distance

Cluster analysis – Clustering criteria Phylogenetic tree Scores Similarity matrix 5×5 Cluster criterion Single linkage - Nearest neighbour Complete linkage – Furthest neighbour Group averaging – UPGMA Ward Neighbour joining – global measure

Neighbour joining Global measure – keeps total branch length minimal, tends to produce a tree with minimal total branch length At each step, join two nodes such that distances are minimal (criterion of minimal evolution) Agglomerative algorithm Leads to unrooted tree

Neighbour joining x x y x y x y x y x (a)(b) (c) (d)(e) (f) At each step all possible ‘neighbour joinings’ are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken.

How to assess confidence in tree Bayesian method – time consuming –The Bayesian posterior probabilities (BPP) are assigned to internal branches in consensus tree –Bayesian Markov chain Monte Carlo (MCMC) analytical software such as MrBayes (Huelsenbeck and Ronquist, 2001) and BAMBE (Simon and Larget,1998) is now commonly used –Uses all the data Distance method – bootstrap: –Select multiple alignment columns with replacement –Recalculate tree –Compare branches with original (target) tree –Repeat times, so calculate different trees –How often is branching (point between 3 nodes) preserved for each internal node? –Uses samples of the data

The Bootstrap C C V K V I Y S M A V R L I F S M C L R L L F T V K V S I I S I V R V S I I S I L R L T L L T L Original Scrambled x2x 3x3x Non- supportive