Bioinformatics Lecture 3 Molecular Phylogenetic By: Dr. Mehdi Mansouri Mehr 1395.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
Phylogenetic analysis
An Introduction to Phylogenetic Methods
Based on lectures by C-B Stewart, and by Tal Pupko Phylogenetic Analysis based on two talks, by Caro-Beth Stewart, Ph.D. Department of Biological Sciences.
Phylogenetic Analysis
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Classification of Living Things. 2 Taxonomy: Distinguishing Species Distinguishing species on the basis of structure can be difficult  Members of the.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Classification systems have changed over time as information has increased. Section 2: Modern Classification K What I Know W What I Want to Find Out L.
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
UPGMA and FM are distance based methods. UPGMA enforces the Molecular Clock Assumption. FM (Fitch-Margoliash) relieves that restriction, but still enforces.
Review of cladistic technique Shared derived (apomorphic) traits are useful in understanding evolutionary relationships Shared primitive (plesiomorphic)
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogenetic Analysis. 2 Phylogenetic Analysis Overview Insight into evolutionary relationships Inferring or estimating these evolutionary relationships.
Classification and Phylogenies Taxonomic categories and taxa Inferring phylogenies –The similarity vs. shared derived character states –Homoplasy –Maximum.
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Phylogenetic Analysis. 2 Introduction Intension –Using powerful algorithms to reconstruct the evolutionary history of all know organisms. Phylogenetic.
Terminology of phylogenetic trees
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Dan Graur Molecular Phylogenetics Molecular phylogenetic approaches: 1. distance-matrix (based on distance measures) 2. character-state.
Phylogentic Tree Evolution Evolution of organisms is driven by Diversity  Different individuals carry different variants of.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Phylogenetic Trees  Importance of phylogenetic trees  What is the phylogenetic analysis  Example of cladistics  Assumptions in cladistics  Frequently.
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
17.2 Modern Classification
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Why do trees?. Phylogeny 101 OTUsoperational taxonomic units: species, populations, individuals Nodes internal (often ancestors) Nodes external (terminal,
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Ch. 26 Phylogeny and the Tree of Life. Opening Discussion: Is this basic “tree of life” a fact? If so, why? If not, what is it?
PHYLOGENETIC ANALYSIS. Phylogenetics Phylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees.
What is phylogenetic analysis and why should we perform it? Phylogenetic analysis has two major components: (1) Phylogeny inference or “tree building”
Section 2: Modern Systematics
Phylogeny and the Tree of Life
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Section 2: Modern Systematics
Phylogenetic Inference
Multiple Alignment and Phylogenetic Trees
Agenda 10/8 Seashell Sort Phylogeny Lecture Phylogenetics Pracice
Endeavour to reconstruct the characters of each hypothetical ancestor.
Patterns in Evolution I. Phylogenetic
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics
Chapter 19 Molecular Phylogenetics
#30 - Phylogenetics Distance-Based Methods
Lecture 7 – Algorithmic Approaches
But what if there is a large amount of homoplasy in the data?
Presentation transcript:

Bioinformatics Lecture 3 Molecular Phylogenetic By: Dr. Mehdi Mansouri Mehr 1395

Phylogenetics Basics Biological sequence analysis is founded on solid evolutionary principles. Similarities and divergence among related biological sequences revealed by sequence alignment often have to be rationalized and visualized in the context of phylogenetic trees

What is evolution? Evolution can be defined as the development of a biological form from other preexisting forms or its origin to the current existing form through natural selections and modifications.

Phylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees of these organisms. The tree branching patterns representing the evolutionary divergence are referred to as phylogeny.

Studying phylogenetics Fossil records – which contain morphological information about ancestors of current species and the timeline of divergence. fossil record nonexistent for microorganisms Molecular data (molecular fossils) – more numerous than fossils, easier to obtain, favorite for reconstruction of the evolutionary history

DNA sequence evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Major Assumptions Molecular sequences used in phylogenetic construction are homologous Phylogenetic divergence is assumed to be bifurcating Each position in a sequence evolved independently

Tree terminology Terminal node = Operational taxonomic unit (OTU) Internal node = Hypothetical taxonomic unit (HTU) Peripheral ( or terminal) branch = relationship between OTU and HTU Internal branch = relationship between two HTUs

10 A clade is a group of all the taxa that have been derived from a common ancestor plus the common ancestor itself. Clades

11 Cladograms & Phylograms Bacterium 1 Bacterium 3 Bacterium 2 Eukaryote 1 Eukaryote 4 Eukaryote 3 Eukaryote 2 Bacterium 1 Bacterium 3 Bacterium 2 Eukaryote 1 Eukaryote 4 Eukaryote 3 Eukaryote 2 Phylograms show branch order and branch lengths Cladograms show branching order - branch lengths are meaningless

dichotomy – all branches bifurcate polytomy – result of a taxon giving rise to more than two descendants or unresolved phylogeny

unrooted – no knowledge of a common ancestor, shows relative relationship of taxa, no direction of an evolutionary path rooted – obviously, more informative

Rooting the tree outgroup – taxa that are known to fall outside of the group of interest Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins). outgroup Based on lectures by Tal Pupko

Rooting the tree Midpoint rooting approach - roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. A B C D d (A,D) = = 18 Midpoint = 18 / 2 = 9 Based on lectures by Tal Pupko

Molecular clock This concept was proposed by Emil Zuckerkandl and Linus Pauling (1962) as well as Emanuel Margoliash (1963). This hypothesis states that for every given gene (or protein), the rate of molecular evolution is approximately constant. Pioneering study by Zuckerkandl and Pauling They observed the number of amino acid differences between human globins – β and δ (~ 6 differences), β and γ (~ 36 differences), α and β (~ 78 differences), and α and γ (~ 83 differences). They could also compare human to gorilla (both β and α globins), observing either 2 or 1 differences respectively. They knew from fossil evidence that humans and gorillas diverged from a common ancestor about 11 MYA. Using this divergence time as a calibration point, they estimated that gene duplications of the common ancestor to β and δ occurred 44 MYA; β and derived from a common ancestor 260MYA; α and β 565 MYA; and α and γ 600MYA.

17 3 OTUs 1 unrooted tree = 3 rooted trees

18 4 OTUs 3 unrooted trees = 15 rooted trees

Finding a true tree is difficult

The Newick format

Gene phylogeny vs. species phylogeny Main objective of building phylogenetic trees based on molecular sequences: reconstruct the evolutionary history of the species involved. A gene phylogeny only describes the evolution of that particular gene or encoded protein. This sequence may evolve more or less rapidly than other genes in the genome. The evolution of a particular sequence does not necessarily correlate with the evolutionary path of the species. Branching point in a species tree – the speciation event Branching point in a gene tree – which event? The two events may or may not coincide. To obtain a species phylogeny, phylogenetic trees from a variety of gene families need to be constructed to give an overall assessment of the species evolution.

22 A gene tree may differ from a species tree S = Divergence time for species 1 and 2

23 A gene tree may differ from a species tree S = Divergence time for species 1 and 2 G 1 = Inferred divergence time by using alleles a and f

24 A gene tree may differ from a species tree Alleles d and b are closer to each other than alleles d and f.

25 Incomplete lineage sorting due to polymorphism at speciation time

Closest living relatives of humans? Based on lectures by Tal Pupko

Closest living relatives of humans? Humans Bonobos Gorillas Orangutans Chimpanzees MYA MYA Chimpanzees Orangutans Humans Bonobos Gorillas 0 14 Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than either are to gorillas. The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least MYA.

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona

Procedure 1.Choice of molecular markers 2.Multiple sequence alignment 3.Choice of a model of evolution 4.Determine a tree building method 5.Assess tree reliability

Choice of molecular markers Nucleotide or protein sequence data? NA sequences evolve more rapidly. They can be used for studying very closely related organisms. E. g., for evolutionary analysis of different individuals within a population, noncoding regions of mtDNA are often used. Evolution of more divergent organisms – either slowly evolving NA (e.g., rRNA) or protein sequences. Deepest level (e.g., relatioships between bacteria and eukaryotes) – conserved protein sequences NA sequences: good if sequences are closely related, reveal synonymous/nonsynonymous substitutions

Positive and negative selection

MSA Critical step Multiple state-of-the-art alignment programs (e.g., T-Coffee and Praline) should be used. The alignment results from multiple sources should be inspected and compared carefully to identify the most reasonable one.

Model of evolution A simple measure of the divergence of two sequences – number of substitutions in the alignment, a distance between two sequences – a proportion of substitutions If A was replaced by C: A → C or A → T → G → C? Back mutation: G → C → G. Parallel mutations – both sequences mutate into e.g., T at the same time. All of this obscures the estimation of the true evolutionary distances between sequences. This effect is known as homoplasy and must be corrected. Statistical models infer the true evolutionary distances between sequences.

Model of evolution

Among site variations Up to now we have assumed that different positions in a sequence are assumed to be evolving at the same rate. However, in reality is may not be true. In DNA, the rates of substitution differ for different codon positions. 3 rd codon mutates much faster. In proteins, some AA change much more rarely than others owing to functional constraints.

Tree building methods Two major categories. Distance based methods. Based on the amount of dissimilarity between pairs of sequences, computed on the basis of sequence alignment. Characters based methods. Based on discrete characters, which are molecular sequences from individual taxa.

Distance based methods Calculate evolutionary distances d AB between sequences using some of the evolutionary model. Construct a distance matrix – distances between all pairs of taxa. Based on the distance scores, construct a phylogenetic tree. clustering algorithms – UPGMA, neighbor joining (NJ) optimality based – Fitch-Margoliash (FM), minimum evolution (ME)

Clustering methods UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Produces rooted tree (most phylogenetic methods produce unrooted tree). Basic assumption of the UPGMA method: all taxa evolve at a constant rate, they are equally distant from the root, implying that a molecular clock is in effect. However, real data rarely meet this assumption. Thus, UPGMA often produces erroneous tree topologies.

Distance based – pros and cons clustering Fast, can handle large datasets Not guaranteed to find the best tree The actual sequence information is lost when all the sequence variation is reduced to a single value. Hence, ancestral sequences at internal nodes cannot be inferred. NJ – does not assume that the rate of evolution is the same in all branches of the tree NJ is slower but better than UPGMA exhaustive tree searching (FM) better accuracy

Character based methods Also called discreet methods Based directly on the sequence characters They count mutational events accumulated on the sequences and may therefore avoid the loss of information when characters are converted to distances. Evolutionary dynamics of each character can be studied Ancestral sequences can also be inferred. The two most popular character-based approaches: maximum parsimony (MP) and maximum likelihood (ML) methods.

Maximum parsimony A tree with the least number of substitutions is probably the best to explain the differences among the taxa under study.

MP – pros and cons The character-based method is able to provide evolutionary information about the sequence characters, such as information regarding homoplasy and ancestral states. It tends to produce more accurate trees than the distance-based methods when sequence divergence is low because this is the circumstance when the parsimony assumption of rarity in evolutionary changes holds true. When sequence divergence is high, tree estimation by MP can be less effective, because the original parsimony assumption no longer holds. Estimation of branch lengths may also be erroneous because MP does not employ substitution models to correct for multiple substitutions.

Maximum likelihood – ML Uses probabilistic models to choose a best tree that has the highest probability (likelihood) of reproducing the observed data. ML is an exhaustive method that searches every possible tree topology and considers every position in an alignment, not just informative sites. By employing a particular substitution model that has probability values of residue substitutions, ML calculates the total likelihood of ancestral sequences evolving to internal nodes and eventually to existing sequences. It sometimes also incorporates parameters that account for rate variations across sites.

ML – pros and cons Based on well-founded statistics instead of a medieval philosophy. More robust, uses the full sequence information, not just informative sites. Employs substitution model – strength, but also weakness (choosing wrong model leads to incorrect tree). Accurately reconstructs the relationships between sequences that have been separated for a long time. Very time consuming, considerably more than MP which is itself more time consuming than clustering methods.