Phylogeny - based on whole genome data

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Introduction to Bioinformatics
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Phylogenetic reconstruction
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Molecular Evolution Revised 29/12/06
© Wiley Publishing All Rights Reserved. Phylogeny.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
UPGMA Algorithm.  Main idea: Group the taxa into clusters and repeatedly merge the closest two clusters until one cluster remains  Algorithm  Add a.
Lecture 7 – Algorithmic Approaches Justification: Any estimate of a phylogenetic tree has a large variance. Therefore, any tree that we can demonstrate.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Phylogenetic Reconstruction: Distance Matrix Methods Anders Gorm Pedersen Molecular Evolution Group Center for.
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Multiple sequence alignment
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Lecture 24 Inferring molecular phylogeny Distance methods
Phylogenetic Trees Tutorial 6. Measuring distance Bottom-up algorithm (Neighbor Joining) –Distance based algorithm –Relative distance based Phylogenetic.
Phylogeny - based on whole genome data
Building Phylogenies Distance-Based Methods. Methods Distance-based Parsimony Maximum likelihood.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Phylogenetic trees Sushmita Roy BMI/CS 576
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogenetic analyses Kirsi Kostamo. The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
1 Summary on similarity search or Why do we care about far homologies ? A protein from a new pathogenic bacteria. We have no idea what it does A protein.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
OUTLINE Phylogeny UPGMA Neighbor Joining Method Phylogeny Understanding life through time, over long periods of past time, the connections between all.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Phylogenetic Analysis Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics Figures from Higgs & Attwood.
Phylogeny Ch. 7 & 8.
Phylogenetic trees Sushmita Roy BMI/CS 576 Sep 23 rd, 2014.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Tutorial 5 Phylogenetic Trees.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Fitch-Margoliash Algorithm 1.From the distance matrix find the closest pair, e.g., A & B 2.Treat the rest of the sequences as a single composite sequence.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Lecture 14 CS5661 Neighbor Joining Generates unrooted tree, allowing for unequal branches Given: Distance matrix for sequences Steps: Repeat 1-3 till all.
Introduction to Bioinformatics Resources for DNA Barcoding
Lesson: Sequence processing
Phylogenetic basis of systematics
Distance based phylogenetics
Inferring a phylogeny is an estimation procedure.
Phylogenetic Inference
Multiple Alignment and Phylogenetic Trees
Hierarchical clustering approaches for high-throughput data
Patterns in Evolution I. Phylogenetic
The Tree of Life From Ernst Haeckel, 1891.
Inferring phylogenetic trees: Distance and maximum likelihood methods
Phylogenetic Trees.
Lecture 7 – Algorithmic Approaches
Phylogeny.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Phylogeny - based on whole genome data Johanne Ahrenfeldt PhD Student 2 gange 30 min, mere baggrund Multible choice test

What are phylogenetic trees A very user friendly way of showing phylogenetic relationships Trees were traditionally made using aligned sequences of single genes or proteins Whole genome data may be used to create trees based on SNP calling K-mer overlap

Reads -> Mapping -> SNP Reference A G Mapped sequence

What is a SNP (Wikipedia) A Single Nucleotide Polymorphism (SNP, pronounced snip; plural snips) is a DNA sequence variation occurring commonly* within a population (e.g. 1%) in which a Single Nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. Some prefer the term SNV

How does it work Strain A ATTCAGTA Strain B ATGCAGTC Strain C ATGCAATC Strain D ATTCAGTC A B C D AA 2 3 1

Construct distance matrix Strain A ATTCAGTA Strain B ATGCAGTC Strain C ATGCAATC Strain D ATTCAGTC A B C D A 0 2 3 1 B 2 0 1 1 C 3 1 0 2 D 1 1 2 0

Building the tree Strain A Strain B Strain C Strain D Strain A ATTCAGTA Strain B ATGCAGTC Strain C ATGCAATC Strain D ATTCAGTC A B C D A 0 2 3 1 B 2 0 1 1 C 3 1 0 2 D 1 1 2 0 Strain A Strain B Strain C Strain D

Single Nucleotide Polymorphism (SNP) calling Which criteria to use to call SNPs? Can it be assumed that the genomes on the positions where no SNPs have been called have the same nucleotide as the reference genome? Positions in an assembly where no SNPs have been called may contain positions where there in the data is solid evidence that the nucleotide is the same as in the reference sequence borderline evidence that a SNP could be called poor data not supporting any claims on what the nucleotide on that position is

Ndtree Nucleotide calling A different approach from the typical SNP calling algorithms, where the main distinction is not between if a SNP should be called or not, but whether there is solid evidence for what nucleotide should be called or not. PLoS One. 2014 Aug 11;9(8):e104984.

Ndtree Simple mapping approach Cuts all reads into Kmers Maps all Kmers to reference genome Makes ungapped consensus sequences of equal lengths PLoS One. 2014 Aug 11;9(8):e104984.

Simple mapping approach - details The reference genome was split into K-mers of length 17 and stored in a hash table. Only non-overlapping K-mers was stored. Each read with a length of at least 50 was split in to 17-mers overlapping by 16. If a match was found it was attempted to use it as a seed for an un-gapped alignment, using a match score of 1 and a mismatch score of -3. K-mers from the read and its reverse complement was mapped until an alignment with a score of at least 50 was found. If this was the case vectors counting the number of A, T, G, and C’s found on each position was updated. If all reads were of length 50, the alignment score threshold was set to 40. PLoS One. 2014 Aug 11;9(8):e104984.

NDtree Nucleotide calling When all reads have been mapped, the significance of the base call at each position is evaluated by calculating the number of reads X having the most common nucleotide at that position, and the number of reads Y supporting other nucleotides. A Z-score threshold is calculated as: > 1.96 (or 3.29) >90% of reads supporting the same base PLoS One. 2014 Aug 11;9(8):e104984.

NDtree Count nucleotide differences Method 1: Each pair of sequences was compared and the number of nucleotide differences in positions called in all sequences was counted. More accurate (z=1.96 is used as threshold) Use of the –a option Method 2: Each pair of sequences was compared and the number of nucleotide differences in positions called in both sequences was counted. More robust (z=3.29 is used as threshold)

NDtree – tree building Uses two different algorithms to make two different trees UPGMA Neighbor Joining Both methods makes trees from distance matrices

UPGMA algorithm UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Simplest method: First cluster closest strains Merge those strains to one point Distance of cluster to another strain is average of distances for each member in cluster to that strain Then merge second closest Repeat until all strains are clustered

Neighbor Joining NJ (Neighbor Joining) method Allows for different evolution rates What you need to know is that is allows for different evolution rates – the algorithm is described in the following slides – if you are interested, it can also be googled

NJ algorithm (Wikipedia) Find the pair that has the lowest distance. These taxa are joined to a newly created node, which is connected to the central node. Calculate the distance from each of the taxa in the pair to this new node. Calculate the distance from each of the taxa outside of this pair to the new node. Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step.

NJ algorithm (Wikipedia)

UPGMA vs. Neighbor Joining UPGMA works well when samples have been taken the same time Neighbor joining is better when samples have been taken at different times

NDtree Output distance_names.mat: Distance matrix - Tab separated infile.names: Distance matrix - Neighbor format tree.nj.newick: Neighbor Joining tree - Newick format Branch lengths is number of SNPs (Nucleotide Differences) tree.upgma.newick: UPGMA tree – Newick format

Controlled evolution Johanne Ahrenfeldt, Master thesis.

Naming of descendants Johanne Ahrenfeldt, Master thesis.

Johanne Ahrenfeldt, Master thesis.

Phylogenetic tree using Ndtree (UPGMA)

Phylogenetic tree using neighbor joining Johanne Ahrenfeldt, Master thesis.

Should the template be close? Two strategies for uncalled positions Uncalled positions are assumed to be identical to reference strain Remote template will lead to clustering based on quality and technology Use the –f option for assimpler.py Uncalled positions are left out of analysis Remote template will make in impossible to differentiate between very closely related strains PLoS One. 2014 Aug 11;9(8):e104984.

Reference dependency NDtree is more dependent on the reference NDtree only finds what we ask it to look for, e.g. what is in the reference