Presentation is loading. Please wait.

Presentation is loading. Please wait.

Phylogeny - based on whole genome data

Similar presentations


Presentation on theme: "Phylogeny - based on whole genome data"— Presentation transcript:

1 Phylogeny - based on whole genome data
Johanne Ahrenfeldt PhD Student 2 gange 30 min, mere baggrund Multible choice test

2 What are phylogenetic trees
A very user friendly way of showing phylogenetic relationships Trees were traditionally made using aligned sequences of single genes or proteins Whole genome data may be used to create trees based on SNP calling K-mer overlap

3 Reads -> Mapping -> SNP
Reference A G Mapped sequence

4 What is a SNP (Wikipedia)
A Single Nucleotide Polymorphism (SNP, pronounced snip; plural snips) is a DNA sequence variation occurring commonly* within a population (e.g. 1%) in which a Single Nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. Some prefer the term SNV

5 How does it work Strain A ATTCAGTA Strain B ATGCAGTC Strain C ATGCAATC Strain D ATTCAGTC A B C D AA 2 3 1

6 Construct distance matrix
Strain A ATTCAGTA Strain B ATGCAGTC Strain C ATGCAATC Strain D ATTCAGTC A B C D A B C D

7 Building the tree Strain A Strain B Strain C Strain D
Strain A ATTCAGTA Strain B ATGCAGTC Strain C ATGCAATC Strain D ATTCAGTC A B C D A B C D Strain A Strain B Strain C Strain D

8 Single Nucleotide Polymorphism (SNP) calling
Which criteria to use to call SNPs? Can it be assumed that the genomes on the positions where no SNPs have been called have the same nucleotide as the reference genome? Positions in an assembly where no SNPs have been called may contain positions where there in the data is solid evidence that the nucleotide is the same as in the reference sequence borderline evidence that a SNP could be called poor data not supporting any claims on what the nucleotide on that position is

9 Ndtree Nucleotide calling A different approach from the typical SNP calling algorithms, where the main distinction is not between if a SNP should be called or not, but whether there is solid evidence for what nucleotide should be called or not. PLoS One Aug 11;9(8):e

10 Ndtree Simple mapping approach Cuts all reads into Kmers
Maps all Kmers to reference genome Makes ungapped consensus sequences of equal lengths PLoS One Aug 11;9(8):e

11 Simple mapping approach - details
The reference genome was split into K-mers of length 17 and stored in a hash table. Only non-overlapping K-mers was stored. Each read with a length of at least 50 was split in to 17-mers overlapping by 16. If a match was found it was attempted to use it as a seed for an un-gapped alignment, using a match score of 1 and a mismatch score of -3. K-mers from the read and its reverse complement was mapped until an alignment with a score of at least 50 was found. If this was the case vectors counting the number of A, T, G, and C’s found on each position was updated. If all reads were of length 50, the alignment score threshold was set to 40. PLoS One Aug 11;9(8):e

12 NDtree Nucleotide calling When all reads have been mapped, the significance of the base call at each position is evaluated by calculating the number of reads X having the most common nucleotide at that position, and the number of reads Y supporting other nucleotides. A Z-score threshold is calculated as: > 1.96 (or 3.29) >90% of reads supporting the same base PLoS One Aug 11;9(8):e

13 NDtree Count nucleotide differences
Method 1: Each pair of sequences was compared and the number of nucleotide differences in positions called in all sequences was counted. More accurate (z=1.96 is used as threshold) Use of the –a option Method 2: Each pair of sequences was compared and the number of nucleotide differences in positions called in both sequences was counted. More robust (z=3.29 is used as threshold)

14 NDtree – tree building Uses two different algorithms to make two different trees UPGMA Neighbor Joining Both methods makes trees from distance matrices

15 UPGMA algorithm UPGMA (Unweighted Pair Group Method with Arithmetic Mean) Simplest method: First cluster closest strains Merge those strains to one point Distance of cluster to another strain is average of distances for each member in cluster to that strain Then merge second closest Repeat until all strains are clustered

16 Neighbor Joining NJ (Neighbor Joining) method
Allows for different evolution rates What you need to know is that is allows for different evolution rates – the algorithm is described in the following slides – if you are interested, it can also be googled

17 NJ algorithm (Wikipedia)
Find the pair that has the lowest distance. These taxa are joined to a newly created node, which is connected to the central node. Calculate the distance from each of the taxa in the pair to this new node. Calculate the distance from each of the taxa outside of this pair to the new node. Start the algorithm again, replacing the pair of joined neighbors with the new node and using the distances calculated in the previous step.

18 NJ algorithm (Wikipedia)

19 UPGMA vs. Neighbor Joining
UPGMA works well when samples have been taken the same time Neighbor joining is better when samples have been taken at different times

20 NDtree Output distance_names.mat: Distance matrix - Tab separated infile.names: Distance matrix - Neighbor format tree.nj.newick: Neighbor Joining tree - Newick format Branch lengths is number of SNPs (Nucleotide Differences) tree.upgma.newick: UPGMA tree – Newick format

21 Controlled evolution Johanne Ahrenfeldt, Master thesis.

22 Naming of descendants Johanne Ahrenfeldt, Master thesis.

23 Johanne Ahrenfeldt, Master thesis.

24 Phylogenetic tree using Ndtree (UPGMA)

25 Phylogenetic tree using neighbor joining
Johanne Ahrenfeldt, Master thesis.

26 Should the template be close?
Two strategies for uncalled positions Uncalled positions are assumed to be identical to reference strain Remote template will lead to clustering based on quality and technology Use the –f option for assimpler.py Uncalled positions are left out of analysis Remote template will make in impossible to differentiate between very closely related strains PLoS One Aug 11;9(8):e

27 Reference dependency NDtree is more dependent on the reference
NDtree only finds what we ask it to look for, e.g. what is in the reference


Download ppt "Phylogeny - based on whole genome data"

Similar presentations


Ads by Google