Workshop on data analyses

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
METHODS FOR HAPLOTYPE RECONSTRUCTION
An Introduction to Phylogenetic Methods
Wellcome Trust Workshop Working with Pathogen Genomes Module 6 Phylogeny.
GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
IE68 - Biological databases Phylogenetic analysis
Molecular Evolution Revised 29/12/06
Tree Reconstruction.
© Wiley Publishing All Rights Reserved. Phylogeny.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
How missing data and taxon sampling play the role in Phylogeny reconstruction? A case study on a five-gene dataset of Eurotiomycetous endophytic fungi.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Bioinformatics and Phylogenetic Analysis
Tree Pattern Matching in Phylogenetic Trees Automatic Search for Orthologs or Paralogs in Homologous Gene Sequence Databases By: Jean-François Dufayard,
Methods for Phylogenetics and Evolutionary analysis Jianpeng Xu University of Nebraska-Omah a.
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Probabilistic methods for phylogenetic trees (Part 2)
Bioinformatics tools for phylogeny and visualization
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Maximum parsimony Kai Müller.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
BINF6201/8201 Molecular phylogenetic methods
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
Molecular phylogenetics
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Speciation history inferred from gene trees L. Lacey Knowles Department of Ecology and Evolutionary Biology University of Michigan, Ann Arbor MI
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Molecular phylogenetics 1 Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
BINF6201/8201 Molecular phylogenetic methods
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Population assignment likelihoods in a phylogenetic and demographic model. Jody Hey Rutgers University.
Phylogeny GENE why is coalescent theory important for understanding phylogenetics (species trees)? coalescent theory lets us test our assumptions.
Announcements Urban Forestry project starts this week. Go through protocol. We'll be sending you off on your own. Please act responsibly. Peer review of.
Introduction to Phylogenetics
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Hybrid MPI/Pthreads Parallelization of the RAxML Phylogenetics Code Wayne Pfeiffer.
Phylogeny Ch. 7 & 8.
Ben Stöver WS 2012/2013 Ancestral state reconstruction Molecular Phylogenetics – exercise.
Hickory Dickory Dock: Understanding the Molecular Clock Felisa Wolfe ERUPT: Biocomplexity Seminar 28 Feb 2003.
Phylogenetics.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
Bayesian Evolutionary Analysis by Sampling Trees (BEAST) LEE KIM-SUNG Environmental Health Institute National Environment Agency.
Bioinf.cs.auckland.ac.nz Juin 2008 Uncorrelated and Autocorrelated relaxed phylogenetics Michaël Defoin-Platel and Alexei Drummond.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Molecular Evolution. Study of how genes and proteins evolve and how are organisms related based on their DNA sequence Molecular evolution therefore is.
Lecture 19 – Species Tree Estimation
Workshop Biogeography
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary genomics can now be applied beyond ‘model’ organisms
Phylogenetic basis of systematics
Linkage and Linkage Disequilibrium
DNA Marker Lecture 10 BY Ms. Shumaila Azam
Molecular Evolution.
Summary and Recommendations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 19 Molecular Phylogenetics
Molecular data assisted morphological analyses
Summary and Recommendations
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Workshop on data analyses Guoxing Yin 2018.7.13

System and software we should prepare macOS or Linux system, python, java, R studio package (ade4, ggplot2) PartitionFinder 2 http://www.robertlanfear.com/partitionfinder RAxML https://github.com/stamatak/standard-RAxML ASTRAL https://github.com/smirarab/ASTRAL Structure http://web.stanford.edu/group/pritchardlab/structure_software/release_version s/v2.3.4/html/structure.html PCA https://www.rstudio.com BFD http://www.beast2.org SNAPP http://www.beast2.org

Backgrounds When we finish data-analysis on assemble pipeline, we can get DNA sequence data. Then we can make a reference using DNA sequence data, mapping reads using gatk.sh, we can get SNP data.

SNP (single-nucleotide polymorphism) SNP is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population.

Call-SNPs Consensus.pl-------make a reference sequence Gatk.sh--------call-SNPs automatic--- **.vcf Vcftosnps.pl----- **_beast.nex & **_structure.txt & **_rout.txt

Gatk https://software.broadinstitute.org/gatk

Mark duplicates ATCGACTGATCGACTGATCGACTGATCGACTG--------reference ATCGACTGATCGACTTATCGACTGATCGACTG--------sample1 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample2 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample3 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample4

Indel realignment ATCGACTGATCGACTGATCGACTGATCGACTG--------reference ATCGACTGATCGACTGATCGACTGATCGACG---------sample1 ATCGACTGATCGACTGATCGACTGATCGACG---------sample2 ATCGACTGATCGACTGATCGACTGATCGACG---------sample3 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample4 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample5 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample6

Base recalibration ATCGACTGATCGACTGATCGACTGATCGACTG--------reference ATCGACTGATCGACTTATCGACTGATCGACTG--------sample1 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample2 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample3 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample4 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample5 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample6

Preparation for data-analysis DNA_aligned.fas Pick_taxa.pl ----------DNA_selected.fas mafft_aln.pl----------DNA_selected_aligned.fas

ATCGACTGATCGACTGATCGACTGATCGACTG sample1 ATCGA--GATC-----GATCGACTGATCGACTG sample2 ATCGACTGATCGACTGATCGACTGATCGACTG sample3 ATCGACTGATC-----GATCGAC------GACTG sample4 ATCGACTGATCGACTGATCGACTGATCGACTG sample5 ATCGACTGATC-----GATCGACTGATCG----- sample6

PartitionFinder2 PartitionFinder2 is a program for selecting best-fit partitioning schemes and models of evolution for nucleotide, amino acid, and morphology alignments.

alignment DNA_selected_aligned.fas concat_loci.pl--------DNA_selested_aligned.phy

Branchlengths Unlinked: Each subset has its own independent set of branch lengths.

Branchlengths Linked: Only one underlying set of branch lengths is estimated. Each subset has its own scaling parameter .

Data blocks DNA_selected_aligned.fas concat_loci.pl--------DNA_selested_aligned.nex extract_DNAblocks.pl------DNA_blocks.txt

Usage python “<PartitionFinder.py>” “<InputFolder.cfg>” --raxml -- rcluster-max 100

--rcluster-max Setting --rcluster-max to 1000 (i.e. the default values) is usually sufficient to ensure that PF2 will estimate a robust partitioning scheme, even on very large datasets in which there may be millions of possible pairs of data blocks. Setting either of them higher will tend to make the search more thorough and slower.

Example

Tree (in biology) Dendrogram representing the evolutionary relationship between species.

Terminology Branches Internal nodes Root Terminal nodes Phylogram or phylogenetic tree

Based on the molecular level of phylogenetic methods can be divided into two categories Discrete feature based approach Example:Maximum parsimony methods, Maximum likelihood methods, Bayesian methods, etc . Based on distance methods Example :N-J methods, etc.

Maximum likelihood methods(ML) A completely statistical-based phylogenetic tree reconstruction approach that takes into account the probability of each nucleotide substitution in each set of comparisons. The tree with the largest sum of probabilities is most likely the most real phylogenetic tree. The important advantage of probabilistic methods over parsimony is statistically consistent.

Basic evolutionary models Topology and branch length Substitution matrix rTC (= rCT), rTA (= rAT), rTG (= rGT) rCA (= rAC), rCG (= rGC) rAG (= rGA) Stationary base frequencies fT, fC, fA, fG, Taxa 1 Taxa 2 Taxa 3 Taxa 4 Taxa 5

More formally if given some data D and a hypothesis H the likelihood of those data is given by LD = Pr (D | H) Which is the probability of D given H.

What kind of trees Gene trees Species trees Time trees Etc.

Gene trees "Gene" trees represent the evolutionary history of the genes included in the study. Gene trees can provide evidence for gene duplication events, as well as speciation events. Sequences from different homologs can be included in a gene tree; the subsequent analyses should cluster orthologs, thus demonstrating the evolutionary history of the orthologs.

Concatenated tree Concatenated tree use concatenate independence gene, so the tree are more truly reflect the evolutionary history of species. We use RAxML to reconstruct concatenated tree.

RAxML (Randomized Axelerated Maximum Likelihood) It is a popular program for phylogenetic analysis of large datasets under maximum likelihood. Its major strength is a fast maximum likelihood tree search algorithm that returns trees with good likelihood scores.

Usage DNA_selected_aligned.fas concat_loci.pl--------DNA_selested_aligned.phy raxmlHPC-PTHREADS –T 12 –p 12345 –m GTRGAMMA –s ***.phy –n raxml -y -f a –x 12345 -# 100 –q partation.txt

-T number of nodes -p parsimony Random Seed -m substitution Model ­-s sequence File Name ­-n output File Name -f a rapid Bootstrap analysis and search for best-scoring ML tree in one program run –x Specify an integer number (random seed) and turn on rapid bootstrapping -# Specify the number of alternative runs on distinct starting trees

Example

Species trees "Species" trees recover the genealogy of taxa, individuals of a population, etc. Internal nodes represent speciation or other taxonomic events. Species trees should contain sequences from only orthologous genes.

ASTRAL ASTRAL is a tool for estimating an unrooted species tree given a set of unrooted gene trees. ASTRAL is statistically consistent under the multi-species coalescent model. ASTRAL finds the species tree that has the maximum number of shared induced quartet trees with the set of gene trees, subject to the constraint that the set of bipartitions in the species tree comes from a predefined set of bipartitions.

Usage DNA_selected_aligned.fas construct_tree.pl ---------DNA_aligned_seleced.tre Catree.pl -------catree.tre Java –jar positation/astral.jar –i positation/catree.tre –a positation/speciesname.txt –t 2 –o ***.tre

-jar version of astral -i input file -a speciesname -o output file -t 2 full annotation

Example

SNP-analysis Structure PCA BFD SNAPP

Structure  structure is a package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.

Usage File -> new project -> name the project + select directory + choose data file -> number of individual (**) + ploidy of data (2) + number of loci (***) + missing data value (-9) -> skip -> choose individual ID for each individual -> Proceed Parameter set -> new -> length of burnin period (***) + number of mcmc reps after burnin (***) -> name parameterProject -> start a job -> set k from m to n (i.e. 2-4) + number of Iterations (3 or 5) -> start Compress the result and upload to Structure Harvester.File -> load structure results -> Browse (choose your result with the best value of K)-> Bar Plot -> show -> sort by Q -> save

Example

PCA (Principal component analysis) PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

Example Working in R studio

BFD (Bays factor delimitation) BFD is to estimate the marginal likelihood of each competing species delimitation model, rank models by marginal likelihood, and use Bayes factors to assess support for model rankings.

Usage Open Beatui Set parameters, save as xx.xml file Fix xx.xml file Open file use beast or commend line

MCMC Robot http://marple.eeb.uconn.edu/mcmcrobot/?page_id=24

The BF scale is as follows: 0 < BF < 2 is not worth more than a bare mention, 2 < BF < 6 is positive evidence, 6 < BF < 10 is strong support, and BF > 10 is decisive.

SNAPP SNAPP (SNP and AFLP Package for Phylogenetic analysis) is package for inferring species trees and species demographics from independent (unlinked) biallelic markers such as well spaced SNPs. It implements a full coalescent model, but uses a novel algorithm to integrate over all possible gene trees, rather than sampling them explicitly.

Usage Open Beauti Set parameters, save file Open file use Beast Check parameters using Trance

Thanks for your attention!