Workshop on data analyses Guoxing Yin 2018.7.13
System and software we should prepare macOS or Linux system, python, java, R studio package (ade4, ggplot2) PartitionFinder 2 http://www.robertlanfear.com/partitionfinder RAxML https://github.com/stamatak/standard-RAxML ASTRAL https://github.com/smirarab/ASTRAL Structure http://web.stanford.edu/group/pritchardlab/structure_software/release_version s/v2.3.4/html/structure.html PCA https://www.rstudio.com BFD http://www.beast2.org SNAPP http://www.beast2.org
Backgrounds When we finish data-analysis on assemble pipeline, we can get DNA sequence data. Then we can make a reference using DNA sequence data, mapping reads using gatk.sh, we can get SNP data.
SNP (single-nucleotide polymorphism) SNP is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population.
Call-SNPs Consensus.pl-------make a reference sequence Gatk.sh--------call-SNPs automatic--- **.vcf Vcftosnps.pl----- **_beast.nex & **_structure.txt & **_rout.txt
Gatk https://software.broadinstitute.org/gatk
Mark duplicates ATCGACTGATCGACTGATCGACTGATCGACTG--------reference ATCGACTGATCGACTTATCGACTGATCGACTG--------sample1 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample2 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample3 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample4
Indel realignment ATCGACTGATCGACTGATCGACTGATCGACTG--------reference ATCGACTGATCGACTGATCGACTGATCGACG---------sample1 ATCGACTGATCGACTGATCGACTGATCGACG---------sample2 ATCGACTGATCGACTGATCGACTGATCGACG---------sample3 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample4 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample5 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample6
Base recalibration ATCGACTGATCGACTGATCGACTGATCGACTG--------reference ATCGACTGATCGACTTATCGACTGATCGACTG--------sample1 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample2 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample3 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample4 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample5 ATCGACTGATCGACTGATCGACTGATCGACTG--------sample6
Preparation for data-analysis DNA_aligned.fas Pick_taxa.pl ----------DNA_selected.fas mafft_aln.pl----------DNA_selected_aligned.fas
ATCGACTGATCGACTGATCGACTGATCGACTG sample1 ATCGA--GATC-----GATCGACTGATCGACTG sample2 ATCGACTGATCGACTGATCGACTGATCGACTG sample3 ATCGACTGATC-----GATCGAC------GACTG sample4 ATCGACTGATCGACTGATCGACTGATCGACTG sample5 ATCGACTGATC-----GATCGACTGATCG----- sample6
PartitionFinder2 PartitionFinder2 is a program for selecting best-fit partitioning schemes and models of evolution for nucleotide, amino acid, and morphology alignments.
alignment DNA_selected_aligned.fas concat_loci.pl--------DNA_selested_aligned.phy
Branchlengths Unlinked: Each subset has its own independent set of branch lengths.
Branchlengths Linked: Only one underlying set of branch lengths is estimated. Each subset has its own scaling parameter .
Data blocks DNA_selected_aligned.fas concat_loci.pl--------DNA_selested_aligned.nex extract_DNAblocks.pl------DNA_blocks.txt
Usage python “<PartitionFinder.py>” “<InputFolder.cfg>” --raxml -- rcluster-max 100
--rcluster-max Setting --rcluster-max to 1000 (i.e. the default values) is usually sufficient to ensure that PF2 will estimate a robust partitioning scheme, even on very large datasets in which there may be millions of possible pairs of data blocks. Setting either of them higher will tend to make the search more thorough and slower.
Example
Tree (in biology) Dendrogram representing the evolutionary relationship between species.
Terminology Branches Internal nodes Root Terminal nodes Phylogram or phylogenetic tree
Based on the molecular level of phylogenetic methods can be divided into two categories Discrete feature based approach Example:Maximum parsimony methods, Maximum likelihood methods, Bayesian methods, etc . Based on distance methods Example :N-J methods, etc.
Maximum likelihood methods(ML) A completely statistical-based phylogenetic tree reconstruction approach that takes into account the probability of each nucleotide substitution in each set of comparisons. The tree with the largest sum of probabilities is most likely the most real phylogenetic tree. The important advantage of probabilistic methods over parsimony is statistically consistent.
Basic evolutionary models Topology and branch length Substitution matrix rTC (= rCT), rTA (= rAT), rTG (= rGT) rCA (= rAC), rCG (= rGC) rAG (= rGA) Stationary base frequencies fT, fC, fA, fG, Taxa 1 Taxa 2 Taxa 3 Taxa 4 Taxa 5
More formally if given some data D and a hypothesis H the likelihood of those data is given by LD = Pr (D | H) Which is the probability of D given H.
What kind of trees Gene trees Species trees Time trees Etc.
Gene trees "Gene" trees represent the evolutionary history of the genes included in the study. Gene trees can provide evidence for gene duplication events, as well as speciation events. Sequences from different homologs can be included in a gene tree; the subsequent analyses should cluster orthologs, thus demonstrating the evolutionary history of the orthologs.
Concatenated tree Concatenated tree use concatenate independence gene, so the tree are more truly reflect the evolutionary history of species. We use RAxML to reconstruct concatenated tree.
RAxML (Randomized Axelerated Maximum Likelihood) It is a popular program for phylogenetic analysis of large datasets under maximum likelihood. Its major strength is a fast maximum likelihood tree search algorithm that returns trees with good likelihood scores.
Usage DNA_selected_aligned.fas concat_loci.pl--------DNA_selested_aligned.phy raxmlHPC-PTHREADS –T 12 –p 12345 –m GTRGAMMA –s ***.phy –n raxml -y -f a –x 12345 -# 100 –q partation.txt
-T number of nodes -p parsimony Random Seed -m substitution Model -s sequence File Name -n output File Name -f a rapid Bootstrap analysis and search for best-scoring ML tree in one program run –x Specify an integer number (random seed) and turn on rapid bootstrapping -# Specify the number of alternative runs on distinct starting trees
Example
Species trees "Species" trees recover the genealogy of taxa, individuals of a population, etc. Internal nodes represent speciation or other taxonomic events. Species trees should contain sequences from only orthologous genes.
ASTRAL ASTRAL is a tool for estimating an unrooted species tree given a set of unrooted gene trees. ASTRAL is statistically consistent under the multi-species coalescent model. ASTRAL finds the species tree that has the maximum number of shared induced quartet trees with the set of gene trees, subject to the constraint that the set of bipartitions in the species tree comes from a predefined set of bipartitions.
Usage DNA_selected_aligned.fas construct_tree.pl ---------DNA_aligned_seleced.tre Catree.pl -------catree.tre Java –jar positation/astral.jar –i positation/catree.tre –a positation/speciesname.txt –t 2 –o ***.tre
-jar version of astral -i input file -a speciesname -o output file -t 2 full annotation
Example
SNP-analysis Structure PCA BFD SNAPP
Structure structure is a package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.
Usage File -> new project -> name the project + select directory + choose data file -> number of individual (**) + ploidy of data (2) + number of loci (***) + missing data value (-9) -> skip -> choose individual ID for each individual -> Proceed Parameter set -> new -> length of burnin period (***) + number of mcmc reps after burnin (***) -> name parameterProject -> start a job -> set k from m to n (i.e. 2-4) + number of Iterations (3 or 5) -> start Compress the result and upload to Structure Harvester.File -> load structure results -> Browse (choose your result with the best value of K)-> Bar Plot -> show -> sort by Q -> save
Example
PCA (Principal component analysis) PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Example Working in R studio
BFD (Bays factor delimitation) BFD is to estimate the marginal likelihood of each competing species delimitation model, rank models by marginal likelihood, and use Bayes factors to assess support for model rankings.
Usage Open Beatui Set parameters, save as xx.xml file Fix xx.xml file Open file use beast or commend line
MCMC Robot http://marple.eeb.uconn.edu/mcmcrobot/?page_id=24
The BF scale is as follows: 0 < BF < 2 is not worth more than a bare mention, 2 < BF < 6 is positive evidence, 6 < BF < 10 is strong support, and BF > 10 is decisive.
SNAPP SNAPP (SNP and AFLP Package for Phylogenetic analysis) is package for inferring species trees and species demographics from independent (unlinked) biallelic markers such as well spaced SNPs. It implements a full coalescent model, but uses a novel algorithm to integrate over all possible gene trees, rather than sampling them explicitly.
Usage Open Beauti Set parameters, save file Open file use Beast Check parameters using Trance
Thanks for your attention!