Workshop on data analyses

Workshop on data analyses
Guoxing Yin

System and software we should prepare
macOS or Linux system, python, java, R studio package (ade4, ggplot2) PartitionFinder 2 RAxML ASTRAL Structure s/v2.3.4/html/structure.html PCA BFD SNAPP

Backgrounds When we finish data-analysis on assemble pipeline, we can get DNA sequence data. Then we can make a reference using DNA sequence data, mapping reads using gatk.sh, we can get SNP data.

SNP (single-nucleotide polymorphism)
SNP is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population.

Call-SNPs Consensus.pl-------make a reference sequence
Gatk.sh call-SNPs automatic--- **.vcf Vcftosnps.pl----- **_beast.nex & **_structure.txt & **_rout.txt

Gatk https://software.broadinstitute.org/gatk

Mark duplicates ATCGACTGATCGACTGATCGACTGATCGACTG--------reference
ATCGACTGATCGACTTATCGACTGATCGACTG sample1 ATCGACTGATCGACTGATCGACTGATCGACTG sample2 ATCGACTGATCGACTGATCGACTGATCGACTG sample3 ATCGACTGATCGACTGATCGACTGATCGACTG sample4

Indel realignment ATCGACTGATCGACTGATCGACTGATCGACTG--------reference
ATCGACTGATCGACTGATCGACTGATCGACG sample1 ATCGACTGATCGACTGATCGACTGATCGACG sample2 ATCGACTGATCGACTGATCGACTGATCGACG sample3 ATCGACTGATCGACTGATCGACTGATCGACTG sample4 ATCGACTGATCGACTGATCGACTGATCGACTG sample5 ATCGACTGATCGACTGATCGACTGATCGACTG sample6

Base recalibration ATCGACTGATCGACTGATCGACTGATCGACTG--------reference
ATCGACTGATCGACTTATCGACTGATCGACTG sample1 ATCGACTGATCGACTGATCGACTGATCGACTG sample2 ATCGACTGATCGACTGATCGACTGATCGACTG sample3 ATCGACTGATCGACTGATCGACTGATCGACTG sample4 ATCGACTGATCGACTGATCGACTGATCGACTG sample5 ATCGACTGATCGACTGATCGACTGATCGACTG sample6

Preparation for data-analysis
DNA_aligned.fas Pick_taxa.pl DNA_selected.fas mafft_aln.pl DNA_selected_aligned.fas

ATCGACTGATCGACTGATCGACTGATCGACTG sample1
ATCGA--GATC-----GATCGACTGATCGACTG sample2 ATCGACTGATCGACTGATCGACTGATCGACTG sample3 ATCGACTGATC-----GATCGAC------GACTG sample4 ATCGACTGATCGACTGATCGACTGATCGACTG sample5 ATCGACTGATC-----GATCGACTGATCG sample6

PartitionFinder2 PartitionFinder2 is a program for selecting best-fit partitioning schemes and models of evolution for nucleotide, amino acid, and morphology alignments.

alignment DNA_selected_aligned.fas
concat_loci.pl DNA_selested_aligned.phy

Branchlengths Unlinked: Each subset has its own independent set of branch lengths.

Branchlengths Linked: Only one underlying set of branch lengths is estimated. Each subset has its own scaling parameter .

Data blocks DNA_selected_aligned.fas
concat_loci.pl DNA_selested_aligned.nex extract_DNAblocks.pl------DNA_blocks.txt

Usage python “<PartitionFinder.py>” “<InputFolder.cfg>” --raxml -- rcluster-max 100

--rcluster-max Setting --rcluster-max to 1000 (i.e. the default values) is usually sufficient to ensure that PF2 will estimate a robust partitioning scheme, even on very large datasets in which there may be millions of possible pairs of data blocks. Setting either of them higher will tend to make the search more thorough and slower.

Example

Tree (in biology) Dendrogram representing the evolutionary relationship between species.

Terminology Branches Internal nodes Root Terminal nodes
Phylogram or phylogenetic tree

Based on the molecular level of phylogenetic methods can be divided into two categories
Discrete feature based approach Example：Maximum parsimony methods, Maximum likelihood methods, Bayesian methods, etc . Based on distance methods Example :N-J methods, etc.

Maximum likelihood methods(ML)
A completely statistical-based phylogenetic tree reconstruction approach that takes into account the probability of each nucleotide substitution in each set of comparisons. The tree with the largest sum of probabilities is most likely the most real phylogenetic tree. The important advantage of probabilistic methods over parsimony is statistically consistent.

Basic evolutionary models
Topology and branch length Substitution matrix rTC (= rCT), rTA (= rAT), rTG (= rGT) rCA (= rAC), rCG (= rGC) rAG (= rGA) Stationary base frequencies fT, fC, fA, fG, Taxa 1 Taxa 2 Taxa 3 Taxa 4 Taxa 5

More formally if given some data D and a hypothesis H the likelihood of those data is given by LD = Pr (D | H) Which is the probability of D given H.

What kind of trees Gene trees Species trees Time trees Etc.

Gene trees "Gene" trees represent the evolutionary history of the genes included in the study. Gene trees can provide evidence for gene duplication events, as well as speciation events. Sequences from different homologs can be included in a gene tree; the subsequent analyses should cluster orthologs, thus demonstrating the evolutionary history of the orthologs.

Concatenated tree Concatenated tree use concatenate independence gene, so the tree are more truly reflect the evolutionary history of species. We use RAxML to reconstruct concatenated tree.

RAxML (Randomized Axelerated Maximum Likelihood)
It is a popular program for phylogenetic analysis of large datasets under maximum likelihood. Its major strength is a fast maximum likelihood tree search algorithm that returns trees with good likelihood scores.

Usage DNA_selected_aligned.fas
concat_loci.pl DNA_selested_aligned.phy raxmlHPC-PTHREADS –T 12 –p –m GTRGAMMA –s ***.phy –n raxml -y -f a –x # 100 –q partation.txt

-T number of nodes -p parsimony Random Seed -m substitution Model -s sequence File Name -n output File Name -f a rapid Bootstrap analysis and search for best-scoring ML tree in one program run –x Specify an integer number (random seed) and turn on rapid bootstrapping -# Specify the number of alternative runs on distinct starting trees

Example

Species trees "Species" trees recover the genealogy of taxa, individuals of a population, etc. Internal nodes represent speciation or other taxonomic events. Species trees should contain sequences from only orthologous genes.

ASTRAL ASTRAL is a tool for estimating an unrooted species tree given a set of unrooted gene trees. ASTRAL is statistically consistent under the multi-species coalescent model. ASTRAL finds the species tree that has the maximum number of shared induced quartet trees with the set of gene trees, subject to the constraint that the set of bipartitions in the species tree comes from a predefined set of bipartitions.

Usage DNA_selected_aligned.fas
construct_tree.pl DNA_aligned_seleced.tre Catree.pl catree.tre Java –jar positation/astral.jar –i positation/catree.tre –a positation/speciesname.txt –t 2 –o ***.tre

-jar version of astral -i input file -a speciesname -o output file -t 2 full annotation

Example

SNP-analysis Structure PCA BFD SNAPP

Structure structure is a package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.

Usage File -> new project -> name the project + select directory + choose data file -> number of individual (**) + ploidy of data (2) + number of loci (***) + missing data value (-9) -> skip -> choose individual ID for each individual -> Proceed Parameter set -> new -> length of burnin period (***) + number of mcmc reps after burnin (***) -> name parameterProject -> start a job -> set k from m to n (i.e. 2-4) + number of Iterations (3 or 5) -> start Compress the result and upload to Structure Harvester.File -> load structure results -> Browse (choose your result with the best value of K)-> Bar Plot -> show -> sort by Q -> save

Example

PCA (Principal component analysis)
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

Example Working in R studio

BFD (Bays factor delimitation)
BFD is to estimate the marginal likelihood of each competing species delimitation model, rank models by marginal likelihood, and use Bayes factors to assess support for model rankings.

Usage Open Beatui Set parameters, save as xx.xml file Fix xx.xml file
Open file use beast or commend line

MCMC Robot

The BF scale is as follows: 0 < BF < 2 is not worth more than a bare mention, 2 < BF < 6 is positive evidence, 6 < BF < 10 is strong support, and BF > 10 is decisive.

SNAPP SNAPP (SNP and AFLP Package for Phylogenetic analysis) is package for inferring species trees and species demographics from independent (unlinked) biallelic markers such as well spaced SNPs. It implements a full coalescent model, but uses a novel algorithm to integrate over all possible gene trees, rather than sampling them explicitly.

Usage Open Beauti Set parameters, save file Open file use Beast
Check parameters using Trance

Thanks for your attention!

Workshop on data analyses

Similar presentations

Presentation on theme: "Workshop on data analyses"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Workshop on data analyses

Similar presentations

Presentation on theme: "Workshop on data analyses"— Presentation transcript:

Similar presentations

About project

Feedback