Bioinformatics trainings, Vietnam Hanoi, November, 2015

Slides:



Advertisements
Similar presentations
A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
Advertisements

GBS & GWAS using the iPlant Discovery Environment
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
Genetic Basis of Agronomic Traits Connecting Phenotype to Genotype Yu and Buckler (2006); Zhu et al. (2008) Traditional F2 QTL MappingAssociation Mapping.
DNAseq analysis Bioinformatics Analysis Team
Association Modeling With iPlant
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Bioinformatics Tips NGS data processing and pipeline writing
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Polymorphism and Variant Analysis Lab
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
RNAseq analyses -- methods
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Next Generation DNA Sequencing
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
© 2010 by The Samuel Roberts Noble Foundation, Inc. 1 The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK, 73401, USA 2 National Center.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
NextGen Pipeline: Enabling the Plant Science Community Tom Brutnell (lead), Steve Rounsley (co-lead), Matt Vaughn (Engagement Lead) Ed Buckler, Justin.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNAseq
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
GenABEL: an R package for Genome Wide Association Analysis
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Personalized genomics
Objectives Genome-wide investigation – to estimate alternate Poly-Adenylation (APA) usage on 3’UTR – to identify polymorphism of Downstream Sequence Elements.
Calling Somatic Mutations using VarScan
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Inheritance Model testing Andrew Stubbs Dept. Bioinformatics.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Risheng Chen et al BMC Genomics
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
Introduction to RAD Acropora millepora.
Genome Wide Association Studies using SNP
EMC Galaxy Course November 24-25, 2014
Rod Eyles1, John Juma1, Morag Ferguson1, Trushar Shah1 1 IITA, Nairobi
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
BF528 - Genomic Variation and SNP Analysis
Canadian Bioinformatics Workshops
Mapping of srt1 by BSA-seq.
The Variant Call Format
Presentation transcript:

Bioinformatics trainings, Vietnam Hanoi, November, 2015 Explore SNP polymorphism data A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015

Tablet Graphical tools to visualize assemblies Accept many formats ACE, SAM, BAM A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

GATK (Genome Analysis ToolKit) Software package to analyse NGS data. Implemented to analyse human resequencing data, for medical purpose (1000 genomes, The Cancer Genome Atlas) Included: depth analyses, quality score recalibration, SNP/InDel detection Complementary with other packages: SamTools, PicardTools, VCFtools, BEDtools PREPROCESS: * Index human genome (Picard), we used HG18 from UCSC. * Convert Illumina reads to Fastq format * Convert Illumina 1.6 read quality scores to standard Sanger scores FOR EACH SAMPLE: 1. Align samples to genome (BWA), generates SAI files. 2. Convert SAI to SAM (BWA) 3. Convert SAM to BAM binary format (SAM Tools) 4. Sort BAM (SAM Tools) 5. Index BAM (SAM Tools) 6. Identify target regions for realignment (Genome Analysis Toolkit) 7. Realign BAM to get better Indel calling (Genome Analysis Toolkit) 8. Reindex the realigned BAM (SAM Tools) 9. Call Indels (Genome Analysis Toolkit) 10. Call SNPs (Genome Analysis Toolkit) 11. View aligned reads in BAM/BAI (Integrated Genome Viewer) A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Global BAM with read group Fastq (RC1) Fastq (RC2) Fastq (RC3) Fastq (RC4) Cutadapt Cutadapt Cutadapt Cutadapt …. Mapping BWA Mapping BWA Mapping BWA Mapping BWA Add or Replace Groups Add or Replace Groups Add or Replace Groups Add or Replace Groups BAM with read group BAM with read group BAM with read group BAM with read group mergeSam Global BAM with read group VCF file 4 4

For GBS data Tassel pipeline Version 5 TASSEL-GBS Plos One, 2014 A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

GBS RAD-Seq RNA-Seq WGRS Reads pre-processing and mapping Galaxy + workflow SNP Calling and genotype assignation Tassel Genotyping data Storage and mining Genotyping data analyses and visualization (GWAS, diversity…)

Format VCF (Variant Call Format) Advantages: Variation description for each position + genotype assignations Indexed flat files. Binary files also exist: BCF format A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Autres fonctionalités GATK Other GATK functionalities Module DepthOfCoverage: Allows to get sequencing depth for each gene, each position and each individual Module ReadBackedPhasing: Allows to set, if possible, associations between alleles (phase and haplotypes) when we are in an heterozygote situation. Et non AGG GGA A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Another format for variant calling (generated by samtools) Format Pileup Another format for variant calling (generated by samtools) Describe alignment row by row (not line by line like in SAM format) Used by VarScan like softwares (varscan pileup2snp) Frequently used for rare variants, with a low frequency (e.g. viral pop) A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

- Based on NoSQL technology Projet Gigwa, pour la gestion des données massives de variants (GBS, RADSeq, WGRS) « With NGS arise serious computational challenges in terms of storage, search, sharing, analysis, and data visualization, that redefine some practices in data management. » - Based on NoSQL technology - Handles VCF files (Variant Call Format) and annotations - Supports multiple variant types: SNPs, InDels, SSRs, SV - Powerful genotyping queries - Easily scalable with MongoDB sharding - Transparent access - Takes phasing information into account when importing/exporting in VCF format Distance: Comparaison de séquences 2 à 2 et à chaque paire on associe une distance de divergence (calcul utilisant un modèle d’évolution donné) => Matrice de distance. On choisit l’arbre avec la plus courte somme de distances. Parcimonie: Sélectionne l’arbre nécessitant le minimum de changements évolutifs (d’évènements de substitution) Bonne approximation dans les régions où les séquences sont proches Likelihood: Méthode probabiliste basé sur une heuristique. Pour une topologie et longueurs de branches données, on estime pour chaque site la probabilité que le pattern de nucléotides observés ait évolué le long de cet arbre. On calcule la vraisemblance de tous les arbres possibles et on retient celui ayant la plus haute vraisemblance… Inférence Bayesienne: basé sur les Chaines de Markov Monte Carlo (MCMC) pour échantilloner des arbres à partir de distribution de topologies d’arbres. A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

http://gigwa.southgreen.fr/gigwa/ Distance: Comparaison de séquences 2 à 2 et à chaque paire on associe une distance de divergence (calcul utilisant un modèle d’évolution donné) => Matrice de distance. On choisit l’arbre avec la plus courte somme de distances. Parcimonie: Sélectionne l’arbre nécessitant le minimum de changements évolutifs (d’évènements de substitution) Bonne approximation dans les régions où les séquences sont proches Likelihood: Méthode probabiliste basé sur une heuristique. Pour une topologie et longueurs de branches données, on estime pour chaque site la probabilité que le pattern de nucléotides observés ait évolué le long de cet arbre. On calcule la vraisemblance de tous les arbres possibles et on retient celui ayant la plus haute vraisemblance… Inférence Bayesienne: basé sur les Chaines de Markov Monte Carlo (MCMC) pour échantilloner des arbres à partir de distribution de topologies d’arbres. http://gigwa.southgreen.fr/gigwa/ A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

SNiPlay: Web application for polymorphism analyses http://sniplay.southgreen.fr Distance: Comparaison de séquences 2 à 2 et à chaque paire on associe une distance de divergence (calcul utilisant un modèle d’évolution donné) => Matrice de distance. On choisit l’arbre avec la plus courte somme de distances. Parcimonie: Sélectionne l’arbre nécessitant le minimum de changements évolutifs (d’évènements de substitution) Bonne approximation dans les régions où les séquences sont proches Likelihood: Méthode probabiliste basé sur une heuristique. Pour une topologie et longueurs de branches données, on estime pour chaque site la probabilité que le pattern de nucléotides observés ait évolué le long de cet arbre. On calcule la vraisemblance de tous les arbres possibles et on retient celui ayant la plus haute vraisemblance… Inférence Bayesienne: basé sur les Chaines de Markov Monte Carlo (MCMC) pour échantilloner des arbres à partir de distribution de topologies d’arbres. A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

IFB project “Galaxy4Sniplay” (WP4 IFB, Plant node) A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Available using Galaxy Toolshed Installable on any Galaxy instance A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Bioinformatics trainings, Vietnam Hanoi, November, 2015 Upload a VCF file in SNiPlay http://sniplay.southgreen.fr Upload a VCF file (+ reference if not available in genome collection) Select rice genome The reference corresponce to mRNA A. Dereeper, G. Sarah, F. Sabot Bioinformatics trainings, Vietnam Hanoi, November, 2015 15 15

Filters using VCFtools or Gigwa Maf Missing data Annotation Position… A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

SNP annotation using SnpEff A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

sNMF Test different values of K (estimates the probability (likelihood tests) that samples are structured in K populations) For the best value of K, the application shows Q estimates for each individual (admixture percent) (probability that the individual belongs to each population)

MDS (Multi-Dimensional Scale) plot SNP-based Distance tree with FastME

Used to measure the degree of polymorphism within a population Diversity analysis Pi: Nucleotide diversity: Average number of nucleotide differences per site between any two DNA sequences chosen randomly from the sample population Used to measure the degree of polymorphism within a population Comparison between individuals A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

=> Can allows the detection of introgression Introgression = Movement of a exogene region (gene flow) from one species into the gene pool of another by the repeated backcrossing of an interspecific hybrid with one of its parent species Widely used in agronomy obtained but can occurs naturally A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Haplotypes Haplotype reconstruction using Gevalt High frequency haplotypes Low frequency haplotype Group distribution whithin this haplotype Distance between 2 haplotypes (nb of mutations) Haplotypes Haplotype reconstruction using Gevalt Network with Haplophyle Available only for regions presenting few variants (short regions, genes) Exploit phased VCF (in progress…) A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Design de puces Illumina Illumina ship design Submission file for Illumina Fichier de soumission pour Illumina Genotypage file Analyse with BeadStudio software Cartesian coordinates A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

GWAS (Genome-Wide Association Studies) Estimate association between a marker and a phenotypic character Manhattan plots: displays GWAS statistical tests (-log10 pvalue) along chromosomes TASSEL, MLMM sofwares False positives because of the studied structuration panel => correction using structure population et and kinship A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

TD: Study of root charaters using GWAS in Oryza sativa japonica TD: Study of root charaters using GWAS in Oryza sativa japonica. Influence of a correction using structure and kinship A. Dereeper, G. Sarah, F. Sabot A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Analyse de structure de populations Population structure analysis Test different values of K (estimates of probability that samples are structured in K populations) For the best value of K, the application shows Q estimates for each individual (admixture percent) A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015

Relatedness between individuals (kinship matrix) TASSEL and plink softwares Estimation of relatedness between individuals using a distance matrix A. Dereeper, G. Sarah, F. Sabot, Y. Hueber A. Dereeper, G. Sarah, F. Sabot Formation Bio-informatique, 9 au 13 février 2015 Bioinformatics trainings, Vietnam Hanoi, November, 2015