Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012.

Slides:



Advertisements
Similar presentations
NGS data analysis in R Biostrings and Shortread
Advertisements

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
DNAseq analysis Bioinformatics Analysis Team
SOLiD Sequencing & Data
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Workshop Schedule Schedule has links to introductory presentations and the FungiDB workshops Tuesday 3rdWednesday.
Bioinformatics Tips NGS data processing and pipeline writing
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Expression Analysis of RNA-seq Data
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Introduction to Short Read Sequencing Analysis
File formats Wrapping your data in the right package Deanna M. Church
GBS Bioinformatics Pipeline(s) Overview
Next Generation DNA Sequencing
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNAseq
Genome STRiP ASHG Workshop demo materials
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Personalized genomics
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
Cancer Genomics Core Lab
Next Generation Sequencing Analysis
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
VCF format: variants c.f. S. Brown NYU
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Introduction to RAD Acropora millepora.
EMC Galaxy Course November 24-25, 2014
ChIP-Seq Data Processing and QC
Maximize read usage through mapping strategies
BF528 - Genomic Variation and SNP Analysis
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012

NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview

NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview Slides and tutorials are available at:

NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview

DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

Resequencing genomes Library prep Library prep Library prep Library prep Library prep Library prep DNA Extraction DNA Extraction

Sequencing data GATGGGAAGA GCGGTTCAGC AGGAATGCCG AGACCGATAT CGTATGCCGT Sequence data Precise Fairly unbiased Easy to QC Coverage depth data Can be biased Hard to know what’s true

Sequencer specific errors  Homopolymer run create false indels  Specific sequence patterns can create phasing issues

Sequencer specific errors  Specific sequence patterns can create phasing issues

Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%

Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%

Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%

Quality control Questions you should ask (yourself or your sequencing provider): Sequencing QC: How much sequencing? What’s the sequencing quality? Library QC: What’s the base profile across the reads? Is there an unexpected GC bias? Are there any library preparation contaminants? Post mapping QC: What is the fragment length distribution? (for paired end) Is there an unexpected Duplicate rate?

Example with FastQC

Example with FastQC

Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

Mapping Reads to a reference genome Problems: How to find the best match of short sequence onto a large genome (high sensitivity) How to not find a match when for 100,000,000,000 reads in reasonable amount of time. Solution: Hashing based algorithms: BLAST, Eland, MAQ, Shrimps, GSNAP, Stampy More sensitive when SNPs/Indels Suffix trie + Burrows Wheeler Transform algorithms: Bowtie, SOAP BWA Faster

Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp bowtie BWA

Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw

Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw Mapper FastqSam/Bam

SAM/BAM format SAM: Sequence Alignment/Map format v1.4 The SAM Format Specification Working Group (Sept 2011)  Standardized format for alignment Bam: binary equivalent of SAM Bam can be indexed for fast record retrieval Manipulate Sam/Bam file using samtools and others 2 parts: Header: contains metadata about the sample Alignment:

SAM/BAM format COLUMNS: 1QNAMEStringQuery template NAME 2FLAGIntbitwise FLAG 3RNAMEStringReference sequence NAME 4POSInt1-based leftmost mapping POSition 5MAPQIntMAPping Quality 6CIGAR StringCIGAR string 7RNEXTStringRef. name of the mate/next fragment 8PNEXT IntPosition of the mate/next fragment 9TLEN Intobserved Template LENgth 10SEQ Stringfragment SEQuence 11QUAL StringASCII of Phred-scaled base QUALity+33≈ R ref37309M=7-39CAGCGCAT TAG

Bitwise flag BitintegerDescription 0x11template having multiple segments in sequencing 0x22each segment properly aligned according to the aligner 0x44segment unmapped 0x88next segment in the template unmapped 0x1016SEQ being reverse complemented 0x2032SEQ of the next segment in the template being reversed 0x4064the first segment in the template 0x80128the last segment in the template 0x100256secondary alignment 0x200512not passing quality controls 0x PCR or optical duplicate 83 = in binary format

Bitwise flag 83 = in binary format

CIGAR alignment Ref: AGGTCCATGGACCTG || ||||X||||||| Query: AG-TCCACGGACCTG 2M1D12M or 2=1D4=1X7= Ref: CTTATGTGATC ||||||||||| Query: CTTATGTGATCCCTG 10M4S Malignment match (can be a sequence match or mismatch) Iinsertion to the reference Ddeletion from the reference Nskipped region from the reference Ssoft clipping (clipped sequences present in SEQ) Hhard clipping (clipped sequences NOT present in SEQ) Ppadding (silent deletion from padded reference) =sequence match Xsequence mismatch

Mapping enhancement Each read is mapped independently:  Can borrow knowledge from neighbor to improve mapping Picard Marking Duplicates: A duplicated read pair is when both two or more read pairs have the same coordinates. Samtools BAQ: Hidden markov model that downweight mismatching based if they are close to indel GATK Indel realignment: take every reads around potential indel and perform a more sensitive alignment GATK Base recalibration: look at several contextual information, such as position in the read or dinucleotide composition to identify covariate of sequencing errors

Indel realignment AACAATATCTATGGA/TTTCG/TTTTG

Indel realignment

Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format

The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s)

The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis

The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis

SNPs and indels calling Samtools mpileup + bcftools GATK UnifiedGenotyper Algorithm Bayesian based multiple samples calling yes Input: bam file(s) output vcf file Runtime Rather fast Slow but multithreaded Multi-allelic Up to 2alleles3 by default

VCF format Variant format designed for 1000 genome project - SNPs - Insertions - Deletions - Duplications - Inversions - Copy number variation

VCF format Header: define the optional fields ##INFO= ##FORMAT= Variants: 8 mandatory columns describing the variant 1 column defining the genotype format 1 column per sample describing the genotype for that SNP for that sample

DATA ##fileformat=VCFv4.1 ##samtoolsVersion= (r982:295) ##INFO= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germline tumor chr T C DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:38,3,0:1:0:3 chr G T DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:33,3,0:1:0:4 chr T C 44. DP=2;AF1=1;AC1=4;DP4=0,0,1,1;MQ=60;FQ=-30.8 GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8 1/1:37,3,0:1:0:8 chr G A DP=2;AF1=0.5011;AC1=2;DP4=1,0,0,1;MQ=60;FQ=-5.67;PV4=1,1,1,1 GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28 0/0:0,0,0:0:0:3 chr A T DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:40,3,0:1:0:4 HEADER

#CHROMPOSIDREFALTQUALFILTERINFOFORMATgermline chr TC8.65.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr GT4.77.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr TC44.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ1/1:40,3,0:1:0:8 chr GA5.47.DP=2;AF1=0.5011; AC1=2; …GT:PL:DP:SP:GQ0/1:34,0,23:2:0:28 chr AT10.4.DP=1;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 Chromosome name VCF format SNPs Position SNP Identifier Reference base Alternate base(s) SNPs quality Filtering reasons SNPs information Genotype format Genotype information

Variant Filtering Depth of Coverage: confident het call= 10X-20X SNPs quality depends on the caller: Genotype quality: 20 Strand bias Biological interpretation