Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012
NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview
NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview Slides and tutorials are available at:
NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview
DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
Resequencing genomes Library prep Library prep Library prep Library prep Library prep Library prep DNA Extraction DNA Extraction
Sequencing data GATGGGAAGA GCGGTTCAGC AGGAATGCCG AGACCGATAT CGTATGCCGT Sequence data Precise Fairly unbiased Easy to QC Coverage depth data Can be biased Hard to know what’s true
Sequencer specific errors Homopolymer run create false indels Specific sequence patterns can create phasing issues
Sequencer specific errors Specific sequence patterns can create phasing issues
Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%
Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%
Sequencing output (Fastq format) Example fastq GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%
Quality control Questions you should ask (yourself or your sequencing provider): Sequencing QC: How much sequencing? What’s the sequencing quality? Library QC: What’s the base profile across the reads? Is there an unexpected GC bias? Are there any library preparation contaminants? Post mapping QC: What is the fragment length distribution? (for paired end) Is there an unexpected Duplicate rate?
Example with FastQC
Example with FastQC
Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
Mapping Reads to a reference genome Problems: How to find the best match of short sequence onto a large genome (high sensitivity) How to not find a match when for 100,000,000,000 reads in reasonable amount of time. Solution: Hashing based algorithms: BLAST, Eland, MAQ, Shrimps, GSNAP, Stampy More sensitive when SNPs/Indels Suffix trie + Burrows Wheeler Transform algorithms: Bowtie, SOAP BWA Faster
Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp bowtie BWA
Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw
Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw Mapper FastqSam/Bam
SAM/BAM format SAM: Sequence Alignment/Map format v1.4 The SAM Format Specification Working Group (Sept 2011) Standardized format for alignment Bam: binary equivalent of SAM Bam can be indexed for fast record retrieval Manipulate Sam/Bam file using samtools and others 2 parts: Header: contains metadata about the sample Alignment:
SAM/BAM format COLUMNS: 1QNAMEStringQuery template NAME 2FLAGIntbitwise FLAG 3RNAMEStringReference sequence NAME 4POSInt1-based leftmost mapping POSition 5MAPQIntMAPping Quality 6CIGAR StringCIGAR string 7RNEXTStringRef. name of the mate/next fragment 8PNEXT IntPosition of the mate/next fragment 9TLEN Intobserved Template LENgth 10SEQ Stringfragment SEQuence 11QUAL StringASCII of Phred-scaled base QUALity+33≈ R ref37309M=7-39CAGCGCAT TAG
Bitwise flag BitintegerDescription 0x11template having multiple segments in sequencing 0x22each segment properly aligned according to the aligner 0x44segment unmapped 0x88next segment in the template unmapped 0x1016SEQ being reverse complemented 0x2032SEQ of the next segment in the template being reversed 0x4064the first segment in the template 0x80128the last segment in the template 0x100256secondary alignment 0x200512not passing quality controls 0x PCR or optical duplicate 83 = in binary format
Bitwise flag 83 = in binary format
CIGAR alignment Ref: AGGTCCATGGACCTG || ||||X||||||| Query: AG-TCCACGGACCTG 2M1D12M or 2=1D4=1X7= Ref: CTTATGTGATC ||||||||||| Query: CTTATGTGATCCCTG 10M4S Malignment match (can be a sequence match or mismatch) Iinsertion to the reference Ddeletion from the reference Nskipped region from the reference Ssoft clipping (clipped sequences present in SEQ) Hhard clipping (clipped sequences NOT present in SEQ) Ppadding (silent deletion from padded reference) =sequence match Xsequence mismatch
Mapping enhancement Each read is mapped independently: Can borrow knowledge from neighbor to improve mapping Picard Marking Duplicates: A duplicated read pair is when both two or more read pairs have the same coordinates. Samtools BAQ: Hidden markov model that downweight mismatching based if they are close to indel GATK Indel realignment: take every reads around potential indel and perform a more sensitive alignment GATK Base recalibration: look at several contextual information, such as position in the read or dinucleotide composition to identify covariate of sequencing errors
Indel realignment AACAATATCTATGGA/TTTCG/TTTTG
Indel realignment
Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s)
The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis
The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis
SNPs and indels calling Samtools mpileup + bcftools GATK UnifiedGenotyper Algorithm Bayesian based multiple samples calling yes Input: bam file(s) output vcf file Runtime Rather fast Slow but multithreaded Multi-allelic Up to 2alleles3 by default
VCF format Variant format designed for 1000 genome project - SNPs - Insertions - Deletions - Duplications - Inversions - Copy number variation
VCF format Header: define the optional fields ##INFO= ##FORMAT= Variants: 8 mandatory columns describing the variant 1 column defining the genotype format 1 column per sample describing the genotype for that SNP for that sample
DATA ##fileformat=VCFv4.1 ##samtoolsVersion= (r982:295) ##INFO= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germline tumor chr T C DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:38,3,0:1:0:3 chr G T DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:33,3,0:1:0:4 chr T C 44. DP=2;AF1=1;AC1=4;DP4=0,0,1,1;MQ=60;FQ=-30.8 GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8 1/1:37,3,0:1:0:8 chr G A DP=2;AF1=0.5011;AC1=2;DP4=1,0,0,1;MQ=60;FQ=-5.67;PV4=1,1,1,1 GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28 0/0:0,0,0:0:0:3 chr A T DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:40,3,0:1:0:4 HEADER
#CHROMPOSIDREFALTQUALFILTERINFOFORMATgermline chr TC8.65.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr GT4.77.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr TC44.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ1/1:40,3,0:1:0:8 chr GA5.47.DP=2;AF1=0.5011; AC1=2; …GT:PL:DP:SP:GQ0/1:34,0,23:2:0:28 chr AT10.4.DP=1;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 Chromosome name VCF format SNPs Position SNP Identifier Reference base Alternate base(s) SNPs quality Filtering reasons SNPs information Genotype format Genotype information
Variant Filtering Depth of Coverage: confident het call= 10X-20X SNPs quality depends on the caller: Genotype quality: 20 Strand bias Biological interpretation