Download presentation
Presentation is loading. Please wait.
Published byJillian Sperry Modified over 9 years ago
1
Resequencing Genome Timothee Cezard EBI NGS workshop 16/10/2012
2
NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview
3
NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview Slides and tutorials are available at: https://www.wiki.ed.ac.uk/display/GenePoolExternal/NGS+workshop+16.10.2012+at+EBI
4
NGS Course – Data Flow DNA Sequencing RNA Sequencing Sequence archives ENA/SRA submission and retrieval Gene regulation ChIP-seq analysis Gene annotation RNA-Seq Ensembl gene build Gene expression RNA-Seq Transcriptome analysis Elizabeth Murchison Jon Teague /Adam Butler/ Simon Forbes Data compression Guy Cochrane Ensembl/John CollinsMyrto Kostadima/ Remco Loos Remco Loos/ Myrto Kostadima Rajesh Radhakrishnan Rasko Leinonen Arnaud Oisel Marc Rossello Vadim Zalunin Resequencing & assembly Timothee Cezard Laura Clarke Genome variation & disease Karim Gharbi Overview
5
DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
6
Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
7
Resequencing genomes Library prep Library prep Library prep Library prep Library prep Library prep DNA Extraction DNA Extraction
8
Sequencing data GATGGGAAGA GCGGTTCAGC AGGAATGCCG AGACCGATAT CGTATGCCGT Sequence data Precise Fairly unbiased Easy to QC Coverage depth data Can be biased Hard to know what’s true
9
Sequencer specific errors Homopolymer run create false indels Specific sequence patterns can create phasing issues
10
Sequencer specific errors Specific sequence patterns can create phasing issues
11
Sequencing output (Fastq format) Example fastq record: @ILLUMINA06_0016:6:1:5388:12733#0 GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%
12
Sequencing output (Fastq format) Example fastq record: @ILLUMINA06_0016:6:1:5388:12733#0 GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%
13
Sequencing output (Fastq format) Example fastq record: @ILLUMINA06_0016:6:1:5388:12733#0 GATGGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATATCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAG + CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDADACBCCCDADBDDCBCD;BBDBDBBBB%%%%%
14
Quality control Questions you should ask (yourself or your sequencing provider): Sequencing QC: How much sequencing? What’s the sequencing quality? Library QC: What’s the base profile across the reads? Is there an unexpected GC bias? Are there any library preparation contaminants? Post mapping QC: What is the fragment length distribution? (for paired end) Is there an unexpected Duplicate rate?
15
Example with FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
16
Example with FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
17
Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
18
Mapping Reads to a reference genome Problems: How to find the best match of short sequence onto a large genome (high sensitivity) How to not find a match when for 100,000,000,000 reads in reasonable amount of time. Solution: Hashing based algorithms: BLAST, Eland, MAQ, Shrimps, GSNAP, Stampy More sensitive when SNPs/Indels Suffix trie + Burrows Wheeler Transform algorithms: Bowtie, SOAP BWA Faster
19
Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp bowtie BWA
20
Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw
21
Different software for different applications Transcriptome to genome Very fast mapping Mapping to distant reference GSNAP Tophat Stampy Shrimp Bowtie Bwa Smalt Splitseek Mr fast Mrs fast Ssaha2 CLC bio Partek Genomatics Bwasw Mapper FastqSam/Bam
22
SAM/BAM format SAM: Sequence Alignment/Map format v1.4 The SAM Format Specification Working Group (Sept 2011) http://samtools.sourceforge.net/SAM1.pdf http://samtools.sourceforge.net/SAM1.pdf Standardized format for alignment Bam: binary equivalent of SAM Bam can be indexed for fast record retrieval Manipulate Sam/Bam file using samtools and others 2 parts: Header: contains metadata about the sample Alignment:
23
SAM/BAM format COLUMNS: 1QNAMEStringQuery template NAME 2FLAGIntbitwise FLAG 3RNAMEStringReference sequence NAME 4POSInt1-based leftmost mapping POSition 5MAPQIntMAPping Quality 6CIGAR StringCIGAR string 7RNEXTStringRef. name of the mate/next fragment 8PNEXT IntPosition of the mate/next fragment 9TLEN Intobserved Template LENgth 10SEQ Stringfragment SEQuence 11QUAL StringASCII of Phred-scaled base QUALity+33≈ 123456789101112 R00 1 83ref37309M=7-39CAGCGCAT TAG
24
Bitwise flag BitintegerDescription 0x11template having multiple segments in sequencing 0x22each segment properly aligned according to the aligner 0x44segment unmapped 0x88next segment in the template unmapped 0x1016SEQ being reverse complemented 0x2032SEQ of the next segment in the template being reversed 0x4064the first segment in the template 0x80128the last segment in the template 0x100256secondary alignment 0x200512not passing quality controls 0x4001024PCR or optical duplicate 83 = 1010011 in binary format
25
Bitwise flag http://picard.sourceforge.net/explain-flags.html 83 = 1010011 in binary format
26
CIGAR alignment Ref: AGGTCCATGGACCTG || ||||X||||||| Query: AG-TCCACGGACCTG 2M1D12M or 2=1D4=1X7= Ref: CTTATGTGATC ||||||||||| Query: CTTATGTGATCCCTG 10M4S Malignment match (can be a sequence match or mismatch) Iinsertion to the reference Ddeletion from the reference Nskipped region from the reference Ssoft clipping (clipped sequences present in SEQ) Hhard clipping (clipped sequences NOT present in SEQ) Ppadding (silent deletion from padded reference) =sequence match Xsequence mismatch
27
Mapping enhancement Each read is mapped independently: Can borrow knowledge from neighbor to improve mapping Picard Marking Duplicates: A duplicated read pair is when both two or more read pairs have the same coordinates. Samtools BAQ: Hidden markov model that downweight mismatching based if they are close to indel GATK Indel realignment: take every reads around potential indel and perform a more sensitive alignment GATK Base recalibration: look at several contextual information, such as position in the read or dinucleotide composition to identify covariate of sequencing errors
28
Indel realignment AACAATATCTATGGA/TTTCG/TTTTG
29
Indel realignment
31
Overview DNA (Re)sequencing Sequencing technologies Sequencing output Quality control Mapping Mapping programs Sam/Bam format Mapping improvements Variant calling Types of variants SNPs/indels VCF format
32
The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s)
33
The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis
34
The whole pipeline Alignment Realignment Mark duplicates Raw data Base recalibration ? Final bam file(s) Final bam file(s) SNPs/Indels Calling SNPs/Indels Calling CNV Calling CNV Calling Structural Variant Calling Structural Variant Calling Pool analysis
35
SNPs and indels calling Samtools mpileup + bcftools GATK UnifiedGenotyper Algorithm Bayesian based multiple samples calling yes Input: bam file(s) output vcf file Runtime Rather fast Slow but multithreaded Multi-allelic Up to 2alleles3 by default
36
VCF format http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 Variant format designed for 1000 genome project - SNPs - Insertions - Deletions - Duplications - Inversions - Copy number variation
37
VCF format http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41 Header: define the optional fields ##INFO= ##FORMAT= Variants: 8 mandatory columns describing the variant 1 column defining the genotype format 1 column per sample describing the genotype for that SNP for that sample
38
DATA ##fileformat=VCFv4.1 ##samtoolsVersion=0.1.18 (r982:295) ##INFO= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT germline tumor chr4 27668. T C 8.65. DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:38,3,0:1:0:3 chr4 27669. G T 4.77. DP=2;AF1=1;AC1=4;DP4=0,0,0,1;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 0/1:33,3,0:1:0:4 chr4 27712. T C 44. DP=2;AF1=1;AC1=4;DP4=0,0,1,1;MQ=60;FQ=-30.8 GT:PL:DP:SP:GQ 1/1:40,3,0:1:0:8 1/1:37,3,0:1:0:8 chr4 27774. G A 5.47. DP=2;AF1=0.5011;AC1=2;DP4=1,0,0,1;MQ=60;FQ=-5.67;PV4=1,1,1,1 GT:PL:DP:SP:GQ 0/1:34,0,23:2:0:28 0/0:0,0,0:0:0:3 chr4 36523. A T 10.4. DP=1;AF1=1;AC1=4;DP4=0,0,1,0;MQ=60;FQ=-27.4 GT:PL:DP:SP:GQ 0/1:0,0,0:0:0:3 1/1:40,3,0:1:0:4 HEADER
39
#CHROMPOSIDREFALTQUALFILTERINFOFORMATgermline chr4 27668.TC8.65.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr4 27669.GT4.77.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 chr427712.TC44.DP=2;AF1=1;AC1=4;…GT:PL:DP:SP:GQ1/1:40,3,0:1:0:8 chr4 27774.GA5.47.DP=2;AF1=0.5011; AC1=2; …GT:PL:DP:SP:GQ0/1:34,0,23:2:0:28 chr4 36523.AT10.4.DP=1;AF1=1;AC1=4;…GT:PL:DP:SP:GQ0/1:0,0,0:0:0:3 Chromosome name VCF format SNPs Position SNP Identifier Reference base Alternate base(s) SNPs quality Filtering reasons SNPs information Genotype format Genotype information
40
Variant Filtering Depth of Coverage: confident het call= 10X-20X SNPs quality depends on the caller: 30-50 Genotype quality: 20 Strand bias Biological interpretation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.