MIK Bachelor seminars Sequencing in practice

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
Advertisements

Next–generation DNA sequencing technologies – theory & practice
DNAseq analysis Bioinformatics Analysis Team
SOLiD Sequencing & Data
Next-generation sequencing
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Canadian Bioinformatics Workshops
Greg Phillips Veterinary Microbiology
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
CS 6293 Advanced Topics: Current Bioinformatics
Update on Next-Generation Sequencing
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
From Haystacks to Needles AP Biology Fall Isolating Genes  Gene library: a collection of bacteria that house different cloned DNA fragments, one.
High Throughput Sequencing Methods and Concepts
Library Preparation Application dependant, using standard molecular biological techniques. Fragment library oligo kit: (per library)$35 GeneAmp dNTP blend:
Introduction to next generation sequencing Rolf Sommer Kaas.
Introduction to Short Read Sequencing Analysis
File formats Wrapping your data in the right package Deanna M. Church
High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.
Tools of Human Molecular Genetics. ANALYSIS OF INDIVIDUAL DNA AND RNA SEQUENCES Two fundamental obstacles to carrying out their investigations of the.
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
Sequence File Formats.
Chapter 20 DNA Technology and Genomics. Biotechnology is the manipulation of organisms or their components to make useful products. Recombinant DNA is.
Personalized genomics
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Canadian Bioinformatics Workshops
DNA Sequencing First generation techniques
Next-generation sequencing technology
Virginia Commonwealth University
DNA Sequencing Second generation techniques
Lesson: Sequence processing
Next generation sequencing
Sequencing Introduction
Introduction to next generation sequencing
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Next-generation sequencing technology
Introduction to RAD Acropora millepora.
Section 3: Gene Technologies in Detail
EMC Galaxy Course November 24-25, 2014
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
SVM 2FG.
Sequencing Technologies
Chapter 14 Bioinformatics—the study of a genome
CHAPTER 12 DNA Technology and the Human Genome
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
2nd (Next) Generation Sequencing
ULTRASEQUENCING. Next Generation Sequencing: methods and applications.
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
High-Throughput Sequencing Technologies
High-Throughput Sequencing Technologies
Maximize read usage through mapping strategies
Next-generation DNA sequencing
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
The Variant Call Format
Presentation transcript:

MIK Bachelor seminars Sequencing in practice The technology Aldo Jongejan a.jongejan@amc.uva.nl

What do we want? Discover genetic cause of disease Fundamental research, System Biology Discover why people react differently to medicines Medicinal chemistry, personalized medicine Diagnostics Quick tests, genetic makeup, risk factors …

Sequencing Updated to Dec. 2012

How does it work? DNA (patient) 1 2 5 4 3 candidate genes Gene A Gene B DNA (patient) 1 Produce shotgun library 2 Determine variants, Filter, compare patients Capture exon sequences 5 Map against reference genome 4 3 candidate genes Wash & Sequence

Sequencing methods “old style” - 1977 Frederick Sanger (1918) Structure insulin Sequence nucleic acids Allan Coulson Chain termination Dye termination Frederick Sanger & Allan Coulson - 1977 (gebaseerd op de methode van Maxam & Gilbert, maar gebruikte minder Toxische middelen en radioisotopen en was dus beter in gebruik). Ruim 30 jaar gebruikt! Twee keer Nobelprijs gewonnen (insuline – 1958, Sequence of nucleinezuren – 1980) Extra: http://www.routesgame.com/games/?challengeId=5

Sanger method Single stranded DNA Primer Cloned in plasmide Denaturation Cloned in bacteriophage M13 Cloned in phagemid PCR Primer Nadeel M13: kleine stukjes kunnen maar geincorporeerd worden. Phagemid omzeilt dit http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=genomes&part=A6441

Dye termination Label each ddNTP with a fluorescent marker http://www.phgfoundation.org/tutorials/dna/5.html

High Throughput Sequencing Sanger method Laborious sample preparation Cloning/amplification Read length determined by electrophoresis Capillary electrophoresis machines bottleneck HUGO - 8 yr. preparation/5 yr. sequencing New methods highly desirable! Some inserts unclonable due to deleterious effect on host Longer strands, less percentual difference between strand. N / N+1

Contest “The first Team that can build a device and use it to sequence 100 human genomes within 10 days or less” Accuracy < 1 op 100.000 base Coverage >98% genome Costs <$10k per genome http://genomics.xprize.org/ De X-prize organisatie stimuleerde het ontwikkelen van een ruimtevoretuig, dat in staat zou zijn om 3 personen over een bepaalde afstand te vervoeren en dat een aantal keer per dag. Als mensen een beloning krijgen dan gaan ze er voor! "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome."

Sequencing methods Roche 454 Life Sciences Illumina www.454.com http://www.youtube.com/watch?v=bFNjxKHP8Jc Illumina www.illumina.com http://www.youtube.com/watch?v=77r5p8IBwJk Applied Biosystems SOLiD system www3.appliedbiosystems.com http://www.youtube.com/watch?v=nlvyF8bFDwM 454, een onderdeel van Roche kwam in 2005 met het eerste alternatief/ Later gevolgd door de Genome Sequencer FLX (2006)

AB SOLiD Fragment library is prepared using a universal primer (P1) attached to a magnetic bead Emulsion PCR Attach to glass plate From Wikipedia: A library of DNA fragments is prepared from the sample to be sequenced, and are used to prepare clonal bead populations. That is, only one species of fragment will be present on the surface of each magnetic bead. The fragments attached to the magnetic beads will have a universal P1 adapter sequence attached so that the starting sequence of every fragment is both known and identical. Emulsion PCR takes place in microreactors containing all the necessary reagents for PCR. The resulting PCR products attached to the beads are then covalently bound to a glass slide. Primers hybridize to the P1 adapter sequence within the library template. A set of four fluorescently labeled di-base probes compete for ligation to the sequencing primer. Specificity of the di-base probe is achieved by interrogating every 1st and 2nd base in each ligation reaction. Multiple cycles of ligation, detection and cleavage are performed with the number of cycles determining the eventual read length. Following a series of ligation cycles, the extension product is removed and the template is reset with a primer complementary to the n-1 position for a second round of ligation cycles. Five rounds of primer reset are completed for each sequence tag. Through the primer reset process, each base is interrogated in two independent ligation reactions by two different primers. For example, the base at read position 5 is assayed by primer number 2 in ligation cycle 2 and by primer number 3 in ligation cycle 1. [edit] Mardis, Annu. Rev. Genomics Hum. Genet., 9 (2008), 387

AB SOLiD di-base Metzker, Nat. Rev. Genetics, 11 (2010), 31

AB SOLiD Metzker, Nat. Rev. Genetics, 11 (2010), 31

AB SOLiD Every base will be interogated twice – high accuracy Round 1: every probe is done Round 2: every thing is cleaved of and an enzyme eats away the last base of the P1 adapter, so you start at position n-1 Round 3: Again cleave off everything, eat an extra base away and add a bridge probe Etc. Every base will be interogated twice – high accuracy

AB SOLiD Mardis, Annu. Rev. Genomics Hum. Genet., 9 (2008), 387

Other sequencing methods Helicos (www.helicosbio.com) http://www.youtube.com/watch?v=TboL7wODBj4 Oxford Nanopore Technol. (www.nanoporetech.com) http://vimeo.com/20289048 Pacific Biosciences (www.pacificbiosciences.com) http://www.youtube.com/watch?v=NHCJ8PtYCFc&feature=related CompleteGenomics (www.completegenomics.com)

References Metzker, Nat. Rev. Genetics, 11 (2010), 31 Batley & Edwards, BioTechniques, 46 (2009), 333 Morozova & Marra, Genomics, 92 (2008), 255 Kahvejian, Quackenbush & Thompson, Nat. Biotechnol., 26 (2008), 1125 Mardis, Annu. Rev. Genom. Hum. Genet., 9 (2008), 387

Bioinformatics Data, lots of data!! NGS: Raw data - TeraBytes Text-sequence data - GigaBytes Sequence variation data - Mega/KiloBytes

Metzker, Nat. Rev. Genetics, 11 (2010), 31 Update with latest figures!!! Metzker, Nat. Rev. Genetics, 11 (2010), 31

Solid 5500XL System accuracy: >99.99% (using ECC) Throughput/Day: Beads – 20-30 Gb, nBeads – 30-45 Gb Throughput/Run: Beads ~180 Gb (2.8 B reads PE) nBeads ~300 Gb (4.8 B reads) 2 FlowChips (with 6 lanes each) ReadLength: 75bp, 75bp/35bp (PE) or 60bp/60bp (mate-pairs) Run time: 1 day (35 bp, 1 lane) 7 days (75bp/35bp, 12 lanes) ECC = Exact Call Chemistry

More numbers Images no longer stored; too big XSQ format (HDF5 based) One lane PE (F3:75 bases, R3:35 bases): ~14 Gb XSQ file Unpacked: ~65-70 Gb

Sequencing pipeline Reads Convert to fastq format Resequencing SOLiD: csfasta & qual 1 set for forward reads, 1 set for reverse reads Convert to fastq format Quality control Resequencing Align against reference genome De novo assembly

Sequencing pipeline Alignment fine tuning – influence on variant calling Realignment Use information of original alignment to fine-tune local alignment Duplicate marking PCR amplifications Recalibration Adjust quality of base calls to better represent situation

http://www. broadinstitute. org/gsa/wiki/index http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v3 www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v3

File formats - SOLiD Csfasta - basecalls Qual - qualities Probability of the call being a particular base – Phred score: >3_1079_178_F3 T11032001032001032001032001032001032001030002002001 >3_1079_178_F3 28 33 32 33 32 22 30 33 31 33 26 33 29 32 33 22 30 33 22 27 24 26 25 22 27 12 27 22 7 33 15 14 29 8 20 15 23 5 8 16 9 27 18 14 15 7 15 4 4 9 Pe = estimated probability of error R3.csfasta: >3_1079_178_F5-P2 T22301002301002301022000022010002000 R3_QV.qual: 22 33 32 33 33 27 31 32 32 33 27 33 30 32 30 21 27 24 8 31 8 22 6 22 21 5 6 14 10 27 26 4 18 6 19

Colorspace >3_1079_178_F3 T11032001032001032001032001032001032001030002002001 Second base A C G T A C G T A A 1 2 3 C C 1 3 2 First base G G 2 3 1 T T 3 2 1 What is the sequence? aldo:~/mySeq$ ls <basename>_F3.csfasta <basename>_F3_QV.qual <basename>_R3.csfasta <basename>_R3_QV.qual aldo:~/mySeq$ solid2fastq.pl <basename> <newTitle>

solid2fastq Fastq Double encoding What happened? BWA - bio-bwa.sourceforge.net Fastq One fastq file for forward reads, one for reverse reads What happened? To the qualities? To the sequence? Double encoding Not using 0123, but “ACGT” … too make life easy!! @run27_sampleD05_1149_27042011:3_1079_178/1 CATGAACATGAACATGAACATGAACATGAACATGAACATAAAGAAGAAC + BABA7?B@B;B>AB7?B7<9;:7<-<7(B0/>)508&)1*<3/0(0%%* Solid2fastq knicks off the first two positions (the primer T and the first base as the color coding works with pairs of bases) R3: @run27_sampleD05_1149_27042011:3_1079_178/2 GTACAAGTACAAGTACAGGAAAAGGACAAAGAAA + BABB<@AAB<B?A?6<9)@)7'76&'/+<;%3'4   There is no biological meaning in the double encoding!!!

Quality Check FastQC www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ http://www.bioinformaticslaboratory.nl/twiki/bin/view/Sequencing/Solid20110407Exome

BWA Burrows-Wheeler Alignment Aligning in colorspace Li & Durbin, Bioinformatics, 25 (2009), 1754-1760 Burrows-Wheeler Alignment Aligning in colorspace Index reference genome in colorspace (index): bwa index –c –a <is|bwtsw> refFasta Create SAI file per fastq file (suffix array index) of all the reads (aln): bwa aln <Options> reffasta read1.fq.gz > read1.sai Convert SA coordinates to chromosomal coordinates (samse/sampe)-> SAM file: bwa sampe <Options> refFasta read1.sai read2.sai read1.fq.gz read2.fq.gz > out.sam

Stats Coverage G=2000, N=8, L=500 -> Cov.=2 The average number of reads covering a nucleotide in a reconstructed sequence (Avg. number reads (N) * Avg.. Length reads)/ Length of orig. Genome: G=2000, N=8, L=500 -> Cov.=2 On target Per chromosome

SAM/BAM Sequence Alignment/Map Li et al., Bioinformatics, 25 (2009), 2078-2079 Sequence Alignment/Map BAM is binary version of SAM (compressed) run27_sampleD05_1149_27042011:3_1079_178 117 1 10052 0 * = 10052 0 TTTCTTTGTCCTTTTCCTGTACTTGTACTTGTAC 4'3%;<+/'&67'7)@)9<6?A?B<BAA@<BBAB RG:Z:RH1 CQ:Z:!4'3%;<+/'&67'7)@)9<6?A?B<BAA@<BBAB CS:Z:T22301002301002301022000022010002000 run27_sampleD05_1149_27042011:3_1079_178 153 1 10052 0 48M = 10052 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA 83>))"%X3(C8GQNGPVH[SH\RR\Z]]]\]]]]]]]]]]]]]]]]] X0:i:468 MD:Z:48 RG:Z:RH1 XG:i:0 AM:i:0 CM:i:0 SM:i:0 XM:i:3 XO:i:0 CQ:Z:!83>))"%X3(C8GQNGPVH[SH\RR\Z]]]\]]]]]]]]]]]]]]]]] CS:Z:T11032001032001032001032001032001032001030002002001 XT:A:R

Samtools & PicardTools samtools.sourceforge.net, picard.sourceforge.net Tools to manipulate SAM/BAM files Generating index for fasta file, needed for quick look ups: samtools faidx ref.fasta Getting unique reads: samtools view –uhq1 <in.bam> > <out.bam> Marking duplicates (PicardTools): java –jar MarkDuplicates.jar I=<in.bam> O=<out.bam> M=<metrics.txt>

Realignment – SRMA Homer & Nelson, Genome Biol., 11 (2010), R99 Before Realignment Small Reads Micro reAligner After Realignment

Recalibration - GATK McKenna et al.,Genome Res., 20 (2010), 1297-303 Problem Reported QUAL score in reads are NOT close enough to actual probability of mismatching the reference genome Solution correct for variation in quality with machine cycle and sequence context covariation analyzed among several features of the base. Reported quality score The position within the read The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine Probability of mismatching the reference genome These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file. The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc. This process is accomplished by analyzing the covariation among several features of a base. For example: Reported quality score The position within the read The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine Probability of mismatching the reference genome These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

Variant calling - VarScan Koboldt et al. Bioinformatics , 25 (2009), 2283-5 A 4 aligned reads (from patient) A A A T T 6 aligned reads (from patient) T T T T Reference nucleotide: A Reference genome Total number of reads covering indicated position: 10 Frequency of reads supporting variant: 6/10 = 60% Samtools’ mpileup followed by VarScan

Annotation - Annovar One line per variant…in bold the header Wang et al., Nucl. Acids Res., 38 (2010), e164 One line per variant…in bold the header Sample Func Gene ExonicFunc D05_1149.bam exonic DIP2C synonymous SNV AAChange Conserved SegDup 1000G_ALL X NM_014974:c.G1248A:p.P416P 511 0.98 dbSNP130 dbSNP131 SIFT Chr Start End Ref Obs rs6560837 rs6560837 10 445061 445061 C T Chrom Position Ref Var Reads1 Reads2 VarFreq chr10 445061 C T 0 3 100 Strands1 Strands2 Qual1 Qual2 Pvalue MapQual1 MapQual2 0 1 0 26 0.98 0 1 Reads1Plus Reads1Minus Reads2Plus Reads2Minus 0 0 3 0 T

Future perspectives Microarrays out, NGS in! Personalized medicines In stead of probes, just sequence all of it www.1000genomes.org GoNL Personalized medicines SNP detection Influence “junk” DNA The 1000 Genomes Project is an international collaboration to produce an extensive public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts. This resource will support genome-wide association studies and other medical research studies. The genomes of about 2000 unidentified people from about 20 populations around the world will be sequenced using next-generation sequencing technologies. The results of the study will be freely and publicly accessible to researchers worldwide.

where no man has gone before! To boldly go where no man has gone before!