Presentation is loading. Please wait.

Presentation is loading. Please wait.

MIK Bachelor seminars Sequencing in practice

Similar presentations


Presentation on theme: "MIK Bachelor seminars Sequencing in practice"— Presentation transcript:

1 MIK Bachelor seminars Sequencing in practice
The technology Aldo Jongejan

2 What do we want? Discover genetic cause of disease
Fundamental research, System Biology Discover why people react differently to medicines Medicinal chemistry, personalized medicine Diagnostics Quick tests, genetic makeup, risk factors

3 Sequencing Updated to Dec. 2012

4 How does it work? DNA (patient) 1 2 5 4 3 candidate genes Gene A
Gene B DNA (patient) 1 Produce shotgun library 2 Determine variants, Filter, compare patients Capture exon sequences 5 Map against reference genome 4 3 candidate genes Wash & Sequence

5 Sequencing methods “old style” - 1977 Frederick Sanger (1918)
Structure insulin Sequence nucleic acids Allan Coulson Chain termination Dye termination Frederick Sanger & Allan Coulson (gebaseerd op de methode van Maxam & Gilbert, maar gebruikte minder Toxische middelen en radioisotopen en was dus beter in gebruik). Ruim 30 jaar gebruikt! Twee keer Nobelprijs gewonnen (insuline – 1958, Sequence of nucleinezuren – 1980) Extra:

6 Sanger method Single stranded DNA Primer Cloned in plasmide
Denaturation Cloned in bacteriophage M13 Cloned in phagemid PCR Primer Nadeel M13: kleine stukjes kunnen maar geincorporeerd worden. Phagemid omzeilt dit

7 Dye termination Label each ddNTP with a fluorescent marker

8 High Throughput Sequencing
Sanger method Laborious sample preparation Cloning/amplification Read length determined by electrophoresis Capillary electrophoresis machines bottleneck HUGO - 8 yr. preparation/5 yr. sequencing New methods highly desirable! Some inserts unclonable due to deleterious effect on host Longer strands, less percentual difference between strand. N / N+1

9 Contest “The first Team that can build a device and use it to sequence 100 human genomes within 10 days or less” Accuracy < 1 op base Coverage >98% genome Costs <$10k per genome De X-prize organisatie stimuleerde het ontwikkelen van een ruimtevoretuig, dat in staat zou zijn om 3 personen over een bepaalde afstand te vervoeren en dat een aantal keer per dag. Als mensen een beloning krijgen dan gaan ze er voor! "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome."

10 Sequencing methods Roche 454 Life Sciences Illumina
Illumina Applied Biosystems SOLiD system www3.appliedbiosystems.com 454, een onderdeel van Roche kwam in 2005 met het eerste alternatief/ Later gevolgd door de Genome Sequencer FLX (2006)

11 AB SOLiD Fragment library is prepared using a universal primer (P1) attached to a magnetic bead Emulsion PCR Attach to glass plate From Wikipedia: A library of DNA fragments is prepared from the sample to be sequenced, and are used to prepare clonal bead populations. That is, only one species of fragment will be present on the surface of each magnetic bead. The fragments attached to the magnetic beads will have a universal P1 adapter sequence attached so that the starting sequence of every fragment is both known and identical. Emulsion PCR takes place in microreactors containing all the necessary reagents for PCR. The resulting PCR products attached to the beads are then covalently bound to a glass slide. Primers hybridize to the P1 adapter sequence within the library template. A set of four fluorescently labeled di-base probes compete for ligation to the sequencing primer. Specificity of the di-base probe is achieved by interrogating every 1st and 2nd base in each ligation reaction. Multiple cycles of ligation, detection and cleavage are performed with the number of cycles determining the eventual read length. Following a series of ligation cycles, the extension product is removed and the template is reset with a primer complementary to the n-1 position for a second round of ligation cycles. Five rounds of primer reset are completed for each sequence tag. Through the primer reset process, each base is interrogated in two independent ligation reactions by two different primers. For example, the base at read position 5 is assayed by primer number 2 in ligation cycle 2 and by primer number 3 in ligation cycle 1. [edit] Mardis, Annu. Rev. Genomics Hum. Genet., 9 (2008), 387

12 AB SOLiD di-base Metzker, Nat. Rev. Genetics, 11 (2010), 31

13 AB SOLiD Metzker, Nat. Rev. Genetics, 11 (2010), 31

14 AB SOLiD Every base will be interogated twice – high accuracy
Round 1: every probe is done Round 2: every thing is cleaved of and an enzyme eats away the last base of the P1 adapter, so you start at position n-1 Round 3: Again cleave off everything, eat an extra base away and add a bridge probe Etc. Every base will be interogated twice – high accuracy

15 AB SOLiD Mardis, Annu. Rev. Genomics Hum. Genet., 9 (2008), 387

16 Other sequencing methods
Helicos ( Oxford Nanopore Technol. ( Pacific Biosciences ( CompleteGenomics (

17 References Metzker, Nat. Rev. Genetics, 11 (2010), 31
Batley & Edwards, BioTechniques, 46 (2009), 333 Morozova & Marra, Genomics, 92 (2008), 255 Kahvejian, Quackenbush & Thompson, Nat. Biotechnol., 26 (2008), 1125 Mardis, Annu. Rev. Genom. Hum. Genet., 9 (2008), 387

18 Bioinformatics Data, lots of data!! NGS: Raw data - TeraBytes
Text-sequence data - GigaBytes Sequence variation data - Mega/KiloBytes

19 Metzker, Nat. Rev. Genetics, 11 (2010), 31
Update with latest figures!!! Metzker, Nat. Rev. Genetics, 11 (2010), 31

20 Solid 5500XL System accuracy: >99.99% (using ECC)
Throughput/Day: Beads – Gb, nBeads – Gb Throughput/Run: Beads ~180 Gb (2.8 B reads PE) nBeads ~300 Gb (4.8 B reads) 2 FlowChips (with 6 lanes each) ReadLength: 75bp, 75bp/35bp (PE) or 60bp/60bp (mate-pairs) Run time: 1 day (35 bp, 1 lane) 7 days (75bp/35bp, 12 lanes) ECC = Exact Call Chemistry

21 More numbers Images no longer stored; too big XSQ format (HDF5 based)
One lane PE (F3:75 bases, R3:35 bases): ~14 Gb XSQ file Unpacked: ~65-70 Gb

22 Sequencing pipeline Reads Convert to fastq format Resequencing
SOLiD: csfasta & qual 1 set for forward reads, 1 set for reverse reads Convert to fastq format Quality control Resequencing Align against reference genome De novo assembly

23 Sequencing pipeline Alignment fine tuning – influence on variant calling Realignment Use information of original alignment to fine-tune local alignment Duplicate marking PCR amplifications Recalibration Adjust quality of base calls to better represent situation

24 http://www. broadinstitute. org/gsa/wiki/index

25 File formats - SOLiD Csfasta - basecalls Qual - qualities
Probability of the call being a particular base – Phred score: >3_1079_178_F3 T >3_1079_178_F3 Pe = estimated probability of error R3.csfasta: >3_1079_178_F5-P2 T R3_QV.qual:

26 Colorspace >3_1079_178_F3 T Second base A C G T A C G T A A 1 2 3 C C 1 3 2 First base G G 2 3 1 T T 3 2 1 What is the sequence? aldo:~/mySeq$ ls <basename>_F3.csfasta <basename>_F3_QV.qual <basename>_R3.csfasta <basename>_R3_QV.qual aldo:~/mySeq$ solid2fastq.pl <basename> <newTitle>

27 solid2fastq Fastq Double encoding What happened?
BWA - bio-bwa.sourceforge.net Fastq One fastq file for forward reads, one for reverse reads What happened? To the qualities? To the sequence? Double encoding Not using 0123, but “ACGT” … too make life easy!! @run27_sampleD05_1149_ :3_1079_178/1 CATGAACATGAACATGAACATGAACATGAACATGAACATAAAGAAGAAC + Solid2fastq knicks off the first two positions (the primer T and the first base as the color coding works with pairs of bases) R3: @run27_sampleD05_1149_ :3_1079_178/2 GTACAAGTACAAGTACAGGAAAAGGACAAAGAAA + There is no biological meaning in the double encoding!!!

28 Quality Check FastQC www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

29 BWA Burrows-Wheeler Alignment Aligning in colorspace
Li & Durbin, Bioinformatics, 25 (2009), Burrows-Wheeler Alignment Aligning in colorspace Index reference genome in colorspace (index): bwa index –c –a <is|bwtsw> refFasta Create SAI file per fastq file (suffix array index) of all the reads (aln): bwa aln <Options> reffasta read1.fq.gz > read1.sai Convert SA coordinates to chromosomal coordinates (samse/sampe)-> SAM file: bwa sampe <Options> refFasta read1.sai read2.sai read1.fq.gz read2.fq.gz > out.sam

30 Stats Coverage G=2000, N=8, L=500 -> Cov.=2
The average number of reads covering a nucleotide in a reconstructed sequence (Avg. number reads (N) * Avg.. Length reads)/ Length of orig. Genome: G=2000, N=8, L=500 -> Cov.=2 On target Per chromosome

31 SAM/BAM Sequence Alignment/Map
Li et al., Bioinformatics, 25 (2009), Sequence Alignment/Map BAM is binary version of SAM (compressed) run27_sampleD05_1149_ :3_1079_ * = TTTCTTTGTCCTTTTCCTGTACTTGTACTTGTAC RG:Z:RH1 CS:Z:T run27_sampleD05_1149_ :3_1079_ 48M = CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAA 83>))"%X3(C8GQNGPVH[SH\RR\Z]]]\]]]]]]]]]]]]]]]]] X0:i:468 MD:Z:48 RG:Z:RH1 XG:i:0 AM:i:0 CM:i:0 SM:i:0 XM:i:3 XO:i:0 CQ:Z:!83>))"%X3(C8GQNGPVH[SH\RR\Z]]]\]]]]]]]]]]]]]]]]] CS:Z:T XT:A:R

32 Samtools & PicardTools
samtools.sourceforge.net, picard.sourceforge.net Tools to manipulate SAM/BAM files Generating index for fasta file, needed for quick look ups: samtools faidx ref.fasta Getting unique reads: samtools view –uhq1 <in.bam> > <out.bam> Marking duplicates (PicardTools): java –jar MarkDuplicates.jar I=<in.bam> O=<out.bam> M=<metrics.txt>

33 Realignment – SRMA Homer & Nelson, Genome Biol., 11 (2010), R99
Before Realignment Small Reads Micro reAligner After Realignment

34 Recalibration - GATK McKenna et al.,Genome Res., 20 (2010), Problem Reported QUAL score in reads are NOT close enough to actual probability of mismatching the reference genome Solution correct for variation in quality with machine cycle and sequence context covariation analyzed among several features of the base. Reported quality score The position within the read The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine Probability of mismatching the reference genome These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file. The tools in this package recalibrate base quality scores of sequencing-by-synthesis reads in an aligned BAM file. After recalibration, the quality scores in the QUAL field in each read in the output BAM are more accurate in that the reported quality score is closer to its actual probability of mismatching the reference genome. Moreover, the recalibration tool attempts to correct for variation in quality with machine cycle and sequence context, and by doing so provides not only more accurate quality scores but also more widely dispersed ones. The system works on BAM files coming from many sequencing platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences, etc. This process is accomplished by analyzing the covariation among several features of a base. For example: Reported quality score The position within the read The preceding and current nucleotide (sequencing chemistry effect) observed by the sequencing machine Probability of mismatching the reference genome These covariates are then subsequently applied through a piecewise tabular correction to recalibrate the quality scores of all reads in a BAM file.

35 Variant calling - VarScan
Koboldt et al. Bioinformatics , 25 (2009), A 4 aligned reads (from patient) A A A T T 6 aligned reads (from patient) T T T T Reference nucleotide: A Reference genome Total number of reads covering indicated position: 10 Frequency of reads supporting variant: 6/10 = 60% Samtools’ mpileup followed by VarScan

36 Annotation - Annovar One line per variant…in bold the header
Wang et al., Nucl. Acids Res., 38 (2010), e164 One line per variant…in bold the header Sample Func Gene ExonicFunc D05_1149.bam exonic DIP2C synonymous SNV AAChange Conserved SegDup 1000G_ALL X NM_014974:c.G1248A:p.P416P dbSNP130 dbSNP131 SIFT Chr Start End Ref Obs rs rs C T Chrom Position Ref Var Reads1 Reads2 VarFreq chr C T Strands1 Strands2 Qual1 Qual2 Pvalue MapQual1 MapQual2 Reads1Plus Reads1Minus Reads2Plus Reads2Minus T

37

38 Future perspectives Microarrays out, NGS in! Personalized medicines
In stead of probes, just sequence all of it GoNL Personalized medicines SNP detection Influence “junk” DNA The 1000 Genomes Project is an international collaboration to produce an extensive public catalog of human genetic variation, including SNPs and structural variants, and their haplotype contexts. This resource will support genome-wide association studies and other medical research studies. The genomes of about 2000 unidentified people from about 20 populations around the world will be sequenced using next-generation sequencing technologies. The results of the study will be freely and publicly accessible to researchers worldwide.

39 where no man has gone before!
To boldly go where no man has gone before!


Download ppt "MIK Bachelor seminars Sequencing in practice"

Similar presentations


Ads by Google