Introduction to next generation sequencing Rolf Sommer Kaas
National Food Institute, Technical University of Denmark Outline Next generation sequencing Ion Torrent454PacBioIllumina Output Data Analysis History MinION
National Food Institute, Technical University of Denmark History ‘77 ‘ ‘
National Food Institute, Technical University of Denmark History Human genome project 1998 Random Shotgun Sequencing Fast 300 mill. $ Hierarchical Shotgun Sequencing 3 billion $
National Food Institute, Technical University of Denmark History Human genome project 2001: Draft 2003: Complete
National Food Institute, Technical University of Denmark History ‘77 ‘ ‘
National Food Institute, Technical University of Denmark History 2004 Next Generation Sequencing 454 Life Sciences: Parallelized pyrosequencing Reduce costs 6 fold
National Food Institute, Technical University of Denmark History 2004 Next Generation Sequencing (Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Accessed 31-oct-14.) European Nucleotide Archive (ENA) (
National Food Institute, Technical University of Denmark Next generation sequencing Roche, 454 Life Sciences (GS FLX Titanium) Life Technologies (Ion Torrent & Ion Proton) Illumina (HiSeq, MiSeq, GenomeAnalyzer) Pacific Biosciences (PacBio RS) Oxford Nanopore (MinION, PromethION, GridION)
National Food Institute, Technical University of Denmark Next generation sequencing Method outline - library 1. Fragment DNA2. Ligate adapters Amplification primer Sequencing primer Barcode 3. Amplification 4. Sequencing
National Food Institute, Technical University of Denmark Next generation sequencing technologies Ion Torrent Problem with homopolymers Fast Expensive Long insert sizes Low throughput Cheapest
National Food Institute, Technical University of Denmark Next generation sequencing Illumina Genome AnalyzerHiSeq MiSeq Short reads (~ bp) Good Accuracy High Throughput
National Food Institute, Technical University of Denmark Next generation sequencing technologies PacBio Expensive Lower accuracy Long reads (~5000 bp)
National Food Institute, Technical University of Denmark Next generation sequencing technologies Nanopore Upcoming technology Released to select labs
National Food Institute, Technical University of Denmark Next generation sequencing technologies Nanopore Up to 80,000 bp reads MinION: 150 mill. Bp pr 6 h. (30x coverage of E. coli) GridION MinION PromethION
National Food Institute, Technical University of Denmark Next generation sequencing technologies Machine distribution Illumina is the most common ABI SOLiD not as big as it appears
National Food Institute, Technical University of Denmark Reads Sample Raw reads Output
National Food Institute, Technical University of Denmark What is sequence data? Sequence data is stored in fasta files Fasta example: Output Header/ID Sequence
National Food Institute, Technical University of Denmark Handling sequence data? Watch out! Output Same FASTA file in Word This should be fine…
National Food Institute, Technical University of Denmark Handling sequence data? Watch out! Output What your data actually looks like! Oh no! This wont work… Take home message: Use “pure text editors” Examples: Notepad (Win) Textedit (Mac) Sublime Text (all) Save files in “txt” format.
National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc 1 read, 4 lines Output
National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Header/ID Output
National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc DNA sequence Output
National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Name field (optional) Output
National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Quality scores Output
National Food Institute, Technical University of Denmark Paired and Single End Single end reads Insert size (eg. 300 bp) Paired end reads Long Insert size (eg bp) Output
National Food Institute, Technical University of Denmark Splitting & clipping data Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc using barcodes Output aka multiplexing De-multiplexing is usually done by the sequencer
National Food Institute, Technical University of Denmark Data quality Output
National Food Institute, Technical University of Denmark Trimming data Fastq AC N GTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Output
National Food Institute, Technical University of Denmark Trimming data Fastq AC N GTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Output Data quality
National Food Institute, Technical University of Denmark Coverage & Depth Output Coverage: Average number of times the data is covered in the genome. N: Number of read L: Read length G: Genome size Depth: Number reads that coveres a particular nucleotide in each position in the genome. reads site = depth Data quality (target or assembly) Breadth-of-coverage: assembly size target size C = Example: N = 5 mill L = 100 bp G = 5 Mbp C = 5*100/5 = 100X On average, 100 reads covers each position in the genome. ________ Example: assembly = 4.9 mill target = 5 mill c = 4.9/5 = 0.98 ________
National Food Institute, Technical University of Denmark Output Data storage & Access International Nucleotide Sequence Database Collaboration (INSDC) Europe European Bioinformatics Institute (EBI) United States National Center for Biotechnology Information (NCBI) Asia DNA Data Bank of Japan (DDBJ)
National Food Institute, Technical University of Denmark European Bioinformatics Institute (EBI) Output Data storage & Access
National Food Institute, Technical University of Denmark Assembly Mapping to a reference Further analysis (eg. Gene finding) Further analysis (eg. SNP trees) Data Analysis Data splitting, clipping, and trimming Referenc e De novo
National Food Institute, Technical University of Denmark UnixDOS Mac OS X LinuxWindows Bioinformatic tools CLC bio and MEGA Geneious Data Analysis Bioinformatic platforms
National Food Institute, Technical University of Denmark Data Analysis Bioinformatic platforms Unix…
National Food Institute, Technical University of Denmark + Platform independent + Requires little computer resources + Can be done everywhere - Requires patience : MLST Resistance genes SNP calling and tree creation Species identification : Many NGS tools Steep learning curve Data Analysis Bioinformatic platforms Web-tools to the rescue!
National Food Institute, Technical University of Denmark Different sequencers requires different assemblers Depend on output and error profile Assembler: Newbler 454 Ion Torrent Assembler: Velvet Illumina ABI Solid (color spaced) Data Analysis Assembly De novo
National Food Institute, Technical University of Denmark Velvet – The unnecessarily complex assembler K-mer based assembler User needs to set K Longer reads equals larger K Everything is defined in “Kmer-space” Nucleotide length = Kmer_length + K-1 Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length Data Analysis Assembly De novo
National Food Institute, Technical University of Denmark Velvet assembly Data Analysis Assembly De novo Example >NODE_1_length_91928_cov_ AGTTCATTGATAAATCTTTTTTGATTATCATCAACGAGTGCCCACACAGATTGATTGGTT TATATTGTTAAAGAGCTTTTCCTATCGAAATCGCTTTTAAGCTCAATTCGCTAGGGCTGC GTATATTACGCTTATTCAGTTGAGTGTCAAACGTTATTTTCTA... K = 83 Kmer_length + K-1 = Nucleotide length – 1 = Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length (300 – ) / 300 ___________________ = 31.84
National Food Institute, Technical University of Denmark De novo quality check Number of contigs - Fewer is generally better N50 Total size of contigs 50% of size Data Analysis
National Food Institute, Technical University of Denmark De novo quality check Number of contigs - Fewer is better N50 Total size of contigs 50% of size Size of contig Data Analysis
National Food Institute, Technical University of Denmark Assembly Further analysis (eg. Gene finding) Data Analysis Data splitting, clipping, and trimming Referenc e De novo
National Food Institute, Technical University of Denmark Contigs Gene finding Resistance MLST Etc. Data Analysis Further data analysis
National Food Institute, Technical University of Denmark Find genes by Open Reading Frames + Shine-Dalgarno + motifs Not there does not mean it is NOT there Not assembled Truncated “Hypothetical” & “Putative” – The curse of bioinformatics Annotated gene – verified in the lab “Hypothetical” or “Putative” annotations No match to original sequence The evil circle of BLAST similarity Suggested annotation service: RAST: Data Analysis Further data analysis Genes are not just genes…
National Food Institute, Technical University of Denmark Assembly Mapping to a reference Further analysis (eg. Gene finding) Data Analysis Data splitting, clipping, and trimming Referenc e De novo
National Food Institute, Technical University of Denmark Mapping to a reference raw reads Do not match any reads Do not match reference Reference sequence Data Analysis Mappers: BWA Bowtie MAQ CGE
National Food Institute, Technical University of Denmark Assembly Mapping to a reference Further analysis (eg. Gene finding) Further analysis (eg. SNP trees) Data Analysis Data splitting, clipping, and trimming Referenc e De novo
National Food Institute, Technical University of Denmark Thank you for listening Questions?