Introduction to next generation sequencing Rolf Sommer Kaas.

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

 Sequencing technology › Roche/454 GS-FLX (‘454’) › Illumina  Prokaryotic profiling › De novo genome sequencing › Metagenomics › SNP profiling › Species.
Next–generation DNA sequencing technologies – theory & practice
Next-generation sequencing
The past, present, and future of DNA sequencing Dan Russell.
The 454 and Ion PGM at the Genomics Core Facility Dr. Deborah Grove, Director for Genetic Analysis Genomics Core Facility Huck Institutes of the Life Sciences.
Greg Phillips Veterinary Microbiology
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
High Throughput Sequencing
CS 6293 Advanced Topics: Current Bioinformatics
Next Generation DNA Sequencing Platforms: Evolving Tools for
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
Update on Next-Generation Sequencing
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
Next generation sequencing Xusheng Wang 4/29/2010.
Sequencing Technologies and Applications at JGI
De-novo Assembly Day 4.
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
PERFORMANCE COMPARISON OF NEXT GENERATION SEQUENCING PLATFORMS Bekir Erguner 1,3, Duran Üstek 2, Mahmut Ş. Sağıroğlu 1 1Advanced Genomics and Bioinformatics.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Ion Torrent and Minion Relatively low cost ‘next generation’ sequencing Wendy Smith School of Computing Science, Alan Ward Newcastle University, UK.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Quick introduction to genomic file types Preliminary quality control (lab)
The Changing Face of Sequencing
Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Molecular Biology Dr. Chaim Wachtel May 28, 2015.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Genomics.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.
De Novo Genome Assembly - Introduction
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Third Generation Sequencing. Today Illumina – Solexa sequencing technology 454 Life sciences – 454 sequencer Applied Biosystem – SOLiD system Tomorrow.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
What is sequencing? Video: WlxM (Illumina video) WlxM.
Canadian Bioinformatics Workshops
MarketsandMarkets Presents Illumina HiSeq(NGS) System Market by 2017
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.
Next-generation sequencing technology
Short Read Sequencing Analysis Workshop
Next generation sequencing
Sequencing technologies
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Introduction to next generation sequencing
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
EDNA analyze Wang Ying & Huang Junman.
Next-generation sequencing technology
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Sequencing technology and assembly
Jin Zhang, Jiayin Wang and Yufeng Wu
2nd (Next) Generation Sequencing
Next-generation DNA sequencing
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
Global Next Generation Sequencing (NGS) Market (By Products - Consumables, Platforms, Services, Sequencing Services, Bioinformatics, Technology, Applications, End Users, Regions), Key Company Profiles - Forecast to 2025
Presentation transcript:

Introduction to next generation sequencing Rolf Sommer Kaas

National Food Institute, Technical University of Denmark Outline Next generation sequencing Ion Torrent454PacBioIllumina Output Data Analysis History MinION

National Food Institute, Technical University of Denmark History ‘77 ‘ ‘

National Food Institute, Technical University of Denmark History Human genome project 1998 Random Shotgun Sequencing Fast 300 mill. $ Hierarchical Shotgun Sequencing 3 billion $

National Food Institute, Technical University of Denmark History Human genome project 2001: Draft 2003: Complete

National Food Institute, Technical University of Denmark History ‘77 ‘ ‘

National Food Institute, Technical University of Denmark History 2004 Next Generation Sequencing 454 Life Sciences: Parallelized pyrosequencing Reduce costs 6 fold

National Food Institute, Technical University of Denmark History 2004 Next Generation Sequencing (Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Accessed 31-oct-14.) European Nucleotide Archive (ENA) (

National Food Institute, Technical University of Denmark Next generation sequencing Roche, 454 Life Sciences (GS FLX Titanium) Life Technologies (Ion Torrent & Ion Proton) Illumina (HiSeq, MiSeq, GenomeAnalyzer) Pacific Biosciences (PacBio RS) Oxford Nanopore (MinION, PromethION, GridION)

National Food Institute, Technical University of Denmark Next generation sequencing Method outline - library 1. Fragment DNA2. Ligate adapters Amplification primer Sequencing primer Barcode 3. Amplification 4. Sequencing

National Food Institute, Technical University of Denmark Next generation sequencing technologies Ion Torrent Problem with homopolymers Fast Expensive Long insert sizes Low throughput Cheapest

National Food Institute, Technical University of Denmark Next generation sequencing Illumina Genome AnalyzerHiSeq MiSeq Short reads (~ bp) Good Accuracy High Throughput

National Food Institute, Technical University of Denmark Next generation sequencing technologies PacBio Expensive Lower accuracy Long reads (~5000 bp)

National Food Institute, Technical University of Denmark Next generation sequencing technologies Nanopore Upcoming technology Released to select labs

National Food Institute, Technical University of Denmark Next generation sequencing technologies Nanopore Up to 80,000 bp reads MinION: 150 mill. Bp pr 6 h. (30x coverage of E. coli) GridION MinION PromethION

National Food Institute, Technical University of Denmark Next generation sequencing technologies Machine distribution Illumina is the most common ABI SOLiD not as big as it appears

National Food Institute, Technical University of Denmark Reads Sample Raw reads Output

National Food Institute, Technical University of Denmark What is sequence data? Sequence data is stored in fasta files Fasta example: Output Header/ID Sequence

National Food Institute, Technical University of Denmark Handling sequence data? Watch out! Output Same FASTA file in Word This should be fine…

National Food Institute, Technical University of Denmark Handling sequence data? Watch out! Output What your data actually looks like! Oh no! This wont work… Take home message: Use “pure text editors” Examples: Notepad (Win) Textedit (Mac) Sublime Text (all) Save files in “txt” format.

National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc 1 read, 4 lines Output

National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Header/ID Output

National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc DNA sequence Output

National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Name field (optional) Output

National Food Institute, Technical University of Denmark What is the data? Fastq files What is Fastq? Fasta + quality scores Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Quality scores Output

National Food Institute, Technical University of Denmark Paired and Single End Single end reads Insert size (eg. 300 bp) Paired end reads Long Insert size (eg bp) Output

National Food Institute, Technical University of Denmark Splitting & clipping data Fastq ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc using barcodes Output aka multiplexing De-multiplexing is usually done by the sequencer

National Food Institute, Technical University of Denmark Data quality Output

National Food Institute, Technical University of Denmark Trimming data Fastq AC N GTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Output

National Food Institute, Technical University of Denmark Trimming data Fastq AC N GTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC + AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT + AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG + bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Output Data quality

National Food Institute, Technical University of Denmark Coverage & Depth Output Coverage: Average number of times the data is covered in the genome. N: Number of read L: Read length G: Genome size Depth: Number reads that coveres a particular nucleotide in each position in the genome. reads site = depth Data quality (target or assembly) Breadth-of-coverage: assembly size target size C = Example: N = 5 mill L = 100 bp G = 5 Mbp C = 5*100/5 = 100X On average, 100 reads covers each position in the genome. ________ Example: assembly = 4.9 mill target = 5 mill c = 4.9/5 = 0.98 ________

National Food Institute, Technical University of Denmark Output Data storage & Access International Nucleotide Sequence Database Collaboration (INSDC) Europe European Bioinformatics Institute (EBI) United States National Center for Biotechnology Information (NCBI) Asia DNA Data Bank of Japan (DDBJ)

National Food Institute, Technical University of Denmark European Bioinformatics Institute (EBI) Output Data storage & Access

National Food Institute, Technical University of Denmark Assembly Mapping to a reference Further analysis (eg. Gene finding) Further analysis (eg. SNP trees) Data Analysis Data splitting, clipping, and trimming Referenc e De novo

National Food Institute, Technical University of Denmark UnixDOS Mac OS X LinuxWindows Bioinformatic tools CLC bio and MEGA Geneious Data Analysis Bioinformatic platforms

National Food Institute, Technical University of Denmark Data Analysis Bioinformatic platforms Unix…

National Food Institute, Technical University of Denmark + Platform independent + Requires little computer resources + Can be done everywhere - Requires patience : MLST Resistance genes SNP calling and tree creation Species identification : Many NGS tools Steep learning curve Data Analysis Bioinformatic platforms Web-tools to the rescue!

National Food Institute, Technical University of Denmark Different sequencers requires different assemblers Depend on output and error profile Assembler: Newbler 454 Ion Torrent Assembler: Velvet Illumina ABI Solid (color spaced) Data Analysis Assembly De novo

National Food Institute, Technical University of Denmark Velvet – The unnecessarily complex assembler K-mer based assembler User needs to set K Longer reads equals larger K Everything is defined in “Kmer-space” Nucleotide length = Kmer_length + K-1 Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length Data Analysis Assembly De novo

National Food Institute, Technical University of Denmark Velvet assembly Data Analysis Assembly De novo Example >NODE_1_length_91928_cov_ AGTTCATTGATAAATCTTTTTTGATTATCATCAACGAGTGCCCACACAGATTGATTGGTT TATATTGTTAAAGAGCTTTTCCTATCGAAATCGCTTTTAAGCTCAATTCGCTAGGGCTGC GTATATTACGCTTATTCAGTTGAGTGTCAAACGTTATTTTCTA... K = 83 Kmer_length + K-1 = Nucleotide length – 1 = Kmer_coverage = Nucleotide_coverage * (Read_length-K+1)/Read_length (300 – ) / 300 ___________________ = 31.84

National Food Institute, Technical University of Denmark De novo quality check Number of contigs - Fewer is generally better N50 Total size of contigs 50% of size Data Analysis

National Food Institute, Technical University of Denmark De novo quality check Number of contigs - Fewer is better N50 Total size of contigs 50% of size Size of contig Data Analysis

National Food Institute, Technical University of Denmark Assembly Further analysis (eg. Gene finding) Data Analysis Data splitting, clipping, and trimming Referenc e De novo

National Food Institute, Technical University of Denmark Contigs Gene finding Resistance MLST Etc. Data Analysis Further data analysis

National Food Institute, Technical University of Denmark Find genes by Open Reading Frames + Shine-Dalgarno + motifs Not there does not mean it is NOT there Not assembled Truncated “Hypothetical” & “Putative” – The curse of bioinformatics Annotated gene – verified in the lab “Hypothetical” or “Putative” annotations No match to original sequence The evil circle of BLAST similarity Suggested annotation service: RAST: Data Analysis Further data analysis Genes are not just genes…

National Food Institute, Technical University of Denmark Assembly Mapping to a reference Further analysis (eg. Gene finding) Data Analysis Data splitting, clipping, and trimming Referenc e De novo

National Food Institute, Technical University of Denmark Mapping to a reference raw reads Do not match any reads Do not match reference Reference sequence Data Analysis Mappers: BWA Bowtie MAQ CGE

National Food Institute, Technical University of Denmark Assembly Mapping to a reference Further analysis (eg. Gene finding) Further analysis (eg. SNP trees) Data Analysis Data splitting, clipping, and trimming Referenc e De novo

National Food Institute, Technical University of Denmark Thank you for listening Questions?