Introduction to next generation sequencing

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

The Past, Present, and Future of DNA Sequencing
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
 Sequencing technology › Roche/454 GS-FLX (‘454’) › Illumina  Prokaryotic profiling › De novo genome sequencing › Metagenomics › SNP profiling › Species.
Next–generation DNA sequencing technologies – theory & practice
High-Throughput Sequencing Technologies
Next-generation sequencing
The past, present, and future of DNA sequencing Dan Russell.
Canadian Bioinformatics Workshops
Greg Phillips Veterinary Microbiology
1 Next Generation Sequencing Itai Sharon November 11th, 2009 Introduction to Bioinformatics.
High Throughput Sequencing
Emily Buckhouse. Nitrogenous Bases Nucleosides  Base linked to a 2-deoxy-D-ribose at 1’ carbon Nucleotides Nucleosides with a phosphate at 5’ carbon.
CS 6293 Advanced Topics: Current Bioinformatics
Next Generation DNA Sequencing Platforms: Evolving Tools for
Update on Next-Generation Sequencing
High-Throughput Sequencing Technologies
Sequencing Technologies and Applications at JGI
High Throughput Sequencing Methods and Concepts
Introduction to next generation sequencing Rolf Sommer Kaas.
MES Genome Informatics I - Lecture IV. NGS basics Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute, Yonsei University.
Ion Torrent and Minion Relatively low cost ‘next generation’ sequencing Wendy Smith School of Computing Science, Alan Ward Newcastle University, UK.
A Lot More Advanced Biotechnology Tools (Part 1) Sequencing.
High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.
CHAPTER 7 DNA SEQUENCING - INTRODUCTION - SANGER DIDEOXY METHOD - AUTOMATED SEQUENCING - NEXT GENERATION OF SEQUENCING METHODS MISS NUR SHALENA SOFIAN.
De Novo Genome Assembly - Introduction Henrik Lantz - BILS/SciLife/Uppsala University.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Molecular Biology Dr. Chaim Wachtel May 28, 2015.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Biochemistry 412 Overview of Genomics & Proteomics 18 January 2005.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
DNA Sequencing.
Anna Shcherbina Bioinformatics Challenge Day 01/10/2013 De novo assembly from clinical sample This work is sponsored by the Defense Threat Reduction Agency.
Ultra-High Throughput DNA Sequencing on the 454/Roche GS-FLX
Third Generation Sequencing. Today Illumina – Solexa sequencing technology 454 Life sciences – 454 sequencer Applied Biosystem – SOLiD system Tomorrow.
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
핵산 염기서열 분석(DNA SEQUENCING)
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Introduction to Illumina Sequencing
DNA Sequencing First generation techniques
Next-generation sequencing technology
Virginia Commonwealth University
Research Techniques Made Simple: Next-Generation Sequencing:
Next generation sequencing
Next Generation Sequencing
Sequencing technologies
Quality Control & Preprocessing of Metagenomic Data
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Genomics Sequencing genomes.
Next-generation sequencing technology
NGS technologies.
Sequencing technology and assembly
Sequencing Technologies
SOLEXA aka: Sequencing by Synthesis
B3- Olympic High School Bioinformatics
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Access to Sequence Data and Related Information
2nd (Next) Generation Sequencing
ULTRASEQUENCING. Next Generation Sequencing: methods and applications.
Sequencing techniques
Genome Sequencing and Assembly
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
High-Throughput Sequencing Technologies
High-Throughput Sequencing Technologies
Plant Biotechnology Lecture 2
Research Support Network (RAI)
Next-generation DNA sequencing
BF nd (Next) Generation Sequencing
A Lot More Advanced Biotechnology Tools
Global Next Generation Sequencing (NGS) Market (By Products - Consumables, Platforms, Services, Sequencing Services, Bioinformatics, Technology, Applications, End Users, Regions), Key Company Profiles - Forecast to 2025
Presentation transcript:

Introduction to next generation sequencing Pimlapas Leekitcharoenphon (Shinny)

History 1953 Watson & Crick The first DNA sequences (24 bp of lac operator in 1973 ) were obtained in the early 1970s using laborious methods based on two-dimensional chromatography. Several notable advancements in DNA sequencing were made during the 1970s. Frederick Sanger developed rapid DNA sequencing methods at the MRC Centre, Cambridge, UK and published a method for "DNA sequencing with chain-terminating inhibitors" in 1977.[6] Walter Gilbert and Allan Maxam at Harvard also developed sequencing methods, including one for "DNA sequencing by chemical degradation".[7][8] In 1973, Gilbert and Maxam reported the sequence of 24 basepairs using a method known as wandering-spot analysis.[9].The first full DNA genome to be sequenced was that of bacteriophage φX174 in 1977.[10.Leroy E. Hood's laboratory at the California Institute of Technology and Smith announced the first semi-automated DNA sequencing machine in 1986.[citation needed] This was followed by Applied Biosystems' marketing of the first fully automated sequencing machine, the ABI 370, in 1987 In 1995, Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) published the first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137 bases and its publication in the journal Science[12] marked the first published use of whole-genome shotgun sequencing, eliminating the need for initial mapping efforts. By 2001, shotgun sequencing methods had been used to produce a draft sequence of the human genome.[13][14]Several new methods for DNA sequencing were developed in the mid to late 1990s. These techniques comprise the first of the "next-generation" sequencing methods. In 1996, First delivered to the United States Government in 1950, the UNIVAC 1101 or ERA 1101 is considered to be the first computer that was capable of storing and running a program from memory.

History The first DNA sequences (24 bp of lac operator in 1973 ) were obtained in the early 1970s using laborious methods based on two-dimensional chromatography. Several notable advancements in DNA sequencing were made during the 1970s. Frederick Sanger developed rapid DNA sequencing methods at the MRC Centre, Cambridge, UK and published a method for "DNA sequencing with chain-terminating inhibitors" in 1977.[6] Walter Gilbert and Allan Maxam at Harvard also developed sequencing methods, including one for "DNA sequencing by chemical degradation".[7][8] In 1973, Gilbert and Maxam reported the sequence of 24 basepairs using a method known as wandering-spot analysis.[9].The first full DNA genome to be sequenced was that of bacteriophage φX174 in 1977.[10.Leroy E. Hood's laboratory at the California Institute of Technology and Smith announced the first semi-automated DNA sequencing machine in 1986.[citation needed] This was followed by Applied Biosystems' marketing of the first fully automated sequencing machine, the ABI 370, in 1987 In 1995, Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) published the first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137 bases and its publication in the journal Science[12] marked the first published use of whole-genome shotgun sequencing, eliminating the need for initial mapping efforts. By 2001, shotgun sequencing methods had been used to produce a draft sequence of the human genome.[13][14]Several new methods for DNA sequencing were developed in the mid to late 1990s. These techniques comprise the first of the "next-generation" sequencing methods. In 1996, First delivered to the United States Government in 1950, the UNIVAC 1101 or ERA 1101 is considered to be the first computer that was capable of storing and running a program from memory.

History 1980 3 billion $ human genome project

2001: Draft 2003: Complete 1990-2003 Human genome project History 1998 The private owned company Celera sets out to sequence the human genome. Founded by Craig Venter. 2001 both genome projects release a draft of the human genome. 2003 The projects

History 1980 3 billion $ human genome project

Next Generation Sequencing History 2005 Next Generation Sequencing 454 Life Sciences: Parallelized pyrosequencing 454 Life Sciences markets a parallelized version of pyrosequencing. The first version of their machine reduced sequencing costs 6-fold compared to automated Sanger sequencing. Reduce costs 6 fold

Next Generation Sequencing History 2005 Next Generation Sequencing European Nucleotide Archive (ENA) (http://www.ebi.ac.uk/ena/about/statistics) 454 Life Sciences markets a parallelized version of pyrosequencing. The first version of their machine reduced sequencing costs 6-fold compared to automated Sanger sequencing. (Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Accessed 31-oct-14.)

Next generation sequencing Third generation sequencing Roche, 454 Life Sciences (GS FLX Titanium) Life Technologies (Ion Torrent & Ion Proton) Illumina (HiSeq, MiSeq, GenomeAnalyzer) Pacific Biosciences (PacBio RS) Oxford Nanopore (MinION, PromethION, GridION) Third generation sequencing The different NGS platforms on the marked ----- Meeting Notes (10/2/13 09:46) ----- They differ in how they sequence and have different pros and cons. First I will gow throug library preparation and then short take about the different platforms.

Next generation sequencing Method outline - library Amplification primer Barcode 1. Fragment DNA 2. Ligate adapters 3. Amplification Sequencing primer 4. Sequencing ----- Meeting Notes (10/2/13 09:07) ----- 1. A library of fragment targets is created... 2. Adapters containing universal priming sites are ligated to the target ends. 3. Thise adapters consist of a amplification primer, to amplify the fragments, a sequence primer which is used to sequence the fragments and if your are sequecing more than one isolate/sample the adapter will also contain a barcode which is specifik for each sample. 4. Finally the fragments are amplified and sequenced by one of the NGS mashines…

Next generation sequencing technologies Problem with homopolymers Fast Ion Torrent 454: Sequence machine contains many picoliter wells, each containing a single DNA-amplified bead and sequencing enzyms (beads coupled with sulfhurylase and luciferase) 454 than uses luciferase to generate light for detection of the individual nucletides. Advantages: long read sizes Disadvantages: Expensive Ion Torrent: Sequencing is based on the detection of hydrogen ions that are relased during the polymerisation of DNA. Advantages: Cheap Disadvantages: Low throughput For both: Fast but problems with homopolymers Expensive Long insert sizes Low throughput Cheapest

Next generation sequencing Illumina Genome Analyzer HiSeq MiSeq Illumina: The most widely used next-generation sequencer on the market and have three To determine the sequence, four types of reversible terminator bases are added and non-incorporated nucleotides are washed away. A camera takes images of the fluorescently labeled nucleotides, then the dye is chemically removed from the DNA, allowing for the next cycle to begin. Advantages: Good accuracy and high throughput Disadvantages: Short reads, long run time, Short reads (~100-300 bp) Good Accuracy High Throughput

Third generation sequencing technologies Expensive (machine) Lower accuracy Long reads (on avg. 20000 bp.) PacBio PacBio – the first third generation sequencer on the market With PacBio platform, single DNA polymerase molecules are attached to the bottom surface of individual detectors (zero-mode waveguide detectors) that can obtain sequence information while phospholinked nucleotides are being incorporated into the growing primer strand. Advantages: Long reads, quick run time, low cost per sample Disadvantages: Relatively low accuracy, needs more input material than other platforms, expensive

Third generation sequencing technologies Nanopore Upcoming technology Released to select labs

Third generation sequencing technologies Nanopore 80,000+ bp. reads MinION: 150 mill. Bp pr 6 h. (30x coverage of E. coli) GridION MinION PromethION 38.2 % error rate on the MinION

Sequence data is stored in fasta files What is sequence data? Output Sequence data is stored in fasta files Fasta example: Header/ID Sequence

Output Sample Raw reads Reads

What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores 1 read, 4 lines Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc ----- Meeting Notes (10/2/13 14:19) ----- When the sequencing is done, you get your output… ----- Meeting Notes (10/2/13 14:27) ----- In this next section I will gow throug the output data. The sequence data is outputtet as fastq files. And what is fastq files? It is a fasta file with quality scores! So this is an example of a fastq file.

What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Header/ID ----- Meeting Notes (10/2/13 14:19) ----- When the sequencing is done, you get your output… ----- Meeting Notes (10/2/13 14:27) ----- In this next section I will gow throug the output data. The sequence data is outputtet as fastq files. And what is fastq files? It is a fasta file with quality scores! So this is an example of a fastq file.

What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: DNA sequence @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Name field (optional)

What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Quality scores

Long Insert size (eg. 8000 bp) Output Paired and Single End Fragment gap Paired end reads Insert size (e.g. 300 bp.) Single end reads Adaptors Long Insert size (eg. 8000 bp)

Splitting & clipping data Output using barcodes De-multiplexing is usually done by the sequencer Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

Data quality Output

Data quality Output

Trimming data Fastq example: Output @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

Data quality Fastq example: Output Trimming data @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc

24 h Data storage & Access Output International Nucleotide Sequence Database Collaboration (INSDC) Asia DNA Data Bank of Japan (DDBJ) 24 h Europe European Bioinformatics Institute (EBI) United States National Center for Biotechnology Information (NCBI)

European Bioinformatics Institute (EBI) Data storage & Access Output European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/ena

Further analysis (eg. Gene finding) Data Analysis De novo Assembly Further analysis (eg. Gene finding) Data splitting, clipping, and trimming Mapping to a reference Further analysis (SNPs, metagenomics)

denovo & reference based Data Analysis Assembly denovo & reference based De novo assembly (De Bruijn graphs) Contigs Reads Reference genome Reference based assembly Consensus sequence

Further data analysis Data Analysis Contigs Resistance MLST Etc. Gene finding

Assembly Data Analysis De novo Different sequencers requires different assemblers Depend on output and error profile Assembler: Newbler 454 Ion Torrent Assembler: Velvet, SPAdes Illumina

Further analysis (eg. Gene finding) Data Analysis Assembly Further analysis (eg. Gene finding) Data splitting, clipping, and trimming Mapping to a reference

Mapping to a reference Data Analysis CGE Mappers: BWA Bowtie MAQ Do not match any reads raw reads Do not match reference Reference sequence

Further analysis (eg. Gene finding) Data Analysis Assembly Further analysis (eg. Gene finding) Data splitting, clipping, and trimming Mapping to a reference Further analysis (eg. SNP, metagenomics)