Introduction to next generation sequencing Pimlapas Leekitcharoenphon (Shinny)
History 1953 Watson & Crick The first DNA sequences (24 bp of lac operator in 1973 ) were obtained in the early 1970s using laborious methods based on two-dimensional chromatography. Several notable advancements in DNA sequencing were made during the 1970s. Frederick Sanger developed rapid DNA sequencing methods at the MRC Centre, Cambridge, UK and published a method for "DNA sequencing with chain-terminating inhibitors" in 1977.[6] Walter Gilbert and Allan Maxam at Harvard also developed sequencing methods, including one for "DNA sequencing by chemical degradation".[7][8] In 1973, Gilbert and Maxam reported the sequence of 24 basepairs using a method known as wandering-spot analysis.[9].The first full DNA genome to be sequenced was that of bacteriophage φX174 in 1977.[10.Leroy E. Hood's laboratory at the California Institute of Technology and Smith announced the first semi-automated DNA sequencing machine in 1986.[citation needed] This was followed by Applied Biosystems' marketing of the first fully automated sequencing machine, the ABI 370, in 1987 In 1995, Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) published the first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137 bases and its publication in the journal Science[12] marked the first published use of whole-genome shotgun sequencing, eliminating the need for initial mapping efforts. By 2001, shotgun sequencing methods had been used to produce a draft sequence of the human genome.[13][14]Several new methods for DNA sequencing were developed in the mid to late 1990s. These techniques comprise the first of the "next-generation" sequencing methods. In 1996, First delivered to the United States Government in 1950, the UNIVAC 1101 or ERA 1101 is considered to be the first computer that was capable of storing and running a program from memory.
History The first DNA sequences (24 bp of lac operator in 1973 ) were obtained in the early 1970s using laborious methods based on two-dimensional chromatography. Several notable advancements in DNA sequencing were made during the 1970s. Frederick Sanger developed rapid DNA sequencing methods at the MRC Centre, Cambridge, UK and published a method for "DNA sequencing with chain-terminating inhibitors" in 1977.[6] Walter Gilbert and Allan Maxam at Harvard also developed sequencing methods, including one for "DNA sequencing by chemical degradation".[7][8] In 1973, Gilbert and Maxam reported the sequence of 24 basepairs using a method known as wandering-spot analysis.[9].The first full DNA genome to be sequenced was that of bacteriophage φX174 in 1977.[10.Leroy E. Hood's laboratory at the California Institute of Technology and Smith announced the first semi-automated DNA sequencing machine in 1986.[citation needed] This was followed by Applied Biosystems' marketing of the first fully automated sequencing machine, the ABI 370, in 1987 In 1995, Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) published the first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137 bases and its publication in the journal Science[12] marked the first published use of whole-genome shotgun sequencing, eliminating the need for initial mapping efforts. By 2001, shotgun sequencing methods had been used to produce a draft sequence of the human genome.[13][14]Several new methods for DNA sequencing were developed in the mid to late 1990s. These techniques comprise the first of the "next-generation" sequencing methods. In 1996, First delivered to the United States Government in 1950, the UNIVAC 1101 or ERA 1101 is considered to be the first computer that was capable of storing and running a program from memory.
History 1980 3 billion $ human genome project
2001: Draft 2003: Complete 1990-2003 Human genome project History 1998 The private owned company Celera sets out to sequence the human genome. Founded by Craig Venter. 2001 both genome projects release a draft of the human genome. 2003 The projects
History 1980 3 billion $ human genome project
Next Generation Sequencing History 2005 Next Generation Sequencing 454 Life Sciences: Parallelized pyrosequencing 454 Life Sciences markets a parallelized version of pyrosequencing. The first version of their machine reduced sequencing costs 6-fold compared to automated Sanger sequencing. Reduce costs 6 fold
Next Generation Sequencing History 2005 Next Generation Sequencing European Nucleotide Archive (ENA) (http://www.ebi.ac.uk/ena/about/statistics) 454 Life Sciences markets a parallelized version of pyrosequencing. The first version of their machine reduced sequencing costs 6-fold compared to automated Sanger sequencing. (Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). Accessed 31-oct-14.)
Next generation sequencing Third generation sequencing Roche, 454 Life Sciences (GS FLX Titanium) Life Technologies (Ion Torrent & Ion Proton) Illumina (HiSeq, MiSeq, GenomeAnalyzer) Pacific Biosciences (PacBio RS) Oxford Nanopore (MinION, PromethION, GridION) Third generation sequencing The different NGS platforms on the marked ----- Meeting Notes (10/2/13 09:46) ----- They differ in how they sequence and have different pros and cons. First I will gow throug library preparation and then short take about the different platforms.
Next generation sequencing Method outline - library Amplification primer Barcode 1. Fragment DNA 2. Ligate adapters 3. Amplification Sequencing primer 4. Sequencing ----- Meeting Notes (10/2/13 09:07) ----- 1. A library of fragment targets is created... 2. Adapters containing universal priming sites are ligated to the target ends. 3. Thise adapters consist of a amplification primer, to amplify the fragments, a sequence primer which is used to sequence the fragments and if your are sequecing more than one isolate/sample the adapter will also contain a barcode which is specifik for each sample. 4. Finally the fragments are amplified and sequenced by one of the NGS mashines…
Next generation sequencing technologies Problem with homopolymers Fast Ion Torrent 454: Sequence machine contains many picoliter wells, each containing a single DNA-amplified bead and sequencing enzyms (beads coupled with sulfhurylase and luciferase) 454 than uses luciferase to generate light for detection of the individual nucletides. Advantages: long read sizes Disadvantages: Expensive Ion Torrent: Sequencing is based on the detection of hydrogen ions that are relased during the polymerisation of DNA. Advantages: Cheap Disadvantages: Low throughput For both: Fast but problems with homopolymers Expensive Long insert sizes Low throughput Cheapest
Next generation sequencing Illumina Genome Analyzer HiSeq MiSeq Illumina: The most widely used next-generation sequencer on the market and have three To determine the sequence, four types of reversible terminator bases are added and non-incorporated nucleotides are washed away. A camera takes images of the fluorescently labeled nucleotides, then the dye is chemically removed from the DNA, allowing for the next cycle to begin. Advantages: Good accuracy and high throughput Disadvantages: Short reads, long run time, Short reads (~100-300 bp) Good Accuracy High Throughput
Third generation sequencing technologies Expensive (machine) Lower accuracy Long reads (on avg. 20000 bp.) PacBio PacBio – the first third generation sequencer on the market With PacBio platform, single DNA polymerase molecules are attached to the bottom surface of individual detectors (zero-mode waveguide detectors) that can obtain sequence information while phospholinked nucleotides are being incorporated into the growing primer strand. Advantages: Long reads, quick run time, low cost per sample Disadvantages: Relatively low accuracy, needs more input material than other platforms, expensive
Third generation sequencing technologies Nanopore Upcoming technology Released to select labs
Third generation sequencing technologies Nanopore 80,000+ bp. reads MinION: 150 mill. Bp pr 6 h. (30x coverage of E. coli) GridION MinION PromethION 38.2 % error rate on the MinION
Sequence data is stored in fasta files What is sequence data? Output Sequence data is stored in fasta files Fasta example: Header/ID Sequence
Output Sample Raw reads Reads
What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores 1 read, 4 lines Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc ----- Meeting Notes (10/2/13 14:19) ----- When the sequencing is done, you get your output… ----- Meeting Notes (10/2/13 14:27) ----- In this next section I will gow throug the output data. The sequence data is outputtet as fastq files. And what is fastq files? It is a fasta file with quality scores! So this is an example of a fastq file.
What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Header/ID ----- Meeting Notes (10/2/13 14:19) ----- When the sequencing is done, you get your output… ----- Meeting Notes (10/2/13 14:27) ----- In this next section I will gow throug the output data. The sequence data is outputtet as fastq files. And what is fastq files? It is a fasta file with quality scores! So this is an example of a fastq file.
What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: DNA sequence @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Name field (optional)
What is the data? What is Fastq? Fastq example: Output Fastq files Fasta + quality scores Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc Quality scores
Long Insert size (eg. 8000 bp) Output Paired and Single End Fragment gap Paired end reads Insert size (e.g. 300 bp.) Single end reads Adaptors Long Insert size (eg. 8000 bp)
Splitting & clipping data Output using barcodes De-multiplexing is usually done by the sequencer Fastq example: @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaX^bbcccaac[_X]]a[aacXT @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Data quality Output
Data quality Output
Trimming data Fastq example: Output @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
Data quality Fastq example: Output Trimming data @FCC0CD5ACXX:1:1101:1103:2048#ACCGT/1 ACNGTGTTTTTAGTTATTGTTTTGTTAAGTTGGGTTTTTTGTACCCAATAGCCAACAAGCCGCCTTTATGGCGGTTTTTTGTGCCTGAAAAGTGGGCGCA + _BP`ccceggcegihiiighiifhihfddgfhi^efgfhhhhhegiiiiiiiihiihihggeeccdddcccacWTT^acc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1165:2058#ACGTT/1 ACGTTAGCAGAATCGCTTTCTGTTCGTTTTCCACCTGCGACAGACGCACCGGACCACGGTTGGCGAGATCGTCGCGCAGAATATCGGCGGCACGCTGCGAC bb_eeceefeggehhdagfghhiihfghighhffhifhhcghfdhiihafgdceba`a\aaccc^V]^baccaccXaaacc[ab_`]`[_b`^BBBBBBBB @FCC0CD5ACXX:1:1101:1135:2082#AGCGT/1 AGCGTGACAAACATTTTATTGCGCCCGGTTTTATCCAGCTTGAATGCCTGACGAAAGAAGATGATGGTGACGACGATGGAGAGAACAATCAGCACCAGATT bbbeeeeefggfgiihgiigiiiiiiiffgifgeghiiihhfefffhhhfgh_fhggdgegeaceeacbdcbcc\^aa]``_^bb]bcccccbac_a^bc @FCC0CD5ACXX:1:1101:1239:2083#AGCGT/1 AGCGTCTGACTCACACAAAAACGGTAACACAGTTATCCACAGAATCAGGGGATAAGGCCGGAAAGAACATGTGAGCAAAAAGGCAAAGCCAGGACAAAAGG bbbeeeeegggggiiiiiiiiiigifhhiiighiiihhiiiiiiihiiiiiiiiiihiigcdbbdcdcccccdccccccccacccccccbcccacccccc
24 h Data storage & Access Output International Nucleotide Sequence Database Collaboration (INSDC) Asia DNA Data Bank of Japan (DDBJ) 24 h Europe European Bioinformatics Institute (EBI) United States National Center for Biotechnology Information (NCBI)
European Bioinformatics Institute (EBI) Data storage & Access Output European Bioinformatics Institute (EBI) http://www.ebi.ac.uk/ena
Further analysis (eg. Gene finding) Data Analysis De novo Assembly Further analysis (eg. Gene finding) Data splitting, clipping, and trimming Mapping to a reference Further analysis (SNPs, metagenomics)
denovo & reference based Data Analysis Assembly denovo & reference based De novo assembly (De Bruijn graphs) Contigs Reads Reference genome Reference based assembly Consensus sequence
Further data analysis Data Analysis Contigs Resistance MLST Etc. Gene finding
Assembly Data Analysis De novo Different sequencers requires different assemblers Depend on output and error profile Assembler: Newbler 454 Ion Torrent Assembler: Velvet, SPAdes Illumina
Further analysis (eg. Gene finding) Data Analysis Assembly Further analysis (eg. Gene finding) Data splitting, clipping, and trimming Mapping to a reference
Mapping to a reference Data Analysis CGE Mappers: BWA Bowtie MAQ Do not match any reads raw reads Do not match reference Reference sequence
Further analysis (eg. Gene finding) Data Analysis Assembly Further analysis (eg. Gene finding) Data splitting, clipping, and trimming Mapping to a reference Further analysis (eg. SNP, metagenomics)