Download presentation
Presentation is loading. Please wait.
Published byMegan Jones Modified over 9 years ago
1
Towards your own genome
2
Designing your Sequencing Run https://genohub.com/next-generation-sequencing-guide/ Sequencing strategy Genome size and genome complexity?! related organism, PFGE, flow cytometry
3
Noncoding DNA in genomes
4
Repetitive DNA in the human genome
5
Sequencing strategy Template and Library prep: Fragment (SE),Paired-end (PE)or Mate pair (MP) BAC clones, fosmids.... Sequencing Platform
6
Method Single-molecule real- time sequencing (Pacific Bio) Ion semiconductor (Ion Torrent sequencing) Pyrosequencing (454) Sequencing by synthesis (Illumina) Sequencing by ligation (SOLiD sequencing) Chain termination (Sanger sequencing) Read length2900 bp average [38] 200 bp700 bp50 to 250 bp50+35 or 50+50 bp400 to 900 bp Accuracy87% - 99%98%99.9%98%99.9% Reads per run35-75 thousandup to 5 million1 millionup to 3 billion1.2 to 1.4 billionN/A Time per run30 minutes to 2 hours2 hours24 hours 1 to 10 days, depending upon sequencer 1 to 2 weeks 20 minutes to 3 hours Cost per 1 mil. bases $2$1$10$0.05 to $0.15$0.13$2400 Advantages Longest read length. Fast. Detects 4mC, 5mC, 6mA. [41] Less expensive equipment. Fast. Long read size. Fast. Potential for high sequence yield, depending upon sequencer model and desired application. Low cost per base. Long individual reads. Useful for many applications. Disadvantages Low yield at high accuracy. Equipment can be very expensive Homopolymer errors. Runs are expensive. Homopolymer errors. Short reads. Slower than other methods. More expensive and impractical for larger sequencing projects. Genome sequencing: Comparison of NGS methods
7
InstrumentApplication: de novo assemblies BACs, plastids, & microbial genomesTranscriptomePlant & animal genome 454 – GS Jr.B – good but expensiveC – need multiple runs, expensiveD – cost prohibitive 454 – FLX+A – good, need to multiplex to be economical B – good but expensive, libraries usually normalized, not best for short RNAs C – OK as part of a mixed platform strategy, prohibitive to use alone MiSeq – v2A – good, need to multiplex for best economics A/B –expensive for rare transcripts (compared to HiSeq), but reads are longer for better assembly B – expensive relative to HiSeq, but additional read length can be valuable HiSeq 2000/2500, standard run B/C – more data than needed unless highly indexed; assembly more challenging than 454 or MiSeq A – good, assembly more challenging than 454 but much more data available for analyses A – primary data type in many current projects; requires mate-pair libraries HiSeq 2500, rapid run (projected) B – more data than needed unless highly indexed; assembly more challenging than 454 A – good, assembly more challenging than 454 but much more data available for analyses A – will probably be more expensive than HiSeq2000, but increased read length may be worth it Ion Torrent – 314 B/C – OK, lowest experimental cost but reads are shorter & more expensive than Illumina C – OK, but reads are shorter & more expensive than Illumina D – cost prohibitive, reads shorter than alternatives Ion Torrent – 318B/A – good, less data than MiSeq B/A – good, less data than MiSeq, reads similar to 454 titanium but less expensive C – high cost relative to Proton or Illumina, more economical than 454 for mixed platform strategy Ion Torrent Proton I B – more data than needed unless indexed; assembly more challenging than 454 or Illumina B/A – assembly currently more challenging than Illumina or 454 B – expensive relative to HiSeq or Proton II/III Ion Torrent Proton II (projected) B/C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina B/A – assembly currently more challenging than Illumina or 454 A/B – should be similar to HiSeq Ion Torrent Proton III (forecast) C – more data than needed unless highly indexedB/A – need assembly pipelines A – cost per MB could make it the best SOLiD – 5500 C – more data than needed unless highly indexed; assembly more challenging than 454 or Illumina C/D – short reads make assembly challenging or impossible PacBio – RS B – good for hybrid assemblies; not economical for solo assemblies – requires high coverage due to high error rates B/D – good for hybrid assemblies; too expensive for solo use; short RNA is challenging B/D – good for hybrid assemblies & scaffolding (mixed platform strategy); cost prohibitive for solo use
8
Platform – instrumentApplication: resequencing Targeted lociTranscript countingGenome resequencing 454 – GS Jr. B/C – good but expensive, need to limit loci D – cost prohibitive D – cost prohibitive for large genomes 454 – FLX+B – good but expensive, should limit lociD – cost prohibitive D – cost prohibitive for large genomes MiSeq A/B – good, fewer and higher cost reads than HiSeq B – more expensive than HiSeq or SOLiD or ProtonII+ B/C – expensive for large genomes HiSeq 2000/2500 – standard run A – primary data type in many current projects; best for many loci A – primary data type in many current projects HiSeq 2500 – rapid run (projected) A – faster path to leading data type A/B – likely to be slightly more expensive than with standard flow cell A – faster path to leading data type Ion Torrent – 314C – OK but expensive, need to limit lociD – cost prohibitive Ion Torrent – 318 B – good, slightly less data per run than MiSeq B/C – more expensive than HiSeq or SOLiD; new informatics pipelines needed; new error profile C – expensive for large genomes Ion Torrent Proton I A/B – similar to MiSeq, but different error profile will inhibit switching B – more expensive than Illumina or SOLiD; new informatics pipelines needed (different error profile than Illumina) B – expensive relative to HiSeq or Proton II+ Ion Torrent Proton II (projected) A/B – similar to HiSeq, but different error profile will inhibit switching A/B – new informatics pipelines needed A – supposed to set new pricing standard, could become leading shorter-read platform Ion Torrent Proton III (forecast) A/B – costs projected to be better than HiSeq; error profile different than Illumina A/B – new informatics pipelines needed A – supposed to set new pricing standard, could become leading shorter-read platform SOLiD – 5500xlB – harder to assemble than IlluminaA/B – used much less than HiSeq PacBio – RS C/D – expensive but can sequence difficult regions D – cost prohibitive C/D – cost prohibitive except for strutural variants
9
Bacterial genomes
10
Noncoding DNA in genomes
11
Bacterial genomes
14
Complex Bacterial Genomes Fosmid and plasmid library; Sanger
15
Simplified Bacterial Genomes MDA for 16h on one lysed cell 3kb Sanger libraries plus 454 15 gaps (chimeric clones) Sanger finishing Polishing by Illumina reads 37 regions Sanger polishing 454 (average read length 225bp) Illumina (33bp)
16
Bacterial genomes
17
Eukaryotic Genomes
18
Eukaryotic Genomes: Fish genomes Template: A female fish was chosen because of its XX sex chromosome constitution Roche 454 Titanium (3 and 20kb libraries) Illumina PE insert size 200bp and 75 bp reads physical map: fingerprints with ABI3730 from the WLC-1247 BAC library (insert size of 160 kb; 10× genome coverage with a total of 43,192 clones available)
19
Bird genomes
20
Mammalian genomes HiSeq2000 DNA isolated from blood
21
Extremelly large genomes loblolly pine (Pinus taeda) The largest genome assembled to date DNA template: a single megagametophyte, the haploid tissue of a single pine seed – quantity long-fragment mate pair libraries from the parental diploid DNA Novel fosmid DiTag libraries N50 scaffold size of 66.9 kbp
22
Raw Data Trimming and Filtering Quality score
23
Raw Data Trimming and Filtering
25
Assembly N50 N75 Contigs Scaffolds
26
Assembly: K-mer A common sequence shared by pairs of reads
27
Assembly: K-mer
28
Assembly
29
Assembly – algorithms Repeats! OLC Overlap/Layout/Consensus Overlap: Overlap discovery all-against-all, seed & extend heuristic algorithm; K-mers as alignment seeds-sensitivity Layout: Construction and manipulation of an overlap graph leads to an approximate read layout Consensus: Multiple sequence alignment (MSA) determines the precise layout and then the consensus sequence. Loading base calls-computer memory
30
Assembly vs Repetitive DNA
32
Assembly vs Repetitive DNA and Coverage Why is coverage important? resolution repeat discovery, copy number estimation binning of metagenomic data
33
Why is GC important? affecting coverage HGT discovery binning of metagenomic data Assembly vs GC content both GC-rich fragments and AT-rich fragments are underrepresented in the Illumina sequencing results
34
Assembly vs GC content Less even coverage with Illumina
35
Velvet and Velvet Optimizer Newbler Celera MaSuRCA Assembling algorithms and Scaffolders http://en.wikipedia.org/wiki/Sequence_assembly
36
Assembling algorithms and Scaffolders
37
Annotation
38
Ready for Annotation? Checking gene coverage: UCOs - Ultra Conserved Orthologs (Kozik et al., 2007) CEGMA - Core Eukaryotic Genes Mapping Approach (Parra et al., 2007) SICO - genes Single Copy genes Proteobacteria (Lerat et al., 2003) Median gene length roughly proportional to genome size Percent gaps: library insert size vs. 50 “N”s
39
Sanger 454 Illumina Ready for Annotation? UCOs
40
Annotation of Prokaryotic Genomes Automated pipelines and annotation softwares: RAST BASys SOP PROKKA IMG ER Gene prediction: GLIMMER Prodigal Prokaryotic Dynamic Programming Genefinding Algorithm
41
Annotation of Prokaryotic Genomes Repeated errors Inconsistent gene names Additional data and postgenomic experiments
42
Annotation of Eukaryotic Genomes Standard draft assembly High quality draft assembly Two phases 1. computation phase repeat masking (homopolymers, transposable elements) evidence alignment (proteins, ESTs, RNA-seq data aligned) ab initio gene prediction vs Evidence driven gene prediction 2. annotation phase finding a consensus
43
Annotation of Eukaryotic Genomes Gene prediction and gene annotation are not synonyms! Predictors do not report untranslated regions (UTRs) or splice variants
44
Annotation of Eukaryotic Genomes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.