Download presentation
Presentation is loading. Please wait.
Published byJuliana Payne Modified over 9 years ago
1
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy random shotgun (chromosome & whole genome) sheared gDNA libraries, physical maps not necessary, fast, whole genome coverage produced quickly, assembly may be problematic clone-by-clone (map-as-you-go) BAC, YAC, cosmid libraries & physical maps, slower, data produced less quickly from isolated regions procurement of DNA: library construction, test sequencing, analysis of data large-scale sequencing of libraries Assembly and data release for shotgun projects: at 3 X: first assembly, release of genome data at 5-6 X: ~97% genes sequenced at 8-10 X coverage, final assembly for clone-by-clone: sequence of clones released as completed Closure gap closure, repeat resolution, identification of mis-assemblies: time-consuming, expensive comparison to physical/genetic/optical maps Gene finding and annotation train gene finding algorithms and predict gene models genome annotation: auto-annotation vs manual annotation genome analysis, comparative genomics, publication, final data release to GenBank
2
Sequencing strategies for long DNA We can’t directly sequence long DNA (yet), but we can assemble the master sequence from smaller pieces.
3
Shotgun Library Construction & Sequencing Concept: 1)Shred long DNA into lots of random short fragments 2)Sequence both ends of the fragments 3)Reassemble the original DNA from overlapping sequences of the fragments SOUNDS EASY!
4
Methods: sonication syringe nebulization NOT RESTRICTION ENZYMES
5
Size-selected shotgun fragment Libraries Small insert library provides most of the sequence coverage (contigs) Large insert libraries help order the contigs (and scaffolds)
6
Mate pair (~1kb between) Mate pair (~9kb between) 5’ end read 3’ end read 5’ end read 3’ end read
7
Assembly of contigs from mate pairs must have high-quality (well-trimmed) input DNA, to reduce false overlaps reads must be mostly mate pairs (<25% single reads) library insert size variance must be kept low (<10%) for accurate prediction of distance between mate-pairs sequences
8
Scaffolds, or ‘Why we sequence mate pairs from longer fragments’ low-complexity/repetitive Knowing the sizes of inserts can tell us roughly what we don’t we don’t know (sometimes).
9
Scaffolds into chromosomes
10
- The average number of times any given base in the genome was sequenced (in this case, each base was read 8 times on average. Of course a particular base may have been read more or less than 8 times.) also -The amount of sequence that was obtained, relative to the length of the whole genome (in this case, the aggregate length of all reads was 8 times the genome length) Lander & Waterman (1988) determined that for an ideal genome project (no ‘difficult’ regions) 8X-10X coverage is sufficient to confidently complete the genome. Two ways of thinking about: COVERAGE What does “8X coverage” mean??
11
NO EUKARYOTIC GENOME IS THAT WELL-BEHAVED So even with 8X shotgun coverage there’s likely at least ~1% of the genome remaining to be finished, by more laborious and expensive means (The human genome…are we there yet??) Some genomes are relatively well-behaved: nearly all sequence reads were assembled into contigs scaffolds chromosomes, with relatively few or no gaps remaining (e.g., Plasmodium falciparum) Some genomes are very badly behaved and far from finished; reads may remain unassigned to contigs, much less scaffolds, much less chromosomes. There are lots of gaps (Ns) and lots of repeats. E.g., Trichomonas vaginalis genome: huge, highly repetitive, AT-rich; low-quality seq was allowed in to increase coverage/gene calls in ‘difficult’ regions..
12
Finishing Closure of gaps between contigs/scaffolds Correction of misassemblies resequencing of low-coverage/low-quality regions This is usually the most time-consuming part of the project. Repeat/low complexity regions can be hard to sequence and hard to know where to ‘put’ in the final assembly.
13
Sequence hierarchy genome (all chromosomes) Chromosome (one or more scaffolds..ultimately one contig!) Scaffold (two or more contigs) contig reads (mate-pair & single) overlapping, ordered sets, no gaps ordered sets w/gaps, size estimated Not biological entities ordered sets w/gaps
14
Post-sequencing steps Automated gene calling (setting boundaries) Annotation (guessing function) Manual refining gene models correcting annotation should be an ONGOING process…wish it was
15
OTHER STUFF (demonstrated on the websites) Adding columns Sorting (some are presorted) Gaps: more than one N (within scaffold, gap between scaffold), vs ambiguities (contig) (see P.falc) Chromosome as one giant contig…or one giant scaffold
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.