Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9.

Similar presentations


Presentation on theme: "Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9."— Presentation transcript:

1 Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9

2 Organism Selection Library Creation Sequencing Assembly Gap Closure Finishing Annotation The (original) genome sequencing process

3 Organism Selection Sequencing Assembly Annotation The (current) genome sequencing process Next gen. random sequencing lets library generation get skipped Gap closure and finishing often get skipped, at least for now.

4 Contigs, Islands contigs Island

5 Assembly pipeline 1.Sequence reads. 2.Phred: base calling. 3.crossmatch: screen out vector, E. coli sequence. 4.Phrap: assemble contigs. 5.Consed: view assembly, correct problems. 6.Finishing.

6 Assembly Methods Strip out vector (or contaminant) Mask known repeats Trim off unreliable data Find Matches (n seq x n seq comparisons) –how long (what ktuple [10 common]) –how perfect (reliability index) –where to look? (ends only vs entire)

7 Assembly Programs PHRAP FAMILY –phred/phrap/consed/cross_match –Developed by Phil Green, U of Wash. Other assemblers –phrap, kangaroo, phrapo, –CAP, TIGRAssembler,... http://www.phrap.org/

8 Assembly Phred -reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base. –The quality value is a log-transformed error probability, specifically: Q = -10 log10( Pe ) –Q = quality value, Pe = error probability. –Q= 20 -> 1% chance of miscall, Q= 30 -> 0.1% chance of miscall. Phrap -assembles shotgun DNA sequence data. Consed/Autofinish -view, edit, and finish sequence assemblies created with phrap. –Allows the user to pick primers and templates –Suggests additional sequencing reactions –Suggest digests and forward/reverse pair information to check accuracy of assembly.

9 Poisson statistics for sequencing completion P 0 =e -L(N)/G L=read length N=#reads G=genome size E. coli 15kb H. sapiens 900kb Coverage 1 = 1-fold = 1X 1 3 8 10 50 % not sequenced 37 5 0.03 0.005 < 1e20

10 Gaps Number of Gaps = Ne -c 150kb Target Clone, 500 bp reads N=# of reads c = fold coverage Coverage, reads 1, 300 5, 1500 8, 2400 10, 3000 50, 15000 Gaps 111 10 1 0

11 Gaps Number of Gaps = Ne -c Human genome, 3Gb, 1,000 bp reads N=# of reads c = fold coverage 454 Seq, 400bp reads Coverage, reads 1, 3e6 5, 1.5e7 8, 2.4e7 10, 3e7 50, 3.75e8 Gaps 1,000,000 100,000 8,000 1,400 7

12 Contigs, Islands contigs Island T T T C

13 Finishing GOALS –>95% coverage on BOTH strands –every base covered 3X –resolve ambiguities Finish when random no longer productive (~8X range)

14 Sequence finishing. How? Identify gaps, ambiguities –Captured gaps: gaps is contained in a clone Extend from end of contigs –Resequencing, new chemistry. –Specific primers –Subcloning and sequencing. Uncaptured gaps. –New specific primers –PCR across gap, sequence PCR product. Resolve ambiguities –Consensus or resequence Specific primers, different chemistry

15 Large clone sequencing process Phase 1: Unfinished, may be unordered/unoriented contigs, with gaps. Phase 2: Unfinished, fully oriented and ordered sequence, may contain gaps and low quality sequences Phase 3: Finished, no gaps.

16 Genome assembly after initial contigs are made Order clones/contig sequences: –Sequence overlaps. Clone/contig end sequences. –Clone fingerprints. –Anchor using other maps Sequence based markers on genetic or physical maps. Conserved synteny to other genomes. Easiest when re-sequencing, e.g, another human genome!

17 Process Control LIMS –Laboratory information management system AIMS –Analysis information management system

18 Hard genome sequencing problems Repeats Complex genome structures Where does a clone from a repetitive region map?

19 Approaches to sequence repeat problems Multiple fragment sizes in 1 project Use length/distance info New assemblers, eg. ARACHNE

20 Results of Multi-length Fragment Assembly Contigs “Supercontigs” Clone links for finishing Clone map

21 DOE Joint Genome Institute (JGI) Prokaryote Finishing Standards All low-quality areas (<Q30) are reviewed and resequenced. The final error rate must be less than 0.2 per 10 Kb. No single-clone coverage is permitted (minimum of 2x depth everywhere). Single-stranded regions are manually inspected and quantified. All positions where an aligned high-quality read (>Q29) disagrees with the consensus base are checked. All strings of xxxx are resolved in the final sequence. All repeats are verified. The ends of final contigs (chromosomes, plasmids) are checked The final assembly is given a manual QC check.

22 Completed genomes 23 complete, 329 in assembly, in progress 389 Arabidopsis thaliana Caenorhabditis elegans Candida glabrata Cryptococcus neoformans Cyanidioschyzon merolae Debaryomyces hansenii Drosophila melanogaster Encephalitozoon cuniculi Entamoeba histolytica PlantsAnimalsProtistsFungi http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi Eremothecium gossypii Homo sapiens Kluyveromyces lactis Leishmzania major Friedlin Mus musculus Oryza sativa Saccharomyces cerevisiae Schizosaccharomyces pombe Trypanosoma cruzi Yarrowia lipolytica

23 Genomes Complete Eukaryotes--23 complete, 329 in assembly, in progress 389 –Human, mouse, rat, zebrafish, –Homo sapiens neanderthalensis –Drosophila, Anopheles, Caenorhabditis –Arabadopsis, oat, corn, barley, rice, tomato –Saccharomyces, Schizosaccharomyces, Magnaportha, Cryptococcus, Candida… –Encephalitozoon cuniculi, Guillardia theta –Toxoplasma, Plasmodium –And many more…

24 Eubacteria and Archaea genomes 608 Bacteria and 48 Archaea completed Comprehensive Microbial Resource –http://pathema.tigr.org/tigr- scripts/CMR/CmrHomePage.cgihttp://pathema.tigr.org/tigr- scripts/CMR/CmrHomePage.cgi Joint Genome Institute –http://www.jgi.doe.gov/genome-projects/http://www.jgi.doe.gov/genome-projects/ –2065 genome projects underway or completed! NCBI Genomes

25 Genome Centers Joint Genome Institute (DOE) Whitehead Institute (MIT) TIGR Washington University (St. Louis) Celera Sanger Institute (the other UK) RIKEN (Japan) Beijing Genomics Institute (China) Max Planck (Germany) …

26 Where do you find Genomic data? NCBI –Entrez (by clone, by Refseq) –Genome (view and search map) Genome center sites Organism genome project sites Annotations projects –UCSC Genome Browser, –Ensembl Genome Browser

27 Arabidopsis http://mips.helmholtz-muenchen.de/plant/athal/index.jsp

28 C. elegans (nematode) http://wormbase.org


Download ppt "Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9."

Similar presentations


Ads by Google