Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9.

Slides:



Advertisements
Similar presentations
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Lecture 14 Genome sequencing projects
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library.
16 and 20 February, 2004 Chapter 9 Genomics Mapping and characterizing whole genomes.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genomics Complete Genomes in The Public DataBases >100 Non-Eukaryotes Eukaryotes: Leishmania 257 Kb 79 orfs Plasmodium falciparum I 947 Kb 205.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Genome Sequencing. Bacteriophage fX174, the first genome to be sequenced, is a viral genome with only 5,368 base pairs (bp). Fred Sanger invented "shotgun"
Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
Large-scale genome projects
The Ensembl Gene set The “Genebuild” 21 April 2008.
CS 394C March 19, 2012 Tandy Warnow.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
DNA Assembly Sanger Reads
Genome Sequencing in the Legumes Le et al Phylogeny Major sequencing efforts Minor sequencing efforts ~14 MY ~45 MY.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
Solanum lycopersicum Chromosome 4 Sequencing Update UK-SOL– Dec 2008 Wellcome Trust Medical Photographic Library.
Genome Characterization DNA sequence-ULTIMATE Map DNA sequencing-methods Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Service 2006.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Human Genome.
2nd TOMATO FINISHING WORKSHOP chromosome 9 Wageningen, April 24-25, 2008.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genomics Education Partnership: a flexible approach to implement Genomic teachings and research in the classroom Matthew W. Wadsworth and Consuelo J. Alvarez,
It will help in preparing for the exam to read:
Lecture 21 – Genome Annotation & Sequenced Genomes Based on Chapther 8 Genomics: The Mapping and Sequencing of Genomes Copyright © 2010 Pearson Education.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Accessing and visualizing genomics data
Gene models and proteomes for Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Oryza sativa (Os), Drosophila melanogaster.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The prokaryotic genome.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The Escherichia coli nucleoid.
Virginia Commonwealth University
B. subtilis as query species
DNA Sequencing Project
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Genome sequence assembly
Pre-genomic era: finding your own clones
Basic Techniques Project Design Process Improvements
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Presentation transcript:

Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9

Organism Selection Library Creation Sequencing Assembly Gap Closure Finishing Annotation The (original) genome sequencing process

Organism Selection Sequencing Assembly Annotation The (current) genome sequencing process Next gen. random sequencing lets library generation get skipped Gap closure and finishing often get skipped, at least for now.

Contigs, Islands contigs Island

Assembly pipeline 1.Sequence reads. 2.Phred: base calling. 3.crossmatch: screen out vector, E. coli sequence. 4.Phrap: assemble contigs. 5.Consed: view assembly, correct problems. 6.Finishing.

Assembly Methods Strip out vector (or contaminant) Mask known repeats Trim off unreliable data Find Matches (n seq x n seq comparisons) –how long (what ktuple [10 common]) –how perfect (reliability index) –where to look? (ends only vs entire)

Assembly Programs PHRAP FAMILY –phred/phrap/consed/cross_match –Developed by Phil Green, U of Wash. Other assemblers –phrap, kangaroo, phrapo, –CAP, TIGRAssembler,...

Assembly Phred -reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base. –The quality value is a log-transformed error probability, specifically: Q = -10 log10( Pe ) –Q = quality value, Pe = error probability. –Q= 20 -> 1% chance of miscall, Q= 30 -> 0.1% chance of miscall. Phrap -assembles shotgun DNA sequence data. Consed/Autofinish -view, edit, and finish sequence assemblies created with phrap. –Allows the user to pick primers and templates –Suggests additional sequencing reactions –Suggest digests and forward/reverse pair information to check accuracy of assembly.

Poisson statistics for sequencing completion P 0 =e -L(N)/G L=read length N=#reads G=genome size E. coli 15kb H. sapiens 900kb Coverage 1 = 1-fold = 1X % not sequenced < 1e20

Gaps Number of Gaps = Ne -c 150kb Target Clone, 500 bp reads N=# of reads c = fold coverage Coverage, reads 1, 300 5, , , , Gaps

Gaps Number of Gaps = Ne -c Human genome, 3Gb, 1,000 bp reads N=# of reads c = fold coverage 454 Seq, 400bp reads Coverage, reads 1, 3e6 5, 1.5e7 8, 2.4e7 10, 3e7 50, 3.75e8 Gaps 1,000, ,000 8,000 1,400 7

Contigs, Islands contigs Island T T T C

Finishing GOALS –>95% coverage on BOTH strands –every base covered 3X –resolve ambiguities Finish when random no longer productive (~8X range)

Sequence finishing. How? Identify gaps, ambiguities –Captured gaps: gaps is contained in a clone Extend from end of contigs –Resequencing, new chemistry. –Specific primers –Subcloning and sequencing. Uncaptured gaps. –New specific primers –PCR across gap, sequence PCR product. Resolve ambiguities –Consensus or resequence Specific primers, different chemistry

Large clone sequencing process Phase 1: Unfinished, may be unordered/unoriented contigs, with gaps. Phase 2: Unfinished, fully oriented and ordered sequence, may contain gaps and low quality sequences Phase 3: Finished, no gaps.

Genome assembly after initial contigs are made Order clones/contig sequences: –Sequence overlaps. Clone/contig end sequences. –Clone fingerprints. –Anchor using other maps Sequence based markers on genetic or physical maps. Conserved synteny to other genomes. Easiest when re-sequencing, e.g, another human genome!

Process Control LIMS –Laboratory information management system AIMS –Analysis information management system

Hard genome sequencing problems Repeats Complex genome structures Where does a clone from a repetitive region map?

Approaches to sequence repeat problems Multiple fragment sizes in 1 project Use length/distance info New assemblers, eg. ARACHNE

Results of Multi-length Fragment Assembly Contigs “Supercontigs” Clone links for finishing Clone map

DOE Joint Genome Institute (JGI) Prokaryote Finishing Standards All low-quality areas (<Q30) are reviewed and resequenced. The final error rate must be less than 0.2 per 10 Kb. No single-clone coverage is permitted (minimum of 2x depth everywhere). Single-stranded regions are manually inspected and quantified. All positions where an aligned high-quality read (>Q29) disagrees with the consensus base are checked. All strings of xxxx are resolved in the final sequence. All repeats are verified. The ends of final contigs (chromosomes, plasmids) are checked The final assembly is given a manual QC check.

Completed genomes 23 complete, 329 in assembly, in progress 389 Arabidopsis thaliana Caenorhabditis elegans Candida glabrata Cryptococcus neoformans Cyanidioschyzon merolae Debaryomyces hansenii Drosophila melanogaster Encephalitozoon cuniculi Entamoeba histolytica PlantsAnimalsProtistsFungi Eremothecium gossypii Homo sapiens Kluyveromyces lactis Leishmzania major Friedlin Mus musculus Oryza sativa Saccharomyces cerevisiae Schizosaccharomyces pombe Trypanosoma cruzi Yarrowia lipolytica

Genomes Complete Eukaryotes--23 complete, 329 in assembly, in progress 389 –Human, mouse, rat, zebrafish, –Homo sapiens neanderthalensis –Drosophila, Anopheles, Caenorhabditis –Arabadopsis, oat, corn, barley, rice, tomato –Saccharomyces, Schizosaccharomyces, Magnaportha, Cryptococcus, Candida… –Encephalitozoon cuniculi, Guillardia theta –Toxoplasma, Plasmodium –And many more…

Eubacteria and Archaea genomes 608 Bacteria and 48 Archaea completed Comprehensive Microbial Resource – scripts/CMR/CmrHomePage.cgihttp://pathema.tigr.org/tigr- scripts/CMR/CmrHomePage.cgi Joint Genome Institute – –2065 genome projects underway or completed! NCBI Genomes

Genome Centers Joint Genome Institute (DOE) Whitehead Institute (MIT) TIGR Washington University (St. Louis) Celera Sanger Institute (the other UK) RIKEN (Japan) Beijing Genomics Institute (China) Max Planck (Germany) …

Where do you find Genomic data? NCBI –Entrez (by clone, by Refseq) –Genome (view and search map) Genome center sites Organism genome project sites Annotations projects –UCSC Genome Browser, –Ensembl Genome Browser

Arabidopsis

C. elegans (nematode)