Advancing Science with DNA Sequence Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek,

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
DNA fingerprinting Every human carries a unique set of genes (except twins!) The order of the base pairs in the sequence of every human varies In a single.
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
Advancing Science with DNA Sequence Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek,
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
3 September, 2004 Chapter 20 Methods: Nucleic Acids.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Genome sequencing and assembling
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
Reading the Blueprint of Life
TOPICS IN (NANO) BIOTECHNOLOGY Lecture 7 5th May, 2006 PhD Course.
HAPLOID GENOME SIZES (DNA PER HAPLOID CELL) Size rangeExample speciesEx. Size BACTERIA1-10 Mb E. coli: Mb FUNGI10-40 Mb S. cerevisiae 13 Mb INSECTS.
Mouse Genome Sequencing
Large-scale genome projects
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
Genome Characterization Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Ch 9.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
Module 1 Section 1.3 DNA Technology
Genome sequencing Haixu Tang School of Informatics.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
P. Tang ( 鄧致剛 ); RRC. Gan ( 甘瑞麒 ); PJ Huang ( 黄栢榕 ) Bioinformatics Center, Chang Gung University. Genome Sequencing Genome Resequencing De novo Genome.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
The Changing Face of Sequencing
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Biased Genomes Zemin Ning The Wellcome Trust Sanger Institute.
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Jan Pačes Institute of Molecular Genetics AS CR
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Double-Ended Shotgun Sequencing of PA14 Daniel G. Lee 10/30/02.
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
August 2008Bioinformatics Tools for Comparative Genomics of Vectors1 Genomes Daniel Lawson EBI.
1.Data production 2.General outline of assembly strategy.
Human Genome.
Highlights of DNA Technology. Cloning technology has many applications: Many copies of the gene are made Protein products can be produced.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.
D. Darban, Ph.D Department of Microbiology School of Medicine Alborz University of Medical Sciences 1 Probe and Primer Design.
Virginia Commonwealth University
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Genome sequence assembly
Very important to know the difference between the trees!
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
The Human Genome Project
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Presentation transcript:

Advancing Science with DNA Sequence Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek, CA

Advancing Science with DNA Sequence A typical Microbial project Sequencing Contigs Base calling Quality screening Auto-assembly Vector screening Gap closure FINISHING Assembly Public release Annotation

Advancing Science with DNA Sequence Processing Microbial projects (Sequencing) Sanger only (yesterday) –4x coverage in 3kb + 4x in 8kb + fosmids to 1x if possible –Total ~ $50k for 5mb genome draft Hybrid Sanger/pyrosequence/Solexa (today) –4x coverage 8kb Sanger + 20x coverage 454 shotgun + 20x Solexa (quality improvement) –Total ~ $35k for 5mb genome draft Solexa (tomorrow – starting this week) –20x coverage 454 standard + 4x coverage 454 paired end (PE) + 50x coverage Solexa shotgun (quality improvement; gaps) –Total ~ $10k per 5mb genome draft

Advancing Science with DNA Sequence Assembly (assembler) Sanger reads only (phrap, PGA, Arch, etc) --3kb-- --8kb kb Hybrid Sanger/pyrosequence/Solexa (no special assemblers; use PGA and Arachne) 454 contig --8kb shreds 454/Solexa (Newbler, PCAP) – 454 reads only Shotgun reads PE reads

Advancing Science with DNA Sequence Role of Solexa data: “The Polisher” Align solexa reads Identify errors Automatically suggest corrections for manual curation Automatically suggest and implement corrections GTA List Disc x1 – G x2 – T x3 – A etc x1x2 x3

Advancing Science with DNA Sequence Errors corrected by Solexa CCTCTTTGATGGAAATGATA**TCTTCGAGCATCGCCTC**GGGTTTTCCATACAGAGAACCTTTGATGATGAACCGGTTGAAGATCTGCGGGTCAAA CCTCTTTGATGGAAATAATA**TATTCGAGCATC TTAGTGGAAATGATA**TCTTCGAGCATCGCCTC CGAGCNTCGCCTC**GGGCTTTCCCT CGAGCATCGCCTC**GGGTTCTCCATACACAGA GCATCGCCTC**GGGTTTTCAATACAGAGAACCT CAGCGCCTC**GGGTTTTCCATACAGAGAACCTT ATCGCCTC**GGGTTTTCCAGACAGAGAACCTTT GGTTC**GGGTTTTCCATACAGAGAACCTTTGAT GTTTTCCATACAGAGAACATTTGATGATGAAC GTTGTCCATACAGAGAACTTTTGATGATGAAC TATANCATACAGAGAACCTTTGATGATGAACC ATTTCCAGACAGAGAACCNTTGATGATGAACC CAAACAGAGAACCTTTGAGGATGAACCGGTTG ACAGGGAACCTTAGATGATGAACCGGTTGAAG ACAGAGAACCTTAGATGATGAACCGGTTGAAG ACCGTTGATGATGAACCGGTTGAAGATCTGCG GATGGTGAACGGGTTGAAGATCTGCGGGTCAA GGTTTGAAGATCTGCGGGTCAAACCAGTCCTC GGTGGAAGATCTGCGGGTAAAACCAGTCCTCT GGT.GNAGAGCTGCGGGTCAAACCAGTCCTCTG TGAAGATCTGCGGTTCAAACCAGTCCTCTCCC GATCGGCGTGTCAAACCAGTCCTCTGCCTCGT TCTGCGGGTCAAACCAGTACTCTGCCTCGTTC Frame shift detected (454 contig) 454 contig Finished consensus Sanger reads

Advancing Science with DNA Sequence Assembly: unordered set of contigs What we get Clone walk (Sanger lib) Ordered sets of contigs (scaffolds) New technologies: no clones to walk off 16 PCR - sequence pri1pri2 PCR product

Advancing Science with DNA Sequence Why do we have gaps Sequencing coverage may not span all regions of the genome, thus producing gaps in the assembly. Assembly results of the shotgun reads may produce misassembled regions due to repetitive sequences. A biased base content (this can result in failure to be cloned, poor stability in the chosen host-vector system, or inability of the polymerase to reliably copy the sequence): ~ AT-rich DNA clones poorly in bacteria (cloning bias; promoters like structures )=> uncaptured gaps ~GC rich DNA is difficult to PCR and to sequence and often requires the use of special chemistry => captured gaps What are gaps (Sanger)? - Genome areas not covered by random shotgun

Advancing Science with DNA Sequence Low GC project and 454 Thermotoga lettingae TMO (JGI ID ) Draft assembly: - 55 total contigs; 41 contigs >2kb - 38GC% - biased Sanger libraries Draft assembly total contigs; 1 contigs >2kb – no cloning 6810 bases 454 only out of 2,170,737bp - average length of gaps

Advancing Science with DNA Sequence High GC stops (Sanger and Hybrid) The presence of small hairpins (inverted repeat sequences) in the DNA that re anneal ether during sequencing or electrophoresis resulting in failed sequencing reactions or unreadable electrophoresis results. (This can be aided by adding modifiers to the reaction, sequencing smaller clones and running gels at higher temperatures in the presence of stronger denaturants).

Advancing Science with DNA Sequence High GC project and 454 Xylanimonas cellulosilytica DSM (3.8 MB; 72.1% GC) PGA assembly - 9x of 8kb PGA assembly - 9x of 8kb +454 AssemblyTotal contigsMajor contigsScaffoldsMisassenblies*N50 PGA-8kb ,048 PGA-8kb ,369

Advancing Science with DNA Sequence What is Finishing? The process of taking a rough draft assembly composed of shotgun sequencing reads, identifying and resolving miss assemblies, sequence gaps and regions of low quality to produce a highly accurate finished DNA sequence. 1.All low quality areas in consensus (<Q30) should be reviewed and re- sequenced. 2.No single clone coverage, i.e. minimum of 2X depth everywhere. 3.Final error rate should be less than 1 per 50 Kb. Current standards:

Advancing Science with DNA Sequence Genome closure issues Resolve repeats and mis-assemblies –Repeats within or in vicinity of other repeats –Large repetitive regions –Complex repetitive regions (tandems) Fill in gaps –DNA region lethal to E.coli (Sanger libraries) –Hairpins, GC rich, hard stops or other 2° structure/physical premature termination –Hard to PCR (new technologies) Other issues –Homopolymeric tracts and other polymorphisms (SNPs, VNTRs, indels)

Advancing Science with DNA Sequence JGI Microbial Finishing Currently: >250 individual microbes “I am all for finished genomes! It will serve us best in the long run.. Unfinished ones are likely to contribute to some chaos” – Proff. Sallie W. Chisholm. MIT

Advancing Science with DNA Sequence Metagenomic assembly Typically size of metagenomic sequencing project is very large Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies Chimerical contigs produced by co-assembly of sequencing reads originating from different species. Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly. No assemblers developed for metagenomic data sets The whole-genome shotgun sequencing approach was used for a number of microbial community projects, however useful quality control and assembly of these data require reassessing methods developed to handle relatively uniform sequences derived from isolate microbes.

Advancing Science with DNA Sequence QC: Annotation of poor quality sequence To avoid this:  make sure you use high quality sequence;  choose proper assembler

Advancing Science with DNA Sequence Recommendations for metagenomic assembly -Use Trimmer (Lucy etc) to treat reads PRIOR to assembly -Do not use PHRAP for metagenomic projects -None of the existing assemblers designed for metagenomic data but assemblers like PGA work better with paired reads information and produce better assemblies

Advancing Science with DNA Sequence Metagenomic finishing: projects Completed Projects: Candidatus Korarchaeum cryptofilum OPF8 - is the first of this apparently ancient hyperthermophilic phyletic group to be sequenced Desulforudis audaxviator - isolated from old water in fissures of a South African gold mine at a depth of 3000 meters. Finished with Sanger and 454 Candidatus Accumulibacter phosphatis Type IIA (CAP) - from EBPR sludge community, US In progress: Candidatus Endomicrobium trichonymphae - an intracellular symbiont of a flagellate protist, itself part of the hindgut community of a termite host. It is of interest in the pursuit of the efficient breakdown of cellulose and lignin necessary in the hoped-for conversion of bulk plant materials to CO2-neutral fuel

Advancing Science with DNA Sequence Metagenomic finishing: approach Binning: Binning: Which DNA fragment derived from which phylotype? (BLAST; GC%; read depth) Non-CAP reads CAP reads + Complete genome of Complete genome of Candidatus Accumulibacter phosphatis Lucy/PGA Candidatus Accumulibacter phosphatis (CAP) ~ 45%

Advancing Science with DNA Sequence The end