My portfolio: Sequencing projects
Alla Lapidus, Ph.D Associate Professor, Fox Chase Cancer Center EDUCATION: 1980M.S. in Physics (with honors) - Department of Theoretical and Experimental Physics, Moscow Physics-Engineering Institute (МИФИ), Moscow, Russia. 1986Ph.D. in Molecular Biology - Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia PROFESSIONAL MEMBERSHIPS and SERVICE Reviewer for Frontiers in Evolutionary and Genomic Microbiology Reviewer for Nucleic Acids Research Reviewer for PLoS Genetics current - Organizing Committee Member – “ Sequencing, Finishing and Analysis in the Future” SFAF meeting ( American Society for Microbiology Grant Reviewer (INTAS) FIRST GENOMIC PROJECT: 1994 – European project - Bacillus subtilis, INRA, France UNIVERSOTY of CHICAGO: 1998 – Rhodobacter capsulatus genome INTEGRATED GENOMICS, Inc: 2001 – Director of Sequencing Center, Chicago JOINT GENOME INSTITUTE (LBNL): 2003 – Genome Finishing Group, Projects coordinator Fox Chase Cancer Center, Cancer Genome Institute: Director of Bioinformatics
May 30 th -June 1, Santa Fe
Genome assembly and finishing group at JGI
Projects in the group Microbial projects – ~120 genomes assembled and finished Metagenomes – 3 completely finished members of different communities + approach development Fungi – genome assembly and partial improvement Single cell – one finished genome Bioinformatics - small tools needed for assembly improvement and visualization Quality control - needs and approaches
Major Sequencing Centers for Prokaryotic Finished Genomes in 2011: 1765 projects Nikos C. Kyrpides,
Metagenomes
Metagenomic Assembly Challenges – Molecular Biology Low representation of individual species Requires high depth of coverage to find all species – Low abundance species unassembled Extraction bias, etc 9
Metagenomic Assembly Challenges - Software Algorithm Bias – de novo assembly assumes normal, even distribution of sequence data – Assumes all Kmers will have similar coverage numbers Memory and Run Times – Illumina sequenced metagenomes can generate > 60 GB of data Assembly software could require >512 GB RAM – Lower Kmer values generate more Kmers, increasing memory requirement and computing cycles increasing assembly time Also can improve assembly (low abundance) Individual genome finishing 10
What is needed for better performance reduce Size of Data Sets implement read quality approach (trimming, filtering, binning) choose the best assembler selecting “Best” Assembly – Merging Velvet Assemblies with minimus – Algorithm for selection of best Kmer process automation
METAGENOME SAMPLES 2011: 1927 samples Nikos C. Kyrpideshttp://
Simple Communities are Very Complex High-resolution metagenomics targets specific functional types in complex microbial communities M. Kalyuzhnaya, A.Lapidus, N. Ivanova, A.Copeland, A. McHardy, E. Szeto, A.Salamov, I. Grigoriev, D. uciu, S. Levine, V.M. Markowitz, I.Rigoutsos, S.Tringe, D. Bruce, P. Richardson, M.Lidstrom & L.Chistoserdova Nature Biotechnology 26, (2008) Published online: 17 August 2008
Over a million species in the Kingdom Fungi have evolved over millions of years to occupy diverse ecological niches and have accumulated an enormous but yet undiscovered natural arsenal of potentially useful innovations. While the number of fungal genome sequencing projects continues to increase, the phylogenetic breadth of current sequencing targets is extremely limited. Exploration of phylogenetic and ecological diversity of fungi by genome sequencing is therefore a potentially rich source of valuable metabolic pathways and enzyme activities that will remain undiscovered and unexploited until a systematic survey of phylogenetically diverse genome sequences is undertaken. Fungal projects
Fungal assembly challenges -Small amount of gDNA => poor quality libraries, insufficient amount of libraries (different PE – paired end libraries are needed – hard to make with new sequencing protocols) -Large data sets -Polyploidy - Large variety of repeats (lengths, complexity)
Figure 2 Model of the evolution of the N. tetrasperma mat A mating-type chromosome. The order of rearrangement events is shown in A and begins with the ancestral mat A chromosome (1) which was collinear with mat a and the mating-type chromosome of N. crassa. The 1.2-Mb inversion occurred first and produced the orientation in 2. This event was followed relatively quickly by the 5.3-Mb inversion (3). The 68-kb inversion, shown as the line at the far right of B, occurred much later to produce the current arrangement of the mat A chromosome (B). The 1.2-Mb inversion (breakpoints show in red) is flanked by unique 50-bp duplications (D) that would have been in an inverted orientation before the occurrence of the large inversion, consistent with rearrangement via staggered single-strand breaks. The 5.3-Mb inversion (breakpoints shown in blue) is flanked by Marinertransposable elements (M), consistent with rearrangement via ectopic recombination. Mariner remnants were not present in either of the homologous regions in the mat a chromosome. The overlapping nature of these two inversions explains the relocated genomic region. The 68-kb inversion is flanked by a microsatellite containing, low-complexity sequence and may have occurred due to ectopic recombination between blocks of microhomology. MAT denotes the location of the mating-type locus while CEN shows the location of the centromere. Massive Changes in Genome Architecture Accompany the Transition to Self- Fertility in the Filamentous FungusNeurospora tetrasperma (Genetics September; 189(1): 55–69.)
Single-cell approach Single-cell genomics is a method for amplifying DNA from single bacterial cells using Multiple Displacement Amplification (MDA) Only 2% of microbes can be cultured. Discovery of novel enzymes, new antibiotics and more Cancer research and clinical diagnostic
Single-cell Process
Challenges with Single-cell Single-cell methodology is sensitive to reagent or processing contamination from multiple displacement amplification (MDA). MDA produces non-uniform read coverage, posing problems with current short read assemblers. MDA produces chimerical reads. De novo assembly of complete genome sequences.
Candidatus Sulcia muelleri DMIN Sulcia cell isolation and sequence coverage, closure and polishing locations along the Sulcia DMIN single cell genome. (A) Micromanipulation of the single Sulcia cell from the sharpshooter bacteriome metasample. (B) Sequence coverage including closure and polishing locations along the finished, circular Sulcia DMIN genome with circles corresponding to following features, starting with outermost circle: (1) Illumina sequence coverage ranging from 0–3276 (mean 303+−386), (2) pyrosequence sequence coverage ranging from 0–231 (mean 42+−39), (3) Sanger sequence coverage ranging from 0–30 (mean 10+−7), (4) locations of captured (green) and uncaptured gaps (orange), (5) polishing locations corrected using Illumina (blue) and Sanger (purple) seqeunce, (6) GC content heat map (dark blue to light green = low to high values) and (7) GC skew. Woyke T, Tighe D, Mavromatis K, Clum A, Copeland A, Schackwitz W, Lapidus A, et al One Bacterial Cell, One Complete Genome. PLoS ONE 5(4): e10314.
Identification of MDA contaminants Red = suspect contaminant S. Trong, JGI
MDA coverage bias Single-Cell kmer distribution Shotgun sequencing theoretical kmer distribution; current short read assemblers expect this uniformity Isolate kmer distribution Woyke T, et al PLoS ONE 6(10): e26161.
Normalizing read coverage improves assembly! Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, et al Decontamination of MDA Reagents for Single Cell Whole Genome Amplification. PLoS ONE 6(10): e26161.
JGI’s Approach Develop pipeline to assemble single-cell genomes addressing contamination and read coverage problems and providing as much genome completeness as possible. Provide QC metrics to evaluate contaminants in the reads and assembly.
Allpath-LG(APLG) - uses less memory, requires much less data For microbes and fungi - less contigs, better N50 numbers and better annoation when compared to a reference using allpaths than with velvet. APLG rarely has large misassemblies whereas with velvet you have to play with the minimum pair cutoff to make sure you don't get misassemblies. For single cells both allpaths and velvet are run and the results are merged. APLG is not used for metagenomes APLG works best with at least an overlapping standard library and a mate pair library between 3-8kb (real of fake). If you give allpaths a mate pair library over 10kb without providing some smaller mate pair library it can get confused. APLGdoesn't accept variable length mate pair data. Allpaths-LG vs Velvet
See “Assembly and finishing” presentation for microbial assemblies and bioinformatics tools. Microbial assemblies and finishing
Thank you!