My portfolio: Sequencing projects. Alla Lapidus, Ph.D Associate Professor, Fox Chase Cancer Center EDUCATION: 1980M.S. in Physics (with honors) - Department.

Slides:

Advertisements

Similar presentations

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

Advertisements

Next-generation sequencing

Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.

Advancing Science with DNA Sequence Microbial Genome Assembly and Finishing Alla Lapidus, Ph.D. Microbial genomics DOE Joint Genome Institute, Walnut Creek,

What Is Genomics? Genomics is the study of how the entire genome of a species functions as a unit and evolves over time. It is the study of life’s blueprint,

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Brock Biology of Microorganisms

Genome sequencing and assembling

The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.

Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.

Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.

Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.

De-novo Assembly Day 4.

Molecular Microbial Ecology

CS 394C March 19, 2012 Tandy Warnow.

Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 12, 2012 Metagenome analysis: use case.

Todd J. Treangen, Steven L. Salzberg

H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.

Probes can be designed in an evolutionary hierarchy.

Gao Song 2010/07/14. Outline Overview of Metagenomices Current Assemblers Genovo Assembly.

Genomics Lecture 8 By Ms. Shumaila Azam. 2 Genome Evolution “Genomes are more than instruction books for building and maintaining an organism; they also.

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.

Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.

Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.

A.J. Pierce MI615 University of Kentucky. Low Copy Repeats in the Human Genome Implications for Genomic Structure MI615 Andrew J. Pierce Microbiology,

The Changing Face of Sequencing

Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.

Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.

The iPlant Collaborative

Chapter 21 Eukaryotic Genome Sequences

How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.

Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.

Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.

CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis

Genomics and Forensics

Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

billion-piece genome puzzle

University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,

The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.

Single-cell genome assembly of marine bacterial communities metabolising plastic waste Robert Sugar 2014.

Drinking from a fire hose: analysis of metagenomic data Rachel Mackelprang, Ph.D. Assistant Professor of Biology California State University Northridge.

The Wellcome Trust Sanger Institute

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.

When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.

Canadian Bioinformatics Workshops

De Novo Assembly of Mitochondrial Genomes from Low Coverage Whole-Genome Sequencing Reads Fahad Alqahtani and Ion Mandoiu University of Connecticut Computer.

16S rRNA Experimental Design

Rob Edwards San Diego State University

Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data David H. Spencer, Haley J. Abel, Christina.

Preprocessing Data Rob Schmieder.

Quality Control & Preprocessing of Metagenomic Data

Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.

Denovo genome assembly of Moniliophthora roreri

Research in Computational Molecular Biology , Vol (2008)

Very important to know the difference between the trees!

Henrik Lantz - NBIS/SciLife/Uppsala University

Haley J. Abel, Hussam Al-Kateb, Catherine E. Cottrell, Andrew J

Detection of FLT3 Internal Tandem Duplication in Targeted, Short-Read-Length, Next- Generation Sequencing Data David H. Spencer, Haley J. Abel, Christina.

H = -Σpi log2 pi.

Reciprocal Crossovers and a Positional Preference for Strand Exchange in Recombination Events Resulting in Deletion or Duplication of Chromosome 17p11.2

Next-Generation Sequencing Strategies Enable Routine Detection of Balanced Chromosome Rearrangements for Clinical Diagnostics and Genetic Research Michael E.

Introduction to Sequencing

Reciprocal Crossovers and a Positional Preference for Strand Exchange in Recombination Events Resulting in Deletion or Duplication of Chromosome 17p11.2

Genome resolved metagenomics

Toward Accurate and Quantitative Comparative Metagenomics

Presentation transcript:

My portfolio: Sequencing projects

Alla Lapidus, Ph.D Associate Professor, Fox Chase Cancer Center EDUCATION: 1980M.S. in Physics (with honors) - Department of Theoretical and Experimental Physics, Moscow Physics-Engineering Institute (МИФИ), Moscow, Russia. 1986Ph.D. in Molecular Biology - Institute of Genetics and Selection of Industrial Microorganisms, Moscow, Russia PROFESSIONAL MEMBERSHIPS and SERVICE Reviewer for Frontiers in Evolutionary and Genomic Microbiology Reviewer for Nucleic Acids Research Reviewer for PLoS Genetics current - Organizing Committee Member – “ Sequencing, Finishing and Analysis in the Future” SFAF meeting ( American Society for Microbiology Grant Reviewer (INTAS) FIRST GENOMIC PROJECT: 1994 – European project - Bacillus subtilis, INRA, France UNIVERSOTY of CHICAGO: 1998 – Rhodobacter capsulatus genome INTEGRATED GENOMICS, Inc: 2001 – Director of Sequencing Center, Chicago JOINT GENOME INSTITUTE (LBNL): 2003 – Genome Finishing Group, Projects coordinator Fox Chase Cancer Center, Cancer Genome Institute: Director of Bioinformatics

May 30 th -June 1, Santa Fe

Genome assembly and finishing group at JGI

Projects in the group Microbial projects – ~120 genomes assembled and finished Metagenomes – 3 completely finished members of different communities + approach development Fungi – genome assembly and partial improvement Single cell – one finished genome Bioinformatics - small tools needed for assembly improvement and visualization Quality control - needs and approaches

Major Sequencing Centers for Prokaryotic Finished Genomes in 2011: 1765 projects Nikos C. Kyrpides,

Metagenomes

Metagenomic Assembly Challenges – Molecular Biology Low representation of individual species Requires high depth of coverage to find all species – Low abundance species unassembled Extraction bias, etc 9

Metagenomic Assembly Challenges - Software Algorithm Bias – de novo assembly assumes normal, even distribution of sequence data – Assumes all Kmers will have similar coverage numbers Memory and Run Times – Illumina sequenced metagenomes can generate > 60 GB of data Assembly software could require >512 GB RAM – Lower Kmer values generate more Kmers, increasing memory requirement and computing cycles increasing assembly time Also can improve assembly (low abundance) Individual genome finishing 10

What is needed for better performance reduce Size of Data Sets implement read quality approach (trimming, filtering, binning) choose the best assembler selecting “Best” Assembly – Merging Velvet Assemblies with minimus – Algorithm for selection of best Kmer process automation

METAGENOME SAMPLES 2011: 1927 samples Nikos C. Kyrpideshttp://

Simple Communities are Very Complex High-resolution metagenomics targets specific functional types in complex microbial communities M. Kalyuzhnaya, A.Lapidus, N. Ivanova, A.Copeland, A. McHardy, E. Szeto, A.Salamov, I. Grigoriev, D. uciu, S. Levine, V.M. Markowitz, I.Rigoutsos, S.Tringe, D. Bruce, P. Richardson, M.Lidstrom & L.Chistoserdova Nature Biotechnology 26, (2008) Published online: 17 August 2008

Over a million species in the Kingdom Fungi have evolved over millions of years to occupy diverse ecological niches and have accumulated an enormous but yet undiscovered natural arsenal of potentially useful innovations. While the number of fungal genome sequencing projects continues to increase, the phylogenetic breadth of current sequencing targets is extremely limited. Exploration of phylogenetic and ecological diversity of fungi by genome sequencing is therefore a potentially rich source of valuable metabolic pathways and enzyme activities that will remain undiscovered and unexploited until a systematic survey of phylogenetically diverse genome sequences is undertaken. Fungal projects

Fungal assembly challenges -Small amount of gDNA => poor quality libraries, insufficient amount of libraries (different PE – paired end libraries are needed – hard to make with new sequencing protocols) -Large data sets -Polyploidy - Large variety of repeats (lengths, complexity)

Figure 2 Model of the evolution of the N. tetrasperma mat A mating-type chromosome. The order of rearrangement events is shown in A and begins with the ancestral mat A chromosome (1) which was collinear with mat a and the mating-type chromosome of N. crassa. The 1.2-Mb inversion occurred first and produced the orientation in 2. This event was followed relatively quickly by the 5.3-Mb inversion (3). The 68-kb inversion, shown as the line at the far right of B, occurred much later to produce the current arrangement of the mat A chromosome (B). The 1.2-Mb inversion (breakpoints show in red) is flanked by unique 50-bp duplications (D) that would have been in an inverted orientation before the occurrence of the large inversion, consistent with rearrangement via staggered single-strand breaks. The 5.3-Mb inversion (breakpoints shown in blue) is flanked by Marinertransposable elements (M), consistent with rearrangement via ectopic recombination. Mariner remnants were not present in either of the homologous regions in the mat a chromosome. The overlapping nature of these two inversions explains the relocated genomic region. The 68-kb inversion is flanked by a microsatellite containing, low-complexity sequence and may have occurred due to ectopic recombination between blocks of microhomology. MAT denotes the location of the mating-type locus while CEN shows the location of the centromere. Massive Changes in Genome Architecture Accompany the Transition to Self- Fertility in the Filamentous FungusNeurospora tetrasperma (Genetics September; 189(1): 55–69.)

Single-cell approach Single-cell genomics is a method for amplifying DNA from single bacterial cells using Multiple Displacement Amplification (MDA) Only 2% of microbes can be cultured. Discovery of novel enzymes, new antibiotics and more Cancer research and clinical diagnostic

Single-cell Process

Challenges with Single-cell Single-cell methodology is sensitive to reagent or processing contamination from multiple displacement amplification (MDA). MDA produces non-uniform read coverage, posing problems with current short read assemblers. MDA produces chimerical reads. De novo assembly of complete genome sequences.

Candidatus Sulcia muelleri DMIN Sulcia cell isolation and sequence coverage, closure and polishing locations along the Sulcia DMIN single cell genome. (A) Micromanipulation of the single Sulcia cell from the sharpshooter bacteriome metasample. (B) Sequence coverage including closure and polishing locations along the finished, circular Sulcia DMIN genome with circles corresponding to following features, starting with outermost circle: (1) Illumina sequence coverage ranging from 0–3276 (mean 303+−386), (2) pyrosequence sequence coverage ranging from 0–231 (mean 42+−39), (3) Sanger sequence coverage ranging from 0–30 (mean 10+−7), (4) locations of captured (green) and uncaptured gaps (orange), (5) polishing locations corrected using Illumina (blue) and Sanger (purple) seqeunce, (6) GC content heat map (dark blue to light green = low to high values) and (7) GC skew. Woyke T, Tighe D, Mavromatis K, Clum A, Copeland A, Schackwitz W, Lapidus A, et al One Bacterial Cell, One Complete Genome. PLoS ONE 5(4): e10314.

Identification of MDA contaminants Red = suspect contaminant S. Trong, JGI

MDA coverage bias Single-Cell kmer distribution Shotgun sequencing theoretical kmer distribution; current short read assemblers expect this uniformity Isolate kmer distribution Woyke T, et al PLoS ONE 6(10): e26161.

Normalizing read coverage improves assembly! Woyke T, Sczyrba A, Lee J, Rinke C, Tighe D, et al Decontamination of MDA Reagents for Single Cell Whole Genome Amplification. PLoS ONE 6(10): e26161.

JGI’s Approach Develop pipeline to assemble single-cell genomes addressing contamination and read coverage problems and providing as much genome completeness as possible. Provide QC metrics to evaluate contaminants in the reads and assembly.

Allpath-LG(APLG) - uses less memory, requires much less data For microbes and fungi - less contigs, better N50 numbers and better annoation when compared to a reference using allpaths than with velvet. APLG rarely has large misassemblies whereas with velvet you have to play with the minimum pair cutoff to make sure you don't get misassemblies. For single cells both allpaths and velvet are run and the results are merged. APLG is not used for metagenomes APLG works best with at least an overlapping standard library and a mate pair library between 3-8kb (real of fake). If you give allpaths a mate pair library over 10kb without providing some smaller mate pair library it can get confused. APLGdoesn't accept variable length mate pair data. Allpaths-LG vs Velvet

See “Assembly and finishing” presentation for microbial assemblies and bioinformatics tools. Microbial assemblies and finishing

Thank you!