High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.

Slides:



Advertisements
Similar presentations
An Introduction to “Bioinformatics to Predict Bacterial Phenotypes” Jerry H. Kavouras, Ph.D. Lewis University Romeoville, IL.
Advertisements

Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
BIOINFORMATICS Ency Lee.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
PANGENOMES How Many Microbial Genes Are There In the World? Nicholas P. Celms, James D. Nulton, Dr. Rob Edwards, Dr. Peter Salamon PANGENOMES How Many.
High Throughput Computational Sequence Analysis Rob Edwards Argonne National Laboratory San Diego State University.
-Cool, cuz it turns stuff red! The halobacterium has been around for billions of years Species: H. salinariumStrain: NRC-1 Organism: Archaea (meaning:
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
THE GLOBAL MARINE VIRIOME Rob Edwards Dept. Biology, SDSU Computational Sciences Research Center, SDSU Center for Microbial Sciences, San Diego, Fellowship.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
How We Annotated Genomes for Free: Fast and Accurate Functional Analysis Using Subsystems Technology Rob Edwards Depts of Computer Science And Biology,
Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Genomes summary 1.>930 bacterial genomes sequenced. 2.Circular. Genes densely packed Mbases, ,000 genes 4.Genomes of >200 eukaryotes (45.
Training a Neural Network to Recognize Phage Major Capsid Proteins Author: Michael Arnoult, San Diego State University Mentors: Victor Seguritan, Anca.
Sequencing All of Microbial Life: Challenges and Opportunities Rob Edwards Argonne National Laboratory San Diego State University.
Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University.
Biological Motivation Gene Finding in Eukaryotic Genomes
Microbial Genomes Features Analysis Role of high-throughput sequencing Yeast - the eukaryotic model microbe Databases –TIGR CMR –NCBI Microbial Genomes.
Zachary Bendiks. Jonathan Eisen  UC Davis Genome Center  Lab focus: “Our work focuses on genomic basis for the origin of novelty in microorganisms (how.
Metagenomics Binning and Machine Learning
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Gene Structure and Identification
Chapter 17 Prokaryotic Taxonomy How many species of bacteria are there? How many species can be grown in culture? Bergey’s Manual Classification Schemes.
Microbial taxonomy and phylogeny
What is the Human Genome Project? Identify all the approximately 35,000 genes in human DNA Determine the sequences of the 3,000,000,000 bases ( = 200 phone.
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
Prokaryotic vs. eukaryotic genomes. Rocha, E Ann. Rev. Genet. 42: Genome organization in bacteria.
Center for Biological Sequence Analysis The Technical University of Denmark DTU Comparative Microbial Genomics group 20 Methods to Compare Bacterial Genomes.
Construction of Substitution Matrices
Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Supporting Scientific Collaboration Online SCOPE Workshop at San Diego Supercomputer Center March 19-22, 2008.
Condor: BLAST Rob Quick Open Science Grid Indiana University.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
November 18, 2000ICTCM 2000 Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee,
Chapter 11: Functional genomics
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics Lecture to accompany BLAST/ORF finder activity
Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Genomes & The Tree of Life
Construction of Substitution matrices
(H)MMs in gene prediction and similarity searches.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The prokaryotic genome.
Canadian Bioinformatics Workshops
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
MICROBIOLOGIA GENERALE Prokaryotic genomes. The Escherichia coli nucleoid.
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing.
Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,
The SEED Family First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How.
Using Computers to Understand Life: from Bacteria and Viruses to Corals and Fishes Rob Edwards SDSURF 2011.
Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
Real time metagenomics Ross Overbeek Bob Olson Terry Disz Liz Dinsdale.
Rob Edwards San Diego State University
Bioinformatics Overview
Metagenomic Species Diversity.
The bioinformatics behind
Research in Computational Molecular Biology , Vol (2008)
Genomic Data Manipulation
Recitation 7 2/4/09 PSSMs+Gene finding
Genomes and Their Evolution
Metagenomics Microbial community DNA extraction
Applying principles of computer science in a biological context
Annotations, Subsystems based approach
Condor: BLAST Tuesday, Dec 7th, 10:45am
Presentation transcript:

High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org

Outline  There is a lot of sequence  Tools for analysis  More computers  Can we speed analysis

First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing

Everybody in San Diego Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments

Metagenomics (Just sequence it) 200 liters water g fresh fecal matter 50 g soil Sequence Epifluorescent Microscopy Concentrate and purify bacteria, viruses, etc Extract nucleic acids Publish papers

The SEED Family

The metagenomics RAST server

Automated Processing

Summary View

Metagenomics Tools Annotation & Subsystems

Metagenomics Tools Phylogenetic Reconstruction

Metagenomics Tools Comparative Tools

Outline  There is a lot of sequence  Tools for analysis  More computers

How much data so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome ~300 GS20 ~300 FLX ~300 Sanger

Computes

Linear compute complexity

Just waiting …

Hours of Compute Time Input size (MB) Overall compute time ~19 hours of compute per input megabyte

How much so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome Compute time (on a single CPU): 328,814 hours = 13,700 days = 38 years ~300 GS20 ~300 FLX ~300 Sanger

Outline  There is a lot of sequence  Tools for analysis  More computers  Can we speed analysis

Shannon’s Uncertainty Shannon’s Uncertainty – Peter’s surprisal p(xi) is the probability of the occurrence of each base or string

Surprisal in Sequences

Uncertainty Correlates With Similarity

But it’s not just randomness…

Which has more surprisal: coding regions or non-coding regions? Uncertainty in complete genomes Coding regionsNon-coding regions

More extreme differences with 6-mers Coding regionsNon-coding regions

Can we predict proteins Short sequences of 100 bp Translate into amino acids Can we predict which are real and could be doing something? Test with bacterial proteins

Kullback-Leibler Divergence Difference between two probability distributions Difference between amino acid composition and average amino acid composition Calculate KLD for 372 bacterial genomes

KLD varies by bacteria Colored by taxonomy of the bacteria

KLD varies by bacteria

Most divergent genomes Borrelia garinii – Spirochaetes Mycoplasma mycoides – Mollicutes Ureaplasma parvum – Mollicutes Buchnera aphidicola – Gammaproteobacteria Wigglesworthia glossinidia – Gammaproteobacteria

Divergence and metabolism Bifidobacterium Bacillus Nostoc Salmonella Chlamydophila Mean of all bacteria

Divergence and amino acids Ureaplasma Wigglesworthia Borrelia Buchnera Mycoplasma Bacteria mean Archaea mean Eukaryotic mean

Predicting KLD y = 2x 2 -2x+0.5 KLD per genome Percent G+C

Summary Shannon’s uncertainty could predict useful sequences KLD varies too much to be useful and is driven by %G+C content

New solutions for old problems?

Xen and the art of imagery

The cell phone problem

Searching the seed by SMS * 0 # * 0 # seed search histidine coli seed search histidine coli AUTO SEEDSEARCHES edward s. sdsu. edu SEED databases 22 proteins in E. coli ) ) ) ) ) ) ) ) AnywhereIdahoGMCS429Argonne

Challenges Too much data Not easy to prioritize New models for HPC needed New interfaces to look at data

Acknowledgements Sajia Akhter Rob Schmieder Nick Celms Sheridan Wright Ramy Aziz FIG The mg-RAST team Rick Stevens Peter Salamon Barb Bailey Forest Rohwer Anca Segall