Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,

Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL CS 2009

Outline ● The problem – Metagenomics ● Current analytical tools ● Alternative approaches ● Real time tools

Sampling the environment...

Chickens, Cows, Mice, and People; Oh my!

Human-associated microbes and viruses More bacteria than somatic cells by at least an order of magnitude More phages than bacteria by an order of magnitude NIH started the human microbiome project last year Nasal; Oral; Skin Gastro-intestinal; Urogenital

Why Metagenomics? What is there? How many are there? What are they doing? Experimental manipulations? Real time metagenomics?

Computational Metagenomics

The SEED Family www.nmpdr.orgwww.theseed.org

Annotations vs. sequences

Subsystems Make Up Metabolism Wikipedia Metabolism http://en.wikipedia.org/wiki/Portal:Metabolism

Subsystem spreadsheet (conceptually)

Three level “hierarchy” Amino Acids and Derivatives –Alanine, serine, and glycine Serine Biosynthesis Amino Acids and Derivatives –Lysine, threonine, methionine, and cysteine Methionine Biosynthesis Make your own subsystems! Over 1,000 Subsystems

Annotation of Complete Genomes Automated user originated processing Takes 1-7 hours depending on size and complexity of the genome ~2,000 external submissions, including hundreds of genomes not yet publicly released. Reannotation of >500 genomes complete 1,000 users, 200 organizations, 25 countries. http://rast.nmpdr.org/

10 genomes submitted on Thursday at 6 pm First annotation complete before 8 am Friday ● Remaining annotations completed Friday before noon ● (there were others in the pipeline too!) ● Presentation ASM 2009 Tuesday, 8pm The Live ASM Test Philadelphia, 2009

The metagenomics RAST server

Freely available comparative tools

Hours of Compute Time Input size (MB) Computational Requirements ~19 hours of compute per input megabyte

How much so far Total: 3,565 metagenomes 334,168,924 sequences 88,311,139,391 bp (88 Gbp) Largest metagenome: 729 Mbp, 11,719,618 reads Public: 394 Metagenomes 54,414,564 sequences 22,234,298,797 bp (22 Gbp) Compute time (on a single CPU): 1,677,911 hours = 69,912 days = 191 years

Lots of computers, no pattern

But there are problems... ● More sequence data to compare to ● The algorithm is really O(n 2 ) not O(n) ● More metagenomes being sequenced ● More sequences per metagenome Most jobs sit around in queues doing nothing!

First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing

Everybody in San Diegoo Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments Year

Sequenced bacterial genomes

Read lengths of metagenomes 100200500400300600700800900 200 300 100 Average read length (bp) # metagenomes

There has to be a better way! Click Here!

How does it work?

Client – Server - Server Send sequence file (PERL CGI) User selects DNA sequence file Split file into random length sections File section sent for processing (PERL CGI) Return number of sections and new URL to receive results (HTML, Javascript) Request next result chunk (XMLHttpRequest) Return each result (YAML) Return each Result (JSON) Parse results, sum, and render display (Javascript)

But why is it so fast? Big computers More small computers Big computers More small computers Better algorithm

How BLAST works Make lookup table (hash table) for query Scan database for hits Extensions of hits

How BLAST Works Protein sequence Filter for words above a threshold Find all words in the protein sequence (>3 letters by default) Extend while score is above another threshold Calculate & report final score for alignment high scoring pairs Map Reduce

How oligo server works ● Generate unique 10-mers ● Use suffix tree to search for them

Subsystem spreadsheet (conceptually) A column becomes a protein family

Identify unique oligos … LQRTVPAFPAERQAALWPCV … … LQKSVPAFPAERQAALWPCV … … LGKTVPAFPAERQAALWPCV … … VQRTVPAFPAERQAALWPCV … … L-RSVPAFPAERQAALWPCV … … L-RPVPAFPAERQALLWHCV … … VQKTVPAFPAERQAILWHCV … … VQRSVPVF-AERQAVLWHCV... Oligomer is unique to this family Doesn't have to contain all members All members should be represented by oligos Fam 57

Suffix tree for quick look up Oligo sequence: VPAFPAERQA V A C F N P Q R W X... A C F N P Q R W X... A C F N P Q R W X... A C F N P Q R W X... Family 57 Family 6 Family 17 Family 34

Does it work?

Vector dot products ● 8,920 dimensional space (one for each function) ● Compare real time to BLAST based ● Compare within a sample (intrinsic) ● Compare sample to genomes (extrinsic)

RT vs BLASTX

Intrinsic (how well does a subsample represent the whole sample)

Extrinsic (how well do complete genomes represent the metagenomes)

From Sequences To Environments Naneh Apkarian, Michelle Creek, Eric Guan, Mayra Hernandez, Kate Isaacs, Chris Peterson, Todd Regh From Sequences To Environments REU Summer 2009 – Liz Dinsdale, Barb Bailey, Imre Tuba

What's next for real time analysis?

Acknowledgements My Lab Ramy Aziz Sajia Akhter Robert Schmieder Victor Seguritan Carny Cheng Kate McNair Nick Celms Daniel Cuevas Matt Hagen Josh Hoffman Vasken Kamikisisan Matt Seitz Sheridan Wright C-SUPERB CSU Program for Education and Research In Biotechnology

Acknowledgements Environmental Genomics Liz Dinsdale Forest Rohwer Brian White Mya Breitbart All the labs that provided sequence Metagenomics Annotation Server Rick Stevens Folker Meyer Bob Olson Daniel Paarman Mark D'Souza Jared Wilkening Andreas Wilke Statistics & Web services Liz Dinsdale Robert Schmieder Dana Hall Beltran Rodriguez-Brito Bahador Nosrat FIG Ross Overbeek Veronika Vonstein Annotators www.nmpdr.orgwww.theseed.org Artist Paula Morris Argonne Sequencing Marc Domanus Areej Ammar PHANTOME Matt Sullivan Mya Breitbart Jeff Elhai

Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,

Similar presentations

Presentation on theme: "Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,

Similar presentations

Presentation on theme: "Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,"— Presentation transcript:

Similar presentations

About project

Feedback