Presentation is loading. Please wait.

Presentation is loading. Please wait.

Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,

Similar presentations


Presentation on theme: "Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,"— Presentation transcript:

1 Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL CS 2009

2 Outline ● The problem – Metagenomics ● Current analytical tools ● Alternative approaches ● Real time tools

3

4 Sampling the environment...

5 Chickens, Cows, Mice, and People; Oh my!

6 Human-associated microbes and viruses More bacteria than somatic cells by at least an order of magnitude More phages than bacteria by an order of magnitude NIH started the human microbiome project last year Nasal; Oral; Skin Gastro-intestinal; Urogenital

7 Why Metagenomics? What is there? How many are there? What are they doing? Experimental manipulations? Real time metagenomics?

8

9 Computational Metagenomics

10 The SEED Family www.nmpdr.orgwww.theseed.org

11 Annotations vs. sequences

12 Subsystems Make Up Metabolism Wikipedia Metabolism http://en.wikipedia.org/wiki/Portal:Metabolism

13 Subsystem spreadsheet (conceptually)

14 Three level “hierarchy” Amino Acids and Derivatives –Alanine, serine, and glycine Serine Biosynthesis Amino Acids and Derivatives –Lysine, threonine, methionine, and cysteine Methionine Biosynthesis Make your own subsystems! Over 1,000 Subsystems

15

16 Annotation of Complete Genomes Automated user originated processing Takes 1-7 hours depending on size and complexity of the genome ~2,000 external submissions, including hundreds of genomes not yet publicly released. Reannotation of >500 genomes complete 1,000 users, 200 organizations, 25 countries. http://rast.nmpdr.org/

17 10 genomes submitted on Thursday at 6 pm First annotation complete before 8 am Friday ● Remaining annotations completed Friday before noon ● (there were others in the pipeline too!) ● Presentation ASM 2009 Tuesday, 8pm The Live ASM Test Philadelphia, 2009

18 The metagenomics RAST server

19 Freely available comparative tools

20 Hours of Compute Time Input size (MB) Computational Requirements ~19 hours of compute per input megabyte

21 How much so far Total: 3,565 metagenomes 334,168,924 sequences 88,311,139,391 bp (88 Gbp) Largest metagenome: 729 Mbp, 11,719,618 reads Public: 394 Metagenomes 54,414,564 sequences 22,234,298,797 bp (22 Gbp) Compute time (on a single CPU): 1,677,911 hours = 69,912 days = 191 years

22 Lots of computers, no pattern

23 But there are problems... ● More sequence data to compare to ● The algorithm is really O(n 2 ) not O(n) ● More metagenomes being sequenced ● More sequences per metagenome Most jobs sit around in queues doing nothing!

24 First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing

25 Everybody in San Diegoo Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments Year

26 Sequenced bacterial genomes

27 Read lengths of metagenomes 100200500400300600700800900 200 300 100 Average read length (bp) # metagenomes

28

29 There has to be a better way! Click Here!

30 How does it work?

31 Client – Server - Server Send sequence file (PERL CGI) User selects DNA sequence file Split file into random length sections File section sent for processing (PERL CGI) Return number of sections and new URL to receive results (HTML, Javascript) Request next result chunk (XMLHttpRequest) Return each result (YAML) Return each Result (JSON) Parse results, sum, and render display (Javascript)

32 But why is it so fast? Big computers More small computers Big computers More small computers Better algorithm

33 How BLAST works Make lookup table (hash table) for query Scan database for hits Extensions of hits

34 How BLAST Works Protein sequence Filter for words above a threshold Find all words in the protein sequence (>3 letters by default) Extend while score is above another threshold Calculate & report final score for alignment high scoring pairs Map Reduce

35 How oligo server works ● Generate unique 10-mers ● Use suffix tree to search for them

36 Subsystem spreadsheet (conceptually) A column becomes a protein family

37 Identify unique oligos … LQRTVPAFPAERQAALWPCV … … LQKSVPAFPAERQAALWPCV … … LGKTVPAFPAERQAALWPCV … … VQRTVPAFPAERQAALWPCV … … L-RSVPAFPAERQAALWPCV … … L-RPVPAFPAERQALLWHCV … … VQKTVPAFPAERQAILWHCV … … VQRSVPVF-AERQAVLWHCV... Oligomer is unique to this family Doesn't have to contain all members All members should be represented by oligos Fam 57

38 Suffix tree for quick look up Oligo sequence: VPAFPAERQA V A C F N P Q R W X... A C F N P Q R W X... A C F N P Q R W X... A C F N P Q R W X... Family 57 Family 6 Family 17 Family 34

39 Does it work?

40 Vector dot products ● 8,920 dimensional space (one for each function) ● Compare real time to BLAST based ● Compare within a sample (intrinsic) ● Compare sample to genomes (extrinsic)

41 RT vs BLASTX

42 Intrinsic (how well does a subsample represent the whole sample)

43 Extrinsic (how well do complete genomes represent the metagenomes)

44 From Sequences To Environments Naneh Apkarian, Michelle Creek, Eric Guan, Mayra Hernandez, Kate Isaacs, Chris Peterson, Todd Regh From Sequences To Environments REU Summer 2009 – Liz Dinsdale, Barb Bailey, Imre Tuba

45 What's next for real time analysis?

46

47 Acknowledgements My Lab Ramy Aziz Sajia Akhter Robert Schmieder Victor Seguritan Carny Cheng Kate McNair Nick Celms Daniel Cuevas Matt Hagen Josh Hoffman Vasken Kamikisisan Matt Seitz Sheridan Wright C-SUPERB CSU Program for Education and Research In Biotechnology

48 Acknowledgements Environmental Genomics Liz Dinsdale Forest Rohwer Brian White Mya Breitbart All the labs that provided sequence Metagenomics Annotation Server Rick Stevens Folker Meyer Bob Olson Daniel Paarman Mark D'Souza Jared Wilkening Andreas Wilke Statistics & Web services Liz Dinsdale Robert Schmieder Dana Hall Beltran Rodriguez-Brito Bahador Nosrat FIG Ross Overbeek Veronika Vonstein Annotators www.nmpdr.orgwww.theseed.org Artist Paula Morris Argonne Sequencing Marc Domanus Areej Ammar PHANTOME Matt Sullivan Mya Breitbart Jeff Elhai


Download ppt "Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,"

Similar presentations


Ads by Google