Download presentation
Presentation is loading. Please wait.
Published byJerome Carroll Modified over 8 years ago
1
Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL CS 2009
2
Outline ● The problem – Metagenomics ● Current analytical tools ● Alternative approaches ● Real time tools
4
Sampling the environment...
5
Chickens, Cows, Mice, and People; Oh my!
6
Human-associated microbes and viruses More bacteria than somatic cells by at least an order of magnitude More phages than bacteria by an order of magnitude NIH started the human microbiome project last year Nasal; Oral; Skin Gastro-intestinal; Urogenital
7
Why Metagenomics? What is there? How many are there? What are they doing? Experimental manipulations? Real time metagenomics?
9
Computational Metagenomics
10
The SEED Family www.nmpdr.orgwww.theseed.org
11
Annotations vs. sequences
12
Subsystems Make Up Metabolism Wikipedia Metabolism http://en.wikipedia.org/wiki/Portal:Metabolism
13
Subsystem spreadsheet (conceptually)
14
Three level “hierarchy” Amino Acids and Derivatives –Alanine, serine, and glycine Serine Biosynthesis Amino Acids and Derivatives –Lysine, threonine, methionine, and cysteine Methionine Biosynthesis Make your own subsystems! Over 1,000 Subsystems
16
Annotation of Complete Genomes Automated user originated processing Takes 1-7 hours depending on size and complexity of the genome ~2,000 external submissions, including hundreds of genomes not yet publicly released. Reannotation of >500 genomes complete 1,000 users, 200 organizations, 25 countries. http://rast.nmpdr.org/
17
10 genomes submitted on Thursday at 6 pm First annotation complete before 8 am Friday ● Remaining annotations completed Friday before noon ● (there were others in the pipeline too!) ● Presentation ASM 2009 Tuesday, 8pm The Live ASM Test Philadelphia, 2009
18
The metagenomics RAST server
19
Freely available comparative tools
20
Hours of Compute Time Input size (MB) Computational Requirements ~19 hours of compute per input megabyte
21
How much so far Total: 3,565 metagenomes 334,168,924 sequences 88,311,139,391 bp (88 Gbp) Largest metagenome: 729 Mbp, 11,719,618 reads Public: 394 Metagenomes 54,414,564 sequences 22,234,298,797 bp (22 Gbp) Compute time (on a single CPU): 1,677,911 hours = 69,912 days = 191 years
22
Lots of computers, no pattern
23
But there are problems... ● More sequence data to compare to ● The algorithm is really O(n 2 ) not O(n) ● More metagenomes being sequenced ● More sequences per metagenome Most jobs sit around in queues doing nothing!
24
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing
25
Everybody in San Diegoo Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments Year
26
Sequenced bacterial genomes
27
Read lengths of metagenomes 100200500400300600700800900 200 300 100 Average read length (bp) # metagenomes
29
There has to be a better way! Click Here!
30
How does it work?
31
Client – Server - Server Send sequence file (PERL CGI) User selects DNA sequence file Split file into random length sections File section sent for processing (PERL CGI) Return number of sections and new URL to receive results (HTML, Javascript) Request next result chunk (XMLHttpRequest) Return each result (YAML) Return each Result (JSON) Parse results, sum, and render display (Javascript)
32
But why is it so fast? Big computers More small computers Big computers More small computers Better algorithm
33
How BLAST works Make lookup table (hash table) for query Scan database for hits Extensions of hits
34
How BLAST Works Protein sequence Filter for words above a threshold Find all words in the protein sequence (>3 letters by default) Extend while score is above another threshold Calculate & report final score for alignment high scoring pairs Map Reduce
35
How oligo server works ● Generate unique 10-mers ● Use suffix tree to search for them
36
Subsystem spreadsheet (conceptually) A column becomes a protein family
37
Identify unique oligos … LQRTVPAFPAERQAALWPCV … … LQKSVPAFPAERQAALWPCV … … LGKTVPAFPAERQAALWPCV … … VQRTVPAFPAERQAALWPCV … … L-RSVPAFPAERQAALWPCV … … L-RPVPAFPAERQALLWHCV … … VQKTVPAFPAERQAILWHCV … … VQRSVPVF-AERQAVLWHCV... Oligomer is unique to this family Doesn't have to contain all members All members should be represented by oligos Fam 57
38
Suffix tree for quick look up Oligo sequence: VPAFPAERQA V A C F N P Q R W X... A C F N P Q R W X... A C F N P Q R W X... A C F N P Q R W X... Family 57 Family 6 Family 17 Family 34
39
Does it work?
40
Vector dot products ● 8,920 dimensional space (one for each function) ● Compare real time to BLAST based ● Compare within a sample (intrinsic) ● Compare sample to genomes (extrinsic)
41
RT vs BLASTX
42
Intrinsic (how well does a subsample represent the whole sample)
43
Extrinsic (how well do complete genomes represent the metagenomes)
44
From Sequences To Environments Naneh Apkarian, Michelle Creek, Eric Guan, Mayra Hernandez, Kate Isaacs, Chris Peterson, Todd Regh From Sequences To Environments REU Summer 2009 – Liz Dinsdale, Barb Bailey, Imre Tuba
45
What's next for real time analysis?
47
Acknowledgements My Lab Ramy Aziz Sajia Akhter Robert Schmieder Victor Seguritan Carny Cheng Kate McNair Nick Celms Daniel Cuevas Matt Hagen Josh Hoffman Vasken Kamikisisan Matt Seitz Sheridan Wright C-SUPERB CSU Program for Education and Research In Biotechnology
48
Acknowledgements Environmental Genomics Liz Dinsdale Forest Rohwer Brian White Mya Breitbart All the labs that provided sequence Metagenomics Annotation Server Rick Stevens Folker Meyer Bob Olson Daniel Paarman Mark D'Souza Jared Wilkening Andreas Wilke Statistics & Web services Liz Dinsdale Robert Schmieder Dana Hall Beltran Rodriguez-Brito Bahador Nosrat FIG Ross Overbeek Veronika Vonstein Annotators www.nmpdr.orgwww.theseed.org Artist Paula Morris Argonne Sequencing Marc Domanus Areej Ammar PHANTOME Matt Sullivan Mya Breitbart Jeff Elhai
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.