Download presentation
Presentation is loading. Please wait.
1
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org
2
Outline There is a lot of sequence Tools for analysis More computers Can we speed analysis
3
First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing
4
Everybody in San Diego Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments
5
Metagenomics (Just sequence it) 200 liters water 5-500 g fresh fecal matter 50 g soil Sequence Epifluorescent Microscopy Concentrate and purify bacteria, viruses, etc Extract nucleic acids Publish papers
6
The SEED Family
7
The metagenomics RAST server
8
Automated Processing
9
Summary View
10
Metagenomics Tools Annotation & Subsystems
11
Metagenomics Tools Phylogenetic Reconstruction
12
Metagenomics Tools Comparative Tools
13
Outline There is a lot of sequence Tools for analysis More computers
14
How much data so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome ~300 GS20 ~300 FLX ~300 Sanger
15
Computes
16
Linear compute complexity
17
Just waiting …
18
Hours of Compute Time Input size (MB) Overall compute time ~19 hours of compute per input megabyte
19
How much so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome Compute time (on a single CPU): 328,814 hours = 13,700 days = 38 years ~300 GS20 ~300 FLX ~300 Sanger
20
Outline There is a lot of sequence Tools for analysis More computers Can we speed analysis
21
Shannon’s Uncertainty Shannon’s Uncertainty – Peter’s surprisal p(xi) is the probability of the occurrence of each base or string
22
Surprisal in Sequences
23
Uncertainty Correlates With Similarity
24
But it’s not just randomness…
25
Which has more surprisal: coding regions or non-coding regions? Uncertainty in complete genomes Coding regionsNon-coding regions
26
More extreme differences with 6-mers Coding regionsNon-coding regions
27
Can we predict proteins Short sequences of 100 bp Translate into 30-35 amino acids Can we predict which are real and could be doing something? Test with bacterial proteins
28
Kullback-Leibler Divergence Difference between two probability distributions Difference between amino acid composition and average amino acid composition Calculate KLD for 372 bacterial genomes
29
KLD varies by bacteria Colored by taxonomy of the bacteria
30
KLD varies by bacteria
31
Most divergent genomes Borrelia garinii – Spirochaetes Mycoplasma mycoides – Mollicutes Ureaplasma parvum – Mollicutes Buchnera aphidicola – Gammaproteobacteria Wigglesworthia glossinidia – Gammaproteobacteria
32
Divergence and metabolism Bifidobacterium Bacillus Nostoc Salmonella Chlamydophila Mean of all bacteria
33
Divergence and amino acids Ureaplasma Wigglesworthia Borrelia Buchnera Mycoplasma Bacteria mean Archaea mean Eukaryotic mean
34
Predicting KLD y = 2x 2 -2x+0.5 KLD per genome Percent G+C
35
Summary Shannon’s uncertainty could predict useful sequences KLD varies too much to be useful and is driven by %G+C content
36
New solutions for old problems?
37
Xen and the art of imagery
38
The cell phone problem
39
Searching the seed by SMS 1 2 3 4 5 6 7 8 9 * 0 # 1 2 3 4 5 6 7 8 9 * 0 # seed search histidine coli seed search histidine coli GMAIL.COM @ AUTO SEEDSEARCHES edward s. sdsu. edu SEED databases 22 proteins in E. coli ) ) ) ) ) ) ) ) AnywhereIdahoGMCS429Argonne
40
Challenges Too much data Not easy to prioritize New models for HPC needed New interfaces to look at data
41
Acknowledgements Sajia Akhter Rob Schmieder Nick Celms Sheridan Wright Ramy Aziz FIG The mg-RAST team Rick Stevens Peter Salamon Barb Bailey Forest Rohwer Anca Segall
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.