Presentation is loading. Please wait.

Presentation is loading. Please wait.

High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.

Similar presentations


Presentation on theme: "High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org."— Presentation transcript:

1 High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org

2 Outline  There is a lot of sequence  Tools for analysis  More computers  Can we speed analysis

3 First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How much has been sequenced? Environmental sequencing

4 Everybody in San Diego Everybody in USA All cultured Bacteria 100 people How much will be sequenced? One genome from every species Most major microbial environments

5 Metagenomics (Just sequence it) 200 liters water 5-500 g fresh fecal matter 50 g soil Sequence Epifluorescent Microscopy Concentrate and purify bacteria, viruses, etc Extract nucleic acids Publish papers

6 The SEED Family

7 The metagenomics RAST server

8 Automated Processing

9 Summary View

10 Metagenomics Tools Annotation & Subsystems

11 Metagenomics Tools Phylogenetic Reconstruction

12 Metagenomics Tools Comparative Tools

13 Outline  There is a lot of sequence  Tools for analysis  More computers

14 How much data so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome ~300 GS20 ~300 FLX ~300 Sanger

15 Computes

16 Linear compute complexity

17 Just waiting …

18 Hours of Compute Time Input size (MB) Overall compute time ~19 hours of compute per input megabyte

19 How much so far 986 metagenomes 79,417,238 sequences 17,306,834,870 bp (17 Gbp) Average: ~15-20 M bp per genome Compute time (on a single CPU): 328,814 hours = 13,700 days = 38 years ~300 GS20 ~300 FLX ~300 Sanger

20 Outline  There is a lot of sequence  Tools for analysis  More computers  Can we speed analysis

21 Shannon’s Uncertainty Shannon’s Uncertainty – Peter’s surprisal p(xi) is the probability of the occurrence of each base or string

22 Surprisal in Sequences

23 Uncertainty Correlates With Similarity

24 But it’s not just randomness…

25 Which has more surprisal: coding regions or non-coding regions? Uncertainty in complete genomes Coding regionsNon-coding regions

26 More extreme differences with 6-mers Coding regionsNon-coding regions

27 Can we predict proteins Short sequences of 100 bp Translate into 30-35 amino acids Can we predict which are real and could be doing something? Test with bacterial proteins

28 Kullback-Leibler Divergence Difference between two probability distributions Difference between amino acid composition and average amino acid composition Calculate KLD for 372 bacterial genomes

29 KLD varies by bacteria Colored by taxonomy of the bacteria

30 KLD varies by bacteria

31 Most divergent genomes Borrelia garinii – Spirochaetes Mycoplasma mycoides – Mollicutes Ureaplasma parvum – Mollicutes Buchnera aphidicola – Gammaproteobacteria Wigglesworthia glossinidia – Gammaproteobacteria

32 Divergence and metabolism Bifidobacterium Bacillus Nostoc Salmonella Chlamydophila Mean of all bacteria

33 Divergence and amino acids Ureaplasma Wigglesworthia Borrelia Buchnera Mycoplasma Bacteria mean Archaea mean Eukaryotic mean

34 Predicting KLD y = 2x 2 -2x+0.5 KLD per genome Percent G+C

35 Summary Shannon’s uncertainty could predict useful sequences KLD varies too much to be useful and is driven by %G+C content

36 New solutions for old problems?

37 Xen and the art of imagery

38 The cell phone problem

39 Searching the seed by SMS 1 2 3 4 5 6 7 8 9 * 0 # 1 2 3 4 5 6 7 8 9 * 0 # seed search histidine coli seed search histidine coli GMAIL.COM @ AUTO SEEDSEARCHES edward s. sdsu. edu SEED databases 22 proteins in E. coli ) ) ) ) ) ) ) ) AnywhereIdahoGMCS429Argonne

40 Challenges Too much data Not easy to prioritize New models for HPC needed New interfaces to look at data

41 Acknowledgements Sajia Akhter Rob Schmieder Nick Celms Sheridan Wright Ramy Aziz FIG The mg-RAST team Rick Stevens Peter Salamon Barb Bailey Forest Rohwer Anca Segall


Download ppt "High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org."

Similar presentations


Ads by Google