The data flood: We need a bigger boat James A. Foster The Initiative for Bioinformatics and Evolutionary Studies (IBEST) Biological Sciences, Bioinformatics and Computational Biology University of Idaho
2 JAF INBRE Data Flood 8/4/09 Outline ✦ Where is this flood of data coming from? ✦ What kind of tool is appropriate for this amount of data? ✦ What kind of a tool is “bioinformatics”? ✦ How about an example?
3 JAF INBRE Data Flood 8/4/09 DNA sequencing data flood Yearbp/day ish , , ,600,000, ,200,000, ?? ABI /FLX ??? Technology ABI 370 ABI 377 Gels
4 JAF INBRE Data Flood 8/4/09 The data flood: DNA example Yearbp/dayNotes Manual: φx ishGel: ABI ,000Gel: ABI ,000Cap: ABI /FLX 2012?? Water 1L Barrel (176 gallons) Big pool (2x6x12m) football field, 20m deep Lakes Michigan/Huron all Great Lakes (nearly) ocean?
5 JAF INBRE Data Flood 8/4/09 Bioinformatics tools YearData volume 19771L 1986barrel 1995big pool 1998football field 2008Lake Michigan 2009Great Lakes 2012ocean? Technology hose pfd Kayak Orca? bigger boat? Glomar? spoon
6 JAF INBRE Data Flood 8/4/09 Bioinformatics: bigger boat? Your thesis Data The Computer (bioinformatics) Hypo You Your hypothesis
7 JAF INBRE Data Flood 8/4/09 Reflection on the metaphor ✦ At some point, you can use fundamentally different techniques: spoons versus boats ✦ At some point, you can test fundamentally new hypotheses: not “we need a smaller shark” ✦ Sometimes the old technology is still good: the kayak was appropriate in this picture ✦ The new technology may be for a different purpose: fishing versus deep sea exploration
8 JAF INBRE Data Flood 8/4/09 ✦T✦Technology quiz!
9 What does this do?
10 JAF INBRE Data Flood 8/4/09 What does this do? Not that! THIS! A Bigger Boat Whateve r you tell it to do!
11 JAF INBRE Data Flood 8/4/09 What is Bioinformatics? ✦ Bioinformatics is what you tell the computer to do with your data
12 JAF INBRE Data Flood 8/4/09 Of Boats and Bioinformatics Bioinformatics is what you do with the boat you are in during the data flood You might be able to do more with a bigger boat
13 JAF INBRE Data Flood 8/4/09 Sampling emergent diversity ✦ Get ALL DNA along a age-variant transect 10 samples per site time since exposure: 5y, 19y, 40y, 63y, 100y, and 150y “chronoclines” sample ecosystems by age ✦ Who’s there? ✦ How does ecosystem change over time?
14 JAF INBRE Data Flood 8/4/09 Bioinformatics problems ✦ Estimate α diversity: number of “species” in each sample and age group ✦ Estimate β diversity: amount of variation in “species” between age groups ✦ Determine which species (no quotes) are present in each sample (not part of this talk) Biological questions: How do soil bacterial respond to retreating glaciers? How do microbial soil communities change?
15 JAF INBRE Data Flood 8/4/09 Lots of data (post QC) AgeSamplesSequencesDNA Mbp 5y935, y1041, y833, y941, y841, y840, Total52233, Note: A SMALL run, max is 37GB/8hr run max, 1.6 Bbp/day
16 JAF INBRE Data Flood 8/4/09 Bioinformatics objectives determine species cluster by species cluster by age Explain data in terms of biological processes and age (tell a story) Too much data: 233K sequences!
17 JAF INBRE Data Flood 8/4/09 Trick: Turn it upside down Cluster each of 52 samples (approx. 6k each), choose a proxy sequence Cluster proxies by age (approx. 40k each) Cluster combined sequences to get species (quantify richness) Build +/- matrix
18 JAF INBRE Data Flood 8/4/09 Bioinformatics challenges ✦ Move data between computers (IGS, laptop, IBEST Core) ✦ File the data in a retrievable way ✦ Associate metadata with data ✦ Cluster sequences within/between samples ✦ Associate clusters with species ✦ Compute diversity statistics ✦ Prepare publications and talks ✦ (much more)
19 JAF INBRE Data Flood 8/4/09 Conclusions ✦ Biology There are thousands of species of bacteria in arctic soil Number of bacterial species increases as time of post-glacial exposure increase ✦ Algorithmics (want a job?) “Quantity has a quality all it’s own” (V.I.Lenin) Need new algorithms to use new hardware Database/dataset management is crucial
20 JAF INBRE Data Flood 8/4/09 Thanks! ✦ Ursel Schüette ✦ Zaid Abdo ✦ Jacob Pierson ✦ Larry Forney ✦ Rob Lyon ✦ The Forney-Top lab ✦ John Bunge, Cornell ✦ The Relational Database project, MSU ✦ to INBRE for the excuse ✦ to IBEST for the science ✦ to NIH, NSF, and UI for the money ( P20RR16448, P20RR016454, EPS080935)
21 JAF INBRE Data Flood 8/4/09 ✦ Discussion?
22 JAF INBRE Data Flood 8/4/09 Extra stuff Intentionally blank
23 JAF INBRE Data Flood 8/4/09 Roche 454: a genome a day
24 JAF INBRE Data Flood 8/4/09 Metagenomics ✦ Harvest approximately first 300bp of every 16s rRNA molecule, all samples Ribosome: required to translate DNA (conserved) Common marker for microbial species ✦ Cluster by evolutionary relationships (“species”) ✦ Analyze by chronocline
25 JAF INBRE Data Flood 8/4/09 Future work: same tune, new lyrics ✦ Data from human microbiome How do microbial communities vary between healthy and sick people? ✦ Data from polluted soil (Yangtzee river, PRC) How do microbial communities vary as pollution increases? ✦ Data from longitudinal transects How does microbial diversity change with latitude?