eScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it
3/12/09 Bill Howe, eScience Institute2 from eScience Rollout, 11/5/08 me
3/12/09 Bill Howe, eScience Institute3 My Background BS Industrial and Systems Engineering, GA Tech 1999 Big 3 Consulting with Deloitte Residual guilt from call centers of consultants burning $50k/day Independent Consulting00-01 Microsoft, Siebel, Schlumberger, Verizon Phd, Computer Science, Portland State University, 2006 (via OGI) Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical Sciences”, Advisor: David Maier Postdoc and Data Architect NSF Science and Technology Center for Coastal Margin Observation and Prediction (CMOP)
3/12/09 Bill Howe, eScience Institute4 All Science is becoming eScience Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses) But: Acquisition now outpaces analysis Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: automated PCR, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X Analytical X Computational X X-informatics
3/12/09 Bill Howe, eScience Institute5 The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) The Long Tail data inventory ordinal position Researchers with growing data management challenges but limited resources for cyberinfrastructure No dedicated IT staff Overreliance on simple tools (e.g., spreadsheets) CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers SDSS (~100TB) Seis- mologists Microbiologists CARMEN (~50TB) “The future is already here. It’s just not very evenly distributed.”-- William Gibson
3/12/09 Bill Howe, eScience Institute6 Heterogeneity also drives costs # of bytes # of data types CERN (~15PB/year, particle interactions) LSST (~100PB; images, objects) PanSTARRS (~40PB; images, objects, trajectories) OOI (~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more) SDSS (~100TB; images, objects) Biologists (~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)
3/12/09 Bill Howe, eScience Institute7 Web Services Facets of Data Management Query Languages Storage Management Visualization; Workflow Data Integration Knowledge Extraction, Crawlers Access Methods Data Mining, Distributed Programming Models, Provenance complexity-hiding interfaces The DB maxim: push computation to the data
3/12/09 Bill Howe, eScience Institute8 Example: Relational Databases At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70]. The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak” physical data independence logical data independence
3/12/09 Bill Howe, eScience Institute9 Medium-Scale Data Management Toolbox Relational Databases Scientific Workflow Systems Science “Mashups” “Dataspace” systems The “hammer” of data management [Howe, Freire, Silva, et al. 2008] [Howe, Green-Fishback, Maier, 2009] [Howe, Maier, Rayner, Rucker 2008]
3/12/09 Bill Howe, eScience Institute10 Large-Scale Data Management Toolbox Amazon S3 Dryad MapReduce Parallel programming via relational algebra plus type safety, monitoring, debugging (Michael Isard, Microsoft Research) Parallel programming using functional programming abstractions (Google) Howe, Freire, Silva: 2009 NSF CluE Award Connolly, Gardner: 2009 NSF CluE Award RDBMS-like features in the cloud Note: cost effectiveness unclear for large datasets
3/12/09 Bill Howe, eScience Institute11 Current Activities Consulting: Armbrust Lab (next slide) Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)
3/12/09 Bill Howe, eScience Institute12 Consulting: Armbrust Lab Initial Goal: Corral and inventory all relevant data SOLiD sequencer: potentially 0.5 TB / day, flat files Metadata: small relational DB + Rails/Django web app Data Products: visualizations, intermediate results Ad hoc scripts and programs Initial Goal: Amplify programmer effort Change is constant: No “one size fits all” solution; ad hoc development is the norm Strategy: Teach biologists to “fish” (David Schruth’s R course) Strategy: Develop an infrastructure that enables and encourages reuse -- scientific workflow systems key idea: these are data too
3/12/09 Bill Howe, eScience Institute13 Scientific Workflow Systems Value proposition: More time on science, less time on code How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency Provenance Automatic task-parallelism Visual programming Caching Domain-specific toolkits Many examples from eScience and DB communities: Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more
3/12/09 Bill Howe, eScience Institute14 Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest
3/12/09 Bill Howe, eScience Institute15 screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
3/12/09 Bill Howe, eScience Institute16 screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah
3/12/09 Bill Howe, eScience Institute17 Bill CMOP computes salt flux using GridFields Erik Utah adds vector streamlines and adjusts opacity Bill CMOP adds an isosurface of salinity Peter Lawson adds discussion of the scientific interpretation source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)
3/12/09 Bill Howe, eScience Institute18 Strategy at Armbrust Lab 1.Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings 2.“Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake 3.“Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut. Informed by two of Jim Gray’s Laws of Data Engineering: Start with “20 queries” Go from “working to working”
3/12/09 Bill Howe, eScience Institute19 NSF Award: Cluster Exploratory (CluE) Partnership between NSF, IBM, Google Data-intensive computing: “I/O farm” massive queries, not massive simulations “in ferro” experiments To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Requires turning over up to 25TB < 5s Provenance, reproducibility, visualization: VisTrails Connect rich desktop experience to cloud query engine Co-PIs from University of Utah Claudio Silva and Juliana Freire
3/12/09 Bill Howe, eScience Institute20 Ahmdahl’s Laws Gene Amdahl (1965): Laws for a balanced system i.Parallelism: max speedup is S/(S+P) ii.One bit of IO/sec per instruction/sec (BW) iii.One byte of memory per one instruction/sec (MEM) iv.One IO per 50,000 instructions (IO) Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006) For a Blue Gene the BW=0.001, MEM=0.12. For the JHU cluster BW=0.5, MEM=1.04 source: Alex Szalay, keynote, eScience 2008
3/12/09 Bill Howe, eScience Institute21 Climatology Feb May Average Surface Salinity by Month Columbia River Plume Columbia River psu Washington Oregon animation
3/12/09 Bill Howe, eScience Institute psu (b)
3/12/09 Bill Howe, eScience Institute23 Epilogue We’re here to help! SIG Wiki: eScience Blog: eScience wesbite:
3/12/09 Bill Howe, eScience Institute24
3/12/09 Bill Howe, eScience Institute25 eScience requirements are Fractal William Gibson -- “The future is already here. It’s just not very evenly distributed.”
3/12/09 Bill Howe, eScience Institute26 High-Performance Computing Data Management Consulting Online Collaboration Tools CS Research eScience
3/12/09 Bill Howe, eScience Institute27 It’s what you can do with it Relational database SQL, plus UDTs and UDFs as needed FASTA databases Alignments, rarefaction curves, phylogenetic trees, filtering MapReduce: Roll your own Dryad Relational algebra available; you can still roll our own if needed
3/12/09 Bill Howe, eScience Institute28 A data deluge in all fields Acquisition eventually outpaces analysis Astronomy: SDSS, now LSST; PanSTARRS Biology: PCR, SOLiD sequencing Oceanography: high-resolution models, cheap sensors Marine Microbiology: FlowCytometer Empirical X Analytical X Computational X X-informatics “Increase Data Collection Exponentially in Less Time, with FlowCAM”
High-Performance Computing Data Management Consulting Online Collaboration Community Building Technology Transfer eScience Research
3/12/09 Bill Howe, eScience Institute30 Query Languages Organize and encapsulate access methods Raise the level of abstraction beyond GPLs Identify and exploit opportunities for algebraic optimization What is algebraic optimization? Consider the expression x/z + y/z x/z + y/z = (x + y)/z, but the latter is less expensive since it involves only one division operation Tables -- SQL XML -- XQuery, XPath RDF -- SPARQL Streams -- StreamSQL, CQL Meshes (e.g., Finite Element Sims) -- GridFields
3/12/09 Bill Howe, eScience Institute31 Example: Relational Databases (In Codd we Trust…) At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data 1. The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! 1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp , 1970 The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.
3/12/09 Bill Howe, eScience Institute32 Gray’s Laws of Data Engineering Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working” DISSC: Data Intensive Scalable Scientific Computing slide source: Alex Szalay, keynote, eScience 2008
3/12/09 Bill Howe, eScience Institute33 Data Management