EScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it.

Slides:

Advertisements

Similar presentations

Agile Views in a Dynamic Data Management System Oliver Kennedy 1+3, Yanif Ahmad 2, Christoph Koch 1

Advertisements

The Virtual Estuary: Simulation meets Visualization Yvette Spitz Scott Durski Erik Anderson Joel Daniels Juliana Freire Claudio Silva Antonio Baptista.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.

Petascale Data Intensive Computing for eScience Alex Szalay, Maria Nieto-Santisteban, Ani Thakar, Jan Vandenberg, Alainna Wonders, Gordon Bell, Dan Fay,

VisTrails: Overview Juliana Freire University of Utah Joint work with: Erik Andersen, Steven P. Callahan, David Koop, Emanuele.

User Patterns from SkyServer 1/f power law of session times and request sizes –No discrete classes of users!! Users are willing to learn SQL for advantage.

Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington August 2011.

CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.

1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,

Using Provenance to Support Real-Time Collaborative Design of Workflows Tommy Ellkvist 1, Erik Anderson 2, David Koop 2, Juliana Freire 2, and Claudio.

Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.

N Tropy: A Framework for Analyzing Massive Astrophysical Datasets Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey.

Free Open-Source, Open- Platform System for Information Mash-Up and Exploration in Earth Science Tawan Banchuen, Will Smart, Brandon Whitehead, Mark Gahegan,

Amdahl Numbers as a Metric for Data Intensive Computing Alex Szalay The Johns Hopkins University.

Big Data in Science (Lessons from astrophysics) Michael Drinkwater, UQ & CAASTRO 1.Preface Contributions by Jim Grey Astronomy data flow 2.Past Glories.

National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:

Alex Szalay, Jim Gray Analyzing Large Data Sets in Astronomy.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Observatories as Integrators Bill Howe, Phd David Maier,Maseeh Professor of Emerging Techologies, Portland State University Claudio Silva, University of.

OpenQuake Infomall ACES Meeting Maui May Geoffrey Fox

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, SCAPE Scalable Preservation Environments.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Astronomical data curation and the Wide-Field Astronomy Unit Bob Mann Wide-Field Astronomy Unit Institute for Astronomy School of Physics University of.

Dr. T. Y. Lin | SJSU | CS 157A | Fall 2011 Chapter 1 THE WORLDS OF DATABASE SYSTEMS 1.

Mini-Project on Web Data Analysis DANIEL DEUTCH. Data Management “Data management is the development, execution and supervision of plans, policies, programs.

IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.

Sensor systems and large data sources Jim Myers, NCSA.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Astro / Geo / Eco - Sciences Illustrative examples of success stories: Sloan digital sky survey: data portal for astronomy data, 1M+ users and nearly 1B.

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

Database-as-a-Service for Long Tail Science Bill Howe Garret Cole Nodira Khoussainova Luke Zettlemoyer Shaminoo Kapoor Patrick Michaud.

The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.

Source: Alex Szalay. Example: Sloan Digital Sky Survey The SDSS telescope array is systematically mapping ¼ of the entire sky Discoveries are made by.

“Big Data” and Data-Intensive Science (eScience) Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering University of Washington July.

Data and storage services on the NGS Mike Mineter Training Outreach and Education

 2009 Calpont Corporation 1 Calpont Open Source Columnar Storage Engine for Scalable MySQL Data Warehousing April 22, 2009 MySQL User Conference Santa.

1 Knowledge Transfer in the Cyber-Infrastructure Group –Microsoft eScience Tools –NANOOS and IOOS Web Services.

Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.

Extreme Data-Intensive Scientific Computing Alex Szalay The Johns Hopkins University.

Data Science Background and Course Software setup Week 1.

Introduction to Information and Computer Science

EScience: Techniques and Technologies for 21st Century Discovery Ed Lazowska Bill & Melinda Gates Chair in Computer Science & Engineering Computer Science.

Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.

Introduction to Data Programming CSE 160 University of Washington Spring 2015 Ruth Anderson 1 Slides based on previous versions by Michael Ernst and earlier.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos.

Data and storage services on the NGS.

Applications and Requirements for Scientific Workflow May NSF Geoffrey Fox Indiana University.

EUROPEAN UNION Polish Infrastructure for Supporting Computational Science in the European Research Space The Capabilities of the GridSpace2 Experiment.

Big Data Yuan Xue CS 292 Special topics on.

Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.

Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,

VisIt Project Overview

Tools and Services Workshop

University of Chicago and ANL

Joslynn Lee – Data Science Educator

MATLAB Distributed, and Other Toolboxes

Modern Data Management

THE WORLDS OF DATABASE SYSTEMS

Emergent Semantics: Towards Self-Organizing Scientific Metadata

Introduction to Data Programming

Parallel Analytic Systems

Laura Bright David Maier Portland State University

Presentation transcript:

eScience Data Management Bill Howe, Phd eScience Institute It’s not just size that matters, it’s what you can do with it

3/12/09 Bill Howe, eScience Institute2 from eScience Rollout, 11/5/08 me

3/12/09 Bill Howe, eScience Institute3 My Background BS Industrial and Systems Engineering, GA Tech 1999 Big 3 Consulting with Deloitte Residual guilt from call centers of consultants burning $50k/day Independent Consulting00-01 Microsoft, Siebel, Schlumberger, Verizon Phd, Computer Science, Portland State University, 2006 (via OGI) Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical Sciences”, Advisor: David Maier Postdoc and Data Architect NSF Science and Technology Center for Coastal Margin Observation and Prediction (CMOP)

3/12/09 Bill Howe, eScience Institute4 All Science is becoming eScience Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses) But: Acquisition now outpaces analysis Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) Medicine: ubiquitous digital records, MRI, ultrasound Oceanography: high-resolution models, cheap sensors, satellites Biology: automated PCR, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Empirical X  Analytical X  Computational X  X-informatics

3/12/09 Bill Howe, eScience Institute5 The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) The Long Tail data inventory ordinal position Researchers with growing data management challenges but limited resources for cyberinfrastructure No dedicated IT staff Overreliance on simple tools (e.g., spreadsheets) CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers SDSS (~100TB) Seis- mologists Microbiologists CARMEN (~50TB) “The future is already here. It’s just not very evenly distributed.”-- William Gibson

3/12/09 Bill Howe, eScience Institute6 Heterogeneity also drives costs # of bytes # of data types CERN (~15PB/year, particle interactions) LSST (~100PB; images, objects) PanSTARRS (~40PB; images, objects, trajectories) OOI (~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more) SDSS (~100TB; images, objects) Biologists (~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees)

3/12/09 Bill Howe, eScience Institute7 Web Services Facets of Data Management Query Languages Storage Management Visualization; Workflow Data Integration Knowledge Extraction, Crawlers Access Methods Data Mining, Distributed Programming Models, Provenance complexity-hiding interfaces The DB maxim: push computation to the data

3/12/09 Bill Howe, eScience Institute8 Example: Relational Databases At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70]. The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak” physical data independence logical data independence

3/12/09 Bill Howe, eScience Institute9 Medium-Scale Data Management Toolbox Relational Databases Scientific Workflow Systems Science “Mashups” “Dataspace” systems The “hammer” of data management [Howe, Freire, Silva, et al. 2008] [Howe, Green-Fishback, Maier, 2009] [Howe, Maier, Rayner, Rucker 2008]

3/12/09 Bill Howe, eScience Institute10 Large-Scale Data Management Toolbox Amazon S3 Dryad MapReduce Parallel programming via relational algebra plus type safety, monitoring, debugging (Michael Isard, Microsoft Research) Parallel programming using functional programming abstractions (Google) Howe, Freire, Silva: 2009 NSF CluE Award Connolly, Gardner: 2009 NSF CluE Award RDBMS-like features in the cloud Note: cost effectiveness unclear for large datasets

3/12/09 Bill Howe, eScience Institute11 Current Activities Consulting: Armbrust Lab (next slide) Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow)

3/12/09 Bill Howe, eScience Institute12 Consulting: Armbrust Lab Initial Goal: Corral and inventory all relevant data SOLiD sequencer: potentially 0.5 TB / day, flat files Metadata: small relational DB + Rails/Django web app Data Products: visualizations, intermediate results Ad hoc scripts and programs Initial Goal: Amplify programmer effort Change is constant: No “one size fits all” solution; ad hoc development is the norm Strategy: Teach biologists to “fish” (David Schruth’s R course) Strategy: Develop an infrastructure that enables and encourages reuse -- scientific workflow systems key idea: these are data too

3/12/09 Bill Howe, eScience Institute13 Scientific Workflow Systems Value proposition: More time on science, less time on code How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency Provenance Automatic task-parallelism Visual programming Caching Domain-specific toolkits Many examples from eScience and DB communities: Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more

3/12/09 Bill Howe, eScience Institute14 Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest

3/12/09 Bill Howe, eScience Institute15 screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

3/12/09 Bill Howe, eScience Institute16 screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

3/12/09 Bill Howe, eScience Institute17 Bill CMOP computes salt flux using GridFields Erik Utah adds vector streamlines and adjusts opacity Bill CMOP adds an isosurface of salinity Peter Lawson adds discussion of the scientific interpretation source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe)

3/12/09 Bill Howe, eScience Institute18 Strategy at Armbrust Lab 1.Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings 2.“Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake 3.“Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut. Informed by two of Jim Gray’s Laws of Data Engineering: Start with “20 queries” Go from “working to working”

3/12/09 Bill Howe, eScience Institute19 NSF Award: Cluster Exploratory (CluE) Partnership between NSF, IBM, Google Data-intensive computing: “I/O farm” massive queries, not massive simulations “in ferro” experiments To “Cloud-Enable” GridFields and VisTrails Goal: 10+-year climatologies at interactive speeds Requires turning over up to 25TB < 5s Provenance, reproducibility, visualization: VisTrails Connect rich desktop experience to cloud query engine Co-PIs from University of Utah Claudio Silva and Juliana Freire

3/12/09 Bill Howe, eScience Institute20 Ahmdahl’s Laws Gene Amdahl (1965): Laws for a balanced system i.Parallelism: max speedup is S/(S+P) ii.One bit of IO/sec per instruction/sec (BW) iii.One byte of memory per one instruction/sec (MEM) iv.One IO per 50,000 instructions (IO) Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006) For a Blue Gene the BW=0.001, MEM=0.12. For the JHU cluster BW=0.5, MEM=1.04 source: Alex Szalay, keynote, eScience 2008

3/12/09 Bill Howe, eScience Institute21 Climatology Feb May Average Surface Salinity by Month Columbia River Plume Columbia River psu Washington Oregon animation

3/12/09 Bill Howe, eScience Institute psu (b)

3/12/09 Bill Howe, eScience Institute23 Epilogue We’re here to help! SIG Wiki: eScience Blog: eScience wesbite:

3/12/09 Bill Howe, eScience Institute24

3/12/09 Bill Howe, eScience Institute25 eScience requirements are Fractal William Gibson -- “The future is already here. It’s just not very evenly distributed.”

3/12/09 Bill Howe, eScience Institute26 High-Performance Computing Data Management Consulting Online Collaboration Tools CS Research eScience

3/12/09 Bill Howe, eScience Institute27 It’s what you can do with it Relational database SQL, plus UDTs and UDFs as needed FASTA databases Alignments, rarefaction curves, phylogenetic trees, filtering MapReduce: Roll your own Dryad Relational algebra available; you can still roll our own if needed

3/12/09 Bill Howe, eScience Institute28 A data deluge in all fields Acquisition eventually outpaces analysis Astronomy: SDSS, now LSST; PanSTARRS Biology: PCR, SOLiD sequencing Oceanography: high-resolution models, cheap sensors Marine Microbiology: FlowCytometer Empirical X  Analytical X  Computational X  X-informatics “Increase Data Collection Exponentially in Less Time, with FlowCAM”

High-Performance Computing Data Management Consulting Online Collaboration Community Building Technology Transfer eScience Research

3/12/09 Bill Howe, eScience Institute30 Query Languages Organize and encapsulate access methods Raise the level of abstraction beyond GPLs Identify and exploit opportunities for algebraic optimization What is algebraic optimization? Consider the expression x/z + y/z x/z + y/z = (x + y)/z, but the latter is less expensive since it involves only one division operation Tables -- SQL XML -- XQuery, XPath RDF -- SPARQL Streams -- StreamSQL, CQL Meshes (e.g., Finite Element Sims) -- GridFields

3/12/09 Bill Howe, eScience Institute31 Example: Relational Databases (In Codd we Trust…) At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data 1. The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! 1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp , 1970 The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc.

3/12/09 Bill Howe, eScience Institute32 Gray’s Laws of Data Engineering Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working” DISSC: Data Intensive Scalable Scientific Computing slide source: Alex Szalay, keynote, eScience 2008

3/12/09 Bill Howe, eScience Institute33 Data Management