Presentation is loading. Please wait.

Presentation is loading. Please wait.

Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer.

Similar presentations


Presentation on theme: "Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer."— Presentation transcript:

1 Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer Science Dept. Indiana University

2 The Data Deluge Computational science is increasingly data intense and getting more so. Why?  More complex computations: –Nested model runs –Linked models –Finer resolution  More sources of data products –Observational data products Streaming continuously from hundreds of sensor and network sources, scaling to thousands Large archives –Annotations –Model configuration parameters –Output results –Model data –Statistical data (e.g., data mining)

3 Problem Computational scientists are reaching their limit on ability to manage data products associated with investigations –Scientist can touch hundreds to thousands of data products in single investigation

4 The Experiment as A Day’s Work NetRad Radar ingest Fetch Data products Forecast Model Execution (20 versions) Convert to format suitable for assim Plan 20 Run ensemble Analyze Final Files of Each run Request to NetRad radar control system Assimilate Into 3D grid 6 hr run followed by 3 hr run followed by 1 hr run …

5 Why not just put up a metadata database and let them come?  The King’s solution.  Burdens users (people or programs) with: –Knowing where database is located –Knowing the schema of the database –Initiating all the communication with database –Generating all metadata –Knowing precisely how to write the queries.  We can’t afford the King’s solution - we have to be more aggressive if our solution is to be widely used.

6 Who are our users? (psst…scientists)  Users don’t want to write precise SQL –That is, learn the nuances of a relational schema  Users won’t hand-code metadata  Scientists don’t want to have to think about hierarchies of files, versions, or replicas. They want to run experiments and do their science.  Scientists use Google - they know searching can be fast and flexible - far more flexible than % find. -n “03052005:1300:25:30.nc” -print

7 myLEAD: an ‘active’ metadata catalog  If we’re going to have half a chance of being widely used, it is going to be us that reaches 3/4’s of the way across the gulf. Our users reach the other 1/4: –Easy query “writing” –Automated metadata generation –Transparent structure management –Transparent versioning management –Expressive query writing

8

9 Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

10 OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Conventional Numerical Weather Prediction

11 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

12 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

13 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

14 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites The process is entirely serial and pre-scheduled: no response to weather! The process is entirely serial and pre-scheduled: no response to weather!

15 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

16 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

17 Architecture Part 1: Distribution scheme of metadata catalogues IU NCSA Illinois UA Huntsville Millersville UCAR Unidata Okla Univ Master catalog Satellite catalogues at each of 5 sites Each satellite replicates its contents to the master catalog

18 Architecture Part II: single catalog

19 Providing higher level functionality: Structure, sharing, preservation, querying

20 Preservation Sharing Structure Depth 2: searchable Depth 3: browsable Does not know existence Flat structure Temporary data product Versioning through time Increasing levels of access Increasing levels of transparency Axes of Functionality

21 Higher-level functionality: transparent structure  Structure -- creating structure in metadata catalog transparent to user, based on knowledge of control flow –Why? Want to hide as structure so user’s don’t need to learn it and abide by it, but –Structure gives user more attributes to query on

22 Hurricane Ivan SE OK quadrant Vortice study 98-00 Input data sets WRF output Hurricane Ivan SE OK quadrant Vortice study 98-00 Workflow templates 150.nc Input data sets Hurricane Ivan SE OK quadrant Vortice study 98-00 ftp://storageserver.org/file1998o768 Bob’s workspace (Dec 04)Bob’s workspace (Feb 05)Bob’s workspace (Mar 05) Physical data storage Table of collection Table of file Table of User Metadata Catalog Experim-Dec04 Experim-Feb05 Experim-Dec04 Experim-Feb05 001.nc... WRF output files Published results Capturing process in the structure

23 Example Query: contains structure, but only vaguely LeadQuery: SELECT TARGET = collection WHERE collection.date = “February 20, 2005” WITHIN experiment.name = “mytest1” and CONTAINS (file.type = “GOES” or file.type = “Eta”) and file.geoProperty = “precipitation” RECURSIVE ResultSet: TARGET_ONLY

24 Creating structure in database that mirrors structure of experiment workflow myLEAD agent Product requests, Product registers, Notification msgs, myLEAD server Gather data products workflow Run 12 hour forecast (6 hrs to complete) Analyze results Based on analysis, gather other products Analyze results Run 6 Hr forecast (3 hrs to complete) 12 hrs Decoder service Notif service

25 Higher level functionality: sharing  Depth-0: participant (P) is unaware that experiment data (E) owned by user (U) exists  Depth-1: P is aware that E exists  Depth-2: P can search E  Depth-3: P can browse the content of E  Depth-4: P can access E and its contents  Depth-5: P can remove and write E

26

27 Experimental evaluation

28 Experiment environment  myLEAD client: dual processor Dell PowerEdge 6400 Xeon server (700 MHz Pentium III), 2GF RAM, 100 GB Raid 5, RedHat 7.2, JDK 1.4.2  myLEAD server: dual processor 2.0 MHz Opterons, 16BGRAM, GENTOO Linux, OGSA- DAI 3.0, Globus MCS 3.1, mysql 5.0.  LAN: 1Gbps switched Ethernet

29 Workload used in experimental evaluation CreateSimpleHard Objects created 1-11203- 500 Attributes created 2-5512- 1012 Depth of “tree” 1-37-9 QuerySimpleHard Tables joined 11-1336-42 Number attributes 0-210 Size of result set 2K0.4- 0.6M Characterizing “simple” and “hard”

30 Response time for querying a single object having an increasing

31

32

33

34 Related Work  myGrid –Intelligent Systems for Molecular Biology 2003  mySpace –UK e-Science All Hands Meeting 2003  NEESgrid metadata catalog –NEESGrid technical report 2004  Roma personal metadata service –Mobile Networks and Applications 2002  Presto Document System –User Interface Software and Technology 1999  Semantic File Systems –SOSP 1991

35

36 The end

37 Seeds of solution in Internet?  Internet has proven the utility of user-oriented view towards information space management –Search, tag: browser, bookmarks –Publish: blogs, web page tools  But web not completely appropriate. Web is –Single-writer, multiple reader, and –Search-and-download.  Apply concept of user-oriented view to managing data space  Want ability to work locally. –myLEAD: tool to help an investigator make sense of, and operate in, the vast information space that is computational science (e.g., mesoscale meteorology.)


Download ppt "Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer."

Similar presentations


Ads by Google