Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
A distributed architecture for crystallography data, metadata, and applications John C. Bollinger Indiana University Molecular Structure Center, Bloomington,
1 GridTorrent Framework: A High-performance Data Transfer and Data Sharing Framework for Scientific Computing.
Archiving derived and temporally changing geospatial data in LEAD Beth Plale Department of Computer Science School of Informatics Indiana University.
Computational Physics Kepler Dr. Guy Tel-Zur. This presentations follows “The Getting Started with Kepler” guide. A tutorial style manual for scientists.
Peter Dinda Department of Computer Science Northwestern University Beth Plale Department.
Grid resources for NWP models at national level in Korea Korean Meteorological Administration Super Computer Center Korea Meteorological Administration.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Developing PANDORA Mark Corbould Director, IT Business Systems.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
A Semantic Workflow Mechanism to Realise Experimental Goals and Constraints Edoardo Pignotti, Peter Edwards, Alun Preece, Nick Gotts and Gary Polhill School.
+ Connecting to the Web Week 7, Lecture A. + Midterm Basics Thursday February 28 during Class The lab Tuesday, February 26 is optional review Class on.
Anthony Atkins Digital Library and Archives VirginiaTech ETD Technology for Implementers Presented March 22, 2001 at the 4th International.
EU 2nd Year Review – Jan – WP9 WP9 Earth Observation Applications Demonstration Pedro Goncalves :
1 Using the Weather to Teach Computing Topics B. Plale, Sangmi Lee, AJ Ragusa Indiana University.
NETWORK CENTRIC COMPUTING (With included EMBEDDED SYSTEMS)
Advances in Technology and CRIS Nikos Houssos National Documentation Centre / National Hellenic Research Foundation, Greece euroCRIS Task Group Leader.
PayDox Corporate Document Management System Rotech AB Interface Ltd Business Software Integration.
Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Applied Meteorology Unit 1 An Operational Configuration of the ARPS Data Analysis System to Initialize WRF in the NWS Environmental Modeling System 31.
L inked E nvironments for A tmospheric D iscovery Linked Environments for Atmospheric Discovery (LEAD) Kelvin K. Droegemeier School of Meteorology and.
The Collaborative Radar Acquisition Field Test (CRAFT): A Unique Public- Private Partnership in Mission-Critical Data Distribution Kelvin K. Droegemeier.
CyberInfrastructure to Support Scientific Exploration and Collaboration Dennis Gannon (based on work with many collaborators, most notably Beth Plale )
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Web Engineering we define Web Engineering as follows: 1) Web Engineering is the application of systematic and proven approaches (concepts, methods, techniques,
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
ASI-Eumetsat Meeting Matera, 4-5 Feb CNM Context Matera, February 4-5, 20092ASI-Eumetsat Meeting.
Fundamentals of Information Systems, Seventh Edition 1 Chapter 3 Data Centers, and Business Intelligence.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Facilitating Document Annotation using Content and Querying Value.
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.
Japanese Virtual Observatory Project Abstract : The National Astronomical Observatory of Japan (NAOJ) started the Japanese Virtual Observatory (JVO) project.
Sponsored by the National Science Foundation A New Approach for Using Web Services, Grids and Virtual Organizations in Mesoscale Meteorology.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Experiences with OGSA-DAI : Portlet Access and Benchmark Deepti Kodeboyina and Beth Plale Computer Science Dept. Indiana University.
Towards Personalized and Active Information Management for Meteorological Investigations Beth Plale Indiana University USA.
MADIS Airlines for America Briefing Meteorological Assimilated Data Ingest System (MADIS) FPAW Briefing Steve Pritchett NWS Aircraft Based Observations.
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
Indiana University School of Informatics The LEAD Gateway Dennis Gannon, Beth Plale, Suresh Marru, Marcus Christie School of Informatics Indiana University.
1 Database Basics: Filemaker 7 Introduction Center for Faculty Development, SJSU Steve Sloan
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.
3-D rendering of jet stream with temperature on Earth’s surface ESIP Air Domain Overview The Air Domain encompasses a variety of topic areas, but its focus.
OFCM CEISC December 12, 2005 Non-Traditional and Non-WMO Observational Networks: Transitioning of the Meteorological Assimilation Data Ingest System (MADIS)
Facilitating Document Annotation Using Content and Querying Value.
LEAD Project Discussion Presented by: Emma Buneci for CPS 296.2: Self-Managing Systems Source for many slides: Kelvin Droegemeier, Year 2 site visit presentation.
High throughput biology data management and data intensive computing drivers George Michaels.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.
LEAD Workflow Orchestration Lavanya Ramakrishnan Renaissance Computing Institute University of North Carolina – Chapel Hill Duke University North Carolina.
Central Operations Ben Kyger Acting Director / NCEP CIO.
Retele de senzori Curs 1 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
DataGrid France 12 Feb – WP9 – n° 1 WP9 Earth Observation Applications.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
A Quick tour of LEAD for the VGrADS
Computational Physics Kepler
Code Analysis, Repository and Modelling for e-Neuroscience
Code Analysis, Repository and Modelling for e-Neuroscience
Status of the Regional OSSE for Space-Based LIDAR Winds – Feb01
Presentation transcript:

Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer Science Dept. Indiana University

The Data Deluge Computational science is increasingly data intense and getting more so. Why?  More complex computations: –Nested model runs –Linked models –Finer resolution  More sources of data products –Observational data products Streaming continuously from hundreds of sensor and network sources, scaling to thousands Large archives –Annotations –Model configuration parameters –Output results –Model data –Statistical data (e.g., data mining)

Problem Computational scientists are reaching their limit on ability to manage data products associated with investigations –Scientist can touch hundreds to thousands of data products in single investigation

The Experiment as A Day’s Work NetRad Radar ingest Fetch Data products Forecast Model Execution (20 versions) Convert to format suitable for assim Plan 20 Run ensemble Analyze Final Files of Each run Request to NetRad radar control system Assimilate Into 3D grid 6 hr run followed by 3 hr run followed by 1 hr run …

Why not just put up a metadata database and let them come?  The King’s solution.  Burdens users (people or programs) with: –Knowing where database is located –Knowing the schema of the database –Initiating all the communication with database –Generating all metadata –Knowing precisely how to write the queries.  We can’t afford the King’s solution - we have to be more aggressive if our solution is to be widely used.

Who are our users? (psst…scientists)  Users don’t want to write precise SQL –That is, learn the nuances of a relational schema  Users won’t hand-code metadata  Scientists don’t want to have to think about hierarchies of files, versions, or replicas. They want to run experiments and do their science.  Scientists use Google - they know searching can be fast and flexible - far more flexible than % find. -n “ :1300:25:30.nc” -print

myLEAD: an ‘active’ metadata catalog  If we’re going to have half a chance of being widely used, it is going to be us that reaches 3/4’s of the way across the gulf. Our users reach the other 1/4: –Easy query “writing” –Automated metadata generation –Transparent structure management –Transparent versioning management –Expressive query writing

Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Conventional Numerical Weather Prediction

Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites The process is entirely serial and pre-scheduled: no response to weather! The process is entirely serial and pre-scheduled: no response to weather!

Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

Architecture Part 1: Distribution scheme of metadata catalogues IU NCSA Illinois UA Huntsville Millersville UCAR Unidata Okla Univ Master catalog Satellite catalogues at each of 5 sites Each satellite replicates its contents to the master catalog

Architecture Part II: single catalog

Providing higher level functionality: Structure, sharing, preservation, querying

Preservation Sharing Structure Depth 2: searchable Depth 3: browsable Does not know existence Flat structure Temporary data product Versioning through time Increasing levels of access Increasing levels of transparency Axes of Functionality

Higher-level functionality: transparent structure  Structure -- creating structure in metadata catalog transparent to user, based on knowledge of control flow –Why? Want to hide as structure so user’s don’t need to learn it and abide by it, but –Structure gives user more attributes to query on

Hurricane Ivan SE OK quadrant Vortice study Input data sets WRF output Hurricane Ivan SE OK quadrant Vortice study Workflow templates 150.nc Input data sets Hurricane Ivan SE OK quadrant Vortice study ftp://storageserver.org/file1998o768 Bob’s workspace (Dec 04)Bob’s workspace (Feb 05)Bob’s workspace (Mar 05) Physical data storage Table of collection Table of file Table of User Metadata Catalog Experim-Dec04 Experim-Feb05 Experim-Dec04 Experim-Feb nc... WRF output files Published results Capturing process in the structure

Example Query: contains structure, but only vaguely LeadQuery: SELECT TARGET = collection WHERE collection.date = “February 20, 2005” WITHIN experiment.name = “mytest1” and CONTAINS (file.type = “GOES” or file.type = “Eta”) and file.geoProperty = “precipitation” RECURSIVE ResultSet: TARGET_ONLY

Creating structure in database that mirrors structure of experiment workflow myLEAD agent Product requests, Product registers, Notification msgs, myLEAD server Gather data products workflow Run 12 hour forecast (6 hrs to complete) Analyze results Based on analysis, gather other products Analyze results Run 6 Hr forecast (3 hrs to complete) 12 hrs Decoder service Notif service

Higher level functionality: sharing  Depth-0: participant (P) is unaware that experiment data (E) owned by user (U) exists  Depth-1: P is aware that E exists  Depth-2: P can search E  Depth-3: P can browse the content of E  Depth-4: P can access E and its contents  Depth-5: P can remove and write E

Experimental evaluation

Experiment environment  myLEAD client: dual processor Dell PowerEdge 6400 Xeon server (700 MHz Pentium III), 2GF RAM, 100 GB Raid 5, RedHat 7.2, JDK  myLEAD server: dual processor 2.0 MHz Opterons, 16BGRAM, GENTOO Linux, OGSA- DAI 3.0, Globus MCS 3.1, mysql 5.0.  LAN: 1Gbps switched Ethernet

Workload used in experimental evaluation CreateSimpleHard Objects created Attributes created Depth of “tree” QuerySimpleHard Tables joined Number attributes Size of result set 2K M Characterizing “simple” and “hard”

Response time for querying a single object having an increasing

Related Work  myGrid –Intelligent Systems for Molecular Biology 2003  mySpace –UK e-Science All Hands Meeting 2003  NEESgrid metadata catalog –NEESGrid technical report 2004  Roma personal metadata service –Mobile Networks and Applications 2002  Presto Document System –User Interface Software and Technology 1999  Semantic File Systems –SOSP 1991

The end

Seeds of solution in Internet?  Internet has proven the utility of user-oriented view towards information space management –Search, tag: browser, bookmarks –Publish: blogs, web page tools  But web not completely appropriate. Web is –Single-writer, multiple reader, and –Search-and-download.  Apply concept of user-oriented view to managing data space  Want ability to work locally. –myLEAD: tool to help an investigator make sense of, and operate in, the vast information space that is computational science (e.g., mesoscale meteorology.)