Archiving derived and temporally changing geospatial data in LEAD Beth Plale Department of Computer Science School of Informatics Indiana University.

Slides:

Advertisements

Similar presentations

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

Advertisements

LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (

Distributed Data Processing

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.

As computer network experiments increase in complexity and size, it becomes increasingly difficult to fully understand the circumstances under which a.

1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,

1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CF21) IRNC Kick-Off Workshop July 13,

A Very Brief Introduction to iRODS

DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):

Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.

Cyberinfrastructure for Rapid Prototyping Capability Tomasz Haupt, Anand Kalyanasundaram, Igor Zhuk, Vamsi Goli Mississippi State University GeoResouces.

Center for Environmental Studies Arizona State University Digital Research Records at Center for Environmental Studies Peter McCartney.

Developing PANDORA Mark Corbould Director, IT Business Systems.

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.

EU 2nd Year Review – Jan – WP9 WP9 Earth Observation Applications Demonstration Pedro Goncalves :

V. Chandrasekar (CSU), Mike Daniels (NCAR), Sara Graves (UAH), Branko Kerkez (Michigan), Frank Vernon (USCD) Integrating Real-time Data into the EarthCube.

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Data Warehouse & Data Mining

18:15:32Service Oriented Cyberinfrastructure Lab, Grid Deployments Saul Rioja Link to presentation on wiki.

CI Days: Planning Your Campus Cyberinfrastructure Strategy Russ Hobby, Internet2 Internet2 Member Meeting 9 October 2007.

Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer.

material assembled from the web pages at

Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.

ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.

What is Cyberinfrastructure? Russ Hobby, Internet2 Clemson University CI Days 20 May 2008.

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.

Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.

Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.

4 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved. Computer Software Chapter 4.

Service - Oriented Middleware for Distributed Data Mining on the Grid ，劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.

Chapter 1 Foundations of Information Systems in Business.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.

Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

IODE Ocean Data Portal - ODP  The objective of the IODE Ocean Data Portal (ODP) is to facilitate and promote the exchange and dissemination of marine.

GO-ESSP Workshop, LLNL, Livermore, CA, Jun 19-21, 2006, Center for ATmosphere sciences and Earthquake Researches Construction of e-science Environment.

Sponsored by the National Science Foundation A New Approach for Using Web Services, Grids and Virtual Organizations in Mesoscale Meteorology.

GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma

Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.

Experiences with OGSA-DAI : Portlet Access and Benchmark Deepti Kodeboyina and Beth Plale Computer Science Dept. Indiana University.

Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.

Breakout # 1 – Data Collecting and Making It Available Data definition “ Any information that [environmental] researchers need to accomplish their tasks”

Towards Personalized and Active Information Management for Meteorological Investigations Beth Plale Indiana University USA.

Indiana University School of Informatics The LEAD Gateway Dennis Gannon, Beth Plale, Suresh Marru, Marcus Christie School of Informatics Indiana University.

Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.

Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.

Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.

Partnerships in Innovation: Serving a Networked Nation Grid Technologies: Foundations for Preservation Environments Portals for managing user interactions.

XMC Cat: An Adaptive Catalog for Scientific Metadata Scott Jensen and Beth Plale School of Informatics and Computing Indiana University-Bloomington Current.

OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.

End-to-End Data Services A Few Personal Thoughts Unidata Staff Meeting 2 September 2009.

Foundations of Information Systems in Business

A Research Collaboratory for Open Source Software Research Yongqin Gao, Matt van Antwerp, Scott Christley, Greg Madey Computer Science & Engineering University.

LEAD Project Discussion Presented by: Emma Buneci for CPS 296.2: Self-Managing Systems Source for many slides: Kelvin Droegemeier, Year 2 site visit presentation.

High throughput biology data management and data intensive computing drivers George Michaels.

The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.

E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.

Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.

A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.

DataGrid France 12 Feb – WP9 – n° 1 WP9 Earth Observation Applications.

IOT – Firefighting Example

A Quick tour of LEAD for the VGrADS

Foundations of Information Systems in Business

Joseph JaJa, Mike Smorul, and Sangchul Song

Grid Computing.

Data Management Components for a Research Data Archive

Presentation transcript:

Archiving derived and temporally changing geospatial data in LEAD Beth Plale Department of Computer Science School of Informatics Indiana University

LEAD (Linked Environments for Atmospheric Discovery) dynamic, adaptive forecasting of mesoscale severe storms GGF leveraged: Service-oriented architecture, moving to WSRF, WS-Notification, service registry, Globus RLS, OGSA-DAI Beth Plale, IU data subsystem architecture, myLEAD personal information space, “VO” catalog Dennis Gannon, IU workflow (GBPEL), portal/science gateway, Teragrid, XSUL, notification Oklahoma Univ -- mesoscale meteorology Unidata -- IDD, LDM NCSA -- brokering UNC (Reed) -- monitoring UAH -- data mining atmospheric data Millersville, Howard University and UG educ NSF ATM

Resources Access services Resource services personal Workspace browser personal Workspace browser Access interfaces Geospatial Query GUI Geospatial Query GUI Ask ontology Ask ontology Viz Client (IDV) Viz Client (IDV) Resource Catalog VO data and compute resources Resource Catalog VO data and compute resources myLEAD User Information space myLEAD User Information space Noesis Ontology concepts and vocabulary Noesis Ontology concepts and vocabulary Query Service query mediation Query Service query mediation THREDDS Catalogs -web browser metadata THREDDS Catalogs -web browser metadata Name Service -single global naming system Name Service -single global naming system Automated metadata generation - a capability Automated metadata generation - a capability Stream Service - from LDM to user’s app Stream Service - from LDM to user’s app Steerable instruments - CASA Steerable instruments - CASA Grid Storage respository Grid Storage respository Unidata Data dissem client (LDM) Unidata Data dissem client (LDM) OPeNDAP data server OPeNDAP data server LEAD Data Subsystem Architecture

Petascale data collections increasingly crucial to research and education in science and engineering Current influential technology factors: Powerful and affordable sensors, processors, instruments, automated equipment Reductions in storage costs make cost-effective to maintain large data collections Existence of Internet makes it easier to share data As result, researchers increasingly conduct research using data originally generated by others. Genomics, climate modeling, demographic studies

Magnitude and breadth of proliferation of data generation in US Same technological advances that produced inexpensive digital cameras has enabled new generation of high resolution scientific instruments and sensors Increasing amount of valuable content is “born digital” and can only be managed, preserved, and used in digital form. Advances in biomedical research depend on building and preserving complex genomic databases. Research in biodiversity and ecosystems, global climate change, meteorology, space science depend on abilty to combine vast quantities of digital information with complex models and analytical tools.

Problem Domain: storage, retrieval, access to petascale data collections in science and engineering Digital data collections* are the foundation for analysis using automated analytical tools Long-lived data undergoes constant re-analysis for improved algorithms or with alternate use in mind. Analysis depends not just on sensed or computer- generated data but on the metadata that characterizes the environment and the sensing instrument. *Data - text, numbers, images, video or movie clips, audio, software, algorithms, equations, models, simulations *Digital data collections - data itself, and infrastructure, organizations needed to preserve access to the data.

Petascale data sets require new work style Analysis tools growing more complex Many analysis algorithms are super-linear, often needing N 2 or N 3 time to process N data points I/O bandwidth has not kept pace with storage capacity Capacity increase 100-fold while storage bandwidth increase 10-fold Too many files (> 1million) for a local file system to manage File name and directory hierarchy not enough Can’t download dataset to laptop and process, analyze, visualize Move end-user’s program to the data, only communicate questions and answers

Problem statement The technologies, strategies, methodologies, and resources needed to manage digital information have not kept pace with innovations in the creation and capture of digital information. Current approaches do not scale to peta-scale data collections.

Typical analysis for mesoscale meteorologists Compare model results to observational data

Research Domain: Archiving derived data products and temporally changing data products. Archiving - saving “born-digital” content for future use and reuse Derived data products - data products that are result of further processing of original raw data Temporally changing data products - data that is continuously changing through regular additions streamed into archive Ad hoc actions taken by content creators, or In conjunction with workflow processes. Approach: General data models, standardized metadata schemas, standard, highly modular system-level architecture (grid computing), well-accepted communication protocol

Our current research challenges are in: Repository architecture Define technical architecture Build tools to acquire, use, store data Predict repository use for provisioning physical infrastructure Representation of temporal and procedural relationships Provenance Automated metadata generation Snapshots of temporally changing data products

User access to personal workspace is through LEAD portal

Early interface for sharing data

Creating structure in user’s archive that models their investigation steps workflow myLEAD agent Product requests, Product registers, Notification msgs, myLEAD server Gather data products workflow Run 12 hour forecast (6 hrs to complete) Analyze results Based on analysis, gather other products Analyze results Run 6 Hr forecast (3 hrs to complete) 12 hrs Decoder service Notif service

Hurricane Ivan SE OK quadrant Vortice study Input data sets WRF output Hurricane Ivan SE OK quadrant Vortice study Workflow templates 150.nc Input data sets Hurricane Ivan SE OK quadrant Vortice study ftp://storageserver.org/file1998o768 Bob’s workspace (Dec 04)Bob’s workspace (Feb 05)Bob’s workspace (Mar 05) Physical data storage Table of collection Table of file Table of User Metadata Catalog Experim-Dec04 Experim-Feb05 Experim-Dec04 Experim-Feb nc... WRF output files Published results Capturing process in the structure

Archiving derived and temporally changing data products 4 < reads < < writes < 100 Personal archive catalog Runs on Teragrid HPC machines Runs on Teragrid storage servers Deepti GregCarolyn

Challenge: criteria for determining number of versions necessary to preserve meaningful sense of an object’s evolution over time. Archiving derived and temporally changing data products

Estimate size of LEAD’s personal archive repository (for provisioning) Canoncial workflow - single 12 hr forecast (10%) Educational workflow - simple analysis (50%) Ensemble workflow - multi-forecast run (5%) Data access workload - “retrieve all data products for Katrina and store to my personal repository” (35%) -- done in advance of any real users -- estimated number of users: 500

Estimating file usage distribution: base on arrival rates of LEAD observational data sources

Estimated resource needs of archival repository for 500 active users Total sustained read/write bandwidth = Mbps Storage needs = 21.2 TB I/O rate = 1,667.6 files read/written per min

Empirical validation of hypothesis often involves gathering information into mental model. How can we archiving system help? ideas, thoughts, concepts, opinions, theories, frames, schema, viewpoints, perspectives, values, beliefs… Mental Models diagrams, maps, illustrations, visual metaphors, pictures, graphs, matrices, schematics, icons, cartoons… Result Models When sufficient information gathered, scientist synthesizes information into knowledge that allows acceptance or rejection of hypothesis. Archiving system can assemble info for synthesis into knowledge.

Forecast workflow example Steps: -- select geospatial region over which forecast is to be run -- use as parameter to model (ARPS, WRF) -- model generates products -- products visualized

Tracking investigation progress MyLEAD offloads mundane work of gathering, storing, and tracking data products used during experimental investigation. These products provide keys to construction of mental “results model.” myLEAD service db

Constructing a ‘result object’ Result object -- collection of key materials assembled during workflow execution deemed important to decision making. Selected derived data objects added to result object. Determining what is important and what is not is a research challenge. Simple Example. Suppose 1. Geospatial region selected as input to forecast model 2. Based on user’s role in evacuation decisions, then 3. System adds link to result object to display road maps and population density maps based geospatial region.

When forecast model completes and user visually examines model results, LEAD data subsystem simultaneously pops up maps of population density and transportation network over that area. Shaves minutes off critical decision-making

Key metrics used in experimental evaluation Query response time -- elapsed time between time client issues request and when it receives response. Scalability - gradually increase amount of work server must do to satisfy a request Add metadata for 1 file, 100, 1000, Add 1, 100, 500, 1000 attributes to file all at once, or one at a time.

Experiment environment Client and server run on separate dual 2.0GHZ Opterons, 16GB Ram Machines connected via Fibre Channel to a 3.5TB SAN Array (16 250GB SATA drives) Gigabit Ethernet connection between machines Linux Red Hat Enterprise

Test architecture and breakdown of measured system components Test client myLEAD toolkit myLEAD server OGSA-DAI myLEAD stored procedures mySQL database tyr02*tyr03 * Acknowledgements to National Science Foundation Grant No g e d cb a f

Performance overhead of adding attributes to a metadata description sec

Issue query with 166K result set. Examine where overheads lie.

Ongoing needs in use of GIS Noesis ontology GEO Query GUI Resource catalog myLEAD user info space Query service THREDDS, Opendap, LDM THREDDS, Opendap, LDM THREDDS, Opendap, CDM Metadata in FGDC-based LEAD metadata schema Data in binary (often) Extract temperatures for region from surface (METAR) data, generate shape file Minnesota map server