An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,
ICEWATER: INRA Constellation of Experimental Watersheds Cyberinfrastructure to Support Publication of Water Resources Data Jeffery S. Horsburgh, Utah State.
C van Ingen, D Agarwal, M Goode, J Gupchup, J Hunt, R Leonardson, M Rodriguez, N Li Berkeley Water Center John Hopkins University Lawrence Berkeley Laboratory.
Linking HIS and GIS How to support the objective, transparent and robust calculation and publication of SWSI? Jeffery S. Horsburgh CUAHSI HIS Sharing hydrologic.
Unlocking the Scientific Value of NEXRAD Weather Radar Data Ramon Lawrence, Witek Krajewski, Anton Kruger, and Allen Bradley IIHR, University of Iowa
Information Retrieval in Practice
Distributed Systems Architectures
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Development of a Community Hydrologic Information System Jeffery S. Horsburgh Utah State University David G. Tarboton Utah State University.
Two NSF Data Services Projects Rick Hooper, President Consortium of Universities for the Advancement of Hydrologic Science, Inc.
Components and Architecture CS 543 – Data Warehousing.
Integrating Historical and Realtime Monitoring Data into an Internet Based Watershed Information System for the Bear River Basin Jeff Horsburgh David Stevens,
Architecture & Data Management of XML-Based Digital Video Library System Jacky C.K. Ma Michael R. Lyu.
Introducing the CUAHSI Hydrologic Information System Desktop Application (HydroDesktop) and Open Development Community Jiří Kadlec, Daniel Ames, Teva Velupillai.
1 Alternate Title Slide: Presentation Name Goes Here Presenter’s Name Infrastructure Solutions Division Date GIS Perfct Ltd. Autodesk Value Added Reseller.
Overview of Search Engines
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
By N.Gopinath AP/CSE. Why a Data Warehouse Application – Business Perspectives  There are several reasons why organizations consider Data Warehousing.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Discussion and conclusion The OGC SOS describes a global standard for storing and recalling sensor data and the associated metadata. The standard covers.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Understanding Data Warehousing
Systems analysis and design, 6th edition Dennis, wixom, and roth
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.
Advanced Web Forms with Databases Programming Right from the Start with Visual Basic.NET 1/e 13.
Chapter 1 Introduction to Data Mining
material assembled from the web pages at
Planning for Arctic GIS and Geographic Information Infrastructure Sponsored by the Arctic Research Support and Logistics Program 30 October 2003 Seattle,
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Engr. M. Fahad Khan Lecturer Software Engineering Department University Of Engineering & Technology Taxila.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Sensor Database System Sultan Alhazmi
Deutscher Wetterdienst DAR Metadata Catalog Markus Heene, DWD
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
Distributed database system
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
CE 394K.2 Surface Water Hydrology Lecture 1 – Introduction to the course Readings for today –Applied Hydrology, Chapter 1 –“Integrated Observatories to.
Management Information Systems, 4 th Edition 1 Chapter 8 Data and Knowledge Management.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Near Real-Time Verification At The Forecast Systems Laboratory: An Operational Perspective Michael P. Kay (CIRES/FSL/NOAA) Jennifer L. Mahoney (FSL/NOAA)
Creating a Data Warehouse Data Acquisition: Extract, Transform, Load Extraction Process of identifying and retrieving a set of data from the operational.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Session 1 Module 1: Introduction to Data Integrity
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Unlocking the Scientific Value of NEXRAD Weather Radar Data Witold F. Krajewski with Anton Kruger, Ramon Lawrence, Allen A. Bradley, and Grzegorz J. Ciach.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
The Virtual Observatory and Ecological Informatics System (VOEIS): Using RESTful architecture and an extensible data model to provide a unique data management.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Towards Better Utilization of NEXRAD Data in Hydrology Anton Kruger, University of Iowa AGU Fall Meeting San Francisco, December 11, 2006 UCAR/Unidata.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
The CUAHSI Hydrologic Information System Spatial Data Publication Platform David Tarboton, Jeff Horsburgh, David Maidment, Dan Ames, Jon Goodall, Richard.
Building a Data Warehouse
MANAGING DATA RESOURCES
Kate Marney CE 394K.2 Surface Water Hydrology
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa

Page 2 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Overview Our goal is to build a general archival architecture for storing and querying massive amounts of scientific data. This presentation will discuss our current architecture and how it is being used in a national project to archive weather radar data in the United States. The architecture achieves four basic design goals: u 1) scalable - can handle terabyte-scale data sets u 2) extensible - types of data and metadata stored can change u 3) inexpensive - uses cheap hardware and open-source software u 4) usable - researchers can interact with the system in a variety of intuitive ways

Page 3 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Motivation The size of scientific data sets in many domains is increasing dramatically. This is placing a burden on IT infrastructure for storing, processing, and querying the data effectively. u As sensor networks are deployed, this will get even worse. Although data warehousing techniques are well-known, it is an impediment to research to manage data sets of this scale. One of the most basic challenges is finding data relevant to the research (the data finding problem). To avoid browsing a large data set, suitable metadata describing the data must be generated, stored, and queryable by the researcher.

Page 4 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Desirable Architecture Properties Our architecture is designed with four key properties: u 1) scalable - The system can accommodate more data simply by adding low-cost PCs. Data files are transparently allocated and replicated across nodes without custom hardware/software. u 2) extensible - The types of metadata generated and stored may change over time as the research evolves. u 3) inexpensive - Low cost hardware and open-source software is used. u 4) usable - Researcher can interact with data archive in a variety of ways including directly through C code, web forms, or web services.

Page 5 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Archive Architecture Overview

Page 6 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Architecture Components The components: u Extractor - is the only component specific to the data set. It is the code module for computing desired metadata statistics on the data. The output is a standard XML schema defined by the Loader. u Loader - is the module responsible for storing metadata in the database and using rules to place data files on retrieval servers. This component is not data set specific. Different and evolving metadata is supported by a general database schema. u Metadata archive - is a relational database that stores the metadata and pointers to the data. SQL queries are built using the various front-end tools (C code, web interface, etc.) to query metadata to find data with specific properties and file locations. u Retrieval server - is any machine capable of running a HTTP server and acting as a data file store.

Page 7 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Case Study: Archiving NEXRAD Data u There are over 150 NEXt generation RADars (NEXRAD) that collect real-time precipitation data across the United States. ðThe system has been operational for about 10 years, and the amount of collected data is continually expanding. ðHow a radar works: A radar emits a coherent train of microwave pulses and processes reflected pulses. Each processed pulse corresponds to a bin. There are multiple bins in a ray (beam). Rotating the radar 360º is a sweep. After a sweep the radar elevation angle is increased, and another sweep performed. All sweeps together form a volume. Our goal is to provide the community with access to the vast archives and real-time data collected by the NEXRAD system.

Page 8 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Usefulness of NEXRAD Data Although the NEXRAD system was designed for severe weather forecasting, data collected has been used in many areas including: u flood prediction u bird and insect migration u rainfall estimation The value of this data has been noted by a NRC report which labeled it a “critical resource.” Enhancing Access to NEXRAD Data—A Critical National Resource. National Academy Press, Washington D.C. ISBN , 1999

Page 9 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Archiving NEXRAD Data Despite its value, the archival system for NEXRAD data is unsatisfactory. The National Climatic Data Center (NCDC) maintains a tape archive of the RAW data, but provides few tools for finding relevant data and processing it for research. Some real-time data is distributed by University Corporation for Atmospheric Research (UCAR) using their Unidata Internet Data Distribution (IDD) system. However, this still requires users be able to: u extract and process a RAW data stream in real-time u archive it appropriately u generate metadata and indexes for retrieving it when required u filter the data set to reduce the amount of space required u develop custom tools for analysis and processing

Page 10 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Data Size Challenges Individual NEXRAD Level II scans are not large ( KB). However, archiving 150 radars that produce 10 scans per hour results in an archive rate of 36,000 scans/day = 17 GB/day. Although the cost of storage has decreased dramatically (1 TB for under $10,000), this still requires a hardware investment. A major challenge is how do you find the data files of interest? u Answer: Queryable metadata that allows you to ask for files with certain properties without browsing the entire collection. u One problem: The metadata can be huge as well making it inefficient to search. Even worse, scientific metadata tends to change as research evolves.

Page 11 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Metadata Archive “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF” User/Client User/Client’s View Get URIs Program Library Get data HTTP Query Metadata Metadata Archive “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF.” Distributed Data Archive (NCDC, Iowa, etc.)

Page 12 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Current Status and Future Work We have implemented a prototype version of the architecture that is currently archiving 30 radars in real-time. Some basic statistics are being generated and can be used to retrieve data files of interest. Accessible at: u Immediate plans: u Generate standardized metadata for use by hydrologists. u Link NEXRAD data to basin information so that rainfall estimation and flood prediction can be performed. This research is supported by NSF ITR Grant ATM : “A Comprehensive Framework for Use of NEXRAD Data in Hydrometeorology and Hydrology”.

Page 13 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data NEXRAD Project Participants The University of Iowa (Lead) u W.F. Krajewski (PI) u A.A. Bradley, A. Kruger, R. Lawrence Princeton University u J.A. Smith (PI) u M. Steiner, M.L.Baeck National Climatic Data Center u S.A. Delgreco (PI) u S. Ansari UCAR/Unidata Program Center u M. K. Ramamurthy (PI) u W.J. Weber

An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa Thank You!

Page 15 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Extra Slides...

Page 16 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data NEXRAD Data Management Challenges Storing NEXRAD Level II data results in many interesting database challenges: u Data size - A historical archive of NEXRAD data consumes many terabytes of space. u Flexibility/Variability - Unlike commercial warehouses, the types of data and metadata that should be stored in the warehouse is not well understood and evolves over time. u Real-Time response - The data should be loaded and queryable in real-time as it is received from the radars. u Scientific Workflow - It is desirable to capture and share sequences of calculations on the raw data (scientific workflows) and develop tools that seemlessly interact with the archive.

Page 17 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Flexibility Challenges Ideally, the system should allow arbitrary metadata to be associated with NEXRAD files that can easily be added, updated, and queried. Unfortunately, relational databases do not nicely handle variable information. Although there are some known schema designs that can handle variability, they are inefficient for large data sets. u Good news: This is not unique to hydrology. Researchers in other domains are building grids to share data/metadata and face the same challenges (e.g. GriPhyn - physics grid). u Bad news: Representing and querying variable data (especially within a relational database) is an active research problem.

Page 18 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Flexibility Example One way to represent variable metadata on a datafile in a relational database is to have a single table:  metadata(dataFileId, attributeName, attributeValue) Example: ðData file 1 has three attributes: ArealCoverage, MaximumReflectivity, MinimumReflectivity. Data file 2 has two attributes, and file 3 has only 1. ðNote that this schema allows any (variable) number of attributes per file. u A challenge: How would you return all files that have ArealCoverage > 5 and MaximumReflectivity > 20? Answer: Join two copies of table metadata together.

Page 19 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Scientific Workflow A workflow is a sequence of steps that is performed on data. u Workflows have received considerable attention where documents must be routed between individuals. ðThink of a funding proposal being internally routed through your university. A scientific workflow is a sequence of steps performed on scientific data. Each step uses as input the output of the previous step. An example workflow in hydrology: u retrieve the raw data files of interest u remove ground clutter and Anomalous Propagation (AP) u calculate estimated rain fall u map calculations to a basin Our goal is to support such workflows. u How to represent and store intermediary products? u How to make the tools/algorithms interoperable?

Page 20 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data A Watershed or Basin A watershed is an area of land that drains water, sediment and dissolved materials to a common receiving body or outlet.

Page 21 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data NRC Quote on NEXRAD Data Archiving “[t]he limited use of ground-based radar rainfall data outside of the operational environment is partially attributed to the lack of research-quality data products and partially to poor archiving practices.” NRC Report, 2002

Page 22 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Metadata “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF” Basic “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF” Derived/Complex

Page 23 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Consortium of Universities for the Advancement of Hydrologic Sciences (CUAHSI) CUAHSI