Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa

Similar presentations


Presentation on theme: "An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa"— Presentation transcript:

1 An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa ramon-lawrence@uiowa.edu http://www.cs.uiowa.edu/~rlawrenc/ http://www.iihr.uiowa.edu/~hml/projects/nexrad-itr

2 Page 2 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Overview Our goal is to build a general archival architecture for storing and querying massive amounts of scientific data. This presentation will discuss our current architecture and how it is being used in a national project to archive weather radar data in the United States. The architecture achieves four basic design goals: u 1) scalable - can handle terabyte-scale data sets u 2) extensible - types of data and metadata stored can change u 3) inexpensive - uses cheap hardware and open-source software u 4) usable - researchers can interact with the system in a variety of intuitive ways

3 Page 3 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Motivation The size of scientific data sets in many domains is increasing dramatically. This is placing a burden on IT infrastructure for storing, processing, and querying the data effectively. u As sensor networks are deployed, this will get even worse. Although data warehousing techniques are well-known, it is an impediment to research to manage data sets of this scale. One of the most basic challenges is finding data relevant to the research (the data finding problem). To avoid browsing a large data set, suitable metadata describing the data must be generated, stored, and queryable by the researcher.

4 Page 4 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Desirable Architecture Properties Our architecture is designed with four key properties: u 1) scalable - The system can accommodate more data simply by adding low-cost PCs. Data files are transparently allocated and replicated across nodes without custom hardware/software. u 2) extensible - The types of metadata generated and stored may change over time as the research evolves. u 3) inexpensive - Low cost hardware and open-source software is used. u 4) usable - Researcher can interact with data archive in a variety of ways including directly through C code, web forms, or web services.

5 Page 5 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Archive Architecture Overview

6 Page 6 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Architecture Components The components: u Extractor - is the only component specific to the data set. It is the code module for computing desired metadata statistics on the data. The output is a standard XML schema defined by the Loader. u Loader - is the module responsible for storing metadata in the database and using rules to place data files on retrieval servers. This component is not data set specific. Different and evolving metadata is supported by a general database schema. u Metadata archive - is a relational database that stores the metadata and pointers to the data. SQL queries are built using the various front-end tools (C code, web interface, etc.) to query metadata to find data with specific properties and file locations. u Retrieval server - is any machine capable of running a HTTP server and acting as a data file store.

7 Page 7 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Case Study: Archiving NEXRAD Data u There are over 150 NEXt generation RADars (NEXRAD) that collect real-time precipitation data across the United States. ðThe system has been operational for about 10 years, and the amount of collected data is continually expanding. ðHow a radar works: A radar emits a coherent train of microwave pulses and processes reflected pulses. Each processed pulse corresponds to a bin. There are multiple bins in a ray (beam). Rotating the radar 360º is a sweep. After a sweep the radar elevation angle is increased, and another sweep performed. All sweeps together form a volume. Our goal is to provide the community with access to the vast archives and real-time data collected by the NEXRAD system.

8 Page 8 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Usefulness of NEXRAD Data Although the NEXRAD system was designed for severe weather forecasting, data collected has been used in many areas including: u flood prediction u bird and insect migration u rainfall estimation The value of this data has been noted by a NRC report which labeled it a “critical resource.” Enhancing Access to NEXRAD Data—A Critical National Resource. National Academy Press, Washington D.C. ISBN 0-309-06636-0, 1999

9 Page 9 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Archiving NEXRAD Data Despite its value, the archival system for NEXRAD data is unsatisfactory. The National Climatic Data Center (NCDC) maintains a tape archive of the RAW data, but provides few tools for finding relevant data and processing it for research. Some real-time data is distributed by University Corporation for Atmospheric Research (UCAR) using their Unidata Internet Data Distribution (IDD) system. However, this still requires users be able to: u extract and process a RAW data stream in real-time u archive it appropriately u generate metadata and indexes for retrieving it when required u filter the data set to reduce the amount of space required u develop custom tools for analysis and processing

10 Page 10 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Data Size Challenges Individual NEXRAD Level II scans are not large (300-1000 KB). However, archiving 150 radars that produce 10 scans per hour results in an archive rate of 36,000 scans/day = 17 GB/day. Although the cost of storage has decreased dramatically (1 TB for under $10,000), this still requires a hardware investment. A major challenge is how do you find the data files of interest? u Answer: Queryable metadata that allows you to ask for files with certain properties without browsing the entire collection. u One problem: The metadata can be huge as well making it inefficient to search. Even worse, scientific metadata tends to change as research evolves.

11 Page 11 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Metadata Archive “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF” User/Client User/Client’s View Get URIs Program Library Get data HTTP Query Metadata Metadata Archive “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF.” Distributed Data Archive (NCDC, Iowa, etc.)

12 Page 12 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Current Status and Future Work We have implemented a prototype version of the architecture that is currently archiving 30 radars in real-time. Some basic statistics are being generated and can be used to retrieve data files of interest. Accessible at: u http://nexrad.cs.uiowa.edu Immediate plans: u Generate standardized metadata for use by hydrologists. u Link NEXRAD data to basin information so that rainfall estimation and flood prediction can be performed. This research is supported by NSF ITR Grant ATM 0427422: “A Comprehensive Framework for Use of NEXRAD Data in Hydrometeorology and Hydrology”.

13 Page 13 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data NEXRAD Project Participants The University of Iowa (Lead) u W.F. Krajewski (PI) u A.A. Bradley, A. Kruger, R. Lawrence Princeton University u J.A. Smith (PI) u M. Steiner, M.L.Baeck National Climatic Data Center u S.A. Delgreco (PI) u S. Ansari UCAR/Unidata Program Center u M. K. Ramamurthy (PI) u W.J. Weber

14 An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa ramon-lawrence@uiowa.edu http://www.cs.uiowa.edu/~rlawrenc/ http://www.iihr.uiowa.edu/~hml/projects/nexrad-itr Thank You!

15 Page 15 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Extra Slides...

16 Page 16 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data NEXRAD Data Management Challenges Storing NEXRAD Level II data results in many interesting database challenges: u Data size - A historical archive of NEXRAD data consumes many terabytes of space. u Flexibility/Variability - Unlike commercial warehouses, the types of data and metadata that should be stored in the warehouse is not well understood and evolves over time. u Real-Time response - The data should be loaded and queryable in real-time as it is received from the radars. u Scientific Workflow - It is desirable to capture and share sequences of calculations on the raw data (scientific workflows) and develop tools that seemlessly interact with the archive.

17 Page 17 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Flexibility Challenges Ideally, the system should allow arbitrary metadata to be associated with NEXRAD files that can easily be added, updated, and queried. Unfortunately, relational databases do not nicely handle variable information. Although there are some known schema designs that can handle variability, they are inefficient for large data sets. u Good news: This is not unique to hydrology. Researchers in other domains are building grids to share data/metadata and face the same challenges (e.g. GriPhyn - physics grid). u Bad news: Representing and querying variable data (especially within a relational database) is an active research problem.

18 Page 18 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Flexibility Example One way to represent variable metadata on a datafile in a relational database is to have a single table:  metadata(dataFileId, attributeName, attributeValue) Example: ðData file 1 has three attributes: ArealCoverage, MaximumReflectivity, MinimumReflectivity. Data file 2 has two attributes, and file 3 has only 1. ðNote that this schema allows any (variable) number of attributes per file. u A challenge: How would you return all files that have ArealCoverage > 5 and MaximumReflectivity > 20? Answer: Join two copies of table metadata together.

19 Page 19 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Scientific Workflow A workflow is a sequence of steps that is performed on data. u Workflows have received considerable attention where documents must be routed between individuals. ðThink of a funding proposal being internally routed through your university. A scientific workflow is a sequence of steps performed on scientific data. Each step uses as input the output of the previous step. An example workflow in hydrology: u retrieve the raw data files of interest u remove ground clutter and Anomalous Propagation (AP) u calculate estimated rain fall u map calculations to a basin Our goal is to support such workflows. u How to represent and store intermediary products? u How to make the tools/algorithms interoperable?

20 Page 20 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data A Watershed or Basin A watershed is an area of land that drains water, sediment and dissolved materials to a common receiving body or outlet.

21 Page 21 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data NRC Quote on NEXRAD Data Archiving “[t]he limited use of ground-based radar rainfall data outside of the operational environment is partially attributed to the lack of research-quality data products and partially to poor archiving practices.” NRC Report, 2002

22 Page 22 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Metadata “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF” Basic “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km 2, with a duration of less than N hours. I want the data in GeoTIFF” Derived/Complex

23 Page 23 The University of Iowa. Copyright© 2005 Ramon Lawrence - An Architecture for Real-Time Warehousing of Scientific Data Consortium of Universities for the Advancement of Hydrologic Sciences (CUAHSI) CUAHSI


Download ppt "An Architecture for Real-Time Warehousing of Scientific Data Ramon Lawrence and Anton Kruger IIHR, University of Iowa"

Similar presentations


Ads by Google