Download presentation
Presentation is loading. Please wait.
1
Archiving derived and temporally changing geospatial data in LEAD Beth Plale Department of Computer Science School of Informatics Indiana University
2
LEAD (Linked Environments for Atmospheric Discovery) dynamic, adaptive forecasting of mesoscale severe storms GGF leveraged: Service-oriented architecture, moving to WSRF, WS-Notification, service registry, Globus RLS, OGSA-DAI Beth Plale, IU data subsystem architecture, myLEAD personal information space, “VO” catalog Dennis Gannon, IU workflow (GBPEL), portal/science gateway, Teragrid, XSUL, notification Oklahoma Univ -- mesoscale meteorology Unidata -- IDD, LDM NCSA -- brokering UNC (Reed) -- monitoring UAH -- data mining atmospheric data Millersville, Howard University -- 6-12 and UG educ NSF ATM-0331480
3
Resources Access services Resource services personal Workspace browser personal Workspace browser Access interfaces Geospatial Query GUI Geospatial Query GUI Ask ontology Ask ontology Viz Client (IDV) Viz Client (IDV) Resource Catalog VO data and compute resources Resource Catalog VO data and compute resources myLEAD User Information space myLEAD User Information space Noesis Ontology concepts and vocabulary Noesis Ontology concepts and vocabulary Query Service query mediation Query Service query mediation THREDDS Catalogs -web browser metadata THREDDS Catalogs -web browser metadata Name Service -single global naming system Name Service -single global naming system Automated metadata generation - a capability Automated metadata generation - a capability Stream Service - from LDM to user’s app Stream Service - from LDM to user’s app Steerable instruments - CASA Steerable instruments - CASA Grid Storage respository Grid Storage respository Unidata Data dissem client (LDM) Unidata Data dissem client (LDM) OPeNDAP data server OPeNDAP data server LEAD Data Subsystem Architecture
4
Petascale data collections increasingly crucial to research and education in science and engineering Current influential technology factors: Powerful and affordable sensors, processors, instruments, automated equipment Reductions in storage costs make cost-effective to maintain large data collections Existence of Internet makes it easier to share data As result, researchers increasingly conduct research using data originally generated by others. Genomics, climate modeling, demographic studies
5
Magnitude and breadth of proliferation of data generation in US Same technological advances that produced inexpensive digital cameras has enabled new generation of high resolution scientific instruments and sensors Increasing amount of valuable content is “born digital” and can only be managed, preserved, and used in digital form. Advances in biomedical research depend on building and preserving complex genomic databases. Research in biodiversity and ecosystems, global climate change, meteorology, space science depend on abilty to combine vast quantities of digital information with complex models and analytical tools.
6
Problem Domain: storage, retrieval, access to petascale data collections in science and engineering Digital data collections* are the foundation for analysis using automated analytical tools Long-lived data undergoes constant re-analysis for improved algorithms or with alternate use in mind. Analysis depends not just on sensed or computer- generated data but on the metadata that characterizes the environment and the sensing instrument. *Data - text, numbers, images, video or movie clips, audio, software, algorithms, equations, models, simulations *Digital data collections - data itself, and infrastructure, organizations needed to preserve access to the data.
7
Petascale data sets require new work style Analysis tools growing more complex Many analysis algorithms are super-linear, often needing N 2 or N 3 time to process N data points I/O bandwidth has not kept pace with storage capacity Capacity increase 100-fold while storage bandwidth increase 10-fold Too many files (> 1million) for a local file system to manage File name and directory hierarchy not enough Can’t download dataset to laptop and process, analyze, visualize Move end-user’s program to the data, only communicate questions and answers
8
Problem statement The technologies, strategies, methodologies, and resources needed to manage digital information have not kept pace with innovations in the creation and capture of digital information. Current approaches do not scale to peta-scale data collections.
9
Typical analysis for mesoscale meteorologists Compare model results to observational data
10
Research Domain: Archiving derived data products and temporally changing data products. Archiving - saving “born-digital” content for future use and reuse Derived data products - data products that are result of further processing of original raw data Temporally changing data products - data that is continuously changing through regular additions streamed into archive Ad hoc actions taken by content creators, or In conjunction with workflow processes. Approach: General data models, standardized metadata schemas, standard, highly modular system-level architecture (grid computing), well-accepted communication protocol
11
Our current research challenges are in: Repository architecture Define technical architecture Build tools to acquire, use, store data Predict repository use for provisioning physical infrastructure Representation of temporal and procedural relationships Provenance Automated metadata generation Snapshots of temporally changing data products
12
User access to personal workspace is through LEAD portal
13
Early interface for sharing data
14
Creating structure in user’s archive that models their investigation steps workflow myLEAD agent Product requests, Product registers, Notification msgs, myLEAD server Gather data products workflow Run 12 hour forecast (6 hrs to complete) Analyze results Based on analysis, gather other products Analyze results Run 6 Hr forecast (3 hrs to complete) 12 hrs Decoder service Notif service
15
Hurricane Ivan SE OK quadrant Vortice study 98-00 Input data sets WRF output Hurricane Ivan SE OK quadrant Vortice study 98-00 Workflow templates 150.nc Input data sets Hurricane Ivan SE OK quadrant Vortice study 98-00 ftp://storageserver.org/file1998o768 Bob’s workspace (Dec 04)Bob’s workspace (Feb 05)Bob’s workspace (Mar 05) Physical data storage Table of collection Table of file Table of User Metadata Catalog Experim-Dec04 Experim-Feb05 Experim-Dec04 Experim-Feb05 001.nc... WRF output files Published results Capturing process in the structure
16
Archiving derived and temporally changing data products 4 < reads < 100 4 < writes < 100 Personal archive catalog Runs on Teragrid HPC machines Runs on Teragrid storage servers Deepti GregCarolyn
17
Challenge: criteria for determining number of versions necessary to preserve meaningful sense of an object’s evolution over time. Archiving derived and temporally changing data products
18
Estimate size of LEAD’s personal archive repository (for provisioning) Canoncial workflow - single 12 hr forecast (10%) Educational workflow - simple analysis (50%) Ensemble workflow - multi-forecast run (5%) Data access workload - “retrieve all data products for Katrina and store to my personal repository” (35%) -- done in advance of any real users -- estimated number of users: 500
19
Estimating file usage distribution: base on arrival rates of LEAD observational data sources
20
Estimated resource needs of archival repository for 500 active users Total sustained read/write bandwidth = 157.9 Mbps Storage needs = 21.2 TB I/O rate = 1,667.6 files read/written per min
21
Empirical validation of hypothesis often involves gathering information into mental model. How can we archiving system help? ideas, thoughts, concepts, opinions, theories, frames, schema, viewpoints, perspectives, values, beliefs… Mental Models diagrams, maps, illustrations, visual metaphors, pictures, graphs, matrices, schematics, icons, cartoons… Result Models When sufficient information gathered, scientist synthesizes information into knowledge that allows acceptance or rejection of hypothesis. Archiving system can assemble info for synthesis into knowledge.
22
Forecast workflow example Steps: -- select geospatial region over which forecast is to be run -- use as parameter to model (ARPS, WRF) -- model generates products -- products visualized
23
Tracking investigation progress MyLEAD offloads mundane work of gathering, storing, and tracking data products used during experimental investigation. These products provide keys to construction of mental “results model.” myLEAD service db
24
Constructing a ‘result object’ Result object -- collection of key materials assembled during workflow execution deemed important to decision making. Selected derived data objects added to result object. Determining what is important and what is not is a research challenge. Simple Example. Suppose 1. Geospatial region selected as input to forecast model 2. Based on user’s role in evacuation decisions, then 3. System adds link to result object to display road maps and population density maps based geospatial region.
25
When forecast model completes and user visually examines model results, LEAD data subsystem simultaneously pops up maps of population density and transportation network over that area. Shaves minutes off critical decision-making
26
Key metrics used in experimental evaluation Query response time -- elapsed time between time client issues request and when it receives response. Scalability - gradually increase amount of work server must do to satisfy a request Add metadata for 1 file, 100, 1000, 10000 Add 1, 100, 500, 1000 attributes to file all at once, or one at a time.
27
Experiment environment Client and server run on separate dual 2.0GHZ Opterons, 16GB Ram Machines connected via Fibre Channel to a 3.5TB SAN Array (16 250GB SATA drives) Gigabit Ethernet connection between machines Linux Red Hat Enterprise
28
Test architecture and breakdown of measured system components Test client myLEAD toolkit myLEAD server OGSA-DAI myLEAD stored procedures mySQL database tyr02*tyr03 * Acknowledgements to National Science Foundation Grant No. 0202048. g e d cb a f
29
Performance overhead of adding attributes to a metadata description 10 610110110050 10 sec
30
Issue query with 166K result set. Examine where overheads lie.
31
Ongoing needs in use of GIS Noesis ontology GEO Query GUI Resource catalog myLEAD user info space Query service THREDDS, Opendap, LDM THREDDS, Opendap, LDM THREDDS, Opendap, CDM Metadata in FGDC-based LEAD metadata schema Data in binary (often) Extract temperatures for region from surface (METAR) data, generate shape file Minnesota map server
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.