UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
C3.ca in Atlantic Canada Virendra Bhavsar Director, Advanced Computational Research Laboratory (ACRL) Faculty of Computer Science University of New Brunswick.
Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.
Information Technology Center Introduction to High Performance Computing at KFUPM.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Distributed IT Infrastructure for U.S. ATLAS Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
ADAPT An Approach to Digital Archiving and Preservation Technology Principal Investigator: Joseph JaJa Lead Programmers: Mike Smorul and Mike McGann Graduate.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
SESSION 9 THE INTERNET AND THE NEW INFORMATION NEW INFORMATIONTECHNOLOGYINFRASTRUCTURE.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
Data Grid: GRASP Mike Smorul. Grid Retrieval and Search Platform Based on concepts developed in the Earth Science Data Interface (ESDI) developed at the.
Data Mining – Intro.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
Inter-American Workshop on Environmental Data Access Panel discussion on scientific and technical issues Merilyn Gentry, LBA-ECO Data Coordinator NASA.
Parallel Processing CS453 Lecture 2.  The role of parallelism in accelerating computing speeds has been recognized for several decades.  Its role in.
HDF5 A new file format & software for high performance scientific data management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Global Land Cover Facility The Global Land Cover Facility (GLCF) is a member of the Earth Science Information Partnership (ESIP) Federation providing data,
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Master Thesis Defense Jan Fiedler 04/17/98
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB + Web Services = Datagrid Management System (DGMS) Arcot.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Alexandria Digital Earth ProtoType DIGITAL LIBRARIES AND ENVIRONMENTAL INFORMATION Terence R. Smith Alexandria Digital Library Project.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
LEGS: A WSRF Service to Estimate Latency between Arbitrary Hosts on the Internet R.Vijayprasanth 1, R. Kavithaa 2,3 and Raj Kettimuthu 2,3 1 Coimbatore.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Capability Computing – High-End Resources Wayne Pfeiffer Deputy Director NPACI & SDSC NPACI.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
VAPoR: A Discovery Environment for Terascale Scientific Data Sets Alan Norton & John Clyne National Center for Atmospheric Research Scientific Computing.
August 3, March, The AC3 GRID An investment in the future of Atlantic Canadian R&D Infrastructure Dr. Virendra C. Bhavsar UNB, Fredericton.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California, Berkeley Paul Watry Richard Marciano.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Computing at SSRL: Experimental User Support Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
A Scalable Distributed Datastore for BioImaging R. Cai, J. Curnutt, E. Gomez, G. Kaymaz, T. Kleffel, K. Schubert, J. Tafas {jcurnutt, egomez, keith,
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
Database Research Overview Database Systems R&D Center Dept. of Computer & Information Science & Engineering College of Engineering University of Florida.
NASA Earth Exchange (NEX) Earth Science Division/NASA Advanced Supercomputing (NAS) Ames Research Center.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Lifemapper 2.0 Using and Creating Geospatial Data and Open Source Tools for the Biological Community Aimee Stewart, CJ Grady, Dave Vieglais, Jim Beach.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Flanders Marine Institute (VLIZ)
Joseph JaJa, Mike Smorul, and Sangchul Song
Problem: Ecological data needed to address critical questions are dispersed, heterogeneous, and complex Solution: An internet-based mechanism to discover,
University of Technology
Distributed Systems Bina Ramamurthy 11/12/2018 From the CDK text.
Introduction to D4Science
Distributed Systems Bina Ramamurthy 11/30/2018 B.Ramamurthy.
Distributed Systems Bina Ramamurthy 12/2/2018 B.Ramamurthy.
Distributed Systems Bina Ramamurthy 4/22/2019 B.Ramamurthy.
Data Management Components for a Research Data Archive
Presentation transcript:

UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute for Advanced Computer Studies and Department of Electrical and Computer Engineering University of Maryland, College Park

Outline Brief Background: Laboratory for Parallel and Distributed Computing NPACI Environmental Informatics Information Discovery Ingestion and Metadata Management Advanced Image Processing and Data Fusion Geospatial Data Mining Data Structures Analysis and Mining Tools Emerging Trends and Planned Activities

Laboratory for Parallel and Distributed Computing Advanced high-end computing platforms in support of research in systems software tools and scalable algorithms for a wide variety of science applications. Current platforms: 16-node IBM SP2 with a large disk array - Grand Challenge Project and IBM Shared University Research. 10-node DEC Cluster, each with 4-Alpha processors - Keck Foundation and Grand Challenge Project. SP-based supercomputer for Earth System Science applications –NSF, IBM Shared University Award, and NASA. 32-node (64 processors) Linux Cluster with Gigabit Ethernet - Systems software tools and applications – NSF and IBM SUR. A Large SMP coupled with a 10-TB of “active” disk array – NSF and IBM SUR. 32-node IBM SP in support of scientific computing and computational biology – Center for Scientific Computing and IBM.

IBM SP2 RS6000 3TB of Disks Tape Robot 18 terabytes, 8 drives Supporting Hardware for the GLCF 4 Way High Node 1GB memory Silver Node Thin Node Silver Node Thin Node 3590 drive Thin Node

NSF Partnership for Advanced Computational Infrastructure UMD is a Major Partner in PACI/UCSD, One of the Two Surviving Supercomputer Centers. UMD Roles: Data Cache Site R&D Participation in the Thrust Areas: Programming Tools and Environments Data Intensive Computing Earth Systems Science Resources

Environmental Informatics: NPACI Project Develop and prototype a software infrastructure on top of multiple distributed data sites that will allow: Information discovery from distributed, heterogeneous environmental and biodiversity data sources. Integration with Current and Emerging Web Technologies. Advanced browsing, subsetting, and image processing at different granule levels, including automatic overlay of different types of data.

Initial Prototype Informix (Sites, Workspace and Remote link Management) Data Search and Retrieval Data Overlay ESS Web Site WWW Interface GLCF USER WORKSPACE KUBirds SDSC-SRB LTER users

Software Modules Map Server Data Overlay Ingestion Data Transport Description Remote Site Description Database(Site, workspace And remote Meta-data, Preview management) WWW Interface Image Browsing And Processing Distributed Search & Retrieve Workspace

Remote Data or Link Ingestion XML DTD at three levels: granule, data set and web site Granule level describes the data item, the fields in XML are either extracted from the header file or provided by the user for historical data Data set level describes the data list in collection level and specifies the searchable parameters, Web site level gives all the information for the whole site of a data provider, such as search engine, interaction protocol, etc.

Collaboration Scenario No migration: No data is migrated but each site provides the interaction protocol and searchable parameters. Each site needs to provide ftp or http service for user to access the data. Metadata Management System The metadata in granule XML is transferred and ingested at UMD. The raw data is hosted at the original site.

Geospatial Data Analysis and Mining Develop basic building blocks to efficiently manage and analyze large scale spatio-temporal data: Efficient indexing schemes for large scale heterogeneous geospatial raster data. Built-in modules for aggregate and statistical analysis over space and time. Mining for spatio-temporal regions that satisfy user-specified characteristics. Efficient algorithms for clustering, discovery of association rules, and decision-tree induction.

A Typical Class of Queries Given a time series of geospatial data and a set of functions {f}, determine regions/time intervals for which each function varies in a certain fashion. Example: Find regions with land cover type x in which there is an unusually warm and dry winter season, followed by a summer drought lasting d days, followed by a period of above normal precipitation

Preliminary Results Efficient data structures built around multidimensional arrays and R-trees: Three-dimensional arrays that include aggregate and statistical values of attributes of interest (average, maximum, minimum, sum, standard deviation, etc.) R-tree built around attributes such that each node contains rectangular regions whose indicators fall within that node Efficient high performance algorithms to build these data structures Efficient algorithms to perform bulk updates

Emerging Trends and Planned Activities Persistent Distributed Data Archives – including data, information, and knowledge management infrastructure (project in collaboration with NARA, led by SDSC). Computational Grid – widely distributed computational and storage resources that can be accessed as if all the resources are local (NPACI). Storage Area Networks (SAN) – storage devices (tapes, disk arrays, NAS) are connected to servers via a Fiber Channel.