National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
An Operational Metadata Framework For Searching, Indexing, and Retrieving Distributed GIServices on the Internet By Ming-Hsiang.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
Distributed components
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
National Partnership for Advanced Computational Infrastructure Advanced Architectures CSE 190 Reagan W. Moore San Diego Supercomputer Center
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Network File System (NFS) in AIX System COSC513 Operation Systems Instructor: Prof. Anvari Yuan Ma SID:
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
MCAT: A Metadata Catalog San Diego Supercomputing Center Part of the Storage Resource Broker (SRB)
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
National Partnership for Advanced Computational Infrastructure Collection-based Persistent Archives Reagan W. Moore Associate Director, Data Intensive.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
File and Object Replication in Data Grids Chin-Yi Tsai.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
The Globus Project: A Status Report Ian Foster Carl Kesselman
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
Rule-Based Programming for VORBs Bertram Ludaescher Arcot Rajasekar Data and Knowledge Systems San Diego Supercomputer Center U.C. San Diego.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB + Web Services = Datagrid Management System (DGMS) Arcot.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
CLASS Information Management Presented at NOAATECH Conference 2006 Presented by Pat Schafer (CLASS-WV Development Lead)
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Introduction to The Storage Resource.
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Interlib Technology Integration Reagan.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
Collection Based Persistent Archives
Policy-Based Data Management integrated Rule Oriented Data System
The Client/Server Database Environment
University of Technology
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
San Diego Supercomputer Center
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
Interlib Technology Integration
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets A.Chervenak, I.Foster, C.Kesselman, C.Salisbury,
Presentation transcript:

National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services Reagan W. Moore San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure Distributed Archives Application Digital Library Data Mining Information Based Computing Information Discovery Collection Building

National Partnership for Advanced Computational Infrastructure Co-evolution of Technology Supercomputer Centers and Digital Libraries Both support large scale processing & storage of data Will the supercomputer centers of the future be digital libraries?

National Partnership for Advanced Computational Infrastructure Researchers Chaitanya Baru Amarnath Gupta Bertram Ludaescher Richard Marciano Yannis Papakonstantinou Arcot Rajasekar Wayne Schroeder Michael Wan

National Partnership for Advanced Computational Infrastructure Outline Two views of computing Executionenvironment - metacomputing systems Data Management environment - digital library Analysis for moving data to the process or the process to the data Data Management Environment Information Based Computing

National Partnership for Advanced Computational Infrastructure Digital Libraries Multimedia / GIS / MVD / XML / LDAP / CORBA / Z39.50 Publication / Services Environment Presentation Interface Object Based Information Model Data Management for publication Data Resources Parallel I/O - MPI Constructors: turning data sets into objects Data Resources Data Management for execution Metacomputing Environment Execution Environment

National Partnership for Advanced Computational Infrastructure Choice between Environments Should we provide services for manipulating information Move the process to the data Should we provide execution environments Move data to the process

National Partnership for Advanced Computational Infrastructure Data Distribution Comparison Data Handling Platform Supercomputer Execution rate r<R Bandwidths linking systems areB & b Operations per bit for analysis is O Operations per bit for data transfer iso Reduce size of data from S bytes to s bytes and analyze Should the data reduction be done before transmission? Data Bb

National Partnership for Advanced Computational Infrastructure Distributing Services Compare times for analyzing data with size reduction from S to s Read Data Reduce Data Transmit Data NetworkReceive Data Read Data Reduce Data Transmit Data Network Receive Data S / BO S / ro s / rs / bo s / R o S / Ro S / rS / bO S / RS / B Data Handling Platform Supercomputer Data Handling Platform Supercomputer

National Partnership for Advanced Computational Infrastructure Comparison of Time T(Super) = S/B + OS/r + os/r + s/b + os/R Processing at supercomputer Processing at archive T(Archive) = S/B + oS/r + S/b + oS/R + OS/R

National Partnership for Advanced Computational Infrastructure Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T (Archive) S/B + OS/r + os/r + s/b + os/R < S/B + oS/r + S/b + oS/R + OS/R Which variable provides the simplest optimization Criterion?

National Partnership for Advanced Computational Infrastructure Scaling Parameters Data size reduction ratio s/S Execution slow down ratior/R Problem complexityo/O Communication/Execution balancer/(ob) When r/(ob) = 1, the data processing rate is the same as the data transmission rate. Optimal designs have r/(ob) = 1 Note (r/o) is the number of bits/sec that can be processed.

National Partnership for Advanced Computational Infrastructure Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis O > o (1-s/S) [1 + r/R + r/(ob)] / (1-r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero.

National Partnership for Advanced Computational Infrastructure Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network b > (r /O) (1 - s/S) / [1 - r/R - (o/O) (1 + r/R) (1 - s/S)] Note the denominator changes sign when O < o (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small.

National Partnership for Advanced Computational Infrastructure Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)] Note the denominator changes sign when O < o (1 - s/S) [1 + r/(ob)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small.

National Partnership for Advanced Computational Infrastructure Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]} Note criteria changes sign when O > o [1 + r/R + r/(ob)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit.

National Partnership for Advanced Computational Infrastructure Is the Future Environment a Metacomputer or a Digital Library? Sufficiently high complexity Move data to processing engine Digital Library execution of remote services Traditional supercomputer processing of applications Sufficiently low complexity Move process to the data source Metacomputing execution of remote applications Traditional digital library service

National Partnership for Advanced Computational Infrastructure The IBM Digital Library Architecture Application (DL client) Metadata in DB2 or Oracle Videocharger DB2 ADSM Oracle Library Server Text and Image indices “Federated” search Object Server Distributed storage resources (SRB) (MCAT)

National Partnership for Advanced Computational Infrastructure Generalization of Digital Library Scaling transparency Support for arbitrary size data sets Support for arbitrary data type Location transparency Access to remote data Access to heterogeneous (non-uniform) storage systems Remove restriction of local disk space size Name service transparency Support for multiple views (naming conventions) for data Presentation transparency Support for alternate representations of data

National Partnership for Advanced Computational Infrastructure Describing Information Content

National Partnership for Advanced Computational Infrastructure State-of-the-art Information Management: Digital Library

National Partnership for Advanced Computational Infrastructure High Performance Storage Provide access to tertiary storage - scale size of repository Disk caches Tape robots Manage migration of data between disk and tape High Performance Storage System - IBM Provides service classes Support for parallel I/O Support for terabyte sized data sets Provide recoverable name space

National Partnership for Advanced Computational Infrastructure State-of-the-art Storage: HPSS Store Teraflops computer output Growth TB data per year Data access rate - 7 TB/day = 80 MB/sec 2-week data cache - 10 TB Scalable control platform 8-node SP (32 processors) Support digital libraries Support for millions of data sets Integration with database meta-data catalogs

National Partnership for Advanced Computational Infrastructure HPSS Archival Storage System 108 GB SSA RAID High Performance Gateway Node High Node Disk Mover HiPPI driver Wide Node Disk Mover HiPPI driver 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon Silver Node Tape / disk mover DCE / FTP /HIS Log Client 160 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 830 GB MaxStrat RAID 9490 Robot Four Drives 3490 Tape RS6000 Tape Mover PVR (9490) HiPPI Switch Trail- Blazer3 Switch Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Magstar 3590 Tape 9490 Robot Eight Tape Drives Magstar 3590 Tape 9490 Robot Seven Tape Drives

National Partnership for Advanced Computational Infrastructure SDSC has achieved: Striping required to achieve desired I/O rates HPSS Bandwidths

National Partnership for Advanced Computational Infrastructure Turning Archives into Digital Libraries Meta-data based access to data sets Support for application of methods (procedures) to data sets Support for information discovery Support for publication of data sets Research issue - optimization of data distribution between database and archive

National Partnership for Advanced Computational Infrastructure Database Table C4C5C1C2C3 DB2/HPSS Integration DB2 HPSS DB2 Disk buffer HPSS Disk cache Collaboration with IBM TJ Watson Research Center Ming-Ling Lo, Sriram Padmanabhan, Vibby Gottemukkala Features: Prototype, works with DB2 UDB (Version 5) DB2 is able to use a HPSS file as a tablespace container DB2 handles DCE authentication to HPSS Regular as well as long (LOB) data can be stored in HPSS Optional disk buffer between DB2 and HPSS

National Partnership for Advanced Computational Infrastructure Generalizing Digital Libraries SRB - Location transparency Access to heterogeneous systems Access to remote systems MCAT - Name service transparency Extensible Schema support MIX - Presentation transparency Mediation of information with XML Support for semi-structured data Access scaling MPI-I/O access to data sets using parallel I/O

National Partnership for Advanced Computational Infrastructure SRB UniTreeHPSSDB2IllustraUnix SRB Software Architecture SRB APIs User Authentication Dataset Location Access Control Type Replication Logging Metadata Catalog MCAT Application (SRB client)

National Partnership for Advanced Computational Infrastructure 14 Installed SRB Sites Rutgers NCSA Montana State University Large Archives

National Partnership for Advanced Computational Infrastructure SRB / MCAT Features Support for Collection hierarchy allows grouping of hetero- geneous data sets into a single logical collection hierarchical access control, with ticket mechanism Replication optional replication at the time of creation can choose replica on read Proxy operations supports proxy (remote) move and copy operations Monitoring capability Supports storing/querying of system- and user-defined “metadata” for data sets and resources API for ad hoc querying of metadata Ability to extend schemas and define new schemas Ability to associate data sets with multiple metadata schemas Ability to relate attributes across schemas Implemented in Oracle and DB2

National Partnership for Advanced Computational Infrastructure MCAT Schema Integration Publish schema for each collection Clusters of attributes form a table Tables implement the schema Use Tokens to define semantic meaning Associate Token with each attribute Use DAG to automate queries Specify directed linkage between clusters of attributes Tokens - Clusters - Attributes

National Partnership for Advanced Computational Infrastructure Publishing A New Schema

National Partnership for Advanced Computational Infrastructure Adding Attributes to the New Schema

National Partnership for Advanced Computational Infrastructure Displaying Attributes From Selected Schemas

National Partnership for Advanced Computational Infrastructure Security Integration of SDSC Encryption Authentication system (SEA) with Globus GSI Kerberos within security domain Globus for inter-realm authentication Access control lists per data set Audit trails of usage Need support for third-party authentication User A accesses data under the control of digital library B when the data is stored at site C

National Partnership for Advanced Computational Infrastructure XMAS query XMAS query “fragment” MIX: Mediation of Information using XML Mediator Wrapper Active View 1 Convert XMAS query to local query language, and data in native format to XML SQL Database Wrapper SpreadsheetHTML files XML data Support for “active” views Active View 2 BBQ Interface Local Data Repository

National Partnership for Advanced Computational Infrastructure Integration of Digital Library with Metacomputing Systems NTON OC-192 network (LLNL - Caltech - SDSC) HPSS archive Globus metacomputing system SRB data handling system MCAT extensible metadata MIX semi-structured data mediation using XML ICE collaboration environment Feature extraction

National Partnership for Advanced Computational Infrastructure INFORMATION SERVICES Data Intensive and High-Performance Distributed Computing Local Resource Management Data Repositories Resources Layer Fault Detection Resource Management Generic Services Layer Domain Specific Services Layer Application Toolkits Network Caching Metadata Communication Libs. Grid-enabled Libs Visualization Resource DiscoveryResource Brokering End-to-End QoS Remote Data Access Interdomain Security Scheduling

National Partnership for Advanced Computational Infrastructure Research Activities Support for remote execution of data manipulation procedures Globus - SRB integration Automated feature extraction XML based tagging of features XML query language for storing attributes into the Intelligent Archive Integration with RIO - parallel I/O transport

National Partnership for Advanced Computational Infrastructure Views of Software Infrastructure Software infrastructure supports user applications Reason for existence of software is to provide explicit capabilities required by applications What is the user perspective for building new software systems? Is the integration of digital library and metacomputing systems the final version?

National Partnership for Advanced Computational Infrastructure Software Integration Projects NSF Computational Grid - Middleware using distributed state information to support metacomputing services DOE Data Visualization Corridor - collaboratively visualize multi-terabyte sized data sets NASA Information Power Grid - integrate data repositories with applications and visualization systems DARPA Quorum - provide quality of service guarantees

National Partnership for Advanced Computational Infrastructure User Requirements - Five Software Environments Code Development Resources support Run-time Parallel Tools and Libraries Distributed Run-Time Metacomputing environment Interaction Environments Collaboration, presentation Publication / Discovery / Retrieval Data intensive computing environment

National Partnership for Advanced Computational Infrastructure Metacomputing Environment Data Flow Perspective Archival Storage System Remote Data Manipulation Data Handling System Data Staging System Data Caching System Distributed Execution Environment Object Oriented Interface Application

National Partnership for Advanced Computational Infrastructure Publication Environment Data Flow Perspective Archival Storage System Remote Data Manipulation Data Handling System Collection Management Software Digital Library Services Data Set Constructor Run-time Access Application

National Partnership for Advanced Computational Infrastructure Run-time Environment Data Flow Perspective Archival Storage System Data Handling System Data Caching System Library Interoperation Data Structures Library Memory Tiling Parallel I/O Library Application

National Partnership for Advanced Computational Infrastructure Interaction Environment Data Flow Perspective Archival Storage System Data Manipulation System Data Caching System Data Formatting System Rendering System Visualization Environment Collaboration Environment Application

National Partnership for Advanced Computational Infrastructure Taxonomy of User Requirements

National Partnership for Advanced Computational Infrastructure Comparison of Environments

National Partnership for Advanced Computational Infrastructure Comparison of Environments

National Partnership for Advanced Computational Infrastructure PACI Environments

National Partnership for Advanced Computational Infrastructure PACI Environments

National Partnership for Advanced Computational Infrastructure PACI Environments

National Partnership for Advanced Computational Infrastructure Future Systems Automation of Information discovery Application execution Publication of results Integration of Code Development Run-time support Distributed computing Collaborative analysis Information publication

National Partnership for Advanced Computational Infrastructure Further Information