Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services.

Similar presentations


Presentation on theme: "National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services."— Presentation transcript:

1 National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE

2 National Partnership for Advanced Computational Infrastructure Distributed Archives Application Digital Library Data Mining Information Based Computing Information Discovery Collection Building

3 National Partnership for Advanced Computational Infrastructure Co-evolution of Technology Supercomputer Centers and Digital Libraries Both support large scale processing & storage of data Will the supercomputer centers of the future be digital libraries?

4 National Partnership for Advanced Computational Infrastructure Researchers Chaitanya Baru Amarnath Gupta Bertram Ludaescher Richard Marciano Yannis Papakonstantinou Arcot Rajasekar Wayne Schroeder Michael Wan

5 National Partnership for Advanced Computational Infrastructure Outline Two views of computing Executionenvironment - metacomputing systems Data Management environment - digital library Analysis for moving data to the process or the process to the data Data Management Environment Information Based Computing

6 National Partnership for Advanced Computational Infrastructure Digital Libraries Multimedia / GIS / MVD / XML / LDAP / CORBA / Z39.50 Publication / Services Environment Presentation Interface Object Based Information Model Data Management for publication Data Resources Parallel I/O - MPI Constructors: turning data sets into objects Data Resources Data Management for execution Metacomputing Environment Execution Environment

7 National Partnership for Advanced Computational Infrastructure Choice between Environments Should we provide services for manipulating information Move the process to the data Should we provide execution environments Move data to the process

8 National Partnership for Advanced Computational Infrastructure Data Distribution Comparison Data Handling Platform Supercomputer Execution rate r<R Bandwidths linking systems areB & b Operations per bit for analysis is O Operations per bit for data transfer iso Reduce size of data from S bytes to s bytes and analyze Should the data reduction be done before transmission? Data Bb

9 National Partnership for Advanced Computational Infrastructure Distributing Services Compare times for analyzing data with size reduction from S to s Read Data Reduce Data Transmit Data NetworkReceive Data Read Data Reduce Data Transmit Data Network Receive Data S / BO S / ro s / rs / bo s / R o S / Ro S / rS / bO S / RS / B Data Handling Platform Supercomputer Data Handling Platform Supercomputer

10 National Partnership for Advanced Computational Infrastructure Comparison of Time T(Super) = S/B + OS/r + os/r + s/b + os/R Processing at supercomputer Processing at archive T(Archive) = S/B + oS/r + S/b + oS/R + OS/R

11 National Partnership for Advanced Computational Infrastructure Optimization Parameter Selection Have algebraic equation with eight independent variables. T (Super) < T (Archive) S/B + OS/r + os/r + s/b + os/R < S/B + oS/r + S/b + oS/R + OS/R Which variable provides the simplest optimization Criterion?

12 National Partnership for Advanced Computational Infrastructure Scaling Parameters Data size reduction ratio s/S Execution slow down ratior/R Problem complexityo/O Communication/Execution balancer/(ob) When r/(ob) = 1, the data processing rate is the same as the data transmission rate. Optimal designs have r/(ob) = 1 Note (r/o) is the number of bits/sec that can be processed.

13 National Partnership for Advanced Computational Infrastructure Complexity Analysis Moving all of the data is faster, T(Super) < T(Archive) Sufficiently complex analysis O > o (1-s/S) [1 + r/R + r/(ob)] / (1-r/R) Note, as the execution ratio approaches 1, the required complexity becomes infinite Also, as the amount of data reduction goes to zero, the required complexity goes to zero.

14 National Partnership for Advanced Computational Infrastructure Bandwidth Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast network b > (r /O) (1 - s/S) / [1 - r/R - (o/O) (1 + r/R) (1 - s/S)] Note the denominator changes sign when O < o (1 + r/R) / [(1 - r/R) (1 - s/S)] Even with an infinitely fast network, it is better to do the processing at the archive if the complexity is too small.

15 National Partnership for Advanced Computational Infrastructure Execution Rate Optimization Moving all of the data is faster, T(Super) < T(Archive) Sufficiently fast supercomputer R > r [1 + (o/O) (1 - s/S)] / [1 - (o/O) (1 - s/S) (1 + r/(ob)] Note the denominator changes sign when O < o (1 - s/S) [1 + r/(ob)] Even with an infinitely fast supercomputer, it is better to process at the archive if the complexity is too small.

16 National Partnership for Advanced Computational Infrastructure Data Reduction Optimization Moving all of the data is faster, T(Super) < T(Archive) Data reduction is small enough s > S {1 - (O/o)(1 - r/R) / [1 + r/R + r/(ob)]} Note criteria changes sign when O > o [1 + r/R + r/(ob)] / (1 - r/R) When the complexity is sufficiently large, it is faster to process on the supercomputer even when data can be reduced to one bit.

17 National Partnership for Advanced Computational Infrastructure Is the Future Environment a Metacomputer or a Digital Library? Sufficiently high complexity Move data to processing engine Digital Library execution of remote services Traditional supercomputer processing of applications Sufficiently low complexity Move process to the data source Metacomputing execution of remote applications Traditional digital library service

18 National Partnership for Advanced Computational Infrastructure The IBM Digital Library Architecture Application (DL client) Metadata in DB2 or Oracle Videocharger DB2 ADSM Oracle Library Server Text and Image indices “Federated” search Object Server Distributed storage resources (SRB) (MCAT)

19 National Partnership for Advanced Computational Infrastructure Generalization of Digital Library Scaling transparency Support for arbitrary size data sets Support for arbitrary data type Location transparency Access to remote data Access to heterogeneous (non-uniform) storage systems Remove restriction of local disk space size Name service transparency Support for multiple views (naming conventions) for data Presentation transparency Support for alternate representations of data

20 National Partnership for Advanced Computational Infrastructure Describing Information Content

21 National Partnership for Advanced Computational Infrastructure State-of-the-art Information Management: Digital Library

22 National Partnership for Advanced Computational Infrastructure High Performance Storage Provide access to tertiary storage - scale size of repository Disk caches Tape robots Manage migration of data between disk and tape High Performance Storage System - IBM Provides service classes Support for parallel I/O Support for terabyte sized data sets Provide recoverable name space

23 National Partnership for Advanced Computational Infrastructure State-of-the-art Storage: HPSS Store Teraflops computer output Growth - 200 TB data per year Data access rate - 7 TB/day = 80 MB/sec 2-week data cache - 10 TB Scalable control platform 8-node SP (32 processors) Support digital libraries Support for millions of data sets Integration with database meta-data catalogs

24 National Partnership for Advanced Computational Infrastructure HPSS Archival Storage System 108 GB SSA RAID High Performance Gateway Node High Node Disk Mover HiPPI driver Wide Node Disk Mover HiPPI driver 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID 54 GB SSA RAID 108 GB SSA RAID 108 GB SSA RAID Silver Node Storage / Purge Bitfile / Migration Nameservice/PVL Log Daemon Silver Node Tape / disk mover DCE / FTP /HIS Log Client 160 GB SSA RAID Silver Node Tape / disk mover DCE / FTP /HIS Log Client 830 GB MaxStrat RAID 9490 Robot Four Drives 3490 Tape RS6000 Tape Mover PVR (9490) HiPPI Switch Trail- Blazer3 Switch Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Silver Node Tape / disk mover DCE / FTP /HIS Log Client Magstar 3590 Tape 9490 Robot Eight Tape Drives Magstar 3590 Tape 9490 Robot Seven Tape Drives

25 National Partnership for Advanced Computational Infrastructure SDSC has achieved: Striping required to achieve desired I/O rates HPSS Bandwidths

26 National Partnership for Advanced Computational Infrastructure Turning Archives into Digital Libraries Meta-data based access to data sets Support for application of methods (procedures) to data sets Support for information discovery Support for publication of data sets Research issue - optimization of data distribution between database and archive

27 National Partnership for Advanced Computational Infrastructure Database Table C4C5C1C2C3 DB2/HPSS Integration DB2 HPSS DB2 Disk buffer HPSS Disk cache Collaboration with IBM TJ Watson Research Center Ming-Ling Lo, Sriram Padmanabhan, Vibby Gottemukkala Features: Prototype, works with DB2 UDB (Version 5) DB2 is able to use a HPSS file as a tablespace container DB2 handles DCE authentication to HPSS Regular as well as long (LOB) data can be stored in HPSS Optional disk buffer between DB2 and HPSS

28 National Partnership for Advanced Computational Infrastructure Generalizing Digital Libraries SRB - Location transparency Access to heterogeneous systems Access to remote systems MCAT - Name service transparency Extensible Schema support MIX - Presentation transparency Mediation of information with XML Support for semi-structured data Access scaling MPI-I/O access to data sets using parallel I/O

29 National Partnership for Advanced Computational Infrastructure SRB UniTreeHPSSDB2IllustraUnix SRB Software Architecture SRB APIs User Authentication Dataset Location Access Control Type Replication Logging Metadata Catalog MCAT Application (SRB client)

30 National Partnership for Advanced Computational Infrastructure 14 Installed SRB Sites Rutgers NCSA Montana State University Large Archives

31 National Partnership for Advanced Computational Infrastructure SRB / MCAT Features Support for Collection hierarchy allows grouping of hetero- geneous data sets into a single logical collection hierarchical access control, with ticket mechanism Replication optional replication at the time of creation can choose replica on read Proxy operations supports proxy (remote) move and copy operations Monitoring capability Supports storing/querying of system- and user-defined “metadata” for data sets and resources API for ad hoc querying of metadata Ability to extend schemas and define new schemas Ability to associate data sets with multiple metadata schemas Ability to relate attributes across schemas Implemented in Oracle and DB2

32 National Partnership for Advanced Computational Infrastructure MCAT Schema Integration Publish schema for each collection Clusters of attributes form a table Tables implement the schema Use Tokens to define semantic meaning Associate Token with each attribute Use DAG to automate queries Specify directed linkage between clusters of attributes Tokens - Clusters - Attributes

33 National Partnership for Advanced Computational Infrastructure Publishing A New Schema

34 National Partnership for Advanced Computational Infrastructure Adding Attributes to the New Schema

35 National Partnership for Advanced Computational Infrastructure Displaying Attributes From Selected Schemas

36 National Partnership for Advanced Computational Infrastructure Security Integration of SDSC Encryption Authentication system (SEA) with Globus GSI Kerberos within security domain Globus for inter-realm authentication Access control lists per data set Audit trails of usage Need support for third-party authentication User A accesses data under the control of digital library B when the data is stored at site C

37 National Partnership for Advanced Computational Infrastructure XMAS query XMAS query “fragment” MIX: Mediation of Information using XML Mediator Wrapper Active View 1 Convert XMAS query to local query language, and data in native format to XML SQL Database Wrapper SpreadsheetHTML files XML data Support for “active” views Active View 2 BBQ Interface Local Data Repository

38 National Partnership for Advanced Computational Infrastructure Integration of Digital Library with Metacomputing Systems NTON OC-192 network (LLNL - Caltech - SDSC) HPSS archive Globus metacomputing system SRB data handling system MCAT extensible metadata MIX semi-structured data mediation using XML ICE collaboration environment Feature extraction

39 National Partnership for Advanced Computational Infrastructure INFORMATION SERVICES Data Intensive and High-Performance Distributed Computing Local Resource Management Data Repositories Resources Layer Fault Detection Resource Management Generic Services Layer Domain Specific Services Layer Application Toolkits Network Caching Metadata Communication Libs. Grid-enabled Libs Visualization Resource DiscoveryResource Brokering End-to-End QoS Remote Data Access Interdomain Security Scheduling

40 National Partnership for Advanced Computational Infrastructure Research Activities Support for remote execution of data manipulation procedures Globus - SRB integration Automated feature extraction XML based tagging of features XML query language for storing attributes into the Intelligent Archive Integration with RIO - parallel I/O transport

41 National Partnership for Advanced Computational Infrastructure Views of Software Infrastructure Software infrastructure supports user applications Reason for existence of software is to provide explicit capabilities required by applications What is the user perspective for building new software systems? Is the integration of digital library and metacomputing systems the final version?

42 National Partnership for Advanced Computational Infrastructure Software Integration Projects NSF Computational Grid - Middleware using distributed state information to support metacomputing services DOE Data Visualization Corridor - collaboratively visualize multi-terabyte sized data sets NASA Information Power Grid - integrate data repositories with applications and visualization systems DARPA Quorum - provide quality of service guarantees

43 National Partnership for Advanced Computational Infrastructure User Requirements - Five Software Environments Code Development Resources support Run-time Parallel Tools and Libraries Distributed Run-Time Metacomputing environment Interaction Environments Collaboration, presentation Publication / Discovery / Retrieval Data intensive computing environment

44 National Partnership for Advanced Computational Infrastructure Metacomputing Environment Data Flow Perspective Archival Storage System Remote Data Manipulation Data Handling System Data Staging System Data Caching System Distributed Execution Environment Object Oriented Interface Application

45 National Partnership for Advanced Computational Infrastructure Publication Environment Data Flow Perspective Archival Storage System Remote Data Manipulation Data Handling System Collection Management Software Digital Library Services Data Set Constructor Run-time Access Application

46 National Partnership for Advanced Computational Infrastructure Run-time Environment Data Flow Perspective Archival Storage System Data Handling System Data Caching System Library Interoperation Data Structures Library Memory Tiling Parallel I/O Library Application

47 National Partnership for Advanced Computational Infrastructure Interaction Environment Data Flow Perspective Archival Storage System Data Manipulation System Data Caching System Data Formatting System Rendering System Visualization Environment Collaboration Environment Application

48 National Partnership for Advanced Computational Infrastructure Taxonomy of User Requirements

49 National Partnership for Advanced Computational Infrastructure Comparison of Environments

50 National Partnership for Advanced Computational Infrastructure Comparison of Environments

51 National Partnership for Advanced Computational Infrastructure PACI Environments

52 National Partnership for Advanced Computational Infrastructure PACI Environments

53 National Partnership for Advanced Computational Infrastructure PACI Environments

54 National Partnership for Advanced Computational Infrastructure Future Systems Automation of Information discovery Application execution Publication of results Integration of Code Development Run-time support Distributed computing Collaborative analysis Information publication

55 National Partnership for Advanced Computational Infrastructure Further Information http://www.npaci.edu/DICE


Download ppt "National Partnership for Advanced Computational Infrastructure Data Intensive Computing Information Based Computing Digital Libraries / Metacomputing Services."

Similar presentations


Ads by Google