San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Data Grid Management Systems (DGMS) Arun Swaran Jagatheesan San Diego.

Slides:



Advertisements
Similar presentations
Building Shared Collections Using the Storage Resource Broker Storage Resource Broker Reagan W. Moore
Advertisements

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
The Storage Resource Broker and.
The Storage Resource Broker and.
Overview of the SDSC Storage Resource Broker Wayne Schroeder (and other SRB team members) May, 2004 San Diego Supercomputer Center, University of California.
Peter Berrisford RAL – Data Management Group SRB Services.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grids, Digital Libraries, and Persistent Archives ESIP.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida A Data Storage Language for.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
A Very Brief Introduction to iRODS
Federating Archives in the DELAMAN Network Reagan W. Moore San Diego Supercomputer Center Storage Resource.
Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore
VL-e PoC Introduction Maurice Bouwhuis VL-e work shop, April 7 th, 2006.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Modern Data Management Overview Storage Resource Broker Reagan W. Moore
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Architecture of Grid File System (GFS) - Based on the outline draft - Arun swaran Jagatheesan San Diego Supercomputer Center Global Grid Forum 11 Honolulu,
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming Gridflows using Matrix Arun Jagatheesan Architect, SDSC.
DISTRIBUTED COMPUTING
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Dataflows in SRB using SDSC Matrix Arun Jagatheesan Architect & Team.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
San Diego Supercomputer Center SDSC Storage Resource Broker Data Grid Automation Arun Jagatheesan et al., San Diego Supercomputer Center University of.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Grid Services/SRB/SRM & Practical Hai-Ning Wu Academia Sinica Grid Computing.
DATABASE MANAGEMENT SYSTEMS IN DATA INTENSIVE ENVIRONMENNTS Leon Guzenda Chief Technology Officer.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
Data Grid Management Systems (DGMS) Arun Jagatheesan San Diego Supercomputer Center
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB + Web Services = Datagrid Management System (DGMS) Arcot.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida DGL: The Assembly Language for Grid Computing Arun swaran Jagatheesan.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
SAN DIEGO SUPERCOMPUTER CENTER By: Roman Olschanowsky An Introduction to the.
Michael Doherty RAL UK e-Science AHM 2-4 September 2003 SRB in Action.
Introduction to The Storage Resource.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for.
San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida Data Grid and Gridflow Management.
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Collection Based Persistent Archives
Policy-Based Data Management integrated Rule Oriented Data System
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
San Diego Supercomputer Center University of California, San Diego
VORB Virtual Object Ring Buffers
Presentation transcript:

San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Data Grid Management Systems (DGMS) Arun Swaran Jagatheesan San Diego Supercomputer Center University of California, San Diego Tutorial at the fourth IEEE International Conference on Data Mining Brighton, UK November , 2004

San Diego Supercomputer Center ICDM 2004 University of Florida 2 Dynamic Calibration of Content Academic researchers and Students () Business analysts & Office of the CTO () Software architects () Software developers () Savvy users () Just to make sure all of us get the most out of this tutorial when we leave this room

San Diego Supercomputer Center ICDM 2004 University of Florida 3 Some Questions What is Data Grid? Is there a problem worth my (student’s) Ph.D thesis? Can data grid technologies help IT department to save cost? How are the concepts I should know to design data grid product. What are the undocumented tricks of the trade that worked? Are there people using these or is it still in research labs? Been hearing about it a lot. Can I see something for real

San Diego Supercomputer Center ICDM 2004 University of Florida 4 Tutorial Outline  Introduction to Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 5 Tutorial Outline  Introduction to Data Grids o The “Grid” Vision o Hype/Reality o Where: Data Grid Infrastructures in production o Why: Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 6 We All Know the Story Information is growing …, information explosion In 2002, 5 Exabytes of data was produced by man kind (Equivalent of all words ever spoken by human beings) In 2006, 62 billion s in 800~1335 PB Mostly files, s,multi-media Source: “How much Information? 20002” – SIMS, UC Berkeley

San Diego Supercomputer Center ICDM 2004 University of Florida 7 Enterprise Data Storage Distributed in multiple locations and resources Geographically or administratively distributed Multiple “autonomous administrative domains” Independent control over operations Business driven - collaborations, acquisitions etc., Heterogeneous storage and data handling systems File systems, databases, archives, /web servers… Can we have a single logical namespace for data storage? Can enterprises reduce TCO treating data storage as a single utility or infrastructure?

San Diego Supercomputer Center ICDM 2004 University of Florida 8 Distributed Computing © Images courtesy of Computer History Museum

San Diego Supercomputer Center ICDM 2004 University of Florida 9 The “Grid” Vision

San Diego Supercomputer Center ICDM 2004 University of Florida 10 Data Grids – Hype / Reality? Forrester research “Data and infrastructure are top of mind for grid at more than 50 percent of firms … the vision of data grids will become part of a greater vision of storage virtualization and information life cycle management” – may 2004 CIO magazine “While most people think of computational grids, enterprises are looking into data grids” – may 2004 Why talk about Busine$$ in an IEEE conference? Necessity drives business; Business drives standards and technology evolution; …; Grid is not just technology alone, but also standards evolution (that require businesses participation)

San Diego Supercomputer Center ICDM 2004 University of Florida 11 Media Perspective CNN: The world's biggest brain: Distributed [grid] computing BBC: Grid virtualises processing and storage resources and lets people use, or rent, the capacity they need for particular tasks Lots more

San Diego Supercomputer Center ICDM 2004 University of Florida 12 Visionaries Perspective Computing and storage shared amongst autonomous organizations [using an grid enabled cyberinfrastructure] As an utility like commodities used within inter/intra organizational collaborations Government agencies and policy makers have subscribed to this and don’t want their country to be left behind

San Diego Supercomputer Center ICDM 2004 University of Florida 13 Academia Perspective “Same questions, different answers” Parallel distributed computing (with multiple organizations) Autonomous administrative domains Heterogeneous Infrastructure Hide heterogeneity using logical resource namespace Change in computing models to take advantage of grid? Large bandwidth, large storage space, large computing power everywhere – does the “large” affect the models/algorithms? Old wine in a new bottle It’s a solved problem. Nothing new “Greed Computing”: Use it to get funding

San Diego Supercomputer Center ICDM 2004 University of Florida 14 User / Vendor Perspective Vendor Yes, its brand new paradigm. We were ready for this long time back – In fact our product X had all the concepts. Do you want to test drive our product and give feed back? Users Our resources (human, computer) are distributed world-wide Collaborations that can span across multiple resources from autonomous administrations of the same company Reducing the Total Cost of Operation (TCO), flexibility to create or use logical resource pools Automobile Industry, Bio-Tech, Electronics, … (Mostly distributed teams with multiple locations and very large data/computing)

San Diego Supercomputer Center ICDM 2004 University of Florida 15 Reality? Navigating hype wave Touting immediate soft benefits Just an application Need for standards Standards that can facilitate new algorithms that can take advantage of heterogeneous infrastructure No use without standard on which interoperable products are developed How long is the wait? Will grid computing deliver? How soon? Or is it just an hype that will fade away? Already some technologies available which are quite promising

San Diego Supercomputer Center ICDM 2004 University of Florida 16 Autonomous Administrative Domain A Grid Entity that: Manages one or more grid resources Can make its own policies Might abide by a superior or global policy Can be act as a resource provider or requestor or both Examples: A department or research lab in an university A HR or finance department of a company (sub-organization) Or simply a single computational or storage resource that manages it self governed by some policies A Grid / Enterprise contains one or more autonomous administrative domains with distributed heterogeneous resources

San Diego Supercomputer Center ICDM 2004 University of Florida 17 Data Grid Resources Context (Information) Information about digital entities (location, size, owners,..) Relationship between digital entities (replicas, collection,.) Behavior the digital entities (services) Content (Data) Structured and unstructured Virtual or derived Commodity (Producers and consumers) Storage resources Also providers, brokers and requestors

San Diego Supercomputer Center ICDM 2004 University of Florida 18 Very Large Scale Data Storage Grid Resource Providers (GRP) providing content and/or storage GRP /txt3.txt GRP

San Diego Supercomputer Center ICDM 2004 University of Florida 19 GRP Autonomous Administrative Domain with one or more Grid Resource Providers /txt3.txt GRP Research Lab Very Large Scale Data Storage

San Diego Supercomputer Center ICDM 2004 University of Florida 20 Very Large Scale Data Storage /…/text1.txt /…//text2.txt GRP /txt3.txt GRP Storage-R-Us Resource Providers data + storage (50) Finance Department data + storage (40) Research Lab data + storage (10)

San Diego Supercomputer Center ICDM 2004 University of Florida 21 Very Large Scale Data Storage /…/text1.txt /…//text2.txt GRP /txt3.txt GRP Storage-R-Us Resource Providers data + storage (50) Finance Department data + storage (40) Research Lab data + storage (10) /home/arun.sdsc/exp1 /home/arun.sdsc/exp1/text1.txt /home/arun.sdsc/exp1/text2.txt /home/arun.sdsc/exp1/text3.txt data + storage (100) Logical Namespace (Need not be same as physical view of resources )

San Diego Supercomputer Center ICDM 2004 University of Florida 22 Tutorial Outline  Introduction to Data Grids o The “Grid” Vision o Hype/Reality o Where: Data Grid Infrastructures in production o Why: Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 23 DGMS Technology Usage NSF Southern California Earthquake Center digital library Worldwide Universities Network data grid NASA Information Power Grid NASA Goddard Data Management System data grid DOE BaBar High Energy Physics data grid NSF National Virtual Observatory data grid NSF ROADnet real-time sensor collection data grid NIH Biomedical Informatics Research Network data grid NARA research prototype persistent archive NSF National Science Digital Library persistent archive NHPRC Persistent Archive Test bed

San Diego Supercomputer Center ICDM 2004 University of Florida 24 Southern California Earthquake Center

San Diego Supercomputer Center ICDM 2004 University of Florida 25 Southern California Earthquake Center Build community digital library Manage simulation and observational data 60 TBs, several million files Provide web-based interface Support standard services on digital library Manage data distributed across multiple sites USC, SDSC, UCSB, SDSU, SIO Provide standard metadata Community based descriptive metadata Administrative metadata Application specific metadata

San Diego Supercomputer Center ICDM 2004 University of Florida 26 SCEC Data Management Technologies Portals Knowledge interface to the library, presenting a coherent view of the services Knowledge Management Systems Organize relationships between SCEC concepts and semantic labels Process management systems Data processing pipelines to create derived data products Web services Uniform capabilities provided across SCEC collections Data grid Management of collections of distributed data Computational grid Access to distributed compute resources Persistent archive Management of technology evolution

San Diego Supercomputer Center ICDM 2004 University of Florida 27

San Diego Supercomputer Center ICDM 2004 University of Florida 28 NASA Data Grids NASA Information Power Grid NASA Ames, NASA Goddard Distributed data collection using the SRB ESIP federation Led by Joseph JaJa (U Md) Federation of ESIP data resources using the SRB NASA Goddard Data Management System Storage repository virtualization (Unix file system, Unitree archive, DMF archive) using the SRB NASA EOS Petabyte store Storage repository virtualization for EMC persistent store using the Nirvana version of SRB

San Diego Supercomputer Center ICDM 2004 University of Florida 29 NIH BIRN SRB Data Grid Biomedical Informatics Research Network Access and analyze biomedical image data Data resources distributed throughout the country Medical schools and research centers across the US Stable high performance grid based environment Coordinate data sharing Federate collections Support data mining and analysis

San Diego Supercomputer Center ICDM 2004 University of Florida 30 BIRN: Inter-organizational Data

San Diego Supercomputer Center ICDM 2004 University of Florida 31 SRB Collections at SDSC

San Diego Supercomputer Center ICDM 2004 University of Florida 32 Tutorial Outline  Introduction to Data Grids o The “Grid” Vision o Hype/Reality o Where: Data Grid Infrastructures in production o Why: Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 33 Why They Require Data Grids Inter/Intra Organizational Sharing Inter/Intra Organizational Data Storage Utility Data Storage Resource Plug-n-play provisioning Data Preservation (Technology Migration) Information Lifecycle Management (ILM)

San Diego Supercomputer Center ICDM 2004 University of Florida 34 Inter/intra Organizational Sharing Research Lab 2: We have relevant data for this project /txt3.txt GRP Research Lab 1: You can use our storage for this project GRP

San Diego Supercomputer Center ICDM 2004 University of Florida 35 Inter/intra Organizational Sharing Sharing of resources between autonomous domains Either same (Inter) or different (Intra) organizations Shared resources Data, storage, IT staff Logical namespace for Collaboration Shared data and physical resources available in the logical namespace for usage Inter-organizational digital libraries, Personal digital libraries

San Diego Supercomputer Center ICDM 2004 University of Florida 36 Inter/intra Organizational Utility /txt3.txt GRP West Coast Offices GRP Data Center GRP /…/text1.txt East Coast Office

San Diego Supercomputer Center ICDM 2004 University of Florida 37 Inter/intra Organizational Utility East Coast Office /txt3.txt GRP West Coast Offices GRP Data Center GRP /…/text1.txt That was so easy in slide show

San Diego Supercomputer Center ICDM 2004 University of Florida 38 Inter/intra Organizational Utility Create a logical data storage utility Virtualization of enterprise resources (data and storage) Manage resource usage based on demand and supply Distribute the quantity and QoS of resources in an enterprise based on the project demands, priorities, usage Saves a lot in TCO Total Cost of Operation, managing logically unified distributed resources without loosing flexibility and local autonomy

San Diego Supercomputer Center ICDM 2004 University of Florida 39 Data Storage Resource Plug-n-play Plug-n-play Provisioning Add a storage resource in grid without major reconfiguration Logical resources Logical namespace of all the resources to which resources can be added (or removed gracefully without affecting the applications) Update resources based on demand and supply Add resources to the storage pool from another inter/intra organizational partner (another department or data center)

San Diego Supercomputer Center ICDM 2004 University of Florida 40 Data Preservation (Technology Migration) Facilitate “Technology Migration” Flexible enterprise data architecture for technology evolution Update seamlessly to new file/storage system resources Hardware changes, Software changes Change archival resource from digital tape to disks Change from magnetic to optical The application and users not aware of any change Significant saving by avoiding downtime Create a replicated resource of all or selected data

San Diego Supercomputer Center ICDM 2004 University of Florida 41 Information Lifecycle Management (ILM) Business oriented management of data and resources If data is in demand; replicate, move to higher QoS Archive only the less accessed data Grid middleware facilitates ILM in the background More than one physical resource as a single logical resource Logical namespace of online and offline data Irrespective of number of replicas and resources added Compliance with federal regulations Health, Finance and many domains now have regulations on digital backup of transactions and communication

San Diego Supercomputer Center ICDM 2004 University of Florida 42 Tutorial Outline  Introduction to Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 43 Data Grid Using a Data Grid – in Abstract Ask for data User asks for data from the data grid Data delivered The data is found and returned Where & how details are managed by data grid But access controls are specified by owner

San Diego Supercomputer Center ICDM 2004 University of Florida 44 Real World : Physical Heterogeneities Multiple autonomous administrative domains Distributed digital entities in different domains Heterogeneous storage resources and systems Distributed users and authentication mechanisms Different user preferences of usage Logical hierarchy Users, groups, sub-organization/departments, administrative domains, enterprises, virtual enterprise

San Diego Supercomputer Center ICDM 2004 University of Florida 45 Data Grid: Every Thing Is Logical Logical namespace of grid resources Collection hierarchy Logical resources Grid Users and Virtual organizations

San Diego Supercomputer Center ICDM 2004 University of Florida 46 Transparencies/Virtualizations (bits,data,information,..) Storage Resource Transparency Storage Location Transparency E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where... Data Identifier Transparency image_0.jpg…image_100.jpg Data Replica Transparency image.sqlimage.cgiimage.wsdl Virtual Data Transparency Semantic data Organization (with behavior) patientRecordsCollectionmyActiveNeuroCollection Inter- organizational Information Storage Management

San Diego Supercomputer Center ICDM 2004 University of Florida 47 Data Grid Transparencies Find data without knowing the identifier Descriptive attributes Access data/storage without knowing the location Logical name space Access data without knowing the type of storage Storage repository abstraction Provide transformations for any data collection Data behavior abstraction

San Diego Supercomputer Center ICDM 2004 University of Florida 48 Data Grid Abstractions Storage repository virtualization Standard operations supported on storage systems Data virtualization Logical name space for files - Global persistent identifier Information repository virtualization Standard operations to manage collections in databases Access virtualization Standard interface to support alternate APIs Latency management mechanisms Aggregation, parallel I/O, replication, caching Security interoperability GSSAPI, inter-realm authentication, collection-based authorization

San Diego Supercomputer Center ICDM 2004 University of Florida 49 Data Organization Physical Organization of the data Distributed Data Heterogeneous resources Multiple formats (structured and unstructured) Logical Organization Impose logical structure for data sets Collections of semantically related data sets Users create their own views (collections) of the data grid

San Diego Supercomputer Center ICDM 2004 University of Florida 50 Data Identifier Transparency Four Types of Data Identifiers: Unique name OID or handle Descriptive name Descriptive attributes – meta data Semantic access to data Collective name Logical name space of a collection of data sets Location independent Physical name Physical location of resource and physical path of data

San Diego Supercomputer Center ICDM 2004 University of Florida 51 Mappings on Resource Name Space Define logical resource name List of physical resources Replication Write to logical resource completes when all physical resources have a copy Load balancing Write to a logical resource completes when copy exist on next physical resource in the list Fault tolerance Write to a logical resource completes when copies exist on “k” of “n” physical resources

San Diego Supercomputer Center ICDM 2004 University of Florida 52 Data Replica Transparency Replication Improve access time Improve reliability Provide disaster backup and preservation Physically or Semantically equivalent replicas Replica consistency Synchronization across replicas on writes Updates might use “m of n” or any other policy Distributed locking across multiple sites Versions of files Time-annotated snapshots of data

San Diego Supercomputer Center ICDM 2004 University of Florida 53 Latency Management -Bulk Operations Bulk register Create a logical name for a file Bulk load Create a copy of the file on a data grid storage repository Bulk unload Provide containers to hold small files and pointers to each file location Bulk delete Mark as deleted in metadata catalog After specified interval, delete file Bulk metadata load Requests for bulk operations for access control setting, …

San Diego Supercomputer Center ICDM 2004 University of Florida 54 In Short… The whole data grid infrastructure is logical All physical heterogeneities and distribution are hidden Flexibility needed for ever changing business demands in enterprise data management The presentation of the underlying physical infrastructure is controlled by the autonomous administrative domains in the grid Grid Middleware has to be the “plumber” Has to do lot of “plumbing” to provide all these transparencies without any significant degrade in performance or QoS Distributed data management: Latency, Replica, Logical namespace, meta-data, P2P, database tuning, etc.,

San Diego Supercomputer Center ICDM 2004 University of Florida 55 Tutorial Outline  Introduction to Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker o SRB Architecture o SRB Clients o SRB Demo  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 56 Storage Resource Broker Distributed data management technology Developed at San Diego Supercomputer Center (Univ. of California, San Diego) DARPA Massive Data Analysis DARPA/USPTO Distributed Object Computation Test bed 2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC Applications Data grids - data sharing Digital libraries - data publication Persistent archives - data preservation Used in national and international projects in support of Astronomy, Bio-Informatics, Biology, Earth Systems Science, Ecology, Education, Geology, Government records, High Energy Physics, Seismology

San Diego Supercomputer Center ICDM 2004 University of Florida 57 Acknowledgement: SDSC SRB Team  Arun Jagatheesan  George Kremenek  Sheau-Yen Chen  Arcot Rajasekar  Reagan Moore  Michael Wan  Roman Olschanowsky  Bing Zhu  Charlie Cowart Not In Picture:  Wayne Schroeder  Tim Warnock (BIRN)  Lucas Gilbert  Marcio Faerman (SCEC)  Antoine De Torcy Students: Jonathan Weinberg Yufang Hu Daniel Moore Grace Lin Allen Ding Yi Li Emeritus: Vicky Rowley (BIRN) Qiao Xin Ethan Chen Reena Mathew Erik Vandekieft Xi (Cynthia) Sheng

San Diego Supercomputer Center ICDM 2004 University of Florida 58 Three Tier Architecture Clients (Any interface/API) Your preferred access mechanism Servers (SRB Server) Manage interactions with storage systems Federated to support direct interactions between servers Metadata catalog (MCAT) Separation of metadata management from data storage State persistence using a well-tuned database

San Diego Supercomputer Center ICDM 2004 University of Florida 59 Unix Shell Java, NT Browsers GridFTP OAI WSDL SDSC Storage Resource Broker & Meta-data Catalog HRM Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX Application C, C++, Libraries Access APIs Drivers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase, SQLServer Consistency Management / Authorization-Authentication Logical Name Space Latency Management Data Transport Metadata Transport SRB Server Linux I/O DLL / Python

San Diego Supercomputer Center ICDM 2004 University of Florida 60 SRB server SRB agent SRB server Federated SRB server model MCAT Read Client SRB agent Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

San Diego Supercomputer Center ICDM 2004 University of Florida 61 SRB Latency Management Replication Server-initiated I/O Streaming Parallel I/O Caching Client-initiated I/O Remote Proxies, Staging Data Aggregation Containers Source Destination Prefetch Network Destination Network

San Diego Supercomputer Center ICDM 2004 University of Florida 62 SRB Name Spaces Digital Entities (files, blobs, Structured data, …) Logical name space for files for global identifiers Resources Logical names for managing collections of resources User names (user-name / domain / SRB-zone) Distinguished names for users to manage access controls MCAT metadata Standard metadata attributes, Dublin Core, administrative metadata

San Diego Supercomputer Center ICDM 2004 University of Florida 63 Logical Name Space Global, location-independent identifiers for digital entities Organized as collection hierarchy Attributes mapped to logical name space Attributed managed in a database Types of administrative metadata Physical location of file Owner, size, creation time, update time Access controls

San Diego Supercomputer Center ICDM 2004 University of Florida 64 Remote Proxies Extract image cutout from Digital Palomar Sky Survey Image size 1 Gbyte Shipped image to server for extracting cutout took 2-4 minutes (5-10 Mbytes/sec) Remote proxy performed cutout directly on storage repository Extracted cutout by partial file reads Image cutouts returned in 1-2 seconds Remote proxies are a mechanism to aggregate I/O commands

San Diego Supercomputer Center ICDM 2004 University of Florida 65 Virtual Data Abstraction Virtual Data or “On Demand Data” Created on demand is not already available Recipe to create derived data Grid based computation to create derived data product Object based storage (extended data operations) Data subsetting at the remote storage repository Data formatting at the remote storage repository Metadata extraction at the remote storage repository Bulk data manipulation at the remote storage repository

San Diego Supercomputer Center ICDM 2004 University of Florida 66 Grid Bricks Integrate data management system, data processing system, and data storage system into a modular unit Commodity based disk systems (1 TB) Memory (1 GB) CPU (1.7 Ghz) Network connection (Gig-E) Linux operating system Data Grid technology to manage name spaces User names (authentication, authorization) File names Collection hierarchy

San Diego Supercomputer Center ICDM 2004 University of Florida 67 Data Grid Brick Hardware components Intel Celeron 1.7 GHz CPU SuperMicro P4SGA PCI Local bus ATX mainboard 1 GB memory (266 MHz DDR DRAM) 3Ware Escalade port PCI bus IDE RAID 10 Western Digital Caviar 200-GB IDE disk drives 3Com Etherlink 3C996B-T PCI bus 1000Base-T Redstone RMC-4F2-7 4U ten bay ATX chassis Linux operating system Cost is $2,200 per Tbyte plus tax Gig-E network switch costs $500 per brick Effective cost is about $2,700 per TByte

San Diego Supercomputer Center ICDM 2004 University of Florida 68 Grid Bricks at SDSC Used to implement “picking” environments for 10-TB collections Web-based access Web services (WSDL/SOAP) for data subsetting Implemented 15-TBs of storage Astronomy sky surveys, NARA prototype persistent archive, NSDL web crawls Must still apply Linux security patches to each Grid Brick Grid bricks managed through SRB Logical name space, User Ids, access controls Load leveling of files across bricks

San Diego Supercomputer Center ICDM 2004 University of Florida 69 Data Grid Federation Data grids provide the ability to name, organize, and manage data on distributed storage resources Federation provides a way to name, organize, and manage data on multiple data grids.

San Diego Supercomputer Center ICDM 2004 University of Florida 70 SRB Zones Each SRB zone uses a metadata catalog (MCAT) to manage the context associated with digital content Context includes: Administrative, descriptive, authenticity attributes Users Resources Applications

San Diego Supercomputer Center ICDM 2004 University of Florida 71 SRB Peer-to-Peer Federation Mechanisms to impose consistency and access constraints on: Resources Controls on which zones may use a resource User names (user-name / domain / SRB-zone) Users may be registered into another domain, but retain their home zone, similar to Shibboleth Data files Controls on who specifies replication of data MCAT metadata Controls on who manages updates to metadata

San Diego Supercomputer Center ICDM 2004 University of Florida 72 Peer-to-Peer Federation 1. Occasional Interchange - for specified users 2. Replicated Catalogs - entire state information replication 3. Resource Interaction - data replication 4. Replicated Data Zones - no user interactions between zones 5. Master-Slave Zones - slaves replicate data from master zone 6. Snow-Flake Zones - hierarchy of data replication zones 7. User / Data Replica Zones - user access from remote to home zone 8. Nomadic Zones “SRB in a Box” - synchronize local zone to parent 9. Free-floating “myZone” - synchronize without a parent zone 10. Archival “BackUp Zone” - synchronize to an archive SRB Version released December 19, 2003

San Diego Supercomputer Center ICDM 2004 University of Florida 73 Principle peer-to-peer federation approaches (1536 possible combinations)

San Diego Supercomputer Center ICDM 2004 University of Florida 74 Free Floating Occasional Interchange Replicated Data User and Data Replica Resource InteractionNomadic Replicated Catalog Snow Flake Master Slave Archival Partial User-ID Sharing Partial Resource Sharing No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Managed Replication Connection From Any Zone Complete Resource Sharing System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing System Managed Replication System Set Access Controls System Controlled Partial Metadata Synch No Resource Sharing Super Administrator Zone Control System Controlled Complete Metadata Synch Complete User-ID Sharing Comparison of peer-to-peer federation approaches

San Diego Supercomputer Center ICDM 2004 University of Florida 75 Unix Shell Java, NT Browsers OAI, WSDL, OGSA HTTP Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Virtualization Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix C, C++, Java Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization-Authentication Audit Linux I/O DLL / Python, Perl Federation Management Data Grid Federation - zoneSRB

San Diego Supercomputer Center ICDM 2004 University of Florida 76 SDSC SRB Clients C library calls Provide access to all SRB functions Shell commands Provide access to all SRB functions mySRB web browser Provides hierarchical collection view inQ Windows browser Provides Windows style directory view Jargon Java API Similar to java.io. API Matrix WSDL/SOAP Interface Aggregate SRB requests into a SOAP request. Has a Java API and GUI Python, Perl, C++, OAI, Windows DLL, Mac DLL, Linux I/O redirection, GridFTP (soon)

San Diego Supercomputer Center ICDM 2004 University of Florida 77 SDSC SRB Demo If possible from the venue Constraints: Wireless, Murphy’s live demo laws; WAN, SRB, Storage, …

San Diego Supercomputer Center ICDM 2004 University of Florida 78 What we are familiar with …

San Diego Supercomputer Center ICDM 2004 University of Florida 79 What we are not familiar with, yet =) inQ Windows Browser Interface

San Diego Supercomputer Center ICDM 2004 University of Florida 80 How do they differ? Folder, does NOT mean physical folder Files, do NOT mean physical files Everything is logical Everything is distributed Permissions are NOT rwxrwxrwx Permissions are on an object by object basis

San Diego Supercomputer Center ICDM 2004 University of Florida 81 inQ Windows OS only User Guide at Download.exe from

San Diego Supercomputer Center ICDM 2004 University of Florida 82 mySRB Web-based access to the SRB Secure HTTP Uses Cookies for Session Control

San Diego Supercomputer Center ICDM 2004 University of Florida 83 mySRB Features Access to Both Data and Metadata Data & File Management Collection Creation and Management Metadata Handling Browsing & Querying Interface Access Control New file creation without upload

San Diego Supercomputer Center ICDM 2004 University of Florida 84 mySRB Interface to a SRB Collection

San Diego Supercomputer Center ICDM 2004 University of Florida 85 Provenance Metadata

San Diego Supercomputer Center ICDM 2004 University of Florida 86 SDSC SRB Information SRB user community posts problems and solutions Request copy of source Access FAQ, installation instructions, papers SRB-Zilla (bugzilla)

San Diego Supercomputer Center ICDM 2004 University of Florida 87 SRB Availability SRB source distributed to academic and research institutions Commercial use access through UCSD Technology Transfer Office William Decker

San Diego Supercomputer Center ICDM 2004 University of Florida 88 Tutorial Outline  Introduction to Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows o Introduction o Matrix Project o Data Grid Language  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 89 Work in progress GfMS is ‘Hard Hat Area’ (Research)

San Diego Supercomputer Center ICDM 2004 University of Florida 90 Data handling pipeline in SCEC (data  information pipeline) Metadata derivation Ingest Metadata Ingest Data Determine analysis pipeline Initiate automated analysis Organize result data into distributed data grid collections Use the optimal set of resources based on the task – on demand Pipeline could be triggered by input at data source or by a data request from user All gridflow activities stored for data flow provenance

San Diego Supercomputer Center ICDM 2004 University of Florida 91 Gridflows (Grid Workflow) Automation of an execution pipeline Data and/or tasks processed by multiple autonomous grid resources According to set of procedural rules Confluence of multiple autonomous administrative domains GridFlow Execution Servers By themselves are from autonomous administrative domains P2P (Distributed) Control

San Diego Supercomputer Center ICDM 2004 University of Florida 92 Need for Gridflows Data-intensive and/or compute-intensive processes Long run processes or pipelines on the Grid (e.g) If job A completes execute jobs x, y, z; else execute job B. Self-organization/management of data Semi-automation of data, storage distribution, curation processes (e.g) After each data insert into a collection, update the meta-data information about the collection or replicate the collection Knowledge Generation Offline data analysis and knowledge generation pipelines (e.g) What inferences can be assumed from the new seismology graphs added to this collection? Which domain scientist will be interested to study these new possible pre-results?

San Diego Supercomputer Center ICDM 2004 University of Florida 93 SDSC Matrix Project CS Research & Development Gridflow Description, Data Grid Administration Rules Gridflow P2P protocols for Gridflow Server Communication Development SRB Data Grid Web Services SRB Datagrid flow automation and provenance Theory  Practice Help in customized development & deployment of gridflow concepts in scientific / grid applications Visibility and assist in standardization of efforts at GGF

San Diego Supercomputer Center ICDM 2004 University of Florida 94 Advantages from Data Grid Perspective Reduces the Client-Server Communication The whole execution logic is sent to the server Less number of WAN messages Our experiments prove significant increase in performance Datagrid Information Lifecycle Management Autonomic: “Move data at 9:00 PM in weekdays and in week ends” Data Grid Administration Power-users and Sophisticated Users Data Grid Administrator (Rules to manage data grid) Scientist or Librarian (Visualized data flow programming)

San Diego Supercomputer Center ICDM 2004 University of Florida 95 What they want? We know the business (scientific) process CyberInfrastructure is all we care (why bother about atoms or DNA)

San Diego Supercomputer Center ICDM 2004 University of Florida 96 What they want? Use DGL to describe your process logic with abstract references to datagrid infrastructure dependencies

San Diego Supercomputer Center ICDM 2004 University of Florida 97 Why a Gridflow Language? Infrastructure independent description Abstract references to hardware and cyberinfrastructure Description of execution flow logic Separate the execution flow logic from application logic (e.g) MonteCarlo is an application, execution of that 10 times or till a variable becomes zero is execution logic Procedural Rules associated with execution flow Provenance What happened, when, who, how …? (and querying)

San Diego Supercomputer Center ICDM 2004 University of Florida 98 Gridflow Language Requirements High level Abstract descriptions Abstract description of cyberinfrastructure dependencies Simple yet flexible Flexible to describe complex requirements (no brute force) Gridflow dependency patterns Based on execution structure and data semantics (Parallel, Sequential, fork-new), (milestones, for-each, switch-case).. Asynchronous execution For long-run requests Querying using existing standard XQuery

San Diego Supercomputer Center ICDM 2004 University of Florida 99 Gridflow Language Requirements Process meta data and annotations Runtime definition, update and querying of meta-data Runtime Management of Gridflows Stop gridflow at run time Partitioning Facility in language to divide a gridflow request to multiple requests (Excellent Research Topic) Import descriptions Refer other gridflows in execution

San Diego Supercomputer Center ICDM 2004 University of Florida 100 Data Grid Language (DGL) XML based gridflow description Describes execution flow logic ECA-based rule description for execution ECA = Event, Condition, Action Querying of Status of Gridflow XQuery / Simple query of a Gridflow Execution Scoped variables and gridflow patterns For control of execution flow logic

San Diego Supercomputer Center ICDM 2004 University of Florida 101 DGL Requests Data Grid Flow An XML Structure that describes the execution logic, associated procedural rules and grid environment variables Status Query An XML Structure used to query the execution status any gridflow or a sub-flow at any granular level A DGL or Matrix client sends any of these to the Matrix Server

San Diego Supercomputer Center ICDM 2004 University of Florida 102 Data Grid Request Annotations about the Data Grid Request Can be either a Flow or a Status Query

San Diego Supercomputer Center ICDM 2004 University of Florida 103 Grid User Matrix-demo sdsc ****** /home/Matrix- demo.sdsc sdsc- unix 0

San Diego Supercomputer Center ICDM 2004 University of Florida 104 Grid Ticket

San Diego Supercomputer Center ICDM 2004 University of Florida 105 VO Info

San Diego Supercomputer Center ICDM 2004 University of Florida 106 Flow Scoped Variables that can control the flow Logic used by the sub-members Sub-members that are the real execution statements

San Diego Supercomputer Center ICDM 2004 University of Florida 107 Using DG-Modeler GUI for dataflow programming

San Diego Supercomputer Center ICDM 2004 University of Florida 108 Gridflow Process I End User using DGBuilder Gridflow Description Data Grid Language

San Diego Supercomputer Center ICDM 2004 University of Florida 109 Gridflow Process II Abstract Gridflow using Data Grid Language Planner Concrete Gridflow Using Data Grid Language

San Diego Supercomputer Center ICDM 2004 University of Florida 110 Gridflow Process III Gridflow P2P Network Gridflow Processor Concrete Gridflow Using Data Grid Language

San Diego Supercomputer Center ICDM 2004 University of Florida 111 Other Gridflow Research Projects GriPhyN Pegasus, Sphinx, Matrix Taverna (MyGrid) Kepler (also from SDSC) GridAnt …

San Diego Supercomputer Center ICDM 2004 University of Florida 112 Tutorial Outline  Introduction to Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 113 DGMS Philosophy Collective view of Inter-organizational data Operations on datagrid space Local autonomy and global state consistency Collaborative datagrid communities Multiple administrative domains or “Grid Zones” Self-describing and self-manipulating data Horizontal and vertical behavior Loose coupling between data and behavior (dynamically) Relationships between a digital entity and its Physical locations, Logical names, Meta-data, Access control, Behavior, “Grid Zones”.

San Diego Supercomputer Center ICDM 2004 University of Florida 114 DGMS Research Issues Self-organization of datagrid communities Using knowledge relationships across the datagrids Inter-datagrid operations based on semantics of data in the communities (different ontologies) High speed data transfer Terabyte to transfer Protocols, routers needed Latency Management Data source speed >> data sink speed Datagrid Constraints Data placement and scheduling How many replicas, where to place them…

San Diego Supercomputer Center ICDM 2004 University of Florida 115 Work Vision Ahead Half-baked research ahead

San Diego Supercomputer Center ICDM 2004 University of Florida 116 Active Datagrid Collections SDSC 121.Event Thit.xml National Lab getEvents() 121.Event Hits.sql University of Gators addEvent() Resources Data Sets Behavior

San Diego Supercomputer Center ICDM 2004 University of Florida 117 Active Datagrid Collections Heterogeneous, distributed physical data SDSC Dynamic or virtual data 121.Event Thit.xml National Lab getEvents() 121.Event Hits.sql University of Gators addEvent() National Lab University of Gators

San Diego Supercomputer Center ICDM 2004 University of Florida 118 Active Datagrid Collections myHEP-Collection SDSC 121.Event Thit.xml National Lab 121.Event Hits.sql University of Gators Logical Collection gives location and naming transparency SDSC Meta-data

San Diego Supercomputer Center ICDM 2004 University of Florida 119 Active Datagrid Collections myHEP-Collection SDSC 121.Event Thit.xml National Lab 121.Event Hits.sql University of Gators Now add behavior or services to this logical collection Meta-data SDSC Collection state and services Horizontal Services getEvents() addEvent()

San Diego Supercomputer Center ICDM 2004 University of Florida 120 Active Datagrid Collections myHEP-Collection SDSC 121.Event Thit.xml National Lab 121.Event Hits.sql University of Gators Meta-data SDSC Collection state and services Horizontal Services getEvents() addEvent() ADC specific Operations + Model View Controllers ADC Logical view of data & operations

San Diego Supercomputer Center ICDM 2004 University of Florida 121 Active Datagrid Collections Digital entities Meta-data Services State Horizontal datagrid services and vertical domain specific services (portType) or pipelines (DGL) Events, collective state, mappings to domain services to be invoked Standardized schema with domain specific schema extensions Physical and virtual data present in the datagrid

San Diego Supercomputer Center ICDM 2004 University of Florida 122 Related Technologies/Links A complete history of the Grid SDSC Storage Resource Broker Globus Data Grid The Legion Project

San Diego Supercomputer Center ICDM 2004 University of Florida 123 Global Grid Forum (GGF) Global Forum for Information Exchange and Collaboration Promote and support the development and deployment of Grid Technologies Creation and documentation of “best practices”, technical specifications (standards), user experiences, … Modeled after Internet Standards Process (IETF, RFC 2026)

San Diego Supercomputer Center ICDM 2004 University of Florida 124 Tutorial Outline  Introduction to Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 125 Data Grid Mining Distributed data mining Underlying infrastructure heterogeneous (not uniform LAN or bandwidth or memory) Mining software (algorithms) to take advantage of the logical resource namespace to select execution site Can the mining software estimate and acquire required cyber infrastructure resources before it starts? Grid standards must be evolved to communicate this infrastructure dependent information Co-location of dependent data or tasks; Distribution (parallel execution) of independent tasks at different domains

San Diego Supercomputer Center ICDM 2004 University of Florida 126 Data Grid Mining Using the Data Grid to mine data Replicating or selecting the right resources for mining Cost-based analysis for the best utilization of the heterogeneous infrastructure Data, mining software, execution location Move data or code or execution location to alternative location based on QoS and available budget Co-locating or distributing appropriate data mining steps (e.g) NVO co-add at FNAL and SDSC (distribution + co- location)

San Diego Supercomputer Center ICDM 2004 University of Florida 127 Q & A Session; feedback Makes a significant difference from being here in this room today, and flipping through the slides from the internet later – So lets make sure we all benefit from the 3 hours we spent here

San Diego Supercomputer Center ICDM 2004 University of Florida 128 Tutorial Outline  Introduction to Data Grids  Data Grid Design philosophies  SDSC Storage Resource Broker  Gridflows  DGMS Related Topics  Q & A Session  Hand-on/Demo Session

San Diego Supercomputer Center ICDM 2004 University of Florida 129 For More Information Arun swaran Jagatheesan San Diego Supercomputer Center University of California, San Diego