Data Grid Management Systems (DGMS) Arun Jagatheesan San Diego Supercomputer Center

Slides:



Advertisements
Similar presentations
Building Shared Collections Using the Storage Resource Broker Storage Resource Broker Reagan W. Moore
Advertisements

San Diego Supercomputer Center & National Partnership for Advance Computational Infrastructure Storage Resource Broker Reagan W. Moore San Diego Supercomputer.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
The Storage Resource Broker and.
The Storage Resource Broker and.
Overview of the SDSC Storage Resource Broker Wayne Schroeder (and other SRB team members) May, 2004 San Diego Supercomputer Center, University of California.
Peter Berrisford RAL – Data Management Group SRB Services.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Data Grids, Digital Libraries, and Persistent Archives ESIP.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids Reagan W. Moore San Diego Supercomputer Center.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids, Digital Libraries and Persistent Archives Reagan.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida A Data Storage Language for.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure Integration of Data Grids, Digital Libraries, and Persistent.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
A Very Brief Introduction to iRODS
Federating Archives in the DELAMAN Network Reagan W. Moore San Diego Supercomputer Center Storage Resource.
Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore
VL-e PoC Introduction Maurice Bouwhuis VL-e work shop, April 7 th, 2006.
Applying Data Grids to Support Distributed Data Management Storage Resource Broker Reagan W. Moore Ian Fisk Bing Zhu University of California, San Diego.
Modern Data Management Overview Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
By: Roman Olschanowsky An Introduction to the.
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
National Partnership for Advanced Computational Infrastructure Digital Library Architecture Reagan Moore Chaitan Baru Amarnath Gupta George Kremenek Bertram.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Programming Gridflows using Matrix Arun Jagatheesan Architect, SDSC.
Jan Storage Resource Broker Managing Distributed Data in a Grid A discussion of a paper published by a group of researchers at the San Diego Supercomputer.
Data Grids and Data Management Storage Resource Broker Reagan W. Moore
Managing Simulation Output Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center SDSC Storage Resource Broker SRB as data grid solution (Chinese version) Arun Jagatheesan San Diego Supercomputer.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Dataflows in SRB using SDSC Matrix Arun Jagatheesan Architect & Team.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure University of Florida 1 Arun Jagatheesan Reagan Moore San.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida Data Grid Management Systems (DGMS) Arun Swaran Jagatheesan San Diego.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
San Diego Supercomputer Center SDSC Storage Resource Broker Data Grid Automation Arun Jagatheesan et al., San Diego Supercomputer Center University of.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data Grid Services/SRB/SRM & Practical Hai-Ning Wu Academia Sinica Grid Computing.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for Advanced.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure SRB + Web Services = Datagrid Management System (DGMS) Arcot.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure University of Florida 1 Arun Jagatheesan Reagan Moore San.
San Diego Supercomputer Center Grid Physics Network (GriPhyN) University of Florida DGL: The Assembly Language for Grid Computing Arun swaran Jagatheesan.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
SAN DIEGO SUPERCOMPUTER CENTER By: Roman Olschanowsky An Introduction to the.
Introduction to The Storage Resource.
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center National Partnership for.
San Diego Supercomputer Center, University of California at San Diego Grid Physics Network (GriPhyN) University of Florida Data Grid and Gridflow Management.
Biomedical Informatics Research Network The Storage Resource Broker & Integration with NMI Middleware Arcot Rajasekar, BIRN-CC SDSC October 9th 2002 BIRN.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Building Preservation Environments from Federated Data Grids Reagan W. Moore San Diego Supercomputer Center Storage.
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Collection Based Persistent Archives
Arcot Rajasekar Michael Wan Reagan Moore (sekar, mwan,
San Diego Supercomputer Center University of California, San Diego
VORB Virtual Object Ring Buffers
Presentation transcript:

Data Grid Management Systems (DGMS) Arun Jagatheesan San Diego Supercomputer Center

Acknowledgement: SDSC SRB Team  Arun Jagatheesan  George Kremenek  Sheau-Yen Chen  Arcot Rajasekar  Reagan Moore  Michael Wan  Roman Olschanowsky  Bing Zhu  Charlie Cowart Not In Picture:  Wayne Schroeder  Tim Warnock(BIRN)  Lucas Gilbert  Marcio Faerman (SCEC)  Antoine De Torcy Students: Xi (Cynthia) Sheng Allen Ding Grace Lin Jonathan Weinberg Yufang Hu Yi Li Emeritus: Vicky Rowley (BIRN) Qiao Xin Daniel Moore Ethan Chen Reena Mathew Erik Vandekieft Ullas Kapadia

Talk Outline Grid Computing and Data Grids Inter-organizational Information Management using Data Grids Gridflows and Data Grids Opportunities and Challenges

Grid as Utility Computing

NIH BIRN Data Grid Biomedical Informatics Research Network  Medical schools and research centers across the country  Access and analyze biomedical images  Coordinate sharing of data and storage resources  Storage virtualization, Data virtualization, Inter-organizational information virtualization  Inter and Intra Organizational Information Storage Management

BIRN: Inter-organizational Data

TeraGrid: 13.6 TF, 6.8 TB memory, 900 TB network disk, 10 PB archive ESnet HSCC MREN/Abilene Starlight OC-12 OC-3 vBNS Abilene MREN 15xxp Origin UniTree 1024p IA p IA-64 Calren 574p IA-32 Chiba City 128p Origin HR Display & VR Facilities NTON vBNS Abilene Calren ESnet Caltech 0.5 TF.4 TB Memory 86 TB disk NCSA 6+2 TF 4 TB Memory 400 TB disk SDSC 4.1 TF 2 TB Memory 500 TB SAN ANL 1 TF.25 TB Memory 25 TB disk HPSS 9 PB Juniper M160 OC-12 OC-48 OC p HP X-Class 128p HP V p IA-32 Myrinet Chicago & LA DTF Core Switch/Routers Cisco 65xx Catalyst Switch (256 Gb/s Crossbar) 1176p IBM SP 1.7 TFLOPs Blue Horizon OC x Sun E10K OC-12 OC-3 8 Sun Server 16 GbE 24 Extreme Blk Diamond OC-12 ATM

NASA Data Grids NASA Information Power Grid  NASA Ames, NASA Goddard  Distributed data collection using the SRB ESIP federation  Led by Joseph JaJa (U Md)  Federation of ESIP data resources using the SRB NASA Goddard Data Management System  Storage repository virtualization (Unix file system, Unitree archive, DMF archive) using the SRB NASA EOS Petabyte store  Storage repository virtualization for EMC persistent store using the Nirvana version of SRB

Southern California Earthquake Center

Build community digital library Manage simulation and observational data  Anelastic wave propagation output  10 TBs, 1.5 million files Provide web-based interface  Support standard services on digital library Manage data distributed across multiple sites  USC, SDSC, UCSB, SDSU, SIO Provide standard metadata  Community based descriptive metadata  Administrative metadata  Application specific metadata

Gridflow in SCEC (data  information pipeline) Metadata derivation Ingest Metadata Ingest Data Determine analysis pipeline Initiate automated analysis Organize result data into distributed data grid collections Use the optimal set of resources based on the task – on demand Pipeline could be triggered by input at data source or by a data request from user All gridflow activities stored for data flow provenance and querying

Commonality in all these projects Distributed data management  Data Grids, Digital Libraries, Persistent Archives  Workflow/dataflow Pipelines Data sharing across administrative domains  Common logical name space for all registered digital entities Data publication  Browsing and discovery of data in collections Data Preservation  Management of technology evolution

Talk Outline Grid Computing and Data Grids Inter-organizational Information Management using Data Grids SDSC Storage Resource Broker (SRB) Gridflows and Data Grids Opportunities and Challenges

Data Grid Using a Data Grid – in Abstract Ask for data User asks for data from the data grid Data delivered The data is found and returned Where & how details are managed by data grid But access controls are specified by owner

Common Data Grid Components Federated client-server architecture  Servers can talk to each other independently of the client Infrastructure independent naming  Logical names for users, resources, files, applications Collective ownership of data  Collection-owned data, with infrastructure independent access control lists Context management  Record state information in a metadata catalog from data grid services such as replication Abstractions for dealing with heterogeneity

Data Grid Transparencies Find data without knowing the identifier  Descriptive attributes Access data without knowing the location  Logical name space Access data without knowing the type of storage  Storage repository abstraction Retrieve data using your preferred API  Access abstraction Provide transformations for any data collection  Data behavior abstraction

Logical Layers (bits,data,information,..) Storage Resource Transparency Storage Location Transparency E:\srbVault\image.jpg /users/srbVault/image.jpg Select … from srb.mdas.td where... Data Identifier Transparency image_0.jpg…image_100.jpg Data Replica Transparency image.sqlimage.cgiimage.wsdl Virtual Data Transparency Semantic data Organization (with behavior) patientRecordsCollectionmyActiveNeuroCollection Inter- organizational Information Storage Management Storage Virtualization Data Virtualization Information Virtualization

Talk Outline Grid Computing and Data Grids Inter-organizational Information Management using Data Grids SDSC Storage Resource Broker (SRB) Gridflows and Data Grids Opportunities and Challenges

SDSC Storage Resource Broker Distributed data management technology  Developed at San Diego Supercomputer Center (Univ. of California, San Diego)  DARPA Massive Data Analysis  DARPA/USPTO Distributed Object Computation Testbed  2000 to present - NSF, NASA, NARA, DOE, DOD, NIH, NLM, NHPRC Applications  Data grids - data sharing  Digital libraries - data publication  Persistent archives - data preservation  Used in national and international projects in support of Astronomy, Bio-Informatics, Biology, Earth Systems Science, Ecology, Education, Geology, Government records, High Energy Physics, Seismology

SRB Data Collections at SDSC

SRB Data Grid Environments NSF Southern California Earthquake Center digital library Worldwide Universities Network data grid NASA Information Power Grid NASA Goddard Data Management System data grid DOE BaBar High Energy Physics data grid NSF National Virtual Observatory data grid NSF ROADnet real-time sensor collection data grid NIH Biomedical Informatics Research Network data grid NARA research prototype persistent archive NSF National Science Digital Library persistent archive NHPRC Persistent Archive Testbed

Why they use SRB for Data Management? Logical Namespace of data collections Replica Transparency and management Meta-data Management Storage resource virtualization Information/data virtualization Latency Management and bulk operations Virtual data abstraction Inter or Intra-organizational namespace of data and storage resources Preserve data irrespective of technology evolution (say for 400 years) Digital Libraries Data Grids Data on demand Persistent Archives

Our “magic” recipe Every thing in the world is abstract (kind of Matrix) Databases:  Physical layer, logical layer  Physical layer not visible to user SRB  Physical layer, logical layer, inter-zone layer  Every this is logical – but visible to users (as long as they have access)

Unix Shell Java, NT Browsers GridFTP OAI WSDL SDSC Storage Resource Broker & Meta-data Catalog (One Zone) HRM Archives HPSS, ADSM, UniTree, DMF Databases DB2, Oracle, Postgres File Systems Unix, NT, Mac OSX Application C, C++, Libraries Access APIs SRB Drivers Storage Abstraction Catalog Abstraction Databases DB2, Oracle, Sybase, SQLServer Consistency Management / Authorization-Authentication Logical Name Space Latency Management Data Transport Metadata Transport SRB Server Linux I/O DLL / Python Our Magic Recipe: Every thing in this world is “abstract”

SRB server SRB agent SRB server Federated SRB server model MCAT Read Client SRB agent Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

SRB Name Spaces Digital Entities (files, blobs, Structured data,,)  Logical name space for files for global identifiers Resources  Logical names for managing collections of resources User names (user-name / domain / SRB-zone)  Distinguished names for users to manage access controls MCAT metadata  Standard metadata attributes,administrative metadata, user-defined meta

Data Grid Federation Data grids provide the ability to name, organize, and manage data on distributed storage resources Federation provides a way to name, organize, and manage data on multiple data grids.

SRB Zones Each SRB zone uses a metadata catalog (MCAT) to manage the context associated with digital content Each SRB zone is an autonomous organization or a sub-organization Context includes:  Administrative, descriptive, authenticity attributes  Users  Resources  Applications

SRB Peer-to-Peer Federation Mechanisms to impose consistency and access constraints on:  Resources  Controls on which zones may use a resource  User names (user-name / domain / SRB- zone)  Users may be registered into another domain, but retain their home zone, similar to Shibboleth  Data files  Controls on who specifies replication of data  MCAT metadata  Controls on who manages updates to metadata

Peer-to-Peer Federation 1. Occasional Interchange - for specified users 2. Replicated Catalogs - entire state information replication 3. Resource Interaction - data replication 4. Replicated Data Zones - no user interactions between zones 5. Master-Slave Zones - slaves replicate data from master zone 6. Snow-Flake Zones - hierarchy of data replication zones 7. User / Data Replica Zones - user access from remote to home zone 8. Nomadic Zones “SRB in a Box” - synchronize local zone to parent 9. Free-floating “myZone” - synchronize without a parent zone 10. Archival “Backup Zone” - synchronize to an archive SRB Version released December 19, 2003

Unix Shell Java, NT Browsers OAI, WSDL, OGSA HTTP Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Virtualization Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix C, C++, Java Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization-Authentication Audit Linux I/O DLL / Python, Perl Federation Management Data Grid Federation - zoneSRB

Talk Outline Grid Computing and Data Grids Inter-organizational Information Management using Data Grids SDSC Storage Resource Broker (SRB) Gridflows and Data Grids Opportunities and Challenges

Gridflows Grid Workflow (Gridflow) is the automation of a execution pipeline in which data or tasks are processed through multiple autonomous grid resources according to a set of procedural rules Gridflows are executed on resources that are dynamically obtained through confluence of one or more autonomous administrative domains (peers)

Need for Gridflows Data-intensive and/or compute-intensive processes  Long run processes or pipelines on the Grid  (e.g) If job A completes execute jobs x, y, z; else execute job B. Self-organization/management of data  Semi-automation of data, storage distribution, curation processes  (e.g) After each data insert into a collection, update the meta-data information about the collection or replicate the collection Knowledge Generation  Offline data analysis and knowledge generation pipelines  (e.g) What inferences can be assumed from the new seismology graphs added to this collection? Which domain scientist will be interested to study these new possible pre-results?

Data Grid Language Assembly Language for Grid Computing Describes Gridflow  Both structure-based and state-based gridflow patterns  Described ECA based rules  Inbuilt support to define data grid datatypes like collections, … Query Gridflow  Query on the execution of any gridflow (any granular detail)  XQuery is used to query on the status of gridflow and its attributes Manage Gridflow  Start or stop the gridflow in execution

SDSC Matrix Project R&D effort that is ready for production now  Gridflow Protocols  Gridflow Language Descriptions  Version 3.0 released Community based  Both Industry and Academia can benefit by participation  Involves University of Florida, UCSD, … (Are you In?) Multiple Projects could be benefited  Very large academic data grid projects  Industries which want to be the early adopters

Matrix Gridflow Server Architecture Matrix Agent Abstraction In Memory Store JDBC Agents for java, WSDL and other grid executables Persistence (Store) Abstraction ECA rules Handler Matrix Data Grid Request Processor Transaction Handler Status query Handler Gridflow Meta data Manager JMS Messaging Interface JAXM Wrapper SOAP Service for Matrix Clients Flow Handler and Execution Manager Workflow Query Processor XQuery Processor Event Publish Subscribe, Notification SDSC SRB Agents Other SDSC Data Services WSDL Description Sangam P2P Gridflow Broker and Protocols

Talk Outline Grid Computing and Data Grids Inter-organizational Information Management using Data Grids SDSC Storage Resource Broker (SRB) Gridflows and Data Grids Opportunities and Challenges

DGMS Philosophy Collective view of  Inter-organizational data  Operations on datagrid space Local autonomy and global state consistency Collaborative datagrid communities  Multiple administrative domains or “Grid Zones” Self-describing and self-manipulating data  Horizontal and vertical behavior  Loose coupling between data and behavior (dynamically)  Relationships between a digital entity and its Physical locations, Logical names, Meta-data, Access control, Behavior, “Grid Zones”.

DGMS Research Issues Self-organization of datagrid communities  Using knowledge relationships across the datagrids  Inter-datagrid operations based on semantics of data in the communities (different ontologies) High speed data transfer  Terabyte to transfer  Protocols, routers Latency Management  Data source speed >> data sink speed Datagrid Constraints Data placement and scheduling  How many replicas, where to place them…

Global Grid Forum (GGF) Global Forum for Information Exchange and Collaboration  Promote and support the development and deployment of Grid Technologies  Creation and documentation of “best practices”, technical specifications (standards), user experiences, …  Modeled after Internet Standards Process (IETF, RFC 2026) 

Talk Outline Grid Computing and Data Grids Inter-organizational Information Management using Data Grids SDSC Storage Resource Broker (SRB) Gridflows and Data Grids Opportunities and Challenges Summary

Let us share dreams to make them for real Arun S. Jagatheesan San Diego Supercomputer Center

inQ Windows Browser Interface

mySRB Interface to a SRB Collection

Provenance Metadata