Data Grids and Data Management Storage Resource Broker Reagan W. Moore
Topics Data management evolution Shared collections Digital Libraries Persistent Archives Building shared collections Project level / National level / International Demonstration of shared collections Access to collections at SDSC
Shared Collections Data grids support the creation of shared collections that may be distributed across multiple institutions, sites, and storage systems. Digital libraries publish data, and provide services for discovery and display Persistent archives preserve data, managing the migration to new technology
Generic Infrastructure Can a single system provide all of the features needed to implement each type of data management system, while supporting access across administrative domains and managing data stored in multiple types of storage systems? Answer is data grid technology
Shared Collections Purpose of SRB data grid is to enable the creation of a collection that is shared between academic institutions Register digital entity into the shared collection Assign owner, access controls Assign descriptive, provenance metadata Manage state information Audit trails, versions, replicas, backups, locks Size, checksum, validation date, synchronization date, … Manage interactions with storage systems Unix file systems, Windows file systems, tape archives, … Manage interactions with preferred access mechanisms Web browser, Java, WSDL, C library, …
SRB server SRB agent SRB server Federated Server Architecture MCAT Read Application SRB agent Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6
Generic Infrastructure Digital libraries now build upon data grids to manage distributed collections DSpace digital library - MIT and Hewlitt Packard Fedora digitial library - Cornell University and University of Virginia Persistent archives build upon data grids to manage technology evolution NARA research prototype persistent archive California Digital Library - Digital Preservation Repository NSF National Science Digital Library persistent archive
National Science Digital Library URLs for educational material for all grade levels registered into repository at Cornell SDSC crawls the URLs, registers the web pages into a SRB data grid, builds a persistent archive 750,000 URLs 13 million web pages About 3 TBs of data
Southern California Earthquake Center SCEC Community Library Select Receiver (Lat/Lon) Output Time History Seismograms Select Scenario Fault Model Source Model Intuitive User Interface – –Pull-Down Query Menus – –Graphical Selection of Source Model – –Clickable LA Basin Map (Olsen) – –Seismogram/History extraction (Olsen) Access SCEC Digital Library – –Data stored in a data grid – –Annotated by modelers – –Standard naming convention – –Automated extraction of selected data and metadata – –Management of visualizations SCEC Digital Library
Terashake Data Handling Simulate 7.7 magnitude earthquake on San Andreas fault 50 Terabytes in a simulation Move 10 Terabytes per day Post-Processing of wave field Movies of seismic wave propagation Seismogram formatting for interactive on-line analysis Velocity magnitude Displacement vector field Cumulative peak maps Statistics used in visualizations Register derived data products into SCEC digital library
Humidity Climate Ecological Wireless Oceanography Wind Speed Climate Ecological Wireless Oceanography Seismic Geophysics ROADNet Sensor Network Data Integration Fire start Rain start Frank Vernon - UCSD/SIO
NARA Persistent Archive NARAU MdSDSC MCAT Original data at NARA, data replicated to U Md & SDSC Replicated copy at U Md for improved access, load balancing and disaster recovery Active archive at SDSC, user access Demonstrate preservation environment Authenticity Integrity Management of technology evolution Mitigation of risk of data loss Replication of data Federation of catalogs Management of preservation metadata Scalability Types of data collections Size of data collections Federation of Three Independent Data Grids
Worldwide University Network Data Grid SDSC Manchester Southampton White Rose NCSA U. Bergen A functioning, general purpose international Data Grid for academic collaborations Manchester-SDSC mirror
WUNGrid Collections BioSimGrid Molecular structure collaborations White Rose Grid Distributed Aircraft Maintenance Environment Medieval Studies Music Grid e-Print collections DSpace Astronomy
BioSimGrid Kaihsu Tai, Stuart Murdock, Bing Wu, Muan Hong Ng, Steven Johnston, Hans Fangohr, Simon J. Cox, Paul Jeffreys, Jonathan W. Essex, Mark S. P. Sansom (2004) BioSimGrid: towards a worldwide repository for biomolecular simulations. Org. Biomol. Chem. 2:3219–3221 DOI: /b411352g University of Oxford Mark Sansom, Biochemistry Paul Jeffreys, e-Science Kaihsu Tai, Biochemistry Bing Wu, Biochemistry / e-Science University of Southampton Jonathan Essex, Chemistry Simon Cox, e-Science Stuart Murdock, Chemistry / e-Science Muan Hong Ng, e-Science Hans Fangohr, e-Science Steven Johnston, e-Science Elsewhere David Moss, Birkbeck, London Adrian Mulholland, Bristol Charles Laughton, Nottingham Leo Caves, York
KEK Data Grid Japan Taiwan South Korea Australia Poland US A functioning, general purpose international Data Grid for high- energy physics Manchester-SDSC mirror
BaBar High-energy Physics Stanford Linear Accelerator Lyon, France Rome, Italy San Diego RAL, UK A functioning international Data Grid for high-energy physics Manchester-SDSC mirror Moved over 100 TBs of data
Astronomy Data Grid Chile Tucson, Arizona NCSA, Illinois A functioning international Data Grid for Astronomy Manchester-SDSC mirror Moved over 400,000 images
International Institutions (2005)
Unix Shell NT Browser, Kepler Actors http, Portlet, WSDL, OAI-PMH) DSpace, OpenDAP, GridFTP, Fedora Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Abstraction Database Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit Linux I/O C++ DLL / Python, Perl, Windows Federation Management Storage Resource Broker 3.3.1
SRB Objectives Automate all aspects of data discovery, access, management, analysis, preservation Security paramount Distributed data Provide distributed data support for Data sharing - data grids Data publication - digital libraries Data preservation - persistent archives Data collections - Real time sensor data
SRB Developers Reagan Moore - PI Michael Wan - SRB Architect Arcot Rajasekar - SRB Manager Wayne Schroeder - SRB Productization Charlie Cowart- inQ Lucas Gilbert - Jargon Bing Zhu - Perl, Python, Windows Antoine de Torcy - mySRB web browser Sheau-Yen Chen - SRB Administration George Kremenek- SRB Collections Arun Jagatheesan - Matrix workflow Marcio Faerman - SCEC Application Sifang Lu - ROADnet Application Richard Marciano - SALT persistent archives Contributors from UK e-Science, Academia Sinica, Ohio State University, Aerospace Corporation, … 75 FTE-years of support About 300,000 lines of C
History DARPA Massive Data Analysis Systems DARPA/USPTO Distributed Object Computation Testbed NSF National Partnership for Advanced Computational Infrastructure DOE Accelerated Strategic Computing Initiative data grid NARA persistent archive NASA Information Power Grid NLM Digital Embryo digital library DOE Particle Physics data grid NSF Grid Physics Network data grid NSF National Virtual Observatory data grid NSF National Science Digital Library persistent archive NSF Southern California Earthquake Center digital library NIH Biomedical Informatics Research Network data grid NSF Real-time Observatories, Applications, and Data management Network NSF ITR, Constraint based data systems LC Digital Preservation Lifecycle Management LC National Digital Information Infrastructure and Preservation program
Development SRB December 15, 2000 Basic distributed data management system Metadata Catalog SRB February 18, 2003 Parallel I/O support Bulk operations SRB August 30, 2003 Federation of data grids SRB October 31, 2005 Feature requests (extensible schema)
Separation of Access Method from Storage Protocols Storage System Storage Protocol Access Method Access Operations Data Grid Map from the operations used by the access method to a standard set of operations used to interact with the storage system Storage Operations
Data Grid Operations File access Open, close, read, write, seek, stat, synch, … Audit, versions, pinning, checksums, synchronize, … Parallel I/O and firewall interactions Versions, backups, replicas Latency management Bulk operations Register, load, unload, delete, … Remote procedures HDFv5, data filtering, file parsing, replicate, aggregate Metadata management SQL generation, schema extension, XML import and export, browsing, queries, GGF, “Operations for Access, Management, and Transport at Remote Sites”
Examples of Extensibility Storage Repository Driver evolution Initially supported Unix file system Added archival access - UniTree, HPSS Added FTP/HTTP Added database blob access Added database table interface Added Windows file system Added project archives - Dcache, Castor, ADS Added Object Ring Buffer, Datascope Added GridFTP version 3.3 Database management evolution Postgres DB2 Oracle Informix Sybase mySQL (most difficult port - no locks, no views, limited SQL)
Examples of Extensibility The 3 fundamental APIs are C library, shell commands, Java Other access mechanisms are ported on top of these interfaces API evolution Initial access through C library, Unix shell command Added iNQ Windows browser (C++ library) Added mySRB Web browser (C library and shell commands) Added Java (Jargon) Added Perl/Python load libraries (shell command) Added WSDL (Java) Added OAI-PMH, OpenDAP, DSpace digital library (Java) Added Kepler actors for dataflow access (Java) Added GridFTP version 3.3 (C library) Added Fedora
Logical Name Spaces Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (C library, Unix, Web Browser) Data access directly between application and storage repository using names required by the local repository
Logical Name Spaces Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (C library, Unix, Web Browser) Data is organized as a shared collection
Federation Between Data Grids Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities
Types of Risk Media failure Replicate data onto multiple media Vendor specific systemic errors Replicate data onto multiple vendor products Operational error Replicate data onto a second administrative domain Natural disaster Replicate data to a geographically remote site Malicious user Replicate data to a deep archive
How Many Replicas Three sites minimize risk Primary site Supports interactive user access to data Secondary site Supports interactive user access when first site is down Provides 2nd media copy, located at a remote site, uses different vendor product, independent administrative procedures Deep archive Provides 3rd media copy, staging environment for data ingestion, no user access
Deep Archive Z2Z1 Z3 Z2:D2:U2 Register Z3:D3:U3 Register Pull Pull Firewall Server initiated I/O DeepArchive StagingZone Remote Zone No access by Remote zones PVN
For More Information Reagan W. Moore San Diego Supercomputer Center