Compute Resources – HPC centers, institutional clusters DFC Collaboration Environment – Data Grid Community Resources – Repository, Catalog DFC Vision Build collaboration environment – Sharing of data, information, and knowledge Form national data cyberinfrastructure – Federation of existing data management systems Support reproducible data-driven research – Encapsulate knowledge within shared workflows Enable student participation in research – Policy-controlled analysis of “live” data NEW
Data Driven Science and Engineering Collaboration Environments – Oceanography – Ocean Observatory Initiative Archiving climatic data records from real-time sensor data streams – Engineering – CIBER-U Engineering Digital Library: Curating civil engineering data, materials data, archaeology data, student training materials – Hydrology- EarthCube Automating hydrology research workflows (data retrieval, transformation, analysis) – Plant biology – the iPlant Collaborative Enable collaborative research across existing data repositories – Cognitive science – the Temporal Dynamics of Learning Center Manage research data, apply IRB policies – Social Science – the Odum Institute Integrate policy-based data management with the existing Dataverse repository
Challenges Federated national data cyberinfrastructure Existing projects have web services, data repositories, digital libraries, archives, processing pipelines, science portals What are the interoperability mechanisms needed to enable federation of existing resources?
1.AstrophysicsAuger supernova search 2.Atmospheric scienceNASA Langley Atmospheric Sciences Center 3.BiologyPhylogenetics at CC IN2P3 4.ClimateNOAA National Climatic Data Center 5.Cognitive ScienceTemporal Dynamics of Learning Center 6.Computer ScienceGENI experimental network 7.Cosmic RayAMS experiment on the International Space Station 8.Dark Matter PhysicsEdelweiss II 9.Earth ScienceNASA Center for Climate Simulations 10.EcologyCEED Caveat Emptor Ecological Data 11.EngineeringCIBER-U 12.High Energy PhysicsBaBar / Stanford Linear Accelerator 13.HydrologyInstitute for the Environment, UNC-CH; Hydroshare 14.GenomicsBroad Institute, Wellcome Trust Sanger Institute, NGS 15.MedicineSick Kids Hospital 16.NeuroscienceInternational Neuroinformatics Coordinating Facility 17.Neutrino PhysicsT2K and dChooz neutrino experiments 18.OceanographyOcean Observatories Initiative 19.Optical AstronomyNational Optical Astronomy Observatory 20.Particle PhysicsIndra multi-detector collaboration at IN2P3 21.Plant geneticsthe iPlant Collaborative 22.Quantum ChromodynamicsIN2P3 23.Radio AstronomyCyber Square Kilometer Array, TREND, BAOradio 24.SeismologySouthern California Earthquake Center 25.Social ScienceOdum, TerraPop DFC Builds on the iRODS data grid (integrated Rule Oriented Data System)
Collection Defines Attribute Has Digital Object Has Collection Purpose Defines Policy Property Defines Controls Updates Persistent State Information Persistent State Information Policy Concept Graph Purpose Procedure Completeness Correctness Isa Consensus Consistency HasFeature Integrity Isa Authenticity Isa Access control HasFeature Property Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Policy Workflow Isa Function Chains Operation Isa Updates GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa Procedure Isa DATA_ID DATA_REPL_NUM DATA_CHECKSUM Isa Persistent State Client Action Periodic Assessment Criteria Policy Policy Enforcement Point Invokes Has SubType Policy Enforcement
Policy-based Data Management – Implementation in iRODS Collection Purpose (5 main types) Purpose (5 main types) Completeness Correctness Consensus Defines Consistency Attribute HasFeature Has Defines Policy (11 default) Policy (11 default) Has Property (7 default) Defines Procedure (11 default) Controls Updates Clients (50) Periodic Assessment Criteria Policy Policy Enforcement Points (70) Workflow Invokes Has SubType Isa Micro-service (317) Chains Operation Isa Persistent State Information (338) Persistent State Information (338) Isa Digital Object Updates Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Integrity Isa Authenticity Isa Access control Isa msiGetUserACL msiSetDataType msiSetQuota msiDataObjRepl msiSysChksumDataObj Isa DATA_ID DATA_REPL_NUM DATA_CHECKSUM Isa HasFeature Archive Data grid Collection Digital Library Processing Pipeline Archive Data grid Collection Digital Library Processing Pipeline SubType
Federation Approach Use middleware to implement unifying name spaces for: 1.UsersSingle sign-on 2.CollectionsDirectories, workflow, time series 3.ObjectsFiles, soft links, workflows 4.Storage systemsCloud, tape, file systems, objects 5.MetadataProvenance, description, state 6.PoliciesManagement, assessment 7.Micro-servicesProcedures, interactions DFC - CNI
Port: 1237, Zone: dfcmain iCAT iCAT hydroResc hydroResc res-bk15 res-bk15 res-dfcmain res-dfcmain demoResc demoResc renci 1247 renci 1247 ooi 1247 ooi 1247 TDLC 6688 TDLC 6688 odumMain 1247 odumMain 1247 dfctest 1248 dfctest 1248 engineering 1247 engineering 1247 hydrology 2823 hydrology 2823 DFC Federation Hub
National Infrastructure Research Environment - Portals, Applications, Workflows Research Environment - Portals, Applications, Workflows DFC Collaboration Environment – Data Grid DFC Collaboration Environment – Data Grid Community Resource Repository Community Resource Repository Community Resource Catalog Community Resource Catalog Community Resource Services Community Resource Services Existing infrastructure XSEDE Kepler OOI TDLC iPlant CUAHSI NCDC Dataverse GeoBrain DataONE NCSA Polyglot DFC - CNI
The Challenge: Support reproducible data-driven research Deliver the capability to manage, mine, and publish knowledge through collaboration environments. Experiments Archives Sensors Literature Simulation The Future: Reproducible Research DFC - CNI
National Infrastructure Approach 1.Build national data cyberinfrastructure prototype – Support multiple science and engineering domains by loosely coupling their existing infrastructure with a collaboration environment 2.Develop generic interoperability framework – Define the generic infrastructure needed for the national infrastructure to manage knowledge as well as data and information 3.Define interoperability mechanisms – Support access across the disparate types of infrastructure in common use 4.Define domain specific extensions – Support three levels: technical interoperability, project level policy, and end user usage requirements
Interoperability Mechanisms Information Collection Registration Information Exchange Soft Links Message Queue Information Manipulation Database Query Policies control execution of each interoperability mechanism Data Data Access Data Manipulation Micro-services Storage Driver Knowledge Knowledge Creation Analysis Workflows Knowledge Management Procedures : Micro-services DFC - CNI
DataNet Interoperability Research Environment - Portals, Applications, Workflows DFC Collaboration Environment Message Queue Web Service DataONE Member Node TerraPop Server SEAD Portal (VIVO) DataONE Coordinating Node SEAD Engagement Center DFC Data Grid DFC Data Grid SEAD Data DFC Data Grid DFC Data Grid DFC - CNI
DFC Interoperability Layers Authentication Workflows Data Manipulation Networks PAM / GSSAPI InCommon, GSI, Kerberos, Shibboleth, LDAP Micro-Services Kepler, NCSA Cyberintegrator, Taverna, NCSA Polyglot Format Drivers NetCDF, HDF5, THREDDS, ERDDAP Network Drivers HTTPS, TCP/IP, Parallel TCP/IP, RBUDP Data Access Micro-Services DataONE, Data Conservancy, CUAHSI, NCDC DFC - CNI Clients Vocabulary Messaging Management OpenSocial Web browsers, Web Services, Workflows, FUSE, Synchronization, MediaWiki Micro-Services HIVE, (Cheshire) Micro-Services AMQP, iRODS Xmsg Policies (RDA Policies), (ISO Criteria) Storage Systems Storage Drivers File Systems, Tape Archives, Object Stores, Cloud Storage
Interoperability Mechanisms Drivers – Encapsulate knowledge to support your operations at the remote repository: partial I/O, parsing of formats, manipulation of data structures – Authentication, format, storage Micro-services – Encapsulate knowledge needed to interact with an external system or with a data set using the remote protocol – Data access, external workflows, semantics, messaging Policies – Encapsulate knowledge needed for management functions – Federation control, administrative tasks, validation checks
Assertion Three basic types of interoperability mechanisms are sufficient for assembling national data cyberinfrastructure Example: Linked software defined networks to data grids – From an iRODS data grid, controlled the selection of three disjoint network paths for optimizing data transport by adding appropriate policy enforcement points and micro-services Expect functionality currently in data grid middleware to migrate into network middleware
Future Architecture Clients Resources Data Grid Middleware Clients Network Middleware Data Grid Middleware Resources DFC Federation GEMI - GENI Virtual collection Virtual network
