National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager.

National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

2 Policy Topics Policy-based Data Management Practical Policy Working Group outcomes – Data Center policies Applications – DataNet Federation Consortium analyzed 175 policies for Data sharing(research collaborations) SILS Digital library (personal collections) RDA Practical Policy(data centers) UNC-CH Protected data (secure medical workspace) Odum/Dataverse(archive) NSF data management plans(publication) – Science Observatory Network (real-time sensor data) – PECE/RPI (anthropology) – NOAA NCDC (archive)

3 National Science Foundation Cooperative Agreement: OCI-0940841 Policy-based Data Management

4 Summary of the Problem Practical Policy Assertion or assurance that is enforced about a (data) collection (data set, digital object, file) by the creators of the collection Computer actionable policies are used to  enforce data management  automate administrative tasks  validate compliance with assessment criteria  automate scientific data processing and analyses Users motivated by issues related to scale, distribution

5 National Science Foundation Cooperative Agreement: OCI-0940841 Practical Policy Working Group

6 Practical Policy members represented – 11 types of data management systems – 30 institutions – 2 testbeds iRODS Renaissance Computing Institute, DataNet Federation Consortium – DFC GPFS Institute of Physics of the Academy of Sciences, CESNET Garching Computing Centre – RZG Published two documents – Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates” February, 2015, – Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”, February, 2015, Policy Templates

7 Data Center Policies Contextual metadata extraction – Automate extraction of metadata from files Data access control – Automate application of appropriate access contrls Data backup – Automate creation of replicas Data format control – Automate identification of data format Data retention – Apply a retention period Disposition – Apply a disposition policy at end of retention period INLS 624 7

8 Data Center Policies Integrity (including replication) – Verify integrity and replace bad copies Notification – Manage events about changes to the collection Restricted searching – Manage searches on collection Storage cost reports – Generate cost report Use agreements – Manage use agreements before data are retrieved INLS 624 8

9 National Science Foundation Cooperative Agreement: OCI-0940841 Digital Library Management

10 LifeTime Library Policies Requirements – Enable students to create a personal digital collection – Provide pedagogy mechanisms for experimenting with: Naming- File names Arrangement- Organization in collections Description- Tags and metadata Access controls- Sharing and publication Ingestion- Controlled loading of data Distribution- Storage locations INLS 624 10

11 Student Experiences Students invariably: – Changed their minds about the purpose of the collection – Changed their minds about the description Term definitions tended to drift over the semester – Changed their minds about the arrangement Added new collections for additional types of data Resulting collections had: – 1,000 – 10,000 files – 2 Gigabytes to 150 Gigabytes in size – 4-10 metadata attributes per file INLS 624 11

12 National Science Foundation Cooperative Agreement: OCI-0940841 Protected Data

13 Protected Data Management UNC-CH has published an administrator’s guide for the management of protected data. This includes: – PIIPersonally Identifiable Information – PHIProtected Health Information – PCIPayment Card Industry information The question is whether each of the tasks specified in the guide can be automated as policies enforced by the data grid. See Chapter 6 of the Policy Examples Workbook – This specifies 51 tasks that should be managed by the administrator

14 Protected Data Tasks 1 Check for presence of PII on ingestion 2 Check for viruses on ingestion 3 Check passwords for required attributes 4 Encrypt data on ingestion 5 Encrypt data transfers 6 Federation - control data copies (access control) 7 Federation - manage remote data grid interactions (update rule base) 8 Federation - periodically copy data 9 Federation- manage data retrieval (update access controls) 10 Generate checksum on ingestion 11 Generate report of corrections to data sets or access controls 12 Generate report for cost (time) required to audit events 13 Generate report of types of protected assets present within a collection 14 Generate report of all security and corruption events 15 Generate report of the policies that are applied to the collections 16 List all storage systems being used 17 List persons who can access a collection INLS 624 14

15 Protected Data Tasks 18List staff by position and required training courses 19List versions of technology that are being used 20Maintain document on independent assessment of software 21Maintain log of all software changes, OS upgrades 22Maintain log of disclosures 23Maintain password history on user name 24Parse event trail for all accessed systems 25Parse event trail for all persons accessing collection 26Parse event trail for all unsuccessful attempts to access data 27Parse event trail for changes to policies 28Parse event trail for inactivity 29Parse event trail for updates to rule bases 30Parse event trail to correlate data accesses with client actions 31Provide test environment to verify policies on new systems 32Provide test system for evaluating a recovery procedure 33Provide training courses for users 34Replicate data sets on ingestion INLS 624 15

16 Protected Data Tasks 35 Replicate iCAT periodically 36 Set access approval flag 37 Set access controls 38 Set access restriction until approval flag is set 39 Set approval flag per collection for enabling bulk download 40 Set asset protection classifier for data sets based on type of PII 41 Set flag for whether tickets can be used on files in a collection 42 Set lockout flag and period on user name - counting number of tries 43 Set password update flag on user name 44 Set retention period for data reviews 45 Set retention period on ingestion 46 Track systems by type (server, laptop, router,….) 47 Verify approval flags within a collection 48 Verify files have not been corrupted 49 Verify presence of required replicas 50 Verify that no controlled data collections have public or anonymous access 51 Verify that protected assets have been encrypted INLS 624 16

17 Task Automation There are some unifying requirements across tasks: – Checking material for PII, viruses – Management of passwords – Generation of log files for all actions done – Creation of state information to track processes – Management of encryption – Management of access controls – Generation of audit trails – Parsing of events to demonstrate compliance over time – Verification that processes were correctly applied Many of these requirements can also be applied to digital libraries and research collaborations INLS 624 17

18 National Science Foundation Cooperative Agreement: OCI-0940841 Preservation

19 Cross-Disciplinary Data Discovery and Geographically Distributed Preservation DFC April 2013 NSF Review Slide 19

20 Archive Policies The Dataverse network has about 800 GigaBytes of data that may contain protected information. An archive is needed with independent management of the material to ensure recovery in the case of a disaster. – Digital objects and provenance metadata must be re- loadable into Dataverse. – Assessment criteria need to be evaluated to verify integrity. – Access controls must be enforced on restricted data. – Dataverse naming convention must be retained. Approach is to replicate the data holdings into an iRODS data grid. INLS 624 20

21 Policies See chapter 5 of the Policy Examples Workbook – Odum preservation policies Preservation tasks include: – Staging files between Dataverse and iRODS – Checking data for presence of protected information – Periodic verification of integrity and replicas – Verification of access controls – Reports on usage statistics INLS 624 21

22 National Science Foundation Cooperative Agreement: OCI-0940841 NSF Data Management Plans

23 The National Science Foundation has mandated that every project provide a 2-page description of how data will be managed. Each NSF directorate published guidelines on what the data management should include. An analysis of 12 sets of requirements identified 38 data management tasks that could be automated See Chapter 7 of Policy Template Workbook INLS 624 23

24 NSF DMP Requirements INLS 624 24

25 NSF DMP Requirements INLS 624 25

26 National Science Foundation Cooperative Agreement: OCI-0940841 Science Observatory Network

27 Real-Time Sensor Data Harvest sensor data from the Antelope Real Time Sensor orb. – Manages environmental, oceanic, seismic data – More that 3,000 sensors across the US Register each sensor as an independent collection – Retrieve the most recent sensor data – Harvest sensor data periodically – Transform to JSON, netCDF – Provide access to archived data

28 National Science Foundation Cooperative Agreement: OCI-0940841 PECE / RPI

29 Collection Management Policies Contextual metadata extraction Data access control Data backup Data format control Data retention Disposition Integrity (including replication) Notification Restricted searching Storage cost reports Use agreements INLS 624 29

30 National Science Foundation Cooperative Agreement: OCI-0940841 NOAA NCDC

31 NOAA Climatic Data Center Manages an archive of climate data records received from multiple sources – Uses a staging area to Check input data for viruses Manage ingestion into a tape archive Challenges – Needed a way to improve security Eliminate direct access to storage within the NOAA firewall – Needed a way to automate management of each file Verify archival storage before file is deleted

32 ftp1 ftp4 ftp2 ftp5 ingest1 ingest2 Tape Disk Cache HDSS DMZ Landing Zone: Open for data delivery DMZ Firewall NCDC External Firewall FTP Load Balance ftp3 External Providers FTP/FTPS NCDC Internal Network FTP PUSH/PULL ftp iRODS Secure Ingest iRODS DMZ Grid /DMZ /Archive /NR2 /NR3 iRODS NCDC Grid /NCDC /NR2 /Ingest /NR3 /NR2 /Archive /NR3 iRODS is: Secure authentication Security via Obscurity (one to bind them) Uses a pull mechanism to move data into NCDC grid A virtual management tool (clean-up) Scope is entire grid iRODS

33 National Science Foundation Cooperative Agreement: OCI-0940841 Policy Examples Workbook Policy Templates Workbook

