Presentation is loading. Please wait.

Presentation is loading. Please wait.

Working Group: Practical Policy Rainer Stotzka, Reagan Moore.

Similar presentations


Presentation on theme: "Working Group: Practical Policy Rainer Stotzka, Reagan Moore."— Presentation transcript:

1 Working Group: Practical Policy Rainer Stotzka, Reagan Moore

2 2  Thursday March 27, 2014 3:30-5:00 PM  Introduction to policy-based data management  Discussion of data policy manager for EUDAT (Mark van de Sanden)  Presentation on natural language rule processing (Chitta Baral)  Initial presentation of summary of policies across data centers and research projects (Jewel Ward)  Friday March 28, 2014 11:00-12:30 PM  Discussion of policy summary  Identification of best practices  Discussion of policy testing – interoperability testbed  Integration with deliverables from other working groups  Persistent identifiers  Linked-data – HIVE  Type registry  Data Foundation and Terminology  Preservation interest group Agenda

3 3  Identify the most important policies  Practical implementations for managing research data collections  Provide recommendations for a “starter kit”  Testbeds:  Evaluate standard policies  Test interoperability across WGs Policy: Assertion or assurance that is enforced about a collection or a dataset Practical Policy Working Group Focuses:

4 Concept Graph by Reagan Moore Collection Purpose Defines Policy Property Defines Procedure Controls Updates Persistent State Information Persistent State Information Consistency HasFeature Integrity Isa Workflow Isa Function Chains SysChksumDataObj Isa

5 Collection Purpose Completeness Correctness Consensus Defines Consistency Attribute HasFeature Has Defines Policy Has Property Defines Procedure Control s Updates Client Action Periodic Assessment Criteria Policy Policy Enforcement Point Workflow Invokes Has SubType Isa Function Chains Operation Isa Persistent State Information Persistent State Information Isa Digital Object Updates Has Replication Policy Checksum Policy Quota Policy Data Type Policy Isa Integrity Isa Authenticity Isa Access control Isa GetUserACL SetDataType SetQuota DataObjRepl SysChksumDataObj Isa DATA_ID DATA_REPL_NUM DATA_CHECKSUM Isa HasFeature Concept Graph by Reagan Moore

6 Policy Categories Collection- based Policies Integrity Data Lifecycle Management Data Staging Federation Description Publication Compliance Data Management Plans Access Control Preservation Provenance Replication Regulatory Management Administrative Assessment

7 7  List of policies in the RDA Wiki  Monthly telephone conferences (RDA)  “Policy of the month” Review of policies that have been submitted  54 persons registered Management Testbeds  iRODS Renaissance Computing Institute  E-iRODS DataNet Federation Consortium – DFC  dCache Institute of Physics of the Academy of Sciences, CESNET  DataVerse Odum Institute

8 8 Data Foundation and Terminology WG  Discussion of a vocabulary for operations Preservation Infrastructure IG  Policies for preservation Persistent Identifiers  Properties versus operations on identifiers Data Citation WG  Type registry Metadata  Linked-data vocabularies Interactions with other WGs

9 9  Peisar – Storage Policies at CESNET EUDAT Data Policy Manager

10 10  Why?  Users or domain experts need not learn the syntax of the rule language.  They specify their rules using natural language.  How?  Natural language specification of rules is translated to rules in the syntax of the rule language – in two steps though  Step 1: Natural language to an intermediate language (focus is on correct translation of natural language and dealing with the challenges and quirkiness of natural language)  Step 2: Intermediate language to Rule language (Should be more straightforward as both languages are formal languages, and the intermediate language has a very restricted vocabulary)  Our focus in this presentation is on Step 1. Natural Language Rule Processing

11 11 Underlying Technical Approach Montague’s approach: The meaning of words and phrases are Lambda calculus formulas The meaning (or translation) of sentences are obtained by combining the meaning of its words and phrases. Usually as dictated by a grammar Categorial Grammar (especially CCG) are often used as they give directionality regarding how to combine.

12 12 λz. print(z)λz. print(z) λy. y@financeλy. y@finance λx. report(x)λx. report(x) report(finance) (λy. y@finance) @ (λx. report(x)) ( λx. report(x))@finance report(finance) print(report(finance)) NL to Policy Example

13 13 Illustration of Montague’s approach using CCG and λ-calculus  Every boxer walks.

14 14 The Key Issue(s) Where do we get the Lambda expressions from? Handcrafting them is not scalable Lambda expressions get complex in a hurry and handcrafting creates a bottleneck Too many words Since target language is not unique we can not painstakingly make new dictionaries for each target language Target languages evolve Other standard issues Ambiguity: Multiple meanings of words; word sense disambiguation; etc.

15 15 How to get the lambda expressions? How we learned natural languages?  Often  We know the meaning of a sentence  We know the meaning of most of the individual words in that sentence  But we do not a-priori know the meaning of some particular word(s) in that sentence  We are able to correctly guess the meaning of those words  Follow a similar approach  Given a set of training examples and an initial dictionary, learn the lambda expressions for the words in those examples that are not in the dictionary  Inverse Lambda operators

16 16 Inverse λ Example  Every boxer walks.

17 17 λ – another Inverse λ – another Example λz.print(z)λz.print(z) λx. report(x)λx. report(x) report(finance) print(report(finance)) print(report(finance)) λy. y@financeλy. y@finance

18 18 Another Example λz. send(email,z)λz. send(email,z) λx. curator(x)λx. curator(x) curator(collection) send(email, curator(collection)) send(email, curator(collection)) λy. y@collectionλy. y@collection λy. λz. send(y,z)λy. λz. send(y,z) email λx.xλx.x curator(collection) collection λx.λy. y@xλx.λy. y@x

19 19 NL2KR System Architecture NL2KR-LNL2KR-T

20 20 Generate all parse trees of the sentences Learn lexicon using Inverse-λ and Generalization Generalize complete lexicon Parameter Estimation NL2KR-L System Learning Process NL2KR-L

21 21 Generate all parse trees of the sentences Generalize the missing meanings of words and recomputed parse trees PCCG to rank the translation NL2KR-T System Translati on Process NL2KR-T

22 22 Current Status  We have a prototype that translates English description of policy rules to a formal representation  Working towards making it usable in iRODS  Step 1: English to a formal policy specification (in an intermediate language)  Step 2: Formal policy specification to Rules (in a lower level language)

23 23 Illustration: Training Data Set PolicyIPDL Translation Generate audit_trail for all changes to rules generate(audit_trail(changes(rules))) Transfer ownership to rodstransfer(ownership, rods) Generate report listing all preservation_attributes generate(report(list(preservation_attributes))) Migrate files to new storage migrate(files, storage(new)) Protect the integrity of Data_folderprotect(integrity(data_folder)) Generate audit_trail for notifications on problems generate(audit_trail(notifications(problems))) Create AIP template from SIP template create(template(aip); template(sip)) Create rule based-on AIP templatecreate(rule; template(aip)) On deletion of files from collection erase metadata When deletion(collection(f iles)); do erase(metadata) Generate report summarizing information of micro_services generate(report(summary(information(micro_services))))

24 24 Illustration: Initial Lexicon WordCCG categorySemantics Transfer(S\NP)/NPλx. λy. transfer(x,y) ownershipN rodsN allNP/N;N/N;NP/NPλx. x theNP/N;N/N;NP/NPλx. x Generate(S\NP)/NPλx. generate(x)

25 25 Illustration: Iteration 1 of Inverse λ

26 26 Illustration: Lexicon after parameter estimation WordCCG category Semantics Weight Transfer(S\NP)/NP λx. λy.transfer(x, y) λx. λy.transfer(x @y) λx. transfer(x) 0.07646726 -0.024746018 -0.024746014 ownershipN ownership λx. ownership(x) 0.07570592 -0.023981703 rodsN λx. rods(x) 0.07493635 -0.023250459 to(NP\(S\NP))/NP (NP\NP)/NP λy. λx. x@y 0.10719467 -0.0895291 reportN λx. report(x) -0.08859752 0.105146274 Protect(S\NP)/NP λx. λy. protect(x, y) λx. λy. protect(x @y) λx. protect(x) 0.07548905 -0.024013432 listing(NP\(S\NP))/NP λy. λx. x@list(y) 0.009448431 …… … … …… … …

27 27 NL2KR Webpage

28 28 NL2KR Download Page

29 29 From the NL2KR manual

30 30  Described an approach to translate natural language (NL) specification to an intermediate (formal) language - which can then be translated to rules.  Theory: Augmented Inverse-Lambda based learning to Montague’s Lambda Calculus based approach.  System: Developed the NL2KR system.  Used the NL2KR system to build a translation system from NL to Intermediate Policy Description Language.  Nl2KR system can be used for developing translation systems from natural language to other formal languages.  Has been evaluated in domains such as Geoquery, Robocup language, puzzles, and Biology questions. Natural Language Rule Processing: Conclusion

31 31 We are seeking: Data experts & Domain scientists !  Provide policies already in use: RDA Wiki  Description  Implementation  Express wishes about policies you might need  Discuss and analyze policies  Enhance the cross-over to other WGs, IGs and initiatives Invitation

32 32 PolicyImportance Integrity217 Preservation150 Access control126 Provenance108 Data Management plans99 Publication75 Replication66 Data staging52 Federation37 Metadata sharing23 Regulatory16 Collection properties7 Identifiers7 Data sharing7 Versioning7 Licensing6 Format6 Data Life Cycle6 Arrangement5 Processing5 Survey of 30 Institutions for Highest Priority Policies

33 33 1.Policy for data retention. How long, how short? Need preservation, or not? (5) Retention and disposition 2.Notification policies. (Ex. must warn data researcher that their data will be deleted at X time.) (6) notification on event 3.Transferability policies. The data must be transferable from the repository back to the researcher and the repository of origin. Or, in the event of defunding, the data must be de- accessioned and moved to another repository (or not, depending on relevant SOPs, agreements, etc.). 4.Policies re: costs and who pays for all of this data storage (8) 5.Policies around context. Sometimes the original data and additional metadata are needed. Sometimes, the context or derived data is what matters, and not the data itself. (7) 6.Policies re: tagging/annotating data 7.Search/Information Retrieval policies. What parts of the data will you search on, or not search on? (4) Controlling search 8.Standard Sys Admin policies: (1) replication, back up, (2) integrity checks, syncing with back ups. 9.Content policies: do we care what content and file formats users upload? Some do, some don't. (3) Transformative migration 10.Policy to educate researchers about all of the different policies relevant to the data repository. For example, a user agreement/Terms & conditions statement that researchers must check off. Summary of policies in production use

34 34  Consensus on a policy  Use at multiple institutions  Generality  Best practice policy components  Name of operation that policy controls  Constraints that policy implements  State information that policy uses or modifies  Verification policy  Example of running code  Documentation Best Practices for production policies

35 35  Paper posted that lists 70 operations  Policy-verification.docx  Candidate operations  Access control  Backups  Data retention  Descriptive metadata  Format creation  Integrity checks  Notification  Policy constraints  Replication  Restricted search  Storage cost  Tags  Use agreements Operations managed by policies

36 36 Types of policies Policy typeOperation AccessSet access control Check access control Audit access control Backups (time-stamped copies)Create copy Set timestamp Verify timestamps Contextual metadataExtract metadata Register metadata Verify metadata Data RetentionSet retention period Check retention Verify retention DispositionDefine migration location Migrate data Verify migration

37 37 Policy Types Policy typeOperation Format requirementsSpecify required format Create format Verify formats Integrity checksSet checksum Verify checksum NotificationDefine events Send e-mail on event Log notices Policy constraints by collection, researcher, fundingSelect constraint Apply constraint to policy Verify constraints Restricted searchingSet search limits Execute restricted search Signing of use agreementsGenerate use form Store agreement Verify agreement Storage cost trackingRecord usage Audit usage Generate storage cost report

38 38  Operation that is being controlled  Replicate a file  Controls  When is replication done?  When file is ingested  When file is changed  Which files are replicated? Choose based on:  Collection  User  Size  Replication properties  Choice of replication location  Choice of access controls on replica  Requirement for checksum  Verification of checksum on replica creation  Variants:  Versioning of changes vs replication  Backups vs replication (time-stamped copy)  Verification  When should replica existence be verified Replication Policy

39 39 Policy : Operation : Constraints : State Information Policy typeOperationConstraintsState information ReplicationSet replica propertiesWhen?Default policy enforcement points Number of replicasDefault number Where is replicate put?Default replica location Which files (collection/user/size)?Default policy selection criteria Default criterium value Set replica access controls?Default access control Require checksum?Replica checksum flag When audit?Default time period ReplicateDelayed or immediateReplica location Replica creation time Replica access control Replica name Replica owner Replica number Verify replica numbersPeriodic ruleAudit time stamp Log of problems and actions Replace missing replicas Replica location Replica creation time Replica access control Replica name Replica owner Replica number

40 40  Interoperability testbed  Demonstrate that RDA recommendations can be jointly implemented  Control policies  Demonstrate that a desired practice can be applied consistently  Assessment policies  Verify that a recommended practice is followed  Integration  Demonstrate semantic consistency across systems level integration  Example – are data objects considered to be immutable Interactions with other Working Groups

41 41  Interoperability testbed provided by Practical Policy WG  Persistent identifiers  Handle system  Metadata  HIVE linked-data vocabularies  Type registry  Expect implementation for integration  Data Foundation and Terminology  Exchange of concepts based on use cases  Preservation interest group  ISO 16363 assessment policies Practical Policy WG Interfaces with the other WGs

42 42  New interest group is driven by the need to have testbeds with a longer lifetime than the Practical Policy working group.  Current testbeds  Dataverse  dCache  iRODS  Testbed functions  Demonstrate interoperability  Provide platform to evaluate proposed best practices / software  We need working groups to provide software systems or policies for testing.  Need a liaison to each working group Proposal - Special Interest Group on Interoperability Testbeds

43 43  Interested participants include:  David AntosCESNET  Jon CrabtreeDataverse  Marcio FaermanOSU  Patrick FuhrmanndCache testbed, DESY  Thomas JejkalKIT Data Manager repository  Tibor KalmanPersistent identifier consortium  Reagan MooreDataNet Federation Consortium  Jakub PeisardCache testbed  Raphael RitzMPG Special Interest Group on Interoperability Testbeds


Download ppt "Working Group: Practical Policy Rainer Stotzka, Reagan Moore."

Similar presentations


Ads by Google