Q UERY L ANGUAGE C ONSTRUCTS FOR P ROVENANCE Murali Mani, Mohamad Alawa, Arunlal Kalyanasundaram University of Michigan, Flint Presented at IDEAS 2011.
P ROVENANCE M ETADATA Data about origins of data Applications: Check whether data item is valid – in health records How much do we trust an inference/observation – scientific computation Audit trails – manufacturing/shipping/trading Database community found provenance could be useful in updating views maintenance of materialized views interpretation of query results querying probabilistic/uncertain data In short, numerous applications …
OPM (O PEN P ROVENANCE M ODEL ) HTTP :// OPENPROVENANCE. ORG / Developed by several researchers who have been involved with provenance Describes a logical representation of provenance information for a wide variety of applications. Provenance information represented as a directed graph consisting of: Nodes (can be artifact, process, or agent) Edges or dependencies. There are 5 types of edges Used: a process used an artifact wasGeneratedBy: an artifact generated by a process wasControlledBy: a process controlled by an agent wasTriggeredBy: a process trigged by another process wasDerivedFrom: an artifact derived from another artifact Nodes and edges have annotations (attribute-value pairs)
OPM: A S IMPLE E XAMPLE P A1 A2 A3 A4 used(divisor)used(dividend) wasGeneratedBy (remainder) wasGeneratedBy (quotient) type=division A1, A2 are artifacts P = a process that is performing division (A1/A2) – note the used edges between P and A1, A2 A3, A4 are artifacts generated by P (representing quotient, remainder) – note the wasGeneratedBy edges between P and A3, A4 Example taken from
Q UERIES FOR OPM We can write complex “multi-step inference” queries using Datalog/SQL based on the different edges in OPM Example: find artifacts directly or indirectly derived from another artifact (recursive query using wasDerivedFrom edges) However, is it sufficient? We may need to express Sub-graph isomorphism (given a graph query pattern, check whether the pattern appears in a provenance graph) Studied in graph query languages ([Graph-QL]), [OPQL] … Shortest path queries (using some notion of distance) Typically not studied in graph query languages
O UR APPROACH
E XAMPLES OF G ENERALIZED S ELECTION O PERATOR
C ONCLUSIONS AND F UTURE W ORK Observation: Provenance query language should not be restricted to Datalog/SQL. Developed a query model that provides constructs for querying structure and for querying content. Using our query model, we can express a wide range of queries including shortest path (not expressible using SQL/Datalog).
R EFERENCES [Graph-QL]: He, H., and Singh, A. K Graphs-at-a-time: Query Language and Access Methods for Graph Databases. ACM SIGMOD (2008). [OPQL]: Lim, C., Lu, S., Chebotko, A., and Fatouhi, F OPQL: A First OPM-Level Query Language for Scientific Workflow Provenance. IEEE SCC (2011). [OPM]: The OPM Provenance Model (OPM), available at