Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim.

Slides:

Advertisements

Similar presentations

GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.

Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Provenance-Aware Storage Systems Margo Seltzer April 29, 2005.

Research Issues in Web Services CS 4244 Lecture Zaki Malik Department of Computer Science Virginia Tech

TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.

PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.

As computer network experiments increase in complexity and size, it becomes increasingly difficult to fully understand the circumstances under which a.

Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

The Experience Factory May 2004 Leonardo Vaccaro.

Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.

Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.

File Systems and Databases

Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.

An Intelligent Broker Approach to Semantics-based Service Composition Yufeng Zhang National Lab. for Parallel and Distributed Processing Department of.

ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.

Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

January, 23, 2006 Ilkay Altintas

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Advances in Technology and CRIS Nikos Houssos National Documentation Centre / National Hellenic Research Foundation, Greece euroCRIS Task Group Leader.

The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.

The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.

Provenance Metadata for Shared Product Model Databases Etiel Petrinja, Vlado Stankovski & Žiga Turk University of Ljubljana Faculty of Civil and Geodetic.

A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.

Usage of `provenance’: A Tower of Babel Luc Moreau.

Recording application executions enriched with domain semantics of computations and data Master of Science Thesis Michał Pelczar Krakow,

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.

Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK

Of 33 lecture 10: ontology – evolution. of 33 ece 720, winter ‘122 ontology evolution introduction - ontologies enable knowledge to be made explicit and.

© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.

EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

Dimitrios Skoutas Alkis Simitsis

1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.

Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.

©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.

1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.

PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.

© 2010 Health Information Management: Concepts, Principles, and Practice Chapter 5: Data and Information Management.

OAIS Rathachai Chawuthai Information Management CSIM / AIT Issued document 1.0.

Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.

Introduction to the Semantic Web and Linked Data

Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.

Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.

Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.

THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.

17 th October 2002Data Provenance Grid Data Requirements Scoping Metadata & Provenance Dave Pearson Oracle Corporation UK.

A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.

Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.

Recording and Reasoning Over Data Provenance in Web and Grid Services Martin Szomszor and Luc Moreau University of Southampton.

Provenance: Problem, Architectural issues, Towards Trust

Cloud based linked data platform for Structural Engineering Experiment

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

NSDL Data Repository (NDR)

LOD reference architecture

Data Provenance.

Chaitali Gupta, Madhusudhan Govindaraju

Presentation transcript:

Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim

Contents Data Quality Data Quality Overview Overview Quality Inference Quality Inference Data Provenance Data Provenance Data Provenance Definitions Data Provenance Definitions Taxonomy of Provenance Techniques Taxonomy of Provenance Techniques

Data Quality Overview What is the Data Quality? What is the Data Quality? Accuracy Accuracy Timeliness Timeliness Credibility (Trustworthy) Credibility (Trustworthy) Users and domains subjective Users and domains subjective

Data Quality Overview Example Example Database collected over a period of time and by a variety of company department Database collected over a period of time and by a variety of company department Company Name Address Number of Employees A 20 Rode St. 3,000 B 50 Main Av. 500

Data Quality Overview Questions: Questions: When it created When it created Where it came from Where it came from How and Why obtained How and Why obtained Company Name Address Number of Employees A 20 Rode St. 3,000 B 50 Main Av. 500 Jan-12-00, by sales Feb-5-00, by ABC Oct-24-00, by acctig Oct-10-00, by EFG

Data Quality Overview How to store it? How to store it? Annotations by tagging Annotations by tagging Provenance Provenance

Data Quality Inference Next questions: Next questions: Can we trust data sets or data sources? Can we trust data sets or data sources? Answer: Answer: Ranking by quality on data set generated from data sources Ranking by quality on data set generated from data sources

Data Quality Inference Motivation Motivation Data are: Data are: Distributed Distributed Erroneous Erroneous Shared and Integrated Shared and Integrated

Data Quality Inference Data source ranking Data source ranking 1. Rank the data sets or sources in order of their accuracies 2. Determine the top-k accurate data sets or source

Data Quality Inference Framework Framework D: a set of data source D: a set of data source Ti(k, v): table for a query Q, k is the key and v is the value at time t Ti(k, v): table for a query Q, k is the key and v is the value at time t Ai  [0, 1]: Accuracy of data source Di Ai  [0, 1]: Accuracy of data source Di Ai < Aj if Di is less accurate than Dj Ai < Aj if Di is less accurate than Dj

Data Quality Inference General Framework General Framework h(t): historical function, 0  h(t)  1 h(t): historical function, 0  h(t)  1 weighted sum of all within the last w time indexes weighted sum of all within the last w time indexes c(i,t): cohesion function c(i,t): cohesion function

Data Quality Inference Cohesion function, c(i,t) Cohesion function, c(i,t) Determines: Determines: new accuracy estimate new accuracy estimate how well each data agrees with one another how well each data agrees with one another f(i,t): dampening factor function f(i,t): dampening factor function a(i,j,t): agreement function a(i,j,t): agreement function

Data Quality Inference Dampening factor function, f(i,t) Dampening factor function, f(i,t) Probability, f(i,t) in data source Probability, f(i,t) in data source Similar to Google’s PageRank: Similar to Google’s PageRank: high-quality sites receive a higher PageRank, high-quality sites receive a higher PageRank, Google remembers each time it conducts a search Google remembers each time it conducts a search Prevent the solution from zeros for all Prevent the solution from zeros for all

Data Quality Inference Agreement function, a(i,j,t) Agreement function, a(i,j,t) tupleOverlap(i,j,t) tupleOverlap(i,j,t) Measure the proportion of tuples in approximate agreement Measure the proportion of tuples in approximate agreement cosineOverlap(i,j,t) cosineOverlap(i,j,t) Measure the complement of the cosine distance of two sets of data over the same key values Measure the complement of the cosine distance of two sets of data over the same key values eOverlap(i,j,t) - Euclidian-based function eOverlap(i,j,t) - Euclidian-based function Euclidian distance in n-dimension Euclidian distance in n-dimension

Data Quality Inference Agreement function, a(i,j,t) Agreement function, a(i,j,t) Using Euclidian distance, Using Euclidian distance, eOverlap(i,j,t) = 1 – eDist(V(i,j,t), V(j,i,t)) eOverlap(i,j,t) = 1 – eDist(V(i,j,t), V(j,i,t))

Data Quality Inference Experimental results Experimental results 100 data sources 100 data sources 20 different tuples (key, value) 20 different tuples (key, value) Randomly assigned Randomly assigned Dampening function f(i,t), 0.5 Dampening function f(i,t), 0.5

Data Quality Inference Experimental results Experimental results

Data Provenance Data Provenance Definitions Data Provenance Definitions Taxonomy of Provenance Techniques Taxonomy of Provenance Techniques Application of Provenance Application of Provenance Subject of Provenance Subject of Provenance Representation of Provenance Representation of Provenance Provenance storage Provenance storage Provenance Dissemination Provenance Dissemination Examples of Data provenance Techniques Examples of Data provenance Techniques

What is Data Provenance Data provenance: In database system domain: Data provenance, a kind of metadata, sometimes called “lineage" or “pedigree" is the description of the origins of a piece of data and the process by which it arrived in a database. Data provenance as information that helps determine the derivation history of a data product, starting from its original sources. E-Science: E-science is computationally intensive science. It is also the type of science that is carried out in highly distributed network environments, or science that uses immense data sets that require grid computing. Examples of this include social simulations, particle physics, earth sciences and bio-informatics...

Why Data Provenance is important When you find some data on the Web, do you have any information about how it got there? It is quite possible that it was copied from somewhere else on the Web, which, in turn may have also been copied; and in this process it may have been transformed and edited. If you are a scientist, or any kind of scholar, you would like to have confidence in the accuracy and timeliness of the data that you are working with. Medical research requires tight controls on the quality of data because mistakes can harm people’s health. Data quality in bioinformatics may not be as immediate, but it is no less important. Among the sciences, the field of Molecular Biology is possibly one of the most sophisticated consumers of modern database technology and has generated a wealth of new database issues. A substantial fraction of research in genetics is conducted in "dry" laboratories using in silico experiments – analysis of data in the available databases.

Taxonomy of Provenance Techniques This paper c ategorizes provenance systems based on: This paper c ategorizes provenance systems based on: Why the record provenance Why the record provenance application of data provenance What they describe What they describe Subject of provenance How they represent provenance How they represent provenance Provenance Representation How to store provenance How to store provenance Storing Provenance Ways to disseminate provenance Ways to disseminate provenance Provenance Dissemination

Taxonomy of Provenance

Application of Provenance Provenance systems can support a number of uses. Several applications of provenance information as follows: Data Quality: Lineage can be used to estimate data quality and data reliability based on the source data and transformations. It can also provide proof statements on data derivation. Data Quality: Lineage can be used to estimate data quality and data reliability based on the source data and transformations. It can also provide proof statements on data derivation. Audit Trail: Provenance can be used to trace the audit trail of data, determine resource usage, and detect errors in data generation. Audit Trail: Provenance can be used to trace the audit trail of data, determine resource usage, and detect errors in data generation. Replication Recipes: Detailed provenance information can allow repetition of data derivation, help maintain its currency, and be a recipe for replication. Replication Recipes: Detailed provenance information can allow repetition of data derivation, help maintain its currency, and be a recipe for replication. Attribution: Pedigree can establish the copyright and ownership of data, enable its citation, and determine liability in case of erroneous data. Attribution: Pedigree can establish the copyright and ownership of data, enable its citation, and determine liability in case of erroneous data. Provenance systems can support a number of uses. Several applications of provenance information as follows: Provenance systems can support a number of uses. Several applications of provenance information as follows: Informational: A generic use of lineage is to query based on lineage metadata for data discovery. It can also be browsed to provide a context to interpret data. Informational: A generic use of lineage is to query based on lineage metadata for data discovery. It can also be browsed to provide a context to interpret data.

Subject of Provenance Provenance Models: data-oriented model data-oriented model an explicit model, lineage metadata is specifically gathered about the data product. One can delineate the provenance metadata about the data product from metadata concerning other resources. process-oriented model process-oriented model An indirect model, where the deriving processes are the primary entities for which provenance is collected, and the data provenance is determined by inspecting the input and output data products of these processes. Provenance Granularity (Coarse Grained/Fine Grained) The usefulness of provenance and the cost of collecting and storing provenance in a certain domain is linked to the granularity at which it is collected. Range from provenance on attributes and tuples in a database to provenance for collections of files, say, generated by an ensemble experiment run.

Representation of Provenance Two major approaches: Annotations: Annotations: Metadata comprising of the derivation history of a data product is collected as annotations and descriptions about source data and processes. Advantage: richer and, in addition to the derivation history, often include the parameters passed to the derivation processes, the versions of the workflows that will enable reproduction of the data, or even related publication references Inversion Inversion Uses the property by which some derivations can be inverted to find the input data supplied to them to derive the output data. Examples include queries and user-defined functions in databases that can be inverted automatically or by explicit functions. Advantage: more compact, the information it provides is sparse and limited to the derivation history of the data.

Representation of Provenance(contd…) Many current provenance systems that use annotations have adopted XML for representing the lineage information. Some also capture semantic information within provenance using domain ontologies in languages like RDF and OWL. Ontologies precisely express the concepts and relationships used in the provenance and provide good contextual information.

Provenance Storage Scalability Scalability Provenance information can grow to be larger than the data it describes if the data is fine-grained and provenance information rich. So the manner in which the provenance metadata is stored is important to its scalability. Provenance information can grow to be larger than the data it describes if the data is fine-grained and provenance information rich. So the manner in which the provenance metadata is stored is important to its scalability. The inversion method is arguably more scalable than using annotations. However, one can reduce storage needs in the annotation method by recording just the immediately preceding transformation step that creates the data and recursively inspecting the provenance information of those ancestors for the complete derivation history. The inversion method is arguably more scalable than using annotations. However, one can reduce storage needs in the annotation method by recording just the immediately preceding transformation step that creates the data and recursively inspecting the provenance information of those ancestors for the complete derivation history. Overhead Overhead Less frequently use provenance information can be archived to reduce storage overhead or a demand-supply model based on usefulness can retain provenance for those frequently used. Less frequently use provenance information can be archived to reduce storage overhead or a demand-supply model based on usefulness can retain provenance for those frequently used. If provenance depends on users manually adding annotations instead of automatically collecting it, the burden on the user may prevent complete provenance from being recorded and available in a machine accessible form that has semantic value If provenance depends on users manually adding annotations instead of automatically collecting it, the burden on the user may prevent complete provenance from being recorded and available in a machine accessible form that has semantic value

Provenance Dissemination Visual Graph Visual Graph A common way of disseminating provenance data is through a derivation graph that users can browse and inspect Queries Queries Users can also search for datasets based on their provenance metadata, such as to locate all datasets generated by a executing a certain workflow. If semantic provenance information is available, these query results can automatically feed input datasets for a workflow at runtime. The derivation history of datasets can be used to replicate data at another site, or update it if a dataset is stale due to changes made to its ancestors. Service API Service API Provenance retrieval APIs can additionally allow users to implement their own mechanism of usage

S urvey of Data Provenance Techniques

Provenance in a Bioinformatics Grid (myGrid) myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments myGrid builds a personalised problem-solving environment that helps bioinformaticians find, adapt, construct and execute in silico experiments Keep the scientist informed as to the provenance of data relevant to their experiment space Keep the scientist informed as to the provenance of data relevant to their experiment space

What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance.

Architectural Vision Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services. Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services. Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data. Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data. Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken. Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken.

Architectural Vision Storage could be achieved by a provenance service. Storage could be achieved by a provenance service. Provenance service would provide support for analysis, navigation or reasoning over provenance Provenance service would provide support for analysis, navigation or reasoning over provenance Client side support for submitting provenance data to the provenance service. Client side support for submitting provenance data to the provenance service.

Prototype Overview

Conclusion Provenance is a rather unexplored domain Provenance is a rather unexplored domain Necessity to design a configurable architecture capable of support multiple requirements from very different application domains. Necessity to design a configurable architecture capable of support multiple requirements from very different application domains. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions. Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.

Future work Using heterogeneous data sources Using heterogeneous data sources Large data sources Large data sources Historical measurement Historical measurement Dynamic measurement Dynamic measurement Security and authorization of data provenance Security and authorization of data provenance Manage provenance in diverse domain Manage provenance in diverse domain

References 1) Yogesh L. Simmhan Beth Plale Dennis Gannon, "A Survey of Data Provenance in e- Science," in SIGMOD Record, Vol. 34, No. 3, Sept ) 2) "Using Semantic Web Technologies forRepresenting e-Science Provenance" 3) Jan Brase, "Using digital library techniques- Registration of scientific primary data," in ECDL, hannover.de/Arbeiten/Publikationen/2004/brase_TIB_hannover.pdf 4) Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan, "Why nd Where:A Characterization of Data Provenance," in ICDT, ) Peter Buneman, Sanjeev Khanna and Wang-Chiew Tan, "Data Provenance: Some Basic Issues," 6) Wang-Chiew Tan, "Research Problems in Data Provenance" 7) Raymond K. Pon and Alfonso F. Cárdenas, "Data Quality inference, " 3) 8) Wang, R., Kon, H. & Madnick, S. (1993), Data Quality Requirements Analysis and Modelling, Ninth International Conference of Data Engineering, Vienna, Austria. 9) Wand, Y. and Wang, R. (1996) “Anchoring Data Quality Dimensions in Ontological Foundations,” Communications of the ACM, November pp