Trio A System for Integrated Management of Data, Accuracy, and Lineage Jennifer Widom Stanford University.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Rulebase Expert System and Uncertainty. Rule-based ES Rules as a knowledge representation technique Type of rules :- relation, recommendation, directive,
Introduction to Histograms Presented By: Laukik Chitnis
Management Information Systems, Sixth Edition
Uncertainty Lineage Data Bases Very Large Data Bases
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
The Comparison of the Software Cost Estimating Methods
A First Attempt towards a Logical Model for the PBMS PANDA Meeting, Milano, 18 April 2002 National Technical University of Athens Patterns for Next-Generation.
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”
Evaluating Hypotheses
Trio: A System for Data, Uncertainty, and Lineage Search “stanford trio”
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom.
Value of Information for Complex Economic Models Jeremy Oakley Department of Probability and Statistics, University of Sheffield. Paper available from.
Database Systems More SQL Database Design -- More SQL1.
Representation Formalisms for Uncertain Data Jennifer Widom with Anish Das Sarma Omar Benjelloun Alon Halevy Trio and other participants in the Trio Project.
RIZWAN REHMAN, CCS, DU. Advantages of ORDBMSs  The main advantages of extending the relational data model come from reuse and sharing.  Reuse comes.
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Database Design - Lecture 1
 Introduction Introduction  Purpose of Database SystemsPurpose of Database Systems  Levels of Abstraction Levels of Abstraction  Instances and Schemas.
© Paradigm Publishing Inc. 9-1 Chapter 9 Database and Information Management.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 
Chapter 16 Methodology – Physical Database Design for Relational Databases.
Configuration Management (CM)
Module 3: The Relational Model.  Overview Terminology Relational Data Structure Mathematical Relations Database Relations Relational Keys Relational.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Methodology - Conceptual Database Design. 2 Design Methodology u Structured approach that uses procedures, techniques, tools, and documentation aids to.
Distributed Database Systems Overview
Greg Janée chit-chat with CS database folks 10/26/01 Gazetteer database 4.5 million items, each having: –1+ names fair to good discriminator –1 geospatial.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Chapter 18 Object Database Management Systems. McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved. Outline Motivation for object.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Prepared By Prepared By : VINAY ALEXANDER ( विनय अलेक्सजेंड़र ) PGT(CS),KV JHAGRAKHAND.
Database Environment Chapter 2. Data Independence Sometimes the way data are physically organized depends on the requirements of the application. Result:
Creating and Maintaining Geographic Databases. Outline Definitions Characteristics of DBMS Types of database Relational model SQL Spatial databases.
Methodology – Physical Database Design for Relational Databases.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
Yanlei Diao, University of Massachusetts Amherst Future Directions in Sensor Data Management Yanlei Diao University of Massachusetts, Amherst.
Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.
The Object-Oriented Database System Manifesto Malcolm Atkinson, François Bancilhon, David deWitt, Klaus Dittrich, David Maier, Stanley Zdonik DOOD'89,
CS 540 Database Management Systems
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
SQL: Interactive Queries (2) Prof. Weining Zhang Cs.utsa.edu.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
ISC321 Database Systems I Chapter 2: Overview of Database Languages and Architectures Fall 2015 Dr. Abdullah Almutairi.
1 Working Models for Uncertain Data Anish Das Sarma, Omar Benjelloun, Alon Halevy, Jennifer Widom Stanford InfoLab.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
The Object-Oriented Database System Manifesto
A paper on Join Synopses for Approximate Query Answering
Probabilistic Data Management
Chapter 1 Database Systems
Trio A System for Integrated Management of Data, Accuracy, and Lineage
The Trio System for Data, Uncertainty, and Lineage: Overview and Demo
Probabilistic Databases
Chapter 1 Database Systems
Presentation transcript:

Trio A System for Integrated Management of Data, Accuracy, and Lineage Jennifer Widom Stanford University

2 Basic Premise Traditional Database Management Systems (DBMS’s) are too rigid for some applications –Every data item is either in the database or it isn’t –Its value is absolute –Its derivation history is irrelevant TrioTrio relaxes these constraints by making: 1.Data 2.Accuracy 3.Lineage all first-class interrelated concepts

3 Formula for a Database Research Project Consider traditional DBMS’s – highly sensitive to their basic assumptions Add a twist or two Forced to reconsider many aspects of data management and query processing Data model and algebra, query language, query optimization and execution, index structures, storage structures, application and user interfaces, …  Many Ph.D. theses  Significant prototype effort

4 Next in the Talk Trio featuresTrio features – overview of the twists Trio introduces to a conventional DBMS goalsnon-goalsSpecific project goals and non-goals A quizA quiz, to wake you up applicationsSeveral motivating applications

5 Trio Features: Accuracy Data values may be inexact Ex: A numeric value lying somewhere in the range [ 3.2, 4.4 ] Ex: A record belonging in the database with probability 97% Ex: A relation missing about 15% of its records

6 Trio Features: Accuracy Queries over inexact data return inexact answers Ex: This result value lies somewhere in the range [1,6] Ex: This record belongs in the result with probability 85% Ex: This answer is missing about 25% of its records

7 Trio Features: Lineage Lineage is an integral part of the data model record rquery Q data Dtime T –Suppose record r was derived by query Q over data D at time T. Trio keeps track. –Lineage captures: Query-based derivationsQuery-based derivations Program-based derivationsProgram-based derivations Update-based derivationsUpdate-based derivations Bulk data loadsBulk data loads Data import from outside sourcesData import from outside sources

8 Trio Features: Querying Accuracy Accuracy may be queried Ex: Find all numeric values whose approximation is within 1% Ex: Consider only records with ≥ 98% chance of belonging in the database

9 Trio Features: Querying Lineage Lineage may be queried Ex: Find all records whose derivation includes data from relation R Ex: Determine whether record r was derived from data imported on 4/1/04

10 Trio Features: Combining Lineage, Accuracy Lineage and accuracy may be combined in queries Ex: Find all records derived solely from high-confidence data

11 Trio Features: Enhancing Updates Lineage can be used to enhance updates Data updatesData updates –Propagate updates from base to derived data (or vice-versa), similar to materialized views  Accuracy updates –When data becomes more (or less) exact, compute and propagate effect on accuracy of derived data

12 Trio is Not… temporalA comprehensive temporal DBMS semistructuredA DBMS for semistructured data The last wordThe last word in approximate or uncertain data, or in data lineage federated distributedA federated or distributed system

13 Main Contributions – Trio Goals 1.Simple and usable model incorporating both accuracy and lineage 2.Query language that extends SQL to handle data, accuracy, and lineage together 3.Working system that is straightforward and efficient enough to actually get used

Quiz Time --- What’s wrong with this logo?

15 Motivating Applications No lack of them –How similar are they really? –Can we accommodate all of them? –Which 2-3 “killer apps” should we focus on?

16 Scientific Data Management Scientific experiments produce vast amounts of data –Often structured but inexact Additional data imported from outside sources –Sources vary in quality and reliability Many levels of derivation and aggregation Data and accuracy evolve over time  A perfect fit for Trio

17 Sensor Data Management Some sensor environments permit centralized collection –With sufficient bandwidth and battery power Readings may still be unreliable –Missing values, incorrect values Many levels of derivation and aggregation –May want to trace to original sensor readings  Another perfect fit

18 Data Cleaning Deduplication:Deduplication: Find and merge items likely to represent the same real-world entity datamatchmerge –Uncertainty in data, in match, and in merge “Merge history” (lineage) important for: –Computing and propagating uncertainty –Unmerging  Yet another perfect fit! Related application: “Profile Assembly”

19 More Applications Storing/querying approximate values Hypothetical reasoning Online query processing Privacy preservation And others…

20 Running Ex: Christmas Bird Count

21 Bird Counters

22 Remainder of the Talk TDMThe Trio Data Model – TDM –A solid proposal TriQLThe Trio Query Language – TriQL –Basic features –Underlying algebra TrioThe Trio System – Trio –Very basic architectural choices

23 The Trio Data Model (TDM) WARNING: This model is preliminary and subject to change 1.Data: 1.Data: relational (+ user-defined types) 2.Accuracy: 2.Accuracy: approximation, confidence, and coverage 3.Lineage

24 TDM: Accuracy approximation a)Attribute-level approximation confidence b)Tuple-level (or relation-level) confidence coverage c)Relation-level coverage

25 TDM: Approximation possible valuesprobability distributionBroadly, an approximate value is a set of possible values along with a probability distribution over them

26 TDM: Approximation Specifically, each Trio attribute value is either: 1)Exact value (default) 2)Set of values, each with prob  [0,1] ( ∑ =1) 3)Min + Max for a range (uniform distribution) 4)Mean + Deviation for Gaussian distribution Type 2 sets may include “unknown” ( ┴ ) Independence of approximate values within a tuple

27 TDM: Confidence confidence Each tuple t has confidence  [0,1] –Informally: chance of t correctly belonging in relation –Default: confidence=1 –Can also define at relation level

28 TDM: Coverage coverage Each relation R has coverage  [0,1] –Informally: percentage of correct R that is present –Default: coverage=1

29 True Meaning of Accuracy Question:Question: What does inaccurate data really mean? Answer:Answer: Nothing in particular –We provide the mechanisms –Application determines interpretation Sort of

30 Accuracy in CBC Each participant P: P-sightings (time, latitude, longitude, species) ApproximationApproximation –time, latitude, longitude: range or Gaussian –species: set of values with probabilities; may include ┴ Confidence:Confidence: based on P’s capabilities or experience; tuple- or relation-level CoverageCoverage: fraction of activity captured by P

31 Subtleties in TDM Accuracy Model Difference between: and: Difference between: and: {a,b,c,d} a b c d c conf = 0.5 conf = 0.25 {c, ┴ }

32 Subtleties: CBC Difference between: and: Difference between: and: {sparrow, finch, toucan, macaw} sparrow finch toucan macaw sparrow conf = 0.5 conf = 0.25 {sparrow, ┴ }

33 TDM Accuracy – Status uncertain probabilisticfuzzyapproximateincomplete impreciseStudying the varied literature in uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise databases Studying several applications in detail –Christmas Bird Count –Microarray database (SMD) –Gene sequence database (SGD) –Sensor and RFID data –Deduplication

34 TDM Accuracy – Status (cont’d) Decisions: what’s in and what’s out Out  user-defined types Accuracy model decisions closely tied to algebra of operations (TriQL)

35 TDM: Lineage Lineage (a.k.a. Provenance) –How data came into existence –How it has evolved over time TDM tracks lineage at tuple level

36 Digression: No-Overwrite Storage Tuples never physically updated or deleted –Create new tuple –“Expire” old one –Associate them using lineage Advantages –Historical lineage –Phantom lineage –Phantom lineage (deleted data only) –Versioning

37 Back to Lineage Whenhowfrom-whatWhen, how, and from-what was a tuple t derived? WhenWhen –Usually “at time T” –Sometimes “now”

38 TDM: Lineage – How queryResult of a query, or part of a query-defined view programInserted by a program, invoked with certain parameters updateResult of a database update bulk data loadPart of a bulk data load data importPart of a data import from outside sources

39 TDM: Lineage – From What Differs for different lineage types (details omitted) Instance-based schema-basedInstance-based versus schema-based lineage –Data values vs. schema elements –Fine-grained vs. coarse-grained –Expensive vs. cheap Forward backwardForward versus backward lineage –What data was used to derive tuple t? –What data was tuple t used to derive?

40 Formalizing Lineage Lineage RelationEvery database has Lineage Relation (logical) Lineage (tupleID, derivation-type, time, how-derived, lineage-data) tupleID is key Inexact lineage using TDM accuracy model? Default for each relation: no lineage

41 Lineage in CBC Raw observations merged and massaged into main database each January Combined with previous years –Interesting design question: –Interesting design question: Capture evolution in data or in lineage? Correlations with other data: environmental, geographic, population, …

42 TriQL: The Trio Query Language Queries and updates Extend SQL Keep it simple Keep it closed Queries on TDM data produce TDM data (e.g., no ranked results)

43 TriQL Steps 1.Semantics of standard SQL over TDM data 2.Extensions to SQL for queries involving explicit operations on accuracy and lineage 3.Update options, especially accuracy updates

44 TriQL Steps 1.Semantics of standard SQL over TDM data 2.Extensions to SQL for queries involving explicit operations on accuracy and lineage 3.Update options, especially accuracy updates

45 Standard SQL Over TDM Data Query(data + accuracy + lineage)  Result(data + accuracy + lineage) Lineage computation “straightforward” –Based on previous work Accuracy computation quickly gets interesting Accuracy Algebra –Define Accuracy Algebra

46 Accuracy Algebra: Basic Questions Minimum, maximum, product, …  Support multiple join operators + UDFs t conf = 0.8 u conf = 0.7 ⋈ = t u conf = ??? coverage = 0.8 Op = coverage = 0.9 coverage = ??? Op: ⋈, ∩, U, –, …

47 Accuracy Algebra: Observation (1) Simple operations can turn approximation into confidence A {x, y} σ A=x A x = conf = 0.5 A {w, x, y} A {x, y, z} = A {x, y} conf = 2/9 ⋈ Non-uniform set-approximations, interval approximations, aggregation, …

48 Accuracy Algebra: Observation (2) Simple operations can produce inexpressible results A {x, y} AB x1 x2 ⋈ AB x1 x2 = conf = 0.5 σ A=B AB {x, y} = AB xx yy conf = 0.25 Approximate approximations?

49 Accuracy Algebra: Status Studying simplified TDM accuracy model –Set-approximations + “maybe” tuples –Various theoretical results Worrying about lack of closure TDM as user interface to more powerful (closed, complete) underlying model?

50 Accuracy Algebra: Status (cont’d) Studying applications –CBC, SMD, SGD, sensor, RFID, deduplication –Frequency of various operations What’s in and what’s out? Out  user-defined functions

51 The Trio Prototype Usual goals –Rapid deployment of first version –Resilience to research fickleness –Reasonably efficient –Extensibility Need to choose among: 1.Implement on top of a conventional DBMS 2.Build from scratch 3.Use extensible OR-DBMS

52 Trio on Conventional DBMS + Rapid deployment + Resilience to (small) changes − Messy, inefficient, uninteresting? − No customized storage structures, indexes, query optimization, … Data Translator Data Translator RDBMS TDM Data Conventional Relations Query Translator Query Translator TriQL Query SQL Query

53 Trio from Scratch + Keeps the grad students busy + Can experiment at every level of the system, fine-tune performance − Delayed deployment − Not resilient to changes? − Many DBMS functions re-coded Buffer management, concurrency control, recovery, …

54 Trio using Extensible OO-DBMS Extensible OO-DBMS Built-in Accuracy Predicates Built-in Accuracy Predicates Built-in Lineage Functions Built-in Lineage Functions Stored Procedures for Special Query Processing Stored Procedures for Special Query Processing Extensions to Query Operators Extensions to Query Operators Data Types Extended with Accuracy Data Types Extended with Accuracy Data Types For Lineage Relation Data Types For Lineage Relation

55Conclusion 1.Data 2.Accuracy 3.Lineage Data Model –Data Model – Combine and distill previous work Query Language –Query Language – Algebra, TriQL, UDFs System –System – Efficient, usable, soon Plenty of applications Google: “stanford trio”

56Thanks Omar Benjelloun, Ashok Chandra, Anish Das Sarma, Alon Halevy, Evan Fanfan Zeng Jim Gray, Wei Hong David DeWitt, Dave Maier