Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Three-Step Database Design
The Selim and Rachel Benin School of Engineering and Computer Science Keyword Proximity Search in Complex Data Graphs Konstantin Golenberg Benny Kimelfeld.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Relational Database Design UNIT II 1. 2 Advantages of Using Database Systems Centralized control of a firm’s data Redundancy can be reduced (avoid keeping.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
1 Probabilistic/Uncertain Data Management Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
Efficient Query Evaluation on Probabilistic Databases
Uncertainty Lineage Data Bases Very Large Data Bases
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
1 Probabilistic/Uncertain Data Management -- III Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
Keyword Proximity Search on Graphs M.Sc. Systems Course The Hebrew University of Jerusalem, Winter 2006.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
1 Probabilistic/Uncertain Data Management Slides based on the Suciu/Dalvi SIGMOD’05 tutorial 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic.
Database Systems and XML David Wu CS 632 April 23, 2001.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Database Systems Marcus Kaiser School of Computing Science Newcastle University.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CIKM Finding and Approximating Top-k Answers in Keyword Proximity Search Benny Kimelfeld Yehoshua Sagiv Benny Kimelfeld and Yehoshua Sagiv The Selim.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.
Finding a Minimal Tree Pattern Under Neighborhood Constraints Benny Kimelfeld Yehoshua Sagiv IBM Research – AlmadenThe Hebrew University of Jerusalem 2011.
1 Information Retrieval LECTURE 1 : Introduction.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Introduction to Artificial Intelligence (G51IAI) Dr Rong Qu Blind Searches - Introduction.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik.
LECTURE TWO Introduction to Databases: Data models Relational database concepts Introduction to DDL & DML.
1 © 2013 Cengage Learning. All Rights Reserved. This edition is intended for use outside of the U.S. only, with content that may be different from the.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
1 Working Models for Uncertain Data Anish Das Sarma, Omar Benjelloun, Alon Halevy, Jennifer Widom Stanford InfoLab.
A Course on Probabilistic Databases
A paper on Join Synopses for Approximate Query Answering
Probabilistic Data Management
Approximate Lineage for Probabilistic Databases
Lecture 16: Probabilistic Databases
Probabilistic Ranking of Database Query Results
Random Sampling over Joins Revisited
Probabilistic Databases
Probabilistic Databases with MarkoViews
Presentation transcript:

Seminar on Databases and the Internet Focus: Probabilistic and Uncertain Databases Credit: Slideshow is a hodge-podge of slides, some of which were taken from Dan Suciu, Benny Kimelfeld, Gerhard Weikum 1

2 Traditionally, Databases are Deterministic and Queries are Boolean An item either is in the database or is not – Database represents a “complete world” A tuple either is in the query answer or is not The schema of the data is well known This applies to both relational data and XML (=tree structured data) Why is this so important? Does this capture real-world application scenarios?

Example 1: RFID Ecosystem at UW 3 [Welbourne’2007]

RFID Data is Uncertain RFID data = noisy – SIGHTING(tagID, antennaID, time) Derived data = Probabilistic – “John entered Room 524 at 9:15” prob=0.6 – “John carried laptop x77 at 11:03” prob=0.8 –... Queries – “Which people were in Room 478 yesterday ?” 4 Massive amounts of probabilistic data from RFIDs, sensors

Example 2 Information Extraction 5 ID House-NoStreetCityP 152Goregaon WestMumbai AGoregaon WestMumbai GoregaonWest Mumbai AGoregaonWest Mumbai [Gupta&Sarawagi’2006]...52 A Goregaon West Mumbai... Here probabilities are meaningful ≈20% of such extractions are correct

Example 3: Scanning Aerial Photography Find regions that include a factory building and a road … with a high probability

Analyzing a Region road (90%)road (60%)factory bldg. & wall (40%) / house & road (30%) house (50%) / factory bldg. (50%) factory bldg. (40%) / apt. building (50%) match (45%) match (36%) match (24%) match (36%) What is the probability that this region is an answer (i.e., includes a factory building and a road)? But specifying the probability of each match does not answer the question! The probability of each match can be significantly smaller than the probability that there is any match

Question: What must we rethink to deal with uncertain data? Ideas? Answer: Everything! – Data model – Query semantics (meaning) – Physical data storage – Efficient evaluation algorithms – Intuitive querying 8

Types of Uncertainty Uncertain Queries: – Data may (or may not) be deterministic – User wants to get approximate answers to his query (=fuzzy queries) Uncertain or Probalistic Data: – May have possibilities for data, without probabilities (=uncertain database) – May have probabilities of data (=probabilistic database) – Queries may or may not be exact 9

Fuzzy Queries 10

When are Fuzzy Queries Needed? When there are no answers When there are too many answers When the user has preferences over the data When the user is not sufficiently familiar with the structure of the database 11

12 The Empty Answers Problem Query is overspecified: no answers Example: try to buy a house in SF… SELECT * FROM Houses WHERE bedrooms = 3 AND style = ‘craftsman’ AND district = ‘Noe Valley’ AND price < SELECT * FROM Houses WHERE bedrooms = 3 AND style = ‘craftsman’ AND district = ‘Noe Valley’ AND price < [Agrawal,Chaudhuri,Das,Gionis 2003] … good luck !

Gerhard Weikum June 14, /41 WHIRL: IR over Relations [W.W. Cohen: SIGMOD’98] Add text-similarity selection and join to relational algebra Example: Select * From Movies M, Reviews R Where M.Plot ~ ”fight“ And M.Year > 1990 And R.Rating > 3 And M.Title ~ R.Title Title Plot … Year Movies Title Comment … Rating Reviews Matrix Hero Matrix 1 Matrix Reloaded Matrix Eigenvalues Ying xiong aka. Hero Shrek 2 … matrix spectrum … orthonormal … … fight for peace … … sword fight … dramatic colors … … In ancient China … fights … sword fight … fights Broken Sword … In the near future … computer hacker Neo … … fight training … … cool fights … new techniques … … fights … and more fights … … fairly boring … In Far Far Away … our lovely hero fights with cat killer … The record linkage problem!

bib journal article title name details author XML: Fighting Global Terrorism title Journal of XML E. Smith article A Survey of XML title aff Cal Tech name details author E. Smith aff U. of Chicago XML Document

bib journal article title name details author XML: Fighting Global Terrorism title Journal of XML E. Smith article A Survey of XML title aff Cal Tech name details author E. Smith aff U. of Chicago Query bib article title name Descendent Edge Child Edge

bib journal article title name details author XML: Fighting Global Terrorism title Journal of XML E. Smith article A Survey of XML title aff Cal Tech name details author E. Smith aff U. of Chicago Query author article title name article title Cannot be satisfied: 1. Root nodes do not agree 2. No available 3. Author should be below article, not above [Kanza,et.al.:PODS98,PODS01,] [Amer-Yahia,et.al.:SIGMOD04][Brodiansky,et.al.:CIKM07]

Another Type of Fuzzy Query: Keyword Proximity Search 17 Keyword Search The natural (and popular) option: Keyword Search Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema Goal: Enable users to instantly pose (inaccurate) queries without knowing the schema Nowadays… Exposure to many databases Different types (relational, XML, RDF…) Different schemas

Data have varying degrees of structure – Relational (w/ foreign keys), XML (w/ id-references) – Natural representation by a graph – Usually, data-centric rather than document-centric A query is a set of keywords − No structural constraints Keyword Proximity Search (KPS) The Goal: Extract meaningful parts of data w.r.t. the keywords Agrawal et al. ICDE’02 Hristidis et al., VLDB’02,03, ICDE’03 Bhalotia et al. VLDB’05 Kacholia al., VLDB’06 Ding et al., ICDE’07 Liu et al., SIGMOD’06 Wang et al., VLDB’06 Luo et al., SIGMOD’07 Golenberg SIGMOD’07 …

Example: Search in RDB IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships

IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 search Belgium, Brussels CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships Brussels is the capital city of Belgium

IDName Population 22 Amsterdam Brussels IDNameHead Q. 135EU73 175ESA81 CountryOrg. B135 NL135 CodeNameAreaCapital NL Netherlands BBelgium CitiesOrganizations CountriesMemberships Brussels hosts EU and Belgium is a member search Belgium, Brussels

Example: Search in XML search Yannakakis, Approximation

Yannakakis wrote a paper about Approximation search Yannakakis, Approximation

Yannakakis is cited by a paper about Approximation search Yannakakis, Approximation

Some Open Problems A lot of the work done is applicable only for unsatisfiable queries, but not for queries with too many answers No user feedback to improve results of fuzzy queries Limited queries Little done when queries is fuzzy and data is uncertain – We look at uncertain data (with exact queries) now… 25

Uncertain Databases 26

Motivation Until now, we assumed that the data was certain, but the query was uncertain – query was answered approximately If data comes from many sources, of varying reliability, then data is uncertain – how is such uncertainty represented? – how do we query uncertain data?

Main Issues Possible worlds semantics Constructs/Models for uncertain data – completeness: can everything be modeled – closure: can query answers always be represented in model? – expressiveness Also: membership, uniqueness, equivalence, minimization [Das Sarma, Benjelloun, Halevy, Widom, ICDE06]

Example Volunteers record birds that they have seen Tables: – BirdInfo(birdname, color, size) – Sightings(observer, when, where, birdname) A volunteer may not be sure which bird she saw Explicit model: List all possible databases Implicit model: Use a compact representation for possible datanases

Attribute-Or Model Amy saw a jay, and either a crow or a raven ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technion{crow, raven} ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technioncrow ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technionraven Is this model complete? What should be the answer to SELECT * FROM Sightings WHERE Birdname=‘raven’ Is this model closed? Is this model complete? What should be the answer to SELECT * FROM Sightings WHERE Birdname=‘raven’ Is this model closed?

Maybe-Tuples A maybe-tuple is one that may or may not appear What are the instances represented by this relation? ObserverWhenWhereBirdName Amy12/23/06Technionjay Mike12/23/06Technion{crow, raven} ?

Tuple Constraints: Example What instances are represented here? Can it be represented by attribute-ors? Can it be represented by maybe-tuples? Can it be represented by attribute-ors with maybe-tuples? ObserverWhenWhereBirdName Amy12/23/06Technionjay Amy12/23/06Technioncrow t1t2t1t2 t 1 or t 2

Open Questions Solutions have been presented to achieve closure, completeness and efficiency in relational databases [Antova,Koch,Olteanu,ICDE07] [Benjelloun,Dar Sarma, Halevy,Theobald,Widom,J of VLDB,2008] How should one model an uncertain XML database? – Especially while taking into consideration closure, completeness, efficiency? 33

Probabilistic Databases 34

35 What is a Probabilistic Database ? “An item belongs to the database” is a probabilistic event – Tuple-existence uncertainty – Attribute-value uncertainty “A tuple is an answer to the query” is a probabilistic event Can be extended to all data models

36 Possible Worlds Semantics int, char(30), varchar(55), datetime Employee(name:varchar(55), dob:datetime, salary:int) Attribute domains: Relational schema: # values: 2 32, 2 120, 2 440, 2 64 # of tuples: £ 2 64 £ 2 23 # of instances: £ 2 64 £ 2 23 Employee(...), Projects(... ), Groups(...), WorksFor(...) Database schema: # of instances: N (= BIG but finite)

37 The Definition The set of all possible database instances: INST = {I 1, I 2, I 3,..., I N } Definition A probabilistic database I p is a probability distribution on INST s.t.  i=1,N Pr(I i ) = 1 Pr : INST  [0,1] Definition A possible world is I s.t. Pr(I) > 0 will use Pr or I p interchangeably

38 Example CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueDenverGizmo Pr(I 1 ) = 1/3 CustomerAddressProduct JohnBostonGadget SueDenverGizmo CustomerAddressProduct JohnSeattleGizmo JohnSeattleCamera SueSeattleCamera CustomerAddressProduct JohnBostonGadget SueSeattleCamera Pr(I 2 ) = 1/12 Pr(I 3 ) = 1/2 Pr(I 4 ) = 1/12 Possible worlds = {I 1, I 2, I 3, I 4 } I p =

39 Tuples as Events One tuple t: event t  I Two tuples t 1, t 2 : event t 1  I  t 2  I Pr(t) =  I: t  I Pr(I) Pr(t 1 t 2 ) =  I: t 1  I  t 2  I Pr(I)

40 Query Semantics Given a query Q and a probabilistic database I p, what is the meaning of Q(I p ) ?

41 Query Semantics Semantics 1: Possible Answers A probability distribution on sets of tuples  A. Pr(Q = A) =  I  INST. Q(I) = A Pr(I) Semantics 2: Possible Tuples A probability function on tuples  t. Pr(t  Q) =  I  INST. t  Q(I) Pr(I)

42 Example: Query Semantics NameCityProduct JohnSeattleGizmo JohnSeattleCamera SueDenverGizmo SueDenverCamera Pr(I 1 ) = 1/3 NameCityProduct JohnBostonGizmo SueDenverGizmo SueSeattleGadget NameCityProduct JohnSeattleGizmo JohnSeattleCamera SueSeattleCamera NameCityProduct JohnBostonCamera SueSeattleCamera Pr(I 2 ) = 1/12 Pr(I 3 ) = 1/2 Pr(I 4 ) = 1/12 SELECT DISTINCT x.product FROM Purchase p x, Purchase p y WHERE x.name = 'John' and x.product = y.product and y.name = 'Sue' SELECT DISTINCT x.product FROM Purchase p x, Purchase p y WHERE x.name = 'John' and x.product = y.product and y.name = 'Sue' Possible answers semantics: Answer setProbability Gizmo, Camera1/3Pr(I 1 ) Gizmo1/12Pr(I 2 ) Camera7/12P(I 3 ) + P(I 4 ) TupleProbability Camera11/12Pr(I 1 )+P(I 3 ) + P(I 4 ) Gizmo5/12Pr(I 1 )+Pr(I 2 ) Possible tuples semantics: Purchase p

43 Possible-Worlds Semantics: Summary Very powerful model – Complete: Can capture any instance distribution, any tuple correlations Intuitive, clean formal semantics for any SQL query – Translates to queries over deterministic instances

44 Possible Worlds Semantics: Summary (contd.) Possible answers semantics Precise Can be used to compose queries Difficult user interface Possible tuples semantics Less precise, but simple; sufficient for most apps Cannot be used to compose queries Simple user interface

45 Possible Worlds Semantics: Summary (contd.) Not very useful as a representation or implementation tool HUGE number of possible worlds! Need more effective representation formalisms Something that users can understand/explore Allow more efficient query execution – Avoid “possible worlds explosion” Perhaps giving up completeness

46 Explicit Independent Tuples Pr(I) =  t  I pr(t)   t  I (1-pr(t)) No restrictions pr : TUP  [0,1] Tuple independent probabilistic database INST = P (TUP) N = 2 M TUP = {t 1, t 2, …, t M } = all tuples

47 Tuple Prob. : Possible Worlds NameCitypr JohnSeattlep 1 = 0.8 SueBostonp 2 = 0.6 FredBostonp 3 = 0.9 I p = NameCity JohnSeattl SueBosto FredBosto NameCity SueBosto FredBosto NameCity JohnSeattl FredBosto NameCity JohnSeattl SueBosto NameCity FredBosto NameCity SueBosto NameCity JohnSeattl ; I1I1 (1-p 1 ) (1-p 2 ) (1-p 3 ) I2I2 p 1 (1-p 2 )(1-p 3 ) I3I3 (1-p 1 )p 2 (1-p 3 ) I4I4 (1-p 1 )(1-p 2 )p 3 I5I5 p 1 p 2 (1-p 3 ) I6I6 p 1 (1-p 2 )p 3 I7I7 (1-p 1 )p 2 p 3 I8I8 p1p2p3p1p2p3  = 1 J = E[ size(I p ) ] = 2.3 tuples

48 Tuple-Independent DBs are Incomplete NameAddresspr JohnSeattlep1p1 SueSeattlep2p2 NameAddress JohnSeattle SueSeattle NameAddress JohnSeattle p1p1 p1p2p1p2 =I p ; 1-p 1 - p 1 p 2 Very limited – cannot capture correlations across tuples Not Closed Query operators can introduce complex correlations!

49 Tuple Prob. : Query Evaluation NameCitypr JohnSeattlep1p1 SueBostonp2p2 FredBostonp3p3 CustomerProductDatepr JohnGizmo...q1q1 JohnGadget...q2q2 JohnGadget...q3q3 SueCamera...q4q4 SueGadget...q5q5 SueGadget...q6q6 FredGadget...q7q7 SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ SELECT DISTINCT x.city FROM Person x, Purchase y WHERE x.Name = y.Customer and y.Product = ‘Gadget’ TupleProbability Seattle Boston 1-(1-q 2 )(1-q 3 )p 1 ( ) 1- (1- )  (1 - ) p 2 ( )1-(1-q 5 )(1-q 6 ) p 3 q 7 Query evaluation is usually intractable (#P-complete)

Other Models for Probabilistic Data There are many more models for relational probabilistic data (see papers on course site) There are many models for probabilistic XML – We will see one now 50

Rooted tree Mutually exclusive A ProTDB Document [Nierman & Jagadish 02] trackprivate type vehicle neighborhood house m size s house aerial-photo building factory 0. 8 park.lotheliport region 2 types of nodes 2 types of distributions Ordinary Ordinary nodes Distributional Distributional nodes Independent

A ProTDB Document [Nierman & Jagadish 02] 0.8 trackprivate type vehicle neighborhood house m size s house aerial-photo building factory 0. 8 park.lotheliport region A probability for each outgoing edge of a distributional node 0. 4

Instance Generation: Step trackprivate type vehicle neighborhood house m size s house aerial-photo building factory 0. 8 park.lotheliport region Distributional nodes choose a set of children Traverse the tree top-down Choose children independently Drop unchosen children Choose children independently Choose at most one child

Instance Generation: Step track 0. 5 type vehicle neighborhood s size house aerial-photo factory 0. 8 heliport 0. 3 region Drop the distributional nodes

Instance Generation: Step 2 track type vehicle s size house aerial-photo factory heliport region neighborhood Connect each ordinary node to its closest ancestor Drop the distributional nodes

The Result: An Ordinary Document track type vehicle s size house aerial-photo factory heliport region neighborhood Many interesting queries are tractable! [Kimelfeld,Sagiv VLDB07][Kimelfeld,Sagiv,Cohen,PODS08] [Kimelfeld, Kosharovsky,Sagiv,SIGMOD08] Many interesting queries are tractable! [Kimelfeld,Sagiv VLDB07][Kimelfeld,Sagiv,Cohen,PODS08] [Kimelfeld, Kosharovsky,Sagiv,SIGMOD08]

Some Interesting Open Questions Possible answers semantics (instead of possible tuples semantics) for ProTDB Efficient physical storage External memory algorithms for query processing More expressive models (e.g., Bayesian correlations) Richer query languages Incremental update of results Many more… 57