Representing and Querying Correlated Tuples in Probabilistic Databases

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
“Lineage/Provenance” Workgroup Report Birgitta, Amol, Ihab, Thomas, Anish, Martin, Matthias.
Structural Reliability Analysis – Basics
Efficient Query Evaluation on Probabilistic Databases
Thomas Bernecker, Tobias Emrich, Hans-Peter Kriegel,
ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Cox Model With Intermitten and Error-Prone Covariate Observation Yury Gubman PhD thesis in Statistics Supervisors: Prof. David Zucker, Prof. Orly Manor.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
What I am doing Amol Deshpande. Selection Ordering  Given a set of selection predicates and correlations between them, find the optimal ordering : Not.
Probabilistic Databases Amol Deshpande, University of Maryland.
03 July 2015Course Overview1 Energy Project Evaluation RES Course ESP606 Goal: To build up knowledge to so that participants will be able to assess if.
1 Probabilistic/Uncertain Data Management -- IV 1.Dalvi, Suciu. “Efficient query evaluation on probabilistic databases”, VLDB’ Sen, Deshpande. “Representing.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Chapter 8 Introduction to Hypothesis Testing
Da Yan and Wilfred Ng The Hong Kong University of Science and Technology.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Probability and Statistics Required!. 2 Review Outline  Connection to simulation.  Concepts to review.  Assess your understanding.  Addressing knowledge.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
A Context Model based on Ontological Languages: a Proposal for Information Visualization School of Informatics Castilla-La Mancha University Ramón Hervás.
A Probabilistic Quantifier Fuzzification Mechanism: The Model and Its Evaluation for Information Retrieval Felix Díaz-Hemida, David E. Losada, Alberto.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Calculating Risk of Cost Using Monte Carlo Simulation with Fuzzy Parameters in Civil Engineering Michał Bętkowski Andrzej Pownuk Silesian University of.
1 Information Retrieval LECTURE 1 : Introduction.
Framework for Interactive Applications Matthew Korchinsky Advisor: Aaron Cass Senior Project – Computer Engineering – 2006 Abstract The Java language was.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
多媒體網路安全實驗室 Anonymous Authentication Systems Based on Private Information Retrieval Date: Reporter: Chien-Wen Huang 出處: Networked Digital Technologies,
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
CSCI 6962: Server-side Design and Programming Shopping Carts and Databases.
Chance Constrained Robust Energy Efficiency in Cognitive Radio Networks with Channel Uncertainty Yongjun Xu and Xiaohui Zhao College of Communication Engineering,
1 Using Network Coding for Dependent Data Broadcasting in a Mobile Environment Chung-Hua Chu, De-Nian Yang and Ming-Syan Chen IEEE GLOBECOM 2007 Reporter.
Xiao Liu, Jinjun Chen, Yun Yang CS3: Centre for Complex Software Systems and Services Swinburne University of Technology, Melbourne, Australia {xliu, jchen,
Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.
A Probabilistic Quantifier Fuzzification Mechanism: The Model and Its Evaluation for Information Retrieval Felix Díaz-Hemida, David E. Losada, Alberto.
A Course on Probabilistic Databases
Probabilistic Data Management
Probabilistic Data Management
Queries with Difference on Probabilistic Databases
Lecture 16: Probabilistic Databases
DBMS with probabilistic model
Electrical and Computer Engineering Department
Probabilistic Databases
Probabilistic Ranking of Database Query Results
Generalized Diagnostics with the Non-Axiomatic Reasoning System (NARS)
Presentation transcript:

Representing and Querying Correlated Tuples in Probabilistic Databases Prithviraj Sen Amol Deshpande

Independent tuples model Tuple correlations Representing Dependencies outline General Info Introduction Independent tuples model Tuple correlations Representing Dependencies Query evaluation Experiments Conclusions & Work to be done

Issues with the use of probabilistic databases General info High demand for storing uncertain data A framework that can represent not only probabilistic tuples but also correlations among them to tackle these limitations Issues with the use of probabilistic databases 1) existent probabilistic databases make simplistic assumptions about the data that make it difficult to use them in applications that naturally produce correlated data 2) Most probabilistic databases can only answer a restricted subset of the queries that can be expressed using traditional query languages Yparxei megalh zhthsh gia thn apo8ukefsh abaibewn dedomenwn OI iparxouses baseis dedomenwn k;anoyn uperaplousteumenes upo8eseis gia ta dedomena (oti ta einai ane3arthta meta3h tous ) me apotelesma na mhn mporoume na tis xrisimopoihsoue ean ta dedomena dn einai ane3arthta meta3h tous Episeis oi perissoteres pi8anotikes baseis dedomenwn mporoun na apantoun mono stis poio sunxes glwsses querry. Sto paper afto proteinoun ena framework to opoio mporei na anaparhsta kai pi8anotikes pleiades alla kai sxeseis meta3i tous.

outline General Info Introduction Independent tuples model Tuple correlations Probabilistic graphical models & factored representations Representing Dependencies Query evaluation Experiments Conclusions & Work to be done

Introduction (1/2) Database research has primarily concentrated on how to store and query exact data Many real-world applications produce large amounts of uncertain data Databases need to do more than simply store and retrieve; they have to help the user sift through the uncertainty and find the results most likely to be the answer. 1)H erebna stis baseis dedomenwn exei sigentrw8ei sthn epo8hkeush bebaiwn dedomenwn 2)Polles efarmoges paragoun abebaia dedomena kai se tetoies periptwseis oi baseis dedomenwn ektos tou na anaktoun kai na paragoun dedomena prepei na boi8oun ton xrhsth na briskei ta apotelesmata pou einai poio pi8ano na apanth8oun

Introduction (2/2) Numerous approaches (models) proposed to handle uncertainty. However, most models make assumptions about data uncertainty that restricts applicability (they cannot easily model or handle dependencies and correlations among tuples)

outline General Info Introduction Independent tuples model Tuple correlations Probabilistic graphical models & factored representations Representing Dependencies Query evaluation Experiments Conclusions & Work to be done

Independent tuples model(1/2) One of the most commonly used tuple-level uncertainty models, associates existence probabilities with individual tuples and assumes that the tuples are independent of each other Estw Dp mia bash dedomenwn me sxeseis Sp (pou periexei thn pleiada s1 me pi8aotita 0.6 kai thn pleiada s2 me pi8anotita 0.5) kai Tp (pou periexei thn pleiada t1 me pi8anotita 0.4) Blepoume ston pinaka afto olous tous pi8anous kosmous kai h pi8anotita gia ka8e kosmo bgenei (me enwsh olwn twn pi8anotitwn) pollaplasiazontas apla tis pi8anotites ka8e pleiadas efoson einai ane3arthtes. Ean twra ektelesoume to querry panw stous pia8ous kosmous

Independent tuples model (2/2) Evaluating a query via the set of possible worlds is clearly intractable as the number of possible worlds is very big Intensional semantics guarantee results in accordance with possible words semantics but are computationally expensive. Extensional semantics are computationally cheaper but do not guarantee results in accordance with the possible worlds semantics. Base tuples are independent of each other, the intermediate tuples that are generated during query evaluation are typically correlated

outline General Info Introduction Independent tuples model Tuple correlations Probabilistic graphical models & factored representations Representing Dependencies Query evaluation Experiments Conclusions & Work to be done

Tuple correlations (1/2) As 8ewrisoume twra 4 set apo pi8anous kosmous pou proerxontai apo thn idia bash dedomenwn opws proigoumenos ala me diaforetikes e3arthseis pou endexomenos 8eloume na anaparasthsoume

Tuple correlations (2/2) Although the tuple probabilities associated with s1, s2 and t1 are identical, the query results are drastically different across these four databases. Since both intensional and extensional semantics assume base tuple independence neither can be directly used to do query evaluation in such cases. ->Parolou pou oi pi8anotites ton pleiadwn einai s1 s2 einai idies ta apotelesmata twn query einai teleiws diaforetika. -> oi me8odoi extensional kai intentional pou eipame parapanw einai mono gia ane3artites pleiades kai giafto ton logo den boroun na xrhshmopoihs8oun gia thn a3iologhsh tou querry.

Independent tuples model Tuple correlations Query evaluation outline General Info Introduction Independent tuples model Tuple correlations Representing correlations Query evaluation Experiments Conclusions & Work to be done

Representing correlations(1/3) Associate every tuple t with a Boolean valued random variable Xt f (X) is a function of a (small) set of random variables X, where 0 <= f (X) <=1 Associate with each tuple in the probabilistic database a random variable Define factors on (sub)sets of tuple-based random variables to encode correlations. 5) The probability of an instantiation of the database is given by the product of all the factors.

Representing correlations(2/3) Suppose we want to represent mutual exclusivity between tuples s1 and t1. In particular, let us try to represent the possible worlds:

Representing correlations(3/3) Suppose we want to represent positive correlation between t1 and s1. In particular, let us try to represent the possible worlds:

Probabilistic graphical model representation A probabilistic graphical model is graph whose nodes represent random variables and edges represent correlations Complete Ind. Mutual Exclusivity Positive Correlation Xt1 Xs1 Xt1 Xs1 Xt1 Xs1 Xs2 Xs2 Xs2

Probabilistic graphical model representation X1 X2 X3

outline General Info Introduction Independent tuples model Tuple correlations Probabilistic graphical models & factored representations Representing Dependencies Query evaluation Experiments Conclusions & Work to be done

Query evaluation: basic idea Treat intermediate tuples as regular tuples. Carefully represent correlations between intermediate tuples, base tuples and result tuples to construct a probabilistic graphical model. Cast the probability computations resulting from query evaluation to inference in probabilistic graphical models. 1)Oi endiameses pleiades prepei na xeiristoun san kanonikes 2)Prosektika prepei na anaparastountai oi sxeseis meta3i tous gia na dimiourgi8ei to pi8anotiko grafiko mondelo

Query evaluation: example

Opou to Fq einai to set twn paragontwn pou xreiazontai gia thn dimiourgeia ths pleiadas t epagogika sto querry

Query evaluation :example Probabilistic graphical model Query evaluation problem in Prob. Databases: Compute the probability of the result tuple summed over all possible worlds of the database Equivalent problem in prob. graph. models: marginal probability computation. use inference algorithms Xs2 Xs1 Xt1 Xi1 Xi2 Xr1

Xs2 Xt1 Xi1 Xi2 Xr1

Representing probabilistic relations Mia pleida borei na iparxei se polles e3arthseis. Sthn dikia tous ilopoihsh oi pleiades apo8ikebontai san komati mia sxeshs. Gia thn apo8ukeush ths abebaibeotitas xrhshmopoioun partition Ena partition apoteleite apo ena factor kai ena set apo anafores stis pleiades stis opoies oi tuxaies times einai ta orismata gia to sigekrimeno factor Ektos apo ka8e pleiada t se mia sxesh iparxei episeis mia lista apo pointers sta partitions pou exoun anafores stis pleiades aftes Sto sxhma: blepoume pou organonontai oi sxesies kai ta partitions gia thn bash me thn nxor e3arthsh Oi diakekomenes grames einai oi pointesapo ths pleiades sta partitions kai oi kanonikes einai oi anafores

outline General Info Introduction Independent tuples model Tuple correlations Probabilistic graphical models & factored representations Representing Dependencies Query evaluation Experiments Conclusions & Work to be done

Experiments (1/3) Database contains 860 publications from CiteSeer [GBL98]. Searched for publications for given (misspelt) author name. Naturally involves mutual exclusivity correlations

Experiments (2/3) Ran experiments on randomly generated TPC-H dataset of size 10MB. The first bar on each query indicates the time it took to run the full query including all the database operations and the probabilistic computations. The second one indicates the time it took to run only the database operations using our Java implementation.

Experiments(3/3) The result of running an average query over a synthetically generated dataset containing tuples

outline General Info Introduction Independent tuples model Tuple correlations Probabilistic graphical models & factored representations Representing Dependencies Query evaluation Experiments Conclusions & Work to be done

conclusions There is an increasing need for database solutions for efficiently managing and querying uncertain data exhibiting complex correlation patterns. A simple and intuitive framework is presented, based on probabilistic graphical models, for explicitly modeling correlations among tuples in a probabilistic database

Work to be done Problem: Although conceptually the approach presented allows for capturing arbitrary tuple correlations, exact query evaluation over large datasets exhibiting complex correlations may not always be feasible. Future Considerations: Development of approximate query evaluation techniques that can be used in such cases Develop disk-based query evaluation algorithms so that their techniques can scale to very large datasets.